strerror_r is not portable; on Gnu libc it returns char * and sometimes
does not fill in the supplied buffer. Use autoconf to test which
version this platform uses and adapt.
Clean up the random calls to strerror and strerror_r (along with all
their private little one-use buffers) and regularize the code to use
cpp_strerror almost everywhere. Where changed, any negation of the
error code is also removed, since cpp_strerror() will do that.
Note: some tools were using their own calls to strerror/strerror_r, so
will now get a (%d) in their output that wasn't there before; hence
the change to test/cli/monmaptool/print-nonexistent.t
Fixes: #8041
Signed-off-by: Dan Mick <dan.mick@inktank.com>
We split global_init_postfork() in two: start and finish, with the first
keeping much of postfork()'s tasks except closing stderr, which we leave
open until just before we daemonize. This allows the user to see any
error messages that the monitor may spit out before it daemonizes, making
sense of the error code (which we were already returning).
Fixes: 7489
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
A mon is considered to exist if the mon-data directory exists and is not
empty. If ceph-mon --mkfs is run twice, it will display succeed the
second time around and display an informative message.
Signed-off-by: Loic Dachary <loic@dachary.org>
It is the same flag that is given to common_preinit. The service thread
is not initialized if CINIT_FLAG_NO_DAEMON_ACTIONS is set.
Signed-off-by: Loic Dachary <loic@dachary.org>
The ceph-mon command usage is updated to document all of the ceph-mon
specific options.
The ceph tell usage examples for log and debug are using a deprecated syntax.
Signed-off-by: Loic Dachary <loic@dachary.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This makes it easy to identify problems with (early) shutdown with a
loop like
while [ ! -e core ] ; do ./ceph-mds -i a -c ceph.conf -f ; done
and a vstart cluster.
Signed-off-by: Sage Weil <sage@inktank.com>
Commands such as 'mon_status', 'quorum_status', 'sync_status' and
'sync_force' didn't support other formatter besides json. Regardless of
'--format=foo' being specified, they would always output in json.
This commit changes that behavior, allowing a format to be passed. These
functions do not output in plain-text however. Plain-text will default
to 'json' -- the reason: the information they provide are better outputted
in a structured fashion, and I was too lazy to come up with a plain-text
version that could be at least as good.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
level doesn't seem to like this when it races with an internal compaction
attempt (see below). Instead, let the store get opened by the ceph_mon
caller, and pull a bit of the logic into the caller to make the flow a
little easier to follow.
-2> 2013-06-25 17:49:25.184490 7f4d439f8780 10 needs_conversion
-1> 2013-06-25 17:49:25.184495 7f4d4065c700 5 asok(0x13b1460) entry start
0> 2013-06-25 17:49:25.316908 7f4d3fe5b700 -1 *** Caught signal (Segmentation fault) **
in thread 7f4d3fe5b700
ceph version 0.64-667-g089cba8 (089cba8fc0e8ae8aef9a3111cba7342ecd0f8314)
1: ceph-mon() [0x649f0a]
2: (()+0xfcb0) [0x7f4d435dccb0]
3: (leveldb::Table::BlockReader(void*, leveldb::ReadOptions const&, leveldb::Slice const&)+0x154) [0x806e54]
4: ceph-mon() [0x808840]
5: ceph-mon() [0x808b39]
6: ceph-mon() [0x806540]
7: (leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*)+0xdd) [0x7f363d]
8: (leveldb::DBImpl::BackgroundCompaction()+0x2c0) [0x7f4210]
9: (leveldb::DBImpl::BackgroundCall()+0x68) [0x7f4cc8]
10: ceph-mon() [0x80b3af]
11: (()+0x7e9a) [0x7f4d435d4e9a]
12: (clone()+0x6d) [0x7f4d4196bccd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Signed-off-by: Sage Weil <sage@inktank.com>
In 3c5706163b we made exit() not actually
exit so that the leak checking would behave for a non-forking case.
That is only needed for the normal exit case; every other case expects
exit() to actually terminate and not continue execution.
Instead, make a signal_exit() method that signals the parent (if any)
and then lets you return. exit() goes back to it's usual behavior,
fixing the many other calls in main().
Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
Put it on the heap so that we can destroy it before the g_ceph_context
cct that it references. This fixes a crash like
*** Caught signal (Segmentation fault) **
in thread 4034a80
ceph version 0.63-204-gcf9aa7a (cf9aa7a003)
1: ceph-mon() [0x59932a]
2: (()+0xfcb0) [0x4e41cb0]
3: (Mutex::Lock(bool)+0x1b) [0x6235bb]
4: (PerfCountersCollection::remove(PerfCounters*)+0x27) [0x6a0877]
5: (LevelDBStore::~LevelDBStore()+0x1b) [0x582b2b]
6: (LevelDBStore::~LevelDBStore()+0x9) [0x582da9]
7: (main()+0x1386) [0x48db16]
8: (__libc_start_main()+0xed) [0x658076d]
9: ceph-mon() [0x4909ad]
Signed-off-by: Sage Weil <sage@inktank.com>
We made the common_init_finish and chdir conditional on daemonize in commit
2e0dd5ae6c, breaking init (asok at least)
when -f is specified (as with upstart).
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
leveldb has static state that prevents it from recreating its worker thread
after our fork(), even when we close and reopen the database (tsk tsk!).
Avoid this by forking early, before we touch leveldb.
Hide the details in a Preforker class. This is modeled after what
ceph-fuse already does; we should convert it later.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
In order of interest/priority:
- our latest monmap version
- a backup monmap version created during sync start, if the store
appears to be in a post-aborted sync state
- a mkfs monmap version
If none of these are found, we should go ahead and try to build a
monmap from ceph.conf to join an existing cluster.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
We already have a throttler that lets of limit the amount of memory
consumed by messages from a given source. Currently this is based only
on the size of the message payload. Add a second throttler that limits
the number of messages so that we can effectively throttle small requests
as well.
Signed-off-by: Sage Weil <sage@inktank.com>
We used to assert() instead, which didn't shed enough light on the cause
and could confuse the user into believing something *terrible* had
happened.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
With the single-paxos patches we shifted from an approach with multiple
paxos instances (one for each paxos service) keeping their own versions
to a single paxos instance for all the paxos services, thus ending up
with a single global version for paxos.
With the release of v0.52, the monitor started tracking these global
versions, keeping them for the single purpose of making it possible to
convert the store to a single-paxos format.
This patch now introduces a mechanism to convert a GV-enabled store to
the single-paxos format store when the monitor is upgraded.
As we require the global versions to be present, we first check if the
store has the GV feature set: if not we will not proceed, but we will
start the conversion otherwise.
In the end of the conversion, the monitor data directory will have a
brand new 'store.db' directory, where the key/value store lies,
alongside with the old store. This makes it possible to revert to a
previous monitor version if things go sideways, without jeopardizing the
data in the store.
The conversion is done as during a rolling upgrade, without any
intervention by the user. Fire up the new monitor version on an old
store, and the monitor itself will convert the store, trim any lingering
versions that might not be required, and proceed to start as expected.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
The init() function always implicitly created a new store if it was
missing.
This patches makes init() a private function accepting a bool that used
to specify whether or not we want to create the store if it does not
exists, and creates two functions: open() and create_and_open().
open() will fail if the store we are trying to open does not exist;
create_and_open() maintains the same behavior as the previous behavior of
init() and will create the store if it does not exist before opening it.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
We are converting the monitor subsystem to a Single-Paxos architecture,
backed by a key/value store. The previous architecture used a Paxos
instance for each Paxos Service, backed by a nasty Monitor Store that
provided few to no consistency guarantees whatsoever, which led to a fair
amount of workarounds.
Changes:
* Paxos:
- Add k/v store support
- Add documentation describing the new Paxos storage layout and behavior
- Get rid of the stashing code, which was used as a consistency point
mechanism (we no longer need it, because of our k/v store)
- Debug level of 30 will output json-formatted transaction dumps
- Allows for proposal queueing, to be proposed in the same order as
they were queued.
- No more 'is_leader()' function, using instead the Monitor's for
enhanced simplicity.
- Add 'is_lease_valid()' function.
- Disregard 'stashed versions'
- Make the paxos 'state' variable a bit-map, so we lock the proposal
mechanism while maintaining the state [5].
- Related notes: [3]
* PaxosService:
- Add k/v store support, creating wrappers to be used by the services
- Add documentation
- Support single-paxos behavior, creating wrappers to be used by the
services and service-specific version
- Rearrange variables so they are neatly organized in the beginning of
the class
- Add a trim_to() function to be used by the services, instead of letting
them rely on Paxos::trim_to(), which is no longer adequate to the job
at hand
- Debug level of 30 will output json-formatted transaction dumps
- Support proposal queueing, taking it into consideration when
assessing the current state of the service (active, writeable,
readable, ...)
- Redefine the conditions for 'is_{active,readable,writeable}()' given
the new single-paxos approach, with proposal queueing [1].
- Use our own waiting_for_* callback lists, which now must be
dissociated from their Paxos counterparts [2].
- Related notes: [3], [4]
* Monitor:
- Add k/v store support
- Use only one Paxos instance and pass it down to each service instance
- Crank up CEPH_MON_PROTOCOL to 10
* {Auth,Log,MDS,Monmap,OSD,PG}Monitor:
- Add k/v store support
- Add single-paxos support
* AuthMonitor:
- Don't always propose full versions: if the KeyServer doesn't have
keys, we cannot propose a full version. This should only happen when
we start with a brand new store and we are creating the first
pending proposal, and if we were to commit a full version filled
with nothing but a big void of nothingness, we could eventually end
up with a corrupted version.
* Elector:
- Add k/v store support
- Add single-paxos support
* ceph-mon:
- Use the monitor's k/v store instead of MonitorStore
* MMonPaxos:
- remove the machine_id field: This field was used to identify from/to
which paxos service a given message belonged. We no longer have a Paxos
for each service, so this field became obsolete.
Notes:
[1] Redefine the conditions for 'is_{active,readable,writeable}()' on
the PaxosService class, to be used with single-paxos and proposal
queueing:
We should not rely on the Paxos::is_*() functions, since they do not apply
directly to the PaxosService.
All the PaxosService classes share the same Paxos class, but they do not
rely on its values. Each service only relies, uses and updates its own
values on the k/v store. Thus, we may have a given service (e.g., the
OSDMonitor) proposing a new value, hence updating or waiting to update its
store, and we may still consider the LogMonitor as being able to read and
write its own values on the k/v store. In a nutshell, different services
do not overlap on their access to their own store when it comes to reading,
and since the Paxos will queue their updates and deal with them in a FIFO
order, their updates won't overlap either.
Therefore, the conditions for the PaxosService::is_{active,readable,
writeable} differ from those on the Paxos::is_{active,readable,writeable}.
* PaxosService::is_active() - the PaxosService will be considered as
active iff it is not proposing and the Paxos is not recovering. This
means that a given PaxosService (e.g., the OSDMonitor) may be considered
as being active even though some other service (e.g., the LogMonitor) is
proposing a new value and the Paxos is on the UPDATING state. This means
that the OSDMonitor will be able to read its own versions and queue any
changes on to the Paxos. However, if the Paxos is on state RECOVERING,
we cannot be considered as active.
* PaxosService::is_writeable() - We will be able to propose new values
iff we are the Leader, we have a valid lease, and we are not already
proposing. If we are proposing, we must wait for our proposal to finish
in order to proceed with writing to our k/v store; otherwise we could
incur in assuming that our last committed version was, say, 10; then
assign map epochs/versions taking that into consideration, make changes
to the store based on those values, just to come to smash previously
proposed values on the store. We really don't want that. To be fair,
there was a chance we could assume we were always writable, but there
may be unforeseen consequences to this; so we take the conservative
approach here for now, and we will relax it in the future if we believe
it to be fruitful.
* PaxosService::is_readable() - We will be readable iff we are not
proposing and the Paxos is not recovering; if our last committed version
exists; and if we are either a cluster of one or we have a valid lease.
[2] Use own waiting_for_* callback lists on PaxosService, which now must
be dissociated from their Paxos counterparts:
We were relying on Paxos to wait for state changes, but since our state
became somewhat independent from the Paxos state, we have to deal with
callbacks waiting for 'readable', 'writable' or 'active' on different
terms than those that Paxos provide.
So, basically, we will take one of two approaches when it comes to waiting:
* If we are proposing, queue ourselves on our own list, waiting for the
proposal to finish;
* Otherwise, the cause for the need to wait comes from Paxos, so queue
the callback directly on Paxos.
This approach means that we must make sure to check our desired state
whenever the callback is fired up, and re-queue ourselves if the state
didn't quite change (or if it changed but our waiting condition result
didn't). For instance, if we were waiting for a proposal to finish due to
a failed 'is_active()', we will need to recheck if we are active before
continuing once the callback is fired. This is mainly because we may have
finished our proposal, but a new Election may have been called and the
Paxos may not be active.
[3] Propose everything in the queue before bootstrapping, but don't
allow new proposals:
The MonmapMonitor may issue bootstraps once it is updated. We must ensure
that we propose every single pending proposal before we actually do it.
However, ee don't want to propose if we are going to bootstrap; otherwise,
we may end up losing proposals.
[4] Handle the case when first_committed_version equals 0 on a
PaxosService
In a nutshell, the services do not set the first committed version, as
they consider it as a SEP (Somebody Else's Problem). They do rely on it
though, and we, the PaxosService, must ensure that it contains a valid
value (that is, higher than zero) at all times.
Since we will only have a first_committed version equal to zero once,
and that is before the service's first proposal, we are safe to simply
read the variable from the store and assign the first_committed the same
value as the last_committed iff the first_committed version is zero.
This also affects trimming, since trimming relies on the first_committed
version as the lower bound for version trimming. Even though the k/v store
will gracefully ignore any problem from trying to remove non-existent
versions, the main issue would still stand: we'd be removing a non-existent
version and that just doesn't make any sense.
[5] 'lock' paxos when we are running some internal proposals
Force the paxos services to wait for us to complete whatever we are
doing before they can proceed. This is required because on certain
occasions we might need to run internal proposals, not affected to any of
the paxos services (for instance, when learning an old value), and we need
them to stay put, or they might incur in erroneous state and crash the
monitor.
This could have been done with an extra bool, but there was no point
in creating a new variable when we can just as easily reuse the
'state' variable for our twisted interests.
Fixes: #4175
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
These functions are like the non-safe versions, but assert that
there were no disk errors and have void return types. Change a
bunch of callers who weren't checking the return code to use
these variants instead.
(Unfortunately we can't make them default safe because several of
the callers depend on getting back the length, and are perfectly happy
with ENOENT producing a 0 return value.)
Signed-off-by: Greg Farnum <greg@inktank.com>
Before the mon, and lockdep, in particular.
#0 __pthread_mutex_lock (mutex=0x30) at pthread_mutex_lock.c:50
#1 0x0000000000816092 in ceph::log::Log::submit_entry (this=0x0, e=0x2f4a270) at log/Log.cc:138
#2 0x00000000007ee0f8 in handle_fatal_signal (signum=11) at global/signal_handler.cc:100
#3 <signal handler called>
#4 0x00000000008e1300 in lockdep_will_lock (name=0x959aa7 "SignalHandler::lock", id=17) at common/lockdep.cc:163
#5 0x00000000008867fc in Mutex::_will_lock (this=0x2f20428) at ./common/Mutex.h:56
#6 0x0000000000886605 in Mutex::Lock (this=0x2f20428, no_lockdep=false) at common/Mutex.cc:81
#7 0x00000000007eeb95 in SignalHandler::entry (this=0x2f20300) at global/signal_handler.cc:198
#8 0x00000000008b0bd1 in Thread::_entry_func (arg=0x2f20300) at common/Thread.cc:43
#9 0x00007f36fefd6b50 in start_thread (arg=<optimized out>) at pthread_create.c:304
#10 0x00007f36fd80b6dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#11 0x0000000000000000 in ?? ()
#0 0x00007f36fefd7e75 in pthread_join (threadid=139874129766144, thread_return=0x0) at pthread_join.c:89
#1 0x00000000008b11ec in Thread::join (this=0x2f20300, prval=0x0) at common/Thread.cc:130
#2 0x00000000007eeae7 in SignalHandler::shutdown (this=0x2f20300) at global/signal_handler.cc:186
#3 0x00000000007ee9cf in SignalHandler::~SignalHandler (this=0x2f20300, __in_chrg=<optimized out>) at global/signal_handler.cc:175
#4 0x00000000007eea58 in SignalHandler::~SignalHandler (this=0x2f20300, __in_chrg=<optimized out>) at global/signal_handler.cc:176
#5 0x00000000007ee643 in shutdown_async_signal_handler () at global/signal_handler.cc:324
#6 0x00000000006de9d2 in main (argc=7, argv=0x7fffbfb8a1e8) at ceph_mon.cc:439
Signed-off-by: Sage Weil <sage@inktank.com>
Three helpers:
- legacy features (if file isn't present)
- required features
- supported features
Write out the feature file on startup with legacy values if it isn't
present, so that everything else can assume it is there.
Signed-off-by: Sage Weil <sage@inktank.com>