This ensures we release our in-progress recovery counters, which prevents
recovery from getting blocked indefinitely when a pool removal races with
recovery ops.
Fixes: #4217
Backport: bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
With the single-paxos patches we shifted from an approach with multiple
paxos instances (one for each paxos service) keeping their own versions
to a single paxos instance for all the paxos services, thus ending up
with a single global version for paxos.
With the release of v0.52, the monitor started tracking these global
versions, keeping them for the single purpose of making it possible to
convert the store to a single-paxos format.
This patch now introduces a mechanism to convert a GV-enabled store to
the single-paxos format store when the monitor is upgraded.
As we require the global versions to be present, we first check if the
store has the GV feature set: if not we will not proceed, but we will
start the conversion otherwise.
In the end of the conversion, the monitor data directory will have a
brand new 'store.db' directory, where the key/value store lies,
alongside with the old store. This makes it possible to revert to a
previous monitor version if things go sideways, without jeopardizing the
data in the store.
The conversion is done as during a rolling upgrade, without any
intervention by the user. Fire up the new monitor version on an old
store, and the monitor itself will convert the store, trim any lingering
versions that might not be required, and proceed to start as expected.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
This tool will convert an old monitor store format (bobtail) to the new
key/value store-backed, single-paxos format.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
The init() function always implicitly created a new store if it was
missing.
This patches makes init() a private function accepting a bool that used
to specify whether or not we want to create the store if it does not
exists, and creates two functions: open() and create_and_open().
open() will fail if the store we are trying to open does not exist;
create_and_open() maintains the same behavior as the previous behavior of
init() and will create the store if it does not exist before opening it.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
Synchronize two monitor stores when one of the monitors has diverged
significantly from the remaining monitor cluster.
This process roughly consists of the following steps:
0. mon.X tries to join the cluster;
1. mon.X verifies that it has diverged from the remaining cluster;
2. mon.X asks the leader to sync;
3. the leader allows mon.X to sync, pointing out a mon.Y from
which mon.X should sync;
4. mon.X asks mon.Y to sync;
5. mon.Y sends its own store in one or more chunks;
6. mon.X acks each received chunk; go to 5;
7. mon.X receives the last chunk from mon.Y;
8. mon.X informs the leader that it has finished synchronizing;
9. the leader acks mon.X's finished sync;
10. mon.X bootstraps and retries joining the cluster (goto 0.)
This is the most simple and straightforward process that can be hoped
for. However, things may go sideways at any time (monitors failing, for
instance), which could potentially lead to a corrupted monitor store.
There are however mechanisms at work to avoid such scenario at any step
of the process.
Some of these mechanisms include:
- aborting the sync if the leader fails or leadership changes;
- state barriers on synchronization functions to avoid stray/outdated
messages from interfering on the normal monitor behavior or on-going
synchronization;
- store clean-up before any synchronization process starts;
- store clean-up if a sync process fails;
- resuming sync from a different monitor mon.Z if mon.Y fails mid-sync;
- several timeouts to guarantee that all the involved parties are still
alive and participating in the sync effort.
- request forwarding when mon.X contacts a monitor outside the quorum
that might know who the leader is (or might know someone who does)
[4].
Changes:
- Adapt the MMonProbe message for the single-paxos approach, dropping
the version map and using a lower and upper bound version instead.
- Remove old slurp code.
- Add 'sync force' command; 'sync_force' through the admin socket.
Notes:
[1] It's important to keep track of the paxos version at the time at
which a store sync starts. Given that after the sync we end up with
the same state as the monitor we are synchronizing from, there is a
chance that we might end up with an uncommitted paxos version if we
are synchronizing with the leader (there's some paxos stashing done
prior to commit on the leader). By keeping track at which version
the sync started, we can then let the requester to which version he
should cap its paxos store.
[2] Furthermore, the enforced paxos cap, described on [1], is even more
important if we consider the need to reapply the paxos versions that
were received during the sync, to make sure the paxos store is
consistent. If we happened to have some yet-uncommitted version in
the store, we could end up applying it.
[3] What is described in [1] and [2]:
Fixes: #4026Fixes: #4037Fixes: #4040
[4] Whenever a given monitor mon.X is on the probing phase and notices
that there is a mon.Y with a paxos version considerably higher than
the one mon.X has, then mon.X will attempt to synchronize from
mon.Y. This is the basis for the store sync. However this might
hold true, the fact is that there might be a chance that, by the
time mon.Y handles the sync request from mon.X, mon.Y might already
be attempting a sync himself with some other mon.Z. In this case,
the appropriate thing for mon.Y to do is to forward mon.X's request
to mon.Z, as mon.Z should be part of the quorum, know who the leader
is or be the leader himself -- if not, at least it is guaranteed
that mon.Z has a higher version than both mon.X and mon.Y, so it
should be okay to sync from him.
Fixes: #4162
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
The monitor's synchronization process requires a specific message type
to carry the required informations. Since this process significantly
differs from slurping, reusing the MMonProbe message is not an option as
it would require major changes and, for all intetions and purposes, it
would be far outside the scope of the MMonProbe message.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
We created an interface specific to the MonitorDBStore, which can be used
to create iterators to obtain chunks for sync.
Two different iterators were defined: one that will iterate over the whole
store, focusing on the specified set of prefixes; another that will
iterate over only one specific prefix.
These two different iterators allow us build the sync process in two
distinct phases: 1) obtain all key/value pairs for paxos and all paxos
services, bundle them in chunks and send them over the wire; and 2) obtain
all the paxos versions, bundle them in chunks and send them over the wire.
Also, we are currently considering a chunk to be (at most) 1 MB worth of
data, although it can be tuned using 'mon_sync_max_payload_size' option.
mon: MonitorDBStore: add crc support when --mon-sync-debug is set
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Instead of directly modifying the store whenever we want to trim our Paxos
state, we should do it through Paxos, proposing the trim to the quorum and
commit it once accepted.
This enforces three major invariants that we will be able to leverage later
on during the store synchronization:
1) The Leader will set the pace for trimming across the system. No one
will trim their state unless they are committing the value proposed by
the Leader;
2) Following (1), the monitors in the quorum will trim at the same time.
There will be no diverging states due to trimming on different monitors.
3) Each trim will be kept as a transaction in the Paxos' store allowing
us to obtain a consistent state during synchronization, by shipping
the Paxos versions to the other monitor and applying them. We could
incur in an inconsistent state if the trim happened without
constraints, without being logged; by going through Paxos this concern
is no longer relevant.
The trimming itself may be triggered each time a proposal finishes, which
is the time at which we know we have committed a new version on the store.
It shall be triggered iff we are sure we have enough versions on the store
to fill the gap of any monitor that might become alive and still hasn't
drifted enough to require synchronization. Roughly speaking, we will check
if the number of available versions is higher than 'paxos_max_join_drift'.
Furthermore, we added a new option, 'paxos_trim_tolerance', so we are able
to avoid trimming every single time the above condition is met -- which
would happen every time we trimmed a version, and then proposed a new one,
and then we would trim it again, etc. So, just tolerate a couple of commits
before trimming again.
Finally, we added support to enable/disable trimming, which will be
essential during the store synchronization process.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
We are converting the monitor subsystem to a Single-Paxos architecture,
backed by a key/value store. The previous architecture used a Paxos
instance for each Paxos Service, backed by a nasty Monitor Store that
provided few to no consistency guarantees whatsoever, which led to a fair
amount of workarounds.
Changes:
* Paxos:
- Add k/v store support
- Add documentation describing the new Paxos storage layout and behavior
- Get rid of the stashing code, which was used as a consistency point
mechanism (we no longer need it, because of our k/v store)
- Debug level of 30 will output json-formatted transaction dumps
- Allows for proposal queueing, to be proposed in the same order as
they were queued.
- No more 'is_leader()' function, using instead the Monitor's for
enhanced simplicity.
- Add 'is_lease_valid()' function.
- Disregard 'stashed versions'
- Make the paxos 'state' variable a bit-map, so we lock the proposal
mechanism while maintaining the state [5].
- Related notes: [3]
* PaxosService:
- Add k/v store support, creating wrappers to be used by the services
- Add documentation
- Support single-paxos behavior, creating wrappers to be used by the
services and service-specific version
- Rearrange variables so they are neatly organized in the beginning of
the class
- Add a trim_to() function to be used by the services, instead of letting
them rely on Paxos::trim_to(), which is no longer adequate to the job
at hand
- Debug level of 30 will output json-formatted transaction dumps
- Support proposal queueing, taking it into consideration when
assessing the current state of the service (active, writeable,
readable, ...)
- Redefine the conditions for 'is_{active,readable,writeable}()' given
the new single-paxos approach, with proposal queueing [1].
- Use our own waiting_for_* callback lists, which now must be
dissociated from their Paxos counterparts [2].
- Related notes: [3], [4]
* Monitor:
- Add k/v store support
- Use only one Paxos instance and pass it down to each service instance
- Crank up CEPH_MON_PROTOCOL to 10
* {Auth,Log,MDS,Monmap,OSD,PG}Monitor:
- Add k/v store support
- Add single-paxos support
* AuthMonitor:
- Don't always propose full versions: if the KeyServer doesn't have
keys, we cannot propose a full version. This should only happen when
we start with a brand new store and we are creating the first
pending proposal, and if we were to commit a full version filled
with nothing but a big void of nothingness, we could eventually end
up with a corrupted version.
* Elector:
- Add k/v store support
- Add single-paxos support
* ceph-mon:
- Use the monitor's k/v store instead of MonitorStore
* MMonPaxos:
- remove the machine_id field: This field was used to identify from/to
which paxos service a given message belonged. We no longer have a Paxos
for each service, so this field became obsolete.
Notes:
[1] Redefine the conditions for 'is_{active,readable,writeable}()' on
the PaxosService class, to be used with single-paxos and proposal
queueing:
We should not rely on the Paxos::is_*() functions, since they do not apply
directly to the PaxosService.
All the PaxosService classes share the same Paxos class, but they do not
rely on its values. Each service only relies, uses and updates its own
values on the k/v store. Thus, we may have a given service (e.g., the
OSDMonitor) proposing a new value, hence updating or waiting to update its
store, and we may still consider the LogMonitor as being able to read and
write its own values on the k/v store. In a nutshell, different services
do not overlap on their access to their own store when it comes to reading,
and since the Paxos will queue their updates and deal with them in a FIFO
order, their updates won't overlap either.
Therefore, the conditions for the PaxosService::is_{active,readable,
writeable} differ from those on the Paxos::is_{active,readable,writeable}.
* PaxosService::is_active() - the PaxosService will be considered as
active iff it is not proposing and the Paxos is not recovering. This
means that a given PaxosService (e.g., the OSDMonitor) may be considered
as being active even though some other service (e.g., the LogMonitor) is
proposing a new value and the Paxos is on the UPDATING state. This means
that the OSDMonitor will be able to read its own versions and queue any
changes on to the Paxos. However, if the Paxos is on state RECOVERING,
we cannot be considered as active.
* PaxosService::is_writeable() - We will be able to propose new values
iff we are the Leader, we have a valid lease, and we are not already
proposing. If we are proposing, we must wait for our proposal to finish
in order to proceed with writing to our k/v store; otherwise we could
incur in assuming that our last committed version was, say, 10; then
assign map epochs/versions taking that into consideration, make changes
to the store based on those values, just to come to smash previously
proposed values on the store. We really don't want that. To be fair,
there was a chance we could assume we were always writable, but there
may be unforeseen consequences to this; so we take the conservative
approach here for now, and we will relax it in the future if we believe
it to be fruitful.
* PaxosService::is_readable() - We will be readable iff we are not
proposing and the Paxos is not recovering; if our last committed version
exists; and if we are either a cluster of one or we have a valid lease.
[2] Use own waiting_for_* callback lists on PaxosService, which now must
be dissociated from their Paxos counterparts:
We were relying on Paxos to wait for state changes, but since our state
became somewhat independent from the Paxos state, we have to deal with
callbacks waiting for 'readable', 'writable' or 'active' on different
terms than those that Paxos provide.
So, basically, we will take one of two approaches when it comes to waiting:
* If we are proposing, queue ourselves on our own list, waiting for the
proposal to finish;
* Otherwise, the cause for the need to wait comes from Paxos, so queue
the callback directly on Paxos.
This approach means that we must make sure to check our desired state
whenever the callback is fired up, and re-queue ourselves if the state
didn't quite change (or if it changed but our waiting condition result
didn't). For instance, if we were waiting for a proposal to finish due to
a failed 'is_active()', we will need to recheck if we are active before
continuing once the callback is fired. This is mainly because we may have
finished our proposal, but a new Election may have been called and the
Paxos may not be active.
[3] Propose everything in the queue before bootstrapping, but don't
allow new proposals:
The MonmapMonitor may issue bootstraps once it is updated. We must ensure
that we propose every single pending proposal before we actually do it.
However, ee don't want to propose if we are going to bootstrap; otherwise,
we may end up losing proposals.
[4] Handle the case when first_committed_version equals 0 on a
PaxosService
In a nutshell, the services do not set the first committed version, as
they consider it as a SEP (Somebody Else's Problem). They do rely on it
though, and we, the PaxosService, must ensure that it contains a valid
value (that is, higher than zero) at all times.
Since we will only have a first_committed version equal to zero once,
and that is before the service's first proposal, we are safe to simply
read the variable from the store and assign the first_committed the same
value as the last_committed iff the first_committed version is zero.
This also affects trimming, since trimming relies on the first_committed
version as the lower bound for version trimming. Even though the k/v store
will gracefully ignore any problem from trying to remove non-existent
versions, the main issue would still stand: we'd be removing a non-existent
version and that just doesn't make any sense.
[5] 'lock' paxos when we are running some internal proposals
Force the paxos services to wait for us to complete whatever we are
doing before they can proceed. This is required because on certain
occasions we might need to run internal proposals, not affected to any of
the paxos services (for instance, when learning an old value), and we need
them to stay put, or they might incur in erroneous state and crash the
monitor.
This could have been done with an extra bool, but there was no point
in creating a new variable when we can just as easily reuse the
'state' variable for our twisted interests.
Fixes: #4175
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Use qi to parse a strictly formatted set of key/value pairs. Be picky
about whitespace. Any subset of recognized keys is allowed. Parse the
same set of keys as the ceph.*.layout.* vxattrs.
Signed-off-by: Sage Weil <sage@inktank.com>
Allow user to control the minimum level to go to syslog for the client-
and server-side submission paths for the cluster log, along with the syslog
'facility'. See syslog(3) man page.
Also move the level checks into a LogEntry method.
Closes: #3704
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Luis <joao.luis@inktank.com>
Methods called by write_if_dirty() (get_osdmap()) assert that the pg
is locked.
Backport: bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
Fixes: #4177
Backport: bobtail
Listing multipart uploads had a typo, and was requiring the
wrong resource (uploadId instead of uploads).
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Fixes: #4150
Backport: bobtail
When object copied into itself, object will not be fully copied: tail
reference count stays the same, head part is rewritten.
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
The existing code will overlay the placement of PGs from pools because
it simply adds the ps to the pool as the CRUSH input. That means that
the layout/placement for pg 0.10 == 1.9 == 2.8 == 3.7 == 4.6 == ...,
which is not optimal.
Instead, use hash(ps, poolid). The avoids the initial problem of
the sequence being adjacent to other pools. It also avoids the (small)
possibility that hash(poolid) will drop us somewhere in the output
number space where our sequence of outputs overlaps with some other
pool; instead, out output sequence will be a fully random (for a well-
behaved hash).
Use the multi-input hash functions used by CRUSH for this.
Default to the legacy behavior for now. We won't enable this until
deployed systems and kernel code catch up.
Fixes: #4128
Signed-off-by: Sage Weil <sage@inktank.com>
The '! command' doesn't fail properly, even with -e, in bash (wtf!).
Also, the last pool deletion command succeeds because the pool
'--yes-i-really-really-mean-it' doesn't exist. So drop that test.
Signed-off-by: Sage Weil <sage@inktank.com>