In preparation to deglobalizing CephContext, remove the CephContext*
parameter to ceph_clock_now() and ceph::real_clock::now() that carries
a configurable offset.
Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
as it's required to upgrade to hammer before moving to jewel, and
hammer is already using the single-paxos monitor. and we convert
the store.db when starting up from an old monitor (bobtail). we
stopped the conversion since 1d814b7.
Signed-off-by: Kefu Chai <kchai@redhat.com>
Use explicit keyword for constructors with one argument to
prevent implicit usage as conversion functions.
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Previous we would queue lots of distinct encoded Transactions from various
callers, usually one per PaxosService. These would be sent through paxos
one at a time.
If there is a completed transaction there is no reason to delay; it is
more efficient to push it through immediately. Since we will propose
anything pending right when we finish, there is minimal opportunity for
other work to get done.
Instead, accumulate everything in a single MonitorDBStore::Transaction and
propose all pending changes all at once. Encode at propose time and
expose the Transaction to the callers so they can add their changes.
Signed-off-by: Sage Weil <sage@redhat.com>
This distinction is important: the update-previous state should not be
writeable, as reflected by PaxosService::is_writeable().
Signed-off-by: Sage Weil <sage@redhat.com>
Move into the WRITING state and do the write to leveldb (or whatever the
backend is) asynchronously.
A few tricks here:
- we can't do the is_updating() state check because we will always be in
REFRESH. Instead, make commit_proposal() tolerate the case where it is
called but the top proposal isn't the one we just did (or the list is
empty). This makes the callers simpler.
- do_refresh() may call bootstrap. If we do bootstrap while in REFRESH,
don't do a sync/flush on the backend store because *we* are async
completion thread and we'll deadlock. All other callers need to wait
for this, though!
Signed-off-by: Sage Weil <sage@redhat.com>
One part happens before the txn starts, the other after. Move all of the
internal state update to the bottom half. Eventually this will matter.
Signed-off-by: Sage Weil <sage@redhat.com>
Each commit should match with exactly one proposal; finish it when we
actually commit it and make sensible asserts.
The old finish_proposal() turns into finish_round(), and performs
generic checks and cleanup associated with the transition from
updating -> active.
Signed-off-by: Sage Weil <sage@inktank.com>
Consider:
- paxos starts a commit N+1
- a majority of the peers ack it
- paxos::commit() writes N+1 it to disk
- tells peers to commit
- peers commit N+1, *and* refresh_from_paxos(), and generate N+1 full map
- leader does _scrub on N+1, without latest full osdmap
- peers do _scrub on N+1, with latest full osdmap
- leader finishes paxos gather, does refresh_from_paxos()
-> scrub fails.
Fix this by doing the refresh_from_paxos() at commit time and not when
the paxos round finishes. We move the refresh out of finish_proposal
and into its own helper, and update all callers accordingly. This
keeps on-disk state more tightly in sync with in-memory state and
avoids the need for a e.g., kludgey workaround in the scrub code.
We also simplify the bootstrap checks a bit by doing so immediately
and relying on the normal bootstrap paxos reset paths to clean up
any waiters.
Signed-off-by: Sage Weil <sage@inktank.com>
The sync no longer cares if we trim Paxos versions as we go, as long as we
don't trim so fast that we fall behind between GET_CHUNK messages, which
we can consider a tuning problem.
Remove this extra complexity!
Signed-off-by: Sage Weil <sage@inktank.com>
We were using paxos_max_join_drift to control the minimum number of
paxos transactions to keep around. Instead, make this explicit, and
separate from the join drift.
Signed-off-by: Sage Weil <sage@inktank.com>
The previous sync implementation was highly stateful and very complex.
This made it very hard to understand and to debug, and there were bugs
still lurking in the timeout code (at least).
Replace it with something much simpler:
- sync providers are almost stateless. they keep an iterator, identified
by a unique cookie, that times out in a simple way.
- sync requesters sync from whomever they fancy. namely anyone with newer
committed paxos state.
There are a few extra fields that might allow sync continuation later, but
this is complex and not necessary at this point.
Signed-off-by: Sage Weil <sage@inktank.com>
In bug #5424 I observed leveldb failing internally and then returning
bad info. We then hit a random/confusing assert. Try to detect this
earlier by verifying that a get of a just-written last_committed gives
us back the right thing.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
- make states mutually exclusive (an enum)
- rename locked -> updating_previous
- set state prior to begin() to simplify things a bit
Signed-off-by: Sage Weil <sage@inktank.com>
Do the paxos refresh inside finish_proposal, ordered *after* the leader
assertion so that MonmapMonitor::update_from_paxos() calling bootstrap()
does not kill us.
Also, remove unnecessary finish_queued_proposal() and move the logic inline
where the bad leader assertion is obvious.
Signed-off-by: Sage Weil <sage@inktank.com>
In the scenario:
- leader wins, peons lose
- leader sees it is too far behind on paxos and bootstraps
- leader tries to sync with someone, waits for a quorum of the others
- peons sit around forever waiting
The problem is that they never time out because paxos never issues a lease,
which is the normal timeout that lets them detect a leader failure.
Avoid this by starting the lease timeout as soon as we lose the election.
The timeout callback just does a bootstrap and does not rely on any other
state.
I see one possible danger here: there may be some "normal" cases where the
leader takes a long time to issue its first lease that we currently
tolerate, but won't with this new check in place. I hope that raising
the lease interval/timeout or reducing the allowed paxos drift will make
that a non-issue. If it is problematic, we will need a separate explicit
"i am alive" from the leader while it is getting ready to issue the lease
to prevent a live-lock.
Backport: cuttlefish, bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
- trim more at a time (by an order of magnitude)
- rename fields to paxos_trim_{min,max}; only trim when there are min items
that are trimmable, and trim at most max items at a time.
- adjust the paxos_service_trim_{min,max} values up by a factor of 2.
Since we are compacting every time we trim, adjusting these up mean less
frequent compactions and less overall work for the monitor.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
We don't need it after all. If we are in the middle of some proposal,
then we guarantee that said proposal is likely to be retried. If we
haven't yet proposed, then it's forever more likely that a client will
eventually retry the message that triggered this proposal.
Basically, this mechanism attempted at fixing a non-problem, and was in
fact triggering some unforeseen issues that would have required increasing
the code complexity for no good reason.
Fixes: #5102
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
We go to the effort of keeping a map of the peer's first/last committed
so that we can send the right commits during the first phase of paxos,
but we forgot to record the first value. This appears to simply be an
oversight. It is mostly harmless; it just means we send extra states
that the peer already has.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
When proposing an older value learned during recovery, we don't create
a queued proposal -- we go straight through Paxos. Therefore, when
finishing a proposal, we must be sure that we have a proposal in the queue
before dereferencing it, otherwise we will segfault.
Fixes: #4250
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
With the single-paxos patches we shifted from an approach with multiple
paxos instances (one for each paxos service) keeping their own versions
to a single paxos instance for all the paxos services, thus ending up
with a single global version for paxos.
With the release of v0.52, the monitor started tracking these global
versions, keeping them for the single purpose of making it possible to
convert the store to a single-paxos format.
This patch now introduces a mechanism to convert a GV-enabled store to
the single-paxos format store when the monitor is upgraded.
As we require the global versions to be present, we first check if the
store has the GV feature set: if not we will not proceed, but we will
start the conversion otherwise.
In the end of the conversion, the monitor data directory will have a
brand new 'store.db' directory, where the key/value store lies,
alongside with the old store. This makes it possible to revert to a
previous monitor version if things go sideways, without jeopardizing the
data in the store.
The conversion is done as during a rolling upgrade, without any
intervention by the user. Fire up the new monitor version on an old
store, and the monitor itself will convert the store, trim any lingering
versions that might not be required, and proceed to start as expected.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Synchronize two monitor stores when one of the monitors has diverged
significantly from the remaining monitor cluster.
This process roughly consists of the following steps:
0. mon.X tries to join the cluster;
1. mon.X verifies that it has diverged from the remaining cluster;
2. mon.X asks the leader to sync;
3. the leader allows mon.X to sync, pointing out a mon.Y from
which mon.X should sync;
4. mon.X asks mon.Y to sync;
5. mon.Y sends its own store in one or more chunks;
6. mon.X acks each received chunk; go to 5;
7. mon.X receives the last chunk from mon.Y;
8. mon.X informs the leader that it has finished synchronizing;
9. the leader acks mon.X's finished sync;
10. mon.X bootstraps and retries joining the cluster (goto 0.)
This is the most simple and straightforward process that can be hoped
for. However, things may go sideways at any time (monitors failing, for
instance), which could potentially lead to a corrupted monitor store.
There are however mechanisms at work to avoid such scenario at any step
of the process.
Some of these mechanisms include:
- aborting the sync if the leader fails or leadership changes;
- state barriers on synchronization functions to avoid stray/outdated
messages from interfering on the normal monitor behavior or on-going
synchronization;
- store clean-up before any synchronization process starts;
- store clean-up if a sync process fails;
- resuming sync from a different monitor mon.Z if mon.Y fails mid-sync;
- several timeouts to guarantee that all the involved parties are still
alive and participating in the sync effort.
- request forwarding when mon.X contacts a monitor outside the quorum
that might know who the leader is (or might know someone who does)
[4].
Changes:
- Adapt the MMonProbe message for the single-paxos approach, dropping
the version map and using a lower and upper bound version instead.
- Remove old slurp code.
- Add 'sync force' command; 'sync_force' through the admin socket.
Notes:
[1] It's important to keep track of the paxos version at the time at
which a store sync starts. Given that after the sync we end up with
the same state as the monitor we are synchronizing from, there is a
chance that we might end up with an uncommitted paxos version if we
are synchronizing with the leader (there's some paxos stashing done
prior to commit on the leader). By keeping track at which version
the sync started, we can then let the requester to which version he
should cap its paxos store.
[2] Furthermore, the enforced paxos cap, described on [1], is even more
important if we consider the need to reapply the paxos versions that
were received during the sync, to make sure the paxos store is
consistent. If we happened to have some yet-uncommitted version in
the store, we could end up applying it.
[3] What is described in [1] and [2]:
Fixes: #4026Fixes: #4037Fixes: #4040
[4] Whenever a given monitor mon.X is on the probing phase and notices
that there is a mon.Y with a paxos version considerably higher than
the one mon.X has, then mon.X will attempt to synchronize from
mon.Y. This is the basis for the store sync. However this might
hold true, the fact is that there might be a chance that, by the
time mon.Y handles the sync request from mon.X, mon.Y might already
be attempting a sync himself with some other mon.Z. In this case,
the appropriate thing for mon.Y to do is to forward mon.X's request
to mon.Z, as mon.Z should be part of the quorum, know who the leader
is or be the leader himself -- if not, at least it is guaranteed
that mon.Z has a higher version than both mon.X and mon.Y, so it
should be okay to sync from him.
Fixes: #4162
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>