Commit Graph

91 Commits

Author SHA1 Message Date
Adam C. Emerson
750ad8340c common: Unskew clock
In preparation to deglobalizing CephContext, remove the CephContext*
parameter to ceph_clock_now() and ceph::real_clock::now() that carries
a configurable offset.

Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
2016-12-22 13:55:37 -05:00
Yunchuan Wen
1531da2e6c mon/Paxos.h: remove unneeded forward declaration
Signed-off-by: Yunchuan Wen <yunchuan.wen@kylin-cloud.com>
2016-12-01 09:03:00 +00:00
Michal Jarzabek
d21357a7e0 mon/Paxos: move classes to .cc file
Signed-off-by: Michal Jarzabek <stiopa@gmail.com>
2016-09-23 19:43:56 +01:00
cxwshawn
32bff51f07 MON: optimize header file dependency.
same work as PR: https://github.com/ceph/ceph/pull/9161

Signed-off-by: Xiaowei Chen <chen.xiaowei@h3c.com>
2016-05-19 22:18:11 -04:00
Li Peng
21b827bd0d mon: remove duplicated words
Signed-off-by: Li Peng <lip@dtdream.com>
2016-05-03 14:48:20 +08:00
Kefu Chai
00cb296a52 mon: remove remove_legacy_versions()
as it's required to upgrade to hammer before moving to jewel, and
hammer is already using the single-paxos monitor. and we convert
the store.db when starting up from an old monitor (bobtail). we
stopped the conversion since 1d814b7.

Signed-off-by: Kefu Chai <kchai@redhat.com>
2016-03-28 12:37:11 +08:00
Danny Al-Gaaf
bbf0582342 make ctors with one argument explicit
Use explicit keyword for constructors with one argument to
prevent implicit usage as conversion functions.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
2016-01-29 23:48:58 +01:00
Weijun Duan
fa9c8e09a3 mon: paxos is_recovering calc error
Fix:#14368

Signed-off-by: Weijun Duan <duanweijun@h3c.com>
2016-01-13 21:16:07 -05:00
Joao Eduardo Luis
2c83e1e2b0 mon: Paxos: have wait_for_* functions requiring ops
Signed-off-by: Joao Eduardo Luis <joao@suse.de>
2015-07-16 18:06:07 +01:00
Joao Eduardo Luis
c713d9a632 mon: optracker (1): support MonOpRequestRef
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
2015-07-16 18:03:35 +01:00
Alexandre Marangone
7f03c8891a be gender neutral
Signed-off-by: Alexandre Marangone <amarango@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2015-03-23 16:59:55 -07:00
Kefu Chai
c1e792d8aa doc: update doc with latest code
* also silence some warnings of doxygen

Signed-off-by: Kefu Chai <kchai@redhat.com>
2015-02-24 16:05:12 +08:00
Sage Weil
364b86813f mon/Paxos: consolidate finish_round()
Signed-off-by: Sage Weil <sage@redhat.com>
2015-01-13 14:51:22 -08:00
Sage Weil
67a90dd75c mon: accumulate a single pending transaction and propose it all at once
Previous we would queue lots of distinct encoded Transactions from various
callers, usually one per PaxosService.  These would be sent through paxos
one at a time.

If there is a completed transaction there is no reason to delay; it is
more efficient to push it through immediately.  Since we will propose
anything pending right when we finish, there is minimal opportunity for
other work to get done.

Instead, accumulate everything in a single MonitorDBStore::Transaction and
propose all pending changes all at once.  Encode at propose time and
expose the Transaction to the callers so they can add their changes.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-01-13 14:51:04 -08:00
Joao Eduardo Luis
5461368968 mon: paxos: queue next proposal after waking up callbacks
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
2015-01-09 17:41:17 -08:00
Sage Weil
b1cf210475 mon/Paxos: WRITING != WRITING_PREVIOUS
This distinction is important: the update-previous state should not be
writeable, as reflected by PaxosService::is_writeable().

Signed-off-by: Sage Weil <sage@redhat.com>
2014-08-27 14:36:08 -07:00
Sage Weil
a0e0b9bb2c mon/Paxos: make backend write async
Move into the WRITING state and do the write to leveldb (or whatever the
backend is) asynchronously.

A few tricks here:
 - we can't do the is_updating() state check because we will always be in
   REFRESH.  Instead, make commit_proposal() tolerate the case where it is
   called but the top proposal isn't the one we just did (or the list is
   empty).  This makes the callers simpler.
 - do_refresh() may call bootstrap.  If we do bootstrap while in REFRESH,
   don't do a sync/flush on the backend store because *we* are async
   completion thread and we'll deadlock.  All other callers need to wait
   for this, though!

Signed-off-by: Sage Weil <sage@redhat.com>
2014-08-27 14:36:08 -07:00
Sage Weil
6a71159ed1 mon/Paxos: add writing and refresh states
The new transition will be

 (updating or updating-previous) -> writing -> refresh -> active

Signed-off-by: Sage Weil <sage@redhat.com>
2014-08-27 14:36:08 -07:00
Sage Weil
08f331bee2 mon/Paxos: break commit() into two pieces
One part happens before the txn starts, the other after.  Move all of the
internal state update to the bottom half.  Eventually this will matter.

Signed-off-by: Sage Weil <sage@redhat.com>
2014-08-27 14:36:07 -07:00
Sage Weil
a6a1e994f9 mon: interact with MonitorDBStore::Transactions by shared_ptr Ref
TransactionRef everywhere!

Signed-off-by: Sage Weil <sage@redhat.com>
2014-08-27 14:36:07 -07:00
Sage Weil
b09b8563d3 mon/Paxos: add perfcounters for most paxos operations
I'm focusing primarily on the ones that result in IO here.

Signed-off-by: Sage Weil <sage@redhat.com>
2014-08-12 21:05:40 -07:00
Dmitry Smirnov
f22e2e9a02 spelling corrections 2014-04-17 12:43:30 +10:00
Loic Dachary
ab69d99309 mon: fix typo and remove redundant sentence
Signed-off-by: Loic Dachary <loic@dachary.org>
2013-09-04 12:34:23 +02:00
Loic Dachary
7c09ede7a2 mon: fix typo in comment
Signed-off-by: Loic Dachary <loic@dachary.org>
2013-09-04 12:33:15 +02:00
Sage Weil
7e0848d8f8 mon/Paxos: return whether store_state stored anything
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-08-20 11:27:09 -07:00
Sage Weil
99e605455f mon/Paxos: accepted_pn_from has no semantic meaning
Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-22 14:12:51 -07:00
Sage Weil
a61635e852 ceph-monstore-tool: dump paxos transactions
Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-22 14:12:51 -07:00
Sage Weil
40a8bbdc53 Merge remote-tracking branch 'gh/wip-mon-report' into next 2013-07-15 14:23:40 -07:00
Sage Weil
c711203c0d mon/Paxos: separate proposal commit from the end of the round
Each commit should match with exactly one proposal; finish it when we
actually commit it and make sensible asserts.

The old finish_proposal() turns into finish_round(), and performs
generic checks and cleanup associated with the transition from
updating -> active.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-15 13:42:53 -07:00
Sage Weil
f1ce8d7c95 mon: fix scrub vs paxos race: refresh on commit, not round completion
Consider:

 - paxos starts a commit N+1
 - a majority of the peers ack it
  - paxos::commit() writes N+1 it to disk
  - tells peers to commit
 - peers commit N+1, *and* refresh_from_paxos(), and generate N+1 full map
 - leader does _scrub on N+1, without latest full osdmap
 - peers do _scrub on N+1, with latest full osdmap
 - leader finishes paxos gather, does refresh_from_paxos()
 -> scrub fails.

Fix this by doing the refresh_from_paxos() at commit time and not when
the paxos round finishes.  We move the refresh out of finish_proposal
and into its own helper, and update all callers accordingly.  This
keeps on-disk state more tightly in sync with in-memory state and
avoids the need for a e.g., kludgey workaround in the scrub code.

We also simplify the bootstrap checks a bit by doing so immediately
and relying on the normal bootstrap paxos reset paths to clean up
any waiters.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-15 12:54:56 -07:00
Sage Weil
56c36fa914 mon: include paxos info in report
Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-14 16:16:55 -07:00
Sage Weil
ccceeee57b mon/Paxos: remove unnecessary trim enable/disable
The sync no longer cares if we trim Paxos versions as we go, as long as we
don't trim so fast that we fall behind between GET_CHUNK messages, which
we can consider a tuning problem.

Remove this extra complexity!

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-09 11:05:48 -07:00
Sage Weil
aa33bc88aa mon/Paxos: config min paxos txns to keep separately
We were using paxos_max_join_drift to control the minimum number of
paxos transactions to keep around.  Instead, make this explicit, and
separate from the join drift.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-09 11:05:47 -07:00
Sage Weil
da0aff28ab mon: implement a simpler sync
The previous sync implementation was highly stateful and very complex.
This made it very hard to understand and to debug, and there were bugs
still lurking in the timeout code (at least).

Replace it with something much simpler:

 - sync providers are almost stateless.  they keep an iterator, identified
   by a unique cookie, that times out in a simple way.
 - sync requesters sync from whomever they fancy.  namely anyone with newer
   committed paxos state.

There are a few extra fields that might allow sync continuation later, but
this is complex and not necessary at this point.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-09 11:05:47 -07:00
Sage Weil
516445bebc mon/Paxos: simplify trim()
Collapse all the trim methods into a single simple method.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-06-26 06:55:02 -07:00
Sage Weil
ac63b2e095 mon/Paxos: clean up removal of pre-conversion paxos states
Use a helper, independent of trim machinery, and call on leader, too.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-06-26 06:55:02 -07:00
Sage Weil
ad9c294850 mon/Paxos: assert that the store gives us back what we just wrote
In bug #5424 I observed leveldb failing internally and then returning
bad info.  We then hit a random/confusing assert.  Try to detect this
earlier by verifying that a get of a just-written last_committed gives
us back the right thing.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-06-25 21:25:04 -07:00
Sage Weil
ee34a21960 mon: simplify states
- make states mutually exclusive (an enum)
- rename locked -> updating_previous
- set state prior to begin() to simplify things a bit

Signed-off-by: Sage Weil <sage@inktank.com>
2013-06-19 11:27:05 -07:00
Sage Weil
7b7ea8e30e mon/Paxos: cleanup: drop unused PREPARING state bit
This is never set when we block, and nobody looks at it.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-06-19 11:27:05 -07:00
Sage Weil
a42d7582f8 mon/Paxos: do paxos refresh in finish_proposal; and refactor
Do the paxos refresh inside finish_proposal, ordered *after* the leader
assertion so that MonmapMonitor::update_from_paxos() calling bootstrap()
does not kill us.

Also, remove unnecessary finish_queued_proposal() and move the logic inline
where the bad leader assertion is obvious.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-06-19 11:27:04 -07:00
Sage Weil
f1ccb2d808 mon: start lease timer from peon_init()
In the scenario:

 - leader wins, peons lose
 - leader sees it is too far behind on paxos and bootstraps
 - leader tries to sync with someone, waits for a quorum of the others
 - peons sit around forever waiting

The problem is that they never time out because paxos never issues a lease,
which is the normal timeout that lets them detect a leader failure.

Avoid this by starting the lease timeout as soon as we lose the election.
The timeout callback just does a bootstrap and does not rely on any other
state.

I see one possible danger here: there may be some "normal" cases where the
leader takes a long time to issue its first lease that we currently
tolerate, but won't with this new check in place.  I hope that raising
the lease interval/timeout or reducing the allowed paxos drift will make
that a non-issue.  If it is problematic, we will need a separate explicit
"i am alive" from the leader while it is getting ready to issue the lease
to prevent a live-lock.

Backport: cuttlefish, bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-31 17:09:19 -07:00
Sage Weil
6b8e74f064 mon/Paxos: adjust trimming defaults up; rename options
- trim more at a time (by an order of magnitude)
- rename fields to paxos_trim_{min,max}; only trim when there are min items
  that are trimmable, and trim at most max items at a time.
- adjust the paxos_service_trim_{min,max} values up by a factor of 2.

Since we are compacting every time we trim, adjusting these up mean less
frequent compactions and less overall work for the monitor.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-31 17:05:03 -07:00
Joao Eduardo Luis
e15d290945 mon: Paxos: get rid of the 'prepare_bootstrap()' mechanism
We don't need it after all.  If we are in the middle of some proposal,
then we guarantee that said proposal is likely to be retried.  If we
haven't yet proposed, then it's forever more likely that a client will
eventually retry the message that triggered this proposal.

Basically, this mechanism attempted at fixing a non-problem, and was in
fact triggering some unforeseen issues that would have required increasing
the code complexity for no good reason.

Fixes: #5102

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-05-22 17:12:38 +01:00
Sage Weil
3a6138b25e mon/Paxos: don't ignore peer first_committed
We go to the effort of keeping a map of the peer's first/last committed
so that we can send the right commits during the first phase of paxos,
but we forgot to record the first value.  This appears to simply be an
oversight.  It is mostly harmless; it just means we send extra states
that the peer already has.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01 10:57:47 -07:00
Yan, Zheng
cea2ff8615 mon: Fix leak of context
Use Context::complete() to finish context, it frees the context
after executing Context::finish().

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-04-28 21:15:25 -07:00
Joao Eduardo Luis
b99367bfb2 mon: Paxos: only finish a queued proposal if there's actually *any*
When proposing an older value learned during recovery, we don't create
a queued proposal -- we go straight through Paxos.  Therefore, when
finishing a proposal, we must be sure that we have a proposal in the queue
before dereferencing it, otherwise we will segfault.

Fixes: #4250

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-03-13 15:03:00 -07:00
Danny Al-Gaaf
26e8577d29 Paxos.h: pass string name function parameter by reference
Pass 'const string name' function parameter by reference.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
2013-02-26 18:50:13 +01:00
Danny Al-Gaaf
350481f90f Paxos.h: fix dangerouse use of c_str()
No need to use c_str() in get_statename(), simply return a
std::strin instead.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
2013-02-25 14:10:20 +01:00
Joao Eduardo Luis
cb85fb7d9a mon: ceph-mon: convert an old monitor store to the new format
With the single-paxos patches we shifted from an approach with multiple
paxos instances (one for each paxos service) keeping their own versions
to a single paxos instance for all the paxos services, thus ending up
with a single global version for paxos.

With the release of v0.52, the monitor started tracking these global
versions, keeping them for the single purpose of making it possible to
convert the store to a single-paxos format.

This patch now introduces a mechanism to convert a GV-enabled store to
the single-paxos format store when the monitor is upgraded.

As we require the global versions to be present, we first check if the
store has the GV feature set: if not we will not proceed, but we will
start the conversion otherwise.

In the end of the conversion, the monitor data directory will have a
brand new 'store.db' directory, where the key/value store lies,
alongside with the old store.  This makes it possible to revert to a
previous monitor version if things go sideways, without jeopardizing the
data in the store.

The conversion is done as during a rolling upgrade, without any
intervention by the user.  Fire up the new monitor version on an old
store, and the monitor itself will convert the store, trim any lingering
versions that might not be required, and proceed to start as expected.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-02-21 18:02:23 +00:00
Joao Eduardo Luis
cab3411b4a mon: Monitor: Add monitor store synchronization support
Synchronize two monitor stores when one of the monitors has diverged
significantly from the remaining monitor cluster.

This process roughly consists of the following steps:

  0. mon.X tries to join the cluster;
  1. mon.X verifies that it has diverged from the remaining cluster;
  2. mon.X asks the leader to sync;
  3. the leader allows mon.X to sync, pointing out a mon.Y from
     which mon.X should sync;
  4. mon.X asks mon.Y to sync;
  5. mon.Y sends its own store in one or more chunks;
  6. mon.X acks each received chunk; go to 5;
  7. mon.X receives the last chunk from mon.Y;
  8. mon.X informs the leader that it has finished synchronizing;
  9. the leader acks mon.X's finished sync;
 10. mon.X bootstraps and retries joining the cluster (goto 0.)

This is the most simple and straightforward process that can be hoped
for. However, things may go sideways at any time (monitors failing, for
instance), which could potentially lead to a corrupted monitor store.
There are however mechanisms at work to avoid such scenario at any step
of the process.

Some of these mechanisms include:

 - aborting the sync if the leader fails or leadership changes;
 - state barriers on synchronization functions to avoid stray/outdated
   messages from interfering on the normal monitor behavior or on-going
   synchronization;
 - store clean-up before any synchronization process starts;
 - store clean-up if a sync process fails;
 - resuming sync from a different monitor mon.Z if mon.Y fails mid-sync;
 - several timeouts to guarantee that all the involved parties are still
   alive and participating in the sync effort.
 - request forwarding when mon.X contacts a monitor outside the quorum
   that might know who the leader is (or might know someone who does)
   [4].

Changes:
  - Adapt the MMonProbe message for the single-paxos approach, dropping
    the version map and using a lower and upper bound version instead.
  - Remove old slurp code.
  - Add 'sync force' command; 'sync_force' through the admin socket.

Notes:

[1] It's important to keep track of the paxos version at the time at
    which a store sync starts.  Given that after the sync we end up with
    the same state as the monitor we are synchronizing from, there is a
    chance that we might end up with an uncommitted paxos version if we
    are synchronizing with the leader (there's some paxos stashing done
    prior to commit on the leader).  By keeping track at which version
    the sync started, we can then let the requester to which version he
    should cap its paxos store.

[2] Furthermore, the enforced paxos cap, described on [1], is even more
    important if we consider the need to reapply the paxos versions that
    were received during the sync, to make sure the paxos store is
    consistent.  If we happened to have some yet-uncommitted version in
    the store, we could end up applying it.

[3] What is described in [1] and [2]:

Fixes: #4026
Fixes: #4037
Fixes: #4040

[4] Whenever a given monitor mon.X is on the probing phase and notices
    that there is a mon.Y with a paxos version considerably higher than
    the one mon.X has, then mon.X will attempt to synchronize from
    mon.Y.  This is the basis for the store sync.  However this might
    hold true, the fact is that there might be a chance that, by the
    time mon.Y handles the sync request from mon.X, mon.Y might already
    be attempting a sync himself with some other mon.Z.  In this case,
    the appropriate thing for mon.Y to do is to forward mon.X's request
    to mon.Z, as mon.Z should be part of the quorum, know who the leader
    is or be the leader himself -- if not, at least it is guaranteed
    that mon.Z has a higher version than both mon.X and mon.Y, so it
    should be okay to sync from him.

Fixes: #4162

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-02-21 18:02:22 +00:00