RepoMirrors/ceph

mirror of https://github.com/ceph/ceph synced 2025-01-12 14:10:27 +00:00

Author	SHA1	Message	Date
Adam C. Emerson	750ad8340c	common: Unskew clock In preparation to deglobalizing CephContext, remove the CephContext* parameter to ceph_clock_now() and ceph::real_clock::now() that carries a configurable offset. Signed-off-by: Adam C. Emerson <aemerson@redhat.com>	2016-12-22 13:55:37 -05:00
Yunchuan Wen	1531da2e6c	mon/Paxos.h: remove unneeded forward declaration Signed-off-by: Yunchuan Wen <yunchuan.wen@kylin-cloud.com>	2016-12-01 09:03:00 +00:00
Michal Jarzabek	d21357a7e0	mon/Paxos: move classes to .cc file Signed-off-by: Michal Jarzabek <stiopa@gmail.com>	2016-09-23 19:43:56 +01:00
cxwshawn	32bff51f07	MON: optimize header file dependency. same work as PR: https://github.com/ceph/ceph/pull/9161 Signed-off-by: Xiaowei Chen <chen.xiaowei@h3c.com>	2016-05-19 22:18:11 -04:00
Li Peng	21b827bd0d	mon: remove duplicated words Signed-off-by: Li Peng <lip@dtdream.com>	2016-05-03 14:48:20 +08:00
Kefu Chai	00cb296a52	mon: remove remove_legacy_versions() as it's required to upgrade to hammer before moving to jewel, and hammer is already using the single-paxos monitor. and we convert the store.db when starting up from an old monitor (bobtail). we stopped the conversion since `1d814b7`. Signed-off-by: Kefu Chai <kchai@redhat.com>	2016-03-28 12:37:11 +08:00
Danny Al-Gaaf	bbf0582342	make ctors with one argument explicit Use explicit keyword for constructors with one argument to prevent implicit usage as conversion functions. Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>	2016-01-29 23:48:58 +01:00
Weijun Duan	fa9c8e09a3	mon: paxos is_recovering calc error Fix:#14368 Signed-off-by: Weijun Duan <duanweijun@h3c.com>	2016-01-13 21:16:07 -05:00
Joao Eduardo Luis	2c83e1e2b0	mon: Paxos: have wait_for_* functions requiring ops Signed-off-by: Joao Eduardo Luis <joao@suse.de>	2015-07-16 18:06:07 +01:00
Joao Eduardo Luis	c713d9a632	mon: optracker (1): support MonOpRequestRef Signed-off-by: Joao Eduardo Luis <joao@redhat.com>	2015-07-16 18:03:35 +01:00
Alexandre Marangone	7f03c8891a	be gender neutral Signed-off-by: Alexandre Marangone <amarango@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>	2015-03-23 16:59:55 -07:00
Kefu Chai	c1e792d8aa	doc: update doc with latest code * also silence some warnings of doxygen Signed-off-by: Kefu Chai <kchai@redhat.com>	2015-02-24 16:05:12 +08:00
Sage Weil	364b86813f	mon/Paxos: consolidate finish_round() Signed-off-by: Sage Weil <sage@redhat.com>	2015-01-13 14:51:22 -08:00
Sage Weil	67a90dd75c	mon: accumulate a single pending transaction and propose it all at once Previous we would queue lots of distinct encoded Transactions from various callers, usually one per PaxosService. These would be sent through paxos one at a time. If there is a completed transaction there is no reason to delay; it is more efficient to push it through immediately. Since we will propose anything pending right when we finish, there is minimal opportunity for other work to get done. Instead, accumulate everything in a single MonitorDBStore::Transaction and propose all pending changes all at once. Encode at propose time and expose the Transaction to the callers so they can add their changes. Signed-off-by: Sage Weil <sage@redhat.com>	2015-01-13 14:51:04 -08:00
Joao Eduardo Luis	5461368968	mon: paxos: queue next proposal after waking up callbacks Signed-off-by: Joao Eduardo Luis <joao@redhat.com>	2015-01-09 17:41:17 -08:00
Sage Weil	b1cf210475	mon/Paxos: WRITING != WRITING_PREVIOUS This distinction is important: the update-previous state should not be writeable, as reflected by PaxosService::is_writeable(). Signed-off-by: Sage Weil <sage@redhat.com>	2014-08-27 14:36:08 -07:00
Sage Weil	a0e0b9bb2c	mon/Paxos: make backend write async Move into the WRITING state and do the write to leveldb (or whatever the backend is) asynchronously. A few tricks here: - we can't do the is_updating() state check because we will always be in REFRESH. Instead, make commit_proposal() tolerate the case where it is called but the top proposal isn't the one we just did (or the list is empty). This makes the callers simpler. - do_refresh() may call bootstrap. If we do bootstrap while in REFRESH, don't do a sync/flush on the backend store because we are async completion thread and we'll deadlock. All other callers need to wait for this, though! Signed-off-by: Sage Weil <sage@redhat.com>	2014-08-27 14:36:08 -07:00
Sage Weil	6a71159ed1	mon/Paxos: add writing and refresh states The new transition will be (updating or updating-previous) -> writing -> refresh -> active Signed-off-by: Sage Weil <sage@redhat.com>	2014-08-27 14:36:08 -07:00
Sage Weil	08f331bee2	mon/Paxos: break commit() into two pieces One part happens before the txn starts, the other after. Move all of the internal state update to the bottom half. Eventually this will matter. Signed-off-by: Sage Weil <sage@redhat.com>	2014-08-27 14:36:07 -07:00
Sage Weil	a6a1e994f9	mon: interact with MonitorDBStore::Transactions by shared_ptr Ref TransactionRef everywhere! Signed-off-by: Sage Weil <sage@redhat.com>	2014-08-27 14:36:07 -07:00
Sage Weil	b09b8563d3	mon/Paxos: add perfcounters for most paxos operations I'm focusing primarily on the ones that result in IO here. Signed-off-by: Sage Weil <sage@redhat.com>	2014-08-12 21:05:40 -07:00
Dmitry Smirnov	f22e2e9a02	spelling corrections	2014-04-17 12:43:30 +10:00
Loic Dachary	ab69d99309	mon: fix typo and remove redundant sentence Signed-off-by: Loic Dachary <loic@dachary.org>	2013-09-04 12:34:23 +02:00
Loic Dachary	7c09ede7a2	mon: fix typo in comment Signed-off-by: Loic Dachary <loic@dachary.org>	2013-09-04 12:33:15 +02:00
Sage Weil	7e0848d8f8	mon/Paxos: return whether store_state stored anything Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>	2013-08-20 11:27:09 -07:00
Sage Weil	99e605455f	mon/Paxos: accepted_pn_from has no semantic meaning Signed-off-by: Sage Weil <sage@inktank.com>	2013-07-22 14:12:51 -07:00
Sage Weil	a61635e852	ceph-monstore-tool: dump paxos transactions Signed-off-by: Sage Weil <sage@inktank.com>	2013-07-22 14:12:51 -07:00
Sage Weil	40a8bbdc53	Merge remote-tracking branch 'gh/wip-mon-report' into next	2013-07-15 14:23:40 -07:00
Sage Weil	c711203c0d	mon/Paxos: separate proposal commit from the end of the round Each commit should match with exactly one proposal; finish it when we actually commit it and make sensible asserts. The old finish_proposal() turns into finish_round(), and performs generic checks and cleanup associated with the transition from updating -> active. Signed-off-by: Sage Weil <sage@inktank.com>	2013-07-15 13:42:53 -07:00
Sage Weil	f1ce8d7c95	mon: fix scrub vs paxos race: refresh on commit, not round completion Consider: - paxos starts a commit N+1 - a majority of the peers ack it - paxos::commit() writes N+1 it to disk - tells peers to commit - peers commit N+1, and refresh_from_paxos(), and generate N+1 full map - leader does _scrub on N+1, without latest full osdmap - peers do _scrub on N+1, with latest full osdmap - leader finishes paxos gather, does refresh_from_paxos() -> scrub fails. Fix this by doing the refresh_from_paxos() at commit time and not when the paxos round finishes. We move the refresh out of finish_proposal and into its own helper, and update all callers accordingly. This keeps on-disk state more tightly in sync with in-memory state and avoids the need for a e.g., kludgey workaround in the scrub code. We also simplify the bootstrap checks a bit by doing so immediately and relying on the normal bootstrap paxos reset paths to clean up any waiters. Signed-off-by: Sage Weil <sage@inktank.com>	2013-07-15 12:54:56 -07:00
Sage Weil	56c36fa914	mon: include paxos info in report Signed-off-by: Sage Weil <sage@inktank.com>	2013-07-14 16:16:55 -07:00
Sage Weil	ccceeee57b	mon/Paxos: remove unnecessary trim enable/disable The sync no longer cares if we trim Paxos versions as we go, as long as we don't trim so fast that we fall behind between GET_CHUNK messages, which we can consider a tuning problem. Remove this extra complexity! Signed-off-by: Sage Weil <sage@inktank.com>	2013-07-09 11:05:48 -07:00
Sage Weil	aa33bc88aa	mon/Paxos: config min paxos txns to keep separately We were using paxos_max_join_drift to control the minimum number of paxos transactions to keep around. Instead, make this explicit, and separate from the join drift. Signed-off-by: Sage Weil <sage@inktank.com>	2013-07-09 11:05:47 -07:00
Sage Weil	da0aff28ab	mon: implement a simpler sync The previous sync implementation was highly stateful and very complex. This made it very hard to understand and to debug, and there were bugs still lurking in the timeout code (at least). Replace it with something much simpler: - sync providers are almost stateless. they keep an iterator, identified by a unique cookie, that times out in a simple way. - sync requesters sync from whomever they fancy. namely anyone with newer committed paxos state. There are a few extra fields that might allow sync continuation later, but this is complex and not necessary at this point. Signed-off-by: Sage Weil <sage@inktank.com>	2013-07-09 11:05:47 -07:00
Sage Weil	516445bebc	mon/Paxos: simplify trim() Collapse all the trim methods into a single simple method. Signed-off-by: Sage Weil <sage@inktank.com>	2013-06-26 06:55:02 -07:00
Sage Weil	ac63b2e095	mon/Paxos: clean up removal of pre-conversion paxos states Use a helper, independent of trim machinery, and call on leader, too. Signed-off-by: Sage Weil <sage@inktank.com>	2013-06-26 06:55:02 -07:00
Sage Weil	ad9c294850	mon/Paxos: assert that the store gives us back what we just wrote In bug #5424 I observed leveldb failing internally and then returning bad info. We then hit a random/confusing assert. Try to detect this earlier by verifying that a get of a just-written last_committed gives us back the right thing. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>	2013-06-25 21:25:04 -07:00
Sage Weil	ee34a21960	mon: simplify states - make states mutually exclusive (an enum) - rename locked -> updating_previous - set state prior to begin() to simplify things a bit Signed-off-by: Sage Weil <sage@inktank.com>	2013-06-19 11:27:05 -07:00
Sage Weil	7b7ea8e30e	mon/Paxos: cleanup: drop unused PREPARING state bit This is never set when we block, and nobody looks at it. Signed-off-by: Sage Weil <sage@inktank.com>	2013-06-19 11:27:05 -07:00
Sage Weil	a42d7582f8	mon/Paxos: do paxos refresh in finish_proposal; and refactor Do the paxos refresh inside finish_proposal, ordered after the leader assertion so that MonmapMonitor::update_from_paxos() calling bootstrap() does not kill us. Also, remove unnecessary finish_queued_proposal() and move the logic inline where the bad leader assertion is obvious. Signed-off-by: Sage Weil <sage@inktank.com>	2013-06-19 11:27:04 -07:00
Sage Weil	f1ccb2d808	mon: start lease timer from peon_init() In the scenario: - leader wins, peons lose - leader sees it is too far behind on paxos and bootstraps - leader tries to sync with someone, waits for a quorum of the others - peons sit around forever waiting The problem is that they never time out because paxos never issues a lease, which is the normal timeout that lets them detect a leader failure. Avoid this by starting the lease timeout as soon as we lose the election. The timeout callback just does a bootstrap and does not rely on any other state. I see one possible danger here: there may be some "normal" cases where the leader takes a long time to issue its first lease that we currently tolerate, but won't with this new check in place. I hope that raising the lease interval/timeout or reducing the allowed paxos drift will make that a non-issue. If it is problematic, we will need a separate explicit "i am alive" from the leader while it is getting ready to issue the lease to prevent a live-lock. Backport: cuttlefish, bobtail Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>	2013-05-31 17:09:19 -07:00
Sage Weil	6b8e74f064	mon/Paxos: adjust trimming defaults up; rename options - trim more at a time (by an order of magnitude) - rename fields to paxos_trim_{min,max}; only trim when there are min items that are trimmable, and trim at most max items at a time. - adjust the paxos_service_trim_{min,max} values up by a factor of 2. Since we are compacting every time we trim, adjusting these up mean less frequent compactions and less overall work for the monitor. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>	2013-05-31 17:05:03 -07:00
Joao Eduardo Luis	e15d290945	mon: Paxos: get rid of the 'prepare_bootstrap()' mechanism We don't need it after all. If we are in the middle of some proposal, then we guarantee that said proposal is likely to be retried. If we haven't yet proposed, then it's forever more likely that a client will eventually retry the message that triggered this proposal. Basically, this mechanism attempted at fixing a non-problem, and was in fact triggering some unforeseen issues that would have required increasing the code complexity for no good reason. Fixes: #5102 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>	2013-05-22 17:12:38 +01:00
Sage Weil	3a6138b25e	mon/Paxos: don't ignore peer first_committed We go to the effort of keeping a map of the peer's first/last committed so that we can send the right commits during the first phase of paxos, but we forgot to record the first value. This appears to simply be an oversight. It is mostly harmless; it just means we send extra states that the peer already has. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>	2013-05-01 10:57:47 -07:00
Yan, Zheng	cea2ff8615	mon: Fix leak of context Use Context::complete() to finish context, it frees the context after executing Context::finish(). Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>	2013-04-28 21:15:25 -07:00
Joao Eduardo Luis	b99367bfb2	mon: Paxos: only finish a queued proposal if there's actually any When proposing an older value learned during recovery, we don't create a queued proposal -- we go straight through Paxos. Therefore, when finishing a proposal, we must be sure that we have a proposal in the queue before dereferencing it, otherwise we will segfault. Fixes: #4250 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>	2013-03-13 15:03:00 -07:00
Danny Al-Gaaf	26e8577d29	Paxos.h: pass string name function parameter by reference Pass 'const string name' function parameter by reference. Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>	2013-02-26 18:50:13 +01:00
Danny Al-Gaaf	350481f90f	Paxos.h: fix dangerouse use of c_str() No need to use c_str() in get_statename(), simply return a std::strin instead. Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>	2013-02-25 14:10:20 +01:00
Joao Eduardo Luis	cb85fb7d9a	mon: ceph-mon: convert an old monitor store to the new format With the single-paxos patches we shifted from an approach with multiple paxos instances (one for each paxos service) keeping their own versions to a single paxos instance for all the paxos services, thus ending up with a single global version for paxos. With the release of v0.52, the monitor started tracking these global versions, keeping them for the single purpose of making it possible to convert the store to a single-paxos format. This patch now introduces a mechanism to convert a GV-enabled store to the single-paxos format store when the monitor is upgraded. As we require the global versions to be present, we first check if the store has the GV feature set: if not we will not proceed, but we will start the conversion otherwise. In the end of the conversion, the monitor data directory will have a brand new 'store.db' directory, where the key/value store lies, alongside with the old store. This makes it possible to revert to a previous monitor version if things go sideways, without jeopardizing the data in the store. The conversion is done as during a rolling upgrade, without any intervention by the user. Fire up the new monitor version on an old store, and the monitor itself will convert the store, trim any lingering versions that might not be required, and proceed to start as expected. Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>	2013-02-21 18:02:23 +00:00
Joao Eduardo Luis	cab3411b4a	mon: Monitor: Add monitor store synchronization support Synchronize two monitor stores when one of the monitors has diverged significantly from the remaining monitor cluster. This process roughly consists of the following steps: 0. mon.X tries to join the cluster; 1. mon.X verifies that it has diverged from the remaining cluster; 2. mon.X asks the leader to sync; 3. the leader allows mon.X to sync, pointing out a mon.Y from which mon.X should sync; 4. mon.X asks mon.Y to sync; 5. mon.Y sends its own store in one or more chunks; 6. mon.X acks each received chunk; go to 5; 7. mon.X receives the last chunk from mon.Y; 8. mon.X informs the leader that it has finished synchronizing; 9. the leader acks mon.X's finished sync; 10. mon.X bootstraps and retries joining the cluster (goto 0.) This is the most simple and straightforward process that can be hoped for. However, things may go sideways at any time (monitors failing, for instance), which could potentially lead to a corrupted monitor store. There are however mechanisms at work to avoid such scenario at any step of the process. Some of these mechanisms include: - aborting the sync if the leader fails or leadership changes; - state barriers on synchronization functions to avoid stray/outdated messages from interfering on the normal monitor behavior or on-going synchronization; - store clean-up before any synchronization process starts; - store clean-up if a sync process fails; - resuming sync from a different monitor mon.Z if mon.Y fails mid-sync; - several timeouts to guarantee that all the involved parties are still alive and participating in the sync effort. - request forwarding when mon.X contacts a monitor outside the quorum that might know who the leader is (or might know someone who does) [4]. Changes: - Adapt the MMonProbe message for the single-paxos approach, dropping the version map and using a lower and upper bound version instead. - Remove old slurp code. - Add 'sync force' command; 'sync_force' through the admin socket. Notes: [1] It's important to keep track of the paxos version at the time at which a store sync starts. Given that after the sync we end up with the same state as the monitor we are synchronizing from, there is a chance that we might end up with an uncommitted paxos version if we are synchronizing with the leader (there's some paxos stashing done prior to commit on the leader). By keeping track at which version the sync started, we can then let the requester to which version he should cap its paxos store. [2] Furthermore, the enforced paxos cap, described on [1], is even more important if we consider the need to reapply the paxos versions that were received during the sync, to make sure the paxos store is consistent. If we happened to have some yet-uncommitted version in the store, we could end up applying it. [3] What is described in [1] and [2]: Fixes: #4026 Fixes: #4037 Fixes: #4040 [4] Whenever a given monitor mon.X is on the probing phase and notices that there is a mon.Y with a paxos version considerably higher than the one mon.X has, then mon.X will attempt to synchronize from mon.Y. This is the basis for the store sync. However this might hold true, the fact is that there might be a chance that, by the time mon.Y handles the sync request from mon.X, mon.Y might already be attempting a sync himself with some other mon.Z. In this case, the appropriate thing for mon.Y to do is to forward mon.X's request to mon.Z, as mon.Z should be part of the quorum, know who the leader is or be the leader himself -- if not, at least it is guaranteed that mon.Z has a higher version than both mon.X and mon.Y, so it should be okay to sync from him. Fixes: #4162 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>	2013-02-21 18:02:22 +00:00

1 2

91 Commits