mon/Paxos: commit only after entire quorum acks

If a subset of the quorum accepts the proposal and we commit, we will start
sharing the new state.  However, the mon that didn't yet reply with the
accept may still be sharing the old and stale value.

The simplest way to prevent this is not to commit until the entire quorum
replies.  In the general case, there are no failures and this is just fine.
In the failure case, we will call a new election and have a smaller quorum
of (live) nodes and will recommit the same value.

A more performant solution would be to have a separate message invalidate
the old state and commit once we have all invalidations and a majority of
accepts.  This will lower latency a bit in the non-failure case, but not
change the failure case significantly.  Later!

Fixes: #7736
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
This commit is contained in:
Sage Weil 2014-03-17 16:21:17 -07:00
parent 1048763e33
commit fa1d957c11

View File

@ -691,22 +691,21 @@ void Paxos::handle_accept(MMonPaxos *accept)
assert(g_conf->paxos_kill_at != 6);
// new majority?
if (accepted.size() == (unsigned)mon->monmap->size()/2+1) {
// only commit (and expose committed state) when we get *all* quorum
// members to accept. otherwise, they may still be sharing the now
// stale state.
// FIXME: we can improve this with an additional lease revocation message
// that doesn't block for the persist.
if (accepted == mon->get_quorum()) {
// yay, commit!
// note: this may happen before the lease is reextended (below)
dout(10) << " got majority, committing" << dendl;
dout(10) << " got majority, committing, done with update" << dendl;
commit();
if (!do_refresh())
goto out;
if (is_updating())
commit_proposal();
finish_contexts(g_ceph_context, waiting_for_commit);
}
// done?
if (accepted == mon->get_quorum()) {
dout(10) << " got quorum, done with update" << dendl;
// cancel timeout event
mon->timer.cancel_event(accept_timeout_event);
accept_timeout_event = 0;