osd/PG: fix DeferRecovery vs AllReplicasRecovered race

- DeferRecovery event queued by AsyncReserver due to preemption
  event.  We are in Recovering state with RECOVERING bit set.
- We finish recovery, clear RECOVERING state bit, and queue
  AllReplicasRecovered from PrimaryLogPG::start_recovery_ops()
- DeferRecovery event arrives, moving us from Recovering -> NotRecovering
- AllReplciasRecovered event arrives, crashing us.

This is all hard to deal with because the events are queued and may
arrive later.  Solve the problem here by tolerating a delayed
DeferRecovery event: if the RECOVERING pg state bit isn't set, ignore
it (it's old).  The async reserver cancel events are unpredictable.

Fixes: http://tracker.ceph.com/issues/23860
Signed-off-by: Sage Weil <sage@redhat.com>
This commit is contained in:
Sage Weil 2018-04-27 15:00:58 -05:00
parent 049e9097a9
commit cfe59cf20c

View File

@ -7685,6 +7685,12 @@ boost::statechart::result
PG::RecoveryState::Recovering::react(const DeferRecovery &evt)
{
PG *pg = context< RecoveryMachine >().pg;
if (!pg->state_test(PG_STATE_RECOVERING)) {
// we may have finished recovery and have an AllReplicasRecovered
// event queued to move us to the next state.
ldout(pg->cct, 10) << "got defer recovery but not recovering" << dendl;
return discard_event();
}
ldout(pg->cct, 10) << "defer recovery, retry delay " << evt.delay << dendl;
pg->state_set(PG_STATE_RECOVERY_WAIT);
pg->osd->local_reserver.cancel_reservation(pg->info.pgid);