mirror of
https://github.com/ceph/ceph
synced 2025-02-22 18:47:18 +00:00
osd/PG: fix DeferRecovery vs AllReplicasRecovered race
- DeferRecovery event queued by AsyncReserver due to preemption event. We are in Recovering state with RECOVERING bit set. - We finish recovery, clear RECOVERING state bit, and queue AllReplicasRecovered from PrimaryLogPG::start_recovery_ops() - DeferRecovery event arrives, moving us from Recovering -> NotRecovering - AllReplciasRecovered event arrives, crashing us. This is all hard to deal with because the events are queued and may arrive later. Solve the problem here by tolerating a delayed DeferRecovery event: if the RECOVERING pg state bit isn't set, ignore it (it's old). The async reserver cancel events are unpredictable. Fixes: http://tracker.ceph.com/issues/23860 Signed-off-by: Sage Weil <sage@redhat.com>
This commit is contained in:
parent
049e9097a9
commit
cfe59cf20c
@ -7685,6 +7685,12 @@ boost::statechart::result
|
||||
PG::RecoveryState::Recovering::react(const DeferRecovery &evt)
|
||||
{
|
||||
PG *pg = context< RecoveryMachine >().pg;
|
||||
if (!pg->state_test(PG_STATE_RECOVERING)) {
|
||||
// we may have finished recovery and have an AllReplicasRecovered
|
||||
// event queued to move us to the next state.
|
||||
ldout(pg->cct, 10) << "got defer recovery but not recovering" << dendl;
|
||||
return discard_event();
|
||||
}
|
||||
ldout(pg->cct, 10) << "defer recovery, retry delay " << evt.delay << dendl;
|
||||
pg->state_set(PG_STATE_RECOVERY_WAIT);
|
||||
pg->osd->local_reserver.cancel_reservation(pg->info.pgid);
|
||||
|
Loading…
Reference in New Issue
Block a user