ceph/doc/dev/osd_internals/stale_read.rst

103 lines
4.4 KiB
ReStructuredText

Preventing Stale Reads
======================
We write synchronously to all replicas before sending an ACK to the
client, which limits the potential for inconsistency
in the write path. However, by default we serve reads from just
one replica (the lead/primary OSD for each PG), and the
client will use whatever OSDMap is has to select the OSD from which to read.
In most cases, this is fine: either the client map is correct,
or the OSD that we think is the primary for the object knows that it
is not the primary anymore, and can feed the client an updated map
that indicates a newer primary.
They key is to ensure that this is *always* true. In particular, we
need to ensure that an OSD that is fenced off from its peers and has
not learned about a map update does not continue to service read
requests from similarly stale clients at any point after which a new
primary may have been allowed to make a write.
We accomplish this via a mechanism that works much like a read lease.
Each pool may have a ``read_lease_interval`` property which defines
how long this is, although by default we simply set it to
``osd_pool_default_read_lease_ratio`` (default: .8) times the
``osd_heartbeat_grace``. (This way the lease will generally have
expired by the time we mark a failed OSD down.)
readable_until
--------------
Primary and replica both track a couple of values:
* *readable_until* is how long we are allowed to service (read)
requests before *our* "lease" expires.
* *readable_until_ub* is an upper bound on *readable_until* for any
OSD in the acting set.
The primary manages these two values by sending *pg_lease_t* messages
to replicas that increase the upper bound. Once all acting OSDs have
acknowledged they've seen the higher bound, the primary increases its
own *readable_until* and shares that (in a subsequent *pg_lease_t*
message). The resulting invariant is that any acting OSDs'
*readable_until* is always <= any acting OSDs' *readable_until_ub*.
In order to avoid any problems with clock skew, we use monotonic
clocks (which are only accurate locally and unaffected by time
adjustments) throughout to manage these leases. Peer OSDs calculate
upper and lower bounds on the deltas between OSD-local clocks,
allowing the primary to share timestamps based on its local clock
while replicas translate that to an appropriate bound in for their own
local clocks.
Prior Intervals
---------------
Whenever there is an interval change, we need to have an upper bound
on the *readable_until* values for any OSDs in the prior interval.
All OSDs from that interval have this value (*readable_until_ub*), and
share it as part of the pg_history_t during peering.
Because peering may involve OSDs that were not already communicating
before and may not have bounds on their clock deltas, the bound in
*pg_history_t* is shared as a simple duration before the upper bound
expires. This means that the bound slips forward in time due to the
transit time for the peering message, but that is generally quite
short, and moving the bound later in time is safe since it is an
*upper* bound.
PG "laggy" state
----------------
While the PG is active, *pg_lease_t* and *pg_lease_ack_t* messages are
regularly exchanged. However, if a client request comes in and the
lease has expired (*readable_until* has passed), the PG will go into a
*LAGGY* state and request will be blocked. Once the lease is renewed,
the request(s) will be requeued.
PG "wait" state
---------------
If peering completes but the prior interval's OSDs may still be
readable, the PG will go into the *WAIT* state until sufficient time
has passed. Any OSD requests will block during that period. Recovery
may proceed while in this state, since the logical, user-visible
content of objects does not change.
Dead OSDs
---------
Generally speaking, we need to wait until prior intervals' OSDs *know*
that they should no longer be readable. If an OSD is known to have
crashed (e.g., because the process is no longer running, which we may
infer because we get a ECONNREFUSED error), then we can infer that it
is not readable.
Similarly, if an OSD is marked down, gets a map update telling it so,
and then informs the monitor that it knows it was marked down, we can
similarly infer that it is not still serving requests for a prior interval.
When a PG is in the *WAIT* state, it will watch new maps for OSDs'
*dead_epoch* value indicating they are aware of their dead-ness. If
all down OSDs from prior interval are so aware, we can exit the WAIT
state early.