mirror of https://github.com/ceph/ceph
61 lines
3.3 KiB
ReStructuredText
61 lines
3.3 KiB
ReStructuredText
======================
|
|
last_epoch_started
|
|
======================
|
|
|
|
``info.last_epoch_started`` records an activation epoch ``e`` for interval ``i``
|
|
such that all writes committed in ``i`` or earlier are reflected in the
|
|
local info/log and no writes after ``i`` are reflected in the local
|
|
info/log. Since no committed write is ever divergent, even if we
|
|
get an authoritative log/info with an older ``info.last_epoch_started``,
|
|
we can leave our ``info.last_epoch_started`` alone since no writes could
|
|
have committed in any intervening interval (See PG::proc_master_log).
|
|
|
|
``info.history.last_epoch_started`` records a lower bound on the most
|
|
recent interval in which the PG as a whole went active and accepted
|
|
writes. On a particular OSD it is also an upper bound on the
|
|
activation epoch of intervals in which writes in the local PG log
|
|
occurred: we update it before accepting writes. Because all
|
|
committed writes are committed by all acting set OSDs, any
|
|
non-divergent writes ensure that ``history.last_epoch_started`` was
|
|
recorded by all acting set members in the interval. Once peering has
|
|
queried one OSD from each interval back to some seen
|
|
``history.last_epoch_started``, it follows that no interval after the max
|
|
``history.last_epoch_started`` can have reported writes as committed
|
|
(since we record it before recording client writes in an interval).
|
|
Thus, the minimum ``last_update`` across all infos with
|
|
``info.last_epoch_started >= MAX(history.last_epoch_started)`` must be an
|
|
upper bound on writes reported as committed to the client.
|
|
|
|
We update ``info.last_epoch_started`` with the initial activation message,
|
|
but we only update ``history.last_epoch_started`` after the new
|
|
``info.last_epoch_started`` is persisted (possibly along with the first
|
|
write). This ensures that we do not require an OSD with the most
|
|
recent ``info.last_epoch_started`` until all acting set OSDs have recorded
|
|
it.
|
|
|
|
In ``find_best_info``, we do include ``info.last_epoch_started`` values when
|
|
calculating ``max_last_epoch_started_found`` because we want to avoid
|
|
designating a log entry divergent which in a prior interval would have
|
|
been non-divergent since it might have been used to serve a read. In
|
|
``activate()``, we use the peer's ``last_epoch_started`` value as a bound on
|
|
how far back divergent log entries can be found.
|
|
|
|
However, in a case like
|
|
|
|
.. code::
|
|
|
|
calc_acting osd.0 1.4e( v 473'302 (292'200,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556
|
|
calc_acting osd.1 1.4e( v 473'302 (293'202,473'302] lb 0//0//-1 local-les=477 n=0 ec=5 les/c 473/473 556/556/556
|
|
calc_acting osd.4 1.4e( v 473'302 (120'121,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556
|
|
calc_acting osd.5 1.4e( empty local-les=0 n=0 ec=5 les/c 473/473 556/556/556
|
|
|
|
since osd.1 is the only one which recorded info.les=477, while osd.4,osd.0
|
|
(which were the acting set in that interval) did not (osd.4 restarted and osd.0
|
|
did not get the message in time), the PG is marked incomplete when
|
|
either osd.4 or osd.0 would have been valid choices. To avoid this, we do not
|
|
consider ``info.les`` for incomplete peers when calculating
|
|
``min_last_epoch_started_found``. It would not have been in the acting
|
|
set, so we must have another OSD from that interval anyway (if
|
|
``maybe_went_rw``). If that OSD does not remember that ``info.les``, then we
|
|
cannot have served reads.
|