mirror of https://github.com/ceph/ceph
94 lines
4.0 KiB
ReStructuredText
94 lines
4.0 KiB
ReStructuredText
====================
|
|
Backfill Reservation
|
|
====================
|
|
|
|
When a new OSD joins a cluster all PGs with it in their acting sets must
|
|
eventually backfill. If all of these backfills happen simultaneously
|
|
they will present excessive load on the OSD: the "thundering herd"
|
|
effect.
|
|
|
|
The ``osd_max_backfills`` tunable limits the number of outgoing or
|
|
incoming backfills that are active on a given OSD. Note that this limit is
|
|
applied separately to incoming and to outgoing backfill operations.
|
|
Thus there can be as many as ``osd_max_backfills * 2`` backfill operations
|
|
in flight on each OSD. This subtlety is often missed, and Ceph
|
|
operators can be puzzled as to why more ops are observed than expected.
|
|
|
|
Each ``OSDService`` now has two AsyncReserver instances: one for backfills going
|
|
from the OSD (``local_reserver``) and one for backfills going to the OSD
|
|
(``remote_reserver``). An ``AsyncReserver`` (``common/AsyncReserver.h``)
|
|
manages a queue by priority of waiting items and a set of current reservation
|
|
holders. When a slot frees up, the ``AsyncReserver`` queues the ``Context*``
|
|
associated with the next item on the highest priority queue in the finisher
|
|
provided to the constructor.
|
|
|
|
For a primary to initiate a backfill it must first obtain a reservation from
|
|
its own ``local_reserver``. Then it must obtain a reservation from the backfill
|
|
target's ``remote_reserver`` via a ``MBackfillReserve`` message. This process is
|
|
managed by sub-states of ``Active`` and ``ReplicaActive`` (see the sub-states
|
|
of ``Active`` in PG.h). The reservations are dropped either on the ``Backfilled``
|
|
event, which is sent on the primary before calling ``recovery_complete``
|
|
and on the replica on receipt of the ``BackfillComplete`` progress message),
|
|
or upon leaving ``Active`` or ``ReplicaActive``.
|
|
|
|
It's important to always grab the local reservation before the remote
|
|
reservation in order to prevent a circular dependency.
|
|
|
|
We minimize the risk of data loss by prioritizing the order in
|
|
which PGs are recovered. Admins can override the default order by using
|
|
``force-recovery`` or ``force-backfill``. A ``force-recovery`` with op
|
|
priority ``255`` will start before a ``force-backfill`` op at priority ``254``.
|
|
|
|
If recovery is needed because a PG is below ``min_size`` a base priority of
|
|
``220`` is used. This is incremented by the number of OSDs short of the pool's
|
|
``min_size`` as well as a value relative to the pool's ``recovery_priority``.
|
|
The resultant priority is capped at ``253`` so that it does not confound forced
|
|
ops as described above. Under ordinary circumstances a recovery op is
|
|
prioritized at ``180`` plus a value relative to the pool's ``recovery_priority``.
|
|
The resultant priority is capped at ``219``.
|
|
|
|
If backfill is needed because the number of acting OSDs is less than
|
|
the pool's ``min_size``, a priority of ``220`` is used. The number of OSDs
|
|
short of the pool's ``min_size`` is added as well as a value relative to
|
|
the pool's ``recovery_priority``. The total priority is limited to ``253``.
|
|
|
|
If backfill is needed because a PG is undersized,
|
|
a priority of ``140`` is used. The number of OSDs below the size of the pool is
|
|
added as well as a value relative to the pool's ``recovery_priority``. The
|
|
resultant priority is capped at ``179``. If a backfill op is
|
|
needed because a PG is degraded, a priority of ``140`` is used. A value
|
|
relative to the pool's ``recovery_priority`` is added. The resultant priority
|
|
is capped at ``179`` . Under ordinary circumstances a
|
|
backfill op priority of ``100`` is used. A value relative to the pool's
|
|
``recovery_priority`` is added. The total priority is capped at ``139``.
|
|
|
|
.. list-table:: Backfill and Recovery op priorities
|
|
:widths: 20 20 20
|
|
:header-rows: 1
|
|
|
|
* - Description
|
|
- Base priority
|
|
- Maximum priority
|
|
* - Backfill
|
|
- 100
|
|
- 139
|
|
* - Degraded Backfill
|
|
- 140
|
|
- 179
|
|
* - Recovery
|
|
- 180
|
|
- 219
|
|
* - Inactive Recovery
|
|
- 220
|
|
- 253
|
|
* - Inactive Backfill
|
|
- 220
|
|
- 253
|
|
* - force-backfill
|
|
- 254
|
|
-
|
|
* - force-recovery
|
|
- 255
|
|
-
|
|
|