ceph/doc/dev/osd_internals/async_recovery.rst

=====================
Asynchronous Recovery
=====================

Ceph Placement Groups (PGs) maintain a log of write transactions to
facilitate speedy recovery of data. During recovery, each of these PG logs
is used to determine which content in each OSD is missing or outdated.
This obviates the need to scan all RADOS objects.
See :ref:`Log Based PG <log-based-pg>` for more details on this process.

Prior to the Nautilus release this recovery process was synchronous: it
blocked writes to a RADOS object until it was recovered. In contrast,
backfill could allow writes to proceed (assuming enough up-to-date replicas
were available) by temporarily assigning a different acting set, and
backfilling an OSD outside of the acting set. In some circumstances
this ends up being significantly better for availability, e.g. if the
PG log contains 3000 writes to disjoint objects.  When the PG log contains
thousands of entries, it could actually be faster (though not as safe) to
trade backfill for recovery by deleting and redeploying the containing
OSD than to iterate through the PG log.  Recovering several megabytes
of RADOS object data (or even worse, several megabytes of omap keys,
notably RGW bucket indexes) can drastically increase latency for a small
update, and combined with requests spread across many degraded objects
it is a recipe for slow requests.

To avoid this we can perform recovery in the background on an OSD
out-of-band of the live acting set, similar to backfill, but still using
the PG log to determine what needs to be done. This is known as *asynchronous
recovery*.

The threashold for performing asynchronous recovery instead of synchronous
recovery is not a clear-cut. There are a few criteria which
need to be met for asynchronous recovery:

* Try to keep ``min_size`` replicas available
* Use the approximate magnitude of the difference in length of
  logs combined with historical missing objects to estimate the cost of
  recovery
* Use the parameter ``osd_async_recovery_min_cost`` to determine
  when asynchronous recovery is appropriate

With the existing peering process, when we choose the acting set we
have not fetched the PG log from each peer; we have only the bounds of
it and other metadata from their ``pg_info_t``. It would be more expensive
to fetch and examine every log at this point, so we only consider an
approximate check for log length for now. In Nautilus, we improved
the accounting of missing objects, so post-Nautilus this information
is also used to determine the cost of recovery.

While async recovery is occurring, writes to members of the acting set
may proceed, but we need to send their log entries to the async
recovery targets (just like we do for backfill OSDs) so that they
can completely catch up.
doc: dev description of async recovery Signed-off-by: Josh Durgin <jdurgin@redhat.com> Signed-off-by: Neha Ojha <nojha@redhat.com> 2018-03-26 22:08:12 +00:00			`=====================`
			`Asynchronous Recovery`
			`=====================`

doc/dev: doc/dev/osd_internals caps, formatting, clarity Signed-off-by: Anthony D'Atri <anthony.datri@gmail.com> 2020-10-05 01:52:00 +00:00			`Ceph Placement Groups (PGs) maintain a log of write transactions to`
			`facilitate speedy recovery of data. During recovery, each of these PG logs`
			`is used to determine which content in each OSD is missing or outdated.`
			`This obviates the need to scan all RADOS objects.`
			See :ref:`Log Based PG <log-based-pg>` for more details on this process.
doc: dev description of async recovery Signed-off-by: Josh Durgin <jdurgin@redhat.com> Signed-off-by: Neha Ojha <nojha@redhat.com> 2018-03-26 22:08:12 +00:00
doc/dev: doc/dev/osd_internals caps, formatting, clarity Signed-off-by: Anthony D'Atri <anthony.datri@gmail.com> 2020-10-05 01:52:00 +00:00			`Prior to the Nautilus release this recovery process was synchronous: it`
			`blocked writes to a RADOS object until it was recovered. In contrast,`
			`backfill could allow writes to proceed (assuming enough up-to-date replicas`
			`were available) by temporarily assigning a different acting set, and`
			`backfilling an OSD outside of the acting set. In some circumstances`
doc: dev description of async recovery Signed-off-by: Josh Durgin <jdurgin@redhat.com> Signed-off-by: Neha Ojha <nojha@redhat.com> 2018-03-26 22:08:12 +00:00			`this ends up being significantly better for availability, e.g. if the`
doc/dev: doc/dev/osd_internals caps, formatting, clarity Signed-off-by: Anthony D'Atri <anthony.datri@gmail.com> 2020-10-05 01:52:00 +00:00			`PG log contains 3000 writes to disjoint objects. When the PG log contains`
			`thousands of entries, it could actually be faster (though not as safe) to`
			`trade backfill for recovery by deleting and redeploying the containing`
			`OSD than to iterate through the PG log. Recovering several megabytes`
			`of RADOS object data (or even worse, several megabytes of omap keys,`
			`notably RGW bucket indexes) can drastically increase latency for a small`
doc: dev description of async recovery Signed-off-by: Josh Durgin <jdurgin@redhat.com> Signed-off-by: Neha Ojha <nojha@redhat.com> 2018-03-26 22:08:12 +00:00			`update, and combined with requests spread across many degraded objects`
			`it is a recipe for slow requests.`

doc/dev: doc/dev/osd_internals caps, formatting, clarity Signed-off-by: Anthony D'Atri <anthony.datri@gmail.com> 2020-10-05 01:52:00 +00:00			`To avoid this we can perform recovery in the background on an OSD`
			`out-of-band of the live acting set, similar to backfill, but still using`
			`the PG log to determine what needs to be done. This is known as *asynchronous`
			`recovery*.`
doc: dev description of async recovery Signed-off-by: Josh Durgin <jdurgin@redhat.com> Signed-off-by: Neha Ojha <nojha@redhat.com> 2018-03-26 22:08:12 +00:00
doc/dev: doc/dev/osd_internals caps, formatting, clarity Signed-off-by: Anthony D'Atri <anthony.datri@gmail.com> 2020-10-05 01:52:00 +00:00			`The threashold for performing asynchronous recovery instead of synchronous`
			`recovery is not a clear-cut. There are a few criteria which`
doc: dev description of async recovery Signed-off-by: Josh Durgin <jdurgin@redhat.com> Signed-off-by: Neha Ojha <nojha@redhat.com> 2018-03-26 22:08:12 +00:00			`need to be met for asynchronous recovery:`

doc/dev: doc/dev/osd_internals caps, formatting, clarity Signed-off-by: Anthony D'Atri <anthony.datri@gmail.com> 2020-10-05 01:52:00 +00:00			* Try to keep ``min_size`` replicas available
			`* Use the approximate magnitude of the difference in length of`
			`logs combined with historical missing objects to estimate the cost of`
			`recovery`
			* Use the parameter ``osd_async_recovery_min_cost`` to determine
doc: dev description of async recovery Signed-off-by: Josh Durgin <jdurgin@redhat.com> Signed-off-by: Neha Ojha <nojha@redhat.com> 2018-03-26 22:08:12 +00:00			`when asynchronous recovery is appropriate`

			`With the existing peering process, when we choose the acting set we`
doc/dev: doc/dev/osd_internals caps, formatting, clarity Signed-off-by: Anthony D'Atri <anthony.datri@gmail.com> 2020-10-05 01:52:00 +00:00			`have not fetched the PG log from each peer; we have only the bounds of`
			it and other metadata from their ``pg_info_t``. It would be more expensive
doc: dev description of async recovery Signed-off-by: Josh Durgin <jdurgin@redhat.com> Signed-off-by: Neha Ojha <nojha@redhat.com> 2018-03-26 22:08:12 +00:00			`to fetch and examine every log at this point, so we only consider an`
doc/dev/osd_internals/async_recovery: update cost calculation Signed-off-by: Neha Ojha <nojha@redhat.com> 2019-05-09 03:19:32 +00:00			`approximate check for log length for now. In Nautilus, we improved`
doc/dev: doc/dev/osd_internals caps, formatting, clarity Signed-off-by: Anthony D'Atri <anthony.datri@gmail.com> 2020-10-05 01:52:00 +00:00			`the accounting of missing objects, so post-Nautilus this information`
doc/dev/osd_internals/async_recovery: update cost calculation Signed-off-by: Neha Ojha <nojha@redhat.com> 2019-05-09 03:19:32 +00:00			`is also used to determine the cost of recovery.`
doc: dev description of async recovery Signed-off-by: Josh Durgin <jdurgin@redhat.com> Signed-off-by: Neha Ojha <nojha@redhat.com> 2018-03-26 22:08:12 +00:00
doc/dev: doc/dev/osd_internals caps, formatting, clarity Signed-off-by: Anthony D'Atri <anthony.datri@gmail.com> 2020-10-05 01:52:00 +00:00			`While async recovery is occurring, writes to members of the acting set`
doc: dev description of async recovery Signed-off-by: Josh Durgin <jdurgin@redhat.com> Signed-off-by: Neha Ojha <nojha@redhat.com> 2018-03-26 22:08:12 +00:00			`may proceed, but we need to send their log entries to the async`
doc/dev: doc/dev/osd_internals caps, formatting, clarity Signed-off-by: Anthony D'Atri <anthony.datri@gmail.com> 2020-10-05 01:52:00 +00:00			`recovery targets (just like we do for backfill OSDs) so that they`
doc: dev description of async recovery Signed-off-by: Josh Durgin <jdurgin@redhat.com> Signed-off-by: Neha Ojha <nojha@redhat.com> 2018-03-26 22:08:12 +00:00			`can completely catch up.`