mirror of
https://github.com/ceph/ceph
synced 2025-02-22 10:37:15 +00:00
doc/dev: doc/dev/osd_internals caps, formatting, clarity
Signed-off-by: Anthony D'Atri <anthony.datri@gmail.com>
This commit is contained in:
parent
8e530674ff
commit
33931a8330
@ -89,12 +89,12 @@ scheme between replication and erasure coding depending on
|
||||
its usage and each pool can be placed in a different storage
|
||||
location depending on the required performance.
|
||||
|
||||
Regarding how to use, please see osd_internals/manifest.rst
|
||||
Regarding how to use, please see ``osd_internals/manifest.rst``
|
||||
|
||||
Usage Patterns
|
||||
==============
|
||||
|
||||
The different ceph interface layers present potentially different oportunities
|
||||
The different Ceph interface layers present potentially different oportunities
|
||||
and costs for deduplication and tiering in general.
|
||||
|
||||
RadosGW
|
||||
@ -107,7 +107,7 @@ overwrites. As such, it makes sense to fingerprint and dedup up front.
|
||||
Unlike cephfs and rbd, radosgw has a system for storing
|
||||
explicit metadata in the head object of a logical s3 object for
|
||||
locating the remaining pieces. As such, radosgw could use the
|
||||
refcounting machinery (osd_internals/refcount.rst) directly without
|
||||
refcounting machinery (``osd_internals/refcount.rst``) directly without
|
||||
needing direct support from rados for manifests.
|
||||
|
||||
RBD/Cephfs
|
||||
@ -131,14 +131,14 @@ support needs robust support for snapshots.
|
||||
RADOS Machinery
|
||||
===============
|
||||
|
||||
For more information on rados redirect/chunk/dedup support, see osd_internals/manifest.rst.
|
||||
For more information on rados refcount support, see osd_internals/refcount.rst.
|
||||
For more information on rados redirect/chunk/dedup support, see ``osd_internals/manifest.rst``.
|
||||
For more information on rados refcount support, see ``osd_internals/refcount.rst``.
|
||||
|
||||
Status and Future Work
|
||||
======================
|
||||
|
||||
At the moment, there exists some preliminary support for manifest
|
||||
objects within the osd as well as a dedup tool.
|
||||
objects within the OSD as well as a dedup tool.
|
||||
|
||||
RadosGW data warehouse workloads probably represent the largest
|
||||
opportunity for this feature, so the first priority is probably to add
|
||||
@ -146,6 +146,6 @@ direct support for fingerprinting and redirects into the refcount pool
|
||||
to radosgw.
|
||||
|
||||
Aside from radosgw, completing work on manifest object support in the
|
||||
osd particularly as it relates to snapshots would be the next step for
|
||||
OSD particularly as it relates to snapshots would be the next step for
|
||||
rbd and cephfs workloads.
|
||||
|
||||
|
@ -2,46 +2,52 @@
|
||||
Asynchronous Recovery
|
||||
=====================
|
||||
|
||||
PGs in Ceph maintain a log of writes to allow speedy recovery of data.
|
||||
Instead of scanning all of the objects to see what is missing on each
|
||||
osd, we can examine the pg log to see which objects we need to
|
||||
recover. See :ref:`Log Based PG <log-based-pg>` for more details on this process.
|
||||
Ceph Placement Groups (PGs) maintain a log of write transactions to
|
||||
facilitate speedy recovery of data. During recovery, each of these PG logs
|
||||
is used to determine which content in each OSD is missing or outdated.
|
||||
This obviates the need to scan all RADOS objects.
|
||||
See :ref:`Log Based PG <log-based-pg>` for more details on this process.
|
||||
|
||||
Until now, this recovery process was synchronous - it blocked writes
|
||||
to an object until it was recovered. In contrast, backfill could allow
|
||||
writes to proceed (assuming enough up-to-date copies of the data were
|
||||
available) by temporarily assigning a different acting set, and
|
||||
backfilling an OSD outside of the acting set. In some circumstances,
|
||||
Prior to the Nautilus release this recovery process was synchronous: it
|
||||
blocked writes to a RADOS object until it was recovered. In contrast,
|
||||
backfill could allow writes to proceed (assuming enough up-to-date replicas
|
||||
were available) by temporarily assigning a different acting set, and
|
||||
backfilling an OSD outside of the acting set. In some circumstances
|
||||
this ends up being significantly better for availability, e.g. if the
|
||||
pg log contains 3000 writes to different objects. Recovering several
|
||||
megabytes of an object (or even worse, several megabytes of omap keys,
|
||||
like rgw bucket indexes) can drastically increase latency for a small
|
||||
PG log contains 3000 writes to disjoint objects. When the PG log contains
|
||||
thousands of entries, it could actually be faster (though not as safe) to
|
||||
trade backfill for recovery by deleting and redeploying the containing
|
||||
OSD than to iterate through the PG log. Recovering several megabytes
|
||||
of RADOS object data (or even worse, several megabytes of omap keys,
|
||||
notably RGW bucket indexes) can drastically increase latency for a small
|
||||
update, and combined with requests spread across many degraded objects
|
||||
it is a recipe for slow requests.
|
||||
|
||||
To avoid this, we can perform recovery in the background on an OSD out
|
||||
of the acting set, similar to backfill, but still using the PG log to
|
||||
determine what needs recovery. This is known as asynchronous recovery.
|
||||
To avoid this we can perform recovery in the background on an OSD
|
||||
out-of-band of the live acting set, similar to backfill, but still using
|
||||
the PG log to determine what needs to be done. This is known as *asynchronous
|
||||
recovery*.
|
||||
|
||||
Exactly when we perform asynchronous recovery instead of synchronous
|
||||
recovery is not a clear-cut threshold. There are a few criteria which
|
||||
The threashold for performing asynchronous recovery instead of synchronous
|
||||
recovery is not a clear-cut. There are a few criteria which
|
||||
need to be met for asynchronous recovery:
|
||||
|
||||
* try to keep min_size replicas available
|
||||
* use the approximate magnitude of the difference in length of
|
||||
logs combined with historical missing objects as the cost of recovery
|
||||
* use the parameter osd_async_recovery_min_cost to determine
|
||||
* Try to keep ``min_size`` replicas available
|
||||
* Use the approximate magnitude of the difference in length of
|
||||
logs combined with historical missing objects to estimate the cost of
|
||||
recovery
|
||||
* Use the parameter ``osd_async_recovery_min_cost`` to determine
|
||||
when asynchronous recovery is appropriate
|
||||
|
||||
With the existing peering process, when we choose the acting set we
|
||||
have not fetched the pg log from each peer, we have only the bounds of
|
||||
it and other metadata from their pg_info_t. It would be more expensive
|
||||
have not fetched the PG log from each peer; we have only the bounds of
|
||||
it and other metadata from their ``pg_info_t``. It would be more expensive
|
||||
to fetch and examine every log at this point, so we only consider an
|
||||
approximate check for log length for now. In Nautilus, we improved
|
||||
the accounting of missing objects, so post nautilus, this information
|
||||
the accounting of missing objects, so post-Nautilus this information
|
||||
is also used to determine the cost of recovery.
|
||||
|
||||
While async recovery is occurring, writes on members of the acting set
|
||||
While async recovery is occurring, writes to members of the acting set
|
||||
may proceed, but we need to send their log entries to the async
|
||||
recovery targets (just like we do for backfill osds) so that they
|
||||
recovery targets (just like we do for backfill OSDs) so that they
|
||||
can completely catch up.
|
||||
|
@ -2,64 +2,91 @@
|
||||
Backfill Reservation
|
||||
====================
|
||||
|
||||
When a new osd joins a cluster, all pgs containing it must eventually backfill
|
||||
to it. If all of these backfills happen simultaneously, it would put excessive
|
||||
load on the osd. osd_max_backfills limits the number of outgoing or
|
||||
incoming backfills on a single node. The maximum number of outgoing backfills is
|
||||
osd_max_backfills. The maximum number of incoming backfills is
|
||||
osd_max_backfills. Therefore there can be a maximum of osd_max_backfills * 2
|
||||
simultaneous backfills on one osd.
|
||||
When a new OSD joins a cluster all PGs with it in their acting sets must
|
||||
eventually backfill. If all of these backfills happen simultaneously
|
||||
they will present excessive load on the OSD: the "thundering herd"
|
||||
effect.
|
||||
|
||||
Each OSDService now has two AsyncReserver instances: one for backfills going
|
||||
from the osd (local_reserver) and one for backfills going to the osd
|
||||
(remote_reserver). An AsyncReserver (common/AsyncReserver.h) manages a queue
|
||||
by priority of waiting items and a set of current reservation holders. When a
|
||||
slot frees up, the AsyncReserver queues the Context* associated with the next
|
||||
item on the highest priority queue in the finisher provided to the constructor.
|
||||
The ``osd_max_backfills`` tunable limits the number of outgoing or
|
||||
incoming backfills that are active on a given OSD. Note that this limit is
|
||||
applied separately to incoming and to outgoing backfill operations.
|
||||
Thus there can be as many as ``osd_max_backfills * 2`` backfill operations
|
||||
in flight on each OSD. This subtlety is often missed, and Ceph
|
||||
operators can be puzzled as to why more ops are observed than expected.
|
||||
|
||||
For a primary to initiate a backfill, it must first obtain a reservation from
|
||||
its own local_reserver. Then, it must obtain a reservation from the backfill
|
||||
target's remote_reserver via a MBackfillReserve message. This process is
|
||||
managed by substates of Active and ReplicaActive (see the substates of Active
|
||||
in PG.h). The reservations are dropped either on the Backfilled event, which
|
||||
is sent on the primary before calling recovery_complete and on the replica on
|
||||
receipt of the BackfillComplete progress message), or upon leaving Active or
|
||||
ReplicaActive.
|
||||
Each ``OSDService`` now has two AsyncReserver instances: one for backfills going
|
||||
from the OSD (``local_reserver``) and one for backfills going to the OSD
|
||||
(``remote_reserver``). An ``AsyncReserver`` (``common/AsyncReserver.h``)
|
||||
manages a queue by priority of waiting items and a set of current reservation
|
||||
holders. When a slot frees up, the ``AsyncReserver`` queues the ``Context*``
|
||||
associated with the next item on the highest priority queue in the finisher
|
||||
provided to the constructor.
|
||||
|
||||
It's important that we always grab the local reservation before the remote
|
||||
For a primary to initiate a backfill it must first obtain a reservation from
|
||||
its own ``local_reserver``. Then it must obtain a reservation from the backfill
|
||||
target's ``remote_reserver`` via a ``MBackfillReserve`` message. This process is
|
||||
managed by sub-states of ``Active`` and ``ReplicaActive`` (see the sub-states
|
||||
of ``Active`` in PG.h). The reservations are dropped either on the ``Backfilled``
|
||||
event, which is sent on the primary before calling ``recovery_complete``
|
||||
and on the replica on receipt of the ``BackfillComplete`` progress message),
|
||||
or upon leaving ``Active`` or ``ReplicaActive``.
|
||||
|
||||
It's important to always grab the local reservation before the remote
|
||||
reservation in order to prevent a circular dependency.
|
||||
|
||||
We want to minimize the risk of data loss by prioritizing the order in
|
||||
which PGs are recovered. A user can override the default order by using
|
||||
force-recovery or force-backfill. A force-recovery at priority 255 will start
|
||||
before a force-backfill at priority 254.
|
||||
We minimize the risk of data loss by prioritizing the order in
|
||||
which PGs are recovered. Admins can override the default order by using
|
||||
``force-recovery`` or ``force-backfill``. A ``force-recovery`` with op
|
||||
priority ``255`` will start before a ``force-backfill`` op at priority ``254``.
|
||||
|
||||
If a recovery is needed because a PG is below min_size a base priority of 220
|
||||
is used. The number of OSDs below min_size of the pool is added, as well as a
|
||||
value relative to the pool's recovery_priority. The total priority is limited
|
||||
to 253. Under ordinary circumstances a recovery is prioritized at 180 plus a
|
||||
value relative to the pool's recovery_priority. The total priority is limited
|
||||
to 219.
|
||||
If a recovery is needed because a PG is below ``min_size`` a base priority of
|
||||
``220`` is used. This is incremented by the number of OSDs short of the pool's
|
||||
``min_size`` as well as a value relative to the pool's ``recovery_priority``.
|
||||
The resultant priority is capped at ``253`` so that it does not confound forced
|
||||
ops as described above. Under ordinary circumstances a recovery op is
|
||||
prioritized at ``180`` plus a value relative to the pool's ``recovery_priority``.
|
||||
The resultant priority is capped at ``219``.
|
||||
|
||||
If a backfill is needed because the number of acting OSDs is less than min_size,
|
||||
a priority of 220 is used. The number of OSDs below min_size of the pool is
|
||||
added as well as a value relative to the pool's recovery_priority. The total
|
||||
priority is limited to 253. If a backfill is needed because a PG is undersized,
|
||||
a priority of 140 is used. The number of OSDs below the size of the pool is
|
||||
added as well as a value relative to the pool's recovery_priority. The total
|
||||
priority is limited to 179. If a backfill is needed because a PG is degraded,
|
||||
a priority of 140 is used. A value relative to the pool's recovery_priority is
|
||||
added. The total priority is limited to 179. Under ordinary circumstances a
|
||||
backfill is priority of 100 is used. A value relative to the pool's
|
||||
recovery_priority is added. The total priority is limited to 139.
|
||||
If a backfill op is needed because the number of acting OSDs is less than
|
||||
the pool's ``min_size``, a priority of ``220`` is used. The number of OSDs
|
||||
short of the pool's `` min_size`` is added as well as a value relative to
|
||||
the pool's ``recovery_priority``. The total priority is limited to ``253``.
|
||||
If a backfill op is needed because a PG is undersized,
|
||||
a priority of ``140`` is used. The number of OSDs below the size of the pool is
|
||||
added as well as a value relative to the pool's ``recovery_priority``. The
|
||||
resultant priority is capped at ``179``. If a backfill op is
|
||||
needed because a PG is degraded, a priority of ``140`` is used. A value
|
||||
relative to the pool's ``recovery_priority`` is added. The resultant priority
|
||||
is capped at ``179`` . Under ordinary circumstances a
|
||||
backfill op priority of ``100`` is used. A value relative to the pool's
|
||||
``recovery_priority`` is added. The total priority is capped at ``139``.
|
||||
|
||||
.. list-table:: Backfill and Recovery op priorities
|
||||
:widths: 20 20 20
|
||||
:header-rows: 1
|
||||
|
||||
* - Description
|
||||
- Base priority
|
||||
- Maximum priority
|
||||
* - Backfill
|
||||
- 100
|
||||
- 139
|
||||
* - Degraded Backfill
|
||||
- 140
|
||||
- 179
|
||||
* - Recovery
|
||||
- 180
|
||||
- 219
|
||||
* - Inactive Recovery
|
||||
- 220
|
||||
- 253
|
||||
* - Inactive Backfill
|
||||
- 220
|
||||
- 253
|
||||
* - force-backfill
|
||||
- 254
|
||||
-
|
||||
* - force-recovery
|
||||
- 255
|
||||
-
|
||||
|
||||
Description Base priority Maximum priority
|
||||
----------- ------------- ----------------
|
||||
Backfill 100 139
|
||||
Degraded Backfill 140 179
|
||||
Recovery 180 219
|
||||
Inactive Recovery 220 253
|
||||
Inactive Backfill 220 253
|
||||
force-backfill 254
|
||||
force-recovery 255
|
||||
|
@ -2,42 +2,42 @@
|
||||
last_epoch_started
|
||||
======================
|
||||
|
||||
info.last_epoch_started records an activation epoch e for interval i
|
||||
such that all writes committed in i or earlier are reflected in the
|
||||
local info/log and no writes after i are reflected in the local
|
||||
``info.last_epoch_started`` records an activation epoch ``e`` for interval ``i``
|
||||
such that all writes committed in ``i`` or earlier are reflected in the
|
||||
local info/log and no writes after ``i`` are reflected in the local
|
||||
info/log. Since no committed write is ever divergent, even if we
|
||||
get an authoritative log/info with an older info.last_epoch_started,
|
||||
we can leave our info.last_epoch_started alone since no writes could
|
||||
get an authoritative log/info with an older ``info.last_epoch_started``,
|
||||
we can leave our ``info.last_epoch_started`` alone since no writes could
|
||||
have committed in any intervening interval (See PG::proc_master_log).
|
||||
|
||||
info.history.last_epoch_started records a lower bound on the most
|
||||
recent interval in which the pg as a whole went active and accepted
|
||||
writes. On a particular osd, it is also an upper bound on the
|
||||
activation epoch of intervals in which writes in the local pg log
|
||||
occurred (we update it before accepting writes). Because all
|
||||
committed writes are committed by all acting set osds, any
|
||||
non-divergent writes ensure that history.last_epoch_started was
|
||||
``info.history.last_epoch_started`` records a lower bound on the most
|
||||
recent interval in which the PG as a whole went active and accepted
|
||||
writes. On a particular OSD it is also an upper bound on the
|
||||
activation epoch of intervals in which writes in the local PG log
|
||||
occurred: we update it before accepting writes. Because all
|
||||
committed writes are committed by all acting set OSDs, any
|
||||
non-divergent writes ensure that ``history.last_epoch_started`` was
|
||||
recorded by all acting set members in the interval. Once peering has
|
||||
queried one osd from each interval back to some seen
|
||||
history.last_epoch_started, it follows that no interval after the max
|
||||
history.last_epoch_started can have reported writes as committed
|
||||
queried one OSD from each interval back to some seen
|
||||
``history.last_epoch_started``, it follows that no interval after the max
|
||||
``history.last_epoch_started`` can have reported writes as committed
|
||||
(since we record it before recording client writes in an interval).
|
||||
Thus, the minimum last_update across all infos with
|
||||
info.last_epoch_started >= MAX(history.last_epoch_started) must be an
|
||||
Thus, the minimum ``last_update`` across all infos with
|
||||
``info.last_epoch_started >= MAX(history.last_epoch_started)`` must be an
|
||||
upper bound on writes reported as committed to the client.
|
||||
|
||||
We update info.last_epoch_started with the initial activation message,
|
||||
but we only update history.last_epoch_started after the new
|
||||
info.last_epoch_started is persisted (possibly along with the first
|
||||
write). This ensures that we do not require an osd with the most
|
||||
recent info.last_epoch_started until all acting set osds have recorded
|
||||
We update ``info.last_epoch_started`` with the initial activation message,
|
||||
but we only update ``history.last_epoch_started`` after the new
|
||||
``info.last_epoch_started`` is persisted (possibly along with the first
|
||||
write). This ensures that we do not require an OSD with the most
|
||||
recent ``info.last_epoch_started`` until all acting set OSDs have recorded
|
||||
it.
|
||||
|
||||
In find_best_info, we do include info.last_epoch_started values when
|
||||
calculating the max_last_epoch_started_found because we want to avoid
|
||||
In ``find_best_info``, we do include ``info.last_epoch_started`` values when
|
||||
calculating ``max_last_epoch_started_found`` because we want to avoid
|
||||
designating a log entry divergent which in a prior interval would have
|
||||
been non-divergent since it might have been used to serve a read. In
|
||||
activate(), we use the peer's last_epoch_started value as a bound on
|
||||
``activate()``, we use the peer's ``last_epoch_started`` value as a bound on
|
||||
how far back divergent log entries can be found.
|
||||
|
||||
However, in a case like
|
||||
@ -49,12 +49,12 @@ However, in a case like
|
||||
calc_acting osd.4 1.4e( v 473'302 (120'121,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556
|
||||
calc_acting osd.5 1.4e( empty local-les=0 n=0 ec=5 les/c 473/473 556/556/556
|
||||
|
||||
since osd.1 is the only one which recorded info.les=477 while 4,0
|
||||
which were the acting set in that interval did not (4 restarted and 0
|
||||
did not get the message in time) the pg is marked incomplete when
|
||||
either 4 or 0 would have been valid choices. To avoid this, we do not
|
||||
consider info.les for incomplete peers when calculating
|
||||
min_last_epoch_started_found. It would not have been in the acting
|
||||
set, so we must have another osd from that interval anyway (if
|
||||
maybe_went_rw). If that osd does not remember that info.les, then we
|
||||
since osd.1 is the only one which recorded info.les=477, while osd.4,osd.0
|
||||
(which were the acting set in that interval) did not (osd.4 restarted and osd.0
|
||||
did not get the message in time), the PG is marked incomplete when
|
||||
either osd.4 or osd.0 would have been valid choices. To avoid this, we do not
|
||||
consider ``info.les`` for incomplete peers when calculating
|
||||
``min_last_epoch_started_found``. It would not have been in the acting
|
||||
set, so we must have another OSD from that interval anyway (if
|
||||
``maybe_went_rw``). If that OSD does not remember that ``info.les``, then we
|
||||
cannot have served reads.
|
||||
|
@ -11,7 +11,7 @@ Why PrimaryLogPG?
|
||||
-----------------
|
||||
|
||||
Currently, consistency for all ceph pool types is ensured by primary
|
||||
log-based replication. This goes for both erasure-coded and
|
||||
log-based replication. This goes for both erasure-coded (EC) and
|
||||
replicated pools.
|
||||
|
||||
Primary log-based replication
|
||||
@ -19,25 +19,25 @@ Primary log-based replication
|
||||
|
||||
Reads must return data written by any write which completed (where the
|
||||
client could possibly have received a commit message). There are lots
|
||||
of ways to handle this, but ceph's architecture makes it easy for
|
||||
of ways to handle this, but Ceph's architecture makes it easy for
|
||||
everyone at any map epoch to know who the primary is. Thus, the easy
|
||||
answer is to route all writes for a particular pg through a single
|
||||
answer is to route all writes for a particular PG through a single
|
||||
ordering primary and then out to the replicas. Though we only
|
||||
actually need to serialize writes on a single object (and even then,
|
||||
actually need to serialize writes on a single RADOS object (and even then,
|
||||
the partial ordering only really needs to provide an ordering between
|
||||
writes on overlapping regions), we might as well serialize writes on
|
||||
the whole PG since it lets us represent the current state of the PG
|
||||
using two numbers: the epoch of the map on the primary in which the
|
||||
most recent write started (this is a bit stranger than it might seem
|
||||
since map distribution itself is asynchronous -- see Peering and the
|
||||
concept of interval changes) and an increasing per-pg version number
|
||||
-- this is referred to in the code with type eversion_t and stored as
|
||||
pg_info_t::last_update. Furthermore, we maintain a log of "recent"
|
||||
concept of interval changes) and an increasing per-PG version number
|
||||
-- this is referred to in the code with type ``eversion_t`` and stored as
|
||||
``pg_info_t::last_update``. Furthermore, we maintain a log of "recent"
|
||||
operations extending back at least far enough to include any
|
||||
*unstable* writes (writes which have been started but not committed)
|
||||
and objects which aren't uptodate locally (see recovery and
|
||||
backfill). In practice, the log will extend much further
|
||||
(osd_min_pg_log_entries when clean, osd_max_pg_log_entries when not
|
||||
(``osd_min_pg_log_entries`` when clean and ``osd_max_pg_log_entries`` when not
|
||||
clean) because it's handy for quickly performing recovery.
|
||||
|
||||
Using this log, as long as we talk to a non-empty subset of the OSDs
|
||||
@ -49,27 +49,27 @@ between the oldest head remembered by an element of that set (any
|
||||
newer cannot have completed without that log containing it) and the
|
||||
newest head remembered (clearly, all writes in the log were started,
|
||||
so it's fine for us to remember them) as the new head. This is the
|
||||
main point of divergence between replicated pools and ec pools in
|
||||
PG/PrimaryLogPG: replicated pools try to choose the newest valid
|
||||
main point of divergence between replicated pools and EC pools in
|
||||
``PG/PrimaryLogPG``: replicated pools try to choose the newest valid
|
||||
option to avoid the client needing to replay those operations and
|
||||
instead recover the other copies. EC pools instead try to choose
|
||||
the *oldest* option available to them.
|
||||
|
||||
The reason for this gets to the heart of the rest of the differences
|
||||
in implementation: one copy will not generally be enough to
|
||||
reconstruct an ec object. Indeed, there are encodings where some log
|
||||
combinations would leave unrecoverable objects (as with a 4+2 encoding
|
||||
reconstruct an EC object. Indeed, there are encodings where some log
|
||||
combinations would leave unrecoverable objects (as with a ``k=4,m=2`` encoding
|
||||
where 3 of the replicas remember a write, but the other 3 do not -- we
|
||||
don't have 3 copies of either version). For this reason, log entries
|
||||
representing *unstable* writes (writes not yet committed to the
|
||||
client) must be rollbackable using only local information on ec pools.
|
||||
client) must be rollbackable using only local information on EC pools.
|
||||
Log entries in general may therefore be rollbackable (and in that case,
|
||||
via a delayed application or via a set of instructions for rolling
|
||||
back an inplace update) or not. Replicated pool log entries are
|
||||
never able to be rolled back.
|
||||
|
||||
For more details, see PGLog.h/cc, osd_types.h:pg_log_t,
|
||||
osd_types.h:pg_log_entry_t, and peering in general.
|
||||
For more details, see ``PGLog.h/cc``, ``osd_types.h:pg_log_t``,
|
||||
``osd_types.h:pg_log_entry_t``, and peering in general.
|
||||
|
||||
ReplicatedBackend/ECBackend unification strategy
|
||||
================================================
|
||||
@ -77,13 +77,13 @@ ReplicatedBackend/ECBackend unification strategy
|
||||
PGBackend
|
||||
---------
|
||||
|
||||
So, the fundamental difference between replication and erasure coding
|
||||
The fundamental difference between replication and erasure coding
|
||||
is that replication can do destructive updates while erasure coding
|
||||
cannot. It would be really annoying if we needed to have two entire
|
||||
implementations of PrimaryLogPG, one for each of the two, if there
|
||||
implementations of ``PrimaryLogPG`` since there
|
||||
are really only a few fundamental differences:
|
||||
|
||||
#. How reads work -- async only, requires remote reads for ec
|
||||
#. How reads work -- async only, requires remote reads for EC
|
||||
#. How writes work -- either restricted to append, or must write aside and do a
|
||||
tpc
|
||||
#. Whether we choose the oldest or newest possible head entry during peering
|
||||
@ -101,81 +101,81 @@ and so many similarities
|
||||
|
||||
Instead, we choose a few abstractions (and a few kludges) to paper over the differences:
|
||||
|
||||
#. PGBackend
|
||||
#. PGTransaction
|
||||
#. PG::choose_acting chooses between calc_replicated_acting and calc_ec_acting
|
||||
#. ``PGBackend``
|
||||
#. ``PGTransaction``
|
||||
#. ``PG::choose_acting`` chooses between ``calc_replicated_acting`` and ``calc_ec_acting``
|
||||
#. Various bits of the write pipeline disallow some operations based on pool
|
||||
type -- like omap operations, class operation reads, and writes which are
|
||||
not aligned appends (officially, so far) for ec
|
||||
not aligned appends (officially, so far) for EC
|
||||
#. Misc other kludges here and there
|
||||
|
||||
PGBackend and PGTransaction enable abstraction of differences 1, 2,
|
||||
``PGBackend`` and ``PGTransaction`` enable abstraction of differences 1 and 2 above
|
||||
and the addition of 4 as needed to the log entries.
|
||||
|
||||
The replicated implementation is in ReplicatedBackend.h/cc and doesn't
|
||||
require much explanation, I think. More detail on the ECBackend can be
|
||||
found in doc/dev/osd_internals/erasure_coding/ecbackend.rst.
|
||||
The replicated implementation is in ``ReplicatedBackend.h/cc`` and doesn't
|
||||
require much additional explanation. More detail on the ``ECBackend`` can be
|
||||
found in ``doc/dev/osd_internals/erasure_coding/ecbackend.rst``.
|
||||
|
||||
PGBackend Interface Explanation
|
||||
===============================
|
||||
|
||||
Note: this is from a design document from before the original firefly
|
||||
Note: this is from a design document that predated the Firefly release
|
||||
and is probably out of date w.r.t. some of the method names.
|
||||
|
||||
Readable vs Degraded
|
||||
--------------------
|
||||
|
||||
For a replicated pool, an object is readable iff it is present on
|
||||
the primary (at the right version). For an ec pool, we need at least
|
||||
M shards present to do a read, and we need it on the primary. For
|
||||
this reason, PGBackend needs to include some interfaces for determining
|
||||
For a replicated pool, an object is readable IFF it is present on
|
||||
the primary (at the right version). For an EC pool, we need at least
|
||||
`m` shards present to perform a read, and we need it on the primary. For
|
||||
this reason, ``PGBackend`` needs to include some interfaces for determining
|
||||
when recovery is required to serve a read vs a write. This also
|
||||
changes the rules for when peering has enough logs to prove that it
|
||||
|
||||
Core Changes:
|
||||
|
||||
- | PGBackend needs to be able to return IsPG(Recoverable|Readable)Predicate
|
||||
- | ``PGBackend`` needs to be able to return ``IsPG(Recoverable|Readable)Predicate``
|
||||
| objects to allow the user to make these determinations.
|
||||
|
||||
Client Reads
|
||||
------------
|
||||
|
||||
Reads with the replicated strategy can always be satisfied
|
||||
synchronously out of the primary OSD. With an erasure coded strategy,
|
||||
Reads from a replicated pool can always be satisfied
|
||||
synchronously by the primary OSD. Within an erasure coded pool,
|
||||
the primary will need to request data from some number of replicas in
|
||||
order to satisfy a read. PGBackend will therefore need to provide
|
||||
separate objects_read_sync and objects_read_async interfaces where
|
||||
the former won't be implemented by the ECBackend.
|
||||
order to satisfy a read. ``PGBackend`` will therefore need to provide
|
||||
separate ``objects_read_sync`` and ``objects_read_async`` interfaces where
|
||||
the former won't be implemented by the ``ECBackend``.
|
||||
|
||||
PGBackend interfaces:
|
||||
``PGBackend`` interfaces:
|
||||
|
||||
- objects_read_sync
|
||||
- objects_read_async
|
||||
- ``objects_read_sync``
|
||||
- ``objects_read_async``
|
||||
|
||||
Scrub
|
||||
-----
|
||||
Scrubs
|
||||
------
|
||||
|
||||
We currently have two scrub modes with different default frequencies:
|
||||
|
||||
#. [shallow] scrub: compares the set of objects and metadata, but not
|
||||
the contents
|
||||
#. deep scrub: compares the set of objects, metadata, and a crc32 of
|
||||
#. deep scrub: compares the set of objects, metadata, and a CRC32 of
|
||||
the object contents (including omap)
|
||||
|
||||
The primary requests a scrubmap from each replica for a particular
|
||||
range of objects. The replica fills out this scrubmap for the range
|
||||
of objects including, if the scrub is deep, a crc32 of the contents of
|
||||
of objects including, if the scrub is deep, a CRC32 of the contents of
|
||||
each object. The primary gathers these scrubmaps from each replica
|
||||
and performs a comparison identifying inconsistent objects.
|
||||
|
||||
Most of this can work essentially unchanged with erasure coded PG with
|
||||
the caveat that the PGBackend implementation must be in charge of
|
||||
the caveat that the ``PGBackend`` implementation must be in charge of
|
||||
actually doing the scan.
|
||||
|
||||
|
||||
PGBackend interfaces:
|
||||
``PGBackend`` interfaces:
|
||||
|
||||
- be_*
|
||||
- ``be_*``
|
||||
|
||||
Recovery
|
||||
--------
|
||||
@ -187,17 +187,17 @@ With the erasure coded strategy, we probably want to read the
|
||||
minimum number of replica chunks required to reconstruct the object
|
||||
and push out the replacement chunks concurrently.
|
||||
|
||||
Another difference is that objects in erasure coded pg may be
|
||||
unrecoverable without being unfound. The "unfound" concept
|
||||
should probably then be renamed to unrecoverable. Also, the
|
||||
PGBackend implementation will have to be able to direct the search
|
||||
for pg replicas with unrecoverable object chunks and to be able
|
||||
Another difference is that objects in erasure coded PG may be
|
||||
unrecoverable without being unfound. The ``unfound`` state
|
||||
should probably be renamed to ``unrecoverable``. Also, the
|
||||
``PGBackend`` implementation will have to be able to direct the search
|
||||
for PG replicas with unrecoverable object chunks and to be able
|
||||
to determine whether a particular object is recoverable.
|
||||
|
||||
|
||||
Core changes:
|
||||
|
||||
- s/unfound/unrecoverable
|
||||
- ``s/unfound/unrecoverable``
|
||||
|
||||
PGBackend interfaces:
|
||||
|
||||
|
@ -6,14 +6,14 @@ Manifest
|
||||
Introduction
|
||||
============
|
||||
|
||||
As described in ../deduplication.rst, adding transparent redirect
|
||||
As described in ``../deduplication.rst``, adding transparent redirect
|
||||
machinery to RADOS would enable a more capable tiering solution
|
||||
than RADOS currently has with "cache/tiering".
|
||||
|
||||
See ../deduplication.rst
|
||||
See ``../deduplication.rst``
|
||||
|
||||
At a high level, each object has a piece of metadata embedded in
|
||||
the object_info_t which can map subsets of the object data payload
|
||||
the ``object_info_t`` which can map subsets of the object data payload
|
||||
to (refcounted) objects in other pools.
|
||||
|
||||
This document exists to detail:
|
||||
@ -29,22 +29,22 @@ Intended Usage Model
|
||||
RBD
|
||||
---
|
||||
|
||||
For RBD, the primary goal is for either an osd-internal agent or a
|
||||
For RBD, the primary goal is for either an OSD-internal agent or a
|
||||
cluster-external agent to be able to transparently shift portions
|
||||
of the consituent 4MB extents between a dedup pool and a hot base
|
||||
pool.
|
||||
|
||||
As such, rbd operations (including class operations and snapshots)
|
||||
As such, RBD operations (including class operations and snapshots)
|
||||
must have the same observable results regardless of the current
|
||||
status of the object.
|
||||
|
||||
Moreover, tiering/dedup operations must interleave with rbd operations
|
||||
Moreover, tiering/dedup operations must interleave with RBD operations
|
||||
without changing the result.
|
||||
|
||||
Thus, here is a sketch of how I'd expect a tiering agent to perform
|
||||
basic operations:
|
||||
|
||||
* Demote cold rbd chunk to slow pool:
|
||||
* Demote cold RBD chunk to slow pool:
|
||||
|
||||
1. Read object, noting current user_version.
|
||||
2. In memory, run CDC implementation to fingerprint object.
|
||||
@ -52,12 +52,12 @@ basic operations:
|
||||
using the CAS class.
|
||||
4. Submit operation to base pool:
|
||||
|
||||
* ASSERT_VER with the user version from the read to fail if the
|
||||
* ``ASSERT_VER`` with the user version from the read to fail if the
|
||||
object has been mutated since the read.
|
||||
* SET_CHUNK for each of the extents to the corresponding object
|
||||
* ``SET_CHUNK`` for each of the extents to the corresponding object
|
||||
in the base pool.
|
||||
* EVICT_CHUNK for each extent to free up space in the base pool.
|
||||
Results in each chunk being marked MISSING.
|
||||
* ``EVICT_CHUNK`` for each extent to free up space in the base pool.
|
||||
Results in each chunk being marked ``MISSING``.
|
||||
|
||||
RBD users should then either see the state prior to the demotion or
|
||||
subsequent to it.
|
||||
@ -65,23 +65,23 @@ basic operations:
|
||||
Note that between 3 and 4, we potentially leak references, so a
|
||||
periodic scrub would be needed to validate refcounts.
|
||||
|
||||
* Promote cold rbd chunk to fast pool.
|
||||
* Promote cold RBD chunk to fast pool.
|
||||
|
||||
1. Submit TIER_PROMOTE
|
||||
1. Submit ``TIER_PROMOTE``
|
||||
|
||||
For clones, all of the above would be identical except that the
|
||||
initial read would need a LIST_SNAPS to determine which clones exist
|
||||
and the PROMOTE or SET_CHUNK/EVICT operations would need to include
|
||||
the cloneid.
|
||||
initial read would need a ``LIST_SNAPS`` to determine which clones exist
|
||||
and the ``PROMOTE`` or ``SET_CHUNK``/``EVICT`` operations would need to include
|
||||
the ``cloneid``.
|
||||
|
||||
RadosGW
|
||||
-------
|
||||
|
||||
For reads, RadosGW could operate as RBD above relying on the manifest
|
||||
machinery in the OSD to hide the distinction between the object being
|
||||
dedup'd or present in the base pool
|
||||
For reads, RADOS Gateway (RGW) could operate as RBD does above relying on the
|
||||
manifest machinery in the OSD to hide the distinction between the object
|
||||
being dedup'd or present in the base pool
|
||||
|
||||
For writes, RadosGW could operate as RBD does above, but it could
|
||||
For writes, RGW could operate as RBD does above, but could
|
||||
optionally have the freedom to fingerprint prior to doing the write.
|
||||
In that case, it could immediately write out the target objects to the
|
||||
CAS pool and then atomically write an object with the corresponding
|
||||
@ -104,8 +104,8 @@ At a high level, our future work plan is:
|
||||
- Snapshots: We want to be able to deduplicate portions of clones
|
||||
below the level of the rados snapshot system. As such, the
|
||||
rados operations below need to be extended to work correctly on
|
||||
clones (e.g.: we should be able to call SET_CHUNK on a clone, clear the
|
||||
corresponding extent in the base pool, and correctly maintain osd metadata).
|
||||
clones (e.g.: we should be able to call ``SET_CHUNK`` on a clone, clear the
|
||||
corresponding extent in the base pool, and correctly maintain OSD metadata).
|
||||
- Cache/tiering: Ultimately, we'd like to be able to deprecate the existing
|
||||
cache/tiering implementation, but to do that we need to ensure that we
|
||||
can address the same use cases.
|
||||
@ -116,22 +116,22 @@ Cleanups
|
||||
|
||||
The existing implementation has some things that need to be cleaned up:
|
||||
|
||||
* SET_REDIRECT: Should create the object if it doesn't exist, otherwise
|
||||
* ``SET_REDIRECT``: Should create the object if it doesn't exist, otherwise
|
||||
one couldn't create an object atomically as a redirect.
|
||||
* SET_CHUNK:
|
||||
* ``SET_CHUNK``:
|
||||
|
||||
* Appears to trigger a new clone as user_modify gets set in
|
||||
do_osd_ops. This probably isn't desirable, see Snapshots section
|
||||
``do_osd_ops``. This probably isn't desirable, see Snapshots section
|
||||
below for some options on how generally to mix these operations
|
||||
with snapshots. At a minimum, SET_CHUNK probably shouldn't set
|
||||
with snapshots. At a minimum, ``SET_CHUNK`` probably shouldn't set
|
||||
user_modify.
|
||||
* Appears to assume that the corresponding section of the object
|
||||
does not exist (sets FLAG_MISSING) but does not check whether the
|
||||
does not exist (sets ``FLAG_MISSING``) but does not check whether the
|
||||
corresponding extent exists already in the object. Should always
|
||||
leave the extent clean.
|
||||
* Appears to clear the manifest unconditionally if not chunked,
|
||||
that's probably wrong. We should return an error if it's a
|
||||
REDIRECT ::
|
||||
``REDIRECT`` ::
|
||||
|
||||
case CEPH_OSD_OP_SET_CHUNK:
|
||||
if (oi.manifest.is_redirect()) {
|
||||
@ -140,33 +140,33 @@ The existing implementation has some things that need to be cleaned up:
|
||||
}
|
||||
|
||||
|
||||
* TIER_PROMOTE:
|
||||
* ``TIER_PROMOTE``:
|
||||
|
||||
* SET_REDIRECT clears the contents of the object. PROMOTE appears
|
||||
* ``SET_REDIRECT`` clears the contents of the object. ``PROMOTE`` appears
|
||||
to copy them back in, but does not unset the redirect or clear the
|
||||
reference. This violates the invariant that a redirect object
|
||||
should be empty in the base pool. In particular, as long as the
|
||||
redirect is set, it appears that all operations will be proxied
|
||||
even after the promote defeating the purpose. We do want PROMOTE
|
||||
even after the promote defeating the purpose. We do want ``PROMOTE``
|
||||
to be able to atomically replace a redirect with the actual
|
||||
object, so the solution is to clear the redirect at the end of the
|
||||
promote.
|
||||
* For a chunked manifest, we appear to flush prior to promoting.
|
||||
Promotion will often be used to prepare an object for low latency
|
||||
reads and writes, accordingly, the only effect should be to read
|
||||
any MISSING extents into the base pool. No flushing should be done.
|
||||
any ``MISSING`` extents into the base pool. No flushing should be done.
|
||||
|
||||
* High Level:
|
||||
|
||||
* It appears that FLAG_DIRTY should never be used for an extent pointing
|
||||
* It appears that ``FLAG_DIRTY`` should never be used for an extent pointing
|
||||
at a dedup extent. Writing the mutated extent back to the dedup pool
|
||||
requires writing a new object since the previous one cannot be mutated,
|
||||
just as it would if it hadn't been dedup'd yet. Thus, we should always
|
||||
drop the reference and remove the manifest pointer.
|
||||
|
||||
* There isn't currently a way to "evict" an object region. With the above
|
||||
change to SET_CHUNK to always retain the existing object region, we
|
||||
need an EVICT_CHUNK operation to then remove the extent.
|
||||
change to ``SET_CHUNK`` to always retain the existing object region, we
|
||||
need an ``EVICT_CHUNK`` operation to then remove the extent.
|
||||
|
||||
|
||||
Testing
|
||||
@ -176,18 +176,18 @@ We rely really heavily on randomized failure testing. As such, we need
|
||||
to extend that testing to include dedup/manifest support as well. Here's
|
||||
a short list of the touchpoints:
|
||||
|
||||
* Thrasher tests like qa/suites/rados/thrash/workloads/cache-snaps.yaml
|
||||
* Thrasher tests like ``qa/suites/rados/thrash/workloads/cache-snaps.yaml``
|
||||
|
||||
That test, of course, tests the existing cache/tiering machinery. Add
|
||||
additional files to that directory that instead setup a dedup pool. Add
|
||||
support to ceph_test_rados (src/test/osd/TestRados*).
|
||||
support to ``ceph_test_rados`` (``src/test/osd/TestRados*``).
|
||||
|
||||
* RBD tests
|
||||
|
||||
Add a test that runs an rbd workload concurrently with blind
|
||||
Add a test that runs an RBD workload concurrently with blind
|
||||
promote/evict operations.
|
||||
|
||||
* RadosGW
|
||||
* RGW
|
||||
|
||||
Add a test that runs a rgw workload concurrently with blind
|
||||
promote/evict operations.
|
||||
@ -196,39 +196,39 @@ a short list of the touchpoints:
|
||||
Snapshots
|
||||
---------
|
||||
|
||||
Fundamentally, I think we need to be able to manipulate the manifest
|
||||
Fundamentally we need to be able to manipulate the manifest
|
||||
status of clones because we want to be able to dynamically promote,
|
||||
flush (if the state was dirty when the clone was created), and evict
|
||||
extents from clones.
|
||||
|
||||
As such, the plan is to allow the object_manifest_t for each clone
|
||||
As such, the plan is to allow the ``object_manifest_t`` for each clone
|
||||
to be independent. Here's an incomplete list of the high level
|
||||
tasks:
|
||||
|
||||
* Modify the op processing pipeline to permit SET_CHUNK, EVICT_CHUNK
|
||||
* Modify the op processing pipeline to permit ``SET_CHUNK``, ``EVICT_CHUNK``
|
||||
to operation directly on clones.
|
||||
* Ensure that recovery checks the object_manifest prior to trying to
|
||||
use the overlaps in clone_range. ReplicatedBackend::calc_*_subsets
|
||||
use the overlaps in clone_range. ``ReplicatedBackend::calc_*_subsets``
|
||||
are the two methods that would likely need to be modified.
|
||||
|
||||
See snaps.rst for a rundown of the librados snapshot system and osd
|
||||
See ``snaps.rst`` for a rundown of the ``librados`` snapshot system and OSD
|
||||
support details. I'd like to call out one particular data structure
|
||||
we may want to exploit.
|
||||
|
||||
The dedup-tool needs to be updated to use LIST_SNAPS to discover
|
||||
The dedup-tool needs to be updated to use ``LIST_SNAPS`` to discover
|
||||
clones as part of leak detection.
|
||||
|
||||
An important question is how we deal with the fact that many clones
|
||||
will frequently have references to the same backing chunks at the same
|
||||
offset. In particular, make_writeable will generally create a clone
|
||||
that shares the same object_manifest_t references with the exception
|
||||
offset. In particular, ``make_writeable`` will generally create a clone
|
||||
that shares the same ``object_manifest_t`` references with the exception
|
||||
of any extents modified in that transaction. The metadata that
|
||||
commits as part of that transaction must therefore map onto the same
|
||||
refcount as before because otherwise we'd have to first increment
|
||||
refcounts on backing objects (or risk a reference to a dead object)
|
||||
Thus, we introduce a simple convention: consecutive clones which
|
||||
share a reference at the same offset share the same refcount. This
|
||||
means that a write that invokes make_writeable may decrease refcounts,
|
||||
means that a write that invokes ``make_writeable`` may decrease refcounts,
|
||||
but not increase them. This has some conquences for removing clones.
|
||||
Consider the following sequence ::
|
||||
|
||||
@ -257,9 +257,9 @@ Consider the following sequence ::
|
||||
10 : [0, 512) aaa, [512, 1024) bbb
|
||||
refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=1
|
||||
|
||||
What should be the refcount for aaa be at the end? By our
|
||||
above rule, it should be two since the two aaa refs are not
|
||||
contiguous. However, consider removing clone 20 ::
|
||||
What should be the refcount for ``aaa`` be at the end? By our
|
||||
above rule, it should be ``2`` since the two ```aaa``` refs are not
|
||||
contiguous. However, consider removing clone ``20`` ::
|
||||
|
||||
initial:
|
||||
head: [0, 512) aaa, [512, 1024) bbb
|
||||
@ -271,22 +271,22 @@ contiguous. However, consider removing clone 20 ::
|
||||
10 : [0, 512) aaa, [512, 1024) bbb
|
||||
refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=0
|
||||
|
||||
At this point, our rule dictates that refcount(aaa) is 1.
|
||||
This means that removing 20 needs to check for refs held by
|
||||
At this point, our rule dictates that ``refcount(aaa)`` is `1`.
|
||||
This means that removing ``20`` needs to check for refs held by
|
||||
the clones on either side which will then match.
|
||||
|
||||
See osd_types.h:object_manifest_t::calc_refs_to_drop_on_removal
|
||||
See ``osd_types.h:object_manifest_t::calc_refs_to_drop_on_removal``
|
||||
for the logic implementing this rule.
|
||||
|
||||
This seems complicated, but it gets us two valuable properties:
|
||||
|
||||
1) The refcount change from make_writeable will not block on
|
||||
incrementing a ref
|
||||
2) We don't need to load the object_manifest_t for every clone
|
||||
2) We don't need to load the ``object_manifest_t`` for every clone
|
||||
to determine how to handle removing one -- just the ones
|
||||
immediately preceding and succeeding it.
|
||||
|
||||
All clone operations will need to consider adjacent chunk_maps
|
||||
All clone operations will need to consider adjacent ``chunk_maps``
|
||||
when adding or removing references.
|
||||
|
||||
Cache/Tiering
|
||||
@ -296,10 +296,10 @@ There already exists a cache/tiering mechanism based on whiteouts.
|
||||
One goal here should ultimately be for this manifest machinery to
|
||||
provide a complete replacement.
|
||||
|
||||
See cache-pool.rst
|
||||
See ``cache-pool.rst``
|
||||
|
||||
The manifest machinery already shares some code paths with the
|
||||
existing cache/tiering code, mainly stat_flush.
|
||||
existing cache/tiering code, mainly ``stat_flush``.
|
||||
|
||||
In no particular order, here's in incomplete list of things that need
|
||||
to be wired up to provide feature parity:
|
||||
@ -308,7 +308,7 @@ to be wired up to provide feature parity:
|
||||
for maintaining bloom filters which provide estimates of access
|
||||
recency for objects. We probably need to modify this to permit
|
||||
hitset maintenance for a normal pool -- there are already
|
||||
CEPH_OSD_OP_PG_HITSET* interfaces for querying them.
|
||||
``CEPH_OSD_OP_PG_HITSET*`` interfaces for querying them.
|
||||
* Tiering agent: The osd already has a background tiering agent which
|
||||
would need to be modified to instead flush and evict using
|
||||
manifests.
|
||||
@ -318,7 +318,7 @@ to be wired up to provide feature parity:
|
||||
- hitset
|
||||
- age, ratio, bytes
|
||||
|
||||
* Add tiering-mode to manifest-tiering.
|
||||
* Add tiering-mode to ``manifest-tiering``
|
||||
- Writeback
|
||||
- Read-only
|
||||
|
||||
@ -326,8 +326,8 @@ to be wired up to provide feature parity:
|
||||
Data Structures
|
||||
===============
|
||||
|
||||
Each object contains an object_manifest_t embedded within the
|
||||
object_info_t (see osd_types.h):
|
||||
Each RADOS object contains an ``object_manifest_t`` embedded within the
|
||||
``object_info_t`` (see ``osd_types.h``):
|
||||
|
||||
::
|
||||
|
||||
@ -342,15 +342,15 @@ object_info_t (see osd_types.h):
|
||||
std::map<uint64_t, chunk_info_t> chunk_map;
|
||||
}
|
||||
|
||||
The type enum reflects three possible states an object can be in:
|
||||
The ``type`` enum reflects three possible states an object can be in:
|
||||
|
||||
1. TYPE_NONE: normal rados object
|
||||
2. TYPE_REDIRECT: object payload is backed by a single object
|
||||
specified by redirect_target
|
||||
3. TYPE_CHUNKED: object payload is distributed among objects with
|
||||
size and offset specified by the chunk_map. chunk_map maps
|
||||
the offset of the chunk to a chunk_info_t shown below further
|
||||
specifying the length, target oid, and flags.
|
||||
1. ``TYPE_NONE``: normal RADOS object
|
||||
2. ``TYPE_REDIRECT``: object payload is backed by a single object
|
||||
specified by ``redirect_target``
|
||||
3. ``TYPE_CHUNKED: object payload is distributed among objects with
|
||||
size and offset specified by the ``chunk_map``. ``chunk_map`` maps
|
||||
the offset of the chunk to a ``chunk_info_t`` as shown below, also
|
||||
specifying the ``length``, target `OID`, and ``flags``.
|
||||
|
||||
::
|
||||
|
||||
@ -367,7 +367,7 @@ The type enum reflects three possible states an object can be in:
|
||||
cflag_t flags; // FLAG_*
|
||||
|
||||
|
||||
FLAG_DIRTY at this time can happen if an extent with a fingerprint
|
||||
``FLAG_DIRTY`` at this time can happen if an extent with a fingerprint
|
||||
is written. This should be changed to drop the fingerprint instead.
|
||||
|
||||
|
||||
@ -375,50 +375,48 @@ Request Handling
|
||||
================
|
||||
|
||||
Similarly to cache/tiering, the initial touchpoint is
|
||||
maybe_handle_manifest_detail.
|
||||
``maybe_handle_manifest_detail``.
|
||||
|
||||
For manifest operations listed below, we return NOOP and continue onto
|
||||
dedicated handling within do_osd_ops.
|
||||
For manifest operations listed below, we return ``NOOP`` and continue onto
|
||||
dedicated handling within ``do_osd_ops``.
|
||||
|
||||
For redirect objects which haven't been promoted (apparently oi.size >
|
||||
0 indicates that it's present?) we proxy reads and writes.
|
||||
For redirect objects which haven't been promoted (apparently ``oi.size >
|
||||
0`` indicates that it's present?) we proxy reads and writes.
|
||||
|
||||
For reads on TYPE_CHUNKED, if can_proxy_chunked_read (basically, all
|
||||
of the ops are reads of extents in the object_manifest_t chunk_map),
|
||||
For reads on ``TYPE_CHUNKED``, if ``can_proxy_chunked_read`` (basically, all
|
||||
of the ops are reads of extents in the ``object_manifest_t chunk_map``),
|
||||
we proxy requests to those objects.
|
||||
|
||||
|
||||
RADOS Interface
|
||||
================
|
||||
|
||||
To set up deduplication pools, you must have two pools. One will act as the
|
||||
To set up deduplication one must provision two pools. One will act as the
|
||||
base pool and the other will act as the chunk pool. The base pool need to be
|
||||
configured with fingerprint_algorithm option as follows.
|
||||
configured with the ``fingerprint_algorithm`` option as follows.
|
||||
|
||||
::
|
||||
|
||||
ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512
|
||||
--yes-i-really-mean-it
|
||||
|
||||
1. Create objects ::
|
||||
Create objects ::
|
||||
|
||||
- rados -p base_pool put foo ./foo
|
||||
rados -p base_pool put foo ./foo
|
||||
rados -p chunk_pool put foo-chunk ./foo-chunk
|
||||
|
||||
- rados -p chunk_pool put foo-chunk ./foo-chunk
|
||||
Make a manifest object ::
|
||||
|
||||
2. Make a manifest object ::
|
||||
|
||||
- rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool
|
||||
chunk_pool foo-chunk $START_OFFSET --with-reference
|
||||
rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool chunk_pool foo-chunk $START_OFFSET --with-reference
|
||||
|
||||
Operations:
|
||||
|
||||
* set-redirect
|
||||
* ``set-redirect``
|
||||
|
||||
set a redirection between a base_object in the base_pool and a target_object
|
||||
in the target_pool.
|
||||
Set a redirection between a ``base_object`` in the ``base_pool`` and a ``target_object``
|
||||
in the ``target_pool``.
|
||||
A redirected object will forward all operations from the client to the
|
||||
target_object. ::
|
||||
``target_object``. ::
|
||||
|
||||
void set_redirect(const std::string& tgt_obj, const IoCtx& tgt_ioctx,
|
||||
uint64_t tgt_version, int flag = 0);
|
||||
@ -426,8 +424,8 @@ Operations:
|
||||
rados -p base_pool set-redirect <base_object> --target-pool <target_pool>
|
||||
<target_object>
|
||||
|
||||
Returns ENOENT if the object does not exist (TODO: why?)
|
||||
Returns EINVAL if the object already is a redirect.
|
||||
Returns ``ENOENT`` if the object does not exist (TODO: why?)
|
||||
Returns ``EINVAL`` if the object already is a redirect.
|
||||
|
||||
Takes a reference to target as part of operation, can possibly leak a ref
|
||||
if the acting set resets and the client dies between taking the ref and
|
||||
@ -435,19 +433,19 @@ Operations:
|
||||
|
||||
Truncates object, clears omap, and clears xattrs as a side effect.
|
||||
|
||||
At the top of do_osd_ops, does not set user_modify.
|
||||
At the top of ``do_osd_ops``, does not set user_modify.
|
||||
|
||||
This operation is not a user mutation and does not trigger a clone to be created.
|
||||
|
||||
The purpose of set_redirect is two.
|
||||
There are two purposes of ``set_redirect``:
|
||||
|
||||
1. Redirect all operation to the target object (like proxy)
|
||||
2. Cache when tier_promote is called (redirect will be cleared at this time).
|
||||
2. Cache when ``tier_promote`` is called (redirect will be cleared at this time).
|
||||
|
||||
* set-chunk
|
||||
* ``set-chunk``
|
||||
|
||||
set the chunk-offset in a source_object to make a link between it and a
|
||||
target_object. ::
|
||||
Set the ``chunk-offset`` in a ``source_object`` to make a link between it and a
|
||||
``target_object``. ::
|
||||
|
||||
void set_chunk(uint64_t src_offset, uint64_t src_length, const IoCtx& tgt_ioctx,
|
||||
std::string tgt_oid, uint64_t tgt_offset, int flag = 0);
|
||||
@ -455,10 +453,10 @@ Operations:
|
||||
rados -p base_pool set-chunk <source_object> <offset> <length> --target-pool
|
||||
<caspool> <target_object> <target-offset>
|
||||
|
||||
Returns ENOENT if the object does not exist (TODO: why?)
|
||||
Returns EINVAL if the object already is a redirect.
|
||||
Returns EINVAL if on ill-formed parameter buffer.
|
||||
Returns ENOTSUPP if existing mapped chunks overlap with new chunk mapping.
|
||||
Returns ``ENOENT`` if the object does not exist (TODO: why?)
|
||||
Returns ``EINVAL`` if the object already is a redirect.
|
||||
Returns ``EINVAL`` if on ill-formed parameter buffer.
|
||||
Returns ``ENOTSUPP`` if existing mapped chunks overlap with new chunk mapping.
|
||||
|
||||
Takes references to targets as part of operation, can possibly leak refs
|
||||
if the acting set resets and the client dies between taking the ref and
|
||||
@ -468,36 +466,36 @@ Operations:
|
||||
|
||||
This operation is not a user mutation and does not trigger a clone to be created.
|
||||
|
||||
TODO: SET_CHUNK appears to clear the manifest unconditionally if it's not chunked. ::
|
||||
TODO: ``SET_CHUNK`` appears to clear the manifest unconditionally if it's not chunked. ::
|
||||
|
||||
if (!oi.manifest.is_chunked()) {
|
||||
oi.manifest.clear();
|
||||
}
|
||||
|
||||
* evict-chunk
|
||||
* ``evict-chunk``
|
||||
|
||||
Clears an extent from an object leaving only the manifest link between
|
||||
it and the target_object. ::
|
||||
it and the ``target_object``. ::
|
||||
|
||||
void evict_chunk(
|
||||
uint64_t offset, uint64_t length, int flag = 0);
|
||||
|
||||
rados -p base_pool evict-chunk <offset> <length> <object>
|
||||
|
||||
Returns EINVAL if the extent is not present in the manifest.
|
||||
Returns ``EINVAL`` if the extent is not present in the manifest.
|
||||
|
||||
Note: this does not exist yet.
|
||||
|
||||
|
||||
* tier-promote
|
||||
* ``tier-promote``
|
||||
|
||||
promotes the object ensuring that subsequent reads and writes will be local ::
|
||||
Promotes the object ensuring that subsequent reads and writes will be local ::
|
||||
|
||||
void tier_promote();
|
||||
|
||||
rados -p base_pool tier-promote <obj-name>
|
||||
|
||||
Returns ENOENT if the object does not exist
|
||||
Returns ``ENOENT`` if the object does not exist
|
||||
|
||||
For a redirect manifest, copies data to head.
|
||||
|
||||
@ -506,17 +504,17 @@ Operations:
|
||||
For a chunked manifest, reads all MISSING extents into the base pool,
|
||||
subsequent reads and writes will be served from the base pool.
|
||||
|
||||
Implementation Note: For a chunked manifest, calls start_copy on itself. The
|
||||
resulting copy_get operation will issue reads which will then be redirected by
|
||||
Implementation Note: For a chunked manifest, calls ``start_copy`` on itself. The
|
||||
resulting ``copy_get`` operation will issue reads which will then be redirected by
|
||||
the normal manifest read machinery.
|
||||
|
||||
Does not set the user_modify flag.
|
||||
Does not set the ``user_modify`` flag.
|
||||
|
||||
Future work will involve adding support for specifying a clone_id.
|
||||
Future work will involve adding support for specifying a ``clone_id``.
|
||||
|
||||
* unset-manifest
|
||||
* ``unset-manifest``
|
||||
|
||||
unset the manifest info in the object that has manifest. ::
|
||||
Unset the manifest info in the object that has manifest. ::
|
||||
|
||||
void unset_manifest();
|
||||
|
||||
@ -525,63 +523,61 @@ Operations:
|
||||
Clears manifest chunks or redirect. Lazily releases references, may
|
||||
leak.
|
||||
|
||||
do_osd_ops seems not to include it in the user_modify=false ignorelist,
|
||||
``do_osd_ops`` seems not to include it in the ``user_modify=false`` ``ignorelist``,
|
||||
and so will trigger a snapshot. Note, this will be true even for a
|
||||
redirect though SET_REDIRECT does not flip user_modify. This should
|
||||
be fixed -- unset-manifest should not be a user_modify.
|
||||
redirect though ``SET_REDIRECT`` does not flip ``user_modify``. This should
|
||||
be fixed -- ``unset-manifest`` should not be a ``user_modify``.
|
||||
|
||||
* tier-flush
|
||||
* ``tier-flush``
|
||||
|
||||
flush the object which has chunks to the chunk pool. ::
|
||||
Flush the object which has chunks to the chunk pool. ::
|
||||
|
||||
void tier_flush();
|
||||
|
||||
rados -p base_pool tier-flush <obj-name>
|
||||
|
||||
Included in the user_modify=false ignorelist, does not trigger a clone.
|
||||
Included in the ``user_modify=false`` ``ignorelist``, does not trigger a clone.
|
||||
|
||||
Does not evict the extents.
|
||||
|
||||
|
||||
Dedup tool
|
||||
==========
|
||||
ceph-dedup-tool
|
||||
===============
|
||||
|
||||
Dedup tool has two features: finding an optimal chunk offset for dedup chunking
|
||||
and fixing the reference count (see ./refcount.rst).
|
||||
``ceph-dedup-tool`` has two features: finding an optimal chunk offset for dedup chunking
|
||||
and fixing the reference count (see ``./refcount.rst``).
|
||||
|
||||
* find an optimal chunk offset
|
||||
* Find an optimal chunk offset
|
||||
|
||||
a. fixed chunk
|
||||
a. Fixed chunk
|
||||
|
||||
To find out a fixed chunk length, you need to run the following command many
|
||||
times while changing the chunk_size. ::
|
||||
To find out a fixed chunk length, you need to run the following command many
|
||||
times while changing the ``chunk_size``. ::
|
||||
|
||||
ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size
|
||||
--chunk-algorithm fixed --fingerprint-algorithm sha1|sha256|sha512
|
||||
|
||||
b. rabin chunk(Rabin-karp algorithm)
|
||||
b. Rabin chunk(Rabin-Karp algorithm)
|
||||
|
||||
As you know, Rabin-karp algorithm is string-searching algorithm based
|
||||
on a rolling-hash. But rolling-hash is not enough to do deduplication because
|
||||
we don't know the chunk boundary. So, we need content-based slicing using
|
||||
a rolling hash for content-defined chunking.
|
||||
The current implementation uses the simplest approach: look for chunk boundaries
|
||||
by inspecting the rolling hash for pattern(like the
|
||||
lower N bits are all zeroes).
|
||||
Rabin-Karp is a string-searching algorithm based
|
||||
on a rolling hash. But a rolling hash is not enough to do deduplication because
|
||||
we don't know the chunk boundary. So, we need content-based slicing using
|
||||
a rolling hash for content-defined chunking.
|
||||
The current implementation uses the simplest approach: look for chunk boundaries
|
||||
by inspecting the rolling hash for pattern (like the
|
||||
lower N bits are all zeroes).
|
||||
|
||||
- Usage
|
||||
|
||||
Users who want to use deduplication need to find an ideal chunk offset.
|
||||
To find out ideal chunk offset, Users should discover
|
||||
the optimal configuration for their data workload via ceph-dedup-tool.
|
||||
And then, this chunking information will be used for object chunking through
|
||||
set-chunk api. ::
|
||||
Users who want to use deduplication need to find an ideal chunk offset.
|
||||
To find out ideal chunk offset, users should discover
|
||||
the optimal configuration for their data workload via ``ceph-dedup-tool``.
|
||||
This information will then be used for object chunking through
|
||||
the ``set-chunk`` API. ::
|
||||
|
||||
ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size
|
||||
--chunk-algorithm rabin --fingerprint-algorithm rabin
|
||||
|
||||
ceph-dedup-tool has many options to utilize rabin chunk.
|
||||
These are options for rabin chunk. ::
|
||||
``ceph-dedup-tool`` has many options to utilize ``rabin chunk``.
|
||||
These are options for ``rabin chunk``. ::
|
||||
|
||||
--mod-prime <uint64_t>
|
||||
--rabin-prime <uint64_t>
|
||||
@ -591,37 +587,37 @@ and fixing the reference count (see ./refcount.rst).
|
||||
--min-chunk <uint32_t>
|
||||
--max-chunk <uint64_t>
|
||||
|
||||
Users need to refer following equation to use above options for rabin chunk. ::
|
||||
Users need to refer following equation to use above options for ``rabin chunk``. ::
|
||||
|
||||
rabin_hash =
|
||||
(rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime)
|
||||
|
||||
c. Fixed chunk vs content-defined chunk
|
||||
|
||||
Content-defined chunking may or not be optimal solution.
|
||||
For example,
|
||||
Content-defined chunking may or not be optimal solution.
|
||||
For example,
|
||||
|
||||
Data chunk A : abcdefgabcdefgabcdefg
|
||||
Data chunk ``A`` : ``abcdefgabcdefgabcdefg``
|
||||
|
||||
Let's think about Data chunk A's deduplication. Ideal chunk offset is
|
||||
from 1 to 7 (abcdefg). So, if we use fixed chunk, 7 is optimal chunk length.
|
||||
But, in the case of content-based slicing, the optimal chunk length
|
||||
could not be found (dedup ratio will not be 100%).
|
||||
Because we need to find optimal parameter such
|
||||
as boundary bit, window size and prime value. This is as easy as fixed chunk.
|
||||
But, content defined chunking is very effective in the following case.
|
||||
Let's think about Data chunk ``A``'s deduplication. The ideal chunk offset is
|
||||
from ``1`` to ``7`` (``abcdefg``). So, if we use fixed chunk, ``7`` is optimal chunk length.
|
||||
But, in the case of content-based slicing, the optimal chunk length
|
||||
could not be found (dedup ratio will not be 100%).
|
||||
Because we need to find optimal parameter such
|
||||
as boundary bit, window size and prime value. This is as easy as fixed chunk.
|
||||
But, content defined chunking is very effective in the following case.
|
||||
|
||||
Data chunk B : abcdefgabcdefgabcdefg
|
||||
Data chunk ``B`` : ``abcdefgabcdefgabcdefg``
|
||||
|
||||
Data chunk C : Tabcdefgabcdefgabcdefg
|
||||
|
||||
Data chunk ``C`` : ``Tabcdefgabcdefgabcdefg``
|
||||
|
||||
* fix reference count
|
||||
|
||||
* Fix reference count
|
||||
|
||||
The key idea behind of reference counting for dedup is false-positive, which means
|
||||
(manifest object (no ref), chunk object(has ref)) happen instead of
|
||||
(manifest object (has ref), chunk 1(no ref)).
|
||||
To fix such inconsistency, ceph-dedup-tool supports chunk_scrub. ::
|
||||
``(manifest object (no ref),, chunk object(has ref))`` happen instead of
|
||||
``(manifest object (has ref), chunk 1(no ref))``.
|
||||
To fix such inconsistencies, ``ceph-dedup-tool`` supports ``chunk_scrub``. ::
|
||||
|
||||
ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL
|
||||
|
||||
|
@ -2,8 +2,8 @@
|
||||
OSD Throttles
|
||||
=============
|
||||
|
||||
There are three significant throttles in the filestore: wbthrottle,
|
||||
op_queue_throttle, and a throttle based on journal usage.
|
||||
There are three significant throttles in the FileStore OSD back end:
|
||||
wbthrottle, op_queue_throttle, and a throttle based on journal usage.
|
||||
|
||||
WBThrottle
|
||||
----------
|
||||
@ -17,7 +17,7 @@ flushing and block in FileStore::_do_op if we have exceeded any hard
|
||||
limits until the background flusher catches up.
|
||||
|
||||
The relevant config options are filestore_wbthrottle*. There are
|
||||
different defaults for xfs and btrfs. Each set has hard and soft
|
||||
different defaults for XFS and Btrfs. Each set has hard and soft
|
||||
limits on bytes (total dirty bytes), ios (total dirty ios), and
|
||||
inodes (total dirty fds). The WBThrottle will begin flushing
|
||||
when any of these hits the soft limit and will block in throttle()
|
||||
|
@ -2,9 +2,9 @@
|
||||
Partial Object Recovery
|
||||
=======================
|
||||
|
||||
Partial Object Recovery devotes to improving the efficiency of
|
||||
log-based recovery rather than backfill. Original log-based recovery
|
||||
calculates missing_set based on the difference between pg_log.
|
||||
Partial Object Recovery improves the efficiency of log-based recovery (vs
|
||||
backfill). Original log-based recovery calculates missing_set based on pg_log
|
||||
differences.
|
||||
|
||||
The whole object should be recovery from one OSD to another
|
||||
if the object is indicated modified by pg_log regardless of how much
|
||||
|
@ -26,11 +26,11 @@ Scrubbing Behavior Table
|
||||
State variables
|
||||
---------------
|
||||
|
||||
- Periodic tick state is !must_scrub && !must_deep_scrub && !time_for_deep
|
||||
- Periodic tick after osd_deep_scrub_interval state is !must_scrub && !must_deep_scrub && time_for_deep
|
||||
- Initiated scrub state is must_scrub && !must_deep_scrub && !time_for_deep
|
||||
- Initiated scrub after osd_deep_scrub_interval state is must scrub && !must_deep_scrub && time_for_deep
|
||||
- Initiated deep scrub state is must_scrub && must_deep_scrub
|
||||
- Periodic tick state is ``!must_scrub && !must_deep_scrub && !time_for_deep``
|
||||
- Periodic tick after ``osd_deep_scrub_interval state is !must_scrub && !must_deep_scrub && time_for_deep``
|
||||
- Initiated scrub state is ``must_scrub && !must_deep_scrub && !time_for_deep``
|
||||
- Initiated scrub after ``osd_deep_scrub_interval`` state is ``must_scrub && !must_deep_scrub && time_for_deep``
|
||||
- Initiated deep scrub state is ``must_scrub && must_deep_scrub``
|
||||
|
||||
Scrub Reservations
|
||||
------------------
|
||||
|
@ -27,7 +27,7 @@ See OSD::make_writeable
|
||||
|
||||
Ondisk Structures
|
||||
-----------------
|
||||
Each object has in the pg collection a *head* object (or *snapdir*, which we
|
||||
Each object has in the PG collection a *head* object (or *snapdir*, which we
|
||||
will come to shortly) and possibly a set of *clone* objects.
|
||||
Each hobject_t has a snap field. For the *head* (the only writeable version
|
||||
of an object), the snap field is set to CEPH_NOSNAP. For the *clones*, the
|
||||
@ -68,7 +68,7 @@ removal, we maintain a mapping from snap to *hobject_t* using the
|
||||
See PrimaryLogPG::SnapTrimmer, SnapMapper
|
||||
|
||||
This trimming is performed asynchronously by the snap_trim_wq while the
|
||||
pg is clean and not scrubbing.
|
||||
PG is clean and not scrubbing.
|
||||
|
||||
#. The next snap in PG::snap_trimq is selected for trimming
|
||||
#. We determine the next object for trimming out of PG::snap_mapper.
|
||||
@ -90,7 +90,7 @@ pg is clean and not scrubbing.
|
||||
Recovery
|
||||
--------
|
||||
Because the trim operations are implemented using repops and log entries,
|
||||
normal pg peering and recovery maintain the snap trimmer operations with
|
||||
normal PG peering and recovery maintain the snap trimmer operations with
|
||||
the caveat that push and removal operations need to update the local
|
||||
*SnapMapper* instance. If the purged_snaps update is lost, we merely
|
||||
retrim a now empty snap.
|
||||
@ -117,12 +117,12 @@ is constant length. These keys have a bufferlist encoding
|
||||
pair<snapid, hobject_t> as a value. Thus, creating or trimming a single
|
||||
object does not involve reading all objects for any snap. Additionally,
|
||||
upon construction, the *SnapMapper* is provided with a mask for filtering
|
||||
the objects in the single SnapMapper keyspace belonging to that pg.
|
||||
the objects in the single SnapMapper keyspace belonging to that PG.
|
||||
|
||||
Split
|
||||
-----
|
||||
The snapid_t -> hobject_t key entries are arranged such that for any pg,
|
||||
The snapid_t -> hobject_t key entries are arranged such that for any PG,
|
||||
up to 8 prefixes need to be checked to determine all hobjects in a particular
|
||||
snap for a particular pg. Upon split, the prefixes to check on the parent
|
||||
are adjusted such that only the objects remaining in the pg will be visible.
|
||||
snap for a particular PG. Upon split, the prefixes to check on the parent
|
||||
are adjusted such that only the objects remaining in the PG will be visible.
|
||||
The children will immediately have the correct mapping.
|
||||
|
Loading…
Reference in New Issue
Block a user