doc/dev: doc/dev/osd_internals caps, formatting, clarity

Signed-off-by: Anthony D'Atri <anthony.datri@gmail.com>
This commit is contained in:
Anthony D'Atri 2020-10-04 18:52:00 -07:00
parent 8e530674ff
commit 33931a8330
10 changed files with 383 additions and 354 deletions

View File

@ -89,12 +89,12 @@ scheme between replication and erasure coding depending on
its usage and each pool can be placed in a different storage
location depending on the required performance.
Regarding how to use, please see osd_internals/manifest.rst
Regarding how to use, please see ``osd_internals/manifest.rst``
Usage Patterns
==============
The different ceph interface layers present potentially different oportunities
The different Ceph interface layers present potentially different oportunities
and costs for deduplication and tiering in general.
RadosGW
@ -107,7 +107,7 @@ overwrites. As such, it makes sense to fingerprint and dedup up front.
Unlike cephfs and rbd, radosgw has a system for storing
explicit metadata in the head object of a logical s3 object for
locating the remaining pieces. As such, radosgw could use the
refcounting machinery (osd_internals/refcount.rst) directly without
refcounting machinery (``osd_internals/refcount.rst``) directly without
needing direct support from rados for manifests.
RBD/Cephfs
@ -131,14 +131,14 @@ support needs robust support for snapshots.
RADOS Machinery
===============
For more information on rados redirect/chunk/dedup support, see osd_internals/manifest.rst.
For more information on rados refcount support, see osd_internals/refcount.rst.
For more information on rados redirect/chunk/dedup support, see ``osd_internals/manifest.rst``.
For more information on rados refcount support, see ``osd_internals/refcount.rst``.
Status and Future Work
======================
At the moment, there exists some preliminary support for manifest
objects within the osd as well as a dedup tool.
objects within the OSD as well as a dedup tool.
RadosGW data warehouse workloads probably represent the largest
opportunity for this feature, so the first priority is probably to add
@ -146,6 +146,6 @@ direct support for fingerprinting and redirects into the refcount pool
to radosgw.
Aside from radosgw, completing work on manifest object support in the
osd particularly as it relates to snapshots would be the next step for
OSD particularly as it relates to snapshots would be the next step for
rbd and cephfs workloads.

View File

@ -2,46 +2,52 @@
Asynchronous Recovery
=====================
PGs in Ceph maintain a log of writes to allow speedy recovery of data.
Instead of scanning all of the objects to see what is missing on each
osd, we can examine the pg log to see which objects we need to
recover. See :ref:`Log Based PG <log-based-pg>` for more details on this process.
Ceph Placement Groups (PGs) maintain a log of write transactions to
facilitate speedy recovery of data. During recovery, each of these PG logs
is used to determine which content in each OSD is missing or outdated.
This obviates the need to scan all RADOS objects.
See :ref:`Log Based PG <log-based-pg>` for more details on this process.
Until now, this recovery process was synchronous - it blocked writes
to an object until it was recovered. In contrast, backfill could allow
writes to proceed (assuming enough up-to-date copies of the data were
available) by temporarily assigning a different acting set, and
backfilling an OSD outside of the acting set. In some circumstances,
Prior to the Nautilus release this recovery process was synchronous: it
blocked writes to a RADOS object until it was recovered. In contrast,
backfill could allow writes to proceed (assuming enough up-to-date replicas
were available) by temporarily assigning a different acting set, and
backfilling an OSD outside of the acting set. In some circumstances
this ends up being significantly better for availability, e.g. if the
pg log contains 3000 writes to different objects. Recovering several
megabytes of an object (or even worse, several megabytes of omap keys,
like rgw bucket indexes) can drastically increase latency for a small
PG log contains 3000 writes to disjoint objects. When the PG log contains
thousands of entries, it could actually be faster (though not as safe) to
trade backfill for recovery by deleting and redeploying the containing
OSD than to iterate through the PG log. Recovering several megabytes
of RADOS object data (or even worse, several megabytes of omap keys,
notably RGW bucket indexes) can drastically increase latency for a small
update, and combined with requests spread across many degraded objects
it is a recipe for slow requests.
To avoid this, we can perform recovery in the background on an OSD out
of the acting set, similar to backfill, but still using the PG log to
determine what needs recovery. This is known as asynchronous recovery.
To avoid this we can perform recovery in the background on an OSD
out-of-band of the live acting set, similar to backfill, but still using
the PG log to determine what needs to be done. This is known as *asynchronous
recovery*.
Exactly when we perform asynchronous recovery instead of synchronous
recovery is not a clear-cut threshold. There are a few criteria which
The threashold for performing asynchronous recovery instead of synchronous
recovery is not a clear-cut. There are a few criteria which
need to be met for asynchronous recovery:
* try to keep min_size replicas available
* use the approximate magnitude of the difference in length of
logs combined with historical missing objects as the cost of recovery
* use the parameter osd_async_recovery_min_cost to determine
* Try to keep ``min_size`` replicas available
* Use the approximate magnitude of the difference in length of
logs combined with historical missing objects to estimate the cost of
recovery
* Use the parameter ``osd_async_recovery_min_cost`` to determine
when asynchronous recovery is appropriate
With the existing peering process, when we choose the acting set we
have not fetched the pg log from each peer, we have only the bounds of
it and other metadata from their pg_info_t. It would be more expensive
have not fetched the PG log from each peer; we have only the bounds of
it and other metadata from their ``pg_info_t``. It would be more expensive
to fetch and examine every log at this point, so we only consider an
approximate check for log length for now. In Nautilus, we improved
the accounting of missing objects, so post nautilus, this information
the accounting of missing objects, so post-Nautilus this information
is also used to determine the cost of recovery.
While async recovery is occurring, writes on members of the acting set
While async recovery is occurring, writes to members of the acting set
may proceed, but we need to send their log entries to the async
recovery targets (just like we do for backfill osds) so that they
recovery targets (just like we do for backfill OSDs) so that they
can completely catch up.

View File

@ -2,64 +2,91 @@
Backfill Reservation
====================
When a new osd joins a cluster, all pgs containing it must eventually backfill
to it. If all of these backfills happen simultaneously, it would put excessive
load on the osd. osd_max_backfills limits the number of outgoing or
incoming backfills on a single node. The maximum number of outgoing backfills is
osd_max_backfills. The maximum number of incoming backfills is
osd_max_backfills. Therefore there can be a maximum of osd_max_backfills * 2
simultaneous backfills on one osd.
When a new OSD joins a cluster all PGs with it in their acting sets must
eventually backfill. If all of these backfills happen simultaneously
they will present excessive load on the OSD: the "thundering herd"
effect.
Each OSDService now has two AsyncReserver instances: one for backfills going
from the osd (local_reserver) and one for backfills going to the osd
(remote_reserver). An AsyncReserver (common/AsyncReserver.h) manages a queue
by priority of waiting items and a set of current reservation holders. When a
slot frees up, the AsyncReserver queues the Context* associated with the next
item on the highest priority queue in the finisher provided to the constructor.
The ``osd_max_backfills`` tunable limits the number of outgoing or
incoming backfills that are active on a given OSD. Note that this limit is
applied separately to incoming and to outgoing backfill operations.
Thus there can be as many as ``osd_max_backfills * 2`` backfill operations
in flight on each OSD. This subtlety is often missed, and Ceph
operators can be puzzled as to why more ops are observed than expected.
For a primary to initiate a backfill, it must first obtain a reservation from
its own local_reserver. Then, it must obtain a reservation from the backfill
target's remote_reserver via a MBackfillReserve message. This process is
managed by substates of Active and ReplicaActive (see the substates of Active
in PG.h). The reservations are dropped either on the Backfilled event, which
is sent on the primary before calling recovery_complete and on the replica on
receipt of the BackfillComplete progress message), or upon leaving Active or
ReplicaActive.
Each ``OSDService`` now has two AsyncReserver instances: one for backfills going
from the OSD (``local_reserver``) and one for backfills going to the OSD
(``remote_reserver``). An ``AsyncReserver`` (``common/AsyncReserver.h``)
manages a queue by priority of waiting items and a set of current reservation
holders. When a slot frees up, the ``AsyncReserver`` queues the ``Context*``
associated with the next item on the highest priority queue in the finisher
provided to the constructor.
It's important that we always grab the local reservation before the remote
For a primary to initiate a backfill it must first obtain a reservation from
its own ``local_reserver``. Then it must obtain a reservation from the backfill
target's ``remote_reserver`` via a ``MBackfillReserve`` message. This process is
managed by sub-states of ``Active`` and ``ReplicaActive`` (see the sub-states
of ``Active`` in PG.h). The reservations are dropped either on the ``Backfilled``
event, which is sent on the primary before calling ``recovery_complete``
and on the replica on receipt of the ``BackfillComplete`` progress message),
or upon leaving ``Active`` or ``ReplicaActive``.
It's important to always grab the local reservation before the remote
reservation in order to prevent a circular dependency.
We want to minimize the risk of data loss by prioritizing the order in
which PGs are recovered. A user can override the default order by using
force-recovery or force-backfill. A force-recovery at priority 255 will start
before a force-backfill at priority 254.
We minimize the risk of data loss by prioritizing the order in
which PGs are recovered. Admins can override the default order by using
``force-recovery`` or ``force-backfill``. A ``force-recovery`` with op
priority ``255`` will start before a ``force-backfill`` op at priority ``254``.
If a recovery is needed because a PG is below min_size a base priority of 220
is used. The number of OSDs below min_size of the pool is added, as well as a
value relative to the pool's recovery_priority. The total priority is limited
to 253. Under ordinary circumstances a recovery is prioritized at 180 plus a
value relative to the pool's recovery_priority. The total priority is limited
to 219.
If a recovery is needed because a PG is below ``min_size`` a base priority of
``220`` is used. This is incremented by the number of OSDs short of the pool's
``min_size`` as well as a value relative to the pool's ``recovery_priority``.
The resultant priority is capped at ``253`` so that it does not confound forced
ops as described above. Under ordinary circumstances a recovery op is
prioritized at ``180`` plus a value relative to the pool's ``recovery_priority``.
The resultant priority is capped at ``219``.
If a backfill is needed because the number of acting OSDs is less than min_size,
a priority of 220 is used. The number of OSDs below min_size of the pool is
added as well as a value relative to the pool's recovery_priority. The total
priority is limited to 253. If a backfill is needed because a PG is undersized,
a priority of 140 is used. The number of OSDs below the size of the pool is
added as well as a value relative to the pool's recovery_priority. The total
priority is limited to 179. If a backfill is needed because a PG is degraded,
a priority of 140 is used. A value relative to the pool's recovery_priority is
added. The total priority is limited to 179. Under ordinary circumstances a
backfill is priority of 100 is used. A value relative to the pool's
recovery_priority is added. The total priority is limited to 139.
If a backfill op is needed because the number of acting OSDs is less than
the pool's ``min_size``, a priority of ``220`` is used. The number of OSDs
short of the pool's `` min_size`` is added as well as a value relative to
the pool's ``recovery_priority``. The total priority is limited to ``253``.
If a backfill op is needed because a PG is undersized,
a priority of ``140`` is used. The number of OSDs below the size of the pool is
added as well as a value relative to the pool's ``recovery_priority``. The
resultant priority is capped at ``179``. If a backfill op is
needed because a PG is degraded, a priority of ``140`` is used. A value
relative to the pool's ``recovery_priority`` is added. The resultant priority
is capped at ``179`` . Under ordinary circumstances a
backfill op priority of ``100`` is used. A value relative to the pool's
``recovery_priority`` is added. The total priority is capped at ``139``.
.. list-table:: Backfill and Recovery op priorities
:widths: 20 20 20
:header-rows: 1
* - Description
- Base priority
- Maximum priority
* - Backfill
- 100
- 139
* - Degraded Backfill
- 140
- 179
* - Recovery
- 180
- 219
* - Inactive Recovery
- 220
- 253
* - Inactive Backfill
- 220
- 253
* - force-backfill
- 254
-
* - force-recovery
- 255
-
Description Base priority Maximum priority
----------- ------------- ----------------
Backfill 100 139
Degraded Backfill 140 179
Recovery 180 219
Inactive Recovery 220 253
Inactive Backfill 220 253
force-backfill 254
force-recovery 255

View File

@ -2,42 +2,42 @@
last_epoch_started
======================
info.last_epoch_started records an activation epoch e for interval i
such that all writes committed in i or earlier are reflected in the
local info/log and no writes after i are reflected in the local
``info.last_epoch_started`` records an activation epoch ``e`` for interval ``i``
such that all writes committed in ``i`` or earlier are reflected in the
local info/log and no writes after ``i`` are reflected in the local
info/log. Since no committed write is ever divergent, even if we
get an authoritative log/info with an older info.last_epoch_started,
we can leave our info.last_epoch_started alone since no writes could
get an authoritative log/info with an older ``info.last_epoch_started``,
we can leave our ``info.last_epoch_started`` alone since no writes could
have committed in any intervening interval (See PG::proc_master_log).
info.history.last_epoch_started records a lower bound on the most
recent interval in which the pg as a whole went active and accepted
writes. On a particular osd, it is also an upper bound on the
activation epoch of intervals in which writes in the local pg log
occurred (we update it before accepting writes). Because all
committed writes are committed by all acting set osds, any
non-divergent writes ensure that history.last_epoch_started was
``info.history.last_epoch_started`` records a lower bound on the most
recent interval in which the PG as a whole went active and accepted
writes. On a particular OSD it is also an upper bound on the
activation epoch of intervals in which writes in the local PG log
occurred: we update it before accepting writes. Because all
committed writes are committed by all acting set OSDs, any
non-divergent writes ensure that ``history.last_epoch_started`` was
recorded by all acting set members in the interval. Once peering has
queried one osd from each interval back to some seen
history.last_epoch_started, it follows that no interval after the max
history.last_epoch_started can have reported writes as committed
queried one OSD from each interval back to some seen
``history.last_epoch_started``, it follows that no interval after the max
``history.last_epoch_started`` can have reported writes as committed
(since we record it before recording client writes in an interval).
Thus, the minimum last_update across all infos with
info.last_epoch_started >= MAX(history.last_epoch_started) must be an
Thus, the minimum ``last_update`` across all infos with
``info.last_epoch_started >= MAX(history.last_epoch_started)`` must be an
upper bound on writes reported as committed to the client.
We update info.last_epoch_started with the initial activation message,
but we only update history.last_epoch_started after the new
info.last_epoch_started is persisted (possibly along with the first
write). This ensures that we do not require an osd with the most
recent info.last_epoch_started until all acting set osds have recorded
We update ``info.last_epoch_started`` with the initial activation message,
but we only update ``history.last_epoch_started`` after the new
``info.last_epoch_started`` is persisted (possibly along with the first
write). This ensures that we do not require an OSD with the most
recent ``info.last_epoch_started`` until all acting set OSDs have recorded
it.
In find_best_info, we do include info.last_epoch_started values when
calculating the max_last_epoch_started_found because we want to avoid
In ``find_best_info``, we do include ``info.last_epoch_started`` values when
calculating ``max_last_epoch_started_found`` because we want to avoid
designating a log entry divergent which in a prior interval would have
been non-divergent since it might have been used to serve a read. In
activate(), we use the peer's last_epoch_started value as a bound on
``activate()``, we use the peer's ``last_epoch_started`` value as a bound on
how far back divergent log entries can be found.
However, in a case like
@ -49,12 +49,12 @@ However, in a case like
calc_acting osd.4 1.4e( v 473'302 (120'121,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556
calc_acting osd.5 1.4e( empty local-les=0 n=0 ec=5 les/c 473/473 556/556/556
since osd.1 is the only one which recorded info.les=477 while 4,0
which were the acting set in that interval did not (4 restarted and 0
did not get the message in time) the pg is marked incomplete when
either 4 or 0 would have been valid choices. To avoid this, we do not
consider info.les for incomplete peers when calculating
min_last_epoch_started_found. It would not have been in the acting
set, so we must have another osd from that interval anyway (if
maybe_went_rw). If that osd does not remember that info.les, then we
since osd.1 is the only one which recorded info.les=477, while osd.4,osd.0
(which were the acting set in that interval) did not (osd.4 restarted and osd.0
did not get the message in time), the PG is marked incomplete when
either osd.4 or osd.0 would have been valid choices. To avoid this, we do not
consider ``info.les`` for incomplete peers when calculating
``min_last_epoch_started_found``. It would not have been in the acting
set, so we must have another OSD from that interval anyway (if
``maybe_went_rw``). If that OSD does not remember that ``info.les``, then we
cannot have served reads.

View File

@ -11,7 +11,7 @@ Why PrimaryLogPG?
-----------------
Currently, consistency for all ceph pool types is ensured by primary
log-based replication. This goes for both erasure-coded and
log-based replication. This goes for both erasure-coded (EC) and
replicated pools.
Primary log-based replication
@ -19,25 +19,25 @@ Primary log-based replication
Reads must return data written by any write which completed (where the
client could possibly have received a commit message). There are lots
of ways to handle this, but ceph's architecture makes it easy for
of ways to handle this, but Ceph's architecture makes it easy for
everyone at any map epoch to know who the primary is. Thus, the easy
answer is to route all writes for a particular pg through a single
answer is to route all writes for a particular PG through a single
ordering primary and then out to the replicas. Though we only
actually need to serialize writes on a single object (and even then,
actually need to serialize writes on a single RADOS object (and even then,
the partial ordering only really needs to provide an ordering between
writes on overlapping regions), we might as well serialize writes on
the whole PG since it lets us represent the current state of the PG
using two numbers: the epoch of the map on the primary in which the
most recent write started (this is a bit stranger than it might seem
since map distribution itself is asynchronous -- see Peering and the
concept of interval changes) and an increasing per-pg version number
-- this is referred to in the code with type eversion_t and stored as
pg_info_t::last_update. Furthermore, we maintain a log of "recent"
concept of interval changes) and an increasing per-PG version number
-- this is referred to in the code with type ``eversion_t`` and stored as
``pg_info_t::last_update``. Furthermore, we maintain a log of "recent"
operations extending back at least far enough to include any
*unstable* writes (writes which have been started but not committed)
and objects which aren't uptodate locally (see recovery and
backfill). In practice, the log will extend much further
(osd_min_pg_log_entries when clean, osd_max_pg_log_entries when not
(``osd_min_pg_log_entries`` when clean and ``osd_max_pg_log_entries`` when not
clean) because it's handy for quickly performing recovery.
Using this log, as long as we talk to a non-empty subset of the OSDs
@ -49,27 +49,27 @@ between the oldest head remembered by an element of that set (any
newer cannot have completed without that log containing it) and the
newest head remembered (clearly, all writes in the log were started,
so it's fine for us to remember them) as the new head. This is the
main point of divergence between replicated pools and ec pools in
PG/PrimaryLogPG: replicated pools try to choose the newest valid
main point of divergence between replicated pools and EC pools in
``PG/PrimaryLogPG``: replicated pools try to choose the newest valid
option to avoid the client needing to replay those operations and
instead recover the other copies. EC pools instead try to choose
the *oldest* option available to them.
The reason for this gets to the heart of the rest of the differences
in implementation: one copy will not generally be enough to
reconstruct an ec object. Indeed, there are encodings where some log
combinations would leave unrecoverable objects (as with a 4+2 encoding
reconstruct an EC object. Indeed, there are encodings where some log
combinations would leave unrecoverable objects (as with a ``k=4,m=2`` encoding
where 3 of the replicas remember a write, but the other 3 do not -- we
don't have 3 copies of either version). For this reason, log entries
representing *unstable* writes (writes not yet committed to the
client) must be rollbackable using only local information on ec pools.
client) must be rollbackable using only local information on EC pools.
Log entries in general may therefore be rollbackable (and in that case,
via a delayed application or via a set of instructions for rolling
back an inplace update) or not. Replicated pool log entries are
never able to be rolled back.
For more details, see PGLog.h/cc, osd_types.h:pg_log_t,
osd_types.h:pg_log_entry_t, and peering in general.
For more details, see ``PGLog.h/cc``, ``osd_types.h:pg_log_t``,
``osd_types.h:pg_log_entry_t``, and peering in general.
ReplicatedBackend/ECBackend unification strategy
================================================
@ -77,13 +77,13 @@ ReplicatedBackend/ECBackend unification strategy
PGBackend
---------
So, the fundamental difference between replication and erasure coding
The fundamental difference between replication and erasure coding
is that replication can do destructive updates while erasure coding
cannot. It would be really annoying if we needed to have two entire
implementations of PrimaryLogPG, one for each of the two, if there
implementations of ``PrimaryLogPG`` since there
are really only a few fundamental differences:
#. How reads work -- async only, requires remote reads for ec
#. How reads work -- async only, requires remote reads for EC
#. How writes work -- either restricted to append, or must write aside and do a
tpc
#. Whether we choose the oldest or newest possible head entry during peering
@ -101,81 +101,81 @@ and so many similarities
Instead, we choose a few abstractions (and a few kludges) to paper over the differences:
#. PGBackend
#. PGTransaction
#. PG::choose_acting chooses between calc_replicated_acting and calc_ec_acting
#. ``PGBackend``
#. ``PGTransaction``
#. ``PG::choose_acting`` chooses between ``calc_replicated_acting`` and ``calc_ec_acting``
#. Various bits of the write pipeline disallow some operations based on pool
type -- like omap operations, class operation reads, and writes which are
not aligned appends (officially, so far) for ec
not aligned appends (officially, so far) for EC
#. Misc other kludges here and there
PGBackend and PGTransaction enable abstraction of differences 1, 2,
``PGBackend`` and ``PGTransaction`` enable abstraction of differences 1 and 2 above
and the addition of 4 as needed to the log entries.
The replicated implementation is in ReplicatedBackend.h/cc and doesn't
require much explanation, I think. More detail on the ECBackend can be
found in doc/dev/osd_internals/erasure_coding/ecbackend.rst.
The replicated implementation is in ``ReplicatedBackend.h/cc`` and doesn't
require much additional explanation. More detail on the ``ECBackend`` can be
found in ``doc/dev/osd_internals/erasure_coding/ecbackend.rst``.
PGBackend Interface Explanation
===============================
Note: this is from a design document from before the original firefly
Note: this is from a design document that predated the Firefly release
and is probably out of date w.r.t. some of the method names.
Readable vs Degraded
--------------------
For a replicated pool, an object is readable iff it is present on
the primary (at the right version). For an ec pool, we need at least
M shards present to do a read, and we need it on the primary. For
this reason, PGBackend needs to include some interfaces for determining
For a replicated pool, an object is readable IFF it is present on
the primary (at the right version). For an EC pool, we need at least
`m` shards present to perform a read, and we need it on the primary. For
this reason, ``PGBackend`` needs to include some interfaces for determining
when recovery is required to serve a read vs a write. This also
changes the rules for when peering has enough logs to prove that it
Core Changes:
- | PGBackend needs to be able to return IsPG(Recoverable|Readable)Predicate
- | ``PGBackend`` needs to be able to return ``IsPG(Recoverable|Readable)Predicate``
| objects to allow the user to make these determinations.
Client Reads
------------
Reads with the replicated strategy can always be satisfied
synchronously out of the primary OSD. With an erasure coded strategy,
Reads from a replicated pool can always be satisfied
synchronously by the primary OSD. Within an erasure coded pool,
the primary will need to request data from some number of replicas in
order to satisfy a read. PGBackend will therefore need to provide
separate objects_read_sync and objects_read_async interfaces where
the former won't be implemented by the ECBackend.
order to satisfy a read. ``PGBackend`` will therefore need to provide
separate ``objects_read_sync`` and ``objects_read_async`` interfaces where
the former won't be implemented by the ``ECBackend``.
PGBackend interfaces:
``PGBackend`` interfaces:
- objects_read_sync
- objects_read_async
- ``objects_read_sync``
- ``objects_read_async``
Scrub
-----
Scrubs
------
We currently have two scrub modes with different default frequencies:
#. [shallow] scrub: compares the set of objects and metadata, but not
the contents
#. deep scrub: compares the set of objects, metadata, and a crc32 of
#. deep scrub: compares the set of objects, metadata, and a CRC32 of
the object contents (including omap)
The primary requests a scrubmap from each replica for a particular
range of objects. The replica fills out this scrubmap for the range
of objects including, if the scrub is deep, a crc32 of the contents of
of objects including, if the scrub is deep, a CRC32 of the contents of
each object. The primary gathers these scrubmaps from each replica
and performs a comparison identifying inconsistent objects.
Most of this can work essentially unchanged with erasure coded PG with
the caveat that the PGBackend implementation must be in charge of
the caveat that the ``PGBackend`` implementation must be in charge of
actually doing the scan.
PGBackend interfaces:
``PGBackend`` interfaces:
- be_*
- ``be_*``
Recovery
--------
@ -187,17 +187,17 @@ With the erasure coded strategy, we probably want to read the
minimum number of replica chunks required to reconstruct the object
and push out the replacement chunks concurrently.
Another difference is that objects in erasure coded pg may be
unrecoverable without being unfound. The "unfound" concept
should probably then be renamed to unrecoverable. Also, the
PGBackend implementation will have to be able to direct the search
for pg replicas with unrecoverable object chunks and to be able
Another difference is that objects in erasure coded PG may be
unrecoverable without being unfound. The ``unfound`` state
should probably be renamed to ``unrecoverable``. Also, the
``PGBackend`` implementation will have to be able to direct the search
for PG replicas with unrecoverable object chunks and to be able
to determine whether a particular object is recoverable.
Core changes:
- s/unfound/unrecoverable
- ``s/unfound/unrecoverable``
PGBackend interfaces:

View File

@ -6,14 +6,14 @@ Manifest
Introduction
============
As described in ../deduplication.rst, adding transparent redirect
As described in ``../deduplication.rst``, adding transparent redirect
machinery to RADOS would enable a more capable tiering solution
than RADOS currently has with "cache/tiering".
See ../deduplication.rst
See ``../deduplication.rst``
At a high level, each object has a piece of metadata embedded in
the object_info_t which can map subsets of the object data payload
the ``object_info_t`` which can map subsets of the object data payload
to (refcounted) objects in other pools.
This document exists to detail:
@ -29,22 +29,22 @@ Intended Usage Model
RBD
---
For RBD, the primary goal is for either an osd-internal agent or a
For RBD, the primary goal is for either an OSD-internal agent or a
cluster-external agent to be able to transparently shift portions
of the consituent 4MB extents between a dedup pool and a hot base
pool.
As such, rbd operations (including class operations and snapshots)
As such, RBD operations (including class operations and snapshots)
must have the same observable results regardless of the current
status of the object.
Moreover, tiering/dedup operations must interleave with rbd operations
Moreover, tiering/dedup operations must interleave with RBD operations
without changing the result.
Thus, here is a sketch of how I'd expect a tiering agent to perform
basic operations:
* Demote cold rbd chunk to slow pool:
* Demote cold RBD chunk to slow pool:
1. Read object, noting current user_version.
2. In memory, run CDC implementation to fingerprint object.
@ -52,12 +52,12 @@ basic operations:
using the CAS class.
4. Submit operation to base pool:
* ASSERT_VER with the user version from the read to fail if the
* ``ASSERT_VER`` with the user version from the read to fail if the
object has been mutated since the read.
* SET_CHUNK for each of the extents to the corresponding object
* ``SET_CHUNK`` for each of the extents to the corresponding object
in the base pool.
* EVICT_CHUNK for each extent to free up space in the base pool.
Results in each chunk being marked MISSING.
* ``EVICT_CHUNK`` for each extent to free up space in the base pool.
Results in each chunk being marked ``MISSING``.
RBD users should then either see the state prior to the demotion or
subsequent to it.
@ -65,23 +65,23 @@ basic operations:
Note that between 3 and 4, we potentially leak references, so a
periodic scrub would be needed to validate refcounts.
* Promote cold rbd chunk to fast pool.
* Promote cold RBD chunk to fast pool.
1. Submit TIER_PROMOTE
1. Submit ``TIER_PROMOTE``
For clones, all of the above would be identical except that the
initial read would need a LIST_SNAPS to determine which clones exist
and the PROMOTE or SET_CHUNK/EVICT operations would need to include
the cloneid.
initial read would need a ``LIST_SNAPS`` to determine which clones exist
and the ``PROMOTE`` or ``SET_CHUNK``/``EVICT`` operations would need to include
the ``cloneid``.
RadosGW
-------
For reads, RadosGW could operate as RBD above relying on the manifest
machinery in the OSD to hide the distinction between the object being
dedup'd or present in the base pool
For reads, RADOS Gateway (RGW) could operate as RBD does above relying on the
manifest machinery in the OSD to hide the distinction between the object
being dedup'd or present in the base pool
For writes, RadosGW could operate as RBD does above, but it could
For writes, RGW could operate as RBD does above, but could
optionally have the freedom to fingerprint prior to doing the write.
In that case, it could immediately write out the target objects to the
CAS pool and then atomically write an object with the corresponding
@ -104,8 +104,8 @@ At a high level, our future work plan is:
- Snapshots: We want to be able to deduplicate portions of clones
below the level of the rados snapshot system. As such, the
rados operations below need to be extended to work correctly on
clones (e.g.: we should be able to call SET_CHUNK on a clone, clear the
corresponding extent in the base pool, and correctly maintain osd metadata).
clones (e.g.: we should be able to call ``SET_CHUNK`` on a clone, clear the
corresponding extent in the base pool, and correctly maintain OSD metadata).
- Cache/tiering: Ultimately, we'd like to be able to deprecate the existing
cache/tiering implementation, but to do that we need to ensure that we
can address the same use cases.
@ -116,22 +116,22 @@ Cleanups
The existing implementation has some things that need to be cleaned up:
* SET_REDIRECT: Should create the object if it doesn't exist, otherwise
* ``SET_REDIRECT``: Should create the object if it doesn't exist, otherwise
one couldn't create an object atomically as a redirect.
* SET_CHUNK:
* ``SET_CHUNK``:
* Appears to trigger a new clone as user_modify gets set in
do_osd_ops. This probably isn't desirable, see Snapshots section
``do_osd_ops``. This probably isn't desirable, see Snapshots section
below for some options on how generally to mix these operations
with snapshots. At a minimum, SET_CHUNK probably shouldn't set
with snapshots. At a minimum, ``SET_CHUNK`` probably shouldn't set
user_modify.
* Appears to assume that the corresponding section of the object
does not exist (sets FLAG_MISSING) but does not check whether the
does not exist (sets ``FLAG_MISSING``) but does not check whether the
corresponding extent exists already in the object. Should always
leave the extent clean.
* Appears to clear the manifest unconditionally if not chunked,
that's probably wrong. We should return an error if it's a
REDIRECT ::
``REDIRECT`` ::
case CEPH_OSD_OP_SET_CHUNK:
if (oi.manifest.is_redirect()) {
@ -140,33 +140,33 @@ The existing implementation has some things that need to be cleaned up:
}
* TIER_PROMOTE:
* ``TIER_PROMOTE``:
* SET_REDIRECT clears the contents of the object. PROMOTE appears
* ``SET_REDIRECT`` clears the contents of the object. ``PROMOTE`` appears
to copy them back in, but does not unset the redirect or clear the
reference. This violates the invariant that a redirect object
should be empty in the base pool. In particular, as long as the
redirect is set, it appears that all operations will be proxied
even after the promote defeating the purpose. We do want PROMOTE
even after the promote defeating the purpose. We do want ``PROMOTE``
to be able to atomically replace a redirect with the actual
object, so the solution is to clear the redirect at the end of the
promote.
* For a chunked manifest, we appear to flush prior to promoting.
Promotion will often be used to prepare an object for low latency
reads and writes, accordingly, the only effect should be to read
any MISSING extents into the base pool. No flushing should be done.
any ``MISSING`` extents into the base pool. No flushing should be done.
* High Level:
* It appears that FLAG_DIRTY should never be used for an extent pointing
* It appears that ``FLAG_DIRTY`` should never be used for an extent pointing
at a dedup extent. Writing the mutated extent back to the dedup pool
requires writing a new object since the previous one cannot be mutated,
just as it would if it hadn't been dedup'd yet. Thus, we should always
drop the reference and remove the manifest pointer.
* There isn't currently a way to "evict" an object region. With the above
change to SET_CHUNK to always retain the existing object region, we
need an EVICT_CHUNK operation to then remove the extent.
change to ``SET_CHUNK`` to always retain the existing object region, we
need an ``EVICT_CHUNK`` operation to then remove the extent.
Testing
@ -176,18 +176,18 @@ We rely really heavily on randomized failure testing. As such, we need
to extend that testing to include dedup/manifest support as well. Here's
a short list of the touchpoints:
* Thrasher tests like qa/suites/rados/thrash/workloads/cache-snaps.yaml
* Thrasher tests like ``qa/suites/rados/thrash/workloads/cache-snaps.yaml``
That test, of course, tests the existing cache/tiering machinery. Add
additional files to that directory that instead setup a dedup pool. Add
support to ceph_test_rados (src/test/osd/TestRados*).
support to ``ceph_test_rados`` (``src/test/osd/TestRados*``).
* RBD tests
Add a test that runs an rbd workload concurrently with blind
Add a test that runs an RBD workload concurrently with blind
promote/evict operations.
* RadosGW
* RGW
Add a test that runs a rgw workload concurrently with blind
promote/evict operations.
@ -196,39 +196,39 @@ a short list of the touchpoints:
Snapshots
---------
Fundamentally, I think we need to be able to manipulate the manifest
Fundamentally we need to be able to manipulate the manifest
status of clones because we want to be able to dynamically promote,
flush (if the state was dirty when the clone was created), and evict
extents from clones.
As such, the plan is to allow the object_manifest_t for each clone
As such, the plan is to allow the ``object_manifest_t`` for each clone
to be independent. Here's an incomplete list of the high level
tasks:
* Modify the op processing pipeline to permit SET_CHUNK, EVICT_CHUNK
* Modify the op processing pipeline to permit ``SET_CHUNK``, ``EVICT_CHUNK``
to operation directly on clones.
* Ensure that recovery checks the object_manifest prior to trying to
use the overlaps in clone_range. ReplicatedBackend::calc_*_subsets
use the overlaps in clone_range. ``ReplicatedBackend::calc_*_subsets``
are the two methods that would likely need to be modified.
See snaps.rst for a rundown of the librados snapshot system and osd
See ``snaps.rst`` for a rundown of the ``librados`` snapshot system and OSD
support details. I'd like to call out one particular data structure
we may want to exploit.
The dedup-tool needs to be updated to use LIST_SNAPS to discover
The dedup-tool needs to be updated to use ``LIST_SNAPS`` to discover
clones as part of leak detection.
An important question is how we deal with the fact that many clones
will frequently have references to the same backing chunks at the same
offset. In particular, make_writeable will generally create a clone
that shares the same object_manifest_t references with the exception
offset. In particular, ``make_writeable`` will generally create a clone
that shares the same ``object_manifest_t`` references with the exception
of any extents modified in that transaction. The metadata that
commits as part of that transaction must therefore map onto the same
refcount as before because otherwise we'd have to first increment
refcounts on backing objects (or risk a reference to a dead object)
Thus, we introduce a simple convention: consecutive clones which
share a reference at the same offset share the same refcount. This
means that a write that invokes make_writeable may decrease refcounts,
means that a write that invokes ``make_writeable`` may decrease refcounts,
but not increase them. This has some conquences for removing clones.
Consider the following sequence ::
@ -257,9 +257,9 @@ Consider the following sequence ::
10 : [0, 512) aaa, [512, 1024) bbb
refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=1
What should be the refcount for aaa be at the end? By our
above rule, it should be two since the two aaa refs are not
contiguous. However, consider removing clone 20 ::
What should be the refcount for ``aaa`` be at the end? By our
above rule, it should be ``2`` since the two ```aaa``` refs are not
contiguous. However, consider removing clone ``20`` ::
initial:
head: [0, 512) aaa, [512, 1024) bbb
@ -271,22 +271,22 @@ contiguous. However, consider removing clone 20 ::
10 : [0, 512) aaa, [512, 1024) bbb
refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=0
At this point, our rule dictates that refcount(aaa) is 1.
This means that removing 20 needs to check for refs held by
At this point, our rule dictates that ``refcount(aaa)`` is `1`.
This means that removing ``20`` needs to check for refs held by
the clones on either side which will then match.
See osd_types.h:object_manifest_t::calc_refs_to_drop_on_removal
See ``osd_types.h:object_manifest_t::calc_refs_to_drop_on_removal``
for the logic implementing this rule.
This seems complicated, but it gets us two valuable properties:
1) The refcount change from make_writeable will not block on
incrementing a ref
2) We don't need to load the object_manifest_t for every clone
2) We don't need to load the ``object_manifest_t`` for every clone
to determine how to handle removing one -- just the ones
immediately preceding and succeeding it.
All clone operations will need to consider adjacent chunk_maps
All clone operations will need to consider adjacent ``chunk_maps``
when adding or removing references.
Cache/Tiering
@ -296,10 +296,10 @@ There already exists a cache/tiering mechanism based on whiteouts.
One goal here should ultimately be for this manifest machinery to
provide a complete replacement.
See cache-pool.rst
See ``cache-pool.rst``
The manifest machinery already shares some code paths with the
existing cache/tiering code, mainly stat_flush.
existing cache/tiering code, mainly ``stat_flush``.
In no particular order, here's in incomplete list of things that need
to be wired up to provide feature parity:
@ -308,7 +308,7 @@ to be wired up to provide feature parity:
for maintaining bloom filters which provide estimates of access
recency for objects. We probably need to modify this to permit
hitset maintenance for a normal pool -- there are already
CEPH_OSD_OP_PG_HITSET* interfaces for querying them.
``CEPH_OSD_OP_PG_HITSET*`` interfaces for querying them.
* Tiering agent: The osd already has a background tiering agent which
would need to be modified to instead flush and evict using
manifests.
@ -318,7 +318,7 @@ to be wired up to provide feature parity:
- hitset
- age, ratio, bytes
* Add tiering-mode to manifest-tiering.
* Add tiering-mode to ``manifest-tiering``
- Writeback
- Read-only
@ -326,8 +326,8 @@ to be wired up to provide feature parity:
Data Structures
===============
Each object contains an object_manifest_t embedded within the
object_info_t (see osd_types.h):
Each RADOS object contains an ``object_manifest_t`` embedded within the
``object_info_t`` (see ``osd_types.h``):
::
@ -342,15 +342,15 @@ object_info_t (see osd_types.h):
std::map<uint64_t, chunk_info_t> chunk_map;
}
The type enum reflects three possible states an object can be in:
The ``type`` enum reflects three possible states an object can be in:
1. TYPE_NONE: normal rados object
2. TYPE_REDIRECT: object payload is backed by a single object
specified by redirect_target
3. TYPE_CHUNKED: object payload is distributed among objects with
size and offset specified by the chunk_map. chunk_map maps
the offset of the chunk to a chunk_info_t shown below further
specifying the length, target oid, and flags.
1. ``TYPE_NONE``: normal RADOS object
2. ``TYPE_REDIRECT``: object payload is backed by a single object
specified by ``redirect_target``
3. ``TYPE_CHUNKED: object payload is distributed among objects with
size and offset specified by the ``chunk_map``. ``chunk_map`` maps
the offset of the chunk to a ``chunk_info_t`` as shown below, also
specifying the ``length``, target `OID`, and ``flags``.
::
@ -367,7 +367,7 @@ The type enum reflects three possible states an object can be in:
cflag_t flags; // FLAG_*
FLAG_DIRTY at this time can happen if an extent with a fingerprint
``FLAG_DIRTY`` at this time can happen if an extent with a fingerprint
is written. This should be changed to drop the fingerprint instead.
@ -375,50 +375,48 @@ Request Handling
================
Similarly to cache/tiering, the initial touchpoint is
maybe_handle_manifest_detail.
``maybe_handle_manifest_detail``.
For manifest operations listed below, we return NOOP and continue onto
dedicated handling within do_osd_ops.
For manifest operations listed below, we return ``NOOP`` and continue onto
dedicated handling within ``do_osd_ops``.
For redirect objects which haven't been promoted (apparently oi.size >
0 indicates that it's present?) we proxy reads and writes.
For redirect objects which haven't been promoted (apparently ``oi.size >
0`` indicates that it's present?) we proxy reads and writes.
For reads on TYPE_CHUNKED, if can_proxy_chunked_read (basically, all
of the ops are reads of extents in the object_manifest_t chunk_map),
For reads on ``TYPE_CHUNKED``, if ``can_proxy_chunked_read`` (basically, all
of the ops are reads of extents in the ``object_manifest_t chunk_map``),
we proxy requests to those objects.
RADOS Interface
================
To set up deduplication pools, you must have two pools. One will act as the
To set up deduplication one must provision two pools. One will act as the
base pool and the other will act as the chunk pool. The base pool need to be
configured with fingerprint_algorithm option as follows.
configured with the ``fingerprint_algorithm`` option as follows.
::
ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512
--yes-i-really-mean-it
1. Create objects ::
Create objects ::
- rados -p base_pool put foo ./foo
rados -p base_pool put foo ./foo
rados -p chunk_pool put foo-chunk ./foo-chunk
- rados -p chunk_pool put foo-chunk ./foo-chunk
Make a manifest object ::
2. Make a manifest object ::
- rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool
chunk_pool foo-chunk $START_OFFSET --with-reference
rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool chunk_pool foo-chunk $START_OFFSET --with-reference
Operations:
* set-redirect
* ``set-redirect``
set a redirection between a base_object in the base_pool and a target_object
in the target_pool.
Set a redirection between a ``base_object`` in the ``base_pool`` and a ``target_object``
in the ``target_pool``.
A redirected object will forward all operations from the client to the
target_object. ::
``target_object``. ::
void set_redirect(const std::string& tgt_obj, const IoCtx& tgt_ioctx,
uint64_t tgt_version, int flag = 0);
@ -426,8 +424,8 @@ Operations:
rados -p base_pool set-redirect <base_object> --target-pool <target_pool>
<target_object>
Returns ENOENT if the object does not exist (TODO: why?)
Returns EINVAL if the object already is a redirect.
Returns ``ENOENT`` if the object does not exist (TODO: why?)
Returns ``EINVAL`` if the object already is a redirect.
Takes a reference to target as part of operation, can possibly leak a ref
if the acting set resets and the client dies between taking the ref and
@ -435,19 +433,19 @@ Operations:
Truncates object, clears omap, and clears xattrs as a side effect.
At the top of do_osd_ops, does not set user_modify.
At the top of ``do_osd_ops``, does not set user_modify.
This operation is not a user mutation and does not trigger a clone to be created.
The purpose of set_redirect is two.
There are two purposes of ``set_redirect``:
1. Redirect all operation to the target object (like proxy)
2. Cache when tier_promote is called (redirect will be cleared at this time).
2. Cache when ``tier_promote`` is called (redirect will be cleared at this time).
* set-chunk
* ``set-chunk``
set the chunk-offset in a source_object to make a link between it and a
target_object. ::
Set the ``chunk-offset`` in a ``source_object`` to make a link between it and a
``target_object``. ::
void set_chunk(uint64_t src_offset, uint64_t src_length, const IoCtx& tgt_ioctx,
std::string tgt_oid, uint64_t tgt_offset, int flag = 0);
@ -455,10 +453,10 @@ Operations:
rados -p base_pool set-chunk <source_object> <offset> <length> --target-pool
<caspool> <target_object> <target-offset>
Returns ENOENT if the object does not exist (TODO: why?)
Returns EINVAL if the object already is a redirect.
Returns EINVAL if on ill-formed parameter buffer.
Returns ENOTSUPP if existing mapped chunks overlap with new chunk mapping.
Returns ``ENOENT`` if the object does not exist (TODO: why?)
Returns ``EINVAL`` if the object already is a redirect.
Returns ``EINVAL`` if on ill-formed parameter buffer.
Returns ``ENOTSUPP`` if existing mapped chunks overlap with new chunk mapping.
Takes references to targets as part of operation, can possibly leak refs
if the acting set resets and the client dies between taking the ref and
@ -468,36 +466,36 @@ Operations:
This operation is not a user mutation and does not trigger a clone to be created.
TODO: SET_CHUNK appears to clear the manifest unconditionally if it's not chunked. ::
TODO: ``SET_CHUNK`` appears to clear the manifest unconditionally if it's not chunked. ::
if (!oi.manifest.is_chunked()) {
oi.manifest.clear();
}
* evict-chunk
* ``evict-chunk``
Clears an extent from an object leaving only the manifest link between
it and the target_object. ::
it and the ``target_object``. ::
void evict_chunk(
uint64_t offset, uint64_t length, int flag = 0);
rados -p base_pool evict-chunk <offset> <length> <object>
Returns EINVAL if the extent is not present in the manifest.
Returns ``EINVAL`` if the extent is not present in the manifest.
Note: this does not exist yet.
* tier-promote
* ``tier-promote``
promotes the object ensuring that subsequent reads and writes will be local ::
Promotes the object ensuring that subsequent reads and writes will be local ::
void tier_promote();
rados -p base_pool tier-promote <obj-name>
Returns ENOENT if the object does not exist
Returns ``ENOENT`` if the object does not exist
For a redirect manifest, copies data to head.
@ -506,17 +504,17 @@ Operations:
For a chunked manifest, reads all MISSING extents into the base pool,
subsequent reads and writes will be served from the base pool.
Implementation Note: For a chunked manifest, calls start_copy on itself. The
resulting copy_get operation will issue reads which will then be redirected by
Implementation Note: For a chunked manifest, calls ``start_copy`` on itself. The
resulting ``copy_get`` operation will issue reads which will then be redirected by
the normal manifest read machinery.
Does not set the user_modify flag.
Does not set the ``user_modify`` flag.
Future work will involve adding support for specifying a clone_id.
Future work will involve adding support for specifying a ``clone_id``.
* unset-manifest
* ``unset-manifest``
unset the manifest info in the object that has manifest. ::
Unset the manifest info in the object that has manifest. ::
void unset_manifest();
@ -525,63 +523,61 @@ Operations:
Clears manifest chunks or redirect. Lazily releases references, may
leak.
do_osd_ops seems not to include it in the user_modify=false ignorelist,
``do_osd_ops`` seems not to include it in the ``user_modify=false`` ``ignorelist``,
and so will trigger a snapshot. Note, this will be true even for a
redirect though SET_REDIRECT does not flip user_modify. This should
be fixed -- unset-manifest should not be a user_modify.
redirect though ``SET_REDIRECT`` does not flip ``user_modify``. This should
be fixed -- ``unset-manifest`` should not be a ``user_modify``.
* tier-flush
* ``tier-flush``
flush the object which has chunks to the chunk pool. ::
Flush the object which has chunks to the chunk pool. ::
void tier_flush();
rados -p base_pool tier-flush <obj-name>
Included in the user_modify=false ignorelist, does not trigger a clone.
Included in the ``user_modify=false`` ``ignorelist``, does not trigger a clone.
Does not evict the extents.
Dedup tool
==========
ceph-dedup-tool
===============
Dedup tool has two features: finding an optimal chunk offset for dedup chunking
and fixing the reference count (see ./refcount.rst).
``ceph-dedup-tool`` has two features: finding an optimal chunk offset for dedup chunking
and fixing the reference count (see ``./refcount.rst``).
* find an optimal chunk offset
* Find an optimal chunk offset
a. fixed chunk
a. Fixed chunk
To find out a fixed chunk length, you need to run the following command many
times while changing the chunk_size. ::
To find out a fixed chunk length, you need to run the following command many
times while changing the ``chunk_size``. ::
ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size
--chunk-algorithm fixed --fingerprint-algorithm sha1|sha256|sha512
b. rabin chunk(Rabin-karp algorithm)
b. Rabin chunk(Rabin-Karp algorithm)
As you know, Rabin-karp algorithm is string-searching algorithm based
on a rolling-hash. But rolling-hash is not enough to do deduplication because
we don't know the chunk boundary. So, we need content-based slicing using
a rolling hash for content-defined chunking.
The current implementation uses the simplest approach: look for chunk boundaries
by inspecting the rolling hash for pattern(like the
lower N bits are all zeroes).
Rabin-Karp is a string-searching algorithm based
on a rolling hash. But a rolling hash is not enough to do deduplication because
we don't know the chunk boundary. So, we need content-based slicing using
a rolling hash for content-defined chunking.
The current implementation uses the simplest approach: look for chunk boundaries
by inspecting the rolling hash for pattern (like the
lower N bits are all zeroes).
- Usage
Users who want to use deduplication need to find an ideal chunk offset.
To find out ideal chunk offset, Users should discover
the optimal configuration for their data workload via ceph-dedup-tool.
And then, this chunking information will be used for object chunking through
set-chunk api. ::
Users who want to use deduplication need to find an ideal chunk offset.
To find out ideal chunk offset, users should discover
the optimal configuration for their data workload via ``ceph-dedup-tool``.
This information will then be used for object chunking through
the ``set-chunk`` API. ::
ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size
--chunk-algorithm rabin --fingerprint-algorithm rabin
ceph-dedup-tool has many options to utilize rabin chunk.
These are options for rabin chunk. ::
``ceph-dedup-tool`` has many options to utilize ``rabin chunk``.
These are options for ``rabin chunk``. ::
--mod-prime <uint64_t>
--rabin-prime <uint64_t>
@ -591,37 +587,37 @@ and fixing the reference count (see ./refcount.rst).
--min-chunk <uint32_t>
--max-chunk <uint64_t>
Users need to refer following equation to use above options for rabin chunk. ::
Users need to refer following equation to use above options for ``rabin chunk``. ::
rabin_hash =
(rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime)
c. Fixed chunk vs content-defined chunk
Content-defined chunking may or not be optimal solution.
For example,
Content-defined chunking may or not be optimal solution.
For example,
Data chunk A : abcdefgabcdefgabcdefg
Data chunk ``A`` : ``abcdefgabcdefgabcdefg``
Let's think about Data chunk A's deduplication. Ideal chunk offset is
from 1 to 7 (abcdefg). So, if we use fixed chunk, 7 is optimal chunk length.
But, in the case of content-based slicing, the optimal chunk length
could not be found (dedup ratio will not be 100%).
Because we need to find optimal parameter such
as boundary bit, window size and prime value. This is as easy as fixed chunk.
But, content defined chunking is very effective in the following case.
Let's think about Data chunk ``A``'s deduplication. The ideal chunk offset is
from ``1`` to ``7`` (``abcdefg``). So, if we use fixed chunk, ``7`` is optimal chunk length.
But, in the case of content-based slicing, the optimal chunk length
could not be found (dedup ratio will not be 100%).
Because we need to find optimal parameter such
as boundary bit, window size and prime value. This is as easy as fixed chunk.
But, content defined chunking is very effective in the following case.
Data chunk B : abcdefgabcdefgabcdefg
Data chunk ``B`` : ``abcdefgabcdefgabcdefg``
Data chunk C : Tabcdefgabcdefgabcdefg
Data chunk ``C`` : ``Tabcdefgabcdefgabcdefg``
* fix reference count
* Fix reference count
The key idea behind of reference counting for dedup is false-positive, which means
(manifest object (no ref), chunk object(has ref)) happen instead of
(manifest object (has ref), chunk 1(no ref)).
To fix such inconsistency, ceph-dedup-tool supports chunk_scrub. ::
``(manifest object (no ref),, chunk object(has ref))`` happen instead of
``(manifest object (has ref), chunk 1(no ref))``.
To fix such inconsistencies, ``ceph-dedup-tool`` supports ``chunk_scrub``. ::
ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL

View File

@ -2,8 +2,8 @@
OSD Throttles
=============
There are three significant throttles in the filestore: wbthrottle,
op_queue_throttle, and a throttle based on journal usage.
There are three significant throttles in the FileStore OSD back end:
wbthrottle, op_queue_throttle, and a throttle based on journal usage.
WBThrottle
----------
@ -17,7 +17,7 @@ flushing and block in FileStore::_do_op if we have exceeded any hard
limits until the background flusher catches up.
The relevant config options are filestore_wbthrottle*. There are
different defaults for xfs and btrfs. Each set has hard and soft
different defaults for XFS and Btrfs. Each set has hard and soft
limits on bytes (total dirty bytes), ios (total dirty ios), and
inodes (total dirty fds). The WBThrottle will begin flushing
when any of these hits the soft limit and will block in throttle()

View File

@ -2,9 +2,9 @@
Partial Object Recovery
=======================
Partial Object Recovery devotes to improving the efficiency of
log-based recovery rather than backfill. Original log-based recovery
calculates missing_set based on the difference between pg_log.
Partial Object Recovery improves the efficiency of log-based recovery (vs
backfill). Original log-based recovery calculates missing_set based on pg_log
differences.
The whole object should be recovery from one OSD to another
if the object is indicated modified by pg_log regardless of how much

View File

@ -26,11 +26,11 @@ Scrubbing Behavior Table
State variables
---------------
- Periodic tick state is !must_scrub && !must_deep_scrub && !time_for_deep
- Periodic tick after osd_deep_scrub_interval state is !must_scrub && !must_deep_scrub && time_for_deep
- Initiated scrub state is must_scrub && !must_deep_scrub && !time_for_deep
- Initiated scrub after osd_deep_scrub_interval state is must scrub && !must_deep_scrub && time_for_deep
- Initiated deep scrub state is must_scrub && must_deep_scrub
- Periodic tick state is ``!must_scrub && !must_deep_scrub && !time_for_deep``
- Periodic tick after ``osd_deep_scrub_interval state is !must_scrub && !must_deep_scrub && time_for_deep``
- Initiated scrub state is ``must_scrub && !must_deep_scrub && !time_for_deep``
- Initiated scrub after ``osd_deep_scrub_interval`` state is ``must_scrub && !must_deep_scrub && time_for_deep``
- Initiated deep scrub state is ``must_scrub && must_deep_scrub``
Scrub Reservations
------------------

View File

@ -27,7 +27,7 @@ See OSD::make_writeable
Ondisk Structures
-----------------
Each object has in the pg collection a *head* object (or *snapdir*, which we
Each object has in the PG collection a *head* object (or *snapdir*, which we
will come to shortly) and possibly a set of *clone* objects.
Each hobject_t has a snap field. For the *head* (the only writeable version
of an object), the snap field is set to CEPH_NOSNAP. For the *clones*, the
@ -68,7 +68,7 @@ removal, we maintain a mapping from snap to *hobject_t* using the
See PrimaryLogPG::SnapTrimmer, SnapMapper
This trimming is performed asynchronously by the snap_trim_wq while the
pg is clean and not scrubbing.
PG is clean and not scrubbing.
#. The next snap in PG::snap_trimq is selected for trimming
#. We determine the next object for trimming out of PG::snap_mapper.
@ -90,7 +90,7 @@ pg is clean and not scrubbing.
Recovery
--------
Because the trim operations are implemented using repops and log entries,
normal pg peering and recovery maintain the snap trimmer operations with
normal PG peering and recovery maintain the snap trimmer operations with
the caveat that push and removal operations need to update the local
*SnapMapper* instance. If the purged_snaps update is lost, we merely
retrim a now empty snap.
@ -117,12 +117,12 @@ is constant length. These keys have a bufferlist encoding
pair<snapid, hobject_t> as a value. Thus, creating or trimming a single
object does not involve reading all objects for any snap. Additionally,
upon construction, the *SnapMapper* is provided with a mask for filtering
the objects in the single SnapMapper keyspace belonging to that pg.
the objects in the single SnapMapper keyspace belonging to that PG.
Split
-----
The snapid_t -> hobject_t key entries are arranged such that for any pg,
The snapid_t -> hobject_t key entries are arranged such that for any PG,
up to 8 prefixes need to be checked to determine all hobjects in a particular
snap for a particular pg. Upon split, the prefixes to check on the parent
are adjusted such that only the objects remaining in the pg will be visible.
snap for a particular PG. Upon split, the prefixes to check on the parent
are adjusted such that only the objects remaining in the PG will be visible.
The children will immediately have the correct mapping.