mirror of https://github.com/ceph/ceph
624 lines
22 KiB
ReStructuredText
624 lines
22 KiB
ReStructuredText
========
|
|
Manifest
|
|
========
|
|
|
|
|
|
Introduction
|
|
============
|
|
|
|
As described in ``../deduplication.rst``, adding transparent redirect
|
|
machinery to RADOS would enable a more capable tiering solution
|
|
than RADOS currently has with "cache/tiering".
|
|
|
|
See ``../deduplication.rst``
|
|
|
|
At a high level, each object has a piece of metadata embedded in
|
|
the ``object_info_t`` which can map subsets of the object data payload
|
|
to (refcounted) objects in other pools.
|
|
|
|
This document exists to detail:
|
|
|
|
1. Manifest data structures
|
|
2. Rados operations for manipulating manifests.
|
|
3. Status and Plans
|
|
|
|
|
|
Intended Usage Model
|
|
====================
|
|
|
|
RBD
|
|
---
|
|
|
|
For RBD, the primary goal is for either an OSD-internal agent or a
|
|
cluster-external agent to be able to transparently shift portions
|
|
of the constituent 4MB extents between a dedup pool and a hot base
|
|
pool.
|
|
|
|
As such, RBD operations (including class operations and snapshots)
|
|
must have the same observable results regardless of the current
|
|
status of the object.
|
|
|
|
Moreover, tiering/dedup operations must interleave with RBD operations
|
|
without changing the result.
|
|
|
|
Thus, here is a sketch of how I'd expect a tiering agent to perform
|
|
basic operations:
|
|
|
|
* Demote cold RBD chunk to slow pool:
|
|
|
|
1. Read object, noting current user_version.
|
|
2. In memory, run CDC implementation to fingerprint object.
|
|
3. Write out each resulting extent to an object in the cold pool
|
|
using the CAS class.
|
|
4. Submit operation to base pool:
|
|
|
|
* ``ASSERT_VER`` with the user version from the read to fail if the
|
|
object has been mutated since the read.
|
|
* ``SET_CHUNK`` for each of the extents to the corresponding object
|
|
in the base pool.
|
|
* ``EVICT_CHUNK`` for each extent to free up space in the base pool.
|
|
Results in each chunk being marked ``MISSING``.
|
|
|
|
RBD users should then either see the state prior to the demotion or
|
|
subsequent to it.
|
|
|
|
Note that between 3 and 4, we potentially leak references, so a
|
|
periodic scrub would be needed to validate refcounts.
|
|
|
|
* Promote cold RBD chunk to fast pool.
|
|
|
|
1. Submit ``TIER_PROMOTE``
|
|
|
|
For clones, all of the above would be identical except that the
|
|
initial read would need a ``LIST_SNAPS`` to determine which clones exist
|
|
and the ``PROMOTE`` or ``SET_CHUNK``/``EVICT`` operations would need to include
|
|
the ``cloneid``.
|
|
|
|
RadosGW
|
|
-------
|
|
|
|
For reads, RADOS Gateway (RGW) could operate as RBD does above relying on the
|
|
manifest machinery in the OSD to hide the distinction between the object
|
|
being dedup'd or present in the base pool
|
|
|
|
For writes, RGW could operate as RBD does above, but could
|
|
optionally have the freedom to fingerprint prior to doing the write.
|
|
In that case, it could immediately write out the target objects to the
|
|
CAS pool and then atomically write an object with the corresponding
|
|
chunks set.
|
|
|
|
Status and Future Work
|
|
======================
|
|
|
|
At the moment, initial versions of a manifest data structure along
|
|
with IO path support and rados control operations exist. This section
|
|
is meant to outline next steps.
|
|
|
|
At a high level, our future work plan is:
|
|
|
|
- Cleanups: Address immediate inconsistencies and shortcomings outlined
|
|
in the next section.
|
|
- Testing: Rados relies heavily on teuthology failure testing to validate
|
|
features like cache/tiering. We'll need corresponding tests for
|
|
manifest operations.
|
|
- Snapshots: We want to be able to deduplicate portions of clones
|
|
below the level of the rados snapshot system. As such, the
|
|
rados operations below need to be extended to work correctly on
|
|
clones (e.g.: we should be able to call ``SET_CHUNK`` on a clone, clear the
|
|
corresponding extent in the base pool, and correctly maintain OSD metadata).
|
|
- Cache/tiering: Ultimately, we'd like to be able to deprecate the existing
|
|
cache/tiering implementation, but to do that we need to ensure that we
|
|
can address the same use cases.
|
|
|
|
|
|
Cleanups
|
|
--------
|
|
|
|
The existing implementation has some things that need to be cleaned up:
|
|
|
|
* ``SET_REDIRECT``: Should create the object if it doesn't exist, otherwise
|
|
one couldn't create an object atomically as a redirect.
|
|
* ``SET_CHUNK``:
|
|
|
|
* Appears to trigger a new clone as user_modify gets set in
|
|
``do_osd_ops``. This probably isn't desirable, see Snapshots section
|
|
below for some options on how generally to mix these operations
|
|
with snapshots. At a minimum, ``SET_CHUNK`` probably shouldn't set
|
|
user_modify.
|
|
* Appears to assume that the corresponding section of the object
|
|
does not exist (sets ``FLAG_MISSING``) but does not check whether the
|
|
corresponding extent exists already in the object. Should always
|
|
leave the extent clean.
|
|
* Appears to clear the manifest unconditionally if not chunked,
|
|
that's probably wrong. We should return an error if it's a
|
|
``REDIRECT`` ::
|
|
|
|
case CEPH_OSD_OP_SET_CHUNK:
|
|
if (oi.manifest.is_redirect()) {
|
|
result = -EINVAL;
|
|
goto fail;
|
|
}
|
|
|
|
|
|
* ``TIER_PROMOTE``:
|
|
|
|
* ``SET_REDIRECT`` clears the contents of the object. ``PROMOTE`` appears
|
|
to copy them back in, but does not unset the redirect or clear the
|
|
reference. This violates the invariant that a redirect object
|
|
should be empty in the base pool. In particular, as long as the
|
|
redirect is set, it appears that all operations will be proxied
|
|
even after the promote defeating the purpose. We do want ``PROMOTE``
|
|
to be able to atomically replace a redirect with the actual
|
|
object, so the solution is to clear the redirect at the end of the
|
|
promote.
|
|
* For a chunked manifest, we appear to flush prior to promoting.
|
|
Promotion will often be used to prepare an object for low latency
|
|
reads and writes, accordingly, the only effect should be to read
|
|
any ``MISSING`` extents into the base pool. No flushing should be done.
|
|
|
|
* High Level:
|
|
|
|
* It appears that ``FLAG_DIRTY`` should never be used for an extent pointing
|
|
at a dedup extent. Writing the mutated extent back to the dedup pool
|
|
requires writing a new object since the previous one cannot be mutated,
|
|
just as it would if it hadn't been dedup'd yet. Thus, we should always
|
|
drop the reference and remove the manifest pointer.
|
|
|
|
* There isn't currently a way to "evict" an object region. With the above
|
|
change to ``SET_CHUNK`` to always retain the existing object region, we
|
|
need an ``EVICT_CHUNK`` operation to then remove the extent.
|
|
|
|
|
|
Testing
|
|
-------
|
|
|
|
We rely really heavily on randomized failure testing. As such, we need
|
|
to extend that testing to include dedup/manifest support as well. Here's
|
|
a short list of the touchpoints:
|
|
|
|
* Thrasher tests like ``qa/suites/rados/thrash/workloads/cache-snaps.yaml``
|
|
|
|
That test, of course, tests the existing cache/tiering machinery. Add
|
|
additional files to that directory that instead setup a dedup pool. Add
|
|
support to ``ceph_test_rados`` (``src/test/osd/TestRados*``).
|
|
|
|
* RBD tests
|
|
|
|
Add a test that runs an RBD workload concurrently with blind
|
|
promote/evict operations.
|
|
|
|
* RGW
|
|
|
|
Add a test that runs a rgw workload concurrently with blind
|
|
promote/evict operations.
|
|
|
|
|
|
Snapshots
|
|
---------
|
|
|
|
Fundamentally we need to be able to manipulate the manifest
|
|
status of clones because we want to be able to dynamically promote,
|
|
flush (if the state was dirty when the clone was created), and evict
|
|
extents from clones.
|
|
|
|
As such, the plan is to allow the ``object_manifest_t`` for each clone
|
|
to be independent. Here's an incomplete list of the high level
|
|
tasks:
|
|
|
|
* Modify the op processing pipeline to permit ``SET_CHUNK``, ``EVICT_CHUNK``
|
|
to operation directly on clones.
|
|
* Ensure that recovery checks the object_manifest prior to trying to
|
|
use the overlaps in clone_range. ``ReplicatedBackend::calc_*_subsets``
|
|
are the two methods that would likely need to be modified.
|
|
|
|
See ``snaps.rst`` for a rundown of the ``librados`` snapshot system and OSD
|
|
support details. I'd like to call out one particular data structure
|
|
we may want to exploit.
|
|
|
|
The dedup-tool needs to be updated to use ``LIST_SNAPS`` to discover
|
|
clones as part of leak detection.
|
|
|
|
An important question is how we deal with the fact that many clones
|
|
will frequently have references to the same backing chunks at the same
|
|
offset. In particular, ``make_writeable`` will generally create a clone
|
|
that shares the same ``object_manifest_t`` references with the exception
|
|
of any extents modified in that transaction. The metadata that
|
|
commits as part of that transaction must therefore map onto the same
|
|
refcount as before because otherwise we'd have to first increment
|
|
refcounts on backing objects (or risk a reference to a dead object)
|
|
Thus, we introduce a simple convention: consecutive clones which
|
|
share a reference at the same offset share the same refcount. This
|
|
means that a write that invokes ``make_writeable`` may decrease refcounts,
|
|
but not increase them. This has some consequences for removing clones.
|
|
Consider the following sequence ::
|
|
|
|
write foo [0, 1024)
|
|
flush foo ->
|
|
head: [0, 512) aaa, [512, 1024) bbb
|
|
refcount(aaa)=1, refcount(bbb)=1
|
|
snapshot 10
|
|
write foo [0, 512) ->
|
|
head: [512, 1024) bbb
|
|
10 : [0, 512) aaa, [512, 1024) bbb
|
|
refcount(aaa)=1, refcount(bbb)=1
|
|
flush foo ->
|
|
head: [0, 512) ccc, [512, 1024) bbb
|
|
10 : [0, 512) aaa, [512, 1024) bbb
|
|
refcount(aaa)=1, refcount(bbb)=1, refcount(ccc)=1
|
|
snapshot 20
|
|
write foo [0, 512) (same contents as the original write)
|
|
head: [512, 1024) bbb
|
|
20 : [0, 512) ccc, [512, 1024) bbb
|
|
10 : [0, 512) aaa, [512, 1024) bbb
|
|
refcount(aaa)=?, refcount(bbb)=1
|
|
flush foo
|
|
head: [0, 512) aaa, [512, 1024) bbb
|
|
20 : [0, 512) ccc, [512, 1024) bbb
|
|
10 : [0, 512) aaa, [512, 1024) bbb
|
|
refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=1
|
|
|
|
What should be the refcount for ``aaa`` be at the end? By our
|
|
above rule, it should be ``2`` since the two ```aaa``` refs are not
|
|
contiguous. However, consider removing clone ``20`` ::
|
|
|
|
initial:
|
|
head: [0, 512) aaa, [512, 1024) bbb
|
|
20 : [0, 512) ccc, [512, 1024) bbb
|
|
10 : [0, 512) aaa, [512, 1024) bbb
|
|
refcount(aaa)=2, refcount(bbb)=1, refcount(ccc)=1
|
|
trim 20
|
|
head: [0, 512) aaa, [512, 1024) bbb
|
|
10 : [0, 512) aaa, [512, 1024) bbb
|
|
refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=0
|
|
|
|
At this point, our rule dictates that ``refcount(aaa)`` is `1`.
|
|
This means that removing ``20`` needs to check for refs held by
|
|
the clones on either side which will then match.
|
|
|
|
See ``osd_types.h:object_manifest_t::calc_refs_to_drop_on_removal``
|
|
for the logic implementing this rule.
|
|
|
|
This seems complicated, but it gets us two valuable properties:
|
|
|
|
1) The refcount change from make_writeable will not block on
|
|
incrementing a ref
|
|
2) We don't need to load the ``object_manifest_t`` for every clone
|
|
to determine how to handle removing one -- just the ones
|
|
immediately preceding and succeeding it.
|
|
|
|
All clone operations will need to consider adjacent ``chunk_maps``
|
|
when adding or removing references.
|
|
|
|
Cache/Tiering
|
|
-------------
|
|
|
|
There already exists a cache/tiering mechanism based on whiteouts.
|
|
One goal here should ultimately be for this manifest machinery to
|
|
provide a complete replacement.
|
|
|
|
See ``cache-pool.rst``
|
|
|
|
The manifest machinery already shares some code paths with the
|
|
existing cache/tiering code, mainly ``stat_flush``.
|
|
|
|
In no particular order, here's in incomplete list of things that need
|
|
to be wired up to provide feature parity:
|
|
|
|
* Online object access information: The osd already has pool configs
|
|
for maintaining bloom filters which provide estimates of access
|
|
recency for objects. We probably need to modify this to permit
|
|
hitset maintenance for a normal pool -- there are already
|
|
``CEPH_OSD_OP_PG_HITSET*`` interfaces for querying them.
|
|
* Tiering agent: The osd already has a background tiering agent which
|
|
would need to be modified to instead flush and evict using
|
|
manifests.
|
|
|
|
* Use exiting existing features regarding the cache flush policy such as
|
|
histset, age, ratio.
|
|
- hitset
|
|
- age, ratio, bytes
|
|
|
|
* Add tiering-mode to ``manifest-tiering``
|
|
- Writeback
|
|
- Read-only
|
|
|
|
|
|
Data Structures
|
|
===============
|
|
|
|
Each RADOS object contains an ``object_manifest_t`` embedded within the
|
|
``object_info_t`` (see ``osd_types.h``):
|
|
|
|
::
|
|
|
|
struct object_manifest_t {
|
|
enum {
|
|
TYPE_NONE = 0,
|
|
TYPE_REDIRECT = 1,
|
|
TYPE_CHUNKED = 2,
|
|
};
|
|
uint8_t type; // redirect, chunked, ...
|
|
hobject_t redirect_target;
|
|
std::map<uint64_t, chunk_info_t> chunk_map;
|
|
}
|
|
|
|
The ``type`` enum reflects three possible states an object can be in:
|
|
|
|
1. ``TYPE_NONE``: normal RADOS object
|
|
2. ``TYPE_REDIRECT``: object payload is backed by a single object
|
|
specified by ``redirect_target``
|
|
3. ``TYPE_CHUNKED: object payload is distributed among objects with
|
|
size and offset specified by the ``chunk_map``. ``chunk_map`` maps
|
|
the offset of the chunk to a ``chunk_info_t`` as shown below, also
|
|
specifying the ``length``, target `OID`, and ``flags``.
|
|
|
|
::
|
|
|
|
struct chunk_info_t {
|
|
typedef enum {
|
|
FLAG_DIRTY = 1,
|
|
FLAG_MISSING = 2,
|
|
FLAG_HAS_REFERENCE = 4,
|
|
FLAG_HAS_FINGERPRINT = 8,
|
|
} cflag_t;
|
|
uint32_t offset;
|
|
uint32_t length;
|
|
hobject_t oid;
|
|
cflag_t flags; // FLAG_*
|
|
|
|
|
|
``FLAG_DIRTY`` at this time can happen if an extent with a fingerprint
|
|
is written. This should be changed to drop the fingerprint instead.
|
|
|
|
|
|
Request Handling
|
|
================
|
|
|
|
Similarly to cache/tiering, the initial touchpoint is
|
|
``maybe_handle_manifest_detail``.
|
|
|
|
For manifest operations listed below, we return ``NOOP`` and continue onto
|
|
dedicated handling within ``do_osd_ops``.
|
|
|
|
For redirect objects which haven't been promoted (apparently ``oi.size >
|
|
0`` indicates that it's present?) we proxy reads and writes.
|
|
|
|
For reads on ``TYPE_CHUNKED``, if ``can_proxy_chunked_read`` (basically, all
|
|
of the ops are reads of extents in the ``object_manifest_t chunk_map``),
|
|
we proxy requests to those objects.
|
|
|
|
|
|
RADOS Interface
|
|
================
|
|
|
|
To set up deduplication one must provision two pools. One will act as the
|
|
base pool and the other will act as the chunk pool. The base pool need to be
|
|
configured with the ``fingerprint_algorithm`` option as follows.
|
|
|
|
::
|
|
|
|
ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512
|
|
--yes-i-really-mean-it
|
|
|
|
Create objects ::
|
|
|
|
rados -p base_pool put foo ./foo
|
|
rados -p chunk_pool put foo-chunk ./foo-chunk
|
|
|
|
Make a manifest object ::
|
|
|
|
rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool chunk_pool foo-chunk $START_OFFSET --with-reference
|
|
|
|
Operations:
|
|
|
|
* ``set-redirect``
|
|
|
|
Set a redirection between a ``base_object`` in the ``base_pool`` and a ``target_object``
|
|
in the ``target_pool``.
|
|
A redirected object will forward all operations from the client to the
|
|
``target_object``. ::
|
|
|
|
void set_redirect(const std::string& tgt_obj, const IoCtx& tgt_ioctx,
|
|
uint64_t tgt_version, int flag = 0);
|
|
|
|
rados -p base_pool set-redirect <base_object> --target-pool <target_pool>
|
|
<target_object>
|
|
|
|
Returns ``ENOENT`` if the object does not exist (TODO: why?)
|
|
Returns ``EINVAL`` if the object already is a redirect.
|
|
|
|
Takes a reference to target as part of operation, can possibly leak a ref
|
|
if the acting set resets and the client dies between taking the ref and
|
|
recording the redirect.
|
|
|
|
Truncates object, clears omap, and clears xattrs as a side effect.
|
|
|
|
At the top of ``do_osd_ops``, does not set user_modify.
|
|
|
|
This operation is not a user mutation and does not trigger a clone to be created.
|
|
|
|
There are two purposes of ``set_redirect``:
|
|
|
|
1. Redirect all operation to the target object (like proxy)
|
|
2. Cache when ``tier_promote`` is called (redirect will be cleared at this time).
|
|
|
|
* ``set-chunk``
|
|
|
|
Set the ``chunk-offset`` in a ``source_object`` to make a link between it and a
|
|
``target_object``. ::
|
|
|
|
void set_chunk(uint64_t src_offset, uint64_t src_length, const IoCtx& tgt_ioctx,
|
|
std::string tgt_oid, uint64_t tgt_offset, int flag = 0);
|
|
|
|
rados -p base_pool set-chunk <source_object> <offset> <length> --target-pool
|
|
<caspool> <target_object> <target-offset>
|
|
|
|
Returns ``ENOENT`` if the object does not exist (TODO: why?)
|
|
Returns ``EINVAL`` if the object already is a redirect.
|
|
Returns ``EINVAL`` if on ill-formed parameter buffer.
|
|
Returns ``ENOTSUPP`` if existing mapped chunks overlap with new chunk mapping.
|
|
|
|
Takes references to targets as part of operation, can possibly leak refs
|
|
if the acting set resets and the client dies between taking the ref and
|
|
recording the redirect.
|
|
|
|
Truncates object, clears omap, and clears xattrs as a side effect.
|
|
|
|
This operation is not a user mutation and does not trigger a clone to be created.
|
|
|
|
TODO: ``SET_CHUNK`` appears to clear the manifest unconditionally if it's not chunked. ::
|
|
|
|
if (!oi.manifest.is_chunked()) {
|
|
oi.manifest.clear();
|
|
}
|
|
|
|
* ``evict-chunk``
|
|
|
|
Clears an extent from an object leaving only the manifest link between
|
|
it and the ``target_object``. ::
|
|
|
|
void evict_chunk(
|
|
uint64_t offset, uint64_t length, int flag = 0);
|
|
|
|
rados -p base_pool evict-chunk <offset> <length> <object>
|
|
|
|
Returns ``EINVAL`` if the extent is not present in the manifest.
|
|
|
|
Note: this does not exist yet.
|
|
|
|
|
|
* ``tier-promote``
|
|
|
|
Promotes the object ensuring that subsequent reads and writes will be local ::
|
|
|
|
void tier_promote();
|
|
|
|
rados -p base_pool tier-promote <obj-name>
|
|
|
|
Returns ``ENOENT`` if the object does not exist
|
|
|
|
For a redirect manifest, copies data to head.
|
|
|
|
TODO: Promote on a redirect object needs to clear the redirect.
|
|
|
|
For a chunked manifest, reads all MISSING extents into the base pool,
|
|
subsequent reads and writes will be served from the base pool.
|
|
|
|
Implementation Note: For a chunked manifest, calls ``start_copy`` on itself. The
|
|
resulting ``copy_get`` operation will issue reads which will then be redirected by
|
|
the normal manifest read machinery.
|
|
|
|
Does not set the ``user_modify`` flag.
|
|
|
|
Future work will involve adding support for specifying a ``clone_id``.
|
|
|
|
* ``unset-manifest``
|
|
|
|
Unset the manifest info in the object that has manifest. ::
|
|
|
|
void unset_manifest();
|
|
|
|
rados -p base_pool unset-manifest <obj-name>
|
|
|
|
Clears manifest chunks or redirect. Lazily releases references, may
|
|
leak.
|
|
|
|
``do_osd_ops`` seems not to include it in the ``user_modify=false`` ``ignorelist``,
|
|
and so will trigger a snapshot. Note, this will be true even for a
|
|
redirect though ``SET_REDIRECT`` does not flip ``user_modify``. This should
|
|
be fixed -- ``unset-manifest`` should not be a ``user_modify``.
|
|
|
|
* ``tier-flush``
|
|
|
|
Flush the object which has chunks to the chunk pool. ::
|
|
|
|
void tier_flush();
|
|
|
|
rados -p base_pool tier-flush <obj-name>
|
|
|
|
Included in the ``user_modify=false`` ``ignorelist``, does not trigger a clone.
|
|
|
|
Does not evict the extents.
|
|
|
|
|
|
ceph-dedup-tool
|
|
===============
|
|
|
|
``ceph-dedup-tool`` has two features: finding an optimal chunk offset for dedup chunking
|
|
and fixing the reference count (see ``./refcount.rst``).
|
|
|
|
* Find an optimal chunk offset
|
|
|
|
a. Fixed chunk
|
|
|
|
To find out a fixed chunk length, you need to run the following command many
|
|
times while changing the ``chunk_size``. ::
|
|
|
|
ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size
|
|
--chunk-algorithm fixed --fingerprint-algorithm sha1|sha256|sha512
|
|
|
|
b. Rabin chunk(Rabin-Karp algorithm)
|
|
|
|
Rabin-Karp is a string-searching algorithm based
|
|
on a rolling hash. But a rolling hash is not enough to do deduplication because
|
|
we don't know the chunk boundary. So, we need content-based slicing using
|
|
a rolling hash for content-defined chunking.
|
|
The current implementation uses the simplest approach: look for chunk boundaries
|
|
by inspecting the rolling hash for pattern (like the
|
|
lower N bits are all zeroes).
|
|
|
|
Users who want to use deduplication need to find an ideal chunk offset.
|
|
To find out ideal chunk offset, users should discover
|
|
the optimal configuration for their data workload via ``ceph-dedup-tool``.
|
|
This information will then be used for object chunking through
|
|
the ``set-chunk`` API. ::
|
|
|
|
ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size
|
|
--chunk-algorithm rabin --fingerprint-algorithm rabin
|
|
|
|
``ceph-dedup-tool`` has many options to utilize ``rabin chunk``.
|
|
These are options for ``rabin chunk``. ::
|
|
|
|
--mod-prime <uint64_t>
|
|
--rabin-prime <uint64_t>
|
|
--pow <uint64_t>
|
|
--chunk-mask-bit <uint32_t>
|
|
--window-size <uint32_t>
|
|
--min-chunk <uint32_t>
|
|
--max-chunk <uint64_t>
|
|
|
|
Users need to refer following equation to use above options for ``rabin chunk``. ::
|
|
|
|
rabin_hash =
|
|
(rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime)
|
|
|
|
c. Fixed chunk vs content-defined chunk
|
|
|
|
Content-defined chunking may or not be optimal solution.
|
|
For example,
|
|
|
|
Data chunk ``A`` : ``abcdefgabcdefgabcdefg``
|
|
|
|
Let's think about Data chunk ``A``'s deduplication. The ideal chunk offset is
|
|
from ``1`` to ``7`` (``abcdefg``). So, if we use fixed chunk, ``7`` is optimal chunk length.
|
|
But, in the case of content-based slicing, the optimal chunk length
|
|
could not be found (dedup ratio will not be 100%).
|
|
Because we need to find optimal parameter such
|
|
as boundary bit, window size and prime value. This is as easy as fixed chunk.
|
|
But, content defined chunking is very effective in the following case.
|
|
|
|
Data chunk ``B`` : ``abcdefgabcdefgabcdefg``
|
|
|
|
Data chunk ``C`` : ``Tabcdefgabcdefgabcdefg``
|
|
|
|
|
|
* Fix reference count
|
|
|
|
The key idea behind of reference counting for dedup is false-positive, which means
|
|
``(manifest object (no ref),, chunk object(has ref))`` happen instead of
|
|
``(manifest object (has ref), chunk 1(no ref))``.
|
|
To fix such inconsistencies, ``ceph-dedup-tool`` supports ``chunk_scrub``. ::
|
|
|
|
ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL
|
|
|