mirror of
https://github.com/ceph/ceph
synced 2025-03-25 11:48:05 +00:00
doc: add a deduplication document
Signed-off-by: Myoungwon Oh <omwmw@sk.com>
This commit is contained in:
parent
f1ea02970b
commit
cd41b64460
258
doc/dev/deduplication.rst
Normal file
258
doc/dev/deduplication.rst
Normal file
@ -0,0 +1,258 @@
|
||||
===============
|
||||
Deduplication
|
||||
===============
|
||||
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
Applying data deduplication on an existing software stack is not easy
|
||||
due to additional metadata management and original data processing
|
||||
procedure.
|
||||
|
||||
In a typical deduplication system, the input source as a data
|
||||
object is split into multiple chunks by a chunking algorithm.
|
||||
The deduplication system then compares each chunk with
|
||||
the existing data chunks, stored in the storage previously.
|
||||
To this end, a fingerprint index that stores the hash value
|
||||
of each chunk is employed by the deduplication system
|
||||
in order to easily find the existing chunks by comparing
|
||||
hash value rather than searching all contents that reside in
|
||||
the underlying storage.
|
||||
|
||||
There are many challenges in order to implement deduplication on top
|
||||
of Ceph. Among them, two issues are essential for deduplication.
|
||||
First is managing scalability of fingerprint index; Second is
|
||||
it is complex to ensure compatibility between newly applied
|
||||
deduplication metadata and existing metadata.
|
||||
|
||||
Key Idea
|
||||
========
|
||||
1. Content hashing (Double hashing): Each client can find an object data
|
||||
for an object ID using CRUSH. With CRUSH, a client knows object's location
|
||||
in Base tier.
|
||||
By hashing object's content at Base tier, a new OID (chunk ID) is generated.
|
||||
Chunk tier stores in the new OID that has a partial content of original object.
|
||||
|
||||
Client 1 -> OID=1 -> HASH(1's content)=K -> OID=K ->
|
||||
CRUSH(K) -> chunk's location
|
||||
|
||||
|
||||
2. Self-contained object: The external metadata design
|
||||
makes difficult for integration with storage feature support
|
||||
since existing storage features cannot recognize the
|
||||
additional external data structures. If we can design data
|
||||
deduplication system without any external component, the
|
||||
original storage features can be reused.
|
||||
|
||||
More details in https://ieeexplore.ieee.org/document/8416369
|
||||
|
||||
Design
|
||||
======
|
||||
|
||||
.. ditaa::
|
||||
+-------------+
|
||||
| Ceph Client |
|
||||
+------+------+
|
||||
^
|
||||
Tiering is |
|
||||
Transparent | Metadata
|
||||
to Ceph | +---------------+
|
||||
Client Ops | | |
|
||||
| +----->+ Base Pool |
|
||||
| | | |
|
||||
| | +-----+---+-----+
|
||||
| | | ^
|
||||
v v | | Dedup metadata in Base Pool
|
||||
+------+----+--+ | | (Dedup metadata contains chunk offsets
|
||||
| Objecter | | | and fingerprints)
|
||||
+-----------+--+ | |
|
||||
^ | | Data in Chunk Pool
|
||||
| v |
|
||||
| +-----+---+-----+
|
||||
| | |
|
||||
+----->| Chunk Pool |
|
||||
| |
|
||||
+---------------+
|
||||
Data
|
||||
|
||||
|
||||
Pool-based object management:
|
||||
We define two pools.
|
||||
The metadata pool stores metadata objects and the chunk pool stores
|
||||
chunk objects. Since these two pools are divided based on
|
||||
the purpose and usage, each pool can be managed more
|
||||
efficiently according to its different characteristics. Base
|
||||
pool and the chunk pool can separately select a redundancy
|
||||
scheme between replication and erasure coding depending on
|
||||
its usage and each pool can be placed in a different storage
|
||||
location depending on the required performance.
|
||||
|
||||
Manifest Object:
|
||||
Metadata objects are stored in the
|
||||
base pool, which contains metadata for data deduplication.
|
||||
|
||||
::
|
||||
|
||||
struct object_manifest_t {
|
||||
enum {
|
||||
TYPE_NONE = 0,
|
||||
TYPE_REDIRECT = 1,
|
||||
TYPE_CHUNKED = 2,
|
||||
};
|
||||
uint8_t type; // redirect, chunked, ...
|
||||
hobject_t redirect_target;
|
||||
std::map<uint64_t, chunk_info_t> chunk_map;
|
||||
}
|
||||
|
||||
|
||||
A chunk Object:
|
||||
Chunk objects are stored in the chunk pool. Chunk object contains chunk data
|
||||
and its reference count information.
|
||||
|
||||
|
||||
Although chunk objects and manifest objects have a different purpose
|
||||
from existing objects, they can be handled the same way as
|
||||
original objects. Therefore, to support existing features such as replication,
|
||||
no additional operations for dedup are needed.
|
||||
|
||||
Usage
|
||||
=====
|
||||
|
||||
To set up deduplication pools, you must have two pools. One will act as the
|
||||
base pool and the other will act as the chunk pool. The base pool need to be
|
||||
configured with fingerprint_algorithm option as follows.
|
||||
|
||||
::
|
||||
|
||||
ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512
|
||||
--yes-i-really-mean-it
|
||||
|
||||
1. Create objects ::
|
||||
|
||||
- rados -p base_pool put foo ./foo
|
||||
|
||||
- rados -p chunk_pool put foo-chunk ./foo-chunk
|
||||
|
||||
2. Make a manifest object ::
|
||||
|
||||
- rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool
|
||||
chunk_pool foo-chunk $START_OFFSET --with-reference
|
||||
|
||||
|
||||
Interface
|
||||
=========
|
||||
|
||||
* set-redirect
|
||||
|
||||
set redirection between a base_object in the base_pool and a target_object
|
||||
in the target_pool.
|
||||
A redirected object will forward all operations from the client to the
|
||||
target_object. ::
|
||||
|
||||
rados -p base_pool set-redirect <base_object> --target-pool <target_pool>
|
||||
<target_object>
|
||||
|
||||
* set-chunk
|
||||
|
||||
set chunk-offset in a source_object to make a link between it and a
|
||||
target_object. ::
|
||||
|
||||
rados -p base_pool set-chunk <source_object> <offset> <length> --target-pool
|
||||
<caspool> <target_object> <taget-offset>
|
||||
|
||||
* tier-promote
|
||||
|
||||
promote the object (including chunks). ::
|
||||
|
||||
rados -p base_pool tier-promote <obj-name>
|
||||
|
||||
* unset-manifest
|
||||
|
||||
unset manifest option from the object that has manifest. ::
|
||||
|
||||
rados -p base_pool unset-manifest <obj-name>
|
||||
|
||||
|
||||
Dedup tool
|
||||
==========
|
||||
|
||||
Dedup tool has two features: finding optimal chunk offset for dedup chunking
|
||||
and fixing the reference count.
|
||||
|
||||
* find optimal chunk offset
|
||||
|
||||
a. fixed chunk
|
||||
|
||||
To find out a fixed chunk length, you need to run following command many
|
||||
times while changing the chunk_size. ::
|
||||
|
||||
ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size
|
||||
--chunk-algorithm fixed --fingerprint-algorithm sha1|sha256|sha512
|
||||
|
||||
b. rabin chunk(Rabin-karp algorithm)
|
||||
|
||||
As you know, Rabin-karp algorithm is string-searching algorithm based
|
||||
on a rolling-hash. But rolling-hash is not enough to do deduplication because
|
||||
we don't know the chunk boundary. So, we need content-based slicing using
|
||||
a rolling hash for content-defined chunking.
|
||||
The current implementation uses the simplest approach: look for chunk boundaries
|
||||
by inspecting the rolling hash for pattern(like the
|
||||
lower N bits are all zeroes).
|
||||
|
||||
- Usage
|
||||
|
||||
Users who want to use deduplication need to find an ideal chunk offset.
|
||||
To find out ideal chunk offset, Users should discover
|
||||
the optimal configuration for their data workload via ceph-dedup-tool.
|
||||
And then, this chunking information will be used for object chunking through
|
||||
set-chunk api. ::
|
||||
|
||||
ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size
|
||||
--chunk-algorithm rabin --fingerprint-algorithm rabin
|
||||
|
||||
ceph-dedup-tool has many options to utilize rabin chunk.
|
||||
These are options for rabin chunk. ::
|
||||
|
||||
--mod-prime <uint64_t>
|
||||
--rabin-prime <uint64_t>
|
||||
--pow <uint64_t>
|
||||
--chunk-mask-bit <uint32_t>
|
||||
--window-size <uint32_t>
|
||||
--min-chunk <uint32_t>
|
||||
--max-chunk <uint64_t>
|
||||
|
||||
Users need to refer following equation to use above options for rabin chunk. ::
|
||||
|
||||
rabin_hash =
|
||||
(rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime)
|
||||
|
||||
c. Fixed chunk vs content-defined chunk
|
||||
|
||||
Content-defined chunking may or not be optimal solution.
|
||||
For example,
|
||||
|
||||
Data chunk A : abcdefgabcdefgabcdefg
|
||||
|
||||
Let's think about Data chunk A's deduplication. Ideal chunk offset is
|
||||
from 1 to 7 (abcdefg). So, if we use fixed chunk, 7 is optimal chunk length.
|
||||
But, in the case of content-based slicing, the optimal chunk length
|
||||
could not be found (dedup ratio will not be 100%).
|
||||
Because we need to find optimal parameter such
|
||||
as boundary bit, window size and prime value. This is as easy as fixed chunk.
|
||||
But, content defined chunking is very effective in the following case.
|
||||
|
||||
Data chunk B : abcdefgabcdefgabcdefg
|
||||
|
||||
Data chunk C : Tabcdefgabcdefgabcdefg
|
||||
|
||||
|
||||
* fix reference count
|
||||
|
||||
The key idea behind of reference counting for dedup is false-positive, which means
|
||||
(manifest object (no ref), chunk object(has ref)) happen instead of
|
||||
(manifest object (has ref), chunk 1(no ref)).
|
||||
To fix such inconsistency, ceph-dedup-tool supports chunk_scrub. ::
|
||||
|
||||
ceph-dedup-tool --op chunk_scrub --pool $POOL --chunk_pool $CHUNK_POOL
|
||||
|
Loading…
Reference in New Issue
Block a user