ceph/doc/dev/deduplication.rst

===============
 Deduplication
===============


Introduction
============

Applying data deduplication on an existing software stack is not easy
due to additional metadata management and original data processing
procedure.

In a typical deduplication system, the input source as a data
object is split into multiple chunks by a chunking algorithm.
The deduplication system then compares each chunk with
the existing data chunks, stored in the storage previously.
To this end, a fingerprint index that stores the hash value
of each chunk is employed by the deduplication system
in order to easily find the existing chunks by comparing
hash value rather than searching all contents that reside in
the underlying storage.

There are many challenges in order to implement deduplication on top
of Ceph. Among them, two issues are essential for deduplication.
First is managing scalability of fingerprint index; Second is
it is complex to ensure compatibility between newly applied
deduplication metadata and existing metadata.

Key Idea
========
1. Content hashing (Double hashing): Each client can find an object data
for an object ID using CRUSH. With CRUSH, a client knows object's location
in Base tier.
By hashing object's content at Base tier, a new OID (chunk ID) is generated.
Chunk tier stores in the new OID that has a partial content of original object.

 Client 1 -> OID=1 -> HASH(1's content)=K -> OID=K ->
 CRUSH(K) -> chunk's location


2. Self-contained object: The external metadata design
makes difficult for integration with storage feature support
since existing storage features cannot recognize the
additional external data structures. If we can design data
deduplication system without any external component, the
original storage features can be reused.

More details in https://ieeexplore.ieee.org/document/8416369

Design
======

.. ditaa::

           +-------------+
           | Ceph Client |
           +------+------+
                  ^
     Tiering is   |
    Transparent   |               Metadata
        to Ceph   |           +---------------+
     Client Ops   |           |               |
                  |    +----->+   Base Pool   |
                  |    |      |               |
                  |    |      +-----+---+-----+
                  |    |            |   ^
                  v    v            |   |   Dedup metadata in Base Pool
           +------+----+--+         |   |   (Dedup metadata contains chunk offsets
           |   Objecter   |         |   |    and fingerprints)
           +-----------+--+         |   |
                       ^            |   |   Data in Chunk Pool
                       |            v   |
                       |      +-----+---+-----+
                       |      |               |
                       +----->|  Chunk Pool   |
                              |               |
                              +---------------+
                                    Data


Pool-based object management:
We define two pools.
The metadata pool stores metadata objects and the chunk pool stores
chunk objects. Since these two pools are divided based on
the purpose and usage, each pool can be managed more
efficiently according to its different characteristics. Base
pool and the chunk pool can separately select a redundancy
scheme between replication and erasure coding depending on
its usage and each pool can be placed in a different storage
location depending on the required performance.

Regarding how to use, please see ``osd_internals/manifest.rst``

Usage Patterns
==============

Each Ceph interface layer presents unique opportunities and costs for
deduplication and tiering in general.

RadosGW
-------

S3 big data workloads seem like a good opportunity for deduplication.  These
objects tend to be write once, read mostly objects which don't see partial
overwrites.  As such, it makes sense to fingerprint and dedup up front.

Unlike cephfs and rbd, radosgw has a system for storing
explicit metadata in the head object of a logical s3 object for
locating the remaining pieces.  As such, radosgw could use the
refcounting machinery (``osd_internals/refcount.rst``) directly without
needing direct support from rados for manifests.

RBD/Cephfs
----------

RBD and CephFS both use deterministic naming schemes to partition
block devices/file data over rados objects.  As such, the redirection
metadata would need to be included as part of rados, presumably
transparently.

Moreover, unlike radosgw, rbd/cephfs rados objects can see overwrites.
For those objects, we don't really want to perform dedup, and we don't
want to pay a write latency penalty in the hot path to do so anyway.
As such, performing tiering and dedup on cold objects in the background
is likely to be preferred.

One important wrinkle, however, is that both rbd and cephfs workloads
often feature usage of snapshots.  This means that the rados manifest
support needs robust support for snapshots.

RADOS Machinery
===============

For more information on rados redirect/chunk/dedup support, see ``osd_internals/manifest.rst``.
For more information on rados refcount support, see ``osd_internals/refcount.rst``.

Status and Future Work
======================

At the moment, there exists some preliminary support for manifest
objects within the OSD as well as a dedup tool.

RadosGW data warehouse workloads probably represent the largest
opportunity for this feature, so the first priority is probably to add
direct support for fingerprinting and redirects into the refcount pool
to radosgw.

Aside from radosgw, completing work on manifest object support in the
OSD particularly as it relates to snapshots would be the next step for
rbd and cephfs workloads.

How to use deduplication
========================

 * This feature is highly experimental and is subject to change or removal.

Ceph provides deduplication using RADOS machinery.
Below we explain how to perform deduplication.

Prerequisite
------------

If the Ceph cluster is started from Ceph mainline, users need to check
``ceph-test`` package which is including ceph-dedup-tool is installed.

Deatiled Instructions
---------------------

Users can use ceph-dedup-tool with ``estimate``, ``sample-dedup``,
``chunk-scrub``, and ``chunk-repair`` operations. To provide better
convenience for users, we have enabled necessary operations through
ceph-dedup-tool, and we recommend using the following operations freely
by using any types of scripts.


1. Estimate space saving ratio of a target pool using ``ceph-dedup-tool``.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: bash

    ceph-dedup-tool --op estimate
      --pool [BASE_POOL]
      --chunk-size [CHUNK_SIZE]
      --chunk-algorithm [fixed|fastcdc]
      --fingerprint-algorithm [sha1|sha256|sha512]
      --max-thread [THREAD_COUNT]

This CLI command will show how much storage space can be saved when deduplication
is applied on the pool. If the amount of the saved space is higher than user's expectation,
the pool probably is worth performing deduplication.
Users should specify the ``BASE_POOL``, within which the object targeted for deduplication
is stored. The users also need to run ceph-dedup-tool multiple time
with varying ``chunk_size`` to find the optimal chunk size. Note that the
optimal value probably differs in the content of each object in case of fastcdc
chunk algorithm (not fixed).

Example output:

.. code:: bash

    {
      "chunk_algo": "fastcdc",
      "chunk_sizes": [
        {
          "target_chunk_size": 8192,
          "dedup_bytes_ratio": 0.4897049
          "dedup_object_ratio": 34.567315
          "chunk_size_average": 64439,
          "chunk_size_stddev": 33620
        }
      ],
      "summary": {
        "examined_objects": 95,
        "examined_bytes": 214968649
      }
    }

The above is an example output when executing ``estimate``. ``target_chunk_size`` is the same as
``chunk_size`` given by the user. ``dedup_bytes_ratio`` shows how many bytes are redundant from
examined bytes. For instance, 1 - ``dedup_bytes_ratio`` means the percentage of saved storage space.
``dedup_object_ratio`` is the generated chunk objects / ``examined_objects``. ``chunk_size_average``
means that the divided chunk size on average when performing CDC---this may differnet from ``target_chunk_size``
because CDC genarates different chunk-boundary depending on the content. ``chunk_size_stddev``
represents the standard deviation of the chunk size.


2. Create chunk pool.
^^^^^^^^^^^^^^^^^^^^^

.. code:: bash

  ceph osd pool create [CHUNK_POOL]


3. Run dedup command (there are two ways).
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- **sample-dedup**

.. code:: bash

    ceph-dedup-tool --op sample-dedup
      --pool [BASE_POOL]
      --chunk-pool [CHUNK_POOL]
      --chunk-size [CHUNK_SIZE]
      --chunk-algorithm [fastcdc]
      --fingerprint-algorithm [sha1|sha256|sha512]
      --chunk-dedup-threshold [THRESHOLD]
      --max-thread [THREAD_COUNT]
      --sampling-ratio [SAMPLE_RATIO]
      --wakeup-period [WAKEUP_PERIOD]
      --loop
      --snap

The ``sample-dedup`` comamnd spawns threads specified by ``THREAD_COUNT`` to deduplicate objects on
the ``BASE_POOL``. According to sampling-ratio---do a full search if ``SAMPLE_RATIO`` is 100, the threads selectively
perform deduplication if the chunk is redundant over ``THRESHOLD`` times during iteration.
If --loop is set, the theads will wakeup after ``WAKEUP_PERIOD``. If not, the threads will exit after one iteration.

Example output:

.. code:: bash

   $ bin/ceph df
   --- RAW STORAGE ---
   CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
   ssd    303 GiB  294 GiB  9.0 GiB   9.0 GiB       2.99
   TOTAL  303 GiB  294 GiB  9.0 GiB   9.0 GiB       2.99

   --- POOLS ---
   POOL   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
   .mgr    1    1  577 KiB        2  1.7 MiB      0     97 GiB
   base    2   32  2.0 GiB      517  6.0 GiB   2.02     97 GiB
   chunk   3   32   0  B          0    0   B      0     97 GiB

   $ bin/ceph-dedup-tool --op sample-dedup --pool base --chunk-pool chunk
     --fingerprint-algorithm sha1 --chunk-algorithm fastcdc --loop --sampling-ratio 100
     --chunk-dedup-threshold 2 --chunk-size 8192 --max-thread 4 --wakeup-period 60

   $ bin/ceph df
   --- RAW STORAGE ---
   CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
   ssd    303 GiB  298 GiB  5.4 GiB   5.4 GiB       1.80
   TOTAL  303 GiB  298 GiB  5.4 GiB   5.4 GiB       1.80

   --- POOLS ---
   POOL   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
   .mgr    1    1  577 KiB        2  1.7 MiB      0     98 GiB
   base    2   32  452 MiB      262  1.3 GiB   0.50     98 GiB
   chunk   3   32  258 MiB   25.91k  938 MiB   0.31     98 GiB

- **object dedup**

.. code:: bash

    ceph-dedup-tool --op object-dedup
      --pool [BASE_POOL]
      --object [OID]
      --chunk-pool [CHUNK_POOL]
      --fingerprint-algorithm [sha1|sha256|sha512]
      --dedup-cdc-chunk-size [CHUNK_SIZE]

The ``object-dedup`` command triggers deduplication on the RADOS object specified by ``OID``.
All parameters shown above must be specified. ``CHUNK_SIZE`` should be taken from
the results of step 1 above.
Note that when this command is executed, ``fastcdc`` will be set by default and other parameters
such as ``fingerprint-algorithm`` and ``CHUNK_SIZE`` will be set as defaults for the pool.
Deduplicated objects will appear in the chunk pool. If the object is mutated over time, user needs to re-run
``object-dedup`` because chunk-boundary should be recalculated based on updated contents.
The user needs to specify ``snap`` if the target object is snapshotted. After deduplication is done, the target
object size in ``BASE_POOL`` is zero (evicted) and chunks objects are genereated---these appear in ``CHUNK_POOL``.

4. Read/write I/Os
^^^^^^^^^^^^^^^^^^

After step 3, the users don't need to consider anything about I/Os. Deduplicated objects are
completely compatible with existing RADOS operations.


5. Run scrub to fix reference count
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Reference mismatches can on rare occasions occur to false positives when handling reference counts for
deduplicated RADOS objects. These mismatches will be fixed by periodically scrubbing the pool:

.. code:: bash

    ceph-dedup-tool --op chunk-scrub
      --chunk-pool [CHUNK_POOL]
      --pool [POOL]
      --max-thread [THREAD_COUNT]

The ``chunk-scrub`` command identifies reference mismatches between a
metadata object and a chunk object. The ``chunk-pool`` parameter tells
where the target chunk objects are located to the ceph-dedup-tool.

Example output:

A reference mismatch is intentionally created by inserting a reference (dummy-obj) into a chunk object (2ac67f70d3dd187f8f332bb1391f61d4e5c9baae) by using chunk-get-ref.

.. code:: bash

    $ bin/ceph-dedup-tool --op dump-chunk-refs --chunk-pool chunk --object 2ac67f70d3dd187f8f332bb1391f61d4e5c9baae
    {
      "type": "by_object",
      "count": 2,
    	"refs": [
        {
          "oid": "testfile2",
        	"key": "",
        	"snapid": -2,
        	"hash": 2905889452,
        	"max": 0,
        	"pool": 2,
        	"namespace": ""
      	},
        {
          "oid": "dummy-obj",
          "key": "",
          "snapid": -2,
          "hash": 1203585162,
          "max": 0,
          "pool": 2,
          "namespace": ""
        }
      ]
    }

    $ bin/ceph-dedup-tool --op chunk-scrub --chunk-pool chunk --max-thread 10
    10 seconds is set as report period by default
    join
    join
    2ac67f70d3dd187f8f332bb1391f61d4e5c9baae
    --done--
    2ac67f70d3dd187f8f332bb1391f61d4e5c9baae ref 10:5102bde2:::dummy-obj:head: referencing pool does not exist
    --done--
     Total object : 1
     Examined object : 1
     Damaged object : 1

6. Repair a mismatched chunk reference
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If any reference mismatches occur after the ``chunk-scrub``, it is
recommended to perform the ``chunk-repair`` operation to fix reference
mismatches. The ``chunk-repair`` operation helps in resolving the
reference mismatch and restoring consistency.

.. code:: bash

    ceph-dedup-tool --op chunk-repair
      --chunk-pool [CHUNK_POOL_NAME]
      --object [CHUNK_OID]
      --target-ref [TARGET_OID]
      --target-ref-pool-id [TARGET_POOL_ID]

``chunk-repair`` fixes the ``target-ref``, which is a wrong reference of
an ``object``. To fix it correctly, the users must enter the correct
``TARGET_OID`` and ``TARGET_POOL_ID``.

.. code:: bash

    $ bin/ceph-dedup-tool --op chunk-repair --chunk-pool chunk --object 2ac67f70d3dd187f8f332bb1391f61d4e5c9baae --target-ref dummy-obj --target-ref-pool-id 10
    2ac67f70d3dd187f8f332bb1391f61d4e5c9baae has 1 references for dummy-obj
    dummy-obj has 0 references for 2ac67f70d3dd187f8f332bb1391f61d4e5c9baae
     fix dangling reference from 1 to 0

    $ bin/ceph-dedup-tool --op dump-chunk-refs --chunk-pool chunk --object 2ac67f70d3dd187f8f332bb1391f61d4e5c9baae
    {
      "type": "by_object",
      "count": 1,
      "refs": [
        {
          "oid": "testfile2",
          "key": "",
          "snapid": -2,
          "hash": 2905889452,
          "max": 0,
          "pool": 2,
          "namespace": ""
        }
      ]
    }