doc:dev update how to use deduplication

Add prerequisite for deduplication, example outputs, and chunk-repair operation. Signed-off-by: Sungmin Lee <sung_min.lee@samsung.com>
2024-12-28 22:43:29 +00:00 · 2023-05-26 17:05:46 +09:00 · 2023-05-26 17:05:46 +09:00 · 9b74a33fa1
commit 9b74a33fa1
parent d01e128db2
1 changed files with 189 additions and 20 deletions
--- a/doc/dev/deduplication.rst
+++ b/doc/dev/deduplication.rst
@ -157,25 +157,46 @@ How to use deduplication
 Ceph provides deduplication using RADOS machinery.
 Below we explain how to perform deduplication. 

+Prerequisite
+------------
+
+If the Ceph cluster is started from Ceph mainline, users need to check
+``ceph-test`` package which is including ceph-dedup-tool is installed.
+
+Deatiled Instructions
+---------------------
+
+Users can use ceph-dedup-tool with ``estimate``, ``sample-dedup``, 
+``chunk-scrub``, and ``chunk-repair`` operations. To provide better
+convenience for users, we have enabled necessary operations through
+ceph-dedup-tool, and we recommend using the following operations freely
+by using any types of scripts.
+

 1. Estimate space saving ratio of a target pool using ``ceph-dedup-tool``.
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 .. code:: bash

-    ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size  
-      --chunk-algorithm fixed|fastcdc --fingerprint-algorithm sha1|sha256|sha512
-      --max-thread THREAD_COUNT
+    ceph-dedup-tool --op estimate
+      --pool [BASE_POOL]
+      --chunk-size [CHUNK_SIZE]
+      --chunk-algorithm [fixed|fastcdc]
+      --fingerprint-algorithm [sha1|sha256|sha512]
+      --max-thread [THREAD_COUNT]

 This CLI command will show how much storage space can be saved when deduplication
 is applied on the pool. If the amount of the saved space is higher than user's expectation,
 the pool probably is worth performing deduplication. 
-Users should specify $POOL where the object---the users want to perform
-deduplication---is stored. The users also need to run ceph-dedup-tool multiple time
+Users should specify the ``BASE_POOL``, within which the object targeted for deduplication 
+is stored. The users also need to run ceph-dedup-tool multiple time
 with varying ``chunk_size`` to find the optimal chunk size. Note that the
 optimal value probably differs in the content of each object in case of fastcdc
-chunk algorithm (not fixed). Example output:
+chunk algorithm (not fixed).

-::
+Example output:
+
+.. code:: bash

    {
      "chunk_algo": "fastcdc",
@ -204,54 +225,202 @@ represents the standard deviation of the chunk size.


 2. Create chunk pool. 
+^^^^^^^^^^^^^^^^^^^^^

 .. code:: bash

-  ceph osd pool create CHUNK_POOL
+  ceph osd pool create [CHUNK_POOL]
    

 3. Run dedup command (there are two ways).
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+- **sample-dedup**
  
 .. code:: bash

-    ceph-dedup-tool --op sample-dedup --pool POOL --chunk-pool CHUNK_POOL --chunk-size 
-    CHUNK_SIZE --chunk-algorithm fastcdc --fingerprint-algorithm sha1|sha256|sha512 
-    --chunk-dedup-threshold THRESHOLD --max-thread THREAD_COUNT ----sampling-ratio SAMPLE_RATIO
-    --wakeup-period WAKEUP_PERIOD --loop --snap
+    ceph-dedup-tool --op sample-dedup
+      --pool [BASE_POOL]
+      --chunk-pool [CHUNK_POOL]
+      --chunk-size [CHUNK_SIZE]
+      --chunk-algorithm [fastcdc]
+      --fingerprint-algorithm [sha1|sha256|sha512]
+      --chunk-dedup-threshold [THRESHOLD]
+      --max-thread [THREAD_COUNT]
+      --sampling-ratio [SAMPLE_RATIO]
+      --wakeup-period [WAKEUP_PERIOD]
+      --loop 
+      --snap

 The ``sample-dedup`` comamnd spawns threads specified by ``THREAD_COUNT`` to deduplicate objects on
-the ``POOL``. According to sampling-ratio---do a full search if ``SAMPLE_RATIO`` is 100, the threads selectively
+the ``BASE_POOL``. According to sampling-ratio---do a full search if ``SAMPLE_RATIO`` is 100, the threads selectively
 perform deduplication if the chunk is redundant over ``THRESHOLD`` times during iteration.
 If --loop is set, the theads will wakeup after ``WAKEUP_PERIOD``. If not, the threads will exit after one iteration.

+Example output:
+
 .. code:: bash

-    ceph-dedup-tool --op object-dedup --pool POOL --object OID --chunk-pool CHUNK_POOL
-      --fingerprint-algorithm sha1|sha256|sha512 --dedup-cdc-chunk-size CHUNK_SIZE
+   $ bin/ceph df
+   --- RAW STORAGE ---
+   CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
+   ssd    303 GiB  294 GiB  9.0 GiB   9.0 GiB       2.99
+   TOTAL  303 GiB  294 GiB  9.0 GiB   9.0 GiB       2.99
+
+   --- POOLS ---
+   POOL   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
+   .mgr    1    1  577 KiB        2  1.7 MiB      0     97 GiB
+   base    2   32  2.0 GiB      517  6.0 GiB   2.02     97 GiB
+   chunk   3   32   0  B          0    0   B      0     97 GiB
+
+   $ bin/ceph-dedup-tool --op sample-dedup --pool base --chunk-pool chunk
+     --fingerprint-algorithm sha1 --chunk-algorithm fastcdc --loop --sampling-ratio 100
+     --chunk-dedup-threshold 2 --chunk-size 8192 --max-thread 4 --wakeup-period 60
+
+   $ bin/ceph df
+   --- RAW STORAGE ---
+   CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
+   ssd    303 GiB  298 GiB  5.4 GiB   5.4 GiB       1.80
+   TOTAL  303 GiB  298 GiB  5.4 GiB   5.4 GiB       1.80
+
+   --- POOLS ---
+   POOL   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
+   .mgr    1    1  577 KiB        2  1.7 MiB      0     98 GiB
+   base    2   32  452 MiB      262  1.3 GiB   0.50     98 GiB
+   chunk   3   32  258 MiB   25.91k  938 MiB   0.31     98 GiB
+
+- **object dedup**
+
+.. code:: bash
+
+    ceph-dedup-tool --op object-dedup
+      --pool [BASE_POOL]
+      --object [OID]
+      --chunk-pool [CHUNK_POOL]
+      --fingerprint-algorithm [sha1|sha256|sha512]
+      --dedup-cdc-chunk-size [CHUNK_SIZE]

 The ``object-dedup`` command triggers deduplication on the RADOS object specified by ``OID``.
 All parameters shown above must be specified. ``CHUNK_SIZE`` should be taken from
 the results of step 1 above.
 Note that when this command is executed, ``fastcdc`` will be set by default and other parameters
-such as ``FP`` and ``CHUNK_SIZE`` will be set as defaults for the pool.
+such as ``fingerprint-algorithm`` and ``CHUNK_SIZE`` will be set as defaults for the pool.
 Deduplicated objects will appear in the chunk pool. If the object is mutated over time, user needs to re-run
 ``object-dedup`` because chunk-boundary should be recalculated based on updated contents.
 The user needs to specify ``snap`` if the target object is snapshotted. After deduplication is done, the target
-object size in ``POOL`` is zero (evicted) and chunks objects are genereated---these appear in ``CHUNK_POOL``.
-
+object size in ``BASE_POOL`` is zero (evicted) and chunks objects are genereated---these appear in ``CHUNK_POOL``.

 4. Read/write I/Os
+^^^^^^^^^^^^^^^^^^

 After step 3, the users don't need to consider anything about I/Os. Deduplicated objects are
 completely compatible with existing RAODS operations.


 5. Run scrub to fix reference count 
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Reference mismatches can on rare occasions occur to false positives when handling reference counts for
 deduplicated RADOS objects. These mismatches will be fixed by periodically scrubbing the pool:

 .. code:: bash

-    ceph-dedup-tool --op chunk-scrub --op chunk-scrub --chunk-pool CHUNK_POOL --pool POOL --max-thread THREAD_COUNT
-  
+    ceph-dedup-tool --op chunk-scrub
+      --chunk-pool [CHUNK_POOL]
+      --pool [POOL]
+      --max-thread [THREAD_COUNT]
+
+The ``chunk-scrub`` command identifies reference mismatches between a
+metadata object and a chunk object. The ``chunk-pool`` parameter tells
+where the target chunk objects are located to the ceph-dedup-tool.
+
+Example output:
+
+A reference mismatch is intentionally created by inserting a reference (dummy-obj) into a chunk object (2ac67f70d3dd187f8f332bb1391f61d4e5c9baae) by using chunk-get-ref.
+
+.. code:: bash
+
+    $ bin/ceph-dedup-tool --op dump-chunk-refs --chunk-pool chunk --object 2ac67f70d3dd187f8f332bb1391f61d4e5c9baae
+    {
+      "type": "by_object",
+      "count": 2,
+    	"refs": [
+        {
+          "oid": "testfile2",
+        	"key": "",
+        	"snapid": -2,
+        	"hash": 2905889452,
+        	"max": 0,
+        	"pool": 2,
+        	"namespace": ""
+      	},
+        {
+          "oid": "dummy-obj",
+          "key": "",
+          "snapid": -2,
+          "hash": 1203585162,
+          "max": 0,
+          "pool": 2,
+          "namespace": ""
+        }
+      ]
+    }
+
+    $ bin/ceph-dedup-tool --op chunk-scrub --chunk-pool chunk --max-thread 10
+    10 seconds is set as report period by default
+    join
+    join
+    2ac67f70d3dd187f8f332bb1391f61d4e5c9baae
+    --done--
+    2ac67f70d3dd187f8f332bb1391f61d4e5c9baae ref 10:5102bde2:::dummy-obj:head: referencing pool does not exist
+    --done--
+     Total object : 1
+     Examined object : 1
+     Damaged object : 1
+
+6. Repair a mismatched chunk reference
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If any reference mismatches occur after the ``chunk-scrub``, it is
+recommended to perform the ``chunk-repair`` operation to fix reference
+mismatches. The ``chunk-repair`` operation helps in resolving the
+reference mismatch and restoring consistency.
+
+.. code:: bash
+
+    ceph-dedup-tool --op chunk-repair
+      --chunk-pool [CHUNK_POOL_NAME]
+      --object [CHUNK_OID]
+      --target-ref [TARGET_OID]
+      --target-ref-pool-id [TARGET_POOL_ID]
+
+``chunk-repair`` fixes the ``target-ref``, which is a wrong reference of
+an ``object``. To fix it correctly, the users must enter the correct
+``TARGET_OID`` and ``TARGET_POOL_ID``.
+
+.. code:: bash
+
+    $ bin/ceph-dedup-tool --op chunk-repair --chunk-pool chunk --object 2ac67f70d3dd187f8f332bb1391f61d4e5c9baae --target-ref dummy-obj --target-ref-pool-id 10
+    2ac67f70d3dd187f8f332bb1391f61d4e5c9baae has 1 references for dummy-obj
+    dummy-obj has 0 references for 2ac67f70d3dd187f8f332bb1391f61d4e5c9baae
+     fix dangling reference from 1 to 0
+
+    $ bin/ceph-dedup-tool --op dump-chunk-refs --chunk-pool chunk --object 2ac67f70d3dd187f8f332bb1391f61d4e5c9baae
+    {
+      "type": "by_object",
+      "count": 1,
+      "refs": [
+        {
+          "oid": "testfile2",
+          "key": "",
+          "snapid": -2,
+          "hash": 2905889452,
+          "max": 0,
+          "pool": 2,
+          "namespace": ""
+        }
+      ]
+    }
+
+
+