btrfs-progs/Documentation/Deduplication.rst

Deduplication
=============

Going by the definition in the context of filesystems, it's a process of
looking up identical data blocks tracked separately and creating a shared
logical link while removing one of the copies of the data blocks. This leads to
data space savings while it increases metadata consumption.

There are two main deduplication types:

* **in-band** *(sometimes also called on-line)* -- all newly written data are
  considered for deduplication before writing
* **out-of-band** *(sometimes also called offline)* -- data for deduplication
  have to be actively looked for and deduplicated by the user application

Both have their pros and cons. BTRFS implements **only out-of-band** type.

BTRFS provides the basic building blocks for deduplication allowing other tools
to choose the strategy and scope of the deduplication.  There are multiple
tools that take different approaches to deduplication, offer additional
features or make trade-offs. The following table lists tools that are known to
be up-to-date, maintained and widely used.

.. list-table::
   :header-rows: 1

   * - Name
     - File based
     - Block based
     - Incremental
   * - `BEES <https://github.com/Zygo/bees>`_
     - No
     - Yes
     - Yes
   * - `duperemove <https://github.com/markfasheh/duperemove>`_
     - Yes
     - No
     - Yes

File based deduplication
------------------------

The tool takes a list of files and tries to find duplicates among data only
from that files. This is suitable eg. for files that originated from the same
base image, source of a reflinked file. Optionally the tools could track a
database of hashes and allow to deduplicate blocks from more files, or use that
for repeated runs and update the database incrementally.

Block based deduplication
-------------------------

The tool typically scans the filesystem and builds a database of file block
hashes, then finds candidate files and deduplicates the ranges. The hash
database is kept as an ordinary file and can be scaled according to the needs.

As the file changes, the hash database may get out of sync and the scan has to
be done repeatedly.

Safety of block comparison
--------------------------

The deduplication inside the filesystem is implemented as an ``ioctl`` that takes
a source file, destination file and the range. The blocks from both files are
compared for exact match before merging to the same range (ie. there's no
hash based comparison). Pages representing the extents in memory are locked
prior to deduplication and prevent concurrent modification by buffered writes
or mmaped writes.

Limitations, compatibility
--------------------------

Files that are subject do deduplication must have the same status regarding
COW, ie. both regular COW files with checksums, or both NOCOW, or files that
are COW but don't have checksums (NODATASUM attribute is set).

If the deduplication is in progress on any file in the filesystem, the *send*
operation cannot be started as it relies on the extent layout being unchanged.
btrfs-progs: docs: add more of the new doc structure Add overall structure for now, some contents is from wiki. Signed-off-by: David Sterba <dsterba@suse.com> 2021-11-30 15:00:48 +00:00			`Deduplication`
			`=============`

btrfs-progs: docs: add more chapters (part 2) The feature pages share the contents with the manual page section 5 so put the contents to separate files. Progress: 2/3. Signed-off-by: David Sterba <dsterba@suse.com> 2021-12-09 19:46:42 +00:00			`Going by the definition in the context of filesystems, it's a process of`
			`looking up identical data blocks tracked separately and creating a shared`
			`logical link while removing one of the copies of the data blocks. This leads to`
			`data space savings while it increases metadata consumption.`

			`There are two main deduplication types:`

			`* in-band (sometimes also called on-line) -- all newly written data are`
			`considered for deduplication before writing`
btrfs-progs: docs: update some chapters Signed-off-by: David Sterba <dsterba@suse.com> 2022-01-04 23:43:47 +00:00			`* out-of-band (sometimes also called offline) -- data for deduplication`
btrfs-progs: docs: add more chapters (part 2) The feature pages share the contents with the manual page section 5 so put the contents to separate files. Progress: 2/3. Signed-off-by: David Sterba <dsterba@suse.com> 2021-12-09 19:46:42 +00:00			`have to be actively looked for and deduplicated by the user application`

			`Both have their pros and cons. BTRFS implements only out-of-band type.`

			`BTRFS provides the basic building blocks for deduplication allowing other tools`
			`to choose the strategy and scope of the deduplication. There are multiple`
			`tools that take different approaches to deduplication, offer additional`
			`features or make trade-offs. The following table lists tools that are known to`
			`be up-to-date, maintained and widely used.`

			`.. list-table::`
			`:header-rows: 1`

			`* - Name`
			`- File based`
			`- Block based`
			`- Incremental`
			* - `BEES <https://github.com/Zygo/bees>`_
			`- No`
			`- Yes`
			`- Yes`
			* - `duperemove <https://github.com/markfasheh/duperemove>`_
			`- Yes`
			`- No`
			`- Yes`

btrfs-progs: docs: update some chapters Signed-off-by: David Sterba <dsterba@suse.com> 2022-01-04 23:43:47 +00:00			`File based deduplication`
			`------------------------`
btrfs-progs: docs: add more chapters (part 2) The feature pages share the contents with the manual page section 5 so put the contents to separate files. Progress: 2/3. Signed-off-by: David Sterba <dsterba@suse.com> 2021-12-09 19:46:42 +00:00
btrfs-progs: docs: update some chapters Signed-off-by: David Sterba <dsterba@suse.com> 2022-01-04 23:43:47 +00:00			`The tool takes a list of files and tries to find duplicates among data only`
			`from that files. This is suitable eg. for files that originated from the same`
			`base image, source of a reflinked file. Optionally the tools could track a`
			`database of hashes and allow to deduplicate blocks from more files, or use that`
			`for repeated runs and update the database incrementally.`

			`Block based deduplication`
			`-------------------------`

			`The tool typically scans the filesystem and builds a database of file block`
			`hashes, then finds candidate files and deduplicates the ranges. The hash`
			`database is kept as an ordinary file and can be scaled according to the needs.`

			`As the file changes, the hash database may get out of sync and the scan has to`
			`be done repeatedly.`

			`Safety of block comparison`
			`--------------------------`

			The deduplication inside the filesystem is implemented as an ``ioctl`` that takes
			`a source file, destination file and the range. The blocks from both files are`
			`compared for exact match before merging to the same range (ie. there's no`
			`hash based comparison). Pages representing the extents in memory are locked`
			`prior to deduplication and prevent concurrent modification by buffered writes`
			`or mmaped writes.`

			`Limitations, compatibility`
			`--------------------------`

			`Files that are subject do deduplication must have the same status regarding`
			`COW, ie. both regular COW files with checksums, or both NOCOW, or files that`
			`are COW but don't have checksums (NODATASUM attribute is set).`

			`If the deduplication is in progress on any file in the filesystem, the send`
			`operation cannot be started as it relies on the extent layout being unchanged.`