|
|
|
@ -737,169 +737,13 @@ priority, not the btrfs mount options).
|
|
|
|
|
CHECKSUM ALGORITHMS
|
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
|
|
There are several checksum algorithms supported. The default and backward
|
|
|
|
|
compatible is *crc32c*. Since kernel 5.5 there are three more with different
|
|
|
|
|
characteristics and trade-offs regarding speed and strength. The following
|
|
|
|
|
list may help you to decide which one to select.
|
|
|
|
|
|
|
|
|
|
CRC32C (32bit digest)
|
|
|
|
|
default, best backward compatibility, very fast, modern CPUs have
|
|
|
|
|
instruction-level support, not collision-resistant but still good error
|
|
|
|
|
detection capabilities
|
|
|
|
|
|
|
|
|
|
XXHASH* (64bit digest)
|
|
|
|
|
can be used as CRC32C successor, very fast, optimized for modern CPUs utilizing
|
|
|
|
|
instruction pipelining, good collision resistance and error detection
|
|
|
|
|
|
|
|
|
|
SHA256 (256bit digest)::
|
|
|
|
|
a cryptographic-strength hash, relatively slow but with possible CPU
|
|
|
|
|
instruction acceleration or specialized hardware cards, FIPS certified and
|
|
|
|
|
in wide use
|
|
|
|
|
|
|
|
|
|
BLAKE2b (256bit digest)
|
|
|
|
|
a cryptographic-strength hash, relatively fast with possible CPU acceleration
|
|
|
|
|
using SIMD extensions, not standardized but based on BLAKE which was a SHA3
|
|
|
|
|
finalist, in wide use, the algorithm used is BLAKE2b-256 that's optimized for
|
|
|
|
|
64bit platforms
|
|
|
|
|
|
|
|
|
|
The *digest size* affects overall size of data block checksums stored in the
|
|
|
|
|
filesystem. The metadata blocks have a fixed area up to 256 bits (32 bytes), so
|
|
|
|
|
there's no increase. Each data block has a separate checksum stored, with
|
|
|
|
|
additional overhead of the b-tree leaves.
|
|
|
|
|
|
|
|
|
|
Approximate relative performance of the algorithms, measured against CRC32C
|
|
|
|
|
using reference software implementations on a 3.5GHz intel CPU:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
======== ============ ======= ================
|
|
|
|
|
Digest Cycles/4KiB Ratio Implementation
|
|
|
|
|
======== ============ ======= ================
|
|
|
|
|
CRC32C 1700 1.00 CPU instruction
|
|
|
|
|
XXHASH 2500 1.44 reference impl.
|
|
|
|
|
SHA256 105000 61 reference impl.
|
|
|
|
|
SHA256 36000 21 libgcrypt/AVX2
|
|
|
|
|
SHA256 63000 37 libsodium/AVX2
|
|
|
|
|
BLAKE2b 22000 13 reference impl.
|
|
|
|
|
BLAKE2b 19000 11 libgcrypt/AVX2
|
|
|
|
|
BLAKE2b 19000 11 libsodium/AVX2
|
|
|
|
|
======== ============ ======= ================
|
|
|
|
|
|
|
|
|
|
Many kernels are configured with SHA256 as built-in and not as a module.
|
|
|
|
|
The accelerated versions are however provided by the modules and must be loaded
|
|
|
|
|
explicitly (**modprobe sha256**) before mounting the filesystem to make use of
|
|
|
|
|
them. You can check in */sys/fs/btrfs/FSID/checksum* which one is used. If you
|
|
|
|
|
see *sha256-generic*, then you may want to unmount and mount the filesystem
|
|
|
|
|
again, changing that on a mounted filesystem is not possible.
|
|
|
|
|
Check the file */proc/crypto*, when the implementation is built-in, you'd find
|
|
|
|
|
|
|
|
|
|
.. code-block:: none
|
|
|
|
|
|
|
|
|
|
name : sha256
|
|
|
|
|
driver : sha256-generic
|
|
|
|
|
module : kernel
|
|
|
|
|
priority : 100
|
|
|
|
|
...
|
|
|
|
|
|
|
|
|
|
while accelerated implementation is e.g.
|
|
|
|
|
|
|
|
|
|
.. code-block:: none
|
|
|
|
|
|
|
|
|
|
name : sha256
|
|
|
|
|
driver : sha256-avx2
|
|
|
|
|
module : sha256_ssse3
|
|
|
|
|
priority : 170
|
|
|
|
|
...
|
|
|
|
|
.. include:: ch-checksumming.rst
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
COMPRESSION
|
|
|
|
|
-----------
|
|
|
|
|
|
|
|
|
|
Btrfs supports transparent file compression. There are three algorithms
|
|
|
|
|
available: ZLIB, LZO and ZSTD (since v4.14). Basically, compression is on a file
|
|
|
|
|
by file basis. You can have a single btrfs mount point that has some files that
|
|
|
|
|
are uncompressed, some that are compressed with LZO, some with ZLIB, for
|
|
|
|
|
instance (though you may not want it that way, it is supported).
|
|
|
|
|
|
|
|
|
|
To enable compression, mount the filesystem with options *compress* or
|
|
|
|
|
*compress-force*. Please refer to section *MOUNT OPTIONS*. Once compression is
|
|
|
|
|
enabled, all new writes will be subject to compression. Some files may not
|
|
|
|
|
compress very well, and these are typically not recompressed but still written
|
|
|
|
|
uncompressed.
|
|
|
|
|
|
|
|
|
|
Each compression algorithm has different speed/ratio trade offs. The levels
|
|
|
|
|
can be selected by a mount option and affect only the resulting size (ie.
|
|
|
|
|
no compatibility issues).
|
|
|
|
|
|
|
|
|
|
Basic characteristics:
|
|
|
|
|
|
|
|
|
|
ZLIB
|
|
|
|
|
* slower, higher compression ratio
|
|
|
|
|
* levels: 1 to 9, mapped directly, default level is 3
|
|
|
|
|
* good backward compatibility
|
|
|
|
|
LZO
|
|
|
|
|
* faster compression and decompression than zlib, worse compression ratio, designed to be fast
|
|
|
|
|
* no levels
|
|
|
|
|
* good backward compatibility
|
|
|
|
|
ZSTD
|
|
|
|
|
* compression comparable to zlib with higher compression/decompression speeds and different ratio
|
|
|
|
|
* levels: 1 to 15
|
|
|
|
|
* since 4.14, levels since 5.1
|
|
|
|
|
|
|
|
|
|
The differences depend on the actual data set and cannot be expressed by a
|
|
|
|
|
single number or recommendation. Higher levels consume more CPU time and may
|
|
|
|
|
not bring a significant improvement, lower levels are close to real time.
|
|
|
|
|
|
|
|
|
|
The algorithms could be mixed in one file as they're stored per extent. The
|
|
|
|
|
compression can be changed on a file by **btrfs filesystem defrag** command,
|
|
|
|
|
using the *-c* option, or by **btrfs property set** using the *compression*
|
|
|
|
|
property. Setting compression by **chattr +c** utility will set it to zlib.
|
|
|
|
|
|
|
|
|
|
INCOMPRESSIBLE DATA
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
Files with already compressed data or with data that won't compress well with
|
|
|
|
|
the CPU and memory constraints of the kernel implementations are using a simple
|
|
|
|
|
decision logic. If the first portion of data being compressed is not smaller
|
|
|
|
|
than the original, the compression of the file is disabled -- unless the
|
|
|
|
|
filesystem is mounted with *compress-force*. In that case compression will
|
|
|
|
|
always be attempted on the file only to be later discarded. This is not optimal
|
|
|
|
|
and subject to optimizations and further development.
|
|
|
|
|
|
|
|
|
|
If a file is identified as incompressible, a flag is set (NOCOMPRESS) and it's
|
|
|
|
|
sticky. On that file compression won't be performed unless forced. The flag
|
|
|
|
|
can be also set by **chattr +m** (since e2fsprogs 1.46.2) or by properties with
|
|
|
|
|
value *no* or *none*. Empty value will reset it to the default that's currently
|
|
|
|
|
applicable on the mounted filesystem.
|
|
|
|
|
|
|
|
|
|
There are two ways to detect incompressible data:
|
|
|
|
|
|
|
|
|
|
* actual compression attempt - data are compressed, if the result is not smaller,
|
|
|
|
|
it's discarded, so this depends on the algorithm and level
|
|
|
|
|
* pre-compression heuristics - a quick statistical evaluation on the data is
|
|
|
|
|
peformed and based on the result either compression is performed or skipped,
|
|
|
|
|
the NOCOMPRESS bit is not set just by the heuristic, only if the compression
|
|
|
|
|
algorithm does not make an improvent
|
|
|
|
|
|
|
|
|
|
PRE-COMPRESSION HEURISTICS
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
The heuristics aim to do a few quick statistical tests on the compressed data
|
|
|
|
|
in order to avoid probably costly compression that would turn out to be
|
|
|
|
|
inefficient. Compression algorithms could have internal detection of
|
|
|
|
|
incompressible data too but this leads to more overhead as the compression is
|
|
|
|
|
done in another thread and has to write the data anyway. The heuristic is
|
|
|
|
|
read-only and can utilize cached memory.
|
|
|
|
|
|
|
|
|
|
The tests performed based on the following: data sampling, long repated
|
|
|
|
|
pattern detection, byte frequency, Shannon entropy.
|
|
|
|
|
|
|
|
|
|
COMPATIBILITY WITH OTHER FEATURES
|
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
|
|
Compression is done using the COW mechanism so it's incompatible with
|
|
|
|
|
*nodatacow*. Direct IO works on compressed files but will fall back to buffered
|
|
|
|
|
writes. Currently 'nodatasum' and compression don't work together.
|
|
|
|
|
|
|
|
|
|
.. include:: ch-compression.rst
|
|
|
|
|
|
|
|
|
|
FILESYSTEM EXCLUSIVE OPERATIONS
|
|
|
|
|
-------------------------------
|
|
|
|
@ -1249,83 +1093,7 @@ that report space usage: **filesystem df**, **device usage**. The command
|
|
|
|
|
SEEDING DEVICE
|
|
|
|
|
--------------
|
|
|
|
|
|
|
|
|
|
The COW mechanism and multiple devices under one hood enable an interesting
|
|
|
|
|
concept, called a seeding device: extending a read-only filesystem on a single
|
|
|
|
|
device filesystem with another device that captures all writes. For example
|
|
|
|
|
imagine an immutable golden image of an operating system enhanced with another
|
|
|
|
|
device that allows to use the data from the golden image and normal operation.
|
|
|
|
|
This idea originated on CD-ROMs with base OS and allowing to use them for live
|
|
|
|
|
systems, but this became obsolete. There are technologies providing similar
|
|
|
|
|
functionality, like *unionmount*, *overlayfs* or *qcow2* image snapshot.
|
|
|
|
|
|
|
|
|
|
The seeding device starts as a normal filesystem, once the contents is ready,
|
|
|
|
|
**btrfstune -S 1** is used to flag it as a seeding device. Mounting such device
|
|
|
|
|
will not allow any writes, except adding a new device by **btrfs device add**.
|
|
|
|
|
Then the filesystem can be remounted as read-write.
|
|
|
|
|
|
|
|
|
|
Given that the filesystem on the seeding device is always recognized as
|
|
|
|
|
read-only, it can be used to seed multiple filesystems, at the same time. The
|
|
|
|
|
UUID that is normally attached to a device is automatically changed to a random
|
|
|
|
|
UUID on each mount.
|
|
|
|
|
|
|
|
|
|
Once the seeding device is mounted, it needs the writable device. After adding
|
|
|
|
|
it, something like **remount -o remount,rw /path** makes the filesystem at
|
|
|
|
|
*/path* ready for use. The simplest usecase is to throw away all changes by
|
|
|
|
|
unmounting the filesystem when convenient.
|
|
|
|
|
|
|
|
|
|
Alternatively, deleting the seeding device from the filesystem can turn it into
|
|
|
|
|
a normal filesystem, provided that the writable device can also contain all the
|
|
|
|
|
data from the seeding device.
|
|
|
|
|
|
|
|
|
|
The seeding device flag can be cleared again by **btrfstune -f -s 0**, eg.
|
|
|
|
|
allowing to update with newer data but please note that this will invalidate
|
|
|
|
|
all existing filesystems that use this particular seeding device. This works
|
|
|
|
|
for some usecases, not for others, and a forcing flag to the command is
|
|
|
|
|
mandatory to avoid accidental mistakes.
|
|
|
|
|
|
|
|
|
|
Example how to create and use one seeding device:
|
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
|
|
|
|
# mkfs.btrfs /dev/sda
|
|
|
|
|
# mount /dev/sda /mnt/mnt1
|
|
|
|
|
# ... fill mnt1 with data
|
|
|
|
|
# umount /mnt/mnt1
|
|
|
|
|
# btrfstune -S 1 /dev/sda
|
|
|
|
|
# mount /dev/sda /mnt/mnt1
|
|
|
|
|
# btrfs device add /dev/sdb /mnt
|
|
|
|
|
# mount -o remount,rw /mnt/mnt1
|
|
|
|
|
# ... /mnt/mnt1 is now writable
|
|
|
|
|
|
|
|
|
|
Now */mnt/mnt1* can be used normally. The device */dev/sda* can be mounted
|
|
|
|
|
again with a another writable device:
|
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
|
|
|
|
# mount /dev/sda /mnt/mnt2
|
|
|
|
|
# btrfs device add /dev/sdc /mnt/mnt2
|
|
|
|
|
# mount -o remount,rw /mnt/mnt2
|
|
|
|
|
... /mnt/mnt2 is now writable
|
|
|
|
|
|
|
|
|
|
The writable device (*/dev/sdb*) can be decoupled from the seeding device and
|
|
|
|
|
used independently:
|
|
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
|
|
|
|
# btrfs device delete /dev/sda /mnt/mnt1
|
|
|
|
|
|
|
|
|
|
As the contents originated in the seeding device, it's possible to turn
|
|
|
|
|
*/dev/sdb* to a seeding device again and repeat the whole process.
|
|
|
|
|
|
|
|
|
|
A few things to note:
|
|
|
|
|
|
|
|
|
|
* it's recommended to use only single device for the seeding device, it works
|
|
|
|
|
for multiple devices but the *single* profile must be used in order to make
|
|
|
|
|
the seeding device deletion work
|
|
|
|
|
* block group profiles *single* and *dup* support the usecases above
|
|
|
|
|
* the label is copied from the seeding device and can be changed by **btrfs filesystem label**
|
|
|
|
|
* each new mount of the seeding device gets a new random UUID
|
|
|
|
|
|
|
|
|
|
.. include:: ch-seeding-device.rst
|
|
|
|
|
|
|
|
|
|
RAID56 STATUS AND RECOMMENDED PRACTICES
|
|
|
|
|
---------------------------------------
|
|
|
|
|