From c6be84840fa740433bebb5ddebb044c4d9a07c8c Mon Sep 17 00:00:00 2001 From: David Sterba Date: Thu, 9 Dec 2021 20:46:42 +0100 Subject: [PATCH] btrfs-progs: docs: add more chapters (part 2) The feature pages share the contents with the manual page section 5 so put the contents to separate files. Progress: 2/3. Signed-off-by: David Sterba --- Documentation/Auto-repair.rst | 6 +- Documentation/Convert.rst | 2 +- Documentation/Deduplication.rst | 42 +++++- Documentation/Defragmentation.rst | 20 ++- Documentation/Flexibility.rst | 16 ++- Documentation/Qgroups.rst | 2 +- Documentation/Reflink.rst | 27 +++- Documentation/Resize.rst | 10 +- Documentation/Scrub.rst | 2 +- Documentation/Send-receive.rst | 25 +++- Documentation/btrfs-convert.rst | 97 +------------- Documentation/btrfs-quota.rst | 197 +-------------------------- Documentation/btrfs-scrub.rst | 28 +--- Documentation/ch-checksumming.rst | 76 +++++++++++ Documentation/ch-compression.rst | 153 +++++++++++++++++++++ Documentation/ch-convert-intro.rst | 97 ++++++++++++++ Documentation/ch-quota-intro.rst | 198 ++++++++++++++++++++++++++++ Documentation/ch-scrub-intro.rst | 28 ++++ Documentation/ch-seeding-device.rst | 78 +++++++++++ 19 files changed, 772 insertions(+), 332 deletions(-) create mode 100644 Documentation/ch-checksumming.rst create mode 100644 Documentation/ch-compression.rst create mode 100644 Documentation/ch-convert-intro.rst create mode 100644 Documentation/ch-quota-intro.rst create mode 100644 Documentation/ch-scrub-intro.rst create mode 100644 Documentation/ch-seeding-device.rst diff --git a/Documentation/Auto-repair.rst b/Documentation/Auto-repair.rst index 1d6c60b7..31760d09 100644 --- a/Documentation/Auto-repair.rst +++ b/Documentation/Auto-repair.rst @@ -1,4 +1,8 @@ Auto-repair on read =================== -... +Data or metadata that are found to be damaged (eg. because the checksum does +not match) at the time they're read from the device can be salvaged in case the +filesystem has another valid copy when using block group profile with redundancy +(DUP, RAID1, RAID5/6). The correct data are returned to the user application +and the damaged copy is replaced by it. diff --git a/Documentation/Convert.rst b/Documentation/Convert.rst index c1f85959..0c13cc8a 100644 --- a/Documentation/Convert.rst +++ b/Documentation/Convert.rst @@ -1,4 +1,4 @@ Convert ======= -... +.. include:: ch-convert-intro.rst diff --git a/Documentation/Deduplication.rst b/Documentation/Deduplication.rst index 9f491a91..0a3abeed 100644 --- a/Documentation/Deduplication.rst +++ b/Documentation/Deduplication.rst @@ -1,4 +1,44 @@ Deduplication ============= -... +Going by the definition in the context of filesystems, it's a process of +looking up identical data blocks tracked separately and creating a shared +logical link while removing one of the copies of the data blocks. This leads to +data space savings while it increases metadata consumption. + +There are two main deduplication types: + +* **in-band** *(sometimes also called on-line)* -- all newly written data are + considered for deduplication before writing +* **out-of-band** *(sometimes alco called offline)* -- data for deduplication + have to be actively looked for and deduplicated by the user application + +Both have their pros and cons. BTRFS implements **only out-of-band** type. + +BTRFS provides the basic building blocks for deduplication allowing other tools +to choose the strategy and scope of the deduplication. There are multiple +tools that take different approaches to deduplication, offer additional +features or make trade-offs. The following table lists tools that are known to +be up-to-date, maintained and widely used. + +.. list-table:: + :header-rows: 1 + + * - Name + - File based + - Block based + - Incremental + * - `BEES `_ + - No + - Yes + - Yes + * - `duperemove `_ + - Yes + - No + - Yes + +Legend: + +- *File based*: the tool takes a list of files and deduplicates blocks only from that set +- *Block based*: the tool enumerates blocks and looks for duplicates +- *Incremental*: repeated runs of the tool utilizes information gathered from previous runs diff --git a/Documentation/Defragmentation.rst b/Documentation/Defragmentation.rst index 89f4fc1f..87bed47d 100644 --- a/Documentation/Defragmentation.rst +++ b/Documentation/Defragmentation.rst @@ -1,4 +1,22 @@ Defragmentation =============== -... +Defragmentation of files is supposed to make the layout of the file extents to +be more linear or at least coalesce the file extents into larger ones that can +be stored on the device more efficiently. The reason there's a need for +defragmentation stems from the COW design that BTRFS is built on and is +inherent. The fragmentation is caused by rewrites of the same file data +in-place, that has to be handled by creating a new copy that may lie on a +distant location on the physical device. Fragmentation is the worst problem on +rotational hard disks due to the delay caused by moving the drive heads to the +distant location. With the modern seek-less devices it's not a problem though +it may still make sense because of reduced size of the metadata that's needed +to track the scattered extents. + +File data that are in use can be safely defragmented because the whole process +happens inside the page cache, that is the central point caching the file data +and takes care of synchronization. Once a filesystem sync or flush is started +(either manually or automatically) all the dirty data get written to the +devices. This however reduces the chances to find optimal layout as the writes +happen together with other data and the result depens on the remaining free +space layout and fragmentation. diff --git a/Documentation/Flexibility.rst b/Documentation/Flexibility.rst index e0c00e63..09cef47e 100644 --- a/Documentation/Flexibility.rst +++ b/Documentation/Flexibility.rst @@ -1,6 +1,18 @@ Flexibility =========== -* dynamic inode creation (no preallocated space) +The underlying design of BTRFS data structures allows a lot of flexibility and +making changes after filesystem creation, like resizing, adding/removing space +or enabling some features on-the-fly. -* block group profile change on-the-fly +* **dynamic inode creation** -- there's no fixed space or tables for tracking + inodes so the number of inodes that can be created is bounded by the metadata + space and it's utilization + +* **block group profile change on-the-fly** -- the block group profiles can be + changed on a mounted filesystem by running the balance operation and + specifying the conversion filters + +* **resize** -- the space occupied by the filesystem on each device can be + resized up (grow) or down (shrink) as long as the amount of data can be still + contained on the device diff --git a/Documentation/Qgroups.rst b/Documentation/Qgroups.rst index 3f9cb701..dde68744 100644 --- a/Documentation/Qgroups.rst +++ b/Documentation/Qgroups.rst @@ -1,4 +1,4 @@ Quota groups ============ -... +.. include:: ch-quota-intro.rst diff --git a/Documentation/Reflink.rst b/Documentation/Reflink.rst index 00efe09b..98c1e232 100644 --- a/Documentation/Reflink.rst +++ b/Documentation/Reflink.rst @@ -1,4 +1,29 @@ Reflink ======= -... +Reflink is a type of shallow copy of file data that shares the blocks but +otherwise the files are independent and any change to the file will not affect +the other. This builds on the underlying COW mechanism. A reflink will +effectively create only a separate metadata pointing to the shared blocks which +is typically much faster than a deep copy of all blocks. + +The reflink is typically meant for whole files but a partial file range can be +also copied, though there are no ready-made tools for that. + +.. code-block:: shell + + cp --reflink=always source target + +There are some constaints: + +- cross-filesystem reflink is not possible, there's nothing in common between + so the block sharing can't work +- reflink crossing two mount points of the same filesystem does not work due + to an artificial limitation in VFS (this may change in the future) +- reflink requires source and target file that have the same status regarding + NOCOW and checksums, for example if the source file is NOCOW (once created + with the chattr +C attribute) then the above command won't work unless the + target file is pre-created with the +C attribute as well, or the NOCOW + attribute is inherited from the parent directory (chattr +C on the directory) + or if the whole filesystem is mounted with *-o nodatacow* that would create + the NOCOW files by default diff --git a/Documentation/Resize.rst b/Documentation/Resize.rst index 0ffdf672..5efca120 100644 --- a/Documentation/Resize.rst +++ b/Documentation/Resize.rst @@ -1,4 +1,12 @@ Resize ====== -... +A BTRFS mounted filesystem can be resized after creation, grown or shrunk. On a +multi device filesystem the space occupied on each device can be resized +independently. Data tha reside in the are that would be out of the new size are +relocated to the remaining space below the limit, so this constrains the +minimum size to which a filesystem can be shrunk. + +Growing a filesystem is quick as it only needs to take note of the available +space, while shrinking a filesystem needs to relocate potentially lots of data +and this is IO intense. It is possible to shrink a filesystem in smaller steps. diff --git a/Documentation/Scrub.rst b/Documentation/Scrub.rst index 35199289..8b076e76 100644 --- a/Documentation/Scrub.rst +++ b/Documentation/Scrub.rst @@ -1,4 +1,4 @@ Scrub ===== -... +.. include:: ch-scrub-intro.rst diff --git a/Documentation/Send-receive.rst b/Documentation/Send-receive.rst index 29e0b4df..a965ff6a 100644 --- a/Documentation/Send-receive.rst +++ b/Documentation/Send-receive.rst @@ -1,4 +1,23 @@ -Balance -======= +Send/receive +============ -... +Send and receive are complementary features that allow to transfer data from +one filesystem to another in a streamable format. The send part traverses a +given read-only subvolume and either creates a full stream representation of +its data and metadata (*full mode*), or given a set of subvolumes for reference +it generates a difference relative to that set (*incremental mode*). + +Receive on the other hand takes the stream and reconstructs a subvolume with +files and directories equivalent to the filesystem that was used to produce the +stream. The result is not exactly 1:1, eg. inode numbers can be different and +other unique identifiers can be different (like the subvolume UUIDs). The full +mode starts with an empty subvolume, creates all the files and then turns the +subvolume to read-only. At this point it could be used as a starting point for a +future incremental send stream, provided it would be generated from the same +source subvolume on the other filesystem. + +The stream is a sequence of encoded commands that change eg. file metadata +(owner, permissions, extended attributes), data extents (create, clone, +truncate), whole file operations (rename, delete). The stream can be sent over +network, piped directly to the receive command or saved to a file. Each command +in the stream is protected by a CRC32C checksum. diff --git a/Documentation/btrfs-convert.rst b/Documentation/btrfs-convert.rst index da61c290..4c7da323 100644 --- a/Documentation/btrfs-convert.rst +++ b/Documentation/btrfs-convert.rst @@ -9,102 +9,7 @@ SYNOPSIS DESCRIPTION ----------- -**btrfs-convert** is used to convert existing source filesystem image to a btrfs -filesystem in-place. The original filesystem image is accessible in subvolume -named like *ext2_saved* as file *image*. - -Supported filesystems: - -* ext2, ext3, ext4 -- original feature, always built in - -* reiserfs -- since version 4.13, optionally built, requires libreiserfscore 3.6.27 - -* ntfs -- external tool https://github.com/maharmstone/ntfs2btrfs - -The list of supported source filesystem by a given binary is listed at the end -of help (option *--help*). - -.. warning:: - If you are going to perform rollback to the original filesystem, you - should not execute **btrfs balance** command on the converted filesystem. This - will change the extent layout and make **btrfs-convert** unable to rollback. - -The conversion utilizes free space of the original filesystem. The exact -estimate of the required space cannot be foretold. The final btrfs metadata -might occupy several gigabytes on a hundreds-gigabyte filesystem. - -If the ability to rollback is no longer important, the it is recommended to -perform a few more steps to transition the btrfs filesystem to a more compact -layout. This is because the conversion inherits the original data blocks' -fragmentation, and also because the metadata blocks are bound to the original -free space layout. - -Due to different constraints, it is only possible to convert filesystems that -have a supported data block size (ie. the same that would be valid for -**mkfs.btrfs**). This is typically the system page size (4KiB on x86_64 -machines). - -**BEFORE YOU START** - -The source filesystem must be clean, eg. no journal to replay or no repairs -needed. The respective **fsck** utility must be run on the source filesytem prior -to conversion. Please refer to the manual pages in case you encounter problems. - -For ext2/3/4: - -.. code-block:: bash - - # e2fsck -fvy /dev/sdx - -For reiserfs: - -.. code-block:: bash - - # reiserfsck -fy /dev/sdx - -Skipping that step could lead to incorrect results on the target filesystem, -but it may work. - -**REMOVE THE ORIGINAL FILESYSTEM METADATA** - -By removing the subvolume named like *ext2_saved* or *reiserfs_saved*, all -metadata of the original filesystem will be removed: - -.. code-block:: bash - - # btrfs subvolume delete /mnt/ext2_saved - -At this point it is not possible to do a rollback. The filesystem is usable but -may be impacted by the fragmentation inherited from the original filesystem. - -**MAKE FILE DATA MORE CONTIGUOUS** - -An optional but recommended step is to run defragmentation on the entire -filesystem. This will attempt to make file extents more contiguous. - -.. code-block:: bash - - # btrfs filesystem defrag -v -r -f -t 32M /mnt/btrfs - -Verbose recursive defragmentation (*-v*, *-r*), flush data per-file (*-f*) with -target extent size 32MiB (*-t*). - -**ATTEMPT TO MAKE BTRFS METADATA MORE COMPACT** - -Optional but recommended step. - -The metadata block groups after conversion may be smaller than the default size -(256MiB or 1GiB). Running a balance will attempt to merge the block groups. -This depends on the free space layout (and fragmentation) and may fail due to -lack of enough work space. This is a soft error leaving the filesystem usable -but the block group layout may remain unchanged. - -Note that balance operation takes a lot of time, please see also -``btrfs-balance(8)``. - -.. code-block:: bash - - # btrfs balance start -m /mnt/btrfs +.. include:: ch-convert-intro.rst OPTIONS ------- diff --git a/Documentation/btrfs-quota.rst b/Documentation/btrfs-quota.rst index a81ad9b9..da26e754 100644 --- a/Documentation/btrfs-quota.rst +++ b/Documentation/btrfs-quota.rst @@ -36,202 +36,7 @@ gradually improving and issues found and fixed. HIERARCHICAL QUOTA GROUP CONCEPTS --------------------------------- -The concept of quota has a long-standing tradition in the Unix world. Ever -since computers allow multiple users to work simultaneously in one filesystem, -there is the need to prevent one user from using up the entire space. Every -user should get his fair share of the available resources. - -In case of files, the solution is quite straightforward. Each file has an -'owner' recorded along with it, and it has a size. Traditional quota just -restricts the total size of all files that are owned by a user. The concept is -quite flexible: if a user hits his quota limit, the administrator can raise it -on the fly. - -On the other hand, the traditional approach has only a poor solution to -restrict directories. -At installation time, the harddisk can be partitioned so that every directory -(eg. /usr, /var/, ...) that needs a limit gets its own partition. The obvious -problem is that those limits cannot be changed without a reinstallation. The -btrfs subvolume feature builds a bridge. Subvolumes correspond in many ways to -partitions, as every subvolume looks like its own filesystem. With subvolume -quota, it is now possible to restrict each subvolume like a partition, but keep -the flexibility of quota. The space for each subvolume can be expanded or -restricted on the fly. - -As subvolumes are the basis for snapshots, interesting questions arise as to -how to account used space in the presence of snapshots. If you have a file -shared between a subvolume and a snapshot, whom to account the file to? The -creator? Both? What if the file gets modified in the snapshot, should only -these changes be accounted to it? But wait, both the snapshot and the subvolume -belong to the same user home. I just want to limit the total space used by -both! But somebody else might not want to charge the snapshots to the users. - -Btrfs subvolume quota solves these problems by introducing groups of subvolumes -and let the user put limits on them. It is even possible to have groups of -groups. In the following, we refer to them as 'qgroups'. - -Each qgroup primarily tracks two numbers, the amount of total referenced -space and the amount of exclusively referenced space. - -referenced - space is the amount of data that can be reached from any of the - subvolumes contained in the qgroup, while -exclusive - is the amount of data where all references to this data can be reached - from within this qgroup. - -SUBVOLUME QUOTA GROUPS -^^^^^^^^^^^^^^^^^^^^^^ - -The basic notion of the Subvolume Quota feature is the quota group, short -qgroup. Qgroups are notated as 'level/id', eg. the qgroup 3/2 is a qgroup of -level 3. For level 0, the leading '0/' can be omitted. -Qgroups of level 0 get created automatically when a subvolume/snapshot gets -created. The ID of the qgroup corresponds to the ID of the subvolume, so 0/5 -is the qgroup for the root subvolume. -For the *btrfs qgroup* command, the path to the subvolume can also be used -instead of '0/ID'. For all higher levels, the ID can be chosen freely. - -Each qgroup can contain a set of lower level qgroups, thus creating a hierarchy -of qgroups. Figure 1 shows an example qgroup tree. - -.. code-block:: none - - +---+ - |2/1| - +---+ - / \ - +---+/ \+---+ - |1/1| |1/2| - +---+ +---+ - / \ / \ - +---+/ \+---+/ \+---+ - qgroups |0/1| |0/2| |0/3| - +-+-+ +---+ +---+ - | / \ / \ - | / \ / \ - | / \ / \ - extents 1 2 3 4 - - Figure1: Sample qgroup hierarchy - -At the bottom, some extents are depicted showing which qgroups reference which -extents. It is important to understand the notion of *referenced* vs -*exclusive*. In the example, qgroup 0/2 references extents 2 and 3, while 1/2 -references extents 2-4, 2/1 references all extents. - -On the other hand, extent 1 is exclusive to 0/1, extent 2 is exclusive to 0/2, -while extent 3 is neither exclusive to 0/2 nor to 0/3. But because both -references can be reached from 1/2, extent 3 is exclusive to 1/2. All extents -are exclusive to 2/1. - -So exclusive does not mean there is no other way to reach the extent, but it -does mean that if you delete all subvolumes contained in a qgroup, the extent -will get deleted. - -Exclusive of a qgroup conveys the useful information how much space will be -freed in case all subvolumes of the qgroup get deleted. - -All data extents are accounted this way. Metadata that belongs to a specific -subvolume (i.e. its filesystem tree) is also accounted. Checksums and extent -allocation information are not accounted. - -In turn, the referenced count of a qgroup can be limited. All writes beyond -this limit will lead to a 'Quota Exceeded' error. - -INHERITANCE -^^^^^^^^^^^ - -Things get a bit more complicated when new subvolumes or snapshots are created. -The case of (empty) subvolumes is still quite easy. If a subvolume should be -part of a qgroup, it has to be added to the qgroup at creation time. To add it -at a later time, it would be necessary to at least rescan the full subvolume -for a proper accounting. - -Creation of a snapshot is the hard case. Obviously, the snapshot will -reference the exact amount of space as its source, and both source and -destination now have an exclusive count of 0 (the filesystem nodesize to be -precise, as the roots of the trees are not shared). But what about qgroups of -higher levels? If the qgroup contains both the source and the destination, -nothing changes. If the qgroup contains only the source, it might lose some -exclusive. - -But how much? The tempting answer is, subtract all exclusive of the source from -the qgroup, but that is wrong, or at least not enough. There could have been -an extent that is referenced from the source and another subvolume from that -qgroup. This extent would have been exclusive to the qgroup, but not to the -source subvolume. With the creation of the snapshot, the qgroup would also -lose this extent from its exclusive set. - -So how can this problem be solved? In the instant the snapshot gets created, we -already have to know the correct exclusive count. We need to have a second -qgroup that contains all the subvolumes as the first qgroup, except the -subvolume we want to snapshot. The moment we create the snapshot, the -exclusive count from the second qgroup needs to be copied to the first qgroup, -as it represents the correct value. The second qgroup is called a tracking -qgroup. It is only there in case a snapshot is needed. - -USE CASES -^^^^^^^^^ - -Below are some usecases that do not mean to be extensive. You can find your -own way how to integrate qgroups. - -SINGLE-USER MACHINE -""""""""""""""""""" - -``Replacement for partitions`` - -The simplest use case is to use qgroups as simple replacement for partitions. -Btrfs takes the disk as a whole, and /, /usr, /var, etc. are created as -subvolumes. As each subvolume gets it own qgroup automatically, they can -simply be restricted. No hierarchy is needed for that. - -``Track usage of snapshots`` - -When a snapshot is taken, a qgroup for it will automatically be created with -the correct values. 'Referenced' will show how much is in it, possibly shared -with other subvolumes. 'Exclusive' will be the amount of space that gets freed -when the subvolume is deleted. - -MULTI-USER MACHINE -"""""""""""""""""" - -``Restricting homes`` - -When you have several users on a machine, with home directories probably under -/home, you might want to restrict /home as a whole, while restricting every -user to an individual limit as well. This is easily accomplished by creating a -qgroup for /home , eg. 1/1, and assigning all user subvolumes to it. -Restricting this qgroup will limit /home, while every user subvolume can get -its own (lower) limit. - -``Accounting snapshots to the user`` - -Let's say the user is allowed to create snapshots via some mechanism. It would -only be fair to account space used by the snapshots to the user. This does not -mean the user doubles his usage as soon as he takes a snapshot. Of course, -files that are present in his home and the snapshot should only be accounted -once. This can be accomplished by creating a qgroup for each user, say -'1/UID'. The user home and all snapshots are assigned to this qgroup. -Limiting it will extend the limit to all snapshots, counting files only once. -To limit /home as a whole, a higher level group 2/1 replacing 1/1 from the -previous example is needed, with all user qgroups assigned to it. - -``Do not account snapshots`` - -On the other hand, when the snapshots get created automatically, the user has -no chance to control them, so the space used by them should not be accounted to -him. This is already the case when creating snapshots in the example from -the previous section. - -``Snapshots for backup purposes`` - -This scenario is a mixture of the previous two. The user can create snapshots, -but some snapshots for backup purposes are being created by the system. The -user's snapshots should be accounted to the user, not the system. The solution -is similar to the one from section 'Accounting snapshots to the user', but do -not assign system snapshots to user's qgroup. +.. include:: ch-quota-intro.rst SUBCOMMAND ---------- diff --git a/Documentation/btrfs-scrub.rst b/Documentation/btrfs-scrub.rst index 5f19365e..75079eec 100644 --- a/Documentation/btrfs-scrub.rst +++ b/Documentation/btrfs-scrub.rst @@ -9,33 +9,7 @@ SYNOPSIS DESCRIPTION ----------- -**btrfs scrub** is used to scrub a mounted btrfs filesystem, which will read all -data and metadata blocks from all devices and verify checksums. Automatically -repair corrupted blocks if there's a correct copy available. - -.. note:: - Scrub is not a filesystem checker (fsck) and does not verify nor repair - structural damage in the filesystem. It really only checks checksums of data - and tree blocks, it doesn't ensure the content of tree blocks is valid and - consistent. There's some validation performed when metadata blocks are read - from disk but it's not extensive and cannot substitute full *btrfs check* - run. - -The user is supposed to run it manually or via a periodic system service. The -recommended period is a month but could be less. The estimated device bandwidth -utilization is about 80% on an idle filesystem. The IO priority class is by -default *idle* so background scrub should not significantly interfere with -normal filesystem operation. The IO scheduler set for the device(s) might not -support the priority classes though. - -The scrubbing status is recorded in */var/lib/btrfs/* in textual files named -*scrub.status.UUID* for a filesystem identified by the given UUID. (Progress -state is communicated through a named pipe in file *scrub.progress.UUID* in the -same directory.) The status file is updated every 5 seconds. A resumed scrub -will continue from the last saved position. - -Scrub can be started only on a mounted filesystem, though it's possible to -scrub only a selected device. See **scrub start** for more. +.. include:: ch-scrub-intro.rst SUBCOMMAND ---------- diff --git a/Documentation/ch-checksumming.rst b/Documentation/ch-checksumming.rst new file mode 100644 index 00000000..96cd27a4 --- /dev/null +++ b/Documentation/ch-checksumming.rst @@ -0,0 +1,76 @@ +Data and metadata are checksummed by default, the checksum is calculated before +write and verifed after reading the blocks. There are several checksum +algorithms supported. The default and backward compatible is *crc32c*. Since +kernel 5.5 there are three more with different characteristics and trade-offs +regarding speed and strength. The following list may help you to decide which +one to select. + +CRC32C (32bit digest) + default, best backward compatibility, very fast, modern CPUs have + instruction-level support, not collision-resistant but still good error + detection capabilities + +XXHASH* (64bit digest) + can be used as CRC32C successor, very fast, optimized for modern CPUs utilizing + instruction pipelining, good collision resistance and error detection + +SHA256 (256bit digest):: + a cryptographic-strength hash, relatively slow but with possible CPU + instruction acceleration or specialized hardware cards, FIPS certified and + in wide use + +BLAKE2b (256bit digest) + a cryptographic-strength hash, relatively fast with possible CPU acceleration + using SIMD extensions, not standardized but based on BLAKE which was a SHA3 + finalist, in wide use, the algorithm used is BLAKE2b-256 that's optimized for + 64bit platforms + +The *digest size* affects overall size of data block checksums stored in the +filesystem. The metadata blocks have a fixed area up to 256 bits (32 bytes), so +there's no increase. Each data block has a separate checksum stored, with +additional overhead of the b-tree leaves. + +Approximate relative performance of the algorithms, measured against CRC32C +using reference software implementations on a 3.5GHz intel CPU: + + +======== ============ ======= ================ +Digest Cycles/4KiB Ratio Implementation +======== ============ ======= ================ +CRC32C 1700 1.00 CPU instruction +XXHASH 2500 1.44 reference impl. +SHA256 105000 61 reference impl. +SHA256 36000 21 libgcrypt/AVX2 +SHA256 63000 37 libsodium/AVX2 +BLAKE2b 22000 13 reference impl. +BLAKE2b 19000 11 libgcrypt/AVX2 +BLAKE2b 19000 11 libsodium/AVX2 +======== ============ ======= ================ + +Many kernels are configured with SHA256 as built-in and not as a module. +The accelerated versions are however provided by the modules and must be loaded +explicitly (**modprobe sha256**) before mounting the filesystem to make use of +them. You can check in */sys/fs/btrfs/FSID/checksum* which one is used. If you +see *sha256-generic*, then you may want to unmount and mount the filesystem +again, changing that on a mounted filesystem is not possible. +Check the file */proc/crypto*, when the implementation is built-in, you'd find + +.. code-block:: none + + name : sha256 + driver : sha256-generic + module : kernel + priority : 100 + ... + +while accelerated implementation is e.g. + +.. code-block:: none + + name : sha256 + driver : sha256-avx2 + module : sha256_ssse3 + priority : 170 + ... + + diff --git a/Documentation/ch-compression.rst b/Documentation/ch-compression.rst new file mode 100644 index 00000000..10c343e4 --- /dev/null +++ b/Documentation/ch-compression.rst @@ -0,0 +1,153 @@ +Btrfs supports transparent file compression. There are three algorithms +available: ZLIB, LZO and ZSTD (since v4.14), with various levels. +The compression happens on the level of file extents and the algorithm is +selected by file property, mount option or by a defrag command. +You can have a single btrfs mount point that has some files that are +uncompressed, some that are compressed with LZO, some with ZLIB, for instance +(though you may not want it that way, it is supported). + +Once the compression is set, all newly written data will be compressed, ie. +existing data are untouched. Data are split into smaller chunks (128KiB) before +compression to make random rewrites possible without a high performance hit. Due +to the increased number of extents the metadata consumption is higher. The +chunks are compressed in parallel. + +The algorithms can be characterized as follows regarding the speed/ratio +trade-offs: + +ZLIB + * slower, higher compression ratio + * levels: 1 to 9, mapped directly, default level is 3 + * good backward compatibility +LZO + * faster compression and decompression than zlib, worse compression ratio, designed to be fast + * no levels + * good backward compatibility +ZSTD + * compression comparable to zlib with higher compression/decompression speeds and different ratio + * levels: 1 to 15 + * since 4.14, levels since 5.1 + +The differences depend on the actual data set and cannot be expressed by a +single number or recommendation. Higher levels consume more CPU time and may +not bring a significant improvement, lower levels are close to real time. + +How to enable compression +------------------------- + +Typically the compression can be enabled on the whole filesystem, specified for +the mount point. Note that the compression mount options are shared among all +mounts of the same filesystem, either bind mounts or subvolume mounts. +Please refer to section *MOUNT OPTIONS*. + +.. code-block:: shell + + $ mount -o compress=zstd /dev/sdx /mnt + +This will enable the ``zstd`` algorithm on the default level (which is 3). +The level can be specified manually too like ``zstd:3``. Higher levels compress +better at the cost of time. This in turn may cause increased write latency, low +levels are suitable for real-time compression and on reasonably fast CPU don't +cause performance drops. + +.. code-block:: shell + + $ btrfs filesystem defrag -czstd file + +The command above will start defragmentation of the whole *file* and apply +the compression, regardless of the mount option. (Note: specifying level is not +yet implemented). The compression algorithm is not persisent and applies only +to the defragmentation command, for any other writes other compression settings +apply. + +Persistent settings on a per-file basis can be set in two ways: + +.. code-block:: shell + + $ chattr +c file + $ btrfs property set file compression zstd + +The first command is using legacy interface of file attributes inherited from +ext2 filesystem and is not flexible, so by default the *zlib* compression is +set. The other command sets a property on the file with the given algorithm. +(Note: setting level that way is not yet implemented.) + +Compression levels +------------------ + +The level support of ZLIB has been added in v4.14, LZO does not support levels +(the kernel implementation provides only one), ZSTD level support has been added +in v5.1. + +There are 9 levels of ZLIB supported (1 to 9), mapping 1:1 from the mount option +to the algorithm defined level. The default is level 3, which provides the +reasonably good compression ratio and is still reasonably fast. The difference +in compression gain of levels 7, 8 and 9 is comparable but the higher levels +take longer. + +The ZSTD support includes levels 1 to 15, a subset of full range of what ZSTD +provides. Levels 1-3 are real-time, 4-8 slower with improved compression and +9-15 try even harder though the resulting size may not be significantly improved. + +Level 0 always maps to the default. The compression level does not affect +compatibility. + +Incompressible data +------------------- + +Files with already compressed data or with data that won't compress well with +the CPU and memory constraints of the kernel implementations are using a simple +decision logic. If the first portion of data being compressed is not smaller +than the original, the compression of the file is disabled -- unless the +filesystem is mounted with *compress-force*. In that case compression will +always be attempted on the file only to be later discarded. This is not optimal +and subject to optimizations and further development. + +If a file is identified as incompressible, a flag is set (*NOCOMPRESS*) and it's +sticky. On that file compression won't be performed unless forced. The flag +can be also set by **chattr +m** (since e2fsprogs 1.46.2) or by properties with +value *no* or *none*. Empty value will reset it to the default that's currently +applicable on the mounted filesystem. + +There are two ways to detect incompressible data: + +* actual compression attempt - data are compressed, if the result is not smaller, + it's discarded, so this depends on the algorithm and level +* pre-compression heuristics - a quick statistical evaluation on the data is + peformed and based on the result either compression is performed or skipped, + the NOCOMPRESS bit is not set just by the heuristic, only if the compression + algorithm does not make an improvent + +.. code-block:: shell + + $ lsattr file + ---------------------m file + +Using the forcing compression is not recommended, the heuristics are +supposed to decide that and compression algorithms internally detect +incompressible data too. + +Pre-compression heuristics +-------------------------- + +The heuristics aim to do a few quick statistical tests on the compressed data +in order to avoid probably costly compression that would turn out to be +inefficient. Compression algorithms could have internal detection of +incompressible data too but this leads to more overhead as the compression is +done in another thread and has to write the data anyway. The heuristic is +read-only and can utilize cached memory. + +The tests performed based on the following: data sampling, long repated +pattern detection, byte frequency, Shannon entropy. + +Compatibility +------------- + +Compression is done using the COW mechanism so it's incompatible with +*nodatacow*. Direct IO works on compressed files but will fall back to buffered +writes and leads to recompression. Currently 'nodatasum' and compression don't +work together. + +The compression algorithms have been added over time so the version +compatibility should be also considered, together with other tools that may +access the compressed data like bootloaders. diff --git a/Documentation/ch-convert-intro.rst b/Documentation/ch-convert-intro.rst new file mode 100644 index 00000000..b3fdd162 --- /dev/null +++ b/Documentation/ch-convert-intro.rst @@ -0,0 +1,97 @@ +The **btrfs-convert** tool can be used to convert existing source filesystem +image to a btrfs filesystem in-place. The original filesystem image is +accessible in subvolume named like *ext2_saved* as file *image*. + +Supported filesystems: + +* ext2, ext3, ext4 -- original feature, always built in + +* reiserfs -- since version 4.13, optionally built, requires libreiserfscore 3.6.27 + +* ntfs -- external tool https://github.com/maharmstone/ntfs2btrfs + +The list of supported source filesystem by a given binary is listed at the end +of help (option *--help*). + +.. warning:: + If you are going to perform rollback to the original filesystem, you + should not execute **btrfs balance** command on the converted filesystem. This + will change the extent layout and make **btrfs-convert** unable to rollback. + +The conversion utilizes free space of the original filesystem. The exact +estimate of the required space cannot be foretold. The final btrfs metadata +might occupy several gigabytes on a hundreds-gigabyte filesystem. + +If the ability to rollback is no longer important, the it is recommended to +perform a few more steps to transition the btrfs filesystem to a more compact +layout. This is because the conversion inherits the original data blocks' +fragmentation, and also because the metadata blocks are bound to the original +free space layout. + +Due to different constraints, it is only possible to convert filesystems that +have a supported data block size (ie. the same that would be valid for +**mkfs.btrfs**). This is typically the system page size (4KiB on x86_64 +machines). + +**BEFORE YOU START** + +The source filesystem must be clean, eg. no journal to replay or no repairs +needed. The respective **fsck** utility must be run on the source filesytem prior +to conversion. Please refer to the manual pages in case you encounter problems. + +For ext2/3/4: + +.. code-block:: bash + + # e2fsck -fvy /dev/sdx + +For reiserfs: + +.. code-block:: bash + + # reiserfsck -fy /dev/sdx + +Skipping that step could lead to incorrect results on the target filesystem, +but it may work. + +**REMOVE THE ORIGINAL FILESYSTEM METADATA** + +By removing the subvolume named like *ext2_saved* or *reiserfs_saved*, all +metadata of the original filesystem will be removed: + +.. code-block:: bash + + # btrfs subvolume delete /mnt/ext2_saved + +At this point it is not possible to do a rollback. The filesystem is usable but +may be impacted by the fragmentation inherited from the original filesystem. + +**MAKE FILE DATA MORE CONTIGUOUS** + +An optional but recommended step is to run defragmentation on the entire +filesystem. This will attempt to make file extents more contiguous. + +.. code-block:: bash + + # btrfs filesystem defrag -v -r -f -t 32M /mnt/btrfs + +Verbose recursive defragmentation (*-v*, *-r*), flush data per-file (*-f*) with +target extent size 32MiB (*-t*). + +**ATTEMPT TO MAKE BTRFS METADATA MORE COMPACT** + +Optional but recommended step. + +The metadata block groups after conversion may be smaller than the default size +(256MiB or 1GiB). Running a balance will attempt to merge the block groups. +This depends on the free space layout (and fragmentation) and may fail due to +lack of enough work space. This is a soft error leaving the filesystem usable +but the block group layout may remain unchanged. + +Note that balance operation takes a lot of time, please see also +``btrfs-balance(8)``. + +.. code-block:: bash + + # btrfs balance start -m /mnt/btrfs + diff --git a/Documentation/ch-quota-intro.rst b/Documentation/ch-quota-intro.rst new file mode 100644 index 00000000..abd71606 --- /dev/null +++ b/Documentation/ch-quota-intro.rst @@ -0,0 +1,198 @@ +The concept of quota has a long-standing tradition in the Unix world. Ever +since computers allow multiple users to work simultaneously in one filesystem, +there is the need to prevent one user from using up the entire space. Every +user should get his fair share of the available resources. + +In case of files, the solution is quite straightforward. Each file has an +*owner* recorded along with it, and it has a size. Traditional quota just +restricts the total size of all files that are owned by a user. The concept is +quite flexible: if a user hits his quota limit, the administrator can raise it +on the fly. + +On the other hand, the traditional approach has only a poor solution to +restrict directories. +At installation time, the harddisk can be partitioned so that every directory +(eg. /usr, /var/, ...) that needs a limit gets its own partition. The obvious +problem is that those limits cannot be changed without a reinstallation. The +btrfs subvolume feature builds a bridge. Subvolumes correspond in many ways to +partitions, as every subvolume looks like its own filesystem. With subvolume +quota, it is now possible to restrict each subvolume like a partition, but keep +the flexibility of quota. The space for each subvolume can be expanded or +restricted on the fly. + +As subvolumes are the basis for snapshots, interesting questions arise as to +how to account used space in the presence of snapshots. If you have a file +shared between a subvolume and a snapshot, whom to account the file to? The +creator? Both? What if the file gets modified in the snapshot, should only +these changes be accounted to it? But wait, both the snapshot and the subvolume +belong to the same user home. I just want to limit the total space used by +both! But somebody else might not want to charge the snapshots to the users. + +Btrfs subvolume quota solves these problems by introducing groups of subvolumes +and let the user put limits on them. It is even possible to have groups of +groups. In the following, we refer to them as *qgroups*. + +Each qgroup primarily tracks two numbers, the amount of total referenced +space and the amount of exclusively referenced space. + +referenced + space is the amount of data that can be reached from any of the + subvolumes contained in the qgroup, while +exclusive + is the amount of data where all references to this data can be reached + from within this qgroup. + +SUBVOLUME QUOTA GROUPS +^^^^^^^^^^^^^^^^^^^^^^ + +The basic notion of the Subvolume Quota feature is the quota group, short +qgroup. Qgroups are notated as *level/id*, eg. the qgroup 3/2 is a qgroup of +level 3. For level 0, the leading '0/' can be omitted. +Qgroups of level 0 get created automatically when a subvolume/snapshot gets +created. The ID of the qgroup corresponds to the ID of the subvolume, so 0/5 +is the qgroup for the root subvolume. +For the ``btrfs qgroup`` command, the path to the subvolume can also be used +instead of *0/ID*. For all higher levels, the ID can be chosen freely. + +Each qgroup can contain a set of lower level qgroups, thus creating a hierarchy +of qgroups. Figure 1 shows an example qgroup tree. + +.. code-block:: none + + +---+ + |2/1| + +---+ + / \ + +---+/ \+---+ + |1/1| |1/2| + +---+ +---+ + / \ / \ + +---+/ \+---+/ \+---+ + qgroups |0/1| |0/2| |0/3| + +-+-+ +---+ +---+ + | / \ / \ + | / \ / \ + | / \ / \ + extents 1 2 3 4 + + Figure1: Sample qgroup hierarchy + +At the bottom, some extents are depicted showing which qgroups reference which +extents. It is important to understand the notion of *referenced* vs +*exclusive*. In the example, qgroup 0/2 references extents 2 and 3, while 1/2 +references extents 2-4, 2/1 references all extents. + +On the other hand, extent 1 is exclusive to 0/1, extent 2 is exclusive to 0/2, +while extent 3 is neither exclusive to 0/2 nor to 0/3. But because both +references can be reached from 1/2, extent 3 is exclusive to 1/2. All extents +are exclusive to 2/1. + +So exclusive does not mean there is no other way to reach the extent, but it +does mean that if you delete all subvolumes contained in a qgroup, the extent +will get deleted. + +Exclusive of a qgroup conveys the useful information how much space will be +freed in case all subvolumes of the qgroup get deleted. + +All data extents are accounted this way. Metadata that belongs to a specific +subvolume (i.e. its filesystem tree) is also accounted. Checksums and extent +allocation information are not accounted. + +In turn, the referenced count of a qgroup can be limited. All writes beyond +this limit will lead to a 'Quota Exceeded' error. + +INHERITANCE +^^^^^^^^^^^ + +Things get a bit more complicated when new subvolumes or snapshots are created. +The case of (empty) subvolumes is still quite easy. If a subvolume should be +part of a qgroup, it has to be added to the qgroup at creation time. To add it +at a later time, it would be necessary to at least rescan the full subvolume +for a proper accounting. + +Creation of a snapshot is the hard case. Obviously, the snapshot will +reference the exact amount of space as its source, and both source and +destination now have an exclusive count of 0 (the filesystem nodesize to be +precise, as the roots of the trees are not shared). But what about qgroups of +higher levels? If the qgroup contains both the source and the destination, +nothing changes. If the qgroup contains only the source, it might lose some +exclusive. + +But how much? The tempting answer is, subtract all exclusive of the source from +the qgroup, but that is wrong, or at least not enough. There could have been +an extent that is referenced from the source and another subvolume from that +qgroup. This extent would have been exclusive to the qgroup, but not to the +source subvolume. With the creation of the snapshot, the qgroup would also +lose this extent from its exclusive set. + +So how can this problem be solved? In the instant the snapshot gets created, we +already have to know the correct exclusive count. We need to have a second +qgroup that contains all the subvolumes as the first qgroup, except the +subvolume we want to snapshot. The moment we create the snapshot, the +exclusive count from the second qgroup needs to be copied to the first qgroup, +as it represents the correct value. The second qgroup is called a tracking +qgroup. It is only there in case a snapshot is needed. + +USE CASES +^^^^^^^^^ + +Below are some usecases that do not mean to be extensive. You can find your +own way how to integrate qgroups. + +SINGLE-USER MACHINE +""""""""""""""""""" + +``Replacement for partitions`` + +The simplest use case is to use qgroups as simple replacement for partitions. +Btrfs takes the disk as a whole, and /, /usr, /var, etc. are created as +subvolumes. As each subvolume gets it own qgroup automatically, they can +simply be restricted. No hierarchy is needed for that. + +``Track usage of snapshots`` + +When a snapshot is taken, a qgroup for it will automatically be created with +the correct values. 'Referenced' will show how much is in it, possibly shared +with other subvolumes. 'Exclusive' will be the amount of space that gets freed +when the subvolume is deleted. + +MULTI-USER MACHINE +"""""""""""""""""" + +``Restricting homes`` + +When you have several users on a machine, with home directories probably under +/home, you might want to restrict /home as a whole, while restricting every +user to an individual limit as well. This is easily accomplished by creating a +qgroup for /home , eg. 1/1, and assigning all user subvolumes to it. +Restricting this qgroup will limit /home, while every user subvolume can get +its own (lower) limit. + +``Accounting snapshots to the user`` + +Let's say the user is allowed to create snapshots via some mechanism. It would +only be fair to account space used by the snapshots to the user. This does not +mean the user doubles his usage as soon as he takes a snapshot. Of course, +files that are present in his home and the snapshot should only be accounted +once. This can be accomplished by creating a qgroup for each user, say +'1/UID'. The user home and all snapshots are assigned to this qgroup. +Limiting it will extend the limit to all snapshots, counting files only once. +To limit /home as a whole, a higher level group 2/1 replacing 1/1 from the +previous example is needed, with all user qgroups assigned to it. + +``Do not account snapshots`` + +On the other hand, when the snapshots get created automatically, the user has +no chance to control them, so the space used by them should not be accounted to +him. This is already the case when creating snapshots in the example from +the previous section. + +``Snapshots for backup purposes`` + +This scenario is a mixture of the previous two. The user can create snapshots, +but some snapshots for backup purposes are being created by the system. The +user's snapshots should be accounted to the user, not the system. The solution +is similar to the one from section 'Accounting snapshots to the user', but do +not assign system snapshots to user's qgroup. + + diff --git a/Documentation/ch-scrub-intro.rst b/Documentation/ch-scrub-intro.rst new file mode 100644 index 00000000..796d0a24 --- /dev/null +++ b/Documentation/ch-scrub-intro.rst @@ -0,0 +1,28 @@ +Scrub is a pass over all filesystem data and metadata and verifying the +checksums. If a valid copy is available (replicated block group profiles) then +the damaged one is repaired. All copies of the replicated profiles are validated. + +.. note:: + Scrub is not a filesystem checker (fsck) and does not verify nor repair + structural damage in the filesystem. It really only checks checksums of data + and tree blocks, it doesn't ensure the content of tree blocks is valid and + consistent. There's some validation performed when metadata blocks are read + from disk but it's not extensive and cannot substitute full *btrfs check* + run. + +The user is supposed to run it manually or via a periodic system service. The +recommended period is a month but could be less. The estimated device bandwidth +utilization is about 80% on an idle filesystem. The IO priority class is by +default *idle* so background scrub should not significantly interfere with +normal filesystem operation. The IO scheduler set for the device(s) might not +support the priority classes though. + +The scrubbing status is recorded in */var/lib/btrfs/* in textual files named +*scrub.status.UUID* for a filesystem identified by the given UUID. (Progress +state is communicated through a named pipe in file *scrub.progress.UUID* in the +same directory.) The status file is updated every 5 seconds. A resumed scrub +will continue from the last saved position. + +Scrub can be started only on a mounted filesystem, though it's possible to +scrub only a selected device. See **btrfs scrub start** for more. + diff --git a/Documentation/ch-seeding-device.rst b/Documentation/ch-seeding-device.rst new file mode 100644 index 00000000..93136c2f --- /dev/null +++ b/Documentation/ch-seeding-device.rst @@ -0,0 +1,78 @@ +The COW mechanism and multiple devices under one hood enable an interesting +concept, called a seeding device: extending a read-only filesystem on a single +device filesystem with another device that captures all writes. For example +imagine an immutable golden image of an operating system enhanced with another +device that allows to use the data from the golden image and normal operation. +This idea originated on CD-ROMs with base OS and allowing to use them for live +systems, but this became obsolete. There are technologies providing similar +functionality, like *unionmount*, *overlayfs* or *qcow2* image snapshot. + +The seeding device starts as a normal filesystem, once the contents is ready, +**btrfstune -S 1** is used to flag it as a seeding device. Mounting such device +will not allow any writes, except adding a new device by **btrfs device add**. +Then the filesystem can be remounted as read-write. + +Given that the filesystem on the seeding device is always recognized as +read-only, it can be used to seed multiple filesystems, at the same time. The +UUID that is normally attached to a device is automatically changed to a random +UUID on each mount. + +Once the seeding device is mounted, it needs the writable device. After adding +it, something like **remount -o remount,rw /path** makes the filesystem at +*/path* ready for use. The simplest usecase is to throw away all changes by +unmounting the filesystem when convenient. + +Alternatively, deleting the seeding device from the filesystem can turn it into +a normal filesystem, provided that the writable device can also contain all the +data from the seeding device. + +The seeding device flag can be cleared again by **btrfstune -f -s 0**, eg. +allowing to update with newer data but please note that this will invalidate +all existing filesystems that use this particular seeding device. This works +for some usecases, not for others, and a forcing flag to the command is +mandatory to avoid accidental mistakes. + +Example how to create and use one seeding device: + +.. code-block:: bash + + # mkfs.btrfs /dev/sda + # mount /dev/sda /mnt/mnt1 + # ... fill mnt1 with data + # umount /mnt/mnt1 + # btrfstune -S 1 /dev/sda + # mount /dev/sda /mnt/mnt1 + # btrfs device add /dev/sdb /mnt + # mount -o remount,rw /mnt/mnt1 + # ... /mnt/mnt1 is now writable + +Now */mnt/mnt1* can be used normally. The device */dev/sda* can be mounted +again with a another writable device: + +.. code-block:: bash + + # mount /dev/sda /mnt/mnt2 + # btrfs device add /dev/sdc /mnt/mnt2 + # mount -o remount,rw /mnt/mnt2 + ... /mnt/mnt2 is now writable + +The writable device (*/dev/sdb*) can be decoupled from the seeding device and +used independently: + +.. code-block:: bash + + # btrfs device delete /dev/sda /mnt/mnt1 + +As the contents originated in the seeding device, it's possible to turn +*/dev/sdb* to a seeding device again and repeat the whole process. + +A few things to note: + +* it's recommended to use only single device for the seeding device, it works + for multiple devices but the *single* profile must be used in order to make + the seeding device deletion work +* block group profiles *single* and *dup* support the usecases above +* the label is copied from the seeding device and can be changed by **btrfs filesystem label** +* each new mount of the seeding device gets a new random UUID + +