btrfs-progs: docs: add more chapters (part 2)

The feature pages share the contents with the manual page section 5 so
put the contents to separate files. Progress: 2/3.

Signed-off-by: David Sterba <dsterba@suse.com>
This commit is contained in:
David Sterba 2021-12-09 20:46:42 +01:00
parent b871bf49f3
commit c6be84840f
19 changed files with 772 additions and 332 deletions

View File

@ -1,4 +1,8 @@
Auto-repair on read
===================
...
Data or metadata that are found to be damaged (eg. because the checksum does
not match) at the time they're read from the device can be salvaged in case the
filesystem has another valid copy when using block group profile with redundancy
(DUP, RAID1, RAID5/6). The correct data are returned to the user application
and the damaged copy is replaced by it.

View File

@ -1,4 +1,4 @@
Convert
=======
...
.. include:: ch-convert-intro.rst

View File

@ -1,4 +1,44 @@
Deduplication
=============
...
Going by the definition in the context of filesystems, it's a process of
looking up identical data blocks tracked separately and creating a shared
logical link while removing one of the copies of the data blocks. This leads to
data space savings while it increases metadata consumption.
There are two main deduplication types:
* **in-band** *(sometimes also called on-line)* -- all newly written data are
considered for deduplication before writing
* **out-of-band** *(sometimes alco called offline)* -- data for deduplication
have to be actively looked for and deduplicated by the user application
Both have their pros and cons. BTRFS implements **only out-of-band** type.
BTRFS provides the basic building blocks for deduplication allowing other tools
to choose the strategy and scope of the deduplication. There are multiple
tools that take different approaches to deduplication, offer additional
features or make trade-offs. The following table lists tools that are known to
be up-to-date, maintained and widely used.
.. list-table::
:header-rows: 1
* - Name
- File based
- Block based
- Incremental
* - `BEES <https://github.com/Zygo/bees>`_
- No
- Yes
- Yes
* - `duperemove <https://github.com/markfasheh/duperemove>`_
- Yes
- No
- Yes
Legend:
- *File based*: the tool takes a list of files and deduplicates blocks only from that set
- *Block based*: the tool enumerates blocks and looks for duplicates
- *Incremental*: repeated runs of the tool utilizes information gathered from previous runs

View File

@ -1,4 +1,22 @@
Defragmentation
===============
...
Defragmentation of files is supposed to make the layout of the file extents to
be more linear or at least coalesce the file extents into larger ones that can
be stored on the device more efficiently. The reason there's a need for
defragmentation stems from the COW design that BTRFS is built on and is
inherent. The fragmentation is caused by rewrites of the same file data
in-place, that has to be handled by creating a new copy that may lie on a
distant location on the physical device. Fragmentation is the worst problem on
rotational hard disks due to the delay caused by moving the drive heads to the
distant location. With the modern seek-less devices it's not a problem though
it may still make sense because of reduced size of the metadata that's needed
to track the scattered extents.
File data that are in use can be safely defragmented because the whole process
happens inside the page cache, that is the central point caching the file data
and takes care of synchronization. Once a filesystem sync or flush is started
(either manually or automatically) all the dirty data get written to the
devices. This however reduces the chances to find optimal layout as the writes
happen together with other data and the result depens on the remaining free
space layout and fragmentation.

View File

@ -1,6 +1,18 @@
Flexibility
===========
* dynamic inode creation (no preallocated space)
The underlying design of BTRFS data structures allows a lot of flexibility and
making changes after filesystem creation, like resizing, adding/removing space
or enabling some features on-the-fly.
* block group profile change on-the-fly
* **dynamic inode creation** -- there's no fixed space or tables for tracking
inodes so the number of inodes that can be created is bounded by the metadata
space and it's utilization
* **block group profile change on-the-fly** -- the block group profiles can be
changed on a mounted filesystem by running the balance operation and
specifying the conversion filters
* **resize** -- the space occupied by the filesystem on each device can be
resized up (grow) or down (shrink) as long as the amount of data can be still
contained on the device

View File

@ -1,4 +1,4 @@
Quota groups
============
...
.. include:: ch-quota-intro.rst

View File

@ -1,4 +1,29 @@
Reflink
=======
...
Reflink is a type of shallow copy of file data that shares the blocks but
otherwise the files are independent and any change to the file will not affect
the other. This builds on the underlying COW mechanism. A reflink will
effectively create only a separate metadata pointing to the shared blocks which
is typically much faster than a deep copy of all blocks.
The reflink is typically meant for whole files but a partial file range can be
also copied, though there are no ready-made tools for that.
.. code-block:: shell
cp --reflink=always source target
There are some constaints:
- cross-filesystem reflink is not possible, there's nothing in common between
so the block sharing can't work
- reflink crossing two mount points of the same filesystem does not work due
to an artificial limitation in VFS (this may change in the future)
- reflink requires source and target file that have the same status regarding
NOCOW and checksums, for example if the source file is NOCOW (once created
with the chattr +C attribute) then the above command won't work unless the
target file is pre-created with the +C attribute as well, or the NOCOW
attribute is inherited from the parent directory (chattr +C on the directory)
or if the whole filesystem is mounted with *-o nodatacow* that would create
the NOCOW files by default

View File

@ -1,4 +1,12 @@
Resize
======
...
A BTRFS mounted filesystem can be resized after creation, grown or shrunk. On a
multi device filesystem the space occupied on each device can be resized
independently. Data tha reside in the are that would be out of the new size are
relocated to the remaining space below the limit, so this constrains the
minimum size to which a filesystem can be shrunk.
Growing a filesystem is quick as it only needs to take note of the available
space, while shrinking a filesystem needs to relocate potentially lots of data
and this is IO intense. It is possible to shrink a filesystem in smaller steps.

View File

@ -1,4 +1,4 @@
Scrub
=====
...
.. include:: ch-scrub-intro.rst

View File

@ -1,4 +1,23 @@
Balance
=======
Send/receive
============
...
Send and receive are complementary features that allow to transfer data from
one filesystem to another in a streamable format. The send part traverses a
given read-only subvolume and either creates a full stream representation of
its data and metadata (*full mode*), or given a set of subvolumes for reference
it generates a difference relative to that set (*incremental mode*).
Receive on the other hand takes the stream and reconstructs a subvolume with
files and directories equivalent to the filesystem that was used to produce the
stream. The result is not exactly 1:1, eg. inode numbers can be different and
other unique identifiers can be different (like the subvolume UUIDs). The full
mode starts with an empty subvolume, creates all the files and then turns the
subvolume to read-only. At this point it could be used as a starting point for a
future incremental send stream, provided it would be generated from the same
source subvolume on the other filesystem.
The stream is a sequence of encoded commands that change eg. file metadata
(owner, permissions, extended attributes), data extents (create, clone,
truncate), whole file operations (rename, delete). The stream can be sent over
network, piped directly to the receive command or saved to a file. Each command
in the stream is protected by a CRC32C checksum.

View File

@ -9,102 +9,7 @@ SYNOPSIS
DESCRIPTION
-----------
**btrfs-convert** is used to convert existing source filesystem image to a btrfs
filesystem in-place. The original filesystem image is accessible in subvolume
named like *ext2_saved* as file *image*.
Supported filesystems:
* ext2, ext3, ext4 -- original feature, always built in
* reiserfs -- since version 4.13, optionally built, requires libreiserfscore 3.6.27
* ntfs -- external tool https://github.com/maharmstone/ntfs2btrfs
The list of supported source filesystem by a given binary is listed at the end
of help (option *--help*).
.. warning::
If you are going to perform rollback to the original filesystem, you
should not execute **btrfs balance** command on the converted filesystem. This
will change the extent layout and make **btrfs-convert** unable to rollback.
The conversion utilizes free space of the original filesystem. The exact
estimate of the required space cannot be foretold. The final btrfs metadata
might occupy several gigabytes on a hundreds-gigabyte filesystem.
If the ability to rollback is no longer important, the it is recommended to
perform a few more steps to transition the btrfs filesystem to a more compact
layout. This is because the conversion inherits the original data blocks'
fragmentation, and also because the metadata blocks are bound to the original
free space layout.
Due to different constraints, it is only possible to convert filesystems that
have a supported data block size (ie. the same that would be valid for
**mkfs.btrfs**). This is typically the system page size (4KiB on x86_64
machines).
**BEFORE YOU START**
The source filesystem must be clean, eg. no journal to replay or no repairs
needed. The respective **fsck** utility must be run on the source filesytem prior
to conversion. Please refer to the manual pages in case you encounter problems.
For ext2/3/4:
.. code-block:: bash
# e2fsck -fvy /dev/sdx
For reiserfs:
.. code-block:: bash
# reiserfsck -fy /dev/sdx
Skipping that step could lead to incorrect results on the target filesystem,
but it may work.
**REMOVE THE ORIGINAL FILESYSTEM METADATA**
By removing the subvolume named like *ext2_saved* or *reiserfs_saved*, all
metadata of the original filesystem will be removed:
.. code-block:: bash
# btrfs subvolume delete /mnt/ext2_saved
At this point it is not possible to do a rollback. The filesystem is usable but
may be impacted by the fragmentation inherited from the original filesystem.
**MAKE FILE DATA MORE CONTIGUOUS**
An optional but recommended step is to run defragmentation on the entire
filesystem. This will attempt to make file extents more contiguous.
.. code-block:: bash
# btrfs filesystem defrag -v -r -f -t 32M /mnt/btrfs
Verbose recursive defragmentation (*-v*, *-r*), flush data per-file (*-f*) with
target extent size 32MiB (*-t*).
**ATTEMPT TO MAKE BTRFS METADATA MORE COMPACT**
Optional but recommended step.
The metadata block groups after conversion may be smaller than the default size
(256MiB or 1GiB). Running a balance will attempt to merge the block groups.
This depends on the free space layout (and fragmentation) and may fail due to
lack of enough work space. This is a soft error leaving the filesystem usable
but the block group layout may remain unchanged.
Note that balance operation takes a lot of time, please see also
``btrfs-balance(8)``.
.. code-block:: bash
# btrfs balance start -m /mnt/btrfs
.. include:: ch-convert-intro.rst
OPTIONS
-------

View File

@ -36,202 +36,7 @@ gradually improving and issues found and fixed.
HIERARCHICAL QUOTA GROUP CONCEPTS
---------------------------------
The concept of quota has a long-standing tradition in the Unix world. Ever
since computers allow multiple users to work simultaneously in one filesystem,
there is the need to prevent one user from using up the entire space. Every
user should get his fair share of the available resources.
In case of files, the solution is quite straightforward. Each file has an
'owner' recorded along with it, and it has a size. Traditional quota just
restricts the total size of all files that are owned by a user. The concept is
quite flexible: if a user hits his quota limit, the administrator can raise it
on the fly.
On the other hand, the traditional approach has only a poor solution to
restrict directories.
At installation time, the harddisk can be partitioned so that every directory
(eg. /usr, /var/, ...) that needs a limit gets its own partition. The obvious
problem is that those limits cannot be changed without a reinstallation. The
btrfs subvolume feature builds a bridge. Subvolumes correspond in many ways to
partitions, as every subvolume looks like its own filesystem. With subvolume
quota, it is now possible to restrict each subvolume like a partition, but keep
the flexibility of quota. The space for each subvolume can be expanded or
restricted on the fly.
As subvolumes are the basis for snapshots, interesting questions arise as to
how to account used space in the presence of snapshots. If you have a file
shared between a subvolume and a snapshot, whom to account the file to? The
creator? Both? What if the file gets modified in the snapshot, should only
these changes be accounted to it? But wait, both the snapshot and the subvolume
belong to the same user home. I just want to limit the total space used by
both! But somebody else might not want to charge the snapshots to the users.
Btrfs subvolume quota solves these problems by introducing groups of subvolumes
and let the user put limits on them. It is even possible to have groups of
groups. In the following, we refer to them as 'qgroups'.
Each qgroup primarily tracks two numbers, the amount of total referenced
space and the amount of exclusively referenced space.
referenced
space is the amount of data that can be reached from any of the
subvolumes contained in the qgroup, while
exclusive
is the amount of data where all references to this data can be reached
from within this qgroup.
SUBVOLUME QUOTA GROUPS
^^^^^^^^^^^^^^^^^^^^^^
The basic notion of the Subvolume Quota feature is the quota group, short
qgroup. Qgroups are notated as 'level/id', eg. the qgroup 3/2 is a qgroup of
level 3. For level 0, the leading '0/' can be omitted.
Qgroups of level 0 get created automatically when a subvolume/snapshot gets
created. The ID of the qgroup corresponds to the ID of the subvolume, so 0/5
is the qgroup for the root subvolume.
For the *btrfs qgroup* command, the path to the subvolume can also be used
instead of '0/ID'. For all higher levels, the ID can be chosen freely.
Each qgroup can contain a set of lower level qgroups, thus creating a hierarchy
of qgroups. Figure 1 shows an example qgroup tree.
.. code-block:: none
+---+
|2/1|
+---+
/ \
+---+/ \+---+
|1/1| |1/2|
+---+ +---+
/ \ / \
+---+/ \+---+/ \+---+
qgroups |0/1| |0/2| |0/3|
+-+-+ +---+ +---+
| / \ / \
| / \ / \
| / \ / \
extents 1 2 3 4
Figure1: Sample qgroup hierarchy
At the bottom, some extents are depicted showing which qgroups reference which
extents. It is important to understand the notion of *referenced* vs
*exclusive*. In the example, qgroup 0/2 references extents 2 and 3, while 1/2
references extents 2-4, 2/1 references all extents.
On the other hand, extent 1 is exclusive to 0/1, extent 2 is exclusive to 0/2,
while extent 3 is neither exclusive to 0/2 nor to 0/3. But because both
references can be reached from 1/2, extent 3 is exclusive to 1/2. All extents
are exclusive to 2/1.
So exclusive does not mean there is no other way to reach the extent, but it
does mean that if you delete all subvolumes contained in a qgroup, the extent
will get deleted.
Exclusive of a qgroup conveys the useful information how much space will be
freed in case all subvolumes of the qgroup get deleted.
All data extents are accounted this way. Metadata that belongs to a specific
subvolume (i.e. its filesystem tree) is also accounted. Checksums and extent
allocation information are not accounted.
In turn, the referenced count of a qgroup can be limited. All writes beyond
this limit will lead to a 'Quota Exceeded' error.
INHERITANCE
^^^^^^^^^^^
Things get a bit more complicated when new subvolumes or snapshots are created.
The case of (empty) subvolumes is still quite easy. If a subvolume should be
part of a qgroup, it has to be added to the qgroup at creation time. To add it
at a later time, it would be necessary to at least rescan the full subvolume
for a proper accounting.
Creation of a snapshot is the hard case. Obviously, the snapshot will
reference the exact amount of space as its source, and both source and
destination now have an exclusive count of 0 (the filesystem nodesize to be
precise, as the roots of the trees are not shared). But what about qgroups of
higher levels? If the qgroup contains both the source and the destination,
nothing changes. If the qgroup contains only the source, it might lose some
exclusive.
But how much? The tempting answer is, subtract all exclusive of the source from
the qgroup, but that is wrong, or at least not enough. There could have been
an extent that is referenced from the source and another subvolume from that
qgroup. This extent would have been exclusive to the qgroup, but not to the
source subvolume. With the creation of the snapshot, the qgroup would also
lose this extent from its exclusive set.
So how can this problem be solved? In the instant the snapshot gets created, we
already have to know the correct exclusive count. We need to have a second
qgroup that contains all the subvolumes as the first qgroup, except the
subvolume we want to snapshot. The moment we create the snapshot, the
exclusive count from the second qgroup needs to be copied to the first qgroup,
as it represents the correct value. The second qgroup is called a tracking
qgroup. It is only there in case a snapshot is needed.
USE CASES
^^^^^^^^^
Below are some usecases that do not mean to be extensive. You can find your
own way how to integrate qgroups.
SINGLE-USER MACHINE
"""""""""""""""""""
``Replacement for partitions``
The simplest use case is to use qgroups as simple replacement for partitions.
Btrfs takes the disk as a whole, and /, /usr, /var, etc. are created as
subvolumes. As each subvolume gets it own qgroup automatically, they can
simply be restricted. No hierarchy is needed for that.
``Track usage of snapshots``
When a snapshot is taken, a qgroup for it will automatically be created with
the correct values. 'Referenced' will show how much is in it, possibly shared
with other subvolumes. 'Exclusive' will be the amount of space that gets freed
when the subvolume is deleted.
MULTI-USER MACHINE
""""""""""""""""""
``Restricting homes``
When you have several users on a machine, with home directories probably under
/home, you might want to restrict /home as a whole, while restricting every
user to an individual limit as well. This is easily accomplished by creating a
qgroup for /home , eg. 1/1, and assigning all user subvolumes to it.
Restricting this qgroup will limit /home, while every user subvolume can get
its own (lower) limit.
``Accounting snapshots to the user``
Let's say the user is allowed to create snapshots via some mechanism. It would
only be fair to account space used by the snapshots to the user. This does not
mean the user doubles his usage as soon as he takes a snapshot. Of course,
files that are present in his home and the snapshot should only be accounted
once. This can be accomplished by creating a qgroup for each user, say
'1/UID'. The user home and all snapshots are assigned to this qgroup.
Limiting it will extend the limit to all snapshots, counting files only once.
To limit /home as a whole, a higher level group 2/1 replacing 1/1 from the
previous example is needed, with all user qgroups assigned to it.
``Do not account snapshots``
On the other hand, when the snapshots get created automatically, the user has
no chance to control them, so the space used by them should not be accounted to
him. This is already the case when creating snapshots in the example from
the previous section.
``Snapshots for backup purposes``
This scenario is a mixture of the previous two. The user can create snapshots,
but some snapshots for backup purposes are being created by the system. The
user's snapshots should be accounted to the user, not the system. The solution
is similar to the one from section 'Accounting snapshots to the user', but do
not assign system snapshots to user's qgroup.
.. include:: ch-quota-intro.rst
SUBCOMMAND
----------

View File

@ -9,33 +9,7 @@ SYNOPSIS
DESCRIPTION
-----------
**btrfs scrub** is used to scrub a mounted btrfs filesystem, which will read all
data and metadata blocks from all devices and verify checksums. Automatically
repair corrupted blocks if there's a correct copy available.
.. note::
Scrub is not a filesystem checker (fsck) and does not verify nor repair
structural damage in the filesystem. It really only checks checksums of data
and tree blocks, it doesn't ensure the content of tree blocks is valid and
consistent. There's some validation performed when metadata blocks are read
from disk but it's not extensive and cannot substitute full *btrfs check*
run.
The user is supposed to run it manually or via a periodic system service. The
recommended period is a month but could be less. The estimated device bandwidth
utilization is about 80% on an idle filesystem. The IO priority class is by
default *idle* so background scrub should not significantly interfere with
normal filesystem operation. The IO scheduler set for the device(s) might not
support the priority classes though.
The scrubbing status is recorded in */var/lib/btrfs/* in textual files named
*scrub.status.UUID* for a filesystem identified by the given UUID. (Progress
state is communicated through a named pipe in file *scrub.progress.UUID* in the
same directory.) The status file is updated every 5 seconds. A resumed scrub
will continue from the last saved position.
Scrub can be started only on a mounted filesystem, though it's possible to
scrub only a selected device. See **scrub start** for more.
.. include:: ch-scrub-intro.rst
SUBCOMMAND
----------

View File

@ -0,0 +1,76 @@
Data and metadata are checksummed by default, the checksum is calculated before
write and verifed after reading the blocks. There are several checksum
algorithms supported. The default and backward compatible is *crc32c*. Since
kernel 5.5 there are three more with different characteristics and trade-offs
regarding speed and strength. The following list may help you to decide which
one to select.
CRC32C (32bit digest)
default, best backward compatibility, very fast, modern CPUs have
instruction-level support, not collision-resistant but still good error
detection capabilities
XXHASH* (64bit digest)
can be used as CRC32C successor, very fast, optimized for modern CPUs utilizing
instruction pipelining, good collision resistance and error detection
SHA256 (256bit digest)::
a cryptographic-strength hash, relatively slow but with possible CPU
instruction acceleration or specialized hardware cards, FIPS certified and
in wide use
BLAKE2b (256bit digest)
a cryptographic-strength hash, relatively fast with possible CPU acceleration
using SIMD extensions, not standardized but based on BLAKE which was a SHA3
finalist, in wide use, the algorithm used is BLAKE2b-256 that's optimized for
64bit platforms
The *digest size* affects overall size of data block checksums stored in the
filesystem. The metadata blocks have a fixed area up to 256 bits (32 bytes), so
there's no increase. Each data block has a separate checksum stored, with
additional overhead of the b-tree leaves.
Approximate relative performance of the algorithms, measured against CRC32C
using reference software implementations on a 3.5GHz intel CPU:
======== ============ ======= ================
Digest Cycles/4KiB Ratio Implementation
======== ============ ======= ================
CRC32C 1700 1.00 CPU instruction
XXHASH 2500 1.44 reference impl.
SHA256 105000 61 reference impl.
SHA256 36000 21 libgcrypt/AVX2
SHA256 63000 37 libsodium/AVX2
BLAKE2b 22000 13 reference impl.
BLAKE2b 19000 11 libgcrypt/AVX2
BLAKE2b 19000 11 libsodium/AVX2
======== ============ ======= ================
Many kernels are configured with SHA256 as built-in and not as a module.
The accelerated versions are however provided by the modules and must be loaded
explicitly (**modprobe sha256**) before mounting the filesystem to make use of
them. You can check in */sys/fs/btrfs/FSID/checksum* which one is used. If you
see *sha256-generic*, then you may want to unmount and mount the filesystem
again, changing that on a mounted filesystem is not possible.
Check the file */proc/crypto*, when the implementation is built-in, you'd find
.. code-block:: none
name : sha256
driver : sha256-generic
module : kernel
priority : 100
...
while accelerated implementation is e.g.
.. code-block:: none
name : sha256
driver : sha256-avx2
module : sha256_ssse3
priority : 170
...

View File

@ -0,0 +1,153 @@
Btrfs supports transparent file compression. There are three algorithms
available: ZLIB, LZO and ZSTD (since v4.14), with various levels.
The compression happens on the level of file extents and the algorithm is
selected by file property, mount option or by a defrag command.
You can have a single btrfs mount point that has some files that are
uncompressed, some that are compressed with LZO, some with ZLIB, for instance
(though you may not want it that way, it is supported).
Once the compression is set, all newly written data will be compressed, ie.
existing data are untouched. Data are split into smaller chunks (128KiB) before
compression to make random rewrites possible without a high performance hit. Due
to the increased number of extents the metadata consumption is higher. The
chunks are compressed in parallel.
The algorithms can be characterized as follows regarding the speed/ratio
trade-offs:
ZLIB
* slower, higher compression ratio
* levels: 1 to 9, mapped directly, default level is 3
* good backward compatibility
LZO
* faster compression and decompression than zlib, worse compression ratio, designed to be fast
* no levels
* good backward compatibility
ZSTD
* compression comparable to zlib with higher compression/decompression speeds and different ratio
* levels: 1 to 15
* since 4.14, levels since 5.1
The differences depend on the actual data set and cannot be expressed by a
single number or recommendation. Higher levels consume more CPU time and may
not bring a significant improvement, lower levels are close to real time.
How to enable compression
-------------------------
Typically the compression can be enabled on the whole filesystem, specified for
the mount point. Note that the compression mount options are shared among all
mounts of the same filesystem, either bind mounts or subvolume mounts.
Please refer to section *MOUNT OPTIONS*.
.. code-block:: shell
$ mount -o compress=zstd /dev/sdx /mnt
This will enable the ``zstd`` algorithm on the default level (which is 3).
The level can be specified manually too like ``zstd:3``. Higher levels compress
better at the cost of time. This in turn may cause increased write latency, low
levels are suitable for real-time compression and on reasonably fast CPU don't
cause performance drops.
.. code-block:: shell
$ btrfs filesystem defrag -czstd file
The command above will start defragmentation of the whole *file* and apply
the compression, regardless of the mount option. (Note: specifying level is not
yet implemented). The compression algorithm is not persisent and applies only
to the defragmentation command, for any other writes other compression settings
apply.
Persistent settings on a per-file basis can be set in two ways:
.. code-block:: shell
$ chattr +c file
$ btrfs property set file compression zstd
The first command is using legacy interface of file attributes inherited from
ext2 filesystem and is not flexible, so by default the *zlib* compression is
set. The other command sets a property on the file with the given algorithm.
(Note: setting level that way is not yet implemented.)
Compression levels
------------------
The level support of ZLIB has been added in v4.14, LZO does not support levels
(the kernel implementation provides only one), ZSTD level support has been added
in v5.1.
There are 9 levels of ZLIB supported (1 to 9), mapping 1:1 from the mount option
to the algorithm defined level. The default is level 3, which provides the
reasonably good compression ratio and is still reasonably fast. The difference
in compression gain of levels 7, 8 and 9 is comparable but the higher levels
take longer.
The ZSTD support includes levels 1 to 15, a subset of full range of what ZSTD
provides. Levels 1-3 are real-time, 4-8 slower with improved compression and
9-15 try even harder though the resulting size may not be significantly improved.
Level 0 always maps to the default. The compression level does not affect
compatibility.
Incompressible data
-------------------
Files with already compressed data or with data that won't compress well with
the CPU and memory constraints of the kernel implementations are using a simple
decision logic. If the first portion of data being compressed is not smaller
than the original, the compression of the file is disabled -- unless the
filesystem is mounted with *compress-force*. In that case compression will
always be attempted on the file only to be later discarded. This is not optimal
and subject to optimizations and further development.
If a file is identified as incompressible, a flag is set (*NOCOMPRESS*) and it's
sticky. On that file compression won't be performed unless forced. The flag
can be also set by **chattr +m** (since e2fsprogs 1.46.2) or by properties with
value *no* or *none*. Empty value will reset it to the default that's currently
applicable on the mounted filesystem.
There are two ways to detect incompressible data:
* actual compression attempt - data are compressed, if the result is not smaller,
it's discarded, so this depends on the algorithm and level
* pre-compression heuristics - a quick statistical evaluation on the data is
peformed and based on the result either compression is performed or skipped,
the NOCOMPRESS bit is not set just by the heuristic, only if the compression
algorithm does not make an improvent
.. code-block:: shell
$ lsattr file
---------------------m file
Using the forcing compression is not recommended, the heuristics are
supposed to decide that and compression algorithms internally detect
incompressible data too.
Pre-compression heuristics
--------------------------
The heuristics aim to do a few quick statistical tests on the compressed data
in order to avoid probably costly compression that would turn out to be
inefficient. Compression algorithms could have internal detection of
incompressible data too but this leads to more overhead as the compression is
done in another thread and has to write the data anyway. The heuristic is
read-only and can utilize cached memory.
The tests performed based on the following: data sampling, long repated
pattern detection, byte frequency, Shannon entropy.
Compatibility
-------------
Compression is done using the COW mechanism so it's incompatible with
*nodatacow*. Direct IO works on compressed files but will fall back to buffered
writes and leads to recompression. Currently 'nodatasum' and compression don't
work together.
The compression algorithms have been added over time so the version
compatibility should be also considered, together with other tools that may
access the compressed data like bootloaders.

View File

@ -0,0 +1,97 @@
The **btrfs-convert** tool can be used to convert existing source filesystem
image to a btrfs filesystem in-place. The original filesystem image is
accessible in subvolume named like *ext2_saved* as file *image*.
Supported filesystems:
* ext2, ext3, ext4 -- original feature, always built in
* reiserfs -- since version 4.13, optionally built, requires libreiserfscore 3.6.27
* ntfs -- external tool https://github.com/maharmstone/ntfs2btrfs
The list of supported source filesystem by a given binary is listed at the end
of help (option *--help*).
.. warning::
If you are going to perform rollback to the original filesystem, you
should not execute **btrfs balance** command on the converted filesystem. This
will change the extent layout and make **btrfs-convert** unable to rollback.
The conversion utilizes free space of the original filesystem. The exact
estimate of the required space cannot be foretold. The final btrfs metadata
might occupy several gigabytes on a hundreds-gigabyte filesystem.
If the ability to rollback is no longer important, the it is recommended to
perform a few more steps to transition the btrfs filesystem to a more compact
layout. This is because the conversion inherits the original data blocks'
fragmentation, and also because the metadata blocks are bound to the original
free space layout.
Due to different constraints, it is only possible to convert filesystems that
have a supported data block size (ie. the same that would be valid for
**mkfs.btrfs**). This is typically the system page size (4KiB on x86_64
machines).
**BEFORE YOU START**
The source filesystem must be clean, eg. no journal to replay or no repairs
needed. The respective **fsck** utility must be run on the source filesytem prior
to conversion. Please refer to the manual pages in case you encounter problems.
For ext2/3/4:
.. code-block:: bash
# e2fsck -fvy /dev/sdx
For reiserfs:
.. code-block:: bash
# reiserfsck -fy /dev/sdx
Skipping that step could lead to incorrect results on the target filesystem,
but it may work.
**REMOVE THE ORIGINAL FILESYSTEM METADATA**
By removing the subvolume named like *ext2_saved* or *reiserfs_saved*, all
metadata of the original filesystem will be removed:
.. code-block:: bash
# btrfs subvolume delete /mnt/ext2_saved
At this point it is not possible to do a rollback. The filesystem is usable but
may be impacted by the fragmentation inherited from the original filesystem.
**MAKE FILE DATA MORE CONTIGUOUS**
An optional but recommended step is to run defragmentation on the entire
filesystem. This will attempt to make file extents more contiguous.
.. code-block:: bash
# btrfs filesystem defrag -v -r -f -t 32M /mnt/btrfs
Verbose recursive defragmentation (*-v*, *-r*), flush data per-file (*-f*) with
target extent size 32MiB (*-t*).
**ATTEMPT TO MAKE BTRFS METADATA MORE COMPACT**
Optional but recommended step.
The metadata block groups after conversion may be smaller than the default size
(256MiB or 1GiB). Running a balance will attempt to merge the block groups.
This depends on the free space layout (and fragmentation) and may fail due to
lack of enough work space. This is a soft error leaving the filesystem usable
but the block group layout may remain unchanged.
Note that balance operation takes a lot of time, please see also
``btrfs-balance(8)``.
.. code-block:: bash
# btrfs balance start -m /mnt/btrfs

View File

@ -0,0 +1,198 @@
The concept of quota has a long-standing tradition in the Unix world. Ever
since computers allow multiple users to work simultaneously in one filesystem,
there is the need to prevent one user from using up the entire space. Every
user should get his fair share of the available resources.
In case of files, the solution is quite straightforward. Each file has an
*owner* recorded along with it, and it has a size. Traditional quota just
restricts the total size of all files that are owned by a user. The concept is
quite flexible: if a user hits his quota limit, the administrator can raise it
on the fly.
On the other hand, the traditional approach has only a poor solution to
restrict directories.
At installation time, the harddisk can be partitioned so that every directory
(eg. /usr, /var/, ...) that needs a limit gets its own partition. The obvious
problem is that those limits cannot be changed without a reinstallation. The
btrfs subvolume feature builds a bridge. Subvolumes correspond in many ways to
partitions, as every subvolume looks like its own filesystem. With subvolume
quota, it is now possible to restrict each subvolume like a partition, but keep
the flexibility of quota. The space for each subvolume can be expanded or
restricted on the fly.
As subvolumes are the basis for snapshots, interesting questions arise as to
how to account used space in the presence of snapshots. If you have a file
shared between a subvolume and a snapshot, whom to account the file to? The
creator? Both? What if the file gets modified in the snapshot, should only
these changes be accounted to it? But wait, both the snapshot and the subvolume
belong to the same user home. I just want to limit the total space used by
both! But somebody else might not want to charge the snapshots to the users.
Btrfs subvolume quota solves these problems by introducing groups of subvolumes
and let the user put limits on them. It is even possible to have groups of
groups. In the following, we refer to them as *qgroups*.
Each qgroup primarily tracks two numbers, the amount of total referenced
space and the amount of exclusively referenced space.
referenced
space is the amount of data that can be reached from any of the
subvolumes contained in the qgroup, while
exclusive
is the amount of data where all references to this data can be reached
from within this qgroup.
SUBVOLUME QUOTA GROUPS
^^^^^^^^^^^^^^^^^^^^^^
The basic notion of the Subvolume Quota feature is the quota group, short
qgroup. Qgroups are notated as *level/id*, eg. the qgroup 3/2 is a qgroup of
level 3. For level 0, the leading '0/' can be omitted.
Qgroups of level 0 get created automatically when a subvolume/snapshot gets
created. The ID of the qgroup corresponds to the ID of the subvolume, so 0/5
is the qgroup for the root subvolume.
For the ``btrfs qgroup`` command, the path to the subvolume can also be used
instead of *0/ID*. For all higher levels, the ID can be chosen freely.
Each qgroup can contain a set of lower level qgroups, thus creating a hierarchy
of qgroups. Figure 1 shows an example qgroup tree.
.. code-block:: none
+---+
|2/1|
+---+
/ \
+---+/ \+---+
|1/1| |1/2|
+---+ +---+
/ \ / \
+---+/ \+---+/ \+---+
qgroups |0/1| |0/2| |0/3|
+-+-+ +---+ +---+
| / \ / \
| / \ / \
| / \ / \
extents 1 2 3 4
Figure1: Sample qgroup hierarchy
At the bottom, some extents are depicted showing which qgroups reference which
extents. It is important to understand the notion of *referenced* vs
*exclusive*. In the example, qgroup 0/2 references extents 2 and 3, while 1/2
references extents 2-4, 2/1 references all extents.
On the other hand, extent 1 is exclusive to 0/1, extent 2 is exclusive to 0/2,
while extent 3 is neither exclusive to 0/2 nor to 0/3. But because both
references can be reached from 1/2, extent 3 is exclusive to 1/2. All extents
are exclusive to 2/1.
So exclusive does not mean there is no other way to reach the extent, but it
does mean that if you delete all subvolumes contained in a qgroup, the extent
will get deleted.
Exclusive of a qgroup conveys the useful information how much space will be
freed in case all subvolumes of the qgroup get deleted.
All data extents are accounted this way. Metadata that belongs to a specific
subvolume (i.e. its filesystem tree) is also accounted. Checksums and extent
allocation information are not accounted.
In turn, the referenced count of a qgroup can be limited. All writes beyond
this limit will lead to a 'Quota Exceeded' error.
INHERITANCE
^^^^^^^^^^^
Things get a bit more complicated when new subvolumes or snapshots are created.
The case of (empty) subvolumes is still quite easy. If a subvolume should be
part of a qgroup, it has to be added to the qgroup at creation time. To add it
at a later time, it would be necessary to at least rescan the full subvolume
for a proper accounting.
Creation of a snapshot is the hard case. Obviously, the snapshot will
reference the exact amount of space as its source, and both source and
destination now have an exclusive count of 0 (the filesystem nodesize to be
precise, as the roots of the trees are not shared). But what about qgroups of
higher levels? If the qgroup contains both the source and the destination,
nothing changes. If the qgroup contains only the source, it might lose some
exclusive.
But how much? The tempting answer is, subtract all exclusive of the source from
the qgroup, but that is wrong, or at least not enough. There could have been
an extent that is referenced from the source and another subvolume from that
qgroup. This extent would have been exclusive to the qgroup, but not to the
source subvolume. With the creation of the snapshot, the qgroup would also
lose this extent from its exclusive set.
So how can this problem be solved? In the instant the snapshot gets created, we
already have to know the correct exclusive count. We need to have a second
qgroup that contains all the subvolumes as the first qgroup, except the
subvolume we want to snapshot. The moment we create the snapshot, the
exclusive count from the second qgroup needs to be copied to the first qgroup,
as it represents the correct value. The second qgroup is called a tracking
qgroup. It is only there in case a snapshot is needed.
USE CASES
^^^^^^^^^
Below are some usecases that do not mean to be extensive. You can find your
own way how to integrate qgroups.
SINGLE-USER MACHINE
"""""""""""""""""""
``Replacement for partitions``
The simplest use case is to use qgroups as simple replacement for partitions.
Btrfs takes the disk as a whole, and /, /usr, /var, etc. are created as
subvolumes. As each subvolume gets it own qgroup automatically, they can
simply be restricted. No hierarchy is needed for that.
``Track usage of snapshots``
When a snapshot is taken, a qgroup for it will automatically be created with
the correct values. 'Referenced' will show how much is in it, possibly shared
with other subvolumes. 'Exclusive' will be the amount of space that gets freed
when the subvolume is deleted.
MULTI-USER MACHINE
""""""""""""""""""
``Restricting homes``
When you have several users on a machine, with home directories probably under
/home, you might want to restrict /home as a whole, while restricting every
user to an individual limit as well. This is easily accomplished by creating a
qgroup for /home , eg. 1/1, and assigning all user subvolumes to it.
Restricting this qgroup will limit /home, while every user subvolume can get
its own (lower) limit.
``Accounting snapshots to the user``
Let's say the user is allowed to create snapshots via some mechanism. It would
only be fair to account space used by the snapshots to the user. This does not
mean the user doubles his usage as soon as he takes a snapshot. Of course,
files that are present in his home and the snapshot should only be accounted
once. This can be accomplished by creating a qgroup for each user, say
'1/UID'. The user home and all snapshots are assigned to this qgroup.
Limiting it will extend the limit to all snapshots, counting files only once.
To limit /home as a whole, a higher level group 2/1 replacing 1/1 from the
previous example is needed, with all user qgroups assigned to it.
``Do not account snapshots``
On the other hand, when the snapshots get created automatically, the user has
no chance to control them, so the space used by them should not be accounted to
him. This is already the case when creating snapshots in the example from
the previous section.
``Snapshots for backup purposes``
This scenario is a mixture of the previous two. The user can create snapshots,
but some snapshots for backup purposes are being created by the system. The
user's snapshots should be accounted to the user, not the system. The solution
is similar to the one from section 'Accounting snapshots to the user', but do
not assign system snapshots to user's qgroup.

View File

@ -0,0 +1,28 @@
Scrub is a pass over all filesystem data and metadata and verifying the
checksums. If a valid copy is available (replicated block group profiles) then
the damaged one is repaired. All copies of the replicated profiles are validated.
.. note::
Scrub is not a filesystem checker (fsck) and does not verify nor repair
structural damage in the filesystem. It really only checks checksums of data
and tree blocks, it doesn't ensure the content of tree blocks is valid and
consistent. There's some validation performed when metadata blocks are read
from disk but it's not extensive and cannot substitute full *btrfs check*
run.
The user is supposed to run it manually or via a periodic system service. The
recommended period is a month but could be less. The estimated device bandwidth
utilization is about 80% on an idle filesystem. The IO priority class is by
default *idle* so background scrub should not significantly interfere with
normal filesystem operation. The IO scheduler set for the device(s) might not
support the priority classes though.
The scrubbing status is recorded in */var/lib/btrfs/* in textual files named
*scrub.status.UUID* for a filesystem identified by the given UUID. (Progress
state is communicated through a named pipe in file *scrub.progress.UUID* in the
same directory.) The status file is updated every 5 seconds. A resumed scrub
will continue from the last saved position.
Scrub can be started only on a mounted filesystem, though it's possible to
scrub only a selected device. See **btrfs scrub start** for more.

View File

@ -0,0 +1,78 @@
The COW mechanism and multiple devices under one hood enable an interesting
concept, called a seeding device: extending a read-only filesystem on a single
device filesystem with another device that captures all writes. For example
imagine an immutable golden image of an operating system enhanced with another
device that allows to use the data from the golden image and normal operation.
This idea originated on CD-ROMs with base OS and allowing to use them for live
systems, but this became obsolete. There are technologies providing similar
functionality, like *unionmount*, *overlayfs* or *qcow2* image snapshot.
The seeding device starts as a normal filesystem, once the contents is ready,
**btrfstune -S 1** is used to flag it as a seeding device. Mounting such device
will not allow any writes, except adding a new device by **btrfs device add**.
Then the filesystem can be remounted as read-write.
Given that the filesystem on the seeding device is always recognized as
read-only, it can be used to seed multiple filesystems, at the same time. The
UUID that is normally attached to a device is automatically changed to a random
UUID on each mount.
Once the seeding device is mounted, it needs the writable device. After adding
it, something like **remount -o remount,rw /path** makes the filesystem at
*/path* ready for use. The simplest usecase is to throw away all changes by
unmounting the filesystem when convenient.
Alternatively, deleting the seeding device from the filesystem can turn it into
a normal filesystem, provided that the writable device can also contain all the
data from the seeding device.
The seeding device flag can be cleared again by **btrfstune -f -s 0**, eg.
allowing to update with newer data but please note that this will invalidate
all existing filesystems that use this particular seeding device. This works
for some usecases, not for others, and a forcing flag to the command is
mandatory to avoid accidental mistakes.
Example how to create and use one seeding device:
.. code-block:: bash
# mkfs.btrfs /dev/sda
# mount /dev/sda /mnt/mnt1
# ... fill mnt1 with data
# umount /mnt/mnt1
# btrfstune -S 1 /dev/sda
# mount /dev/sda /mnt/mnt1
# btrfs device add /dev/sdb /mnt
# mount -o remount,rw /mnt/mnt1
# ... /mnt/mnt1 is now writable
Now */mnt/mnt1* can be used normally. The device */dev/sda* can be mounted
again with a another writable device:
.. code-block:: bash
# mount /dev/sda /mnt/mnt2
# btrfs device add /dev/sdc /mnt/mnt2
# mount -o remount,rw /mnt/mnt2
... /mnt/mnt2 is now writable
The writable device (*/dev/sdb*) can be decoupled from the seeding device and
used independently:
.. code-block:: bash
# btrfs device delete /dev/sda /mnt/mnt1
As the contents originated in the seeding device, it's possible to turn
*/dev/sdb* to a seeding device again and repeat the whole process.
A few things to note:
* it's recommended to use only single device for the seeding device, it works
for multiple devices but the *single* profile must be used in order to make
the seeding device deletion work
* block group profiles *single* and *dup* support the usecases above
* the label is copied from the seeding device and can be changed by **btrfs filesystem label**
* each new mount of the seeding device gets a new random UUID