mirror of
https://github.com/kdave/btrfs-progs
synced 2025-04-01 22:48:06 +00:00
btrfs-progs: docs: add more chapters (part 2)
The feature pages share the contents with the manual page section 5 so put the contents to separate files. Progress: 2/3. Signed-off-by: David Sterba <dsterba@suse.com>
This commit is contained in:
parent
b871bf49f3
commit
c6be84840f
@ -1,4 +1,8 @@
|
|||||||
Auto-repair on read
|
Auto-repair on read
|
||||||
===================
|
===================
|
||||||
|
|
||||||
...
|
Data or metadata that are found to be damaged (eg. because the checksum does
|
||||||
|
not match) at the time they're read from the device can be salvaged in case the
|
||||||
|
filesystem has another valid copy when using block group profile with redundancy
|
||||||
|
(DUP, RAID1, RAID5/6). The correct data are returned to the user application
|
||||||
|
and the damaged copy is replaced by it.
|
||||||
|
@ -1,4 +1,4 @@
|
|||||||
Convert
|
Convert
|
||||||
=======
|
=======
|
||||||
|
|
||||||
...
|
.. include:: ch-convert-intro.rst
|
||||||
|
@ -1,4 +1,44 @@
|
|||||||
Deduplication
|
Deduplication
|
||||||
=============
|
=============
|
||||||
|
|
||||||
...
|
Going by the definition in the context of filesystems, it's a process of
|
||||||
|
looking up identical data blocks tracked separately and creating a shared
|
||||||
|
logical link while removing one of the copies of the data blocks. This leads to
|
||||||
|
data space savings while it increases metadata consumption.
|
||||||
|
|
||||||
|
There are two main deduplication types:
|
||||||
|
|
||||||
|
* **in-band** *(sometimes also called on-line)* -- all newly written data are
|
||||||
|
considered for deduplication before writing
|
||||||
|
* **out-of-band** *(sometimes alco called offline)* -- data for deduplication
|
||||||
|
have to be actively looked for and deduplicated by the user application
|
||||||
|
|
||||||
|
Both have their pros and cons. BTRFS implements **only out-of-band** type.
|
||||||
|
|
||||||
|
BTRFS provides the basic building blocks for deduplication allowing other tools
|
||||||
|
to choose the strategy and scope of the deduplication. There are multiple
|
||||||
|
tools that take different approaches to deduplication, offer additional
|
||||||
|
features or make trade-offs. The following table lists tools that are known to
|
||||||
|
be up-to-date, maintained and widely used.
|
||||||
|
|
||||||
|
.. list-table::
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - Name
|
||||||
|
- File based
|
||||||
|
- Block based
|
||||||
|
- Incremental
|
||||||
|
* - `BEES <https://github.com/Zygo/bees>`_
|
||||||
|
- No
|
||||||
|
- Yes
|
||||||
|
- Yes
|
||||||
|
* - `duperemove <https://github.com/markfasheh/duperemove>`_
|
||||||
|
- Yes
|
||||||
|
- No
|
||||||
|
- Yes
|
||||||
|
|
||||||
|
Legend:
|
||||||
|
|
||||||
|
- *File based*: the tool takes a list of files and deduplicates blocks only from that set
|
||||||
|
- *Block based*: the tool enumerates blocks and looks for duplicates
|
||||||
|
- *Incremental*: repeated runs of the tool utilizes information gathered from previous runs
|
||||||
|
@ -1,4 +1,22 @@
|
|||||||
Defragmentation
|
Defragmentation
|
||||||
===============
|
===============
|
||||||
|
|
||||||
...
|
Defragmentation of files is supposed to make the layout of the file extents to
|
||||||
|
be more linear or at least coalesce the file extents into larger ones that can
|
||||||
|
be stored on the device more efficiently. The reason there's a need for
|
||||||
|
defragmentation stems from the COW design that BTRFS is built on and is
|
||||||
|
inherent. The fragmentation is caused by rewrites of the same file data
|
||||||
|
in-place, that has to be handled by creating a new copy that may lie on a
|
||||||
|
distant location on the physical device. Fragmentation is the worst problem on
|
||||||
|
rotational hard disks due to the delay caused by moving the drive heads to the
|
||||||
|
distant location. With the modern seek-less devices it's not a problem though
|
||||||
|
it may still make sense because of reduced size of the metadata that's needed
|
||||||
|
to track the scattered extents.
|
||||||
|
|
||||||
|
File data that are in use can be safely defragmented because the whole process
|
||||||
|
happens inside the page cache, that is the central point caching the file data
|
||||||
|
and takes care of synchronization. Once a filesystem sync or flush is started
|
||||||
|
(either manually or automatically) all the dirty data get written to the
|
||||||
|
devices. This however reduces the chances to find optimal layout as the writes
|
||||||
|
happen together with other data and the result depens on the remaining free
|
||||||
|
space layout and fragmentation.
|
||||||
|
@ -1,6 +1,18 @@
|
|||||||
Flexibility
|
Flexibility
|
||||||
===========
|
===========
|
||||||
|
|
||||||
* dynamic inode creation (no preallocated space)
|
The underlying design of BTRFS data structures allows a lot of flexibility and
|
||||||
|
making changes after filesystem creation, like resizing, adding/removing space
|
||||||
|
or enabling some features on-the-fly.
|
||||||
|
|
||||||
* block group profile change on-the-fly
|
* **dynamic inode creation** -- there's no fixed space or tables for tracking
|
||||||
|
inodes so the number of inodes that can be created is bounded by the metadata
|
||||||
|
space and it's utilization
|
||||||
|
|
||||||
|
* **block group profile change on-the-fly** -- the block group profiles can be
|
||||||
|
changed on a mounted filesystem by running the balance operation and
|
||||||
|
specifying the conversion filters
|
||||||
|
|
||||||
|
* **resize** -- the space occupied by the filesystem on each device can be
|
||||||
|
resized up (grow) or down (shrink) as long as the amount of data can be still
|
||||||
|
contained on the device
|
||||||
|
@ -1,4 +1,4 @@
|
|||||||
Quota groups
|
Quota groups
|
||||||
============
|
============
|
||||||
|
|
||||||
...
|
.. include:: ch-quota-intro.rst
|
||||||
|
@ -1,4 +1,29 @@
|
|||||||
Reflink
|
Reflink
|
||||||
=======
|
=======
|
||||||
|
|
||||||
...
|
Reflink is a type of shallow copy of file data that shares the blocks but
|
||||||
|
otherwise the files are independent and any change to the file will not affect
|
||||||
|
the other. This builds on the underlying COW mechanism. A reflink will
|
||||||
|
effectively create only a separate metadata pointing to the shared blocks which
|
||||||
|
is typically much faster than a deep copy of all blocks.
|
||||||
|
|
||||||
|
The reflink is typically meant for whole files but a partial file range can be
|
||||||
|
also copied, though there are no ready-made tools for that.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
cp --reflink=always source target
|
||||||
|
|
||||||
|
There are some constaints:
|
||||||
|
|
||||||
|
- cross-filesystem reflink is not possible, there's nothing in common between
|
||||||
|
so the block sharing can't work
|
||||||
|
- reflink crossing two mount points of the same filesystem does not work due
|
||||||
|
to an artificial limitation in VFS (this may change in the future)
|
||||||
|
- reflink requires source and target file that have the same status regarding
|
||||||
|
NOCOW and checksums, for example if the source file is NOCOW (once created
|
||||||
|
with the chattr +C attribute) then the above command won't work unless the
|
||||||
|
target file is pre-created with the +C attribute as well, or the NOCOW
|
||||||
|
attribute is inherited from the parent directory (chattr +C on the directory)
|
||||||
|
or if the whole filesystem is mounted with *-o nodatacow* that would create
|
||||||
|
the NOCOW files by default
|
||||||
|
@ -1,4 +1,12 @@
|
|||||||
Resize
|
Resize
|
||||||
======
|
======
|
||||||
|
|
||||||
...
|
A BTRFS mounted filesystem can be resized after creation, grown or shrunk. On a
|
||||||
|
multi device filesystem the space occupied on each device can be resized
|
||||||
|
independently. Data tha reside in the are that would be out of the new size are
|
||||||
|
relocated to the remaining space below the limit, so this constrains the
|
||||||
|
minimum size to which a filesystem can be shrunk.
|
||||||
|
|
||||||
|
Growing a filesystem is quick as it only needs to take note of the available
|
||||||
|
space, while shrinking a filesystem needs to relocate potentially lots of data
|
||||||
|
and this is IO intense. It is possible to shrink a filesystem in smaller steps.
|
||||||
|
@ -1,4 +1,4 @@
|
|||||||
Scrub
|
Scrub
|
||||||
=====
|
=====
|
||||||
|
|
||||||
...
|
.. include:: ch-scrub-intro.rst
|
||||||
|
@ -1,4 +1,23 @@
|
|||||||
Balance
|
Send/receive
|
||||||
=======
|
============
|
||||||
|
|
||||||
...
|
Send and receive are complementary features that allow to transfer data from
|
||||||
|
one filesystem to another in a streamable format. The send part traverses a
|
||||||
|
given read-only subvolume and either creates a full stream representation of
|
||||||
|
its data and metadata (*full mode*), or given a set of subvolumes for reference
|
||||||
|
it generates a difference relative to that set (*incremental mode*).
|
||||||
|
|
||||||
|
Receive on the other hand takes the stream and reconstructs a subvolume with
|
||||||
|
files and directories equivalent to the filesystem that was used to produce the
|
||||||
|
stream. The result is not exactly 1:1, eg. inode numbers can be different and
|
||||||
|
other unique identifiers can be different (like the subvolume UUIDs). The full
|
||||||
|
mode starts with an empty subvolume, creates all the files and then turns the
|
||||||
|
subvolume to read-only. At this point it could be used as a starting point for a
|
||||||
|
future incremental send stream, provided it would be generated from the same
|
||||||
|
source subvolume on the other filesystem.
|
||||||
|
|
||||||
|
The stream is a sequence of encoded commands that change eg. file metadata
|
||||||
|
(owner, permissions, extended attributes), data extents (create, clone,
|
||||||
|
truncate), whole file operations (rename, delete). The stream can be sent over
|
||||||
|
network, piped directly to the receive command or saved to a file. Each command
|
||||||
|
in the stream is protected by a CRC32C checksum.
|
||||||
|
@ -9,102 +9,7 @@ SYNOPSIS
|
|||||||
DESCRIPTION
|
DESCRIPTION
|
||||||
-----------
|
-----------
|
||||||
|
|
||||||
**btrfs-convert** is used to convert existing source filesystem image to a btrfs
|
.. include:: ch-convert-intro.rst
|
||||||
filesystem in-place. The original filesystem image is accessible in subvolume
|
|
||||||
named like *ext2_saved* as file *image*.
|
|
||||||
|
|
||||||
Supported filesystems:
|
|
||||||
|
|
||||||
* ext2, ext3, ext4 -- original feature, always built in
|
|
||||||
|
|
||||||
* reiserfs -- since version 4.13, optionally built, requires libreiserfscore 3.6.27
|
|
||||||
|
|
||||||
* ntfs -- external tool https://github.com/maharmstone/ntfs2btrfs
|
|
||||||
|
|
||||||
The list of supported source filesystem by a given binary is listed at the end
|
|
||||||
of help (option *--help*).
|
|
||||||
|
|
||||||
.. warning::
|
|
||||||
If you are going to perform rollback to the original filesystem, you
|
|
||||||
should not execute **btrfs balance** command on the converted filesystem. This
|
|
||||||
will change the extent layout and make **btrfs-convert** unable to rollback.
|
|
||||||
|
|
||||||
The conversion utilizes free space of the original filesystem. The exact
|
|
||||||
estimate of the required space cannot be foretold. The final btrfs metadata
|
|
||||||
might occupy several gigabytes on a hundreds-gigabyte filesystem.
|
|
||||||
|
|
||||||
If the ability to rollback is no longer important, the it is recommended to
|
|
||||||
perform a few more steps to transition the btrfs filesystem to a more compact
|
|
||||||
layout. This is because the conversion inherits the original data blocks'
|
|
||||||
fragmentation, and also because the metadata blocks are bound to the original
|
|
||||||
free space layout.
|
|
||||||
|
|
||||||
Due to different constraints, it is only possible to convert filesystems that
|
|
||||||
have a supported data block size (ie. the same that would be valid for
|
|
||||||
**mkfs.btrfs**). This is typically the system page size (4KiB on x86_64
|
|
||||||
machines).
|
|
||||||
|
|
||||||
**BEFORE YOU START**
|
|
||||||
|
|
||||||
The source filesystem must be clean, eg. no journal to replay or no repairs
|
|
||||||
needed. The respective **fsck** utility must be run on the source filesytem prior
|
|
||||||
to conversion. Please refer to the manual pages in case you encounter problems.
|
|
||||||
|
|
||||||
For ext2/3/4:
|
|
||||||
|
|
||||||
.. code-block:: bash
|
|
||||||
|
|
||||||
# e2fsck -fvy /dev/sdx
|
|
||||||
|
|
||||||
For reiserfs:
|
|
||||||
|
|
||||||
.. code-block:: bash
|
|
||||||
|
|
||||||
# reiserfsck -fy /dev/sdx
|
|
||||||
|
|
||||||
Skipping that step could lead to incorrect results on the target filesystem,
|
|
||||||
but it may work.
|
|
||||||
|
|
||||||
**REMOVE THE ORIGINAL FILESYSTEM METADATA**
|
|
||||||
|
|
||||||
By removing the subvolume named like *ext2_saved* or *reiserfs_saved*, all
|
|
||||||
metadata of the original filesystem will be removed:
|
|
||||||
|
|
||||||
.. code-block:: bash
|
|
||||||
|
|
||||||
# btrfs subvolume delete /mnt/ext2_saved
|
|
||||||
|
|
||||||
At this point it is not possible to do a rollback. The filesystem is usable but
|
|
||||||
may be impacted by the fragmentation inherited from the original filesystem.
|
|
||||||
|
|
||||||
**MAKE FILE DATA MORE CONTIGUOUS**
|
|
||||||
|
|
||||||
An optional but recommended step is to run defragmentation on the entire
|
|
||||||
filesystem. This will attempt to make file extents more contiguous.
|
|
||||||
|
|
||||||
.. code-block:: bash
|
|
||||||
|
|
||||||
# btrfs filesystem defrag -v -r -f -t 32M /mnt/btrfs
|
|
||||||
|
|
||||||
Verbose recursive defragmentation (*-v*, *-r*), flush data per-file (*-f*) with
|
|
||||||
target extent size 32MiB (*-t*).
|
|
||||||
|
|
||||||
**ATTEMPT TO MAKE BTRFS METADATA MORE COMPACT**
|
|
||||||
|
|
||||||
Optional but recommended step.
|
|
||||||
|
|
||||||
The metadata block groups after conversion may be smaller than the default size
|
|
||||||
(256MiB or 1GiB). Running a balance will attempt to merge the block groups.
|
|
||||||
This depends on the free space layout (and fragmentation) and may fail due to
|
|
||||||
lack of enough work space. This is a soft error leaving the filesystem usable
|
|
||||||
but the block group layout may remain unchanged.
|
|
||||||
|
|
||||||
Note that balance operation takes a lot of time, please see also
|
|
||||||
``btrfs-balance(8)``.
|
|
||||||
|
|
||||||
.. code-block:: bash
|
|
||||||
|
|
||||||
# btrfs balance start -m /mnt/btrfs
|
|
||||||
|
|
||||||
OPTIONS
|
OPTIONS
|
||||||
-------
|
-------
|
||||||
|
@ -36,202 +36,7 @@ gradually improving and issues found and fixed.
|
|||||||
HIERARCHICAL QUOTA GROUP CONCEPTS
|
HIERARCHICAL QUOTA GROUP CONCEPTS
|
||||||
---------------------------------
|
---------------------------------
|
||||||
|
|
||||||
The concept of quota has a long-standing tradition in the Unix world. Ever
|
.. include:: ch-quota-intro.rst
|
||||||
since computers allow multiple users to work simultaneously in one filesystem,
|
|
||||||
there is the need to prevent one user from using up the entire space. Every
|
|
||||||
user should get his fair share of the available resources.
|
|
||||||
|
|
||||||
In case of files, the solution is quite straightforward. Each file has an
|
|
||||||
'owner' recorded along with it, and it has a size. Traditional quota just
|
|
||||||
restricts the total size of all files that are owned by a user. The concept is
|
|
||||||
quite flexible: if a user hits his quota limit, the administrator can raise it
|
|
||||||
on the fly.
|
|
||||||
|
|
||||||
On the other hand, the traditional approach has only a poor solution to
|
|
||||||
restrict directories.
|
|
||||||
At installation time, the harddisk can be partitioned so that every directory
|
|
||||||
(eg. /usr, /var/, ...) that needs a limit gets its own partition. The obvious
|
|
||||||
problem is that those limits cannot be changed without a reinstallation. The
|
|
||||||
btrfs subvolume feature builds a bridge. Subvolumes correspond in many ways to
|
|
||||||
partitions, as every subvolume looks like its own filesystem. With subvolume
|
|
||||||
quota, it is now possible to restrict each subvolume like a partition, but keep
|
|
||||||
the flexibility of quota. The space for each subvolume can be expanded or
|
|
||||||
restricted on the fly.
|
|
||||||
|
|
||||||
As subvolumes are the basis for snapshots, interesting questions arise as to
|
|
||||||
how to account used space in the presence of snapshots. If you have a file
|
|
||||||
shared between a subvolume and a snapshot, whom to account the file to? The
|
|
||||||
creator? Both? What if the file gets modified in the snapshot, should only
|
|
||||||
these changes be accounted to it? But wait, both the snapshot and the subvolume
|
|
||||||
belong to the same user home. I just want to limit the total space used by
|
|
||||||
both! But somebody else might not want to charge the snapshots to the users.
|
|
||||||
|
|
||||||
Btrfs subvolume quota solves these problems by introducing groups of subvolumes
|
|
||||||
and let the user put limits on them. It is even possible to have groups of
|
|
||||||
groups. In the following, we refer to them as 'qgroups'.
|
|
||||||
|
|
||||||
Each qgroup primarily tracks two numbers, the amount of total referenced
|
|
||||||
space and the amount of exclusively referenced space.
|
|
||||||
|
|
||||||
referenced
|
|
||||||
space is the amount of data that can be reached from any of the
|
|
||||||
subvolumes contained in the qgroup, while
|
|
||||||
exclusive
|
|
||||||
is the amount of data where all references to this data can be reached
|
|
||||||
from within this qgroup.
|
|
||||||
|
|
||||||
SUBVOLUME QUOTA GROUPS
|
|
||||||
^^^^^^^^^^^^^^^^^^^^^^
|
|
||||||
|
|
||||||
The basic notion of the Subvolume Quota feature is the quota group, short
|
|
||||||
qgroup. Qgroups are notated as 'level/id', eg. the qgroup 3/2 is a qgroup of
|
|
||||||
level 3. For level 0, the leading '0/' can be omitted.
|
|
||||||
Qgroups of level 0 get created automatically when a subvolume/snapshot gets
|
|
||||||
created. The ID of the qgroup corresponds to the ID of the subvolume, so 0/5
|
|
||||||
is the qgroup for the root subvolume.
|
|
||||||
For the *btrfs qgroup* command, the path to the subvolume can also be used
|
|
||||||
instead of '0/ID'. For all higher levels, the ID can be chosen freely.
|
|
||||||
|
|
||||||
Each qgroup can contain a set of lower level qgroups, thus creating a hierarchy
|
|
||||||
of qgroups. Figure 1 shows an example qgroup tree.
|
|
||||||
|
|
||||||
.. code-block:: none
|
|
||||||
|
|
||||||
+---+
|
|
||||||
|2/1|
|
|
||||||
+---+
|
|
||||||
/ \
|
|
||||||
+---+/ \+---+
|
|
||||||
|1/1| |1/2|
|
|
||||||
+---+ +---+
|
|
||||||
/ \ / \
|
|
||||||
+---+/ \+---+/ \+---+
|
|
||||||
qgroups |0/1| |0/2| |0/3|
|
|
||||||
+-+-+ +---+ +---+
|
|
||||||
| / \ / \
|
|
||||||
| / \ / \
|
|
||||||
| / \ / \
|
|
||||||
extents 1 2 3 4
|
|
||||||
|
|
||||||
Figure1: Sample qgroup hierarchy
|
|
||||||
|
|
||||||
At the bottom, some extents are depicted showing which qgroups reference which
|
|
||||||
extents. It is important to understand the notion of *referenced* vs
|
|
||||||
*exclusive*. In the example, qgroup 0/2 references extents 2 and 3, while 1/2
|
|
||||||
references extents 2-4, 2/1 references all extents.
|
|
||||||
|
|
||||||
On the other hand, extent 1 is exclusive to 0/1, extent 2 is exclusive to 0/2,
|
|
||||||
while extent 3 is neither exclusive to 0/2 nor to 0/3. But because both
|
|
||||||
references can be reached from 1/2, extent 3 is exclusive to 1/2. All extents
|
|
||||||
are exclusive to 2/1.
|
|
||||||
|
|
||||||
So exclusive does not mean there is no other way to reach the extent, but it
|
|
||||||
does mean that if you delete all subvolumes contained in a qgroup, the extent
|
|
||||||
will get deleted.
|
|
||||||
|
|
||||||
Exclusive of a qgroup conveys the useful information how much space will be
|
|
||||||
freed in case all subvolumes of the qgroup get deleted.
|
|
||||||
|
|
||||||
All data extents are accounted this way. Metadata that belongs to a specific
|
|
||||||
subvolume (i.e. its filesystem tree) is also accounted. Checksums and extent
|
|
||||||
allocation information are not accounted.
|
|
||||||
|
|
||||||
In turn, the referenced count of a qgroup can be limited. All writes beyond
|
|
||||||
this limit will lead to a 'Quota Exceeded' error.
|
|
||||||
|
|
||||||
INHERITANCE
|
|
||||||
^^^^^^^^^^^
|
|
||||||
|
|
||||||
Things get a bit more complicated when new subvolumes or snapshots are created.
|
|
||||||
The case of (empty) subvolumes is still quite easy. If a subvolume should be
|
|
||||||
part of a qgroup, it has to be added to the qgroup at creation time. To add it
|
|
||||||
at a later time, it would be necessary to at least rescan the full subvolume
|
|
||||||
for a proper accounting.
|
|
||||||
|
|
||||||
Creation of a snapshot is the hard case. Obviously, the snapshot will
|
|
||||||
reference the exact amount of space as its source, and both source and
|
|
||||||
destination now have an exclusive count of 0 (the filesystem nodesize to be
|
|
||||||
precise, as the roots of the trees are not shared). But what about qgroups of
|
|
||||||
higher levels? If the qgroup contains both the source and the destination,
|
|
||||||
nothing changes. If the qgroup contains only the source, it might lose some
|
|
||||||
exclusive.
|
|
||||||
|
|
||||||
But how much? The tempting answer is, subtract all exclusive of the source from
|
|
||||||
the qgroup, but that is wrong, or at least not enough. There could have been
|
|
||||||
an extent that is referenced from the source and another subvolume from that
|
|
||||||
qgroup. This extent would have been exclusive to the qgroup, but not to the
|
|
||||||
source subvolume. With the creation of the snapshot, the qgroup would also
|
|
||||||
lose this extent from its exclusive set.
|
|
||||||
|
|
||||||
So how can this problem be solved? In the instant the snapshot gets created, we
|
|
||||||
already have to know the correct exclusive count. We need to have a second
|
|
||||||
qgroup that contains all the subvolumes as the first qgroup, except the
|
|
||||||
subvolume we want to snapshot. The moment we create the snapshot, the
|
|
||||||
exclusive count from the second qgroup needs to be copied to the first qgroup,
|
|
||||||
as it represents the correct value. The second qgroup is called a tracking
|
|
||||||
qgroup. It is only there in case a snapshot is needed.
|
|
||||||
|
|
||||||
USE CASES
|
|
||||||
^^^^^^^^^
|
|
||||||
|
|
||||||
Below are some usecases that do not mean to be extensive. You can find your
|
|
||||||
own way how to integrate qgroups.
|
|
||||||
|
|
||||||
SINGLE-USER MACHINE
|
|
||||||
"""""""""""""""""""
|
|
||||||
|
|
||||||
``Replacement for partitions``
|
|
||||||
|
|
||||||
The simplest use case is to use qgroups as simple replacement for partitions.
|
|
||||||
Btrfs takes the disk as a whole, and /, /usr, /var, etc. are created as
|
|
||||||
subvolumes. As each subvolume gets it own qgroup automatically, they can
|
|
||||||
simply be restricted. No hierarchy is needed for that.
|
|
||||||
|
|
||||||
``Track usage of snapshots``
|
|
||||||
|
|
||||||
When a snapshot is taken, a qgroup for it will automatically be created with
|
|
||||||
the correct values. 'Referenced' will show how much is in it, possibly shared
|
|
||||||
with other subvolumes. 'Exclusive' will be the amount of space that gets freed
|
|
||||||
when the subvolume is deleted.
|
|
||||||
|
|
||||||
MULTI-USER MACHINE
|
|
||||||
""""""""""""""""""
|
|
||||||
|
|
||||||
``Restricting homes``
|
|
||||||
|
|
||||||
When you have several users on a machine, with home directories probably under
|
|
||||||
/home, you might want to restrict /home as a whole, while restricting every
|
|
||||||
user to an individual limit as well. This is easily accomplished by creating a
|
|
||||||
qgroup for /home , eg. 1/1, and assigning all user subvolumes to it.
|
|
||||||
Restricting this qgroup will limit /home, while every user subvolume can get
|
|
||||||
its own (lower) limit.
|
|
||||||
|
|
||||||
``Accounting snapshots to the user``
|
|
||||||
|
|
||||||
Let's say the user is allowed to create snapshots via some mechanism. It would
|
|
||||||
only be fair to account space used by the snapshots to the user. This does not
|
|
||||||
mean the user doubles his usage as soon as he takes a snapshot. Of course,
|
|
||||||
files that are present in his home and the snapshot should only be accounted
|
|
||||||
once. This can be accomplished by creating a qgroup for each user, say
|
|
||||||
'1/UID'. The user home and all snapshots are assigned to this qgroup.
|
|
||||||
Limiting it will extend the limit to all snapshots, counting files only once.
|
|
||||||
To limit /home as a whole, a higher level group 2/1 replacing 1/1 from the
|
|
||||||
previous example is needed, with all user qgroups assigned to it.
|
|
||||||
|
|
||||||
``Do not account snapshots``
|
|
||||||
|
|
||||||
On the other hand, when the snapshots get created automatically, the user has
|
|
||||||
no chance to control them, so the space used by them should not be accounted to
|
|
||||||
him. This is already the case when creating snapshots in the example from
|
|
||||||
the previous section.
|
|
||||||
|
|
||||||
``Snapshots for backup purposes``
|
|
||||||
|
|
||||||
This scenario is a mixture of the previous two. The user can create snapshots,
|
|
||||||
but some snapshots for backup purposes are being created by the system. The
|
|
||||||
user's snapshots should be accounted to the user, not the system. The solution
|
|
||||||
is similar to the one from section 'Accounting snapshots to the user', but do
|
|
||||||
not assign system snapshots to user's qgroup.
|
|
||||||
|
|
||||||
SUBCOMMAND
|
SUBCOMMAND
|
||||||
----------
|
----------
|
||||||
|
@ -9,33 +9,7 @@ SYNOPSIS
|
|||||||
DESCRIPTION
|
DESCRIPTION
|
||||||
-----------
|
-----------
|
||||||
|
|
||||||
**btrfs scrub** is used to scrub a mounted btrfs filesystem, which will read all
|
.. include:: ch-scrub-intro.rst
|
||||||
data and metadata blocks from all devices and verify checksums. Automatically
|
|
||||||
repair corrupted blocks if there's a correct copy available.
|
|
||||||
|
|
||||||
.. note::
|
|
||||||
Scrub is not a filesystem checker (fsck) and does not verify nor repair
|
|
||||||
structural damage in the filesystem. It really only checks checksums of data
|
|
||||||
and tree blocks, it doesn't ensure the content of tree blocks is valid and
|
|
||||||
consistent. There's some validation performed when metadata blocks are read
|
|
||||||
from disk but it's not extensive and cannot substitute full *btrfs check*
|
|
||||||
run.
|
|
||||||
|
|
||||||
The user is supposed to run it manually or via a periodic system service. The
|
|
||||||
recommended period is a month but could be less. The estimated device bandwidth
|
|
||||||
utilization is about 80% on an idle filesystem. The IO priority class is by
|
|
||||||
default *idle* so background scrub should not significantly interfere with
|
|
||||||
normal filesystem operation. The IO scheduler set for the device(s) might not
|
|
||||||
support the priority classes though.
|
|
||||||
|
|
||||||
The scrubbing status is recorded in */var/lib/btrfs/* in textual files named
|
|
||||||
*scrub.status.UUID* for a filesystem identified by the given UUID. (Progress
|
|
||||||
state is communicated through a named pipe in file *scrub.progress.UUID* in the
|
|
||||||
same directory.) The status file is updated every 5 seconds. A resumed scrub
|
|
||||||
will continue from the last saved position.
|
|
||||||
|
|
||||||
Scrub can be started only on a mounted filesystem, though it's possible to
|
|
||||||
scrub only a selected device. See **scrub start** for more.
|
|
||||||
|
|
||||||
SUBCOMMAND
|
SUBCOMMAND
|
||||||
----------
|
----------
|
||||||
|
76
Documentation/ch-checksumming.rst
Normal file
76
Documentation/ch-checksumming.rst
Normal file
@ -0,0 +1,76 @@
|
|||||||
|
Data and metadata are checksummed by default, the checksum is calculated before
|
||||||
|
write and verifed after reading the blocks. There are several checksum
|
||||||
|
algorithms supported. The default and backward compatible is *crc32c*. Since
|
||||||
|
kernel 5.5 there are three more with different characteristics and trade-offs
|
||||||
|
regarding speed and strength. The following list may help you to decide which
|
||||||
|
one to select.
|
||||||
|
|
||||||
|
CRC32C (32bit digest)
|
||||||
|
default, best backward compatibility, very fast, modern CPUs have
|
||||||
|
instruction-level support, not collision-resistant but still good error
|
||||||
|
detection capabilities
|
||||||
|
|
||||||
|
XXHASH* (64bit digest)
|
||||||
|
can be used as CRC32C successor, very fast, optimized for modern CPUs utilizing
|
||||||
|
instruction pipelining, good collision resistance and error detection
|
||||||
|
|
||||||
|
SHA256 (256bit digest)::
|
||||||
|
a cryptographic-strength hash, relatively slow but with possible CPU
|
||||||
|
instruction acceleration or specialized hardware cards, FIPS certified and
|
||||||
|
in wide use
|
||||||
|
|
||||||
|
BLAKE2b (256bit digest)
|
||||||
|
a cryptographic-strength hash, relatively fast with possible CPU acceleration
|
||||||
|
using SIMD extensions, not standardized but based on BLAKE which was a SHA3
|
||||||
|
finalist, in wide use, the algorithm used is BLAKE2b-256 that's optimized for
|
||||||
|
64bit platforms
|
||||||
|
|
||||||
|
The *digest size* affects overall size of data block checksums stored in the
|
||||||
|
filesystem. The metadata blocks have a fixed area up to 256 bits (32 bytes), so
|
||||||
|
there's no increase. Each data block has a separate checksum stored, with
|
||||||
|
additional overhead of the b-tree leaves.
|
||||||
|
|
||||||
|
Approximate relative performance of the algorithms, measured against CRC32C
|
||||||
|
using reference software implementations on a 3.5GHz intel CPU:
|
||||||
|
|
||||||
|
|
||||||
|
======== ============ ======= ================
|
||||||
|
Digest Cycles/4KiB Ratio Implementation
|
||||||
|
======== ============ ======= ================
|
||||||
|
CRC32C 1700 1.00 CPU instruction
|
||||||
|
XXHASH 2500 1.44 reference impl.
|
||||||
|
SHA256 105000 61 reference impl.
|
||||||
|
SHA256 36000 21 libgcrypt/AVX2
|
||||||
|
SHA256 63000 37 libsodium/AVX2
|
||||||
|
BLAKE2b 22000 13 reference impl.
|
||||||
|
BLAKE2b 19000 11 libgcrypt/AVX2
|
||||||
|
BLAKE2b 19000 11 libsodium/AVX2
|
||||||
|
======== ============ ======= ================
|
||||||
|
|
||||||
|
Many kernels are configured with SHA256 as built-in and not as a module.
|
||||||
|
The accelerated versions are however provided by the modules and must be loaded
|
||||||
|
explicitly (**modprobe sha256**) before mounting the filesystem to make use of
|
||||||
|
them. You can check in */sys/fs/btrfs/FSID/checksum* which one is used. If you
|
||||||
|
see *sha256-generic*, then you may want to unmount and mount the filesystem
|
||||||
|
again, changing that on a mounted filesystem is not possible.
|
||||||
|
Check the file */proc/crypto*, when the implementation is built-in, you'd find
|
||||||
|
|
||||||
|
.. code-block:: none
|
||||||
|
|
||||||
|
name : sha256
|
||||||
|
driver : sha256-generic
|
||||||
|
module : kernel
|
||||||
|
priority : 100
|
||||||
|
...
|
||||||
|
|
||||||
|
while accelerated implementation is e.g.
|
||||||
|
|
||||||
|
.. code-block:: none
|
||||||
|
|
||||||
|
name : sha256
|
||||||
|
driver : sha256-avx2
|
||||||
|
module : sha256_ssse3
|
||||||
|
priority : 170
|
||||||
|
...
|
||||||
|
|
||||||
|
|
153
Documentation/ch-compression.rst
Normal file
153
Documentation/ch-compression.rst
Normal file
@ -0,0 +1,153 @@
|
|||||||
|
Btrfs supports transparent file compression. There are three algorithms
|
||||||
|
available: ZLIB, LZO and ZSTD (since v4.14), with various levels.
|
||||||
|
The compression happens on the level of file extents and the algorithm is
|
||||||
|
selected by file property, mount option or by a defrag command.
|
||||||
|
You can have a single btrfs mount point that has some files that are
|
||||||
|
uncompressed, some that are compressed with LZO, some with ZLIB, for instance
|
||||||
|
(though you may not want it that way, it is supported).
|
||||||
|
|
||||||
|
Once the compression is set, all newly written data will be compressed, ie.
|
||||||
|
existing data are untouched. Data are split into smaller chunks (128KiB) before
|
||||||
|
compression to make random rewrites possible without a high performance hit. Due
|
||||||
|
to the increased number of extents the metadata consumption is higher. The
|
||||||
|
chunks are compressed in parallel.
|
||||||
|
|
||||||
|
The algorithms can be characterized as follows regarding the speed/ratio
|
||||||
|
trade-offs:
|
||||||
|
|
||||||
|
ZLIB
|
||||||
|
* slower, higher compression ratio
|
||||||
|
* levels: 1 to 9, mapped directly, default level is 3
|
||||||
|
* good backward compatibility
|
||||||
|
LZO
|
||||||
|
* faster compression and decompression than zlib, worse compression ratio, designed to be fast
|
||||||
|
* no levels
|
||||||
|
* good backward compatibility
|
||||||
|
ZSTD
|
||||||
|
* compression comparable to zlib with higher compression/decompression speeds and different ratio
|
||||||
|
* levels: 1 to 15
|
||||||
|
* since 4.14, levels since 5.1
|
||||||
|
|
||||||
|
The differences depend on the actual data set and cannot be expressed by a
|
||||||
|
single number or recommendation. Higher levels consume more CPU time and may
|
||||||
|
not bring a significant improvement, lower levels are close to real time.
|
||||||
|
|
||||||
|
How to enable compression
|
||||||
|
-------------------------
|
||||||
|
|
||||||
|
Typically the compression can be enabled on the whole filesystem, specified for
|
||||||
|
the mount point. Note that the compression mount options are shared among all
|
||||||
|
mounts of the same filesystem, either bind mounts or subvolume mounts.
|
||||||
|
Please refer to section *MOUNT OPTIONS*.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
$ mount -o compress=zstd /dev/sdx /mnt
|
||||||
|
|
||||||
|
This will enable the ``zstd`` algorithm on the default level (which is 3).
|
||||||
|
The level can be specified manually too like ``zstd:3``. Higher levels compress
|
||||||
|
better at the cost of time. This in turn may cause increased write latency, low
|
||||||
|
levels are suitable for real-time compression and on reasonably fast CPU don't
|
||||||
|
cause performance drops.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
$ btrfs filesystem defrag -czstd file
|
||||||
|
|
||||||
|
The command above will start defragmentation of the whole *file* and apply
|
||||||
|
the compression, regardless of the mount option. (Note: specifying level is not
|
||||||
|
yet implemented). The compression algorithm is not persisent and applies only
|
||||||
|
to the defragmentation command, for any other writes other compression settings
|
||||||
|
apply.
|
||||||
|
|
||||||
|
Persistent settings on a per-file basis can be set in two ways:
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
$ chattr +c file
|
||||||
|
$ btrfs property set file compression zstd
|
||||||
|
|
||||||
|
The first command is using legacy interface of file attributes inherited from
|
||||||
|
ext2 filesystem and is not flexible, so by default the *zlib* compression is
|
||||||
|
set. The other command sets a property on the file with the given algorithm.
|
||||||
|
(Note: setting level that way is not yet implemented.)
|
||||||
|
|
||||||
|
Compression levels
|
||||||
|
------------------
|
||||||
|
|
||||||
|
The level support of ZLIB has been added in v4.14, LZO does not support levels
|
||||||
|
(the kernel implementation provides only one), ZSTD level support has been added
|
||||||
|
in v5.1.
|
||||||
|
|
||||||
|
There are 9 levels of ZLIB supported (1 to 9), mapping 1:1 from the mount option
|
||||||
|
to the algorithm defined level. The default is level 3, which provides the
|
||||||
|
reasonably good compression ratio and is still reasonably fast. The difference
|
||||||
|
in compression gain of levels 7, 8 and 9 is comparable but the higher levels
|
||||||
|
take longer.
|
||||||
|
|
||||||
|
The ZSTD support includes levels 1 to 15, a subset of full range of what ZSTD
|
||||||
|
provides. Levels 1-3 are real-time, 4-8 slower with improved compression and
|
||||||
|
9-15 try even harder though the resulting size may not be significantly improved.
|
||||||
|
|
||||||
|
Level 0 always maps to the default. The compression level does not affect
|
||||||
|
compatibility.
|
||||||
|
|
||||||
|
Incompressible data
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
Files with already compressed data or with data that won't compress well with
|
||||||
|
the CPU and memory constraints of the kernel implementations are using a simple
|
||||||
|
decision logic. If the first portion of data being compressed is not smaller
|
||||||
|
than the original, the compression of the file is disabled -- unless the
|
||||||
|
filesystem is mounted with *compress-force*. In that case compression will
|
||||||
|
always be attempted on the file only to be later discarded. This is not optimal
|
||||||
|
and subject to optimizations and further development.
|
||||||
|
|
||||||
|
If a file is identified as incompressible, a flag is set (*NOCOMPRESS*) and it's
|
||||||
|
sticky. On that file compression won't be performed unless forced. The flag
|
||||||
|
can be also set by **chattr +m** (since e2fsprogs 1.46.2) or by properties with
|
||||||
|
value *no* or *none*. Empty value will reset it to the default that's currently
|
||||||
|
applicable on the mounted filesystem.
|
||||||
|
|
||||||
|
There are two ways to detect incompressible data:
|
||||||
|
|
||||||
|
* actual compression attempt - data are compressed, if the result is not smaller,
|
||||||
|
it's discarded, so this depends on the algorithm and level
|
||||||
|
* pre-compression heuristics - a quick statistical evaluation on the data is
|
||||||
|
peformed and based on the result either compression is performed or skipped,
|
||||||
|
the NOCOMPRESS bit is not set just by the heuristic, only if the compression
|
||||||
|
algorithm does not make an improvent
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
$ lsattr file
|
||||||
|
---------------------m file
|
||||||
|
|
||||||
|
Using the forcing compression is not recommended, the heuristics are
|
||||||
|
supposed to decide that and compression algorithms internally detect
|
||||||
|
incompressible data too.
|
||||||
|
|
||||||
|
Pre-compression heuristics
|
||||||
|
--------------------------
|
||||||
|
|
||||||
|
The heuristics aim to do a few quick statistical tests on the compressed data
|
||||||
|
in order to avoid probably costly compression that would turn out to be
|
||||||
|
inefficient. Compression algorithms could have internal detection of
|
||||||
|
incompressible data too but this leads to more overhead as the compression is
|
||||||
|
done in another thread and has to write the data anyway. The heuristic is
|
||||||
|
read-only and can utilize cached memory.
|
||||||
|
|
||||||
|
The tests performed based on the following: data sampling, long repated
|
||||||
|
pattern detection, byte frequency, Shannon entropy.
|
||||||
|
|
||||||
|
Compatibility
|
||||||
|
-------------
|
||||||
|
|
||||||
|
Compression is done using the COW mechanism so it's incompatible with
|
||||||
|
*nodatacow*. Direct IO works on compressed files but will fall back to buffered
|
||||||
|
writes and leads to recompression. Currently 'nodatasum' and compression don't
|
||||||
|
work together.
|
||||||
|
|
||||||
|
The compression algorithms have been added over time so the version
|
||||||
|
compatibility should be also considered, together with other tools that may
|
||||||
|
access the compressed data like bootloaders.
|
97
Documentation/ch-convert-intro.rst
Normal file
97
Documentation/ch-convert-intro.rst
Normal file
@ -0,0 +1,97 @@
|
|||||||
|
The **btrfs-convert** tool can be used to convert existing source filesystem
|
||||||
|
image to a btrfs filesystem in-place. The original filesystem image is
|
||||||
|
accessible in subvolume named like *ext2_saved* as file *image*.
|
||||||
|
|
||||||
|
Supported filesystems:
|
||||||
|
|
||||||
|
* ext2, ext3, ext4 -- original feature, always built in
|
||||||
|
|
||||||
|
* reiserfs -- since version 4.13, optionally built, requires libreiserfscore 3.6.27
|
||||||
|
|
||||||
|
* ntfs -- external tool https://github.com/maharmstone/ntfs2btrfs
|
||||||
|
|
||||||
|
The list of supported source filesystem by a given binary is listed at the end
|
||||||
|
of help (option *--help*).
|
||||||
|
|
||||||
|
.. warning::
|
||||||
|
If you are going to perform rollback to the original filesystem, you
|
||||||
|
should not execute **btrfs balance** command on the converted filesystem. This
|
||||||
|
will change the extent layout and make **btrfs-convert** unable to rollback.
|
||||||
|
|
||||||
|
The conversion utilizes free space of the original filesystem. The exact
|
||||||
|
estimate of the required space cannot be foretold. The final btrfs metadata
|
||||||
|
might occupy several gigabytes on a hundreds-gigabyte filesystem.
|
||||||
|
|
||||||
|
If the ability to rollback is no longer important, the it is recommended to
|
||||||
|
perform a few more steps to transition the btrfs filesystem to a more compact
|
||||||
|
layout. This is because the conversion inherits the original data blocks'
|
||||||
|
fragmentation, and also because the metadata blocks are bound to the original
|
||||||
|
free space layout.
|
||||||
|
|
||||||
|
Due to different constraints, it is only possible to convert filesystems that
|
||||||
|
have a supported data block size (ie. the same that would be valid for
|
||||||
|
**mkfs.btrfs**). This is typically the system page size (4KiB on x86_64
|
||||||
|
machines).
|
||||||
|
|
||||||
|
**BEFORE YOU START**
|
||||||
|
|
||||||
|
The source filesystem must be clean, eg. no journal to replay or no repairs
|
||||||
|
needed. The respective **fsck** utility must be run on the source filesytem prior
|
||||||
|
to conversion. Please refer to the manual pages in case you encounter problems.
|
||||||
|
|
||||||
|
For ext2/3/4:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
# e2fsck -fvy /dev/sdx
|
||||||
|
|
||||||
|
For reiserfs:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
# reiserfsck -fy /dev/sdx
|
||||||
|
|
||||||
|
Skipping that step could lead to incorrect results on the target filesystem,
|
||||||
|
but it may work.
|
||||||
|
|
||||||
|
**REMOVE THE ORIGINAL FILESYSTEM METADATA**
|
||||||
|
|
||||||
|
By removing the subvolume named like *ext2_saved* or *reiserfs_saved*, all
|
||||||
|
metadata of the original filesystem will be removed:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
# btrfs subvolume delete /mnt/ext2_saved
|
||||||
|
|
||||||
|
At this point it is not possible to do a rollback. The filesystem is usable but
|
||||||
|
may be impacted by the fragmentation inherited from the original filesystem.
|
||||||
|
|
||||||
|
**MAKE FILE DATA MORE CONTIGUOUS**
|
||||||
|
|
||||||
|
An optional but recommended step is to run defragmentation on the entire
|
||||||
|
filesystem. This will attempt to make file extents more contiguous.
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
# btrfs filesystem defrag -v -r -f -t 32M /mnt/btrfs
|
||||||
|
|
||||||
|
Verbose recursive defragmentation (*-v*, *-r*), flush data per-file (*-f*) with
|
||||||
|
target extent size 32MiB (*-t*).
|
||||||
|
|
||||||
|
**ATTEMPT TO MAKE BTRFS METADATA MORE COMPACT**
|
||||||
|
|
||||||
|
Optional but recommended step.
|
||||||
|
|
||||||
|
The metadata block groups after conversion may be smaller than the default size
|
||||||
|
(256MiB or 1GiB). Running a balance will attempt to merge the block groups.
|
||||||
|
This depends on the free space layout (and fragmentation) and may fail due to
|
||||||
|
lack of enough work space. This is a soft error leaving the filesystem usable
|
||||||
|
but the block group layout may remain unchanged.
|
||||||
|
|
||||||
|
Note that balance operation takes a lot of time, please see also
|
||||||
|
``btrfs-balance(8)``.
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
# btrfs balance start -m /mnt/btrfs
|
||||||
|
|
198
Documentation/ch-quota-intro.rst
Normal file
198
Documentation/ch-quota-intro.rst
Normal file
@ -0,0 +1,198 @@
|
|||||||
|
The concept of quota has a long-standing tradition in the Unix world. Ever
|
||||||
|
since computers allow multiple users to work simultaneously in one filesystem,
|
||||||
|
there is the need to prevent one user from using up the entire space. Every
|
||||||
|
user should get his fair share of the available resources.
|
||||||
|
|
||||||
|
In case of files, the solution is quite straightforward. Each file has an
|
||||||
|
*owner* recorded along with it, and it has a size. Traditional quota just
|
||||||
|
restricts the total size of all files that are owned by a user. The concept is
|
||||||
|
quite flexible: if a user hits his quota limit, the administrator can raise it
|
||||||
|
on the fly.
|
||||||
|
|
||||||
|
On the other hand, the traditional approach has only a poor solution to
|
||||||
|
restrict directories.
|
||||||
|
At installation time, the harddisk can be partitioned so that every directory
|
||||||
|
(eg. /usr, /var/, ...) that needs a limit gets its own partition. The obvious
|
||||||
|
problem is that those limits cannot be changed without a reinstallation. The
|
||||||
|
btrfs subvolume feature builds a bridge. Subvolumes correspond in many ways to
|
||||||
|
partitions, as every subvolume looks like its own filesystem. With subvolume
|
||||||
|
quota, it is now possible to restrict each subvolume like a partition, but keep
|
||||||
|
the flexibility of quota. The space for each subvolume can be expanded or
|
||||||
|
restricted on the fly.
|
||||||
|
|
||||||
|
As subvolumes are the basis for snapshots, interesting questions arise as to
|
||||||
|
how to account used space in the presence of snapshots. If you have a file
|
||||||
|
shared between a subvolume and a snapshot, whom to account the file to? The
|
||||||
|
creator? Both? What if the file gets modified in the snapshot, should only
|
||||||
|
these changes be accounted to it? But wait, both the snapshot and the subvolume
|
||||||
|
belong to the same user home. I just want to limit the total space used by
|
||||||
|
both! But somebody else might not want to charge the snapshots to the users.
|
||||||
|
|
||||||
|
Btrfs subvolume quota solves these problems by introducing groups of subvolumes
|
||||||
|
and let the user put limits on them. It is even possible to have groups of
|
||||||
|
groups. In the following, we refer to them as *qgroups*.
|
||||||
|
|
||||||
|
Each qgroup primarily tracks two numbers, the amount of total referenced
|
||||||
|
space and the amount of exclusively referenced space.
|
||||||
|
|
||||||
|
referenced
|
||||||
|
space is the amount of data that can be reached from any of the
|
||||||
|
subvolumes contained in the qgroup, while
|
||||||
|
exclusive
|
||||||
|
is the amount of data where all references to this data can be reached
|
||||||
|
from within this qgroup.
|
||||||
|
|
||||||
|
SUBVOLUME QUOTA GROUPS
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
The basic notion of the Subvolume Quota feature is the quota group, short
|
||||||
|
qgroup. Qgroups are notated as *level/id*, eg. the qgroup 3/2 is a qgroup of
|
||||||
|
level 3. For level 0, the leading '0/' can be omitted.
|
||||||
|
Qgroups of level 0 get created automatically when a subvolume/snapshot gets
|
||||||
|
created. The ID of the qgroup corresponds to the ID of the subvolume, so 0/5
|
||||||
|
is the qgroup for the root subvolume.
|
||||||
|
For the ``btrfs qgroup`` command, the path to the subvolume can also be used
|
||||||
|
instead of *0/ID*. For all higher levels, the ID can be chosen freely.
|
||||||
|
|
||||||
|
Each qgroup can contain a set of lower level qgroups, thus creating a hierarchy
|
||||||
|
of qgroups. Figure 1 shows an example qgroup tree.
|
||||||
|
|
||||||
|
.. code-block:: none
|
||||||
|
|
||||||
|
+---+
|
||||||
|
|2/1|
|
||||||
|
+---+
|
||||||
|
/ \
|
||||||
|
+---+/ \+---+
|
||||||
|
|1/1| |1/2|
|
||||||
|
+---+ +---+
|
||||||
|
/ \ / \
|
||||||
|
+---+/ \+---+/ \+---+
|
||||||
|
qgroups |0/1| |0/2| |0/3|
|
||||||
|
+-+-+ +---+ +---+
|
||||||
|
| / \ / \
|
||||||
|
| / \ / \
|
||||||
|
| / \ / \
|
||||||
|
extents 1 2 3 4
|
||||||
|
|
||||||
|
Figure1: Sample qgroup hierarchy
|
||||||
|
|
||||||
|
At the bottom, some extents are depicted showing which qgroups reference which
|
||||||
|
extents. It is important to understand the notion of *referenced* vs
|
||||||
|
*exclusive*. In the example, qgroup 0/2 references extents 2 and 3, while 1/2
|
||||||
|
references extents 2-4, 2/1 references all extents.
|
||||||
|
|
||||||
|
On the other hand, extent 1 is exclusive to 0/1, extent 2 is exclusive to 0/2,
|
||||||
|
while extent 3 is neither exclusive to 0/2 nor to 0/3. But because both
|
||||||
|
references can be reached from 1/2, extent 3 is exclusive to 1/2. All extents
|
||||||
|
are exclusive to 2/1.
|
||||||
|
|
||||||
|
So exclusive does not mean there is no other way to reach the extent, but it
|
||||||
|
does mean that if you delete all subvolumes contained in a qgroup, the extent
|
||||||
|
will get deleted.
|
||||||
|
|
||||||
|
Exclusive of a qgroup conveys the useful information how much space will be
|
||||||
|
freed in case all subvolumes of the qgroup get deleted.
|
||||||
|
|
||||||
|
All data extents are accounted this way. Metadata that belongs to a specific
|
||||||
|
subvolume (i.e. its filesystem tree) is also accounted. Checksums and extent
|
||||||
|
allocation information are not accounted.
|
||||||
|
|
||||||
|
In turn, the referenced count of a qgroup can be limited. All writes beyond
|
||||||
|
this limit will lead to a 'Quota Exceeded' error.
|
||||||
|
|
||||||
|
INHERITANCE
|
||||||
|
^^^^^^^^^^^
|
||||||
|
|
||||||
|
Things get a bit more complicated when new subvolumes or snapshots are created.
|
||||||
|
The case of (empty) subvolumes is still quite easy. If a subvolume should be
|
||||||
|
part of a qgroup, it has to be added to the qgroup at creation time. To add it
|
||||||
|
at a later time, it would be necessary to at least rescan the full subvolume
|
||||||
|
for a proper accounting.
|
||||||
|
|
||||||
|
Creation of a snapshot is the hard case. Obviously, the snapshot will
|
||||||
|
reference the exact amount of space as its source, and both source and
|
||||||
|
destination now have an exclusive count of 0 (the filesystem nodesize to be
|
||||||
|
precise, as the roots of the trees are not shared). But what about qgroups of
|
||||||
|
higher levels? If the qgroup contains both the source and the destination,
|
||||||
|
nothing changes. If the qgroup contains only the source, it might lose some
|
||||||
|
exclusive.
|
||||||
|
|
||||||
|
But how much? The tempting answer is, subtract all exclusive of the source from
|
||||||
|
the qgroup, but that is wrong, or at least not enough. There could have been
|
||||||
|
an extent that is referenced from the source and another subvolume from that
|
||||||
|
qgroup. This extent would have been exclusive to the qgroup, but not to the
|
||||||
|
source subvolume. With the creation of the snapshot, the qgroup would also
|
||||||
|
lose this extent from its exclusive set.
|
||||||
|
|
||||||
|
So how can this problem be solved? In the instant the snapshot gets created, we
|
||||||
|
already have to know the correct exclusive count. We need to have a second
|
||||||
|
qgroup that contains all the subvolumes as the first qgroup, except the
|
||||||
|
subvolume we want to snapshot. The moment we create the snapshot, the
|
||||||
|
exclusive count from the second qgroup needs to be copied to the first qgroup,
|
||||||
|
as it represents the correct value. The second qgroup is called a tracking
|
||||||
|
qgroup. It is only there in case a snapshot is needed.
|
||||||
|
|
||||||
|
USE CASES
|
||||||
|
^^^^^^^^^
|
||||||
|
|
||||||
|
Below are some usecases that do not mean to be extensive. You can find your
|
||||||
|
own way how to integrate qgroups.
|
||||||
|
|
||||||
|
SINGLE-USER MACHINE
|
||||||
|
"""""""""""""""""""
|
||||||
|
|
||||||
|
``Replacement for partitions``
|
||||||
|
|
||||||
|
The simplest use case is to use qgroups as simple replacement for partitions.
|
||||||
|
Btrfs takes the disk as a whole, and /, /usr, /var, etc. are created as
|
||||||
|
subvolumes. As each subvolume gets it own qgroup automatically, they can
|
||||||
|
simply be restricted. No hierarchy is needed for that.
|
||||||
|
|
||||||
|
``Track usage of snapshots``
|
||||||
|
|
||||||
|
When a snapshot is taken, a qgroup for it will automatically be created with
|
||||||
|
the correct values. 'Referenced' will show how much is in it, possibly shared
|
||||||
|
with other subvolumes. 'Exclusive' will be the amount of space that gets freed
|
||||||
|
when the subvolume is deleted.
|
||||||
|
|
||||||
|
MULTI-USER MACHINE
|
||||||
|
""""""""""""""""""
|
||||||
|
|
||||||
|
``Restricting homes``
|
||||||
|
|
||||||
|
When you have several users on a machine, with home directories probably under
|
||||||
|
/home, you might want to restrict /home as a whole, while restricting every
|
||||||
|
user to an individual limit as well. This is easily accomplished by creating a
|
||||||
|
qgroup for /home , eg. 1/1, and assigning all user subvolumes to it.
|
||||||
|
Restricting this qgroup will limit /home, while every user subvolume can get
|
||||||
|
its own (lower) limit.
|
||||||
|
|
||||||
|
``Accounting snapshots to the user``
|
||||||
|
|
||||||
|
Let's say the user is allowed to create snapshots via some mechanism. It would
|
||||||
|
only be fair to account space used by the snapshots to the user. This does not
|
||||||
|
mean the user doubles his usage as soon as he takes a snapshot. Of course,
|
||||||
|
files that are present in his home and the snapshot should only be accounted
|
||||||
|
once. This can be accomplished by creating a qgroup for each user, say
|
||||||
|
'1/UID'. The user home and all snapshots are assigned to this qgroup.
|
||||||
|
Limiting it will extend the limit to all snapshots, counting files only once.
|
||||||
|
To limit /home as a whole, a higher level group 2/1 replacing 1/1 from the
|
||||||
|
previous example is needed, with all user qgroups assigned to it.
|
||||||
|
|
||||||
|
``Do not account snapshots``
|
||||||
|
|
||||||
|
On the other hand, when the snapshots get created automatically, the user has
|
||||||
|
no chance to control them, so the space used by them should not be accounted to
|
||||||
|
him. This is already the case when creating snapshots in the example from
|
||||||
|
the previous section.
|
||||||
|
|
||||||
|
``Snapshots for backup purposes``
|
||||||
|
|
||||||
|
This scenario is a mixture of the previous two. The user can create snapshots,
|
||||||
|
but some snapshots for backup purposes are being created by the system. The
|
||||||
|
user's snapshots should be accounted to the user, not the system. The solution
|
||||||
|
is similar to the one from section 'Accounting snapshots to the user', but do
|
||||||
|
not assign system snapshots to user's qgroup.
|
||||||
|
|
||||||
|
|
28
Documentation/ch-scrub-intro.rst
Normal file
28
Documentation/ch-scrub-intro.rst
Normal file
@ -0,0 +1,28 @@
|
|||||||
|
Scrub is a pass over all filesystem data and metadata and verifying the
|
||||||
|
checksums. If a valid copy is available (replicated block group profiles) then
|
||||||
|
the damaged one is repaired. All copies of the replicated profiles are validated.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
Scrub is not a filesystem checker (fsck) and does not verify nor repair
|
||||||
|
structural damage in the filesystem. It really only checks checksums of data
|
||||||
|
and tree blocks, it doesn't ensure the content of tree blocks is valid and
|
||||||
|
consistent. There's some validation performed when metadata blocks are read
|
||||||
|
from disk but it's not extensive and cannot substitute full *btrfs check*
|
||||||
|
run.
|
||||||
|
|
||||||
|
The user is supposed to run it manually or via a periodic system service. The
|
||||||
|
recommended period is a month but could be less. The estimated device bandwidth
|
||||||
|
utilization is about 80% on an idle filesystem. The IO priority class is by
|
||||||
|
default *idle* so background scrub should not significantly interfere with
|
||||||
|
normal filesystem operation. The IO scheduler set for the device(s) might not
|
||||||
|
support the priority classes though.
|
||||||
|
|
||||||
|
The scrubbing status is recorded in */var/lib/btrfs/* in textual files named
|
||||||
|
*scrub.status.UUID* for a filesystem identified by the given UUID. (Progress
|
||||||
|
state is communicated through a named pipe in file *scrub.progress.UUID* in the
|
||||||
|
same directory.) The status file is updated every 5 seconds. A resumed scrub
|
||||||
|
will continue from the last saved position.
|
||||||
|
|
||||||
|
Scrub can be started only on a mounted filesystem, though it's possible to
|
||||||
|
scrub only a selected device. See **btrfs scrub start** for more.
|
||||||
|
|
78
Documentation/ch-seeding-device.rst
Normal file
78
Documentation/ch-seeding-device.rst
Normal file
@ -0,0 +1,78 @@
|
|||||||
|
The COW mechanism and multiple devices under one hood enable an interesting
|
||||||
|
concept, called a seeding device: extending a read-only filesystem on a single
|
||||||
|
device filesystem with another device that captures all writes. For example
|
||||||
|
imagine an immutable golden image of an operating system enhanced with another
|
||||||
|
device that allows to use the data from the golden image and normal operation.
|
||||||
|
This idea originated on CD-ROMs with base OS and allowing to use them for live
|
||||||
|
systems, but this became obsolete. There are technologies providing similar
|
||||||
|
functionality, like *unionmount*, *overlayfs* or *qcow2* image snapshot.
|
||||||
|
|
||||||
|
The seeding device starts as a normal filesystem, once the contents is ready,
|
||||||
|
**btrfstune -S 1** is used to flag it as a seeding device. Mounting such device
|
||||||
|
will not allow any writes, except adding a new device by **btrfs device add**.
|
||||||
|
Then the filesystem can be remounted as read-write.
|
||||||
|
|
||||||
|
Given that the filesystem on the seeding device is always recognized as
|
||||||
|
read-only, it can be used to seed multiple filesystems, at the same time. The
|
||||||
|
UUID that is normally attached to a device is automatically changed to a random
|
||||||
|
UUID on each mount.
|
||||||
|
|
||||||
|
Once the seeding device is mounted, it needs the writable device. After adding
|
||||||
|
it, something like **remount -o remount,rw /path** makes the filesystem at
|
||||||
|
*/path* ready for use. The simplest usecase is to throw away all changes by
|
||||||
|
unmounting the filesystem when convenient.
|
||||||
|
|
||||||
|
Alternatively, deleting the seeding device from the filesystem can turn it into
|
||||||
|
a normal filesystem, provided that the writable device can also contain all the
|
||||||
|
data from the seeding device.
|
||||||
|
|
||||||
|
The seeding device flag can be cleared again by **btrfstune -f -s 0**, eg.
|
||||||
|
allowing to update with newer data but please note that this will invalidate
|
||||||
|
all existing filesystems that use this particular seeding device. This works
|
||||||
|
for some usecases, not for others, and a forcing flag to the command is
|
||||||
|
mandatory to avoid accidental mistakes.
|
||||||
|
|
||||||
|
Example how to create and use one seeding device:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
# mkfs.btrfs /dev/sda
|
||||||
|
# mount /dev/sda /mnt/mnt1
|
||||||
|
# ... fill mnt1 with data
|
||||||
|
# umount /mnt/mnt1
|
||||||
|
# btrfstune -S 1 /dev/sda
|
||||||
|
# mount /dev/sda /mnt/mnt1
|
||||||
|
# btrfs device add /dev/sdb /mnt
|
||||||
|
# mount -o remount,rw /mnt/mnt1
|
||||||
|
# ... /mnt/mnt1 is now writable
|
||||||
|
|
||||||
|
Now */mnt/mnt1* can be used normally. The device */dev/sda* can be mounted
|
||||||
|
again with a another writable device:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
# mount /dev/sda /mnt/mnt2
|
||||||
|
# btrfs device add /dev/sdc /mnt/mnt2
|
||||||
|
# mount -o remount,rw /mnt/mnt2
|
||||||
|
... /mnt/mnt2 is now writable
|
||||||
|
|
||||||
|
The writable device (*/dev/sdb*) can be decoupled from the seeding device and
|
||||||
|
used independently:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
# btrfs device delete /dev/sda /mnt/mnt1
|
||||||
|
|
||||||
|
As the contents originated in the seeding device, it's possible to turn
|
||||||
|
*/dev/sdb* to a seeding device again and repeat the whole process.
|
||||||
|
|
||||||
|
A few things to note:
|
||||||
|
|
||||||
|
* it's recommended to use only single device for the seeding device, it works
|
||||||
|
for multiple devices but the *single* profile must be used in order to make
|
||||||
|
the seeding device deletion work
|
||||||
|
* block group profiles *single* and *dup* support the usecases above
|
||||||
|
* the label is copied from the seeding device and can be changed by **btrfs filesystem label**
|
||||||
|
* each new mount of the seeding device gets a new random UUID
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user