btrfs-progs: docs: add more chapters (part 2)
The feature pages share the contents with the manual page section 5 so put the contents to separate files. Progress: 2/3. Signed-off-by: David Sterba <dsterba@suse.com>
This commit is contained in:
parent
b871bf49f3
commit
c6be84840f
|
@ -1,4 +1,8 @@
|
|||
Auto-repair on read
|
||||
===================
|
||||
|
||||
...
|
||||
Data or metadata that are found to be damaged (eg. because the checksum does
|
||||
not match) at the time they're read from the device can be salvaged in case the
|
||||
filesystem has another valid copy when using block group profile with redundancy
|
||||
(DUP, RAID1, RAID5/6). The correct data are returned to the user application
|
||||
and the damaged copy is replaced by it.
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
Convert
|
||||
=======
|
||||
|
||||
...
|
||||
.. include:: ch-convert-intro.rst
|
||||
|
|
|
@ -1,4 +1,44 @@
|
|||
Deduplication
|
||||
=============
|
||||
|
||||
...
|
||||
Going by the definition in the context of filesystems, it's a process of
|
||||
looking up identical data blocks tracked separately and creating a shared
|
||||
logical link while removing one of the copies of the data blocks. This leads to
|
||||
data space savings while it increases metadata consumption.
|
||||
|
||||
There are two main deduplication types:
|
||||
|
||||
* **in-band** *(sometimes also called on-line)* -- all newly written data are
|
||||
considered for deduplication before writing
|
||||
* **out-of-band** *(sometimes alco called offline)* -- data for deduplication
|
||||
have to be actively looked for and deduplicated by the user application
|
||||
|
||||
Both have their pros and cons. BTRFS implements **only out-of-band** type.
|
||||
|
||||
BTRFS provides the basic building blocks for deduplication allowing other tools
|
||||
to choose the strategy and scope of the deduplication. There are multiple
|
||||
tools that take different approaches to deduplication, offer additional
|
||||
features or make trade-offs. The following table lists tools that are known to
|
||||
be up-to-date, maintained and widely used.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Name
|
||||
- File based
|
||||
- Block based
|
||||
- Incremental
|
||||
* - `BEES <https://github.com/Zygo/bees>`_
|
||||
- No
|
||||
- Yes
|
||||
- Yes
|
||||
* - `duperemove <https://github.com/markfasheh/duperemove>`_
|
||||
- Yes
|
||||
- No
|
||||
- Yes
|
||||
|
||||
Legend:
|
||||
|
||||
- *File based*: the tool takes a list of files and deduplicates blocks only from that set
|
||||
- *Block based*: the tool enumerates blocks and looks for duplicates
|
||||
- *Incremental*: repeated runs of the tool utilizes information gathered from previous runs
|
||||
|
|
|
@ -1,4 +1,22 @@
|
|||
Defragmentation
|
||||
===============
|
||||
|
||||
...
|
||||
Defragmentation of files is supposed to make the layout of the file extents to
|
||||
be more linear or at least coalesce the file extents into larger ones that can
|
||||
be stored on the device more efficiently. The reason there's a need for
|
||||
defragmentation stems from the COW design that BTRFS is built on and is
|
||||
inherent. The fragmentation is caused by rewrites of the same file data
|
||||
in-place, that has to be handled by creating a new copy that may lie on a
|
||||
distant location on the physical device. Fragmentation is the worst problem on
|
||||
rotational hard disks due to the delay caused by moving the drive heads to the
|
||||
distant location. With the modern seek-less devices it's not a problem though
|
||||
it may still make sense because of reduced size of the metadata that's needed
|
||||
to track the scattered extents.
|
||||
|
||||
File data that are in use can be safely defragmented because the whole process
|
||||
happens inside the page cache, that is the central point caching the file data
|
||||
and takes care of synchronization. Once a filesystem sync or flush is started
|
||||
(either manually or automatically) all the dirty data get written to the
|
||||
devices. This however reduces the chances to find optimal layout as the writes
|
||||
happen together with other data and the result depens on the remaining free
|
||||
space layout and fragmentation.
|
||||
|
|
|
@ -1,6 +1,18 @@
|
|||
Flexibility
|
||||
===========
|
||||
|
||||
* dynamic inode creation (no preallocated space)
|
||||
The underlying design of BTRFS data structures allows a lot of flexibility and
|
||||
making changes after filesystem creation, like resizing, adding/removing space
|
||||
or enabling some features on-the-fly.
|
||||
|
||||
* block group profile change on-the-fly
|
||||
* **dynamic inode creation** -- there's no fixed space or tables for tracking
|
||||
inodes so the number of inodes that can be created is bounded by the metadata
|
||||
space and it's utilization
|
||||
|
||||
* **block group profile change on-the-fly** -- the block group profiles can be
|
||||
changed on a mounted filesystem by running the balance operation and
|
||||
specifying the conversion filters
|
||||
|
||||
* **resize** -- the space occupied by the filesystem on each device can be
|
||||
resized up (grow) or down (shrink) as long as the amount of data can be still
|
||||
contained on the device
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
Quota groups
|
||||
============
|
||||
|
||||
...
|
||||
.. include:: ch-quota-intro.rst
|
||||
|
|
|
@ -1,4 +1,29 @@
|
|||
Reflink
|
||||
=======
|
||||
|
||||
...
|
||||
Reflink is a type of shallow copy of file data that shares the blocks but
|
||||
otherwise the files are independent and any change to the file will not affect
|
||||
the other. This builds on the underlying COW mechanism. A reflink will
|
||||
effectively create only a separate metadata pointing to the shared blocks which
|
||||
is typically much faster than a deep copy of all blocks.
|
||||
|
||||
The reflink is typically meant for whole files but a partial file range can be
|
||||
also copied, though there are no ready-made tools for that.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
cp --reflink=always source target
|
||||
|
||||
There are some constaints:
|
||||
|
||||
- cross-filesystem reflink is not possible, there's nothing in common between
|
||||
so the block sharing can't work
|
||||
- reflink crossing two mount points of the same filesystem does not work due
|
||||
to an artificial limitation in VFS (this may change in the future)
|
||||
- reflink requires source and target file that have the same status regarding
|
||||
NOCOW and checksums, for example if the source file is NOCOW (once created
|
||||
with the chattr +C attribute) then the above command won't work unless the
|
||||
target file is pre-created with the +C attribute as well, or the NOCOW
|
||||
attribute is inherited from the parent directory (chattr +C on the directory)
|
||||
or if the whole filesystem is mounted with *-o nodatacow* that would create
|
||||
the NOCOW files by default
|
||||
|
|
|
@ -1,4 +1,12 @@
|
|||
Resize
|
||||
======
|
||||
|
||||
...
|
||||
A BTRFS mounted filesystem can be resized after creation, grown or shrunk. On a
|
||||
multi device filesystem the space occupied on each device can be resized
|
||||
independently. Data tha reside in the are that would be out of the new size are
|
||||
relocated to the remaining space below the limit, so this constrains the
|
||||
minimum size to which a filesystem can be shrunk.
|
||||
|
||||
Growing a filesystem is quick as it only needs to take note of the available
|
||||
space, while shrinking a filesystem needs to relocate potentially lots of data
|
||||
and this is IO intense. It is possible to shrink a filesystem in smaller steps.
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
Scrub
|
||||
=====
|
||||
|
||||
...
|
||||
.. include:: ch-scrub-intro.rst
|
||||
|
|
|
@ -1,4 +1,23 @@
|
|||
Balance
|
||||
=======
|
||||
Send/receive
|
||||
============
|
||||
|
||||
...
|
||||
Send and receive are complementary features that allow to transfer data from
|
||||
one filesystem to another in a streamable format. The send part traverses a
|
||||
given read-only subvolume and either creates a full stream representation of
|
||||
its data and metadata (*full mode*), or given a set of subvolumes for reference
|
||||
it generates a difference relative to that set (*incremental mode*).
|
||||
|
||||
Receive on the other hand takes the stream and reconstructs a subvolume with
|
||||
files and directories equivalent to the filesystem that was used to produce the
|
||||
stream. The result is not exactly 1:1, eg. inode numbers can be different and
|
||||
other unique identifiers can be different (like the subvolume UUIDs). The full
|
||||
mode starts with an empty subvolume, creates all the files and then turns the
|
||||
subvolume to read-only. At this point it could be used as a starting point for a
|
||||
future incremental send stream, provided it would be generated from the same
|
||||
source subvolume on the other filesystem.
|
||||
|
||||
The stream is a sequence of encoded commands that change eg. file metadata
|
||||
(owner, permissions, extended attributes), data extents (create, clone,
|
||||
truncate), whole file operations (rename, delete). The stream can be sent over
|
||||
network, piped directly to the receive command or saved to a file. Each command
|
||||
in the stream is protected by a CRC32C checksum.
|
||||
|
|
|
@ -9,102 +9,7 @@ SYNOPSIS
|
|||
DESCRIPTION
|
||||
-----------
|
||||
|
||||
**btrfs-convert** is used to convert existing source filesystem image to a btrfs
|
||||
filesystem in-place. The original filesystem image is accessible in subvolume
|
||||
named like *ext2_saved* as file *image*.
|
||||
|
||||
Supported filesystems:
|
||||
|
||||
* ext2, ext3, ext4 -- original feature, always built in
|
||||
|
||||
* reiserfs -- since version 4.13, optionally built, requires libreiserfscore 3.6.27
|
||||
|
||||
* ntfs -- external tool https://github.com/maharmstone/ntfs2btrfs
|
||||
|
||||
The list of supported source filesystem by a given binary is listed at the end
|
||||
of help (option *--help*).
|
||||
|
||||
.. warning::
|
||||
If you are going to perform rollback to the original filesystem, you
|
||||
should not execute **btrfs balance** command on the converted filesystem. This
|
||||
will change the extent layout and make **btrfs-convert** unable to rollback.
|
||||
|
||||
The conversion utilizes free space of the original filesystem. The exact
|
||||
estimate of the required space cannot be foretold. The final btrfs metadata
|
||||
might occupy several gigabytes on a hundreds-gigabyte filesystem.
|
||||
|
||||
If the ability to rollback is no longer important, the it is recommended to
|
||||
perform a few more steps to transition the btrfs filesystem to a more compact
|
||||
layout. This is because the conversion inherits the original data blocks'
|
||||
fragmentation, and also because the metadata blocks are bound to the original
|
||||
free space layout.
|
||||
|
||||
Due to different constraints, it is only possible to convert filesystems that
|
||||
have a supported data block size (ie. the same that would be valid for
|
||||
**mkfs.btrfs**). This is typically the system page size (4KiB on x86_64
|
||||
machines).
|
||||
|
||||
**BEFORE YOU START**
|
||||
|
||||
The source filesystem must be clean, eg. no journal to replay or no repairs
|
||||
needed. The respective **fsck** utility must be run on the source filesytem prior
|
||||
to conversion. Please refer to the manual pages in case you encounter problems.
|
||||
|
||||
For ext2/3/4:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# e2fsck -fvy /dev/sdx
|
||||
|
||||
For reiserfs:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# reiserfsck -fy /dev/sdx
|
||||
|
||||
Skipping that step could lead to incorrect results on the target filesystem,
|
||||
but it may work.
|
||||
|
||||
**REMOVE THE ORIGINAL FILESYSTEM METADATA**
|
||||
|
||||
By removing the subvolume named like *ext2_saved* or *reiserfs_saved*, all
|
||||
metadata of the original filesystem will be removed:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# btrfs subvolume delete /mnt/ext2_saved
|
||||
|
||||
At this point it is not possible to do a rollback. The filesystem is usable but
|
||||
may be impacted by the fragmentation inherited from the original filesystem.
|
||||
|
||||
**MAKE FILE DATA MORE CONTIGUOUS**
|
||||
|
||||
An optional but recommended step is to run defragmentation on the entire
|
||||
filesystem. This will attempt to make file extents more contiguous.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# btrfs filesystem defrag -v -r -f -t 32M /mnt/btrfs
|
||||
|
||||
Verbose recursive defragmentation (*-v*, *-r*), flush data per-file (*-f*) with
|
||||
target extent size 32MiB (*-t*).
|
||||
|
||||
**ATTEMPT TO MAKE BTRFS METADATA MORE COMPACT**
|
||||
|
||||
Optional but recommended step.
|
||||
|
||||
The metadata block groups after conversion may be smaller than the default size
|
||||
(256MiB or 1GiB). Running a balance will attempt to merge the block groups.
|
||||
This depends on the free space layout (and fragmentation) and may fail due to
|
||||
lack of enough work space. This is a soft error leaving the filesystem usable
|
||||
but the block group layout may remain unchanged.
|
||||
|
||||
Note that balance operation takes a lot of time, please see also
|
||||
``btrfs-balance(8)``.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# btrfs balance start -m /mnt/btrfs
|
||||
.. include:: ch-convert-intro.rst
|
||||
|
||||
OPTIONS
|
||||
-------
|
||||
|
|
|
@ -36,202 +36,7 @@ gradually improving and issues found and fixed.
|
|||
HIERARCHICAL QUOTA GROUP CONCEPTS
|
||||
---------------------------------
|
||||
|
||||
The concept of quota has a long-standing tradition in the Unix world. Ever
|
||||
since computers allow multiple users to work simultaneously in one filesystem,
|
||||
there is the need to prevent one user from using up the entire space. Every
|
||||
user should get his fair share of the available resources.
|
||||
|
||||
In case of files, the solution is quite straightforward. Each file has an
|
||||
'owner' recorded along with it, and it has a size. Traditional quota just
|
||||
restricts the total size of all files that are owned by a user. The concept is
|
||||
quite flexible: if a user hits his quota limit, the administrator can raise it
|
||||
on the fly.
|
||||
|
||||
On the other hand, the traditional approach has only a poor solution to
|
||||
restrict directories.
|
||||
At installation time, the harddisk can be partitioned so that every directory
|
||||
(eg. /usr, /var/, ...) that needs a limit gets its own partition. The obvious
|
||||
problem is that those limits cannot be changed without a reinstallation. The
|
||||
btrfs subvolume feature builds a bridge. Subvolumes correspond in many ways to
|
||||
partitions, as every subvolume looks like its own filesystem. With subvolume
|
||||
quota, it is now possible to restrict each subvolume like a partition, but keep
|
||||
the flexibility of quota. The space for each subvolume can be expanded or
|
||||
restricted on the fly.
|
||||
|
||||
As subvolumes are the basis for snapshots, interesting questions arise as to
|
||||
how to account used space in the presence of snapshots. If you have a file
|
||||
shared between a subvolume and a snapshot, whom to account the file to? The
|
||||
creator? Both? What if the file gets modified in the snapshot, should only
|
||||
these changes be accounted to it? But wait, both the snapshot and the subvolume
|
||||
belong to the same user home. I just want to limit the total space used by
|
||||
both! But somebody else might not want to charge the snapshots to the users.
|
||||
|
||||
Btrfs subvolume quota solves these problems by introducing groups of subvolumes
|
||||
and let the user put limits on them. It is even possible to have groups of
|
||||
groups. In the following, we refer to them as 'qgroups'.
|
||||
|
||||
Each qgroup primarily tracks two numbers, the amount of total referenced
|
||||
space and the amount of exclusively referenced space.
|
||||
|
||||
referenced
|
||||
space is the amount of data that can be reached from any of the
|
||||
subvolumes contained in the qgroup, while
|
||||
exclusive
|
||||
is the amount of data where all references to this data can be reached
|
||||
from within this qgroup.
|
||||
|
||||
SUBVOLUME QUOTA GROUPS
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The basic notion of the Subvolume Quota feature is the quota group, short
|
||||
qgroup. Qgroups are notated as 'level/id', eg. the qgroup 3/2 is a qgroup of
|
||||
level 3. For level 0, the leading '0/' can be omitted.
|
||||
Qgroups of level 0 get created automatically when a subvolume/snapshot gets
|
||||
created. The ID of the qgroup corresponds to the ID of the subvolume, so 0/5
|
||||
is the qgroup for the root subvolume.
|
||||
For the *btrfs qgroup* command, the path to the subvolume can also be used
|
||||
instead of '0/ID'. For all higher levels, the ID can be chosen freely.
|
||||
|
||||
Each qgroup can contain a set of lower level qgroups, thus creating a hierarchy
|
||||
of qgroups. Figure 1 shows an example qgroup tree.
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
+---+
|
||||
|2/1|
|
||||
+---+
|
||||
/ \
|
||||
+---+/ \+---+
|
||||
|1/1| |1/2|
|
||||
+---+ +---+
|
||||
/ \ / \
|
||||
+---+/ \+---+/ \+---+
|
||||
qgroups |0/1| |0/2| |0/3|
|
||||
+-+-+ +---+ +---+
|
||||
| / \ / \
|
||||
| / \ / \
|
||||
| / \ / \
|
||||
extents 1 2 3 4
|
||||
|
||||
Figure1: Sample qgroup hierarchy
|
||||
|
||||
At the bottom, some extents are depicted showing which qgroups reference which
|
||||
extents. It is important to understand the notion of *referenced* vs
|
||||
*exclusive*. In the example, qgroup 0/2 references extents 2 and 3, while 1/2
|
||||
references extents 2-4, 2/1 references all extents.
|
||||
|
||||
On the other hand, extent 1 is exclusive to 0/1, extent 2 is exclusive to 0/2,
|
||||
while extent 3 is neither exclusive to 0/2 nor to 0/3. But because both
|
||||
references can be reached from 1/2, extent 3 is exclusive to 1/2. All extents
|
||||
are exclusive to 2/1.
|
||||
|
||||
So exclusive does not mean there is no other way to reach the extent, but it
|
||||
does mean that if you delete all subvolumes contained in a qgroup, the extent
|
||||
will get deleted.
|
||||
|
||||
Exclusive of a qgroup conveys the useful information how much space will be
|
||||
freed in case all subvolumes of the qgroup get deleted.
|
||||
|
||||
All data extents are accounted this way. Metadata that belongs to a specific
|
||||
subvolume (i.e. its filesystem tree) is also accounted. Checksums and extent
|
||||
allocation information are not accounted.
|
||||
|
||||
In turn, the referenced count of a qgroup can be limited. All writes beyond
|
||||
this limit will lead to a 'Quota Exceeded' error.
|
||||
|
||||
INHERITANCE
|
||||
^^^^^^^^^^^
|
||||
|
||||
Things get a bit more complicated when new subvolumes or snapshots are created.
|
||||
The case of (empty) subvolumes is still quite easy. If a subvolume should be
|
||||
part of a qgroup, it has to be added to the qgroup at creation time. To add it
|
||||
at a later time, it would be necessary to at least rescan the full subvolume
|
||||
for a proper accounting.
|
||||
|
||||
Creation of a snapshot is the hard case. Obviously, the snapshot will
|
||||
reference the exact amount of space as its source, and both source and
|
||||
destination now have an exclusive count of 0 (the filesystem nodesize to be
|
||||
precise, as the roots of the trees are not shared). But what about qgroups of
|
||||
higher levels? If the qgroup contains both the source and the destination,
|
||||
nothing changes. If the qgroup contains only the source, it might lose some
|
||||
exclusive.
|
||||
|
||||
But how much? The tempting answer is, subtract all exclusive of the source from
|
||||
the qgroup, but that is wrong, or at least not enough. There could have been
|
||||
an extent that is referenced from the source and another subvolume from that
|
||||
qgroup. This extent would have been exclusive to the qgroup, but not to the
|
||||
source subvolume. With the creation of the snapshot, the qgroup would also
|
||||
lose this extent from its exclusive set.
|
||||
|
||||
So how can this problem be solved? In the instant the snapshot gets created, we
|
||||
already have to know the correct exclusive count. We need to have a second
|
||||
qgroup that contains all the subvolumes as the first qgroup, except the
|
||||
subvolume we want to snapshot. The moment we create the snapshot, the
|
||||
exclusive count from the second qgroup needs to be copied to the first qgroup,
|
||||
as it represents the correct value. The second qgroup is called a tracking
|
||||
qgroup. It is only there in case a snapshot is needed.
|
||||
|
||||
USE CASES
|
||||
^^^^^^^^^
|
||||
|
||||
Below are some usecases that do not mean to be extensive. You can find your
|
||||
own way how to integrate qgroups.
|
||||
|
||||
SINGLE-USER MACHINE
|
||||
"""""""""""""""""""
|
||||
|
||||
``Replacement for partitions``
|
||||
|
||||
The simplest use case is to use qgroups as simple replacement for partitions.
|
||||
Btrfs takes the disk as a whole, and /, /usr, /var, etc. are created as
|
||||
subvolumes. As each subvolume gets it own qgroup automatically, they can
|
||||
simply be restricted. No hierarchy is needed for that.
|
||||
|
||||
``Track usage of snapshots``
|
||||
|
||||
When a snapshot is taken, a qgroup for it will automatically be created with
|
||||
the correct values. 'Referenced' will show how much is in it, possibly shared
|
||||
with other subvolumes. 'Exclusive' will be the amount of space that gets freed
|
||||
when the subvolume is deleted.
|
||||
|
||||
MULTI-USER MACHINE
|
||||
""""""""""""""""""
|
||||
|
||||
``Restricting homes``
|
||||
|
||||
When you have several users on a machine, with home directories probably under
|
||||
/home, you might want to restrict /home as a whole, while restricting every
|
||||
user to an individual limit as well. This is easily accomplished by creating a
|
||||
qgroup for /home , eg. 1/1, and assigning all user subvolumes to it.
|
||||
Restricting this qgroup will limit /home, while every user subvolume can get
|
||||
its own (lower) limit.
|
||||
|
||||
``Accounting snapshots to the user``
|
||||
|
||||
Let's say the user is allowed to create snapshots via some mechanism. It would
|
||||
only be fair to account space used by the snapshots to the user. This does not
|
||||
mean the user doubles his usage as soon as he takes a snapshot. Of course,
|
||||
files that are present in his home and the snapshot should only be accounted
|
||||
once. This can be accomplished by creating a qgroup for each user, say
|
||||
'1/UID'. The user home and all snapshots are assigned to this qgroup.
|
||||
Limiting it will extend the limit to all snapshots, counting files only once.
|
||||
To limit /home as a whole, a higher level group 2/1 replacing 1/1 from the
|
||||
previous example is needed, with all user qgroups assigned to it.
|
||||
|
||||
``Do not account snapshots``
|
||||
|
||||
On the other hand, when the snapshots get created automatically, the user has
|
||||
no chance to control them, so the space used by them should not be accounted to
|
||||
him. This is already the case when creating snapshots in the example from
|
||||
the previous section.
|
||||
|
||||
``Snapshots for backup purposes``
|
||||
|
||||
This scenario is a mixture of the previous two. The user can create snapshots,
|
||||
but some snapshots for backup purposes are being created by the system. The
|
||||
user's snapshots should be accounted to the user, not the system. The solution
|
||||
is similar to the one from section 'Accounting snapshots to the user', but do
|
||||
not assign system snapshots to user's qgroup.
|
||||
.. include:: ch-quota-intro.rst
|
||||
|
||||
SUBCOMMAND
|
||||
----------
|
||||
|
|
|
@ -9,33 +9,7 @@ SYNOPSIS
|
|||
DESCRIPTION
|
||||
-----------
|
||||
|
||||
**btrfs scrub** is used to scrub a mounted btrfs filesystem, which will read all
|
||||
data and metadata blocks from all devices and verify checksums. Automatically
|
||||
repair corrupted blocks if there's a correct copy available.
|
||||
|
||||
.. note::
|
||||
Scrub is not a filesystem checker (fsck) and does not verify nor repair
|
||||
structural damage in the filesystem. It really only checks checksums of data
|
||||
and tree blocks, it doesn't ensure the content of tree blocks is valid and
|
||||
consistent. There's some validation performed when metadata blocks are read
|
||||
from disk but it's not extensive and cannot substitute full *btrfs check*
|
||||
run.
|
||||
|
||||
The user is supposed to run it manually or via a periodic system service. The
|
||||
recommended period is a month but could be less. The estimated device bandwidth
|
||||
utilization is about 80% on an idle filesystem. The IO priority class is by
|
||||
default *idle* so background scrub should not significantly interfere with
|
||||
normal filesystem operation. The IO scheduler set for the device(s) might not
|
||||
support the priority classes though.
|
||||
|
||||
The scrubbing status is recorded in */var/lib/btrfs/* in textual files named
|
||||
*scrub.status.UUID* for a filesystem identified by the given UUID. (Progress
|
||||
state is communicated through a named pipe in file *scrub.progress.UUID* in the
|
||||
same directory.) The status file is updated every 5 seconds. A resumed scrub
|
||||
will continue from the last saved position.
|
||||
|
||||
Scrub can be started only on a mounted filesystem, though it's possible to
|
||||
scrub only a selected device. See **scrub start** for more.
|
||||
.. include:: ch-scrub-intro.rst
|
||||
|
||||
SUBCOMMAND
|
||||
----------
|
||||
|
|
|
@ -0,0 +1,76 @@
|
|||
Data and metadata are checksummed by default, the checksum is calculated before
|
||||
write and verifed after reading the blocks. There are several checksum
|
||||
algorithms supported. The default and backward compatible is *crc32c*. Since
|
||||
kernel 5.5 there are three more with different characteristics and trade-offs
|
||||
regarding speed and strength. The following list may help you to decide which
|
||||
one to select.
|
||||
|
||||
CRC32C (32bit digest)
|
||||
default, best backward compatibility, very fast, modern CPUs have
|
||||
instruction-level support, not collision-resistant but still good error
|
||||
detection capabilities
|
||||
|
||||
XXHASH* (64bit digest)
|
||||
can be used as CRC32C successor, very fast, optimized for modern CPUs utilizing
|
||||
instruction pipelining, good collision resistance and error detection
|
||||
|
||||
SHA256 (256bit digest)::
|
||||
a cryptographic-strength hash, relatively slow but with possible CPU
|
||||
instruction acceleration or specialized hardware cards, FIPS certified and
|
||||
in wide use
|
||||
|
||||
BLAKE2b (256bit digest)
|
||||
a cryptographic-strength hash, relatively fast with possible CPU acceleration
|
||||
using SIMD extensions, not standardized but based on BLAKE which was a SHA3
|
||||
finalist, in wide use, the algorithm used is BLAKE2b-256 that's optimized for
|
||||
64bit platforms
|
||||
|
||||
The *digest size* affects overall size of data block checksums stored in the
|
||||
filesystem. The metadata blocks have a fixed area up to 256 bits (32 bytes), so
|
||||
there's no increase. Each data block has a separate checksum stored, with
|
||||
additional overhead of the b-tree leaves.
|
||||
|
||||
Approximate relative performance of the algorithms, measured against CRC32C
|
||||
using reference software implementations on a 3.5GHz intel CPU:
|
||||
|
||||
|
||||
======== ============ ======= ================
|
||||
Digest Cycles/4KiB Ratio Implementation
|
||||
======== ============ ======= ================
|
||||
CRC32C 1700 1.00 CPU instruction
|
||||
XXHASH 2500 1.44 reference impl.
|
||||
SHA256 105000 61 reference impl.
|
||||
SHA256 36000 21 libgcrypt/AVX2
|
||||
SHA256 63000 37 libsodium/AVX2
|
||||
BLAKE2b 22000 13 reference impl.
|
||||
BLAKE2b 19000 11 libgcrypt/AVX2
|
||||
BLAKE2b 19000 11 libsodium/AVX2
|
||||
======== ============ ======= ================
|
||||
|
||||
Many kernels are configured with SHA256 as built-in and not as a module.
|
||||
The accelerated versions are however provided by the modules and must be loaded
|
||||
explicitly (**modprobe sha256**) before mounting the filesystem to make use of
|
||||
them. You can check in */sys/fs/btrfs/FSID/checksum* which one is used. If you
|
||||
see *sha256-generic*, then you may want to unmount and mount the filesystem
|
||||
again, changing that on a mounted filesystem is not possible.
|
||||
Check the file */proc/crypto*, when the implementation is built-in, you'd find
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
name : sha256
|
||||
driver : sha256-generic
|
||||
module : kernel
|
||||
priority : 100
|
||||
...
|
||||
|
||||
while accelerated implementation is e.g.
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
name : sha256
|
||||
driver : sha256-avx2
|
||||
module : sha256_ssse3
|
||||
priority : 170
|
||||
...
|
||||
|
||||
|
|
@ -0,0 +1,153 @@
|
|||
Btrfs supports transparent file compression. There are three algorithms
|
||||
available: ZLIB, LZO and ZSTD (since v4.14), with various levels.
|
||||
The compression happens on the level of file extents and the algorithm is
|
||||
selected by file property, mount option or by a defrag command.
|
||||
You can have a single btrfs mount point that has some files that are
|
||||
uncompressed, some that are compressed with LZO, some with ZLIB, for instance
|
||||
(though you may not want it that way, it is supported).
|
||||
|
||||
Once the compression is set, all newly written data will be compressed, ie.
|
||||
existing data are untouched. Data are split into smaller chunks (128KiB) before
|
||||
compression to make random rewrites possible without a high performance hit. Due
|
||||
to the increased number of extents the metadata consumption is higher. The
|
||||
chunks are compressed in parallel.
|
||||
|
||||
The algorithms can be characterized as follows regarding the speed/ratio
|
||||
trade-offs:
|
||||
|
||||
ZLIB
|
||||
* slower, higher compression ratio
|
||||
* levels: 1 to 9, mapped directly, default level is 3
|
||||
* good backward compatibility
|
||||
LZO
|
||||
* faster compression and decompression than zlib, worse compression ratio, designed to be fast
|
||||
* no levels
|
||||
* good backward compatibility
|
||||
ZSTD
|
||||
* compression comparable to zlib with higher compression/decompression speeds and different ratio
|
||||
* levels: 1 to 15
|
||||
* since 4.14, levels since 5.1
|
||||
|
||||
The differences depend on the actual data set and cannot be expressed by a
|
||||
single number or recommendation. Higher levels consume more CPU time and may
|
||||
not bring a significant improvement, lower levels are close to real time.
|
||||
|
||||
How to enable compression
|
||||
-------------------------
|
||||
|
||||
Typically the compression can be enabled on the whole filesystem, specified for
|
||||
the mount point. Note that the compression mount options are shared among all
|
||||
mounts of the same filesystem, either bind mounts or subvolume mounts.
|
||||
Please refer to section *MOUNT OPTIONS*.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ mount -o compress=zstd /dev/sdx /mnt
|
||||
|
||||
This will enable the ``zstd`` algorithm on the default level (which is 3).
|
||||
The level can be specified manually too like ``zstd:3``. Higher levels compress
|
||||
better at the cost of time. This in turn may cause increased write latency, low
|
||||
levels are suitable for real-time compression and on reasonably fast CPU don't
|
||||
cause performance drops.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ btrfs filesystem defrag -czstd file
|
||||
|
||||
The command above will start defragmentation of the whole *file* and apply
|
||||
the compression, regardless of the mount option. (Note: specifying level is not
|
||||
yet implemented). The compression algorithm is not persisent and applies only
|
||||
to the defragmentation command, for any other writes other compression settings
|
||||
apply.
|
||||
|
||||
Persistent settings on a per-file basis can be set in two ways:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ chattr +c file
|
||||
$ btrfs property set file compression zstd
|
||||
|
||||
The first command is using legacy interface of file attributes inherited from
|
||||
ext2 filesystem and is not flexible, so by default the *zlib* compression is
|
||||
set. The other command sets a property on the file with the given algorithm.
|
||||
(Note: setting level that way is not yet implemented.)
|
||||
|
||||
Compression levels
|
||||
------------------
|
||||
|
||||
The level support of ZLIB has been added in v4.14, LZO does not support levels
|
||||
(the kernel implementation provides only one), ZSTD level support has been added
|
||||
in v5.1.
|
||||
|
||||
There are 9 levels of ZLIB supported (1 to 9), mapping 1:1 from the mount option
|
||||
to the algorithm defined level. The default is level 3, which provides the
|
||||
reasonably good compression ratio and is still reasonably fast. The difference
|
||||
in compression gain of levels 7, 8 and 9 is comparable but the higher levels
|
||||
take longer.
|
||||
|
||||
The ZSTD support includes levels 1 to 15, a subset of full range of what ZSTD
|
||||
provides. Levels 1-3 are real-time, 4-8 slower with improved compression and
|
||||
9-15 try even harder though the resulting size may not be significantly improved.
|
||||
|
||||
Level 0 always maps to the default. The compression level does not affect
|
||||
compatibility.
|
||||
|
||||
Incompressible data
|
||||
-------------------
|
||||
|
||||
Files with already compressed data or with data that won't compress well with
|
||||
the CPU and memory constraints of the kernel implementations are using a simple
|
||||
decision logic. If the first portion of data being compressed is not smaller
|
||||
than the original, the compression of the file is disabled -- unless the
|
||||
filesystem is mounted with *compress-force*. In that case compression will
|
||||
always be attempted on the file only to be later discarded. This is not optimal
|
||||
and subject to optimizations and further development.
|
||||
|
||||
If a file is identified as incompressible, a flag is set (*NOCOMPRESS*) and it's
|
||||
sticky. On that file compression won't be performed unless forced. The flag
|
||||
can be also set by **chattr +m** (since e2fsprogs 1.46.2) or by properties with
|
||||
value *no* or *none*. Empty value will reset it to the default that's currently
|
||||
applicable on the mounted filesystem.
|
||||
|
||||
There are two ways to detect incompressible data:
|
||||
|
||||
* actual compression attempt - data are compressed, if the result is not smaller,
|
||||
it's discarded, so this depends on the algorithm and level
|
||||
* pre-compression heuristics - a quick statistical evaluation on the data is
|
||||
peformed and based on the result either compression is performed or skipped,
|
||||
the NOCOMPRESS bit is not set just by the heuristic, only if the compression
|
||||
algorithm does not make an improvent
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ lsattr file
|
||||
---------------------m file
|
||||
|
||||
Using the forcing compression is not recommended, the heuristics are
|
||||
supposed to decide that and compression algorithms internally detect
|
||||
incompressible data too.
|
||||
|
||||
Pre-compression heuristics
|
||||
--------------------------
|
||||
|
||||
The heuristics aim to do a few quick statistical tests on the compressed data
|
||||
in order to avoid probably costly compression that would turn out to be
|
||||
inefficient. Compression algorithms could have internal detection of
|
||||
incompressible data too but this leads to more overhead as the compression is
|
||||
done in another thread and has to write the data anyway. The heuristic is
|
||||
read-only and can utilize cached memory.
|
||||
|
||||
The tests performed based on the following: data sampling, long repated
|
||||
pattern detection, byte frequency, Shannon entropy.
|
||||
|
||||
Compatibility
|
||||
-------------
|
||||
|
||||
Compression is done using the COW mechanism so it's incompatible with
|
||||
*nodatacow*. Direct IO works on compressed files but will fall back to buffered
|
||||
writes and leads to recompression. Currently 'nodatasum' and compression don't
|
||||
work together.
|
||||
|
||||
The compression algorithms have been added over time so the version
|
||||
compatibility should be also considered, together with other tools that may
|
||||
access the compressed data like bootloaders.
|
|
@ -0,0 +1,97 @@
|
|||
The **btrfs-convert** tool can be used to convert existing source filesystem
|
||||
image to a btrfs filesystem in-place. The original filesystem image is
|
||||
accessible in subvolume named like *ext2_saved* as file *image*.
|
||||
|
||||
Supported filesystems:
|
||||
|
||||
* ext2, ext3, ext4 -- original feature, always built in
|
||||
|
||||
* reiserfs -- since version 4.13, optionally built, requires libreiserfscore 3.6.27
|
||||
|
||||
* ntfs -- external tool https://github.com/maharmstone/ntfs2btrfs
|
||||
|
||||
The list of supported source filesystem by a given binary is listed at the end
|
||||
of help (option *--help*).
|
||||
|
||||
.. warning::
|
||||
If you are going to perform rollback to the original filesystem, you
|
||||
should not execute **btrfs balance** command on the converted filesystem. This
|
||||
will change the extent layout and make **btrfs-convert** unable to rollback.
|
||||
|
||||
The conversion utilizes free space of the original filesystem. The exact
|
||||
estimate of the required space cannot be foretold. The final btrfs metadata
|
||||
might occupy several gigabytes on a hundreds-gigabyte filesystem.
|
||||
|
||||
If the ability to rollback is no longer important, the it is recommended to
|
||||
perform a few more steps to transition the btrfs filesystem to a more compact
|
||||
layout. This is because the conversion inherits the original data blocks'
|
||||
fragmentation, and also because the metadata blocks are bound to the original
|
||||
free space layout.
|
||||
|
||||
Due to different constraints, it is only possible to convert filesystems that
|
||||
have a supported data block size (ie. the same that would be valid for
|
||||
**mkfs.btrfs**). This is typically the system page size (4KiB on x86_64
|
||||
machines).
|
||||
|
||||
**BEFORE YOU START**
|
||||
|
||||
The source filesystem must be clean, eg. no journal to replay or no repairs
|
||||
needed. The respective **fsck** utility must be run on the source filesytem prior
|
||||
to conversion. Please refer to the manual pages in case you encounter problems.
|
||||
|
||||
For ext2/3/4:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# e2fsck -fvy /dev/sdx
|
||||
|
||||
For reiserfs:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# reiserfsck -fy /dev/sdx
|
||||
|
||||
Skipping that step could lead to incorrect results on the target filesystem,
|
||||
but it may work.
|
||||
|
||||
**REMOVE THE ORIGINAL FILESYSTEM METADATA**
|
||||
|
||||
By removing the subvolume named like *ext2_saved* or *reiserfs_saved*, all
|
||||
metadata of the original filesystem will be removed:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# btrfs subvolume delete /mnt/ext2_saved
|
||||
|
||||
At this point it is not possible to do a rollback. The filesystem is usable but
|
||||
may be impacted by the fragmentation inherited from the original filesystem.
|
||||
|
||||
**MAKE FILE DATA MORE CONTIGUOUS**
|
||||
|
||||
An optional but recommended step is to run defragmentation on the entire
|
||||
filesystem. This will attempt to make file extents more contiguous.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# btrfs filesystem defrag -v -r -f -t 32M /mnt/btrfs
|
||||
|
||||
Verbose recursive defragmentation (*-v*, *-r*), flush data per-file (*-f*) with
|
||||
target extent size 32MiB (*-t*).
|
||||
|
||||
**ATTEMPT TO MAKE BTRFS METADATA MORE COMPACT**
|
||||
|
||||
Optional but recommended step.
|
||||
|
||||
The metadata block groups after conversion may be smaller than the default size
|
||||
(256MiB or 1GiB). Running a balance will attempt to merge the block groups.
|
||||
This depends on the free space layout (and fragmentation) and may fail due to
|
||||
lack of enough work space. This is a soft error leaving the filesystem usable
|
||||
but the block group layout may remain unchanged.
|
||||
|
||||
Note that balance operation takes a lot of time, please see also
|
||||
``btrfs-balance(8)``.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# btrfs balance start -m /mnt/btrfs
|
||||
|
|
@ -0,0 +1,198 @@
|
|||
The concept of quota has a long-standing tradition in the Unix world. Ever
|
||||
since computers allow multiple users to work simultaneously in one filesystem,
|
||||
there is the need to prevent one user from using up the entire space. Every
|
||||
user should get his fair share of the available resources.
|
||||
|
||||
In case of files, the solution is quite straightforward. Each file has an
|
||||
*owner* recorded along with it, and it has a size. Traditional quota just
|
||||
restricts the total size of all files that are owned by a user. The concept is
|
||||
quite flexible: if a user hits his quota limit, the administrator can raise it
|
||||
on the fly.
|
||||
|
||||
On the other hand, the traditional approach has only a poor solution to
|
||||
restrict directories.
|
||||
At installation time, the harddisk can be partitioned so that every directory
|
||||
(eg. /usr, /var/, ...) that needs a limit gets its own partition. The obvious
|
||||
problem is that those limits cannot be changed without a reinstallation. The
|
||||
btrfs subvolume feature builds a bridge. Subvolumes correspond in many ways to
|
||||
partitions, as every subvolume looks like its own filesystem. With subvolume
|
||||
quota, it is now possible to restrict each subvolume like a partition, but keep
|
||||
the flexibility of quota. The space for each subvolume can be expanded or
|
||||
restricted on the fly.
|
||||
|
||||
As subvolumes are the basis for snapshots, interesting questions arise as to
|
||||
how to account used space in the presence of snapshots. If you have a file
|
||||
shared between a subvolume and a snapshot, whom to account the file to? The
|
||||
creator? Both? What if the file gets modified in the snapshot, should only
|
||||
these changes be accounted to it? But wait, both the snapshot and the subvolume
|
||||
belong to the same user home. I just want to limit the total space used by
|
||||
both! But somebody else might not want to charge the snapshots to the users.
|
||||
|
||||
Btrfs subvolume quota solves these problems by introducing groups of subvolumes
|
||||
and let the user put limits on them. It is even possible to have groups of
|
||||
groups. In the following, we refer to them as *qgroups*.
|
||||
|
||||
Each qgroup primarily tracks two numbers, the amount of total referenced
|
||||
space and the amount of exclusively referenced space.
|
||||
|
||||
referenced
|
||||
space is the amount of data that can be reached from any of the
|
||||
subvolumes contained in the qgroup, while
|
||||
exclusive
|
||||
is the amount of data where all references to this data can be reached
|
||||
from within this qgroup.
|
||||
|
||||
SUBVOLUME QUOTA GROUPS
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The basic notion of the Subvolume Quota feature is the quota group, short
|
||||
qgroup. Qgroups are notated as *level/id*, eg. the qgroup 3/2 is a qgroup of
|
||||
level 3. For level 0, the leading '0/' can be omitted.
|
||||
Qgroups of level 0 get created automatically when a subvolume/snapshot gets
|
||||
created. The ID of the qgroup corresponds to the ID of the subvolume, so 0/5
|
||||
is the qgroup for the root subvolume.
|
||||
For the ``btrfs qgroup`` command, the path to the subvolume can also be used
|
||||
instead of *0/ID*. For all higher levels, the ID can be chosen freely.
|
||||
|
||||
Each qgroup can contain a set of lower level qgroups, thus creating a hierarchy
|
||||
of qgroups. Figure 1 shows an example qgroup tree.
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
+---+
|
||||
|2/1|
|
||||
+---+
|
||||
/ \
|
||||
+---+/ \+---+
|
||||
|1/1| |1/2|
|
||||
+---+ +---+
|
||||
/ \ / \
|
||||
+---+/ \+---+/ \+---+
|
||||
qgroups |0/1| |0/2| |0/3|
|
||||
+-+-+ +---+ +---+
|
||||
| / \ / \
|
||||
| / \ / \
|
||||
| / \ / \
|
||||
extents 1 2 3 4
|
||||
|
||||
Figure1: Sample qgroup hierarchy
|
||||
|
||||
At the bottom, some extents are depicted showing which qgroups reference which
|
||||
extents. It is important to understand the notion of *referenced* vs
|
||||
*exclusive*. In the example, qgroup 0/2 references extents 2 and 3, while 1/2
|
||||
references extents 2-4, 2/1 references all extents.
|
||||
|
||||
On the other hand, extent 1 is exclusive to 0/1, extent 2 is exclusive to 0/2,
|
||||
while extent 3 is neither exclusive to 0/2 nor to 0/3. But because both
|
||||
references can be reached from 1/2, extent 3 is exclusive to 1/2. All extents
|
||||
are exclusive to 2/1.
|
||||
|
||||
So exclusive does not mean there is no other way to reach the extent, but it
|
||||
does mean that if you delete all subvolumes contained in a qgroup, the extent
|
||||
will get deleted.
|
||||
|
||||
Exclusive of a qgroup conveys the useful information how much space will be
|
||||
freed in case all subvolumes of the qgroup get deleted.
|
||||
|
||||
All data extents are accounted this way. Metadata that belongs to a specific
|
||||
subvolume (i.e. its filesystem tree) is also accounted. Checksums and extent
|
||||
allocation information are not accounted.
|
||||
|
||||
In turn, the referenced count of a qgroup can be limited. All writes beyond
|
||||
this limit will lead to a 'Quota Exceeded' error.
|
||||
|
||||
INHERITANCE
|
||||
^^^^^^^^^^^
|
||||
|
||||
Things get a bit more complicated when new subvolumes or snapshots are created.
|
||||
The case of (empty) subvolumes is still quite easy. If a subvolume should be
|
||||
part of a qgroup, it has to be added to the qgroup at creation time. To add it
|
||||
at a later time, it would be necessary to at least rescan the full subvolume
|
||||
for a proper accounting.
|
||||
|
||||
Creation of a snapshot is the hard case. Obviously, the snapshot will
|
||||
reference the exact amount of space as its source, and both source and
|
||||
destination now have an exclusive count of 0 (the filesystem nodesize to be
|
||||
precise, as the roots of the trees are not shared). But what about qgroups of
|
||||
higher levels? If the qgroup contains both the source and the destination,
|
||||
nothing changes. If the qgroup contains only the source, it might lose some
|
||||
exclusive.
|
||||
|
||||
But how much? The tempting answer is, subtract all exclusive of the source from
|
||||
the qgroup, but that is wrong, or at least not enough. There could have been
|
||||
an extent that is referenced from the source and another subvolume from that
|
||||
qgroup. This extent would have been exclusive to the qgroup, but not to the
|
||||
source subvolume. With the creation of the snapshot, the qgroup would also
|
||||
lose this extent from its exclusive set.
|
||||
|
||||
So how can this problem be solved? In the instant the snapshot gets created, we
|
||||
already have to know the correct exclusive count. We need to have a second
|
||||
qgroup that contains all the subvolumes as the first qgroup, except the
|
||||
subvolume we want to snapshot. The moment we create the snapshot, the
|
||||
exclusive count from the second qgroup needs to be copied to the first qgroup,
|
||||
as it represents the correct value. The second qgroup is called a tracking
|
||||
qgroup. It is only there in case a snapshot is needed.
|
||||
|
||||
USE CASES
|
||||
^^^^^^^^^
|
||||
|
||||
Below are some usecases that do not mean to be extensive. You can find your
|
||||
own way how to integrate qgroups.
|
||||
|
||||
SINGLE-USER MACHINE
|
||||
"""""""""""""""""""
|
||||
|
||||
``Replacement for partitions``
|
||||
|
||||
The simplest use case is to use qgroups as simple replacement for partitions.
|
||||
Btrfs takes the disk as a whole, and /, /usr, /var, etc. are created as
|
||||
subvolumes. As each subvolume gets it own qgroup automatically, they can
|
||||
simply be restricted. No hierarchy is needed for that.
|
||||
|
||||
``Track usage of snapshots``
|
||||
|
||||
When a snapshot is taken, a qgroup for it will automatically be created with
|
||||
the correct values. 'Referenced' will show how much is in it, possibly shared
|
||||
with other subvolumes. 'Exclusive' will be the amount of space that gets freed
|
||||
when the subvolume is deleted.
|
||||
|
||||
MULTI-USER MACHINE
|
||||
""""""""""""""""""
|
||||
|
||||
``Restricting homes``
|
||||
|
||||
When you have several users on a machine, with home directories probably under
|
||||
/home, you might want to restrict /home as a whole, while restricting every
|
||||
user to an individual limit as well. This is easily accomplished by creating a
|
||||
qgroup for /home , eg. 1/1, and assigning all user subvolumes to it.
|
||||
Restricting this qgroup will limit /home, while every user subvolume can get
|
||||
its own (lower) limit.
|
||||
|
||||
``Accounting snapshots to the user``
|
||||
|
||||
Let's say the user is allowed to create snapshots via some mechanism. It would
|
||||
only be fair to account space used by the snapshots to the user. This does not
|
||||
mean the user doubles his usage as soon as he takes a snapshot. Of course,
|
||||
files that are present in his home and the snapshot should only be accounted
|
||||
once. This can be accomplished by creating a qgroup for each user, say
|
||||
'1/UID'. The user home and all snapshots are assigned to this qgroup.
|
||||
Limiting it will extend the limit to all snapshots, counting files only once.
|
||||
To limit /home as a whole, a higher level group 2/1 replacing 1/1 from the
|
||||
previous example is needed, with all user qgroups assigned to it.
|
||||
|
||||
``Do not account snapshots``
|
||||
|
||||
On the other hand, when the snapshots get created automatically, the user has
|
||||
no chance to control them, so the space used by them should not be accounted to
|
||||
him. This is already the case when creating snapshots in the example from
|
||||
the previous section.
|
||||
|
||||
``Snapshots for backup purposes``
|
||||
|
||||
This scenario is a mixture of the previous two. The user can create snapshots,
|
||||
but some snapshots for backup purposes are being created by the system. The
|
||||
user's snapshots should be accounted to the user, not the system. The solution
|
||||
is similar to the one from section 'Accounting snapshots to the user', but do
|
||||
not assign system snapshots to user's qgroup.
|
||||
|
||||
|
|
@ -0,0 +1,28 @@
|
|||
Scrub is a pass over all filesystem data and metadata and verifying the
|
||||
checksums. If a valid copy is available (replicated block group profiles) then
|
||||
the damaged one is repaired. All copies of the replicated profiles are validated.
|
||||
|
||||
.. note::
|
||||
Scrub is not a filesystem checker (fsck) and does not verify nor repair
|
||||
structural damage in the filesystem. It really only checks checksums of data
|
||||
and tree blocks, it doesn't ensure the content of tree blocks is valid and
|
||||
consistent. There's some validation performed when metadata blocks are read
|
||||
from disk but it's not extensive and cannot substitute full *btrfs check*
|
||||
run.
|
||||
|
||||
The user is supposed to run it manually or via a periodic system service. The
|
||||
recommended period is a month but could be less. The estimated device bandwidth
|
||||
utilization is about 80% on an idle filesystem. The IO priority class is by
|
||||
default *idle* so background scrub should not significantly interfere with
|
||||
normal filesystem operation. The IO scheduler set for the device(s) might not
|
||||
support the priority classes though.
|
||||
|
||||
The scrubbing status is recorded in */var/lib/btrfs/* in textual files named
|
||||
*scrub.status.UUID* for a filesystem identified by the given UUID. (Progress
|
||||
state is communicated through a named pipe in file *scrub.progress.UUID* in the
|
||||
same directory.) The status file is updated every 5 seconds. A resumed scrub
|
||||
will continue from the last saved position.
|
||||
|
||||
Scrub can be started only on a mounted filesystem, though it's possible to
|
||||
scrub only a selected device. See **btrfs scrub start** for more.
|
||||
|
|
@ -0,0 +1,78 @@
|
|||
The COW mechanism and multiple devices under one hood enable an interesting
|
||||
concept, called a seeding device: extending a read-only filesystem on a single
|
||||
device filesystem with another device that captures all writes. For example
|
||||
imagine an immutable golden image of an operating system enhanced with another
|
||||
device that allows to use the data from the golden image and normal operation.
|
||||
This idea originated on CD-ROMs with base OS and allowing to use them for live
|
||||
systems, but this became obsolete. There are technologies providing similar
|
||||
functionality, like *unionmount*, *overlayfs* or *qcow2* image snapshot.
|
||||
|
||||
The seeding device starts as a normal filesystem, once the contents is ready,
|
||||
**btrfstune -S 1** is used to flag it as a seeding device. Mounting such device
|
||||
will not allow any writes, except adding a new device by **btrfs device add**.
|
||||
Then the filesystem can be remounted as read-write.
|
||||
|
||||
Given that the filesystem on the seeding device is always recognized as
|
||||
read-only, it can be used to seed multiple filesystems, at the same time. The
|
||||
UUID that is normally attached to a device is automatically changed to a random
|
||||
UUID on each mount.
|
||||
|
||||
Once the seeding device is mounted, it needs the writable device. After adding
|
||||
it, something like **remount -o remount,rw /path** makes the filesystem at
|
||||
*/path* ready for use. The simplest usecase is to throw away all changes by
|
||||
unmounting the filesystem when convenient.
|
||||
|
||||
Alternatively, deleting the seeding device from the filesystem can turn it into
|
||||
a normal filesystem, provided that the writable device can also contain all the
|
||||
data from the seeding device.
|
||||
|
||||
The seeding device flag can be cleared again by **btrfstune -f -s 0**, eg.
|
||||
allowing to update with newer data but please note that this will invalidate
|
||||
all existing filesystems that use this particular seeding device. This works
|
||||
for some usecases, not for others, and a forcing flag to the command is
|
||||
mandatory to avoid accidental mistakes.
|
||||
|
||||
Example how to create and use one seeding device:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# mkfs.btrfs /dev/sda
|
||||
# mount /dev/sda /mnt/mnt1
|
||||
# ... fill mnt1 with data
|
||||
# umount /mnt/mnt1
|
||||
# btrfstune -S 1 /dev/sda
|
||||
# mount /dev/sda /mnt/mnt1
|
||||
# btrfs device add /dev/sdb /mnt
|
||||
# mount -o remount,rw /mnt/mnt1
|
||||
# ... /mnt/mnt1 is now writable
|
||||
|
||||
Now */mnt/mnt1* can be used normally. The device */dev/sda* can be mounted
|
||||
again with a another writable device:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# mount /dev/sda /mnt/mnt2
|
||||
# btrfs device add /dev/sdc /mnt/mnt2
|
||||
# mount -o remount,rw /mnt/mnt2
|
||||
... /mnt/mnt2 is now writable
|
||||
|
||||
The writable device (*/dev/sdb*) can be decoupled from the seeding device and
|
||||
used independently:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# btrfs device delete /dev/sda /mnt/mnt1
|
||||
|
||||
As the contents originated in the seeding device, it's possible to turn
|
||||
*/dev/sdb* to a seeding device again and repeat the whole process.
|
||||
|
||||
A few things to note:
|
||||
|
||||
* it's recommended to use only single device for the seeding device, it works
|
||||
for multiple devices but the *single* profile must be used in order to make
|
||||
the seeding device deletion work
|
||||
* block group profiles *single* and *dup* support the usecases above
|
||||
* the label is copied from the seeding device and can be changed by **btrfs filesystem label**
|
||||
* each new mount of the seeding device gets a new random UUID
|
||||
|
||||
|
Loading…
Reference in New Issue