btrfs-progs: docs: more docs updates

Signed-off-by: David Sterba <dsterba@suse.com>
This commit is contained in:
David Sterba 2022-01-10 16:20:34 +01:00
parent df91bfd5d5
commit 79ef78f0e4
5 changed files with 181 additions and 27 deletions

View File

@ -1,24 +1,24 @@
Tree checker
============
Metadata blocks that have been just read from devices or are just about to be
written are verified and sanity checked by so called **tree checker**. The
b-tree nodes contain several items describing the filesystem structure and to
some degree can be verified for consistency or validity. This is additional
check to the checksums that only verify the overall block status while the tree
checker tries to validate and cross reference the logical structure. This takes
a slight performance hit but is comparable to calculating the checksum and has
no noticeable impact while it does catch all sorts of errors.
Tree checker is a feature that verifies metadata blocks before write or after
read from the devices. The b-tree nodes contain several items describing the
filesystem structures and to some degree can be verified for consistency or
validity. This is an additional check to the checksums that only verify the
overall block status while the tree checker tries to validate and cross
reference the logical structure. This takes a slight performance hit but is
comparable to calculating the checksum and has no noticeable impact while it
does catch all sorts of errors.
There are two occasions when the checks are done:
Pre-write checks
----------------
When metadata blocks are in memory about to be written to the permanent storage,
the checks are performed, before the checksums are calculated. This can catch
random corruptions of the blocks (or pages) either caused by bugs or by other
parts of the system or hardware errors (namely faulty RAM).
When metadata blocks are in memory and about to be written to the permanent
storage, the checks are performed, before the checksums are calculated. This
can catch random corruptions of the blocks (or pages) either caused by bugs or
by other parts of the system or hardware errors (namely faulty RAM).
Once a block does not pass the checks, the filesystem refuses to write more data
and turns itself to read-only mode to prevent further damage. At this point some
@ -28,6 +28,24 @@ the filesystem gets unmounted, the most recent changes are unfortunately lost.
The filesystem that is stored on the device is still consistent and should mount
fine.
A message may look like:
.. code-block::
[ 1716.823895] BTRFS critical (device vdb): corrupt leaf: root=18446744073709551607 block=38092800 slot=0, invalid key objectid: has 1 expect 6 or [256, 18446744073709551360] or 18446744073709551604
[ 1716.829499] BTRFS info (device vdb): leaf 38092800 gen 19 total ptrs 4 free space 15851 owner 18446744073709551607
[ 1716.832891] BTRFS info (device vdb): refs 3 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 1506
[ 1716.836054] item 0 key (1 1 0) itemoff 16123 itemsize 160
[ 1716.837993] inode generation 1 size 0 mode 100600
[ 1716.839760] item 1 key (256 1 0) itemoff 15963 itemsize 160
[ 1716.841742] inode generation 4 size 0 mode 40755
[ 1716.843393] item 2 key (256 12 256) itemoff 15951 itemsize 12
[ 1716.845320] item 3 key (18446744073709551611 48 1) itemoff 15951 itemsize 0
[ 1716.847505] BTRFS error (device vdb): block=38092800 write time tree block corruption detected
The line(s) before the *write time tree block corruption detected* message is
specific to the found error.
Post-read checks
----------------
@ -36,6 +54,11 @@ checksum is found to be valid. This protects against changes to the metadata
that could possibly also update the checksum, less likely to happen accidentally
but rather due to intentional corruption or fuzzing.
.. code-block::
[ 4823.612832] BTRFS critical (device vdb): corrupt leaf: root=7 block=30474240 slot=0, invalid nritems, have 0 should not be 0 for non-root leaf
[ 4823.616798] BTRFS error (device vdb): block=30474240 read time tree block corruption detected
The checks
----------

View File

@ -1,9 +1,12 @@
Data and metadata are checksummed by default, the checksum is calculated before
write and verifed after reading the blocks from devices. There are several
checksum algorithms supported. The default and backward compatible is *crc32c*.
Since kernel 5.5 there are three more with different characteristics and
trade-offs regarding speed and strength. The following list may help you to
decide which one to select.
write and verifed after reading the blocks from devices. The whole metadata
block has a checksum stored inline in the b-tree node header, each data block
has a detached checksum stored in the checksum tree.
There are several checksum algorithms supported. The default and backward
compatible is *crc32c*. Since kernel 5.5 there are three more with different
characteristics and trade-offs regarding speed and strength. The following list
may help you to decide which one to select.
CRC32C (32bit digest)
default, best backward compatibility, very fast, modern CPUs have

View File

@ -48,7 +48,7 @@ This will enable the ``zstd`` algorithm on the default level (which is 3).
The level can be specified manually too like ``zstd:3``. Higher levels compress
better at the cost of time. This in turn may cause increased write latency, low
levels are suitable for real-time compression and on reasonably fast CPU don't
cause performance drops.
cause noticeable performance drops.
.. code-block:: shell
@ -145,9 +145,11 @@ Compatibility
Compression is done using the COW mechanism so it's incompatible with
*nodatacow*. Direct IO works on compressed files but will fall back to buffered
writes and leads to recompression. Currently 'nodatasum' and compression don't
writes and leads to recompression. Currently *nodatasum* and compression don't
work together.
The compression algorithms have been added over time so the version
compatibility should be also considered, together with other tools that may
access the compressed data like bootloaders.

View File

@ -1,6 +1,7 @@
A BTRFS subvolume is a part of filesystem with its own independent
file/directory hierarchy. Subvolumes can share file extents. A snapshot is also
subvolume, but with a given initial content of the original subvolume.
file/directory hierarchy and inode number namespace. Subvolumes can share file
extents. A snapshot is also subvolume, but with a given initial content of the
original subvolume.
.. note::
A subvolume in BTRFS is not like an LVM logical volume, which is block-level
@ -8,7 +9,9 @@ subvolume, but with a given initial content of the original subvolume.
A subvolume looks like a normal directory, with some additional operations
described below. Subvolumes can be renamed or moved, nesting subvolumes is not
restricted but has some implications regarding snapshotting.
restricted but has some implications regarding snapshotting. The numeric id
(called *subvolid* or *rootid*) of the subvolume is persistent and cannot be
changed.
A subvolume in BTRFS can be accessed in two ways:
@ -30,10 +33,10 @@ do not affect the files in the original subvolume.
Subvolume flags
---------------
The subvolume flag currently implemented is the *ro* property. Read-write
subvolumes have that set to *false*, snapshots as *true*. In addition to that,
a plain snapshot will also have last change generation and creation generation
equal.
The subvolume flag currently implemented is the *ro* property (read-only
status). Read-write subvolumes have that set to *false*, snapshots as *true*.
In addition to that, a plain snapshot will also have last change generation and
creation generation equal.
Read-only snapshots are building blocks of incremental send (see
``btrfs-send(8)``) and the whole use case relies on unmodified snapshots where
@ -56,3 +59,36 @@ it by **btrfs property set** requires force if that is really desired by user.
show** to identify them. Flipping the flags to read-only and back to
read-write will reset the *received_uuid* manually. There may exist a
convenience tool in the future.
Nested subvolumes
-----------------
There are no restrictions for subvolume creation, so it's up to the user how to
organize them, whether to have a flat layout (all subvolumes are direct
descendants of the toplevel one), or nested.
What should be mentioned early is that a snapshotting is not recursive, so a
subvolume or a snapshot is effectively a barrier. This can be used
intentionally but could be confusing in case of nested layouts.
Case study: system root layouts
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
There are two ways how the system root directory and subvolume layout could be
organized. The interesting usecase for root is to allow rollbacks to previous
version, as one atomic step. If the entire filesystem hierarchy starting in "/"
is in one subvolume, taking snapshot will encompass all files. This is easy for
the snapshotting part but has undesirable consequences for rollback. For example,
log files would get rolled back too, or any data that are stored on the root
filesystem but are not meant to be rolled back either (database files, VM
images, ...).
Here we could utilize the snapshotting barrier mentioned above, each directory
that stores data to be preserved accross rollbacks is it's own subvolume. This
could be eg. ``/var``. Further more-fine grained partitioning could be done, eg.
adding separate subvolumes for ``/var/log``, ``/var/cache`` etc.
That there are separate subvolumes requrires separate actions to take the
snapshots (here it gets disconnected from the system root snapshots). This needs
to be taken care of by system tools, installers together with selection of which
directories are highly recommended to be separate subvolumes.

View File

@ -5,7 +5,97 @@ Troubleshooting pages
Correctness related, permanent
- transid verify error
Error: parent transid verify error
----------------------------------
Reason: result of a failed internal consistency check of the filesystem's metadata.
Type: permanent
.. code-block::
[ 4007.489730] BTRFS error (device vdb): parent transid verify failed on 30736384 wanted 10 found 8
The b-tree nodes are linked together, a block pointer in the parent node
contains target block offset and generation that last changed this block. The
block it points to then upon read verifies that the block address and the
generation matches. This check is done on all tree levels.
The number in **faled on 30736384** is the logical block number, **wanted 10**
is the expected generation number in the parent node, **found 8** is the one
found in the target block. The number difference between the generation can
give a hint when the problem could have happened, in terms of transaction
commits.
Once the mismatched generations are stored on the device, it's permanent and
cannot be easily recovered, because of information loss. The recovery tool
``btrfs restore`` is able to ignore the errors and attempt to restore the data
but due to the inconsistency in the metadata the data need to be verified by the
user.
The root cause of the error cannot be easily determined, possible reasons are:
* logical bug: filesystem structures haven't been properly updated and stored
* misdirected write: the underlying storage does not store the data to the exact
address as expected and overwrites some other block
* storage device (hardware or emulated) does not properly flush and persist data
between transactions so they get mixed up
* lost write without proper error handling: writing the block worked as viewed
on the filesystem layer, but there was a problem on the lower layers not
propagated upwards
Error: No space left on device (ENOSPC)
---------------------------------------
Type: transient
Space handling on a COW filesystem is tricky, namely when it's in combination
with delayed allocation, dynamic chunk allocation and parallel data updates.
There are several reasons why the ENOSPC might get reported and there's not just
a single cause and solution. The space reservation algorithms try to fairly
assign the space, fall back to heuristics or block writes until enough data are
persisted and possibly making old copies available.
The most obvious way how to exhaust space is to create a file until the data
chunks are full:
.. code-block::
$ df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/sda 4.0G 3.6M 2.0G 1% /mnt/
$ cat /dev/zero > file
cat: write error: No space left on device
$ df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/sdc 4.0G 2.0G 0 100% /mnt/data250
$ btrfs fi df .
Data, single: total=1.98GiB, used=1.98GiB
System, DUP: total=8.00MiB, used=16.00KiB
Metadata, DUP: total=1.00GiB, used=2.22MiB
GlobalReserve, single: total=3.25MiB, used=0.00B
The data chunks have been exhausted, so there's really no space left where to
write. The metadata chunks have space but that can't be used for that purpose.
Metadata space got exhausted
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Cannot track new data extents, no inline files, no reflinks, no xattrs.
Deletion still works.
Balance does not have enough workspace
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Relocation of block groups requires a temporary work space, ie. area on the
device that's available for the filesystem but without any other existing block
groups. Before balance starts a check is performed to verify the requested
action is possible. If not, ENOSPC is returned.
TODO
----
Transient