btrfs-progs: docs: more docs updates
Signed-off-by: David Sterba <dsterba@suse.com>
This commit is contained in:
parent
df91bfd5d5
commit
79ef78f0e4
|
@ -1,24 +1,24 @@
|
||||||
Tree checker
|
Tree checker
|
||||||
============
|
============
|
||||||
|
|
||||||
Metadata blocks that have been just read from devices or are just about to be
|
Tree checker is a feature that verifies metadata blocks before write or after
|
||||||
written are verified and sanity checked by so called **tree checker**. The
|
read from the devices. The b-tree nodes contain several items describing the
|
||||||
b-tree nodes contain several items describing the filesystem structure and to
|
filesystem structures and to some degree can be verified for consistency or
|
||||||
some degree can be verified for consistency or validity. This is additional
|
validity. This is an additional check to the checksums that only verify the
|
||||||
check to the checksums that only verify the overall block status while the tree
|
overall block status while the tree checker tries to validate and cross
|
||||||
checker tries to validate and cross reference the logical structure. This takes
|
reference the logical structure. This takes a slight performance hit but is
|
||||||
a slight performance hit but is comparable to calculating the checksum and has
|
comparable to calculating the checksum and has no noticeable impact while it
|
||||||
no noticeable impact while it does catch all sorts of errors.
|
does catch all sorts of errors.
|
||||||
|
|
||||||
There are two occasions when the checks are done:
|
There are two occasions when the checks are done:
|
||||||
|
|
||||||
Pre-write checks
|
Pre-write checks
|
||||||
----------------
|
----------------
|
||||||
|
|
||||||
When metadata blocks are in memory about to be written to the permanent storage,
|
When metadata blocks are in memory and about to be written to the permanent
|
||||||
the checks are performed, before the checksums are calculated. This can catch
|
storage, the checks are performed, before the checksums are calculated. This
|
||||||
random corruptions of the blocks (or pages) either caused by bugs or by other
|
can catch random corruptions of the blocks (or pages) either caused by bugs or
|
||||||
parts of the system or hardware errors (namely faulty RAM).
|
by other parts of the system or hardware errors (namely faulty RAM).
|
||||||
|
|
||||||
Once a block does not pass the checks, the filesystem refuses to write more data
|
Once a block does not pass the checks, the filesystem refuses to write more data
|
||||||
and turns itself to read-only mode to prevent further damage. At this point some
|
and turns itself to read-only mode to prevent further damage. At this point some
|
||||||
|
@ -28,6 +28,24 @@ the filesystem gets unmounted, the most recent changes are unfortunately lost.
|
||||||
The filesystem that is stored on the device is still consistent and should mount
|
The filesystem that is stored on the device is still consistent and should mount
|
||||||
fine.
|
fine.
|
||||||
|
|
||||||
|
A message may look like:
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
[ 1716.823895] BTRFS critical (device vdb): corrupt leaf: root=18446744073709551607 block=38092800 slot=0, invalid key objectid: has 1 expect 6 or [256, 18446744073709551360] or 18446744073709551604
|
||||||
|
[ 1716.829499] BTRFS info (device vdb): leaf 38092800 gen 19 total ptrs 4 free space 15851 owner 18446744073709551607
|
||||||
|
[ 1716.832891] BTRFS info (device vdb): refs 3 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 1506
|
||||||
|
[ 1716.836054] item 0 key (1 1 0) itemoff 16123 itemsize 160
|
||||||
|
[ 1716.837993] inode generation 1 size 0 mode 100600
|
||||||
|
[ 1716.839760] item 1 key (256 1 0) itemoff 15963 itemsize 160
|
||||||
|
[ 1716.841742] inode generation 4 size 0 mode 40755
|
||||||
|
[ 1716.843393] item 2 key (256 12 256) itemoff 15951 itemsize 12
|
||||||
|
[ 1716.845320] item 3 key (18446744073709551611 48 1) itemoff 15951 itemsize 0
|
||||||
|
[ 1716.847505] BTRFS error (device vdb): block=38092800 write time tree block corruption detected
|
||||||
|
|
||||||
|
The line(s) before the *write time tree block corruption detected* message is
|
||||||
|
specific to the found error.
|
||||||
|
|
||||||
Post-read checks
|
Post-read checks
|
||||||
----------------
|
----------------
|
||||||
|
|
||||||
|
@ -36,6 +54,11 @@ checksum is found to be valid. This protects against changes to the metadata
|
||||||
that could possibly also update the checksum, less likely to happen accidentally
|
that could possibly also update the checksum, less likely to happen accidentally
|
||||||
but rather due to intentional corruption or fuzzing.
|
but rather due to intentional corruption or fuzzing.
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
[ 4823.612832] BTRFS critical (device vdb): corrupt leaf: root=7 block=30474240 slot=0, invalid nritems, have 0 should not be 0 for non-root leaf
|
||||||
|
[ 4823.616798] BTRFS error (device vdb): block=30474240 read time tree block corruption detected
|
||||||
|
|
||||||
The checks
|
The checks
|
||||||
----------
|
----------
|
||||||
|
|
||||||
|
|
|
@ -1,9 +1,12 @@
|
||||||
Data and metadata are checksummed by default, the checksum is calculated before
|
Data and metadata are checksummed by default, the checksum is calculated before
|
||||||
write and verifed after reading the blocks from devices. There are several
|
write and verifed after reading the blocks from devices. The whole metadata
|
||||||
checksum algorithms supported. The default and backward compatible is *crc32c*.
|
block has a checksum stored inline in the b-tree node header, each data block
|
||||||
Since kernel 5.5 there are three more with different characteristics and
|
has a detached checksum stored in the checksum tree.
|
||||||
trade-offs regarding speed and strength. The following list may help you to
|
|
||||||
decide which one to select.
|
There are several checksum algorithms supported. The default and backward
|
||||||
|
compatible is *crc32c*. Since kernel 5.5 there are three more with different
|
||||||
|
characteristics and trade-offs regarding speed and strength. The following list
|
||||||
|
may help you to decide which one to select.
|
||||||
|
|
||||||
CRC32C (32bit digest)
|
CRC32C (32bit digest)
|
||||||
default, best backward compatibility, very fast, modern CPUs have
|
default, best backward compatibility, very fast, modern CPUs have
|
||||||
|
|
|
@ -48,7 +48,7 @@ This will enable the ``zstd`` algorithm on the default level (which is 3).
|
||||||
The level can be specified manually too like ``zstd:3``. Higher levels compress
|
The level can be specified manually too like ``zstd:3``. Higher levels compress
|
||||||
better at the cost of time. This in turn may cause increased write latency, low
|
better at the cost of time. This in turn may cause increased write latency, low
|
||||||
levels are suitable for real-time compression and on reasonably fast CPU don't
|
levels are suitable for real-time compression and on reasonably fast CPU don't
|
||||||
cause performance drops.
|
cause noticeable performance drops.
|
||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
|
@ -145,9 +145,11 @@ Compatibility
|
||||||
|
|
||||||
Compression is done using the COW mechanism so it's incompatible with
|
Compression is done using the COW mechanism so it's incompatible with
|
||||||
*nodatacow*. Direct IO works on compressed files but will fall back to buffered
|
*nodatacow*. Direct IO works on compressed files but will fall back to buffered
|
||||||
writes and leads to recompression. Currently 'nodatasum' and compression don't
|
writes and leads to recompression. Currently *nodatasum* and compression don't
|
||||||
work together.
|
work together.
|
||||||
|
|
||||||
The compression algorithms have been added over time so the version
|
The compression algorithms have been added over time so the version
|
||||||
compatibility should be also considered, together with other tools that may
|
compatibility should be also considered, together with other tools that may
|
||||||
access the compressed data like bootloaders.
|
access the compressed data like bootloaders.
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,6 +1,7 @@
|
||||||
A BTRFS subvolume is a part of filesystem with its own independent
|
A BTRFS subvolume is a part of filesystem with its own independent
|
||||||
file/directory hierarchy. Subvolumes can share file extents. A snapshot is also
|
file/directory hierarchy and inode number namespace. Subvolumes can share file
|
||||||
subvolume, but with a given initial content of the original subvolume.
|
extents. A snapshot is also subvolume, but with a given initial content of the
|
||||||
|
original subvolume.
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
A subvolume in BTRFS is not like an LVM logical volume, which is block-level
|
A subvolume in BTRFS is not like an LVM logical volume, which is block-level
|
||||||
|
@ -8,7 +9,9 @@ subvolume, but with a given initial content of the original subvolume.
|
||||||
|
|
||||||
A subvolume looks like a normal directory, with some additional operations
|
A subvolume looks like a normal directory, with some additional operations
|
||||||
described below. Subvolumes can be renamed or moved, nesting subvolumes is not
|
described below. Subvolumes can be renamed or moved, nesting subvolumes is not
|
||||||
restricted but has some implications regarding snapshotting.
|
restricted but has some implications regarding snapshotting. The numeric id
|
||||||
|
(called *subvolid* or *rootid*) of the subvolume is persistent and cannot be
|
||||||
|
changed.
|
||||||
|
|
||||||
A subvolume in BTRFS can be accessed in two ways:
|
A subvolume in BTRFS can be accessed in two ways:
|
||||||
|
|
||||||
|
@ -30,10 +33,10 @@ do not affect the files in the original subvolume.
|
||||||
Subvolume flags
|
Subvolume flags
|
||||||
---------------
|
---------------
|
||||||
|
|
||||||
The subvolume flag currently implemented is the *ro* property. Read-write
|
The subvolume flag currently implemented is the *ro* property (read-only
|
||||||
subvolumes have that set to *false*, snapshots as *true*. In addition to that,
|
status). Read-write subvolumes have that set to *false*, snapshots as *true*.
|
||||||
a plain snapshot will also have last change generation and creation generation
|
In addition to that, a plain snapshot will also have last change generation and
|
||||||
equal.
|
creation generation equal.
|
||||||
|
|
||||||
Read-only snapshots are building blocks of incremental send (see
|
Read-only snapshots are building blocks of incremental send (see
|
||||||
``btrfs-send(8)``) and the whole use case relies on unmodified snapshots where
|
``btrfs-send(8)``) and the whole use case relies on unmodified snapshots where
|
||||||
|
@ -56,3 +59,36 @@ it by **btrfs property set** requires force if that is really desired by user.
|
||||||
show** to identify them. Flipping the flags to read-only and back to
|
show** to identify them. Flipping the flags to read-only and back to
|
||||||
read-write will reset the *received_uuid* manually. There may exist a
|
read-write will reset the *received_uuid* manually. There may exist a
|
||||||
convenience tool in the future.
|
convenience tool in the future.
|
||||||
|
|
||||||
|
Nested subvolumes
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
There are no restrictions for subvolume creation, so it's up to the user how to
|
||||||
|
organize them, whether to have a flat layout (all subvolumes are direct
|
||||||
|
descendants of the toplevel one), or nested.
|
||||||
|
|
||||||
|
What should be mentioned early is that a snapshotting is not recursive, so a
|
||||||
|
subvolume or a snapshot is effectively a barrier. This can be used
|
||||||
|
intentionally but could be confusing in case of nested layouts.
|
||||||
|
|
||||||
|
Case study: system root layouts
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
There are two ways how the system root directory and subvolume layout could be
|
||||||
|
organized. The interesting usecase for root is to allow rollbacks to previous
|
||||||
|
version, as one atomic step. If the entire filesystem hierarchy starting in "/"
|
||||||
|
is in one subvolume, taking snapshot will encompass all files. This is easy for
|
||||||
|
the snapshotting part but has undesirable consequences for rollback. For example,
|
||||||
|
log files would get rolled back too, or any data that are stored on the root
|
||||||
|
filesystem but are not meant to be rolled back either (database files, VM
|
||||||
|
images, ...).
|
||||||
|
|
||||||
|
Here we could utilize the snapshotting barrier mentioned above, each directory
|
||||||
|
that stores data to be preserved accross rollbacks is it's own subvolume. This
|
||||||
|
could be eg. ``/var``. Further more-fine grained partitioning could be done, eg.
|
||||||
|
adding separate subvolumes for ``/var/log``, ``/var/cache`` etc.
|
||||||
|
|
||||||
|
That there are separate subvolumes requrires separate actions to take the
|
||||||
|
snapshots (here it gets disconnected from the system root snapshots). This needs
|
||||||
|
to be taken care of by system tools, installers together with selection of which
|
||||||
|
directories are highly recommended to be separate subvolumes.
|
||||||
|
|
|
@ -5,7 +5,97 @@ Troubleshooting pages
|
||||||
|
|
||||||
Correctness related, permanent
|
Correctness related, permanent
|
||||||
|
|
||||||
- transid verify error
|
Error: parent transid verify error
|
||||||
|
----------------------------------
|
||||||
|
|
||||||
|
Reason: result of a failed internal consistency check of the filesystem's metadata.
|
||||||
|
Type: permanent
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
[ 4007.489730] BTRFS error (device vdb): parent transid verify failed on 30736384 wanted 10 found 8
|
||||||
|
|
||||||
|
The b-tree nodes are linked together, a block pointer in the parent node
|
||||||
|
contains target block offset and generation that last changed this block. The
|
||||||
|
block it points to then upon read verifies that the block address and the
|
||||||
|
generation matches. This check is done on all tree levels.
|
||||||
|
|
||||||
|
The number in **faled on 30736384** is the logical block number, **wanted 10**
|
||||||
|
is the expected generation number in the parent node, **found 8** is the one
|
||||||
|
found in the target block. The number difference between the generation can
|
||||||
|
give a hint when the problem could have happened, in terms of transaction
|
||||||
|
commits.
|
||||||
|
|
||||||
|
Once the mismatched generations are stored on the device, it's permanent and
|
||||||
|
cannot be easily recovered, because of information loss. The recovery tool
|
||||||
|
``btrfs restore`` is able to ignore the errors and attempt to restore the data
|
||||||
|
but due to the inconsistency in the metadata the data need to be verified by the
|
||||||
|
user.
|
||||||
|
|
||||||
|
The root cause of the error cannot be easily determined, possible reasons are:
|
||||||
|
|
||||||
|
* logical bug: filesystem structures haven't been properly updated and stored
|
||||||
|
* misdirected write: the underlying storage does not store the data to the exact
|
||||||
|
address as expected and overwrites some other block
|
||||||
|
* storage device (hardware or emulated) does not properly flush and persist data
|
||||||
|
between transactions so they get mixed up
|
||||||
|
* lost write without proper error handling: writing the block worked as viewed
|
||||||
|
on the filesystem layer, but there was a problem on the lower layers not
|
||||||
|
propagated upwards
|
||||||
|
|
||||||
|
Error: No space left on device (ENOSPC)
|
||||||
|
---------------------------------------
|
||||||
|
|
||||||
|
Type: transient
|
||||||
|
|
||||||
|
Space handling on a COW filesystem is tricky, namely when it's in combination
|
||||||
|
with delayed allocation, dynamic chunk allocation and parallel data updates.
|
||||||
|
There are several reasons why the ENOSPC might get reported and there's not just
|
||||||
|
a single cause and solution. The space reservation algorithms try to fairly
|
||||||
|
assign the space, fall back to heuristics or block writes until enough data are
|
||||||
|
persisted and possibly making old copies available.
|
||||||
|
|
||||||
|
The most obvious way how to exhaust space is to create a file until the data
|
||||||
|
chunks are full:
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
$ df -h .
|
||||||
|
Filesystem Size Used Avail Use% Mounted on
|
||||||
|
/dev/sda 4.0G 3.6M 2.0G 1% /mnt/
|
||||||
|
|
||||||
|
$ cat /dev/zero > file
|
||||||
|
cat: write error: No space left on device
|
||||||
|
|
||||||
|
$ df -h .
|
||||||
|
Filesystem Size Used Avail Use% Mounted on
|
||||||
|
/dev/sdc 4.0G 2.0G 0 100% /mnt/data250
|
||||||
|
|
||||||
|
$ btrfs fi df .
|
||||||
|
Data, single: total=1.98GiB, used=1.98GiB
|
||||||
|
System, DUP: total=8.00MiB, used=16.00KiB
|
||||||
|
Metadata, DUP: total=1.00GiB, used=2.22MiB
|
||||||
|
GlobalReserve, single: total=3.25MiB, used=0.00B
|
||||||
|
|
||||||
|
The data chunks have been exhausted, so there's really no space left where to
|
||||||
|
write. The metadata chunks have space but that can't be used for that purpose.
|
||||||
|
|
||||||
|
Metadata space got exhausted
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Cannot track new data extents, no inline files, no reflinks, no xattrs.
|
||||||
|
Deletion still works.
|
||||||
|
|
||||||
|
Balance does not have enough workspace
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Relocation of block groups requires a temporary work space, ie. area on the
|
||||||
|
device that's available for the filesystem but without any other existing block
|
||||||
|
groups. Before balance starts a check is performed to verify the requested
|
||||||
|
action is possible. If not, ENOSPC is returned.
|
||||||
|
|
||||||
|
TODO
|
||||||
|
----
|
||||||
|
|
||||||
Transient
|
Transient
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue