btrfs-progs: docs: more docs updates
Signed-off-by: David Sterba <dsterba@suse.com>
This commit is contained in:
parent
df91bfd5d5
commit
79ef78f0e4
|
@ -1,24 +1,24 @@
|
|||
Tree checker
|
||||
============
|
||||
|
||||
Metadata blocks that have been just read from devices or are just about to be
|
||||
written are verified and sanity checked by so called **tree checker**. The
|
||||
b-tree nodes contain several items describing the filesystem structure and to
|
||||
some degree can be verified for consistency or validity. This is additional
|
||||
check to the checksums that only verify the overall block status while the tree
|
||||
checker tries to validate and cross reference the logical structure. This takes
|
||||
a slight performance hit but is comparable to calculating the checksum and has
|
||||
no noticeable impact while it does catch all sorts of errors.
|
||||
Tree checker is a feature that verifies metadata blocks before write or after
|
||||
read from the devices. The b-tree nodes contain several items describing the
|
||||
filesystem structures and to some degree can be verified for consistency or
|
||||
validity. This is an additional check to the checksums that only verify the
|
||||
overall block status while the tree checker tries to validate and cross
|
||||
reference the logical structure. This takes a slight performance hit but is
|
||||
comparable to calculating the checksum and has no noticeable impact while it
|
||||
does catch all sorts of errors.
|
||||
|
||||
There are two occasions when the checks are done:
|
||||
|
||||
Pre-write checks
|
||||
----------------
|
||||
|
||||
When metadata blocks are in memory about to be written to the permanent storage,
|
||||
the checks are performed, before the checksums are calculated. This can catch
|
||||
random corruptions of the blocks (or pages) either caused by bugs or by other
|
||||
parts of the system or hardware errors (namely faulty RAM).
|
||||
When metadata blocks are in memory and about to be written to the permanent
|
||||
storage, the checks are performed, before the checksums are calculated. This
|
||||
can catch random corruptions of the blocks (or pages) either caused by bugs or
|
||||
by other parts of the system or hardware errors (namely faulty RAM).
|
||||
|
||||
Once a block does not pass the checks, the filesystem refuses to write more data
|
||||
and turns itself to read-only mode to prevent further damage. At this point some
|
||||
|
@ -28,6 +28,24 @@ the filesystem gets unmounted, the most recent changes are unfortunately lost.
|
|||
The filesystem that is stored on the device is still consistent and should mount
|
||||
fine.
|
||||
|
||||
A message may look like:
|
||||
|
||||
.. code-block::
|
||||
|
||||
[ 1716.823895] BTRFS critical (device vdb): corrupt leaf: root=18446744073709551607 block=38092800 slot=0, invalid key objectid: has 1 expect 6 or [256, 18446744073709551360] or 18446744073709551604
|
||||
[ 1716.829499] BTRFS info (device vdb): leaf 38092800 gen 19 total ptrs 4 free space 15851 owner 18446744073709551607
|
||||
[ 1716.832891] BTRFS info (device vdb): refs 3 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 1506
|
||||
[ 1716.836054] item 0 key (1 1 0) itemoff 16123 itemsize 160
|
||||
[ 1716.837993] inode generation 1 size 0 mode 100600
|
||||
[ 1716.839760] item 1 key (256 1 0) itemoff 15963 itemsize 160
|
||||
[ 1716.841742] inode generation 4 size 0 mode 40755
|
||||
[ 1716.843393] item 2 key (256 12 256) itemoff 15951 itemsize 12
|
||||
[ 1716.845320] item 3 key (18446744073709551611 48 1) itemoff 15951 itemsize 0
|
||||
[ 1716.847505] BTRFS error (device vdb): block=38092800 write time tree block corruption detected
|
||||
|
||||
The line(s) before the *write time tree block corruption detected* message is
|
||||
specific to the found error.
|
||||
|
||||
Post-read checks
|
||||
----------------
|
||||
|
||||
|
@ -36,6 +54,11 @@ checksum is found to be valid. This protects against changes to the metadata
|
|||
that could possibly also update the checksum, less likely to happen accidentally
|
||||
but rather due to intentional corruption or fuzzing.
|
||||
|
||||
.. code-block::
|
||||
|
||||
[ 4823.612832] BTRFS critical (device vdb): corrupt leaf: root=7 block=30474240 slot=0, invalid nritems, have 0 should not be 0 for non-root leaf
|
||||
[ 4823.616798] BTRFS error (device vdb): block=30474240 read time tree block corruption detected
|
||||
|
||||
The checks
|
||||
----------
|
||||
|
||||
|
|
|
@ -1,9 +1,12 @@
|
|||
Data and metadata are checksummed by default, the checksum is calculated before
|
||||
write and verifed after reading the blocks from devices. There are several
|
||||
checksum algorithms supported. The default and backward compatible is *crc32c*.
|
||||
Since kernel 5.5 there are three more with different characteristics and
|
||||
trade-offs regarding speed and strength. The following list may help you to
|
||||
decide which one to select.
|
||||
write and verifed after reading the blocks from devices. The whole metadata
|
||||
block has a checksum stored inline in the b-tree node header, each data block
|
||||
has a detached checksum stored in the checksum tree.
|
||||
|
||||
There are several checksum algorithms supported. The default and backward
|
||||
compatible is *crc32c*. Since kernel 5.5 there are three more with different
|
||||
characteristics and trade-offs regarding speed and strength. The following list
|
||||
may help you to decide which one to select.
|
||||
|
||||
CRC32C (32bit digest)
|
||||
default, best backward compatibility, very fast, modern CPUs have
|
||||
|
|
|
@ -48,7 +48,7 @@ This will enable the ``zstd`` algorithm on the default level (which is 3).
|
|||
The level can be specified manually too like ``zstd:3``. Higher levels compress
|
||||
better at the cost of time. This in turn may cause increased write latency, low
|
||||
levels are suitable for real-time compression and on reasonably fast CPU don't
|
||||
cause performance drops.
|
||||
cause noticeable performance drops.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|
@ -145,9 +145,11 @@ Compatibility
|
|||
|
||||
Compression is done using the COW mechanism so it's incompatible with
|
||||
*nodatacow*. Direct IO works on compressed files but will fall back to buffered
|
||||
writes and leads to recompression. Currently 'nodatasum' and compression don't
|
||||
writes and leads to recompression. Currently *nodatasum* and compression don't
|
||||
work together.
|
||||
|
||||
The compression algorithms have been added over time so the version
|
||||
compatibility should be also considered, together with other tools that may
|
||||
access the compressed data like bootloaders.
|
||||
|
||||
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
A BTRFS subvolume is a part of filesystem with its own independent
|
||||
file/directory hierarchy. Subvolumes can share file extents. A snapshot is also
|
||||
subvolume, but with a given initial content of the original subvolume.
|
||||
file/directory hierarchy and inode number namespace. Subvolumes can share file
|
||||
extents. A snapshot is also subvolume, but with a given initial content of the
|
||||
original subvolume.
|
||||
|
||||
.. note::
|
||||
A subvolume in BTRFS is not like an LVM logical volume, which is block-level
|
||||
|
@ -8,7 +9,9 @@ subvolume, but with a given initial content of the original subvolume.
|
|||
|
||||
A subvolume looks like a normal directory, with some additional operations
|
||||
described below. Subvolumes can be renamed or moved, nesting subvolumes is not
|
||||
restricted but has some implications regarding snapshotting.
|
||||
restricted but has some implications regarding snapshotting. The numeric id
|
||||
(called *subvolid* or *rootid*) of the subvolume is persistent and cannot be
|
||||
changed.
|
||||
|
||||
A subvolume in BTRFS can be accessed in two ways:
|
||||
|
||||
|
@ -30,10 +33,10 @@ do not affect the files in the original subvolume.
|
|||
Subvolume flags
|
||||
---------------
|
||||
|
||||
The subvolume flag currently implemented is the *ro* property. Read-write
|
||||
subvolumes have that set to *false*, snapshots as *true*. In addition to that,
|
||||
a plain snapshot will also have last change generation and creation generation
|
||||
equal.
|
||||
The subvolume flag currently implemented is the *ro* property (read-only
|
||||
status). Read-write subvolumes have that set to *false*, snapshots as *true*.
|
||||
In addition to that, a plain snapshot will also have last change generation and
|
||||
creation generation equal.
|
||||
|
||||
Read-only snapshots are building blocks of incremental send (see
|
||||
``btrfs-send(8)``) and the whole use case relies on unmodified snapshots where
|
||||
|
@ -56,3 +59,36 @@ it by **btrfs property set** requires force if that is really desired by user.
|
|||
show** to identify them. Flipping the flags to read-only and back to
|
||||
read-write will reset the *received_uuid* manually. There may exist a
|
||||
convenience tool in the future.
|
||||
|
||||
Nested subvolumes
|
||||
-----------------
|
||||
|
||||
There are no restrictions for subvolume creation, so it's up to the user how to
|
||||
organize them, whether to have a flat layout (all subvolumes are direct
|
||||
descendants of the toplevel one), or nested.
|
||||
|
||||
What should be mentioned early is that a snapshotting is not recursive, so a
|
||||
subvolume or a snapshot is effectively a barrier. This can be used
|
||||
intentionally but could be confusing in case of nested layouts.
|
||||
|
||||
Case study: system root layouts
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
There are two ways how the system root directory and subvolume layout could be
|
||||
organized. The interesting usecase for root is to allow rollbacks to previous
|
||||
version, as one atomic step. If the entire filesystem hierarchy starting in "/"
|
||||
is in one subvolume, taking snapshot will encompass all files. This is easy for
|
||||
the snapshotting part but has undesirable consequences for rollback. For example,
|
||||
log files would get rolled back too, or any data that are stored on the root
|
||||
filesystem but are not meant to be rolled back either (database files, VM
|
||||
images, ...).
|
||||
|
||||
Here we could utilize the snapshotting barrier mentioned above, each directory
|
||||
that stores data to be preserved accross rollbacks is it's own subvolume. This
|
||||
could be eg. ``/var``. Further more-fine grained partitioning could be done, eg.
|
||||
adding separate subvolumes for ``/var/log``, ``/var/cache`` etc.
|
||||
|
||||
That there are separate subvolumes requrires separate actions to take the
|
||||
snapshots (here it gets disconnected from the system root snapshots). This needs
|
||||
to be taken care of by system tools, installers together with selection of which
|
||||
directories are highly recommended to be separate subvolumes.
|
||||
|
|
|
@ -5,7 +5,97 @@ Troubleshooting pages
|
|||
|
||||
Correctness related, permanent
|
||||
|
||||
- transid verify error
|
||||
Error: parent transid verify error
|
||||
----------------------------------
|
||||
|
||||
Reason: result of a failed internal consistency check of the filesystem's metadata.
|
||||
Type: permanent
|
||||
|
||||
.. code-block::
|
||||
|
||||
[ 4007.489730] BTRFS error (device vdb): parent transid verify failed on 30736384 wanted 10 found 8
|
||||
|
||||
The b-tree nodes are linked together, a block pointer in the parent node
|
||||
contains target block offset and generation that last changed this block. The
|
||||
block it points to then upon read verifies that the block address and the
|
||||
generation matches. This check is done on all tree levels.
|
||||
|
||||
The number in **faled on 30736384** is the logical block number, **wanted 10**
|
||||
is the expected generation number in the parent node, **found 8** is the one
|
||||
found in the target block. The number difference between the generation can
|
||||
give a hint when the problem could have happened, in terms of transaction
|
||||
commits.
|
||||
|
||||
Once the mismatched generations are stored on the device, it's permanent and
|
||||
cannot be easily recovered, because of information loss. The recovery tool
|
||||
``btrfs restore`` is able to ignore the errors and attempt to restore the data
|
||||
but due to the inconsistency in the metadata the data need to be verified by the
|
||||
user.
|
||||
|
||||
The root cause of the error cannot be easily determined, possible reasons are:
|
||||
|
||||
* logical bug: filesystem structures haven't been properly updated and stored
|
||||
* misdirected write: the underlying storage does not store the data to the exact
|
||||
address as expected and overwrites some other block
|
||||
* storage device (hardware or emulated) does not properly flush and persist data
|
||||
between transactions so they get mixed up
|
||||
* lost write without proper error handling: writing the block worked as viewed
|
||||
on the filesystem layer, but there was a problem on the lower layers not
|
||||
propagated upwards
|
||||
|
||||
Error: No space left on device (ENOSPC)
|
||||
---------------------------------------
|
||||
|
||||
Type: transient
|
||||
|
||||
Space handling on a COW filesystem is tricky, namely when it's in combination
|
||||
with delayed allocation, dynamic chunk allocation and parallel data updates.
|
||||
There are several reasons why the ENOSPC might get reported and there's not just
|
||||
a single cause and solution. The space reservation algorithms try to fairly
|
||||
assign the space, fall back to heuristics or block writes until enough data are
|
||||
persisted and possibly making old copies available.
|
||||
|
||||
The most obvious way how to exhaust space is to create a file until the data
|
||||
chunks are full:
|
||||
|
||||
.. code-block::
|
||||
|
||||
$ df -h .
|
||||
Filesystem Size Used Avail Use% Mounted on
|
||||
/dev/sda 4.0G 3.6M 2.0G 1% /mnt/
|
||||
|
||||
$ cat /dev/zero > file
|
||||
cat: write error: No space left on device
|
||||
|
||||
$ df -h .
|
||||
Filesystem Size Used Avail Use% Mounted on
|
||||
/dev/sdc 4.0G 2.0G 0 100% /mnt/data250
|
||||
|
||||
$ btrfs fi df .
|
||||
Data, single: total=1.98GiB, used=1.98GiB
|
||||
System, DUP: total=8.00MiB, used=16.00KiB
|
||||
Metadata, DUP: total=1.00GiB, used=2.22MiB
|
||||
GlobalReserve, single: total=3.25MiB, used=0.00B
|
||||
|
||||
The data chunks have been exhausted, so there's really no space left where to
|
||||
write. The metadata chunks have space but that can't be used for that purpose.
|
||||
|
||||
Metadata space got exhausted
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Cannot track new data extents, no inline files, no reflinks, no xattrs.
|
||||
Deletion still works.
|
||||
|
||||
Balance does not have enough workspace
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Relocation of block groups requires a temporary work space, ie. area on the
|
||||
device that's available for the filesystem but without any other existing block
|
||||
groups. Before balance starts a check is performed to verify the requested
|
||||
action is possible. If not, ENOSPC is returned.
|
||||
|
||||
TODO
|
||||
----
|
||||
|
||||
Transient
|
||||
|
||||
|
|
Loading…
Reference in New Issue