diff --git a/Documentation/Tree-checker.rst b/Documentation/Tree-checker.rst index 09597373..5f1cafd0 100644 --- a/Documentation/Tree-checker.rst +++ b/Documentation/Tree-checker.rst @@ -1,24 +1,24 @@ Tree checker ============ -Metadata blocks that have been just read from devices or are just about to be -written are verified and sanity checked by so called **tree checker**. The -b-tree nodes contain several items describing the filesystem structure and to -some degree can be verified for consistency or validity. This is additional -check to the checksums that only verify the overall block status while the tree -checker tries to validate and cross reference the logical structure. This takes -a slight performance hit but is comparable to calculating the checksum and has -no noticeable impact while it does catch all sorts of errors. +Tree checker is a feature that verifies metadata blocks before write or after +read from the devices. The b-tree nodes contain several items describing the +filesystem structures and to some degree can be verified for consistency or +validity. This is an additional check to the checksums that only verify the +overall block status while the tree checker tries to validate and cross +reference the logical structure. This takes a slight performance hit but is +comparable to calculating the checksum and has no noticeable impact while it +does catch all sorts of errors. There are two occasions when the checks are done: Pre-write checks ---------------- -When metadata blocks are in memory about to be written to the permanent storage, -the checks are performed, before the checksums are calculated. This can catch -random corruptions of the blocks (or pages) either caused by bugs or by other -parts of the system or hardware errors (namely faulty RAM). +When metadata blocks are in memory and about to be written to the permanent +storage, the checks are performed, before the checksums are calculated. This +can catch random corruptions of the blocks (or pages) either caused by bugs or +by other parts of the system or hardware errors (namely faulty RAM). Once a block does not pass the checks, the filesystem refuses to write more data and turns itself to read-only mode to prevent further damage. At this point some @@ -28,6 +28,24 @@ the filesystem gets unmounted, the most recent changes are unfortunately lost. The filesystem that is stored on the device is still consistent and should mount fine. +A message may look like: + +.. code-block:: + + [ 1716.823895] BTRFS critical (device vdb): corrupt leaf: root=18446744073709551607 block=38092800 slot=0, invalid key objectid: has 1 expect 6 or [256, 18446744073709551360] or 18446744073709551604 + [ 1716.829499] BTRFS info (device vdb): leaf 38092800 gen 19 total ptrs 4 free space 15851 owner 18446744073709551607 + [ 1716.832891] BTRFS info (device vdb): refs 3 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 1506 + [ 1716.836054] item 0 key (1 1 0) itemoff 16123 itemsize 160 + [ 1716.837993] inode generation 1 size 0 mode 100600 + [ 1716.839760] item 1 key (256 1 0) itemoff 15963 itemsize 160 + [ 1716.841742] inode generation 4 size 0 mode 40755 + [ 1716.843393] item 2 key (256 12 256) itemoff 15951 itemsize 12 + [ 1716.845320] item 3 key (18446744073709551611 48 1) itemoff 15951 itemsize 0 + [ 1716.847505] BTRFS error (device vdb): block=38092800 write time tree block corruption detected + +The line(s) before the *write time tree block corruption detected* message is +specific to the found error. + Post-read checks ---------------- @@ -36,6 +54,11 @@ checksum is found to be valid. This protects against changes to the metadata that could possibly also update the checksum, less likely to happen accidentally but rather due to intentional corruption or fuzzing. +.. code-block:: + + [ 4823.612832] BTRFS critical (device vdb): corrupt leaf: root=7 block=30474240 slot=0, invalid nritems, have 0 should not be 0 for non-root leaf + [ 4823.616798] BTRFS error (device vdb): block=30474240 read time tree block corruption detected + The checks ---------- diff --git a/Documentation/ch-checksumming.rst b/Documentation/ch-checksumming.rst index 143bb0b2..d69b42bb 100644 --- a/Documentation/ch-checksumming.rst +++ b/Documentation/ch-checksumming.rst @@ -1,9 +1,12 @@ Data and metadata are checksummed by default, the checksum is calculated before -write and verifed after reading the blocks from devices. There are several -checksum algorithms supported. The default and backward compatible is *crc32c*. -Since kernel 5.5 there are three more with different characteristics and -trade-offs regarding speed and strength. The following list may help you to -decide which one to select. +write and verifed after reading the blocks from devices. The whole metadata +block has a checksum stored inline in the b-tree node header, each data block +has a detached checksum stored in the checksum tree. + +There are several checksum algorithms supported. The default and backward +compatible is *crc32c*. Since kernel 5.5 there are three more with different +characteristics and trade-offs regarding speed and strength. The following list +may help you to decide which one to select. CRC32C (32bit digest) default, best backward compatibility, very fast, modern CPUs have diff --git a/Documentation/ch-compression.rst b/Documentation/ch-compression.rst index c319d88a..b3459b3e 100644 --- a/Documentation/ch-compression.rst +++ b/Documentation/ch-compression.rst @@ -48,7 +48,7 @@ This will enable the ``zstd`` algorithm on the default level (which is 3). The level can be specified manually too like ``zstd:3``. Higher levels compress better at the cost of time. This in turn may cause increased write latency, low levels are suitable for real-time compression and on reasonably fast CPU don't -cause performance drops. +cause noticeable performance drops. .. code-block:: shell @@ -145,9 +145,11 @@ Compatibility Compression is done using the COW mechanism so it's incompatible with *nodatacow*. Direct IO works on compressed files but will fall back to buffered -writes and leads to recompression. Currently 'nodatasum' and compression don't +writes and leads to recompression. Currently *nodatasum* and compression don't work together. The compression algorithms have been added over time so the version compatibility should be also considered, together with other tools that may access the compressed data like bootloaders. + + diff --git a/Documentation/ch-subvolume-intro.rst b/Documentation/ch-subvolume-intro.rst index ca5f5331..a87ed66a 100644 --- a/Documentation/ch-subvolume-intro.rst +++ b/Documentation/ch-subvolume-intro.rst @@ -1,6 +1,7 @@ A BTRFS subvolume is a part of filesystem with its own independent -file/directory hierarchy. Subvolumes can share file extents. A snapshot is also -subvolume, but with a given initial content of the original subvolume. +file/directory hierarchy and inode number namespace. Subvolumes can share file +extents. A snapshot is also subvolume, but with a given initial content of the +original subvolume. .. note:: A subvolume in BTRFS is not like an LVM logical volume, which is block-level @@ -8,7 +9,9 @@ subvolume, but with a given initial content of the original subvolume. A subvolume looks like a normal directory, with some additional operations described below. Subvolumes can be renamed or moved, nesting subvolumes is not -restricted but has some implications regarding snapshotting. +restricted but has some implications regarding snapshotting. The numeric id +(called *subvolid* or *rootid*) of the subvolume is persistent and cannot be +changed. A subvolume in BTRFS can be accessed in two ways: @@ -30,10 +33,10 @@ do not affect the files in the original subvolume. Subvolume flags --------------- -The subvolume flag currently implemented is the *ro* property. Read-write -subvolumes have that set to *false*, snapshots as *true*. In addition to that, -a plain snapshot will also have last change generation and creation generation -equal. +The subvolume flag currently implemented is the *ro* property (read-only +status). Read-write subvolumes have that set to *false*, snapshots as *true*. +In addition to that, a plain snapshot will also have last change generation and +creation generation equal. Read-only snapshots are building blocks of incremental send (see ``btrfs-send(8)``) and the whole use case relies on unmodified snapshots where @@ -56,3 +59,36 @@ it by **btrfs property set** requires force if that is really desired by user. show** to identify them. Flipping the flags to read-only and back to read-write will reset the *received_uuid* manually. There may exist a convenience tool in the future. + +Nested subvolumes +----------------- + +There are no restrictions for subvolume creation, so it's up to the user how to +organize them, whether to have a flat layout (all subvolumes are direct +descendants of the toplevel one), or nested. + +What should be mentioned early is that a snapshotting is not recursive, so a +subvolume or a snapshot is effectively a barrier. This can be used +intentionally but could be confusing in case of nested layouts. + +Case study: system root layouts +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +There are two ways how the system root directory and subvolume layout could be +organized. The interesting usecase for root is to allow rollbacks to previous +version, as one atomic step. If the entire filesystem hierarchy starting in "/" +is in one subvolume, taking snapshot will encompass all files. This is easy for +the snapshotting part but has undesirable consequences for rollback. For example, +log files would get rolled back too, or any data that are stored on the root +filesystem but are not meant to be rolled back either (database files, VM +images, ...). + +Here we could utilize the snapshotting barrier mentioned above, each directory +that stores data to be preserved accross rollbacks is it's own subvolume. This +could be eg. ``/var``. Further more-fine grained partitioning could be done, eg. +adding separate subvolumes for ``/var/log``, ``/var/cache`` etc. + +That there are separate subvolumes requrires separate actions to take the +snapshots (here it gets disconnected from the system root snapshots). This needs +to be taken care of by system tools, installers together with selection of which +directories are highly recommended to be separate subvolumes. diff --git a/Documentation/trouble-index.rst b/Documentation/trouble-index.rst index fc2f04ae..b93be469 100644 --- a/Documentation/trouble-index.rst +++ b/Documentation/trouble-index.rst @@ -5,7 +5,97 @@ Troubleshooting pages Correctness related, permanent -- transid verify error +Error: parent transid verify error +---------------------------------- + +Reason: result of a failed internal consistency check of the filesystem's metadata. +Type: permanent + +.. code-block:: + + [ 4007.489730] BTRFS error (device vdb): parent transid verify failed on 30736384 wanted 10 found 8 + +The b-tree nodes are linked together, a block pointer in the parent node +contains target block offset and generation that last changed this block. The +block it points to then upon read verifies that the block address and the +generation matches. This check is done on all tree levels. + +The number in **faled on 30736384** is the logical block number, **wanted 10** +is the expected generation number in the parent node, **found 8** is the one +found in the target block. The number difference between the generation can +give a hint when the problem could have happened, in terms of transaction +commits. + +Once the mismatched generations are stored on the device, it's permanent and +cannot be easily recovered, because of information loss. The recovery tool +``btrfs restore`` is able to ignore the errors and attempt to restore the data +but due to the inconsistency in the metadata the data need to be verified by the +user. + +The root cause of the error cannot be easily determined, possible reasons are: + +* logical bug: filesystem structures haven't been properly updated and stored +* misdirected write: the underlying storage does not store the data to the exact + address as expected and overwrites some other block +* storage device (hardware or emulated) does not properly flush and persist data + between transactions so they get mixed up +* lost write without proper error handling: writing the block worked as viewed + on the filesystem layer, but there was a problem on the lower layers not + propagated upwards + +Error: No space left on device (ENOSPC) +--------------------------------------- + +Type: transient + +Space handling on a COW filesystem is tricky, namely when it's in combination +with delayed allocation, dynamic chunk allocation and parallel data updates. +There are several reasons why the ENOSPC might get reported and there's not just +a single cause and solution. The space reservation algorithms try to fairly +assign the space, fall back to heuristics or block writes until enough data are +persisted and possibly making old copies available. + +The most obvious way how to exhaust space is to create a file until the data +chunks are full: + +.. code-block:: + + $ df -h . + Filesystem Size Used Avail Use% Mounted on + /dev/sda 4.0G 3.6M 2.0G 1% /mnt/ + + $ cat /dev/zero > file + cat: write error: No space left on device + + $ df -h . + Filesystem Size Used Avail Use% Mounted on + /dev/sdc 4.0G 2.0G 0 100% /mnt/data250 + + $ btrfs fi df . + Data, single: total=1.98GiB, used=1.98GiB + System, DUP: total=8.00MiB, used=16.00KiB + Metadata, DUP: total=1.00GiB, used=2.22MiB + GlobalReserve, single: total=3.25MiB, used=0.00B + +The data chunks have been exhausted, so there's really no space left where to +write. The metadata chunks have space but that can't be used for that purpose. + +Metadata space got exhausted +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Cannot track new data extents, no inline files, no reflinks, no xattrs. +Deletion still works. + +Balance does not have enough workspace +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Relocation of block groups requires a temporary work space, ie. area on the +device that's available for the filesystem but without any other existing block +groups. Before balance starts a check is performed to verify the requested +action is possible. If not, ENOSPC is returned. + +TODO +---- Transient