btrfs-progs: docs: more docs updates

Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-10 16:20:34 +01:00 · 2022-01-10 16:20:34 +01:00 · 79ef78f0e4
parent df91bfd5d5
commit 79ef78f0e4
5 changed files with 181 additions and 27 deletions
--- a/Documentation/Tree-checker.rst
+++ b/Documentation/Tree-checker.rst
@ -1,24 +1,24 @@
 Tree checker
 ============
-Metadata blocks that have been just read from devices or are just about to be
+Tree checker is a feature that verifies metadata blocks before write or after
-written are verified and sanity checked by so called **tree checker**. The
+read from the devices.  The b-tree nodes contain several items describing the
-b-tree nodes contain several items describing the filesystem structure and to
+filesystem structures and to some degree can be verified for consistency or
-some degree can be verified for consistency or validity. This is additional
+validity. This is an additional check to the checksums that only verify the
-check to the checksums that only verify the overall block status while the tree
+overall block status while the tree checker tries to validate and cross
-checker tries to validate and cross reference the logical structure. This takes
+reference the logical structure. This takes a slight performance hit but is
-a slight performance hit but is comparable to calculating the checksum and has
+comparable to calculating the checksum and has no noticeable impact while it
-no noticeable impact while it does catch all sorts of errors.
+does catch all sorts of errors.
 There are two occasions when the checks are done:
 Pre-write checks
 ----------------
-When metadata blocks are in memory about to be written to the permanent storage,
+When metadata blocks are in memory and about to be written to the permanent
-the checks are performed, before the checksums are calculated. This can catch
+storage, the checks are performed, before the checksums are calculated. This
-random corruptions of the blocks (or pages) either caused by bugs or by other
+can catch random corruptions of the blocks (or pages) either caused by bugs or
-parts of the system or hardware errors (namely faulty RAM).
+by other parts of the system or hardware errors (namely faulty RAM).
 Once a block does not pass the checks, the filesystem refuses to write more data
 and turns itself to read-only mode to prevent further damage. At this point some
@ -28,6 +28,24 @@ the filesystem gets unmounted, the most recent changes are unfortunately lost.
 The filesystem that is stored on the device is still consistent and should mount
 fine.
 A message may look like:
 .. code-block::
   [ 1716.823895] BTRFS critical (device vdb): corrupt leaf: root=18446744073709551607 block=38092800 slot=0, invalid key objectid: has 1 expect 6 or [256, 18446744073709551360] or 18446744073709551604
   [ 1716.829499] BTRFS info (device vdb): leaf 38092800 gen 19 total ptrs 4 free space 15851 owner 18446744073709551607
   [ 1716.832891] BTRFS info (device vdb): refs 3 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 1506
   [ 1716.836054]  item 0 key (1 1 0) itemoff 16123 itemsize 160
   [ 1716.837993]          inode generation 1 size 0 mode 100600
   [ 1716.839760]  item 1 key (256 1 0) itemoff 15963 itemsize 160
   [ 1716.841742]          inode generation 4 size 0 mode 40755
   [ 1716.843393]  item 2 key (256 12 256) itemoff 15951 itemsize 12
   [ 1716.845320]  item 3 key (18446744073709551611 48 1) itemoff 15951 itemsize 0
   [ 1716.847505] BTRFS error (device vdb): block=38092800 write time tree block corruption detected
 The line(s) before the *write time tree block corruption detected* message is
 specific to the found error.
 Post-read checks
 ----------------
@ -36,6 +54,11 @@ checksum is found to be valid. This protects against changes to the metadata
 that could possibly also update the checksum, less likely to happen accidentally
 but rather due to intentional corruption or fuzzing.
 .. code-block::
   [ 4823.612832] BTRFS critical (device vdb): corrupt leaf: root=7 block=30474240 slot=0, invalid nritems, have 0 should not be 0 for non-root leaf
   [ 4823.616798] BTRFS error (device vdb): block=30474240 read time tree block corruption detected
 The checks
 ----------
--- a/Documentation/ch-checksumming.rst
+++ b/Documentation/ch-checksumming.rst
@ -1,9 +1,12 @@
 Data and metadata are checksummed by default, the checksum is calculated before
-write and verifed after reading the blocks from devices.  There are several
+write and verifed after reading the blocks from devices. The whole metadata
-checksum algorithms supported. The default and backward compatible is *crc32c*.
+block has a checksum stored inline in the b-tree node header, each data block
-Since kernel 5.5 there are three more with different characteristics and
+has a detached checksum stored in the checksum tree.
-trade-offs regarding speed and strength. The following list may help you to
+
-decide which one to select.
+There are several checksum algorithms supported. The default and backward
 compatible is *crc32c*.  Since kernel 5.5 there are three more with different
 characteristics and trade-offs regarding speed and strength. The following list
 may help you to decide which one to select.
 CRC32C (32bit digest)
        default, best backward compatibility, very fast, modern CPUs have
--- a/Documentation/ch-compression.rst
+++ b/Documentation/ch-compression.rst
@ -48,7 +48,7 @@ This will enable the ``zstd`` algorithm on the default level (which is 3).
 The level can be specified manually too like ``zstd:3``. Higher levels compress
 better at the cost of time. This in turn may cause increased write latency, low
 levels are suitable for real-time compression and on reasonably fast CPU don't
-cause performance drops.
+cause noticeable performance drops.
 .. code-block:: shell
@ -145,9 +145,11 @@ Compatibility
 Compression is done using the COW mechanism so it's incompatible with
 *nodatacow*. Direct IO works on compressed files but will fall back to buffered
-writes and leads to recompression. Currently 'nodatasum' and compression don't
+writes and leads to recompression. Currently *nodatasum* and compression don't
 work together.
 The compression algorithms have been added over time so the version
 compatibility should be also considered, together with other tools that may
 access the compressed data like bootloaders.
--- a/Documentation/ch-subvolume-intro.rst
+++ b/Documentation/ch-subvolume-intro.rst
@ -1,6 +1,7 @@
 A BTRFS subvolume is a part of filesystem with its own independent
-file/directory hierarchy. Subvolumes can share file extents. A snapshot is also
+file/directory hierarchy and inode number namespace. Subvolumes can share file
-subvolume, but with a given initial content of the original subvolume.
+extents. A snapshot is also subvolume, but with a given initial content of the
 original subvolume.
 .. note::
   A subvolume in BTRFS is not like an LVM logical volume, which is block-level
@ -8,7 +9,9 @@ subvolume, but with a given initial content of the original subvolume.
 A subvolume looks like a normal directory, with some additional operations
 described below. Subvolumes can be renamed or moved, nesting subvolumes is not
-restricted but has some implications regarding snapshotting.
+restricted but has some implications regarding snapshotting. The numeric id
 (called *subvolid* or *rootid*) of the subvolume is persistent and cannot be
 changed.
 A subvolume in BTRFS can be accessed in two ways:
@ -30,10 +33,10 @@ do not affect the files in the original subvolume.
 Subvolume flags
 ---------------
-The subvolume flag currently implemented is the *ro* property. Read-write
+The subvolume flag currently implemented is the *ro* property (read-only
-subvolumes have that set to *false*, snapshots as *true*. In addition to that,
+status). Read-write subvolumes have that set to *false*, snapshots as *true*.
-a plain snapshot will also have last change generation and creation generation
+In addition to that, a plain snapshot will also have last change generation and
-equal.
+creation generation equal.
 Read-only snapshots are building blocks of incremental send (see
 ``btrfs-send(8)``) and the whole use case relies on unmodified snapshots where
@ -56,3 +59,36 @@ it by **btrfs property set** requires force if that is really desired by user.
   show** to identify them. Flipping the flags to read-only and back to
   read-write will reset the *received_uuid* manually.  There may exist a
   convenience tool in the future.
 Nested subvolumes
 -----------------
 There are no restrictions for subvolume creation, so it's up to the user how to
 organize them, whether to have a flat layout (all subvolumes are direct
 descendants of the toplevel one), or nested.
 What should be mentioned early is that a snapshotting is not recursive, so a
 subvolume or a snapshot is effectively a barrier. This can be used
 intentionally but could be confusing in case of nested layouts.
 Case study: system root layouts
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 There are two ways how the system root directory and subvolume layout could be
 organized. The interesting usecase for root is to allow rollbacks to previous
 version, as one atomic step. If the entire filesystem hierarchy starting in "/"
 is in one subvolume, taking snapshot will encompass all files. This is easy for
 the snapshotting part but has undesirable consequences for rollback. For example,
 log files would get rolled back too, or any data that are stored on the root
 filesystem but are not meant to be rolled back either (database files, VM
 images, ...).
 Here we could utilize the snapshotting barrier mentioned above, each directory
 that stores data to be preserved accross rollbacks is it's own subvolume. This
 could be eg. ``/var``. Further more-fine grained partitioning could be done, eg.
 adding separate subvolumes for ``/var/log``, ``/var/cache`` etc.
 That there are separate subvolumes requrires separate actions to take the
 snapshots (here it gets disconnected from the system root snapshots). This needs
 to be taken care of by system tools, installers together with selection of which
 directories are highly recommended to be separate subvolumes.
--- a/Documentation/trouble-index.rst
+++ b/Documentation/trouble-index.rst
@ -5,7 +5,97 @@ Troubleshooting pages
 Correctness related, permanent
- transid verify error
+Error: parent transid verify error
 ----------------------------------
 Reason: result of a failed internal consistency check of the filesystem's metadata.
 Type: permanent
 .. code-block::
   [ 4007.489730] BTRFS error (device vdb): parent transid verify failed on 30736384 wanted 10 found 8
 The b-tree nodes are linked together, a block pointer in the parent node
 contains target block offset and generation that last changed this block. The
 block it points to then upon read verifies that the block address and the
 generation matches. This check is done on all tree levels.
 The number in **faled on 30736384** is the logical block number, **wanted 10**
 is the expected generation number in the parent node, **found 8** is the one
 found in the target block.  The number difference between the generation can
 give a hint when the problem could have happened, in terms of transaction
 commits.
 Once the mismatched generations are stored on the device, it's permanent and
 cannot be easily recovered, because of information loss. The recovery tool
 ``btrfs restore`` is able to ignore the errors and attempt to restore the data
 but due to the inconsistency in the metadata the data need to be verified by the
 user.
 The root cause of the error cannot be easily determined, possible reasons are:
 * logical bug: filesystem structures haven't been properly updated and stored
 * misdirected write: the underlying storage does not store the data to the exact
  address as expected and overwrites some other block
 * storage device (hardware or emulated) does not properly flush and persist data
  between transactions so they get mixed up
 * lost write without proper error handling: writing the block worked as viewed
  on the filesystem layer, but there was a problem on the lower layers not
  propagated upwards
 Error: No space left on device (ENOSPC)
 ---------------------------------------
 Type: transient
 Space handling on a COW filesystem is tricky, namely when it's in combination
 with delayed allocation, dynamic chunk allocation and parallel data updates.
 There are several reasons why the ENOSPC might get reported and there's not just
 a single cause and solution. The space reservation algorithms try to fairly
 assign the space, fall back to heuristics or block writes until enough data are
 persisted and possibly making old copies available.
 The most obvious way how to exhaust space is to create a file until the data
 chunks are full:
 .. code-block::
   $ df -h .
   Filesystem      Size  Used Avail Use% Mounted on
   /dev/sda        4.0G  3.6M  2.0G   1% /mnt/
   $ cat /dev/zero > file
   cat: write error: No space left on device
   $ df -h .
   Filesystem      Size  Used Avail Use% Mounted on
   /dev/sdc        4.0G  2.0G     0 100% /mnt/data250
   $ btrfs fi df .
   Data, single: total=1.98GiB, used=1.98GiB
   System, DUP: total=8.00MiB, used=16.00KiB
   Metadata, DUP: total=1.00GiB, used=2.22MiB
   GlobalReserve, single: total=3.25MiB, used=0.00B
 The data chunks have been exhausted, so there's really no space left where to
 write. The metadata chunks have space but that can't be used for that purpose.
 Metadata space got exhausted
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Cannot track new data extents, no inline files, no reflinks, no xattrs.
 Deletion still works.
 Balance does not have enough workspace
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Relocation of block groups requires a temporary work space, ie. area on the
 device that's available for the filesystem but without any other existing block
 groups. Before balance starts a check is performed to verify the requested
 action is possible. If not, ENOSPC is returned.
 TODO
 ----
 Transient