diff --git a/Documentation/Balance.rst b/Documentation/Balance.rst index 29e0b4df..64f3d789 100644 --- a/Documentation/Balance.rst +++ b/Documentation/Balance.rst @@ -1,4 +1,9 @@ Balance ======= -... +.. include:: ch-balance-intro.rst + +Filters +------- + +.. include:: ch-balance-filters.rst diff --git a/Documentation/Common-features.rst b/Documentation/Common-features.rst index 81ba3c0e..2eaeed00 100644 --- a/Documentation/Common-features.rst +++ b/Documentation/Common-features.rst @@ -1,20 +1,44 @@ Common Linux features ===================== -Anything that's standard and also supported +The Linux operating system implements a POSIX standard interfaces and API with +additional interfaces. Many of them have become common in other filesystems. The +ones listed below have been added relatively recently and are considered +interesting for users: -- statx +birth/origin inode time + a timestamp associated with an inode of when it was created, cannot be + changed and requires the *statx* syscall to be read -- fallocate modes +statx + an extended version of the *stat* syscall that provides extensible + interface to read more information that are not available in original + *stat* -- birth/origin inode time +fallocate modes + the *fallocate* syscall allows to manipulate file extents like punching + holes, preallocation or zeroing a range -- filesystem label +FIEMAP + an ioctl that enumerates file extents, related tool is ``filefrag`` -- xattr, acl +filesystem label + another filesystem identification, could be used for mount or for better + recognition, can be set or read by an ioctl or by command ``btrfs + filesystem label`` -- FIEMAP +O_TMPFILE + mode of open() syscall that creates a file with no associated directory + entry, which makes it impossible to be seen by other processes and is + thus safe to be used as a temporary file + (https://lwn.net/Articles/619146/) -- O_TMPFILE +xattr, acl + extended attributes (xattr) is a list of *key=value* pairs associated + with a file, usually storing additional metadata related to security, + access control list in particular (ACL) or properties (``btrfs + property``) - XFLAGS, fileattr + +- cross-rename diff --git a/Documentation/Custom-ioctls.rst b/Documentation/Custom-ioctls.rst index 11f08280..34a00050 100644 --- a/Documentation/Custom-ioctls.rst +++ b/Documentation/Custom-ioctls.rst @@ -1,16 +1,21 @@ Custom ioctls ============= -Anything that's not doing the other features and stands on it's own +Filesystems are usually extended by custom ioctls beyond the standard system +call interface to let user applications access the advanced features. They're +low level and the following list gives only an overview of the capabilities or +a command if available: -- reverse lookup, from file offset to inode +- reverse lookup, from file offset to inode, ``btrfs inspect-internal + logical-resolve`` -- resolve inode number -> name +- resolve inode number to list of name, ``btrfs inspect-internal inode-resolve`` -- file offset -> all inodes that share it +- tree search, given a key range and tree id, lookup and return all b-tree items + found in that range, basically all metadata at your hand but you need to know + what to do with them -- tree search, all the metadata at your hand (if you know what to do with them) +- informative, about devices, space allocation or the whole filesystem, many of + which is also exported in ``/sys/fs/btrfs`` -- informative (device, fs, space) - -- query/set a subset of features on a mounted fs +- query/set a subset of features on a mounted filesystem diff --git a/Documentation/Defragmentation.rst b/Documentation/Defragmentation.rst index 87bed47d..8fc309a8 100644 --- a/Documentation/Defragmentation.rst +++ b/Documentation/Defragmentation.rst @@ -18,5 +18,5 @@ happens inside the page cache, that is the central point caching the file data and takes care of synchronization. Once a filesystem sync or flush is started (either manually or automatically) all the dirty data get written to the devices. This however reduces the chances to find optimal layout as the writes -happen together with other data and the result depens on the remaining free +happen together with other data and the result depends on the remaining free space layout and fragmentation. diff --git a/Documentation/Reflink.rst b/Documentation/Reflink.rst index 98c1e232..ae2498f9 100644 --- a/Documentation/Reflink.rst +++ b/Documentation/Reflink.rst @@ -14,7 +14,7 @@ also copied, though there are no ready-made tools for that. cp --reflink=always source target -There are some constaints: +There are some constraints: - cross-filesystem reflink is not possible, there's nothing in common between so the block sharing can't work diff --git a/Documentation/Resize.rst b/Documentation/Resize.rst index 5efca120..da19a19e 100644 --- a/Documentation/Resize.rst +++ b/Documentation/Resize.rst @@ -3,8 +3,8 @@ Resize A BTRFS mounted filesystem can be resized after creation, grown or shrunk. On a multi device filesystem the space occupied on each device can be resized -independently. Data tha reside in the are that would be out of the new size are -relocated to the remaining space below the limit, so this constrains the +independently. Data that reside in the area that would be out of the new size +are relocated to the remaining space below the limit, so this constrains the minimum size to which a filesystem can be shrunk. Growing a filesystem is quick as it only needs to take note of the available diff --git a/Documentation/Subvolumes.rst b/Documentation/Subvolumes.rst index 2475956f..2ea6034e 100644 --- a/Documentation/Subvolumes.rst +++ b/Documentation/Subvolumes.rst @@ -1,4 +1,4 @@ Subvolumes ========== -... +.. include:: ch-subvolume-intro.rst diff --git a/Documentation/Tree-checker.rst b/Documentation/Tree-checker.rst index 43b1fff2..09597373 100644 --- a/Documentation/Tree-checker.rst +++ b/Documentation/Tree-checker.rst @@ -1,6 +1,53 @@ Tree checker ============ +Metadata blocks that have been just read from devices or are just about to be +written are verified and sanity checked by so called **tree checker**. The +b-tree nodes contain several items describing the filesystem structure and to +some degree can be verified for consistency or validity. This is additional +check to the checksums that only verify the overall block status while the tree +checker tries to validate and cross reference the logical structure. This takes +a slight performance hit but is comparable to calculating the checksum and has +no noticeable impact while it does catch all sorts of errors. + +There are two occasions when the checks are done: + Pre-write checks +---------------- + +When metadata blocks are in memory about to be written to the permanent storage, +the checks are performed, before the checksums are calculated. This can catch +random corruptions of the blocks (or pages) either caused by bugs or by other +parts of the system or hardware errors (namely faulty RAM). + +Once a block does not pass the checks, the filesystem refuses to write more data +and turns itself to read-only mode to prevent further damage. At this point some +the recent metadata updates are held *only* in memory so it's best to not panic +and try to remember what files could be affected and copy them elsewhere. Once +the filesystem gets unmounted, the most recent changes are unfortunately lost. +The filesystem that is stored on the device is still consistent and should mount +fine. Post-read checks +---------------- + +Metadata blocks get verified right after they're read from devices and the +checksum is found to be valid. This protects against changes to the metadata +that could possibly also update the checksum, less likely to happen accidentally +but rather due to intentional corruption or fuzzing. + +The checks +---------- + +As implemented right now, the metadata consistency is limited to one b-tree node +and what items are stored there, ie. there's no extensive or broad check done +eg. against other data structures in other b-tree nodes. This still provides +enough opportunities to verify consistency of individual items, besides verifying +general validity of the items like the length or offset. The b-tree items are +also coupled with a key so proper key ordering is also part of the check and can +reveal random bitflips in the sequence (this has been the most successful +detector of faulty RAM). + +The capabilities of tree checker have been improved over time and it's possible +that a filesystem created on an older kernel may trigger warnings or fail some +checks on a new one. diff --git a/Documentation/Trim.rst b/Documentation/Trim.rst index 13e2842d..a9e52f1c 100644 --- a/Documentation/Trim.rst +++ b/Documentation/Trim.rst @@ -1,4 +1,41 @@ -Trim -==== +Trim/discard +============ -... +Trim or discard is an operation on a storage device based on flash technology +(SSD, NVMe or similar), a thin-provisioned device or could be emulated on top +of other block device types. On real hardware, there's a different lifetime +span of the memory cells and the driver firmware usually tries to optimize for +that. The trim operation issued by user provides hints about what data are +unused and allow to reclaim the memory cells. On thin-provisioned or emulated +this is could simply free the space. + +There are three main uses of trim that BTRFS supports: + +synchronous + enabled by mounting filesystem with ``-o discard`` or ``-o + discard=sync``, the trim is done right after the file extents get freed, + this however could have severe performance hit and is not recommended + as the ranges to be trimmed could be too fragmented + +asynchronous + enabled by mounting filesystem with ``-o discard=async``, which is an + improved version of the synchronous trim where the freed file extents + are first tracked in memory and after a period or enough ranges accumulate + the trim is started, expecting the ranges to be much larger and + allowing to throttle the number of IO requests which does not interfere + with the rest of the filesystem activity + +manually by fstrim + the tool ``fstrim`` starts a trim operation on the whole filesystem, no + mount options need to be specified, so it's up to the filesystem to + traverse the free space and start the trim, this is suitable for running + it as periodic service + +The trim is considered only a hint to the device, it could ignore it completely, +start it only on ranges meeting some criteria, or decide not to do it because of +other factors affecting the memory cells. The device itself could internally +relocate the data, however this leads to unexpected performance drop. Running +trim periodically could prevent that too. + +When a filesystem is created by ``mkfs.btrfs`` and is capable of trim, then it's +by default performed on all devices. diff --git a/Documentation/Volume-management.rst b/Documentation/Volume-management.rst index 21adbfe8..c23400c3 100644 --- a/Documentation/Volume-management.rst +++ b/Documentation/Volume-management.rst @@ -1,4 +1,4 @@ Volume management ================= -... +.. include:: ch-volume-management-intro.rst diff --git a/Documentation/Zoned-mode.rst b/Documentation/Zoned-mode.rst index 52a747cd..8e584ce2 100644 --- a/Documentation/Zoned-mode.rst +++ b/Documentation/Zoned-mode.rst @@ -1,4 +1,4 @@ Zoned mode ========== -... +.. include:: ch-zoned-intro.rst diff --git a/Documentation/btrfs-balance.rst b/Documentation/btrfs-balance.rst index 1d3a6c74..352965bf 100644 --- a/Documentation/btrfs-balance.rst +++ b/Documentation/btrfs-balance.rst @@ -9,68 +9,7 @@ SYNOPSIS DESCRIPTION ----------- -The primary purpose of the balance feature is to spread block groups across -all devices so they match constraints defined by the respective profiles. See -``mkfs.btrfs(8)`` section *PROFILES* for more details. -The scope of the balancing process can be further tuned by use of filters that -can select the block groups to process. Balance works only on a mounted -filesystem. Extent sharing is preserved and reflinks are not broken. -Files are not defragmented nor recompressed, file extents are preserved -but the physical location on devices will change. - -The balance operation is cancellable by the user. The on-disk state of the -filesystem is always consistent so an unexpected interruption (eg. system crash, -reboot) does not corrupt the filesystem. The progress of the balance operation -is temporarily stored as an internal state and will be resumed upon mount, -unless the mount option *skip_balance* is specified. - -.. warning:: - Running balance without filters will take a lot of time as it basically move - data/metadata from the whol filesystem and needs to update all block - pointers. - -The filters can be used to perform following actions: - -- convert block group profiles (filter *convert*) -- make block group usage more compact (filter *usage*) -- perform actions only on a given device (filters *devid*, *drange*) - -The filters can be applied to a combination of block group types (data, -metadata, system). Note that changing only the *system* type needs the force -option. Otherwise *system* gets automatically converted whenever *metadata* -profile is converted. - -When metadata redundancy is reduced (eg. from RAID1 to single) the force option -is also required and it is noted in system log. - -.. note:: - The balance operation needs enough work space, ie. space that is completely - unused in the filesystem, otherwise this may lead to ENOSPC reports. See - the section *ENOSPC* for more details. - -COMPATIBILITY -------------- - -.. note:: - - The balance subcommand also exists under the **btrfs filesystem** namespace. - This still works for backward compatibility but is deprecated and should not - be used any more. - -.. note:: - A short syntax **btrfs balance ** works due to backward compatibility - but is deprecated and should not be used any more. Use **btrfs balance start** - command instead. - -PERFORMANCE IMPLICATIONS ------------------------- - -Balancing operations are very IO intensive and can also be quite CPU intensive, -impacting other ongoing filesystem operations. Typically large amounts of data -are copied from one location to another, with corresponding metadata updates. - -Depending upon the block group layout, it can also be seek heavy. Performance -on rotational devices is noticeably worse compared to SSDs or fast arrays. +.. include:: ch-balance-intro.rst SUBCOMMAND ---------- @@ -148,89 +87,7 @@ status [-v] FILTERS ------- -From kernel 3.3 onwards, btrfs balance can limit its action to a subset of the -whole filesystem, and can be used to change the replication configuration (e.g. -moving data from single to RAID1). This functionality is accessed through the -*-d*, *-m* or *-s* options to btrfs balance start, which filter on data, -metadata and system blocks respectively. - -A filter has the following structure: *type[=params][,type=...]* - -The available types are: - -profiles= - Balances only block groups with the given profiles. Parameters - are a list of profile names separated by "*|*" (pipe). - -usage=, usage= - Balances only block groups with usage under the given percentage. The - value of 0 is allowed and will clean up completely unused block groups, this - should not require any new work space allocated. You may want to use *usage=0* - in case balance is returning ENOSPC and your filesystem is not too full. - - The argument may be a single value or a range. The single value *N* means *at - most N percent used*, equivalent to *..N* range syntax. Kernels prior to 4.4 - accept only the single value format. - The minimum range boundary is inclusive, maximum is exclusive. - -devid= - Balances only block groups which have at least one chunk on the given - device. To list devices with ids use **btrfs filesystem show**. - -drange= - Balance only block groups which overlap with the given byte range on any - device. Use in conjunction with *devid* to filter on a specific device. The - parameter is a range specified as *start..end*. - -vrange= - Balance only block groups which overlap with the given byte range in the - filesystem's internal virtual address space. This is the address space that - most reports from btrfs in the kernel log use. The parameter is a range - specified as *start..end*. - -convert= - Convert each selected block group to the given profile name identified by - parameters. - - .. note:: - Starting with kernel 4.5, the *data* chunks can be converted to/from the - *DUP* profile on a single device. - - .. note:: - Starting with kernel 4.6, all profiles can be converted to/from *DUP* on - multi-device filesystems. - -limit=, limit= - Process only given number of chunks, after all filters are applied. This can be - used to specifically target a chunk in connection with other filters (*drange*, - *vrange*) or just simply limit the amount of work done by a single balance run. - - The argument may be a single value or a range. The single value *N* means *at - most N chunks*, equivalent to *..N* range syntax. Kernels prior to 4.4 accept - only the single value format. The range minimum and maximum are inclusive. - -stripes= - Balance only block groups which have the given number of stripes. The parameter - is a range specified as *start..end*. Makes sense for block group profiles that - utilize striping, ie. RAID0/10/5/6. The range minimum and maximum are - inclusive. - -soft - Takes no parameters. Only has meaning when converting between profiles. - When doing convert from one profile to another and soft mode is on, - chunks that already have the target profile are left untouched. - This is useful e.g. when half of the filesystem was converted earlier but got - cancelled. - - The soft mode switch is (like every other filter) per-type. - For example, this means that we can convert metadata chunks the "hard" way - while converting data chunks selectively with soft switch. - -Profile names, used in *profiles* and *convert* are one of: *raid0*, *raid1*, -*raid1c3*, *raid1c4*, *raid10*, *raid5*, *raid6*, *dup*, *single*. The mixed -data/metadata profiles can be converted in the same way, but it's conversion -between mixed and non-mixed is not implemented. For the constraints of the -profiles please refer to ``mkfs.btrfs(8)``, section *PROFILES*. +.. include:: ch-balance-filters.rst ENOSPC ------ diff --git a/Documentation/btrfs-device.rst b/Documentation/btrfs-device.rst index dda712bb..233cb713 100644 --- a/Documentation/btrfs-device.rst +++ b/Documentation/btrfs-device.rst @@ -14,36 +14,7 @@ The **btrfs device** command group is used to manage devices of the btrfs filesy DEVICE MANAGEMENT ----------------- -Btrfs filesystem can be created on top of single or multiple block devices. -Data and metadata are organized in allocation profiles with various redundancy -policies. There's some similarity with traditional RAID levels, but this could -be confusing to users familiar with the traditional meaning. Due to the -similarity, the RAID terminology is widely used in the documentation. See -``mkfs.btrfs(8)`` for more details and the exact profile capabilities and -constraints. - -The device management works on a mounted filesystem. Devices can be added, -removed or replaced, by commands provided by **btrfs device** and **btrfs replace**. - -The profiles can be also changed, provided there's enough workspace to do the -conversion, using the **btrfs balance** command and namely the filter *convert*. - -Type - The block group profile type is the main distinction of the information stored - on the block device. User data are called *Data*, the internal data structures - managed by filesystem are *Metadata* and *System*. - -Profile - A profile describes an allocation policy based on the redundancy/replication - constraints in connection with the number of devices. The profile applies to - data and metadata block groups separately. Eg. *single*, *RAID1*. - -RAID level - Where applicable, the level refers to a profile that matches constraints of the - standard RAID levels. At the moment the supported ones are: RAID0, RAID1, - RAID10, RAID5 and RAID6. - -See the section *TYPICAL USECASES* for some examples. +.. include ch-volume-management-intro.rst SUBCOMMAND ---------- @@ -76,7 +47,7 @@ remove [options] | [|...] Device removal must satisfy the profile constraints, otherwise the command fails. The filesystem must be converted to profile(s) that would allow the removal. This can typically happen when going down from 2 devices to 1 and - using the RAID1 profile. See the *TYPICAL USECASES* section below. + using the RAID1 profile. See the section *TYPICAL USECASES*. The operation can take long as it needs to move all data from the device. @@ -217,94 +188,6 @@ usage [options] [...]:: If conflicting options are passed, the last one takes precedence. -TYPICAL USECASES ----------------- - -STARTING WITH A SINGLE-DEVICE FILESYSTEM -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Assume we've created a filesystem on a block device */dev/sda* with profile -*single/single* (data/metadata), the device size is 50GiB and we've used the -whole device for the filesystem. The mount point is */mnt*. - -The amount of data stored is 16GiB, metadata have allocated 2GiB. - -ADD NEW DEVICE -"""""""""""""" - -We want to increase the total size of the filesystem and keep the profiles. The -size of the new device */dev/sdb* is 100GiB. - -.. code-block:: bash - - $ btrfs device add /dev/sdb /mnt - -The amount of free data space increases by less than 100GiB, some space is -allocated for metadata. - -CONVERT TO RAID1 -"""""""""""""""" - -Now we want to increase the redundancy level of both data and metadata, but -we'll do that in steps. Note, that the device sizes are not equal and we'll use -that to show the capabilities of split data/metadata and independent profiles. - -The constraint for RAID1 gives us at most 50GiB of usable space and exactly 2 -copies will be stored on the devices. - -First we'll convert the metadata. As the metadata occupy less than 50GiB and -there's enough workspace for the conversion process, we can do: - -.. code-block:: bash - - $ btrfs balance start -mconvert=raid1 /mnt - -This operation can take a while, because all metadata have to be moved and all -block pointers updated. Depending on the physical locations of the old and new -blocks, the disk seeking is the key factor affecting performance. - -You'll note that the system block group has been also converted to RAID1, this -normally happens as the system block group also holds metadata (the physical to -logical mappings). - -What changed: - -* available data space decreased by 3GiB, usable roughly (50 - 3) + (100 - 3) = 144 GiB -* metadata redundancy increased - -IOW, the unequal device sizes allow for combined space for data yet improved -redundancy for metadata. If we decide to increase redundancy of data as well, -we're going to lose 50GiB of the second device for obvious reasons. - -.. code-block:: bash - - $ btrfs balance start -dconvert=raid1 /mnt - -The balance process needs some workspace (ie. a free device space without any -data or metadata block groups) so the command could fail if there's too much -data or the block groups occupy the whole first device. - -The device size of */dev/sdb* as seen by the filesystem remains unchanged, but -the logical space from 50-100GiB will be unused. - -REMOVE DEVICE -""""""""""""" - -Device removal must satisfy the profile constraints, otherwise the command -fails. For example: - -.. code-block:: bash - - $ btrfs device remove /dev/sda /mnt - ERROR: error removing device '/dev/sda': unable to go below two devices on raid1 - -In order to remove a device, you need to convert the profile in this case: - -.. code-block:: bash - - $ btrfs balance start -mconvert=dup -dconvert=single /mnt - $ btrfs device remove /dev/sda /mnt - DEVICE STATS ------------ diff --git a/Documentation/btrfs-man5.rst b/Documentation/btrfs-man5.rst index ee3364d8..65dd3adb 100644 --- a/Documentation/btrfs-man5.rst +++ b/Documentation/btrfs-man5.rst @@ -739,7 +739,6 @@ CHECKSUM ALGORITHMS .. include:: ch-checksumming.rst - COMPRESSION ----------- @@ -915,71 +914,7 @@ d ZONED MODE ---------- -Since version 5.12 btrfs supports so called *zoned mode*. This is a special -on-disk format and allocation/write strategy that's friendly to zoned devices. -In short, a device is partitioned into fixed-size zones and each zone can be -updated by append-only manner, or reset. As btrfs has no fixed data structures, -except the super blocks, the zoned mode only requires block placement that -follows the device constraints. You can learn about the whole architecture at -https://zonedstorage.io . - -The devices are also called SMR/ZBC/ZNS, in *host-managed* mode. Note that -there are devices that appear as non-zoned but actually are, this is -*drive-managed* and using zoned mode won't help. - -The zone size depends on the device, typical sizes are 256MiB or 1GiB. In -general it must be a power of two. Emulated zoned devices like *null_blk* allow -to set various zone sizes. - -REQUIREMENTS, LIMITATIONS -^^^^^^^^^^^^^^^^^^^^^^^^^ - -* all devices must have the same zone size -* maximum zone size is 8GiB -* mixing zoned and non-zoned devices is possible, the zone writes are emulated, - but this is namely for testing -* the super block is handled in a special way and is at different locations - than on a non-zoned filesystem: - * primary: 0B (and the next two zones) - * secondary: 512G (and the next two zones) - * tertiary: 4TiB (4096GiB, and the next two zones) - -INCOMPATIBLE FEATURES -^^^^^^^^^^^^^^^^^^^^^ - -The main constraint of the zoned devices is lack of in-place update of the data. -This is inherently incompatbile with some features: - -* nodatacow - overwrite in-place, cannot create such files -* fallocate - preallocating space for in-place first write -* mixed-bg - unordered writes to data and metadata, fixing that means using - separate data and metadata block groups -* booting - the zone at offset 0 contains superblock, resetting the zone would - destroy the bootloader data - -Initial support lacks some features but they're planned: - -* only single profile is supported -* fstrim - due to dependency on free space cache v1 - -SUPER BLOCK -~~~~~~~~~~~ - -As said above, super block is handled in a special way. In order to be crash -safe, at least one zone in a known location must contain a valid superblock. -This is implemented as a ring buffer in two consecutive zones, starting from -known offsets 0, 512G and 4TiB. The values are different than on non-zoned -devices. Each new super block is appended to the end of the zone, once it's -filled, the zone is reset and writes continue to the next one. Looking up the -latest super block needs to read offsets of both zones and determine the last -written version. - -The amount of space reserved for super block depends on the zone size. The -secondary and tertiary copies are at distant offsets as the capacity of the -devices is expected to be large, tens of terabytes. Maximum zone size supported -is 8GiB, which would mean that eg. offset 0-16GiB would be reserved just for -the super block on a hypothetical device of that zone size. This is wasteful -but required to guarantee crash safety. +.. include:: ch-zoned-intro.rst CONTROL DEVICE diff --git a/Documentation/btrfs-subvolume.rst b/Documentation/btrfs-subvolume.rst index 3e381bca..4591d4bb 100644 --- a/Documentation/btrfs-subvolume.rst +++ b/Documentation/btrfs-subvolume.rst @@ -12,6 +12,8 @@ DESCRIPTION **btrfs subvolume** is used to create/delete/list/show btrfs subvolumes and snapshots. +.. include:: ch-subvolume-intro.rst + SUBVOLUME AND SNAPSHOT ---------------------- @@ -241,36 +243,6 @@ sync [subvolid...] -s sleep N seconds between checks (default: 1) -SUBVOLUME FLAGS ---------------- - -The subvolume flag currently implemented is the *ro* property. Read-write -subvolumes have that set to *false*, snapshots as *true*. In addition to that, -a plain snapshot will also have last change generation and creation generation -equal. - -Read-only snapshots are building blocks fo incremental send (see -``btrfs-send(8)``) and the whole use case relies on unmodified snapshots where the -relative changes are generated from. Thus, changing the subvolume flags from -read-only to read-write will break the assumptions and may lead to unexpected changes -in the resulting incremental stream. - -A snapshot that was created by send/receive will be read-only, with different -last change generation, read-only and with set *received_uuid* which identifies -the subvolume on the filesystem that produced the stream. The usecase relies -on matching data on both sides. Changing the subvolume to read-write after it -has been received requires to reset the *received_uuid*. As this is a notable -change and could potentially break the incremental send use case, performing -it by **btrfs property set** requires force if that is really desired by user. - -.. note:: - The safety checks have been implemented in 5.14.2, any subvolumes previously - received (with a valid *received_uuid*) and read-write status may exist and - could still lead to problems with send/receive. You can use **btrfs subvolume - show** to identify them. Flipping the flags to read-only and back to - read-write will reset the *received_uuid* manually. There may exist a - convenience tool in the future. - EXAMPLES -------- diff --git a/Documentation/ch-balance-filters.rst b/Documentation/ch-balance-filters.rst new file mode 100644 index 00000000..2ca141fb --- /dev/null +++ b/Documentation/ch-balance-filters.rst @@ -0,0 +1,83 @@ +From kernel 3.3 onwards, btrfs balance can limit its action to a subset of the +whole filesystem, and can be used to change the replication configuration (e.g. +moving data from single to RAID1). This functionality is accessed through the +*-d*, *-m* or *-s* options to btrfs balance start, which filter on data, +metadata and system blocks respectively. + +A filter has the following structure: *type[=params][,type=...]* + +The available types are: + +profiles= + Balances only block groups with the given profiles. Parameters + are a list of profile names separated by "*|*" (pipe). + +usage=, usage= + Balances only block groups with usage under the given percentage. The + value of 0 is allowed and will clean up completely unused block groups, this + should not require any new work space allocated. You may want to use *usage=0* + in case balance is returning ENOSPC and your filesystem is not too full. + + The argument may be a single value or a range. The single value *N* means *at + most N percent used*, equivalent to *..N* range syntax. Kernels prior to 4.4 + accept only the single value format. + The minimum range boundary is inclusive, maximum is exclusive. + +devid= + Balances only block groups which have at least one chunk on the given + device. To list devices with ids use **btrfs filesystem show**. + +drange= + Balance only block groups which overlap with the given byte range on any + device. Use in conjunction with *devid* to filter on a specific device. The + parameter is a range specified as *start..end*. + +vrange= + Balance only block groups which overlap with the given byte range in the + filesystem's internal virtual address space. This is the address space that + most reports from btrfs in the kernel log use. The parameter is a range + specified as *start..end*. + +convert= + Convert each selected block group to the given profile name identified by + parameters. + + .. note:: + Starting with kernel 4.5, the *data* chunks can be converted to/from the + *DUP* profile on a single device. + + .. note:: + Starting with kernel 4.6, all profiles can be converted to/from *DUP* on + multi-device filesystems. + +limit=, limit= + Process only given number of chunks, after all filters are applied. This can be + used to specifically target a chunk in connection with other filters (*drange*, + *vrange*) or just simply limit the amount of work done by a single balance run. + + The argument may be a single value or a range. The single value *N* means *at + most N chunks*, equivalent to *..N* range syntax. Kernels prior to 4.4 accept + only the single value format. The range minimum and maximum are inclusive. + +stripes= + Balance only block groups which have the given number of stripes. The parameter + is a range specified as *start..end*. Makes sense for block group profiles that + utilize striping, ie. RAID0/10/5/6. The range minimum and maximum are + inclusive. + +soft + Takes no parameters. Only has meaning when converting between profiles. + When doing convert from one profile to another and soft mode is on, + chunks that already have the target profile are left untouched. + This is useful e.g. when half of the filesystem was converted earlier but got + cancelled. + + The soft mode switch is (like every other filter) per-type. + For example, this means that we can convert metadata chunks the "hard" way + while converting data chunks selectively with soft switch. + +Profile names, used in *profiles* and *convert* are one of: *raid0*, *raid1*, +*raid1c3*, *raid1c4*, *raid10*, *raid5*, *raid6*, *dup*, *single*. The mixed +data/metadata profiles can be converted in the same way, but it's conversion +between mixed and non-mixed is not implemented. For the constraints of the +profiles please refer to ``mkfs.btrfs(8)``, section *PROFILES*. diff --git a/Documentation/ch-balance-intro.rst b/Documentation/ch-balance-intro.rst new file mode 100644 index 00000000..f885903a --- /dev/null +++ b/Documentation/ch-balance-intro.rst @@ -0,0 +1,62 @@ +The primary purpose of the balance feature is to spread block groups across +all devices so they match constraints defined by the respective profiles. See +``mkfs.btrfs(8)`` section *PROFILES* for more details. +The scope of the balancing process can be further tuned by use of filters that +can select the block groups to process. Balance works only on a mounted +filesystem. Extent sharing is preserved and reflinks are not broken. +Files are not defragmented nor recompressed, file extents are preserved +but the physical location on devices will change. + +The balance operation is cancellable by the user. The on-disk state of the +filesystem is always consistent so an unexpected interruption (eg. system crash, +reboot) does not corrupt the filesystem. The progress of the balance operation +is temporarily stored as an internal state and will be resumed upon mount, +unless the mount option *skip_balance* is specified. + +.. warning:: + Running balance without filters will take a lot of time as it basically move + data/metadata from the whole filesystem and needs to update all block + pointers. + +The filters can be used to perform following actions: + +- convert block group profiles (filter *convert*) +- make block group usage more compact (filter *usage*) +- perform actions only on a given device (filters *devid*, *drange*) + +The filters can be applied to a combination of block group types (data, +metadata, system). Note that changing only the *system* type needs the force +option. Otherwise *system* gets automatically converted whenever *metadata* +profile is converted. + +When metadata redundancy is reduced (eg. from RAID1 to single) the force option +is also required and it is noted in system log. + +.. note:: + The balance operation needs enough work space, ie. space that is completely + unused in the filesystem, otherwise this may lead to ENOSPC reports. See + the section *ENOSPC* for more details. + +Compatibility +------------- + +.. note:: + + The balance subcommand also exists under the **btrfs filesystem** namespace. + This still works for backward compatibility but is deprecated and should not + be used any more. + +.. note:: + A short syntax **btrfs balance ** works due to backward compatibility + but is deprecated and should not be used any more. Use **btrfs balance start** + command instead. + +Performance implications +------------------------ + +Balancing operations are very IO intensive and can also be quite CPU intensive, +impacting other ongoing filesystem operations. Typically large amounts of data +are copied from one location to another, with corresponding metadata updates. + +Depending upon the block group layout, it can also be seek heavy. Performance +on rotational devices is noticeably worse compared to SSDs or fast arrays. diff --git a/Documentation/ch-checksumming.rst b/Documentation/ch-checksumming.rst index 96cd27a4..f4a27e3e 100644 --- a/Documentation/ch-checksumming.rst +++ b/Documentation/ch-checksumming.rst @@ -10,7 +10,7 @@ CRC32C (32bit digest) instruction-level support, not collision-resistant but still good error detection capabilities -XXHASH* (64bit digest) +XXHASH (64bit digest) can be used as CRC32C successor, very fast, optimized for modern CPUs utilizing instruction pipelining, good collision resistance and error detection @@ -33,7 +33,6 @@ additional overhead of the b-tree leaves. Approximate relative performance of the algorithms, measured against CRC32C using reference software implementations on a 3.5GHz intel CPU: - ======== ============ ======= ================ Digest Cycles/4KiB Ratio Implementation ======== ============ ======= ================ @@ -73,4 +72,3 @@ while accelerated implementation is e.g. priority : 170 ... - diff --git a/Documentation/ch-compression.rst b/Documentation/ch-compression.rst index 10c343e4..c319d88a 100644 --- a/Documentation/ch-compression.rst +++ b/Documentation/ch-compression.rst @@ -56,7 +56,7 @@ cause performance drops. The command above will start defragmentation of the whole *file* and apply the compression, regardless of the mount option. (Note: specifying level is not -yet implemented). The compression algorithm is not persisent and applies only +yet implemented). The compression algorithm is not persistent and applies only to the defragmentation command, for any other writes other compression settings apply. @@ -114,9 +114,9 @@ There are two ways to detect incompressible data: * actual compression attempt - data are compressed, if the result is not smaller, it's discarded, so this depends on the algorithm and level * pre-compression heuristics - a quick statistical evaluation on the data is - peformed and based on the result either compression is performed or skipped, + performed and based on the result either compression is performed or skipped, the NOCOMPRESS bit is not set just by the heuristic, only if the compression - algorithm does not make an improvent + algorithm does not make an improvement .. code-block:: shell @@ -137,7 +137,7 @@ incompressible data too but this leads to more overhead as the compression is done in another thread and has to write the data anyway. The heuristic is read-only and can utilize cached memory. -The tests performed based on the following: data sampling, long repated +The tests performed based on the following: data sampling, long repeated pattern detection, byte frequency, Shannon entropy. Compatibility diff --git a/Documentation/ch-convert-intro.rst b/Documentation/ch-convert-intro.rst index b3fdd162..56d1c7a6 100644 --- a/Documentation/ch-convert-intro.rst +++ b/Documentation/ch-convert-intro.rst @@ -36,7 +36,7 @@ machines). **BEFORE YOU START** The source filesystem must be clean, eg. no journal to replay or no repairs -needed. The respective **fsck** utility must be run on the source filesytem prior +needed. The respective **fsck** utility must be run on the source filessytem prior to conversion. Please refer to the manual pages in case you encounter problems. For ext2/3/4: diff --git a/Documentation/ch-quota-intro.rst b/Documentation/ch-quota-intro.rst index abd71606..a3b9d3b2 100644 --- a/Documentation/ch-quota-intro.rst +++ b/Documentation/ch-quota-intro.rst @@ -42,7 +42,7 @@ exclusive is the amount of data where all references to this data can be reached from within this qgroup. -SUBVOLUME QUOTA GROUPS +Subvolume quota groups ^^^^^^^^^^^^^^^^^^^^^^ The basic notion of the Subvolume Quota feature is the quota group, short @@ -75,7 +75,7 @@ of qgroups. Figure 1 shows an example qgroup tree. | / \ / \ extents 1 2 3 4 - Figure1: Sample qgroup hierarchy + Figure 1: Sample qgroup hierarchy At the bottom, some extents are depicted showing which qgroups reference which extents. It is important to understand the notion of *referenced* vs @@ -101,7 +101,7 @@ allocation information are not accounted. In turn, the referenced count of a qgroup can be limited. All writes beyond this limit will lead to a 'Quota Exceeded' error. -INHERITANCE +Inheritance ^^^^^^^^^^^ Things get a bit more complicated when new subvolumes or snapshots are created. @@ -133,13 +133,13 @@ exclusive count from the second qgroup needs to be copied to the first qgroup, as it represents the correct value. The second qgroup is called a tracking qgroup. It is only there in case a snapshot is needed. -USE CASES +Use cases ^^^^^^^^^ -Below are some usecases that do not mean to be extensive. You can find your +Below are some use cases that do not mean to be extensive. You can find your own way how to integrate qgroups. -SINGLE-USER MACHINE +Single-user machine """"""""""""""""""" ``Replacement for partitions`` @@ -156,7 +156,7 @@ the correct values. 'Referenced' will show how much is in it, possibly shared with other subvolumes. 'Exclusive' will be the amount of space that gets freed when the subvolume is deleted. -MULTI-USER MACHINE +Multi-user machine """""""""""""""""" ``Restricting homes`` @@ -194,5 +194,3 @@ but some snapshots for backup purposes are being created by the system. The user's snapshots should be accounted to the user, not the system. The solution is similar to the one from section 'Accounting snapshots to the user', but do not assign system snapshots to user's qgroup. - - diff --git a/Documentation/ch-seeding-device.rst b/Documentation/ch-seeding-device.rst index 93136c2f..78451e58 100644 --- a/Documentation/ch-seeding-device.rst +++ b/Documentation/ch-seeding-device.rst @@ -19,7 +19,7 @@ UUID on each mount. Once the seeding device is mounted, it needs the writable device. After adding it, something like **remount -o remount,rw /path** makes the filesystem at -*/path* ready for use. The simplest usecase is to throw away all changes by +*/path* ready for use. The simplest use case is to throw away all changes by unmounting the filesystem when convenient. Alternatively, deleting the seeding device from the filesystem can turn it into @@ -29,7 +29,7 @@ data from the seeding device. The seeding device flag can be cleared again by **btrfstune -f -s 0**, eg. allowing to update with newer data but please note that this will invalidate all existing filesystems that use this particular seeding device. This works -for some usecases, not for others, and a forcing flag to the command is +for some use cases, not for others, and a forcing flag to the command is mandatory to avoid accidental mistakes. Example how to create and use one seeding device: @@ -71,8 +71,6 @@ A few things to note: * it's recommended to use only single device for the seeding device, it works for multiple devices but the *single* profile must be used in order to make the seeding device deletion work -* block group profiles *single* and *dup* support the usecases above +* block group profiles *single* and *dup* support the use cases above * the label is copied from the seeding device and can be changed by **btrfs filesystem label** * each new mount of the seeding device gets a new random UUID - - diff --git a/Documentation/ch-subvolume-intro.rst b/Documentation/ch-subvolume-intro.rst new file mode 100644 index 00000000..ca5f5331 --- /dev/null +++ b/Documentation/ch-subvolume-intro.rst @@ -0,0 +1,58 @@ +A BTRFS subvolume is a part of filesystem with its own independent +file/directory hierarchy. Subvolumes can share file extents. A snapshot is also +subvolume, but with a given initial content of the original subvolume. + +.. note:: + A subvolume in BTRFS is not like an LVM logical volume, which is block-level + snapshot while BTRFS subvolumes are file extent-based. + +A subvolume looks like a normal directory, with some additional operations +described below. Subvolumes can be renamed or moved, nesting subvolumes is not +restricted but has some implications regarding snapshotting. + +A subvolume in BTRFS can be accessed in two ways: + +* like any other directory that is accessible to the user +* like a separately mounted filesystem (options *subvol* or *subvolid*) + +In the latter case the parent directory is not visible and accessible. This is +similar to a bind mount, and in fact the subvolume mount does exactly that. + +A freshly created filesystem is also a subvolume, called *top-level*, +internally has an id 5. This subvolume cannot be removed or replaced by another +subvolume. This is also the subvolume that will be mounted by default, unless +the default subvolume has been changed (see ``btrfs subvolume set-default``). + +A snapshot is a subvolume like any other, with given initial content. By +default, snapshots are created read-write. File modifications in a snapshot +do not affect the files in the original subvolume. + +Subvolume flags +--------------- + +The subvolume flag currently implemented is the *ro* property. Read-write +subvolumes have that set to *false*, snapshots as *true*. In addition to that, +a plain snapshot will also have last change generation and creation generation +equal. + +Read-only snapshots are building blocks of incremental send (see +``btrfs-send(8)``) and the whole use case relies on unmodified snapshots where +the relative changes are generated from. Thus, changing the subvolume flags +from read-only to read-write will break the assumptions and may lead to +unexpected changes in the resulting incremental stream. + +A snapshot that was created by send/receive will be read-only, with different +last change generation, read-only and with set *received_uuid* which identifies +the subvolume on the filesystem that produced the stream. The use case relies +on matching data on both sides. Changing the subvolume to read-write after it +has been received requires to reset the *received_uuid*. As this is a notable +change and could potentially break the incremental send use case, performing +it by **btrfs property set** requires force if that is really desired by user. + +.. note:: + The safety checks have been implemented in 5.14.2, any subvolumes previously + received (with a valid *received_uuid*) and read-write status may exist and + could still lead to problems with send/receive. You can use **btrfs subvolume + show** to identify them. Flipping the flags to read-only and back to + read-write will reset the *received_uuid* manually. There may exist a + convenience tool in the future. diff --git a/Documentation/ch-volume-management-intro.rst b/Documentation/ch-volume-management-intro.rst new file mode 100644 index 00000000..814878b5 --- /dev/null +++ b/Documentation/ch-volume-management-intro.rst @@ -0,0 +1,116 @@ +BTRFS filesystem can be created on top of single or multiple block devices. +Devices can be then added, removed or replaced on demand. Data and metadata are +organized in allocation profiles with various redundancy policies. There's some +similarity with traditional RAID levels, but this could be confusing to users +familiar with the traditional meaning. Due to the similarity, the RAID +terminology is widely used in the documentation. See ``mkfs.btrfs(8)`` for more +details and the exact profile capabilities and constraints. + +The device management works on a mounted filesystem. Devices can be added, +removed or replaced, by commands provided by ``btrfs device`` and ``btrfs replace``. + +The profiles can be also changed, provided there's enough workspace to do the +conversion, using the ``btrfs balance`` command and namely the filter *convert*. + +Type + The block group profile type is the main distinction of the information stored + on the block device. User data are called *Data*, the internal data structures + managed by filesystem are *Metadata* and *System*. + +Profile + A profile describes an allocation policy based on the redundancy/replication + constraints in connection with the number of devices. The profile applies to + data and metadata block groups separately. Eg. *single*, *RAID1*. + +RAID level + Where applicable, the level refers to a profile that matches constraints of the + standard RAID levels. At the moment the supported ones are: RAID0, RAID1, + RAID10, RAID5 and RAID6. + +Typical use cases +----------------- + +Starting with a single-device filesystem +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Assume we've created a filesystem on a block device */dev/sda* with profile +*single/single* (data/metadata), the device size is 50GiB and we've used the +whole device for the filesystem. The mount point is */mnt*. + +The amount of data stored is 16GiB, metadata have allocated 2GiB. + +Add new device +"""""""""""""" + +We want to increase the total size of the filesystem and keep the profiles. The +size of the new device */dev/sdb* is 100GiB. + +.. code-block:: bash + + $ btrfs device add /dev/sdb /mnt + +The amount of free data space increases by less than 100GiB, some space is +allocated for metadata. + +Convert to RAID1 +"""""""""""""""" + +Now we want to increase the redundancy level of both data and metadata, but +we'll do that in steps. Note, that the device sizes are not equal and we'll use +that to show the capabilities of split data/metadata and independent profiles. + +The constraint for RAID1 gives us at most 50GiB of usable space and exactly 2 +copies will be stored on the devices. + +First we'll convert the metadata. As the metadata occupy less than 50GiB and +there's enough workspace for the conversion process, we can do: + +.. code-block:: bash + + $ btrfs balance start -mconvert=raid1 /mnt + +This operation can take a while, because all metadata have to be moved and all +block pointers updated. Depending on the physical locations of the old and new +blocks, the disk seeking is the key factor affecting performance. + +You'll note that the system block group has been also converted to RAID1, this +normally happens as the system block group also holds metadata (the physical to +logical mappings). + +What changed: + +* available data space decreased by 3GiB, usable roughly (50 - 3) + (100 - 3) = 144 GiB +* metadata redundancy increased + +IOW, the unequal device sizes allow for combined space for data yet improved +redundancy for metadata. If we decide to increase redundancy of data as well, +we're going to lose 50GiB of the second device for obvious reasons. + +.. code-block:: bash + + $ btrfs balance start -dconvert=raid1 /mnt + +The balance process needs some workspace (ie. a free device space without any +data or metadata block groups) so the command could fail if there's too much +data or the block groups occupy the whole first device. + +The device size of */dev/sdb* as seen by the filesystem remains unchanged, but +the logical space from 50-100GiB will be unused. + +Remove device +""""""""""""" + +Device removal must satisfy the profile constraints, otherwise the command +fails. For example: + +.. code-block:: bash + + $ btrfs device remove /dev/sda /mnt + ERROR: error removing device '/dev/sda': unable to go below two devices on raid1 + +In order to remove a device, you need to convert the profile in this case: + +.. code-block:: bash + + $ btrfs balance start -mconvert=dup -dconvert=single /mnt + $ btrfs device remove /dev/sda /mnt diff --git a/Documentation/ch-zoned-intro.rst b/Documentation/ch-zoned-intro.rst new file mode 100644 index 00000000..ea54f898 --- /dev/null +++ b/Documentation/ch-zoned-intro.rst @@ -0,0 +1,66 @@ +Since version 5.12 btrfs supports so called *zoned mode*. This is a special +on-disk format and allocation/write strategy that's friendly to zoned devices. +In short, a device is partitioned into fixed-size zones and each zone can be +updated by append-only manner, or reset. As btrfs has no fixed data structures, +except the super blocks, the zoned mode only requires block placement that +follows the device constraints. You can learn about the whole architecture at +https://zonedstorage.io . + +The devices are also called SMR/ZBC/ZNS, in *host-managed* mode. Note that +there are devices that appear as non-zoned but actually are, this is +*drive-managed* and using zoned mode won't help. + +The zone size depends on the device, typical sizes are 256MiB or 1GiB. In +general it must be a power of two. Emulated zoned devices like *null_blk* allow +to set various zone sizes. + +Requirements, limitations +^^^^^^^^^^^^^^^^^^^^^^^^^ + +* all devices must have the same zone size +* maximum zone size is 8GiB +* mixing zoned and non-zoned devices is possible, the zone writes are emulated, + but this is namely for testing +* the super block is handled in a special way and is at different locations + than on a non-zoned filesystem: + * primary: 0B (and the next two zones) + * secondary: 512GiB (and the next two zones) + * tertiary: 4TiB (4096GiB, and the next two zones) + +Incompatible features +^^^^^^^^^^^^^^^^^^^^^ + +The main constraint of the zoned devices is lack of in-place update of the data. +This is inherently incompatibile with some features: + +* nodatacow - overwrite in-place, cannot create such files +* fallocate - preallocating space for in-place first write +* mixed-bg - unordered writes to data and metadata, fixing that means using + separate data and metadata block groups +* booting - the zone at offset 0 contains superblock, resetting the zone would + destroy the bootloader data + +Initial support lacks some features but they're planned: + +* only single profile is supported +* fstrim - due to dependency on free space cache v1 + +Super block +^^^^^^^^^^^ + +As said above, super block is handled in a special way. In order to be crash +safe, at least one zone in a known location must contain a valid superblock. +This is implemented as a ring buffer in two consecutive zones, starting from +known offsets 0B, 512GiB and 4TiB. + +The values are different than on non-zoned devices. Each new super block is +appended to the end of the zone, once it's filled, the zone is reset and writes +continue to the next one. Looking up the latest super block needs to read +offsets of both zones and determine the last written version. + +The amount of space reserved for super block depends on the zone size. The +secondary and tertiary copies are at distant offsets as the capacity of the +devices is expected to be large, tens of terabytes. Maximum zone size supported +is 8GiB, which would mean that eg. offset 0-16GiB would be reserved just for +the super block on a hypothetical device of that zone size. This is wasteful +but required to guarantee crash safety. diff --git a/Documentation/index.rst b/Documentation/index.rst index 53f321a4..fcf89590 100644 --- a/Documentation/index.rst +++ b/Documentation/index.rst @@ -8,7 +8,6 @@ Welcome to BTRFS documentation! :caption: Overview Introduction - Quick-start man-index .. toctree:: @@ -41,6 +40,7 @@ Welcome to BTRFS documentation! :maxdepth: 1 :caption: TODO + Quick-start Interoperability Glossary Flexibility