btrfs-progs: docs: add more chapters (part 3)
All main pages have some content and many typos have been fixed. Signed-off-by: David Sterba <dsterba@suse.com>
This commit is contained in:
parent
c6be84840f
commit
208aed2ed4
|
@ -1,4 +1,9 @@
|
|||
Balance
|
||||
=======
|
||||
|
||||
...
|
||||
.. include:: ch-balance-intro.rst
|
||||
|
||||
Filters
|
||||
-------
|
||||
|
||||
.. include:: ch-balance-filters.rst
|
||||
|
|
|
@ -1,20 +1,44 @@
|
|||
Common Linux features
|
||||
=====================
|
||||
|
||||
Anything that's standard and also supported
|
||||
The Linux operating system implements a POSIX standard interfaces and API with
|
||||
additional interfaces. Many of them have become common in other filesystems. The
|
||||
ones listed below have been added relatively recently and are considered
|
||||
interesting for users:
|
||||
|
||||
- statx
|
||||
birth/origin inode time
|
||||
a timestamp associated with an inode of when it was created, cannot be
|
||||
changed and requires the *statx* syscall to be read
|
||||
|
||||
- fallocate modes
|
||||
statx
|
||||
an extended version of the *stat* syscall that provides extensible
|
||||
interface to read more information that are not available in original
|
||||
*stat*
|
||||
|
||||
- birth/origin inode time
|
||||
fallocate modes
|
||||
the *fallocate* syscall allows to manipulate file extents like punching
|
||||
holes, preallocation or zeroing a range
|
||||
|
||||
- filesystem label
|
||||
FIEMAP
|
||||
an ioctl that enumerates file extents, related tool is ``filefrag``
|
||||
|
||||
- xattr, acl
|
||||
filesystem label
|
||||
another filesystem identification, could be used for mount or for better
|
||||
recognition, can be set or read by an ioctl or by command ``btrfs
|
||||
filesystem label``
|
||||
|
||||
- FIEMAP
|
||||
O_TMPFILE
|
||||
mode of open() syscall that creates a file with no associated directory
|
||||
entry, which makes it impossible to be seen by other processes and is
|
||||
thus safe to be used as a temporary file
|
||||
(https://lwn.net/Articles/619146/)
|
||||
|
||||
- O_TMPFILE
|
||||
xattr, acl
|
||||
extended attributes (xattr) is a list of *key=value* pairs associated
|
||||
with a file, usually storing additional metadata related to security,
|
||||
access control list in particular (ACL) or properties (``btrfs
|
||||
property``)
|
||||
|
||||
- XFLAGS, fileattr
|
||||
|
||||
- cross-rename
|
||||
|
|
|
@ -1,16 +1,21 @@
|
|||
Custom ioctls
|
||||
=============
|
||||
|
||||
Anything that's not doing the other features and stands on it's own
|
||||
Filesystems are usually extended by custom ioctls beyond the standard system
|
||||
call interface to let user applications access the advanced features. They're
|
||||
low level and the following list gives only an overview of the capabilities or
|
||||
a command if available:
|
||||
|
||||
- reverse lookup, from file offset to inode
|
||||
- reverse lookup, from file offset to inode, ``btrfs inspect-internal
|
||||
logical-resolve``
|
||||
|
||||
- resolve inode number -> name
|
||||
- resolve inode number to list of name, ``btrfs inspect-internal inode-resolve``
|
||||
|
||||
- file offset -> all inodes that share it
|
||||
- tree search, given a key range and tree id, lookup and return all b-tree items
|
||||
found in that range, basically all metadata at your hand but you need to know
|
||||
what to do with them
|
||||
|
||||
- tree search, all the metadata at your hand (if you know what to do with them)
|
||||
- informative, about devices, space allocation or the whole filesystem, many of
|
||||
which is also exported in ``/sys/fs/btrfs``
|
||||
|
||||
- informative (device, fs, space)
|
||||
|
||||
- query/set a subset of features on a mounted fs
|
||||
- query/set a subset of features on a mounted filesystem
|
||||
|
|
|
@ -18,5 +18,5 @@ happens inside the page cache, that is the central point caching the file data
|
|||
and takes care of synchronization. Once a filesystem sync or flush is started
|
||||
(either manually or automatically) all the dirty data get written to the
|
||||
devices. This however reduces the chances to find optimal layout as the writes
|
||||
happen together with other data and the result depens on the remaining free
|
||||
happen together with other data and the result depends on the remaining free
|
||||
space layout and fragmentation.
|
||||
|
|
|
@ -14,7 +14,7 @@ also copied, though there are no ready-made tools for that.
|
|||
|
||||
cp --reflink=always source target
|
||||
|
||||
There are some constaints:
|
||||
There are some constraints:
|
||||
|
||||
- cross-filesystem reflink is not possible, there's nothing in common between
|
||||
so the block sharing can't work
|
||||
|
|
|
@ -3,8 +3,8 @@ Resize
|
|||
|
||||
A BTRFS mounted filesystem can be resized after creation, grown or shrunk. On a
|
||||
multi device filesystem the space occupied on each device can be resized
|
||||
independently. Data tha reside in the are that would be out of the new size are
|
||||
relocated to the remaining space below the limit, so this constrains the
|
||||
independently. Data that reside in the area that would be out of the new size
|
||||
are relocated to the remaining space below the limit, so this constrains the
|
||||
minimum size to which a filesystem can be shrunk.
|
||||
|
||||
Growing a filesystem is quick as it only needs to take note of the available
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
Subvolumes
|
||||
==========
|
||||
|
||||
...
|
||||
.. include:: ch-subvolume-intro.rst
|
||||
|
|
|
@ -1,6 +1,53 @@
|
|||
Tree checker
|
||||
============
|
||||
|
||||
Metadata blocks that have been just read from devices or are just about to be
|
||||
written are verified and sanity checked by so called **tree checker**. The
|
||||
b-tree nodes contain several items describing the filesystem structure and to
|
||||
some degree can be verified for consistency or validity. This is additional
|
||||
check to the checksums that only verify the overall block status while the tree
|
||||
checker tries to validate and cross reference the logical structure. This takes
|
||||
a slight performance hit but is comparable to calculating the checksum and has
|
||||
no noticeable impact while it does catch all sorts of errors.
|
||||
|
||||
There are two occasions when the checks are done:
|
||||
|
||||
Pre-write checks
|
||||
----------------
|
||||
|
||||
When metadata blocks are in memory about to be written to the permanent storage,
|
||||
the checks are performed, before the checksums are calculated. This can catch
|
||||
random corruptions of the blocks (or pages) either caused by bugs or by other
|
||||
parts of the system or hardware errors (namely faulty RAM).
|
||||
|
||||
Once a block does not pass the checks, the filesystem refuses to write more data
|
||||
and turns itself to read-only mode to prevent further damage. At this point some
|
||||
the recent metadata updates are held *only* in memory so it's best to not panic
|
||||
and try to remember what files could be affected and copy them elsewhere. Once
|
||||
the filesystem gets unmounted, the most recent changes are unfortunately lost.
|
||||
The filesystem that is stored on the device is still consistent and should mount
|
||||
fine.
|
||||
|
||||
Post-read checks
|
||||
----------------
|
||||
|
||||
Metadata blocks get verified right after they're read from devices and the
|
||||
checksum is found to be valid. This protects against changes to the metadata
|
||||
that could possibly also update the checksum, less likely to happen accidentally
|
||||
but rather due to intentional corruption or fuzzing.
|
||||
|
||||
The checks
|
||||
----------
|
||||
|
||||
As implemented right now, the metadata consistency is limited to one b-tree node
|
||||
and what items are stored there, ie. there's no extensive or broad check done
|
||||
eg. against other data structures in other b-tree nodes. This still provides
|
||||
enough opportunities to verify consistency of individual items, besides verifying
|
||||
general validity of the items like the length or offset. The b-tree items are
|
||||
also coupled with a key so proper key ordering is also part of the check and can
|
||||
reveal random bitflips in the sequence (this has been the most successful
|
||||
detector of faulty RAM).
|
||||
|
||||
The capabilities of tree checker have been improved over time and it's possible
|
||||
that a filesystem created on an older kernel may trigger warnings or fail some
|
||||
checks on a new one.
|
||||
|
|
|
@ -1,4 +1,41 @@
|
|||
Trim
|
||||
====
|
||||
Trim/discard
|
||||
============
|
||||
|
||||
...
|
||||
Trim or discard is an operation on a storage device based on flash technology
|
||||
(SSD, NVMe or similar), a thin-provisioned device or could be emulated on top
|
||||
of other block device types. On real hardware, there's a different lifetime
|
||||
span of the memory cells and the driver firmware usually tries to optimize for
|
||||
that. The trim operation issued by user provides hints about what data are
|
||||
unused and allow to reclaim the memory cells. On thin-provisioned or emulated
|
||||
this is could simply free the space.
|
||||
|
||||
There are three main uses of trim that BTRFS supports:
|
||||
|
||||
synchronous
|
||||
enabled by mounting filesystem with ``-o discard`` or ``-o
|
||||
discard=sync``, the trim is done right after the file extents get freed,
|
||||
this however could have severe performance hit and is not recommended
|
||||
as the ranges to be trimmed could be too fragmented
|
||||
|
||||
asynchronous
|
||||
enabled by mounting filesystem with ``-o discard=async``, which is an
|
||||
improved version of the synchronous trim where the freed file extents
|
||||
are first tracked in memory and after a period or enough ranges accumulate
|
||||
the trim is started, expecting the ranges to be much larger and
|
||||
allowing to throttle the number of IO requests which does not interfere
|
||||
with the rest of the filesystem activity
|
||||
|
||||
manually by fstrim
|
||||
the tool ``fstrim`` starts a trim operation on the whole filesystem, no
|
||||
mount options need to be specified, so it's up to the filesystem to
|
||||
traverse the free space and start the trim, this is suitable for running
|
||||
it as periodic service
|
||||
|
||||
The trim is considered only a hint to the device, it could ignore it completely,
|
||||
start it only on ranges meeting some criteria, or decide not to do it because of
|
||||
other factors affecting the memory cells. The device itself could internally
|
||||
relocate the data, however this leads to unexpected performance drop. Running
|
||||
trim periodically could prevent that too.
|
||||
|
||||
When a filesystem is created by ``mkfs.btrfs`` and is capable of trim, then it's
|
||||
by default performed on all devices.
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
Volume management
|
||||
=================
|
||||
|
||||
...
|
||||
.. include:: ch-volume-management-intro.rst
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
Zoned mode
|
||||
==========
|
||||
|
||||
...
|
||||
.. include:: ch-zoned-intro.rst
|
||||
|
|
|
@ -9,68 +9,7 @@ SYNOPSIS
|
|||
DESCRIPTION
|
||||
-----------
|
||||
|
||||
The primary purpose of the balance feature is to spread block groups across
|
||||
all devices so they match constraints defined by the respective profiles. See
|
||||
``mkfs.btrfs(8)`` section *PROFILES* for more details.
|
||||
The scope of the balancing process can be further tuned by use of filters that
|
||||
can select the block groups to process. Balance works only on a mounted
|
||||
filesystem. Extent sharing is preserved and reflinks are not broken.
|
||||
Files are not defragmented nor recompressed, file extents are preserved
|
||||
but the physical location on devices will change.
|
||||
|
||||
The balance operation is cancellable by the user. The on-disk state of the
|
||||
filesystem is always consistent so an unexpected interruption (eg. system crash,
|
||||
reboot) does not corrupt the filesystem. The progress of the balance operation
|
||||
is temporarily stored as an internal state and will be resumed upon mount,
|
||||
unless the mount option *skip_balance* is specified.
|
||||
|
||||
.. warning::
|
||||
Running balance without filters will take a lot of time as it basically move
|
||||
data/metadata from the whol filesystem and needs to update all block
|
||||
pointers.
|
||||
|
||||
The filters can be used to perform following actions:
|
||||
|
||||
- convert block group profiles (filter *convert*)
|
||||
- make block group usage more compact (filter *usage*)
|
||||
- perform actions only on a given device (filters *devid*, *drange*)
|
||||
|
||||
The filters can be applied to a combination of block group types (data,
|
||||
metadata, system). Note that changing only the *system* type needs the force
|
||||
option. Otherwise *system* gets automatically converted whenever *metadata*
|
||||
profile is converted.
|
||||
|
||||
When metadata redundancy is reduced (eg. from RAID1 to single) the force option
|
||||
is also required and it is noted in system log.
|
||||
|
||||
.. note::
|
||||
The balance operation needs enough work space, ie. space that is completely
|
||||
unused in the filesystem, otherwise this may lead to ENOSPC reports. See
|
||||
the section *ENOSPC* for more details.
|
||||
|
||||
COMPATIBILITY
|
||||
-------------
|
||||
|
||||
.. note::
|
||||
|
||||
The balance subcommand also exists under the **btrfs filesystem** namespace.
|
||||
This still works for backward compatibility but is deprecated and should not
|
||||
be used any more.
|
||||
|
||||
.. note::
|
||||
A short syntax **btrfs balance <path>** works due to backward compatibility
|
||||
but is deprecated and should not be used any more. Use **btrfs balance start**
|
||||
command instead.
|
||||
|
||||
PERFORMANCE IMPLICATIONS
|
||||
------------------------
|
||||
|
||||
Balancing operations are very IO intensive and can also be quite CPU intensive,
|
||||
impacting other ongoing filesystem operations. Typically large amounts of data
|
||||
are copied from one location to another, with corresponding metadata updates.
|
||||
|
||||
Depending upon the block group layout, it can also be seek heavy. Performance
|
||||
on rotational devices is noticeably worse compared to SSDs or fast arrays.
|
||||
.. include:: ch-balance-intro.rst
|
||||
|
||||
SUBCOMMAND
|
||||
----------
|
||||
|
@ -148,89 +87,7 @@ status [-v] <path>
|
|||
FILTERS
|
||||
-------
|
||||
|
||||
From kernel 3.3 onwards, btrfs balance can limit its action to a subset of the
|
||||
whole filesystem, and can be used to change the replication configuration (e.g.
|
||||
moving data from single to RAID1). This functionality is accessed through the
|
||||
*-d*, *-m* or *-s* options to btrfs balance start, which filter on data,
|
||||
metadata and system blocks respectively.
|
||||
|
||||
A filter has the following structure: *type[=params][,type=...]*
|
||||
|
||||
The available types are:
|
||||
|
||||
profiles=<profiles>
|
||||
Balances only block groups with the given profiles. Parameters
|
||||
are a list of profile names separated by "*|*" (pipe).
|
||||
|
||||
usage=<percent>, usage=<range>
|
||||
Balances only block groups with usage under the given percentage. The
|
||||
value of 0 is allowed and will clean up completely unused block groups, this
|
||||
should not require any new work space allocated. You may want to use *usage=0*
|
||||
in case balance is returning ENOSPC and your filesystem is not too full.
|
||||
|
||||
The argument may be a single value or a range. The single value *N* means *at
|
||||
most N percent used*, equivalent to *..N* range syntax. Kernels prior to 4.4
|
||||
accept only the single value format.
|
||||
The minimum range boundary is inclusive, maximum is exclusive.
|
||||
|
||||
devid=<id>
|
||||
Balances only block groups which have at least one chunk on the given
|
||||
device. To list devices with ids use **btrfs filesystem show**.
|
||||
|
||||
drange=<range>
|
||||
Balance only block groups which overlap with the given byte range on any
|
||||
device. Use in conjunction with *devid* to filter on a specific device. The
|
||||
parameter is a range specified as *start..end*.
|
||||
|
||||
vrange=<range>
|
||||
Balance only block groups which overlap with the given byte range in the
|
||||
filesystem's internal virtual address space. This is the address space that
|
||||
most reports from btrfs in the kernel log use. The parameter is a range
|
||||
specified as *start..end*.
|
||||
|
||||
convert=<profile>
|
||||
Convert each selected block group to the given profile name identified by
|
||||
parameters.
|
||||
|
||||
.. note::
|
||||
Starting with kernel 4.5, the *data* chunks can be converted to/from the
|
||||
*DUP* profile on a single device.
|
||||
|
||||
.. note::
|
||||
Starting with kernel 4.6, all profiles can be converted to/from *DUP* on
|
||||
multi-device filesystems.
|
||||
|
||||
limit=<number>, limit=<range>
|
||||
Process only given number of chunks, after all filters are applied. This can be
|
||||
used to specifically target a chunk in connection with other filters (*drange*,
|
||||
*vrange*) or just simply limit the amount of work done by a single balance run.
|
||||
|
||||
The argument may be a single value or a range. The single value *N* means *at
|
||||
most N chunks*, equivalent to *..N* range syntax. Kernels prior to 4.4 accept
|
||||
only the single value format. The range minimum and maximum are inclusive.
|
||||
|
||||
stripes=<range>
|
||||
Balance only block groups which have the given number of stripes. The parameter
|
||||
is a range specified as *start..end*. Makes sense for block group profiles that
|
||||
utilize striping, ie. RAID0/10/5/6. The range minimum and maximum are
|
||||
inclusive.
|
||||
|
||||
soft
|
||||
Takes no parameters. Only has meaning when converting between profiles.
|
||||
When doing convert from one profile to another and soft mode is on,
|
||||
chunks that already have the target profile are left untouched.
|
||||
This is useful e.g. when half of the filesystem was converted earlier but got
|
||||
cancelled.
|
||||
|
||||
The soft mode switch is (like every other filter) per-type.
|
||||
For example, this means that we can convert metadata chunks the "hard" way
|
||||
while converting data chunks selectively with soft switch.
|
||||
|
||||
Profile names, used in *profiles* and *convert* are one of: *raid0*, *raid1*,
|
||||
*raid1c3*, *raid1c4*, *raid10*, *raid5*, *raid6*, *dup*, *single*. The mixed
|
||||
data/metadata profiles can be converted in the same way, but it's conversion
|
||||
between mixed and non-mixed is not implemented. For the constraints of the
|
||||
profiles please refer to ``mkfs.btrfs(8)``, section *PROFILES*.
|
||||
.. include:: ch-balance-filters.rst
|
||||
|
||||
ENOSPC
|
||||
------
|
||||
|
|
|
@ -14,36 +14,7 @@ The **btrfs device** command group is used to manage devices of the btrfs filesy
|
|||
DEVICE MANAGEMENT
|
||||
-----------------
|
||||
|
||||
Btrfs filesystem can be created on top of single or multiple block devices.
|
||||
Data and metadata are organized in allocation profiles with various redundancy
|
||||
policies. There's some similarity with traditional RAID levels, but this could
|
||||
be confusing to users familiar with the traditional meaning. Due to the
|
||||
similarity, the RAID terminology is widely used in the documentation. See
|
||||
``mkfs.btrfs(8)`` for more details and the exact profile capabilities and
|
||||
constraints.
|
||||
|
||||
The device management works on a mounted filesystem. Devices can be added,
|
||||
removed or replaced, by commands provided by **btrfs device** and **btrfs replace**.
|
||||
|
||||
The profiles can be also changed, provided there's enough workspace to do the
|
||||
conversion, using the **btrfs balance** command and namely the filter *convert*.
|
||||
|
||||
Type
|
||||
The block group profile type is the main distinction of the information stored
|
||||
on the block device. User data are called *Data*, the internal data structures
|
||||
managed by filesystem are *Metadata* and *System*.
|
||||
|
||||
Profile
|
||||
A profile describes an allocation policy based on the redundancy/replication
|
||||
constraints in connection with the number of devices. The profile applies to
|
||||
data and metadata block groups separately. Eg. *single*, *RAID1*.
|
||||
|
||||
RAID level
|
||||
Where applicable, the level refers to a profile that matches constraints of the
|
||||
standard RAID levels. At the moment the supported ones are: RAID0, RAID1,
|
||||
RAID10, RAID5 and RAID6.
|
||||
|
||||
See the section *TYPICAL USECASES* for some examples.
|
||||
.. include ch-volume-management-intro.rst
|
||||
|
||||
SUBCOMMAND
|
||||
----------
|
||||
|
@ -76,7 +47,7 @@ remove [options] <device>|<devid> [<device>|<devid>...] <path>
|
|||
Device removal must satisfy the profile constraints, otherwise the command
|
||||
fails. The filesystem must be converted to profile(s) that would allow the
|
||||
removal. This can typically happen when going down from 2 devices to 1 and
|
||||
using the RAID1 profile. See the *TYPICAL USECASES* section below.
|
||||
using the RAID1 profile. See the section *TYPICAL USECASES*.
|
||||
|
||||
The operation can take long as it needs to move all data from the device.
|
||||
|
||||
|
@ -217,94 +188,6 @@ usage [options] <path> [<path>...]::
|
|||
|
||||
If conflicting options are passed, the last one takes precedence.
|
||||
|
||||
TYPICAL USECASES
|
||||
----------------
|
||||
|
||||
STARTING WITH A SINGLE-DEVICE FILESYSTEM
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Assume we've created a filesystem on a block device */dev/sda* with profile
|
||||
*single/single* (data/metadata), the device size is 50GiB and we've used the
|
||||
whole device for the filesystem. The mount point is */mnt*.
|
||||
|
||||
The amount of data stored is 16GiB, metadata have allocated 2GiB.
|
||||
|
||||
ADD NEW DEVICE
|
||||
""""""""""""""
|
||||
|
||||
We want to increase the total size of the filesystem and keep the profiles. The
|
||||
size of the new device */dev/sdb* is 100GiB.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ btrfs device add /dev/sdb /mnt
|
||||
|
||||
The amount of free data space increases by less than 100GiB, some space is
|
||||
allocated for metadata.
|
||||
|
||||
CONVERT TO RAID1
|
||||
""""""""""""""""
|
||||
|
||||
Now we want to increase the redundancy level of both data and metadata, but
|
||||
we'll do that in steps. Note, that the device sizes are not equal and we'll use
|
||||
that to show the capabilities of split data/metadata and independent profiles.
|
||||
|
||||
The constraint for RAID1 gives us at most 50GiB of usable space and exactly 2
|
||||
copies will be stored on the devices.
|
||||
|
||||
First we'll convert the metadata. As the metadata occupy less than 50GiB and
|
||||
there's enough workspace for the conversion process, we can do:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ btrfs balance start -mconvert=raid1 /mnt
|
||||
|
||||
This operation can take a while, because all metadata have to be moved and all
|
||||
block pointers updated. Depending on the physical locations of the old and new
|
||||
blocks, the disk seeking is the key factor affecting performance.
|
||||
|
||||
You'll note that the system block group has been also converted to RAID1, this
|
||||
normally happens as the system block group also holds metadata (the physical to
|
||||
logical mappings).
|
||||
|
||||
What changed:
|
||||
|
||||
* available data space decreased by 3GiB, usable roughly (50 - 3) + (100 - 3) = 144 GiB
|
||||
* metadata redundancy increased
|
||||
|
||||
IOW, the unequal device sizes allow for combined space for data yet improved
|
||||
redundancy for metadata. If we decide to increase redundancy of data as well,
|
||||
we're going to lose 50GiB of the second device for obvious reasons.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ btrfs balance start -dconvert=raid1 /mnt
|
||||
|
||||
The balance process needs some workspace (ie. a free device space without any
|
||||
data or metadata block groups) so the command could fail if there's too much
|
||||
data or the block groups occupy the whole first device.
|
||||
|
||||
The device size of */dev/sdb* as seen by the filesystem remains unchanged, but
|
||||
the logical space from 50-100GiB will be unused.
|
||||
|
||||
REMOVE DEVICE
|
||||
"""""""""""""
|
||||
|
||||
Device removal must satisfy the profile constraints, otherwise the command
|
||||
fails. For example:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ btrfs device remove /dev/sda /mnt
|
||||
ERROR: error removing device '/dev/sda': unable to go below two devices on raid1
|
||||
|
||||
In order to remove a device, you need to convert the profile in this case:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ btrfs balance start -mconvert=dup -dconvert=single /mnt
|
||||
$ btrfs device remove /dev/sda /mnt
|
||||
|
||||
DEVICE STATS
|
||||
------------
|
||||
|
||||
|
|
|
@ -739,7 +739,6 @@ CHECKSUM ALGORITHMS
|
|||
|
||||
.. include:: ch-checksumming.rst
|
||||
|
||||
|
||||
COMPRESSION
|
||||
-----------
|
||||
|
||||
|
@ -915,71 +914,7 @@ d
|
|||
ZONED MODE
|
||||
----------
|
||||
|
||||
Since version 5.12 btrfs supports so called *zoned mode*. This is a special
|
||||
on-disk format and allocation/write strategy that's friendly to zoned devices.
|
||||
In short, a device is partitioned into fixed-size zones and each zone can be
|
||||
updated by append-only manner, or reset. As btrfs has no fixed data structures,
|
||||
except the super blocks, the zoned mode only requires block placement that
|
||||
follows the device constraints. You can learn about the whole architecture at
|
||||
https://zonedstorage.io .
|
||||
|
||||
The devices are also called SMR/ZBC/ZNS, in *host-managed* mode. Note that
|
||||
there are devices that appear as non-zoned but actually are, this is
|
||||
*drive-managed* and using zoned mode won't help.
|
||||
|
||||
The zone size depends on the device, typical sizes are 256MiB or 1GiB. In
|
||||
general it must be a power of two. Emulated zoned devices like *null_blk* allow
|
||||
to set various zone sizes.
|
||||
|
||||
REQUIREMENTS, LIMITATIONS
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* all devices must have the same zone size
|
||||
* maximum zone size is 8GiB
|
||||
* mixing zoned and non-zoned devices is possible, the zone writes are emulated,
|
||||
but this is namely for testing
|
||||
* the super block is handled in a special way and is at different locations
|
||||
than on a non-zoned filesystem:
|
||||
* primary: 0B (and the next two zones)
|
||||
* secondary: 512G (and the next two zones)
|
||||
* tertiary: 4TiB (4096GiB, and the next two zones)
|
||||
|
||||
INCOMPATIBLE FEATURES
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The main constraint of the zoned devices is lack of in-place update of the data.
|
||||
This is inherently incompatbile with some features:
|
||||
|
||||
* nodatacow - overwrite in-place, cannot create such files
|
||||
* fallocate - preallocating space for in-place first write
|
||||
* mixed-bg - unordered writes to data and metadata, fixing that means using
|
||||
separate data and metadata block groups
|
||||
* booting - the zone at offset 0 contains superblock, resetting the zone would
|
||||
destroy the bootloader data
|
||||
|
||||
Initial support lacks some features but they're planned:
|
||||
|
||||
* only single profile is supported
|
||||
* fstrim - due to dependency on free space cache v1
|
||||
|
||||
SUPER BLOCK
|
||||
~~~~~~~~~~~
|
||||
|
||||
As said above, super block is handled in a special way. In order to be crash
|
||||
safe, at least one zone in a known location must contain a valid superblock.
|
||||
This is implemented as a ring buffer in two consecutive zones, starting from
|
||||
known offsets 0, 512G and 4TiB. The values are different than on non-zoned
|
||||
devices. Each new super block is appended to the end of the zone, once it's
|
||||
filled, the zone is reset and writes continue to the next one. Looking up the
|
||||
latest super block needs to read offsets of both zones and determine the last
|
||||
written version.
|
||||
|
||||
The amount of space reserved for super block depends on the zone size. The
|
||||
secondary and tertiary copies are at distant offsets as the capacity of the
|
||||
devices is expected to be large, tens of terabytes. Maximum zone size supported
|
||||
is 8GiB, which would mean that eg. offset 0-16GiB would be reserved just for
|
||||
the super block on a hypothetical device of that zone size. This is wasteful
|
||||
but required to guarantee crash safety.
|
||||
.. include:: ch-zoned-intro.rst
|
||||
|
||||
|
||||
CONTROL DEVICE
|
||||
|
|
|
@ -12,6 +12,8 @@ DESCRIPTION
|
|||
**btrfs subvolume** is used to create/delete/list/show btrfs subvolumes and
|
||||
snapshots.
|
||||
|
||||
.. include:: ch-subvolume-intro.rst
|
||||
|
||||
SUBVOLUME AND SNAPSHOT
|
||||
----------------------
|
||||
|
||||
|
@ -241,36 +243,6 @@ sync <path> [subvolid...]
|
|||
-s <N>
|
||||
sleep N seconds between checks (default: 1)
|
||||
|
||||
SUBVOLUME FLAGS
|
||||
---------------
|
||||
|
||||
The subvolume flag currently implemented is the *ro* property. Read-write
|
||||
subvolumes have that set to *false*, snapshots as *true*. In addition to that,
|
||||
a plain snapshot will also have last change generation and creation generation
|
||||
equal.
|
||||
|
||||
Read-only snapshots are building blocks fo incremental send (see
|
||||
``btrfs-send(8)``) and the whole use case relies on unmodified snapshots where the
|
||||
relative changes are generated from. Thus, changing the subvolume flags from
|
||||
read-only to read-write will break the assumptions and may lead to unexpected changes
|
||||
in the resulting incremental stream.
|
||||
|
||||
A snapshot that was created by send/receive will be read-only, with different
|
||||
last change generation, read-only and with set *received_uuid* which identifies
|
||||
the subvolume on the filesystem that produced the stream. The usecase relies
|
||||
on matching data on both sides. Changing the subvolume to read-write after it
|
||||
has been received requires to reset the *received_uuid*. As this is a notable
|
||||
change and could potentially break the incremental send use case, performing
|
||||
it by **btrfs property set** requires force if that is really desired by user.
|
||||
|
||||
.. note::
|
||||
The safety checks have been implemented in 5.14.2, any subvolumes previously
|
||||
received (with a valid *received_uuid*) and read-write status may exist and
|
||||
could still lead to problems with send/receive. You can use **btrfs subvolume
|
||||
show** to identify them. Flipping the flags to read-only and back to
|
||||
read-write will reset the *received_uuid* manually. There may exist a
|
||||
convenience tool in the future.
|
||||
|
||||
EXAMPLES
|
||||
--------
|
||||
|
||||
|
|
|
@ -0,0 +1,83 @@
|
|||
From kernel 3.3 onwards, btrfs balance can limit its action to a subset of the
|
||||
whole filesystem, and can be used to change the replication configuration (e.g.
|
||||
moving data from single to RAID1). This functionality is accessed through the
|
||||
*-d*, *-m* or *-s* options to btrfs balance start, which filter on data,
|
||||
metadata and system blocks respectively.
|
||||
|
||||
A filter has the following structure: *type[=params][,type=...]*
|
||||
|
||||
The available types are:
|
||||
|
||||
profiles=<profiles>
|
||||
Balances only block groups with the given profiles. Parameters
|
||||
are a list of profile names separated by "*|*" (pipe).
|
||||
|
||||
usage=<percent>, usage=<range>
|
||||
Balances only block groups with usage under the given percentage. The
|
||||
value of 0 is allowed and will clean up completely unused block groups, this
|
||||
should not require any new work space allocated. You may want to use *usage=0*
|
||||
in case balance is returning ENOSPC and your filesystem is not too full.
|
||||
|
||||
The argument may be a single value or a range. The single value *N* means *at
|
||||
most N percent used*, equivalent to *..N* range syntax. Kernels prior to 4.4
|
||||
accept only the single value format.
|
||||
The minimum range boundary is inclusive, maximum is exclusive.
|
||||
|
||||
devid=<id>
|
||||
Balances only block groups which have at least one chunk on the given
|
||||
device. To list devices with ids use **btrfs filesystem show**.
|
||||
|
||||
drange=<range>
|
||||
Balance only block groups which overlap with the given byte range on any
|
||||
device. Use in conjunction with *devid* to filter on a specific device. The
|
||||
parameter is a range specified as *start..end*.
|
||||
|
||||
vrange=<range>
|
||||
Balance only block groups which overlap with the given byte range in the
|
||||
filesystem's internal virtual address space. This is the address space that
|
||||
most reports from btrfs in the kernel log use. The parameter is a range
|
||||
specified as *start..end*.
|
||||
|
||||
convert=<profile>
|
||||
Convert each selected block group to the given profile name identified by
|
||||
parameters.
|
||||
|
||||
.. note::
|
||||
Starting with kernel 4.5, the *data* chunks can be converted to/from the
|
||||
*DUP* profile on a single device.
|
||||
|
||||
.. note::
|
||||
Starting with kernel 4.6, all profiles can be converted to/from *DUP* on
|
||||
multi-device filesystems.
|
||||
|
||||
limit=<number>, limit=<range>
|
||||
Process only given number of chunks, after all filters are applied. This can be
|
||||
used to specifically target a chunk in connection with other filters (*drange*,
|
||||
*vrange*) or just simply limit the amount of work done by a single balance run.
|
||||
|
||||
The argument may be a single value or a range. The single value *N* means *at
|
||||
most N chunks*, equivalent to *..N* range syntax. Kernels prior to 4.4 accept
|
||||
only the single value format. The range minimum and maximum are inclusive.
|
||||
|
||||
stripes=<range>
|
||||
Balance only block groups which have the given number of stripes. The parameter
|
||||
is a range specified as *start..end*. Makes sense for block group profiles that
|
||||
utilize striping, ie. RAID0/10/5/6. The range minimum and maximum are
|
||||
inclusive.
|
||||
|
||||
soft
|
||||
Takes no parameters. Only has meaning when converting between profiles.
|
||||
When doing convert from one profile to another and soft mode is on,
|
||||
chunks that already have the target profile are left untouched.
|
||||
This is useful e.g. when half of the filesystem was converted earlier but got
|
||||
cancelled.
|
||||
|
||||
The soft mode switch is (like every other filter) per-type.
|
||||
For example, this means that we can convert metadata chunks the "hard" way
|
||||
while converting data chunks selectively with soft switch.
|
||||
|
||||
Profile names, used in *profiles* and *convert* are one of: *raid0*, *raid1*,
|
||||
*raid1c3*, *raid1c4*, *raid10*, *raid5*, *raid6*, *dup*, *single*. The mixed
|
||||
data/metadata profiles can be converted in the same way, but it's conversion
|
||||
between mixed and non-mixed is not implemented. For the constraints of the
|
||||
profiles please refer to ``mkfs.btrfs(8)``, section *PROFILES*.
|
|
@ -0,0 +1,62 @@
|
|||
The primary purpose of the balance feature is to spread block groups across
|
||||
all devices so they match constraints defined by the respective profiles. See
|
||||
``mkfs.btrfs(8)`` section *PROFILES* for more details.
|
||||
The scope of the balancing process can be further tuned by use of filters that
|
||||
can select the block groups to process. Balance works only on a mounted
|
||||
filesystem. Extent sharing is preserved and reflinks are not broken.
|
||||
Files are not defragmented nor recompressed, file extents are preserved
|
||||
but the physical location on devices will change.
|
||||
|
||||
The balance operation is cancellable by the user. The on-disk state of the
|
||||
filesystem is always consistent so an unexpected interruption (eg. system crash,
|
||||
reboot) does not corrupt the filesystem. The progress of the balance operation
|
||||
is temporarily stored as an internal state and will be resumed upon mount,
|
||||
unless the mount option *skip_balance* is specified.
|
||||
|
||||
.. warning::
|
||||
Running balance without filters will take a lot of time as it basically move
|
||||
data/metadata from the whole filesystem and needs to update all block
|
||||
pointers.
|
||||
|
||||
The filters can be used to perform following actions:
|
||||
|
||||
- convert block group profiles (filter *convert*)
|
||||
- make block group usage more compact (filter *usage*)
|
||||
- perform actions only on a given device (filters *devid*, *drange*)
|
||||
|
||||
The filters can be applied to a combination of block group types (data,
|
||||
metadata, system). Note that changing only the *system* type needs the force
|
||||
option. Otherwise *system* gets automatically converted whenever *metadata*
|
||||
profile is converted.
|
||||
|
||||
When metadata redundancy is reduced (eg. from RAID1 to single) the force option
|
||||
is also required and it is noted in system log.
|
||||
|
||||
.. note::
|
||||
The balance operation needs enough work space, ie. space that is completely
|
||||
unused in the filesystem, otherwise this may lead to ENOSPC reports. See
|
||||
the section *ENOSPC* for more details.
|
||||
|
||||
Compatibility
|
||||
-------------
|
||||
|
||||
.. note::
|
||||
|
||||
The balance subcommand also exists under the **btrfs filesystem** namespace.
|
||||
This still works for backward compatibility but is deprecated and should not
|
||||
be used any more.
|
||||
|
||||
.. note::
|
||||
A short syntax **btrfs balance <path>** works due to backward compatibility
|
||||
but is deprecated and should not be used any more. Use **btrfs balance start**
|
||||
command instead.
|
||||
|
||||
Performance implications
|
||||
------------------------
|
||||
|
||||
Balancing operations are very IO intensive and can also be quite CPU intensive,
|
||||
impacting other ongoing filesystem operations. Typically large amounts of data
|
||||
are copied from one location to another, with corresponding metadata updates.
|
||||
|
||||
Depending upon the block group layout, it can also be seek heavy. Performance
|
||||
on rotational devices is noticeably worse compared to SSDs or fast arrays.
|
|
@ -10,7 +10,7 @@ CRC32C (32bit digest)
|
|||
instruction-level support, not collision-resistant but still good error
|
||||
detection capabilities
|
||||
|
||||
XXHASH* (64bit digest)
|
||||
XXHASH (64bit digest)
|
||||
can be used as CRC32C successor, very fast, optimized for modern CPUs utilizing
|
||||
instruction pipelining, good collision resistance and error detection
|
||||
|
||||
|
@ -33,7 +33,6 @@ additional overhead of the b-tree leaves.
|
|||
Approximate relative performance of the algorithms, measured against CRC32C
|
||||
using reference software implementations on a 3.5GHz intel CPU:
|
||||
|
||||
|
||||
======== ============ ======= ================
|
||||
Digest Cycles/4KiB Ratio Implementation
|
||||
======== ============ ======= ================
|
||||
|
@ -73,4 +72,3 @@ while accelerated implementation is e.g.
|
|||
priority : 170
|
||||
...
|
||||
|
||||
|
||||
|
|
|
@ -56,7 +56,7 @@ cause performance drops.
|
|||
|
||||
The command above will start defragmentation of the whole *file* and apply
|
||||
the compression, regardless of the mount option. (Note: specifying level is not
|
||||
yet implemented). The compression algorithm is not persisent and applies only
|
||||
yet implemented). The compression algorithm is not persistent and applies only
|
||||
to the defragmentation command, for any other writes other compression settings
|
||||
apply.
|
||||
|
||||
|
@ -114,9 +114,9 @@ There are two ways to detect incompressible data:
|
|||
* actual compression attempt - data are compressed, if the result is not smaller,
|
||||
it's discarded, so this depends on the algorithm and level
|
||||
* pre-compression heuristics - a quick statistical evaluation on the data is
|
||||
peformed and based on the result either compression is performed or skipped,
|
||||
performed and based on the result either compression is performed or skipped,
|
||||
the NOCOMPRESS bit is not set just by the heuristic, only if the compression
|
||||
algorithm does not make an improvent
|
||||
algorithm does not make an improvement
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|
@ -137,7 +137,7 @@ incompressible data too but this leads to more overhead as the compression is
|
|||
done in another thread and has to write the data anyway. The heuristic is
|
||||
read-only and can utilize cached memory.
|
||||
|
||||
The tests performed based on the following: data sampling, long repated
|
||||
The tests performed based on the following: data sampling, long repeated
|
||||
pattern detection, byte frequency, Shannon entropy.
|
||||
|
||||
Compatibility
|
||||
|
|
|
@ -36,7 +36,7 @@ machines).
|
|||
**BEFORE YOU START**
|
||||
|
||||
The source filesystem must be clean, eg. no journal to replay or no repairs
|
||||
needed. The respective **fsck** utility must be run on the source filesytem prior
|
||||
needed. The respective **fsck** utility must be run on the source filessytem prior
|
||||
to conversion. Please refer to the manual pages in case you encounter problems.
|
||||
|
||||
For ext2/3/4:
|
||||
|
|
|
@ -42,7 +42,7 @@ exclusive
|
|||
is the amount of data where all references to this data can be reached
|
||||
from within this qgroup.
|
||||
|
||||
SUBVOLUME QUOTA GROUPS
|
||||
Subvolume quota groups
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The basic notion of the Subvolume Quota feature is the quota group, short
|
||||
|
@ -75,7 +75,7 @@ of qgroups. Figure 1 shows an example qgroup tree.
|
|||
| / \ / \
|
||||
extents 1 2 3 4
|
||||
|
||||
Figure1: Sample qgroup hierarchy
|
||||
Figure 1: Sample qgroup hierarchy
|
||||
|
||||
At the bottom, some extents are depicted showing which qgroups reference which
|
||||
extents. It is important to understand the notion of *referenced* vs
|
||||
|
@ -101,7 +101,7 @@ allocation information are not accounted.
|
|||
In turn, the referenced count of a qgroup can be limited. All writes beyond
|
||||
this limit will lead to a 'Quota Exceeded' error.
|
||||
|
||||
INHERITANCE
|
||||
Inheritance
|
||||
^^^^^^^^^^^
|
||||
|
||||
Things get a bit more complicated when new subvolumes or snapshots are created.
|
||||
|
@ -133,13 +133,13 @@ exclusive count from the second qgroup needs to be copied to the first qgroup,
|
|||
as it represents the correct value. The second qgroup is called a tracking
|
||||
qgroup. It is only there in case a snapshot is needed.
|
||||
|
||||
USE CASES
|
||||
Use cases
|
||||
^^^^^^^^^
|
||||
|
||||
Below are some usecases that do not mean to be extensive. You can find your
|
||||
Below are some use cases that do not mean to be extensive. You can find your
|
||||
own way how to integrate qgroups.
|
||||
|
||||
SINGLE-USER MACHINE
|
||||
Single-user machine
|
||||
"""""""""""""""""""
|
||||
|
||||
``Replacement for partitions``
|
||||
|
@ -156,7 +156,7 @@ the correct values. 'Referenced' will show how much is in it, possibly shared
|
|||
with other subvolumes. 'Exclusive' will be the amount of space that gets freed
|
||||
when the subvolume is deleted.
|
||||
|
||||
MULTI-USER MACHINE
|
||||
Multi-user machine
|
||||
""""""""""""""""""
|
||||
|
||||
``Restricting homes``
|
||||
|
@ -194,5 +194,3 @@ but some snapshots for backup purposes are being created by the system. The
|
|||
user's snapshots should be accounted to the user, not the system. The solution
|
||||
is similar to the one from section 'Accounting snapshots to the user', but do
|
||||
not assign system snapshots to user's qgroup.
|
||||
|
||||
|
||||
|
|
|
@ -19,7 +19,7 @@ UUID on each mount.
|
|||
|
||||
Once the seeding device is mounted, it needs the writable device. After adding
|
||||
it, something like **remount -o remount,rw /path** makes the filesystem at
|
||||
*/path* ready for use. The simplest usecase is to throw away all changes by
|
||||
*/path* ready for use. The simplest use case is to throw away all changes by
|
||||
unmounting the filesystem when convenient.
|
||||
|
||||
Alternatively, deleting the seeding device from the filesystem can turn it into
|
||||
|
@ -29,7 +29,7 @@ data from the seeding device.
|
|||
The seeding device flag can be cleared again by **btrfstune -f -s 0**, eg.
|
||||
allowing to update with newer data but please note that this will invalidate
|
||||
all existing filesystems that use this particular seeding device. This works
|
||||
for some usecases, not for others, and a forcing flag to the command is
|
||||
for some use cases, not for others, and a forcing flag to the command is
|
||||
mandatory to avoid accidental mistakes.
|
||||
|
||||
Example how to create and use one seeding device:
|
||||
|
@ -71,8 +71,6 @@ A few things to note:
|
|||
* it's recommended to use only single device for the seeding device, it works
|
||||
for multiple devices but the *single* profile must be used in order to make
|
||||
the seeding device deletion work
|
||||
* block group profiles *single* and *dup* support the usecases above
|
||||
* block group profiles *single* and *dup* support the use cases above
|
||||
* the label is copied from the seeding device and can be changed by **btrfs filesystem label**
|
||||
* each new mount of the seeding device gets a new random UUID
|
||||
|
||||
|
||||
|
|
|
@ -0,0 +1,58 @@
|
|||
A BTRFS subvolume is a part of filesystem with its own independent
|
||||
file/directory hierarchy. Subvolumes can share file extents. A snapshot is also
|
||||
subvolume, but with a given initial content of the original subvolume.
|
||||
|
||||
.. note::
|
||||
A subvolume in BTRFS is not like an LVM logical volume, which is block-level
|
||||
snapshot while BTRFS subvolumes are file extent-based.
|
||||
|
||||
A subvolume looks like a normal directory, with some additional operations
|
||||
described below. Subvolumes can be renamed or moved, nesting subvolumes is not
|
||||
restricted but has some implications regarding snapshotting.
|
||||
|
||||
A subvolume in BTRFS can be accessed in two ways:
|
||||
|
||||
* like any other directory that is accessible to the user
|
||||
* like a separately mounted filesystem (options *subvol* or *subvolid*)
|
||||
|
||||
In the latter case the parent directory is not visible and accessible. This is
|
||||
similar to a bind mount, and in fact the subvolume mount does exactly that.
|
||||
|
||||
A freshly created filesystem is also a subvolume, called *top-level*,
|
||||
internally has an id 5. This subvolume cannot be removed or replaced by another
|
||||
subvolume. This is also the subvolume that will be mounted by default, unless
|
||||
the default subvolume has been changed (see ``btrfs subvolume set-default``).
|
||||
|
||||
A snapshot is a subvolume like any other, with given initial content. By
|
||||
default, snapshots are created read-write. File modifications in a snapshot
|
||||
do not affect the files in the original subvolume.
|
||||
|
||||
Subvolume flags
|
||||
---------------
|
||||
|
||||
The subvolume flag currently implemented is the *ro* property. Read-write
|
||||
subvolumes have that set to *false*, snapshots as *true*. In addition to that,
|
||||
a plain snapshot will also have last change generation and creation generation
|
||||
equal.
|
||||
|
||||
Read-only snapshots are building blocks of incremental send (see
|
||||
``btrfs-send(8)``) and the whole use case relies on unmodified snapshots where
|
||||
the relative changes are generated from. Thus, changing the subvolume flags
|
||||
from read-only to read-write will break the assumptions and may lead to
|
||||
unexpected changes in the resulting incremental stream.
|
||||
|
||||
A snapshot that was created by send/receive will be read-only, with different
|
||||
last change generation, read-only and with set *received_uuid* which identifies
|
||||
the subvolume on the filesystem that produced the stream. The use case relies
|
||||
on matching data on both sides. Changing the subvolume to read-write after it
|
||||
has been received requires to reset the *received_uuid*. As this is a notable
|
||||
change and could potentially break the incremental send use case, performing
|
||||
it by **btrfs property set** requires force if that is really desired by user.
|
||||
|
||||
.. note::
|
||||
The safety checks have been implemented in 5.14.2, any subvolumes previously
|
||||
received (with a valid *received_uuid*) and read-write status may exist and
|
||||
could still lead to problems with send/receive. You can use **btrfs subvolume
|
||||
show** to identify them. Flipping the flags to read-only and back to
|
||||
read-write will reset the *received_uuid* manually. There may exist a
|
||||
convenience tool in the future.
|
|
@ -0,0 +1,116 @@
|
|||
BTRFS filesystem can be created on top of single or multiple block devices.
|
||||
Devices can be then added, removed or replaced on demand. Data and metadata are
|
||||
organized in allocation profiles with various redundancy policies. There's some
|
||||
similarity with traditional RAID levels, but this could be confusing to users
|
||||
familiar with the traditional meaning. Due to the similarity, the RAID
|
||||
terminology is widely used in the documentation. See ``mkfs.btrfs(8)`` for more
|
||||
details and the exact profile capabilities and constraints.
|
||||
|
||||
The device management works on a mounted filesystem. Devices can be added,
|
||||
removed or replaced, by commands provided by ``btrfs device`` and ``btrfs replace``.
|
||||
|
||||
The profiles can be also changed, provided there's enough workspace to do the
|
||||
conversion, using the ``btrfs balance`` command and namely the filter *convert*.
|
||||
|
||||
Type
|
||||
The block group profile type is the main distinction of the information stored
|
||||
on the block device. User data are called *Data*, the internal data structures
|
||||
managed by filesystem are *Metadata* and *System*.
|
||||
|
||||
Profile
|
||||
A profile describes an allocation policy based on the redundancy/replication
|
||||
constraints in connection with the number of devices. The profile applies to
|
||||
data and metadata block groups separately. Eg. *single*, *RAID1*.
|
||||
|
||||
RAID level
|
||||
Where applicable, the level refers to a profile that matches constraints of the
|
||||
standard RAID levels. At the moment the supported ones are: RAID0, RAID1,
|
||||
RAID10, RAID5 and RAID6.
|
||||
|
||||
Typical use cases
|
||||
-----------------
|
||||
|
||||
Starting with a single-device filesystem
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Assume we've created a filesystem on a block device */dev/sda* with profile
|
||||
*single/single* (data/metadata), the device size is 50GiB and we've used the
|
||||
whole device for the filesystem. The mount point is */mnt*.
|
||||
|
||||
The amount of data stored is 16GiB, metadata have allocated 2GiB.
|
||||
|
||||
Add new device
|
||||
""""""""""""""
|
||||
|
||||
We want to increase the total size of the filesystem and keep the profiles. The
|
||||
size of the new device */dev/sdb* is 100GiB.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ btrfs device add /dev/sdb /mnt
|
||||
|
||||
The amount of free data space increases by less than 100GiB, some space is
|
||||
allocated for metadata.
|
||||
|
||||
Convert to RAID1
|
||||
""""""""""""""""
|
||||
|
||||
Now we want to increase the redundancy level of both data and metadata, but
|
||||
we'll do that in steps. Note, that the device sizes are not equal and we'll use
|
||||
that to show the capabilities of split data/metadata and independent profiles.
|
||||
|
||||
The constraint for RAID1 gives us at most 50GiB of usable space and exactly 2
|
||||
copies will be stored on the devices.
|
||||
|
||||
First we'll convert the metadata. As the metadata occupy less than 50GiB and
|
||||
there's enough workspace for the conversion process, we can do:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ btrfs balance start -mconvert=raid1 /mnt
|
||||
|
||||
This operation can take a while, because all metadata have to be moved and all
|
||||
block pointers updated. Depending on the physical locations of the old and new
|
||||
blocks, the disk seeking is the key factor affecting performance.
|
||||
|
||||
You'll note that the system block group has been also converted to RAID1, this
|
||||
normally happens as the system block group also holds metadata (the physical to
|
||||
logical mappings).
|
||||
|
||||
What changed:
|
||||
|
||||
* available data space decreased by 3GiB, usable roughly (50 - 3) + (100 - 3) = 144 GiB
|
||||
* metadata redundancy increased
|
||||
|
||||
IOW, the unequal device sizes allow for combined space for data yet improved
|
||||
redundancy for metadata. If we decide to increase redundancy of data as well,
|
||||
we're going to lose 50GiB of the second device for obvious reasons.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ btrfs balance start -dconvert=raid1 /mnt
|
||||
|
||||
The balance process needs some workspace (ie. a free device space without any
|
||||
data or metadata block groups) so the command could fail if there's too much
|
||||
data or the block groups occupy the whole first device.
|
||||
|
||||
The device size of */dev/sdb* as seen by the filesystem remains unchanged, but
|
||||
the logical space from 50-100GiB will be unused.
|
||||
|
||||
Remove device
|
||||
"""""""""""""
|
||||
|
||||
Device removal must satisfy the profile constraints, otherwise the command
|
||||
fails. For example:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ btrfs device remove /dev/sda /mnt
|
||||
ERROR: error removing device '/dev/sda': unable to go below two devices on raid1
|
||||
|
||||
In order to remove a device, you need to convert the profile in this case:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ btrfs balance start -mconvert=dup -dconvert=single /mnt
|
||||
$ btrfs device remove /dev/sda /mnt
|
|
@ -0,0 +1,66 @@
|
|||
Since version 5.12 btrfs supports so called *zoned mode*. This is a special
|
||||
on-disk format and allocation/write strategy that's friendly to zoned devices.
|
||||
In short, a device is partitioned into fixed-size zones and each zone can be
|
||||
updated by append-only manner, or reset. As btrfs has no fixed data structures,
|
||||
except the super blocks, the zoned mode only requires block placement that
|
||||
follows the device constraints. You can learn about the whole architecture at
|
||||
https://zonedstorage.io .
|
||||
|
||||
The devices are also called SMR/ZBC/ZNS, in *host-managed* mode. Note that
|
||||
there are devices that appear as non-zoned but actually are, this is
|
||||
*drive-managed* and using zoned mode won't help.
|
||||
|
||||
The zone size depends on the device, typical sizes are 256MiB or 1GiB. In
|
||||
general it must be a power of two. Emulated zoned devices like *null_blk* allow
|
||||
to set various zone sizes.
|
||||
|
||||
Requirements, limitations
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* all devices must have the same zone size
|
||||
* maximum zone size is 8GiB
|
||||
* mixing zoned and non-zoned devices is possible, the zone writes are emulated,
|
||||
but this is namely for testing
|
||||
* the super block is handled in a special way and is at different locations
|
||||
than on a non-zoned filesystem:
|
||||
* primary: 0B (and the next two zones)
|
||||
* secondary: 512GiB (and the next two zones)
|
||||
* tertiary: 4TiB (4096GiB, and the next two zones)
|
||||
|
||||
Incompatible features
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The main constraint of the zoned devices is lack of in-place update of the data.
|
||||
This is inherently incompatibile with some features:
|
||||
|
||||
* nodatacow - overwrite in-place, cannot create such files
|
||||
* fallocate - preallocating space for in-place first write
|
||||
* mixed-bg - unordered writes to data and metadata, fixing that means using
|
||||
separate data and metadata block groups
|
||||
* booting - the zone at offset 0 contains superblock, resetting the zone would
|
||||
destroy the bootloader data
|
||||
|
||||
Initial support lacks some features but they're planned:
|
||||
|
||||
* only single profile is supported
|
||||
* fstrim - due to dependency on free space cache v1
|
||||
|
||||
Super block
|
||||
^^^^^^^^^^^
|
||||
|
||||
As said above, super block is handled in a special way. In order to be crash
|
||||
safe, at least one zone in a known location must contain a valid superblock.
|
||||
This is implemented as a ring buffer in two consecutive zones, starting from
|
||||
known offsets 0B, 512GiB and 4TiB.
|
||||
|
||||
The values are different than on non-zoned devices. Each new super block is
|
||||
appended to the end of the zone, once it's filled, the zone is reset and writes
|
||||
continue to the next one. Looking up the latest super block needs to read
|
||||
offsets of both zones and determine the last written version.
|
||||
|
||||
The amount of space reserved for super block depends on the zone size. The
|
||||
secondary and tertiary copies are at distant offsets as the capacity of the
|
||||
devices is expected to be large, tens of terabytes. Maximum zone size supported
|
||||
is 8GiB, which would mean that eg. offset 0-16GiB would be reserved just for
|
||||
the super block on a hypothetical device of that zone size. This is wasteful
|
||||
but required to guarantee crash safety.
|
|
@ -8,7 +8,6 @@ Welcome to BTRFS documentation!
|
|||
:caption: Overview
|
||||
|
||||
Introduction
|
||||
Quick-start
|
||||
man-index
|
||||
|
||||
.. toctree::
|
||||
|
@ -41,6 +40,7 @@ Welcome to BTRFS documentation!
|
|||
:maxdepth: 1
|
||||
:caption: TODO
|
||||
|
||||
Quick-start
|
||||
Interoperability
|
||||
Glossary
|
||||
Flexibility
|
||||
|
|
Loading…
Reference in New Issue