btrfs-progs/Documentation/dev/dev-btrfs-design.rst

Btrfs design
============

Btrfs is implemented with simple and well known constructs. It should
perform well, but the long term goal of maintaining performance as the
FS system ages and grows is more important than winning a short lived
benchmark. To that end, benchmarks are being used to try to simulate
performance over the life of a filesystem.


Btree Data structures
---------------------

The Btrfs btree provides a generic facility to store a variety of data
types. Internally it only knows about three data structures: keys,
items, and a block header:

.. code-block:: none

   struct btrfs_header {
           u8 csum[32];
           u8 fsid[16];
           __le64 bytenr;
           __le64 flags;

           u8 chunk_tree_uid[16];
           __le64 generation;
           __le64 owner;
           __le32 nritems;
           u8 level;
   }

.. code-block:: none

   struct btrfs_disk_key {
          __le64 objectid;
          u8 type;
          __le64 offset;
   }

.. code-block:: none

   struct btrfs_item {
          struct btrfs_disk_key key;
          __le32 offset;
          __le32 size;
   }

Upper nodes of the trees contain only [ key, block pointer ] pairs. Tree
leaves are broken up into two sections that grow toward each other.
Leaves have an array of fixed sized items, and an area where item data
is stored. The offset and size fields in the item indicate where in the
leaf the item data can be found. Example:

.. image:: Leaf-structure.png
   :alt: Leaf structure

Item data is variably size, and various filesystem data structures are
defined as different types of item data. The type field in struct
btrfs_disk_key indicates the type of data stored in the item.

The block header contains a checksum for the block contents, the uuid of
the filesystem that owns the block, the level of the block in the tree,
and the block number where this block is supposed to live. These fields
allow the contents of the metadata to be verified when the data is read.
Everything that points to a btree block also stores the generation field
it expects that block to have. This allows Btrfs to detect phantom or
misplaced writes on the media.

The checksum of the lower node is not stored in the node pointer to
simplify the FS writeback code. The generation number will be known at
the time the block is inserted into the btree, but the checksum is only
calculated before writing the block to disk. Using the generation will
allow Btrfs to detect phantom writes without having to find and update
the upper node each time the lower node checksum is updated.

The generation field corresponds to the transaction id that allocated
the block, which enables easy incremental backups and is used by the
copy on write transaction subsystem.


Filesystem Data Structures
--------------------------

Each object in the filesystem has an objectid, which is allocated
dynamically on creation. A free objectid is simply a hole in the key
space of the filesystem btree; objectids that don't already exist in the
tree. The objectid makes up the most significant bits of the key,
allowing all of the items for a given filesystem object to be logically
grouped together in the btree.

The offset field of the key stores indicates the byte offset for a
particular item in the object. For file extents, this would be the byte
offset of the start of the extent in the file. The type field stores the
item type information, and has extra room for expanded use.

Inodes
------

Inodes are stored in struct btrfs_inode_item at offset zero in the key,
and have a type value of one. Inode items are always the lowest valued
key for a given object, and they store the traditional stat data for
files and directories. The inode structure is relatively small, and will
not contain embedded file data or extended attribute data. These things
are stored in other item types.

Files
-----

Small files that occupy less than one leaf block may be packed into the
btree inside the extent item. In this case the key offset is the byte
offset of the data in the file, and the size field of struct btrfs_item
indicates how much data is stored. There may be more than one of these
per file.

Larger files are stored in extents. struct btrfs_file_extent_item
records a generation number for the extent and a [ disk block, disk num
blocks ] pair to record the area of disk corresponding to the file.
Extents also store the logical offset and the number of blocks used by
this extent record into the extent on disk. This allows Btrfs to satisfy
a rewrite into the middle of an extent without having to read the old
file data first. For example, writing 1MB into the middle of a existing
128MB extent may result in three extent records:

``[ old extent: bytes 0-64MB ], [ new extent 1MB ], [ old extent: bytes 65MB – 128MB]``

File data checksums are stored in a dedicated btree in a struct
btrfs_csum_item. The offset of the key corresponds to the byte number of
the extent. The data is checksummed after any compression or encryption
is done and it reflects the bytes sent to the disk.

A single item may store a number of checksums. struct btrfs_csum_items
are only used for file extents. File data inline in the btree is covered
by the checksum at the start of the btree block.

Directories
-----------

Directories are indexed in two different ways. For filename lookup,
there is an index comprised of keys:

================== ================== ====================
Directory Objectid BTRFS_DIR_ITEM_KEY 64 bit filename hash
================== ================== ====================

The default directory hash used is crc32c, although other hashes may be
added later on. A flags field in the super block will indicate which
hash is used for a given FS.

The second directory index is used by readdir to return data in inode
number order. This more closely resembles the order of blocks on disk
and generally provides better performance for reading data in bulk
(backups, copies, etc). Also, it allows fast checking that a given inode
is linked into a directory when verifying inode link counts. This index
uses an additional set of keys:

================== =================== =====================
Directory Objectid BTRFS_DIR_INDEX_KEY Inode Sequence number
================== =================== =====================

The inode sequence number comes from the directory. It is increased each
time a new file or directory is added.


Reference Counted Extents
-------------------------

Reference counting is the basis for the snapshotting subsystems. For
every extent allocated to a btree or a file, Btrfs records the number of
references in a struct btrfs_extent_item. The trees that hold these
items also serve as the allocation map for blocks that are in use on the
filesystem. Some trees are not reference counted and are only protected
by a copy on write logging. However, the same type of extent items are
used for all allocated blocks on the disk.

A reasonably comprehensive description of the way that references work
can be found in `this email from Josef
Bacik <http://www.spinics.net/lists/linux-btrfs/msg33415.html>`__.


Extent Block Groups
-------------------

Extent block groups allow allocator optimizations by breaking the disk
up into chunks of 256MB or more. For each chunk, they record information
about the number of blocks available. Files and directories will have a
preferred block group which they try first for allocations.

Block groups have a flag that indicate if they are preferred for data or
metadata allocations, and at mkfs time the disk is broken up into
alternating metadata (33% of the disk) and data groups (66% of the
disk). As the disk fills, a group's preference may change back and
forth, but Btrfs always tries to avoid intermixing data and metadata
extents in the same group. This substantially improves fsck throughput,
and reduces seeks during writeback while the FS is mounted. It does
slightly increase the seeks while reading.


Extent Trees and DM integration
-------------------------------

The Btrfs extent trees are intended to divide up the available storage
into a number of flexible allocation policies. Each extent tree owns a
section of the underlying disk, and they can be assigned to a collection
of (or a single) tree roots, directories or inodes. Policies will direct
how a given allocation is spread across the extent trees available,
allowing the admin to direct which parts of the filesystem are striped,
mirrored or confined to a given device.

Btrfs will try to tie in with DM in order to easily manage large pools
of storage. The basic idea is to have at least one extent tree per
spindle, and then allow the admin to assign those extent trees to
subvolumes, directories or files.


Explicit Back References
------------------------

Back references have three main goals:

-  Differentiate between all holders of references to an extent so that
   when a reference is dropped we can make sure it was a valid reference
   before freeing the extent.
-  Provide enough information to quickly find the holders of an extent
   if we notice a given block is corrupted or bad.
-  Make it easy to migrate blocks for FS shrinking or storage pool
   maintenance. This is actually the same as #2, but with a slightly
   different use case.


File Extent Backrefs
^^^^^^^^^^^^^^^^^^^^

File extents can be referenced by:

-  Multiple snapshots, subvolumes, or different generations in one
   subvol
-  Different files inside a single subvolume
-  Different offsets inside a file

.. note::
   The remainder of this section refers to the extent_ref_v0 structure, which is not used on current btrfs filesystems.

The extent ref structure has fields for:

-  Objectid of the subvolume root
-  Generation number of the tree holding the reference
-  objectid of the file holding the reference
-  offset in the file corresponding to the key holding the reference

When a file extent is allocated the fields are filled in:

   (root objectid, transaction id, inode objectid, offset in file)

When a leaf is cow'd new references are added for every file extent
found in the leaf. It looks the same as the create case, but the
transaction id will be different when the block is cow'd.

   (root objectid, transaction id, inode objectid, offset in file)

When a file extent is removed either during snapshot deletion or file
truncation, the corresponding back reference is found by searching for:

   (btrfs_header_owner(leaf), btrfs_header_generation(leaf), inode
   objectid, offset in file)


Btree Extent Backrefs
^^^^^^^^^^^^^^^^^^^^^

Btree extents can be referenced by:

-  Different subvolumes
-  Different generations of the same subvolume

Storing sufficient information for a full reverse mapping of a btree
block would require storing the lowest key of the block in the backref,
and it would require updating that lowest key either before write out or
every time it changed.

Instead, the objectid of the lowest key is stored along with the level
of the tree block. This provides a hint about where in the btree the
block can be found. Searches through the btree only need to look for a
pointer to that block, and they stop one level higher than the level
recorded in the backref.

Some btrees do not do reference counting on their extents. These include
the extent tree and the tree of tree roots. Backrefs for these trees
always have a generation of zero.

When a tree block is created, back references are inserted:

   (root objectid, transaction id or zero, level, lowest objectid)

The level is stored in the objectid slot of the backref to differentiate
between Btree back references and file data back references. The highest
possible level is 255, and the lowest possible file objectid has been
raised to 256. So, if the objectid field in the back reference is less
than 256, it corresponds to a Btree block.

When a tree block is cow'd in a reference counted root, new back
references are added for all the blocks it points to:

   (root objectid, transaction id, level, lowest objectid)

Because the lowest_key_objectid and the level are just hints they are
not used when backrefs are deleted. When a snapshot is created a new
reference is taken directly on the root block. This means the owner
field of the root block may be different from the objectid of the
snapshot. So, when dropping references on tree roots, the objectid of
the root structure is always used. When a backref is deleted:

.. code-block:: none

   if backref was for a tree root:
        root_objectid = root->root_key.objectid
   else
        root_objectid = btrfs_header_owner(parent)

(root_objectid, btrfs_header_generation(parent) or zero, 0, 0)


Back Reference Key Construction
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Back references have four fields, each 64 bits long. This is hashed into
a single 64 bit number and placed into the key offset. The key objectid
corresponds to the first byte in the extent, and the key type is set to
BTRFS_EXTENT_REF_KEY.

Hash overflows on the offset field are handled by adding one to the
calculated hash and searching forward. The searching stops when the
correct back reference structure is found or


Snapshots and Subvolumes
------------------------

Subvolumes are basically a named btree that holds files and directories.
They have inodes inside the tree of tree roots and can have non-root
owners and groups. Subvolumes can be given a quota of blocks, and once
this quota is reached no new writes are allowed. All of the blocks and
file extents inside of subvolumes are reference counted to allow
snapshotting. Up to 2\ :sup:`64` subvolumes may be created on the FS.

Snapshots are identical to subvolumes, but their root block is initially
shared with another subvolume. When the snapshot is taken, the reference
count on the root block is increased, and the copy on write transaction
system ensures changes made in either the snapshot or the source
subvolume are private to that root. Snapshots are writable, and they can
be snapshotted again any number of times. If read only snapshots are
desired, their block quota is set to one at creation time.


Btree Roots
-----------

Each Btrfs filesystem consists of a number of tree roots. A freshly
formatted filesystem will have roots for:

-  The tree of tree roots
-  The tree of allocated extents
-  The default subvolume tree

The tree of tree roots records the root block for the extent tree and
the root blocks and names for each subvolume and snapshot tree. As
transactions commit, the root block pointers are updated in this tree to
reference the new roots created by the transaction, and then the new
root block of this tree is recorded in the FS super block.

The tree of tree roots acts as a directory of all the other trees on the
filesystem, and it has directory items recording the names of all
snapshots and subvolumes in the FS. Each snapshot or subvolume has an
objectid in the tree of tree roots, and at least one corresponding
struct btrfs_root_item. Directory items in the tree map names of
snapshots and subvolumes to these root items. Because the root item key
is updated with every transaction commit, the directory items reference
a generation number of (u64)-1, which tells the lookup code to find the
most recent root available.

The extent trees are used to manage allocated space on the devices. The
space available can be divided between a number of extent trees to
reduce lock contention and give different allocation policies to
different block ranges.

The diagram below depicts a collection of tree roots. The super block
points to the root tree, and the root tree points to the extent trees
and subvolumes. The root tree also has a directory to map subvolume
names to struct btrfs_root_items in the root tree. This filesystem has
one subvolume named 'default' (created by mkfs), and one snapshot of
'default' named 'snap' (created by the admin some time later). In this
example, 'default' has not changed since the snapshot was created and so
both point tree to the same root block on disk.

.. image:: Copy-Design-r.png
   :alt: Copy-Design-r.png


Copy on Write Logging
---------------------

Data and metadata in Btrfs are protected with copy on write logging
(COW). Once the transaction that allocated the space on disk has
committed, any new writes to that logical address in the file or btree
will go to a newly allocated block, and block pointers in the btrees and
super blocks will be updated to reflect the new location.

Some of the btrfs trees do not use reference counting for their
allocated space. This includes the root tree, and the extent trees. As
blocks are replaced in these trees, the old block is freed in the extent
tree. These blocks are not reused for other purposes until the
transaction that freed them commits.

All subvolume (and snapshot) trees are reference counted. When a COW
operation is performed on a btree node, the reference count of all the
blocks it points to is increased by one. For leaves, the reference
counts of any file extents in the leaf are increased by one. When the
transaction commits, a new root pointer is inserted in the root tree for
each new subvolume root. The key used has the form:

====================== =================== ==============
Subvolume inode number BTRFS_ROOT_ITEM_KEY Transaction ID
====================== =================== ==============

The updated btree blocks are all flushed to disk, and then the super
block is updated to point to the new root tree. Once the super block has
been properly written to disk, the transaction is considered complete.
At this time the root tree has two pointers for each subvolume changed
during the transaction. One item points to the new tree and one points
to the tree that existed at the start of the last transaction.

Any time after the commit finishes, the older subvolume root items may
be removed. The reference count on the subvolume root block is lowered
by one. If the reference count reaches zero, the block is freed and the
reference count on any nodes the root points to is lowered by one. If a
tree node or leaf can be freed, it is traversed to free the nodes or
extents below it in the tree in a depth first fashion.

The traversal and freeing of the tree may be done in pieces by inserting
a progress record in the root tree. The progress record indicates the
last key and level touched by the traversal so the current transaction
can commit and the traversal can resume in the next transaction. If the
system crashes before the traversal completes, the progress record is
used to safely delete the root on the next mount.

Ohad Rodeh presented this reference counted snapshot algorithm at the
2007 Linux Filesystem and Storage Workshop:

Slides: `LinuxFS_Workshop.pdf <Media:LinuxFS_Workshop.pdf>`__

Paper: `Btree_TOS.pdf <Media:Btree_TOS.pdf>`__

The Btrfs snapshotting implementation is based on the ideas he
presented.

Btrfsck
-------

The filesystem checking utility is a crucial tool, but it can be a major
bottleneck in getting systems back online after something has gone
wrong. Btrfs aims to be tolerant of invalid metadata, and will avoid
using metadata it determines to be incorrect. The disk format allows
Btrfs to deal with most corruptions at run time, without crashing the
system and without requiring offline filesystem checking.

An offline btrfsck is being developed, in part to help verify the
filesystem during testing, and as an emergency tool to make sure the
filesystem is safe for mounting. The existing tool only verifies the
extent allocation maps, making sure that reference counts are correct
and that all extents are accounted for. If the extent maps are correct,
there is no risk of incorrectly writing over existing data or metadata
as blocks are allocated for new use.

btrfsck is able to read metadata in roughly disk order. As it scans the
btrees on disk, it collects the locations of nodes and leaves and pulls
them from the disk in large sequential batches. For the most part,
btrfsck is bound by the sequential read throughput of the storage, and
it is able to take advantage of multi-spindle arrays. The price paid for
the extra speed is more ram. Btrfsck uses about 3x more ram than
ext2fsck.
-												btrfs-progs: docs: add some design-related documents

Copied from wiki.

Signed-off-by: David Sterba <dsterba@suse.com>

											
										
										
											2023-03-17 21:35:30 +00:00
+								Btrfs design
 								============
 								Btrfs is implemented with simple and well known constructs. It should
 								perform well, but the long term goal of maintaining performance as the
 								FS system ages and grows is more important than winning a short lived
 								benchmark. To that end, benchmarks are being used to try to simulate
 								performance over the life of a filesystem.
 								Btree Data structures
 								---------------------
 								The Btrfs btree provides a generic facility to store a variety of data
 								types. Internally it only knows about three data structures: keys,
 								items, and a block header:
-												btrfs-progs: docs: fix sphinx code-block warnings

There are several warnings regarding the absence of an argument for the
code-block directive on some build servers using python3-sphinx 0.2.2-17.

For example:

Making all in Documentation
    [SPHINX] man
ch-subvolume-intro.rst:141: WARNING: Error in "code-block" directive:
1 argument(s) required, 0 supplied.

.. code-block::

   27 21 0:19 /subv1 /mnt rw,relatime - btrfs /dev/sda rw,space_cache

 Etc...

Add the none argument.

[ci skip]

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

											
										
										
											2024-01-10 17:25:22 +00:00
+								.. code-block:: none
-												btrfs-progs: docs: add some design-related documents

Copied from wiki.

Signed-off-by: David Sterba <dsterba@suse.com>

											
										
										
											2023-03-17 21:35:30 +00:00
 								   struct btrfs_header {
 								           u8 csum[32];
 								           u8 fsid[16];
 								           __le64 bytenr;
 								           __le64 flags;
 								           u8 chunk_tree_uid[16];
 								           __le64 generation;
 								           __le64 owner;
 								           __le32 nritems;
 								           u8 level;
 								   }
-												btrfs-progs: docs: fix sphinx code-block warnings

There are several warnings regarding the absence of an argument for the
code-block directive on some build servers using python3-sphinx 0.2.2-17.

For example:

Making all in Documentation
    [SPHINX] man
ch-subvolume-intro.rst:141: WARNING: Error in "code-block" directive:
1 argument(s) required, 0 supplied.

.. code-block::

   27 21 0:19 /subv1 /mnt rw,relatime - btrfs /dev/sda rw,space_cache

 Etc...

Add the none argument.

[ci skip]

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

											
										
										
											2024-01-10 17:25:22 +00:00
+								.. code-block:: none
-												btrfs-progs: docs: add some design-related documents

Copied from wiki.

Signed-off-by: David Sterba <dsterba@suse.com>

											
										
										
											2023-03-17 21:35:30 +00:00
 								   struct btrfs_disk_key {
 								          __le64 objectid;
 								          u8 type;
 								          __le64 offset;
 								   }
-												btrfs-progs: docs: fix sphinx code-block warnings

There are several warnings regarding the absence of an argument for the
code-block directive on some build servers using python3-sphinx 0.2.2-17.

For example:

Making all in Documentation
    [SPHINX] man
ch-subvolume-intro.rst:141: WARNING: Error in "code-block" directive:
1 argument(s) required, 0 supplied.

.. code-block::

   27 21 0:19 /subv1 /mnt rw,relatime - btrfs /dev/sda rw,space_cache

 Etc...

Add the none argument.

[ci skip]

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

											
										
										
											2024-01-10 17:25:22 +00:00
+								.. code-block:: none
-												btrfs-progs: docs: add some design-related documents

Copied from wiki.

Signed-off-by: David Sterba <dsterba@suse.com>

											
										
										
											2023-03-17 21:35:30 +00:00
 								   struct btrfs_item {
 								          struct btrfs_disk_key key;
 								          __le32 offset;
 								          __le32 size;
 								   }
 								Upper nodes of the trees contain only [ key, block pointer ] pairs. Tree
 								leaves are broken up into two sections that grow toward each other.
 								Leaves have an array of fixed sized items, and an area where item data
 								is stored. The offset and size fields in the item indicate where in the
 								leaf the item data can be found. Example:
-												btrfs-progs: docs: fix image directives in design page

Copy the images from wiki so that we don't need to jump around the web
search results.

[ci skip]

Pull-request: #771
Signed-off-by: Austin Chang <austin880625@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>

											
										
										
											2024-04-02 14:21:28 +00:00
+								.. image:: Leaf-structure.png
 								   :alt: Leaf structure
-												btrfs-progs: docs: add some design-related documents

Copied from wiki.

Signed-off-by: David Sterba <dsterba@suse.com>

											
										
										
											2023-03-17 21:35:30 +00:00
 								Item data is variably size, and various filesystem data structures are
 								defined as different types of item data. The type field in struct
 								btrfs_disk_key indicates the type of data stored in the item.
 								The block header contains a checksum for the block contents, the uuid of
 								the filesystem that owns the block, the level of the block in the tree,
 								and the block number where this block is supposed to live. These fields
 								allow the contents of the metadata to be verified when the data is read.
 								Everything that points to a btree block also stores the generation field
 								it expects that block to have. This allows Btrfs to detect phantom or
 								misplaced writes on the media.
 								The checksum of the lower node is not stored in the node pointer to
 								simplify the FS writeback code. The generation number will be known at
 								the time the block is inserted into the btree, but the checksum is only
 								calculated before writing the block to disk. Using the generation will
 								allow Btrfs to detect phantom writes without having to find and update
 								the upper node each time the lower node checksum is updated.
 								The generation field corresponds to the transaction id that allocated
 								the block, which enables easy incremental backups and is used by the
 								copy on write transaction subsystem.
 								Filesystem Data Structures
 								--------------------------
 								Each object in the filesystem has an objectid, which is allocated
 								dynamically on creation. A free objectid is simply a hole in the key
 								space of the filesystem btree; objectids that don't already exist in the
 								tree. The objectid makes up the most significant bits of the key,
 								allowing all of the items for a given filesystem object to be logically
 								grouped together in the btree.
 								The offset field of the key stores indicates the byte offset for a
 								particular item in the object. For file extents, this would be the byte
 								offset of the start of the extent in the file. The type field stores the
 								item type information, and has extra room for expanded use.
 								Inodes
 								------
 								Inodes are stored in struct btrfs_inode_item at offset zero in the key,
 								and have a type value of one. Inode items are always the lowest valued
 								key for a given object, and they store the traditional stat data for
 								files and directories. The inode structure is relatively small, and will
 								not contain embedded file data or extended attribute data. These things
 								are stored in other item types.
 								Files
 								-----
 								Small files that occupy less than one leaf block may be packed into the
 								btree inside the extent item. In this case the key offset is the byte
 								offset of the data in the file, and the size field of struct btrfs_item
 								indicates how much data is stored. There may be more than one of these
 								per file.
 								Larger files are stored in extents. struct btrfs_file_extent_item
 								records a generation number for the extent and a [ disk block, disk num
 								blocks ] pair to record the area of disk corresponding to the file.
 								Extents also store the logical offset and the number of blocks used by
 								this extent record into the extent on disk. This allows Btrfs to satisfy
 								a rewrite into the middle of an extent without having to read the old
 								file data first. For example, writing 1MB into the middle of a existing
 MB extent may result in three extent records:
 								``[ old extent: bytes 0-64MB ], [ new extent 1MB ], [ old extent: bytes 65MB – 128MB]``
 								File data checksums are stored in a dedicated btree in a struct
 								btrfs_csum_item. The offset of the key corresponds to the byte number of
 								the extent. The data is checksummed after any compression or encryption
 								is done and it reflects the bytes sent to the disk.
 								A single item may store a number of checksums. struct btrfs_csum_items
 								are only used for file extents. File data inline in the btree is covered
 								by the checksum at the start of the btree block.
 								Directories
 								-----------
 								Directories are indexed in two different ways. For filename lookup,
 								there is an index comprised of keys:
 								================== ================== ====================
 								Directory Objectid BTRFS_DIR_ITEM_KEY 64 bit filename hash
 								================== ================== ====================
 								The default directory hash used is crc32c, although other hashes may be
 								added later on. A flags field in the super block will indicate which
 								hash is used for a given FS.
 								The second directory index is used by readdir to return data in inode
 								number order. This more closely resembles the order of blocks on disk
 								and generally provides better performance for reading data in bulk
 								(backups, copies, etc). Also, it allows fast checking that a given inode
 								is linked into a directory when verifying inode link counts. This index
 								uses an additional set of keys:
 								================== =================== =====================
 								Directory Objectid BTRFS_DIR_INDEX_KEY Inode Sequence number
 								================== =================== =====================
 								The inode sequence number comes from the directory. It is increased each
 								time a new file or directory is added.
 								Reference Counted Extents
 								-------------------------
 								Reference counting is the basis for the snapshotting subsystems. For
 								every extent allocated to a btree or a file, Btrfs records the number of
 								references in a struct btrfs_extent_item. The trees that hold these
 								items also serve as the allocation map for blocks that are in use on the
 								filesystem. Some trees are not reference counted and are only protected
 								by a copy on write logging. However, the same type of extent items are
 								used for all allocated blocks on the disk.
 								A reasonably comprehensive description of the way that references work
 								can be found in `this email from Josef
 								Bacik <http://www.spinics.net/lists/linux-btrfs/msg33415.html>`__.
 								Extent Block Groups
 								-------------------
 								Extent block groups allow allocator optimizations by breaking the disk
 								up into chunks of 256MB or more. For each chunk, they record information
 								about the number of blocks available. Files and directories will have a
 								preferred block group which they try first for allocations.
 								Block groups have a flag that indicate if they are preferred for data or
 								metadata allocations, and at mkfs time the disk is broken up into
 								alternating metadata (33% of the disk) and data groups (66% of the
 								disk). As the disk fills, a group's preference may change back and
 								forth, but Btrfs always tries to avoid intermixing data and metadata
 								extents in the same group. This substantially improves fsck throughput,
 								and reduces seeks during writeback while the FS is mounted. It does
 								slightly increase the seeks while reading.
 								Extent Trees and DM integration
 								-------------------------------
 								The Btrfs extent trees are intended to divide up the available storage
 								into a number of flexible allocation policies. Each extent tree owns a
 								section of the underlying disk, and they can be assigned to a collection
 								of (or a single) tree roots, directories or inodes. Policies will direct
 								how a given allocation is spread across the extent trees available,
 								allowing the admin to direct which parts of the filesystem are striped,
 								mirrored or confined to a given device.
 								Btrfs will try to tie in with DM in order to easily manage large pools
 								of storage. The basic idea is to have at least one extent tree per
 								spindle, and then allow the admin to assign those extent trees to
 								subvolumes, directories or files.
 								Explicit Back References
 								------------------------
 								Back references have three main goals:
 								-  Differentiate between all holders of references to an extent so that
 								   when a reference is dropped we can make sure it was a valid reference
 								   before freeing the extent.
 								-  Provide enough information to quickly find the holders of an extent
 								   if we notice a given block is corrupted or bad.
 								-  Make it easy to migrate blocks for FS shrinking or storage pool
 								   maintenance. This is actually the same as #2, but with a slightly
 								   different use case.
 								File Extent Backrefs
 								^^^^^^^^^^^^^^^^^^^^
 								File extents can be referenced by:
 								-  Multiple snapshots, subvolumes, or different generations in one
 								   subvol
 								-  Different files inside a single subvolume
 								-  Different offsets inside a file
 								.. note::
 								   The remainder of this section refers to the extent_ref_v0 structure, which is not used on current btrfs filesystems.
 								The extent ref structure has fields for:
 								-  Objectid of the subvolume root
 								-  Generation number of the tree holding the reference
 								-  objectid of the file holding the reference
 								-  offset in the file corresponding to the key holding the reference
 								When a file extent is allocated the fields are filled in:
 								   (root objectid, transaction id, inode objectid, offset in file)
 								When a leaf is cow'd new references are added for every file extent
 								found in the leaf. It looks the same as the create case, but the
 								transaction id will be different when the block is cow'd.
 								   (root objectid, transaction id, inode objectid, offset in file)
 								When a file extent is removed either during snapshot deletion or file
 								truncation, the corresponding back reference is found by searching for:
 								   (btrfs_header_owner(leaf), btrfs_header_generation(leaf), inode
 								   objectid, offset in file)
 								Btree Extent Backrefs
 								^^^^^^^^^^^^^^^^^^^^^
 								Btree extents can be referenced by:
 								-  Different subvolumes
 								-  Different generations of the same subvolume
 								Storing sufficient information for a full reverse mapping of a btree
 								block would require storing the lowest key of the block in the backref,
 								and it would require updating that lowest key either before write out or
 								every time it changed.
 								Instead, the objectid of the lowest key is stored along with the level
 								of the tree block. This provides a hint about where in the btree the
 								block can be found. Searches through the btree only need to look for a
 								pointer to that block, and they stop one level higher than the level
 								recorded in the backref.
 								Some btrees do not do reference counting on their extents. These include
 								the extent tree and the tree of tree roots. Backrefs for these trees
 								always have a generation of zero.
 								When a tree block is created, back references are inserted:
 								   (root objectid, transaction id or zero, level, lowest objectid)
 								The level is stored in the objectid slot of the backref to differentiate
 								between Btree back references and file data back references. The highest
 								possible level is 255, and the lowest possible file objectid has been
 								raised to 256. So, if the objectid field in the back reference is less
 								than 256, it corresponds to a Btree block.
 								When a tree block is cow'd in a reference counted root, new back
 								references are added for all the blocks it points to:
 								   (root objectid, transaction id, level, lowest objectid)
 								Because the lowest_key_objectid and the level are just hints they are
 								not used when backrefs are deleted. When a snapshot is created a new
 								reference is taken directly on the root block. This means the owner
 								field of the root block may be different from the objectid of the
 								snapshot. So, when dropping references on tree roots, the objectid of
 								the root structure is always used. When a backref is deleted:
-												btrfs-progs: docs: fix sphinx code-block warnings

There are several warnings regarding the absence of an argument for the
code-block directive on some build servers using python3-sphinx 0.2.2-17.

For example:

Making all in Documentation
    [SPHINX] man
ch-subvolume-intro.rst:141: WARNING: Error in "code-block" directive:
1 argument(s) required, 0 supplied.

.. code-block::

   27 21 0:19 /subv1 /mnt rw,relatime - btrfs /dev/sda rw,space_cache

 Etc...

Add the none argument.

[ci skip]

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

											
										
										
											2024-01-10 17:25:22 +00:00
+								.. code-block:: none
-												btrfs-progs: docs: add some design-related documents

Copied from wiki.

Signed-off-by: David Sterba <dsterba@suse.com>

											
										
										
											2023-03-17 21:35:30 +00:00
 								   if backref was for a tree root:
 								        root_objectid = root->root_key.objectid
 								   else
 								        root_objectid = btrfs_header_owner(parent)
 								(root_objectid, btrfs_header_generation(parent) or zero, 0, 0)
 								Back Reference Key Construction
 								^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 								Back references have four fields, each 64 bits long. This is hashed into
 								a single 64 bit number and placed into the key offset. The key objectid
 								corresponds to the first byte in the extent, and the key type is set to
 								BTRFS_EXTENT_REF_KEY.
 								Hash overflows on the offset field are handled by adding one to the
 								calculated hash and searching forward. The searching stops when the
 								correct back reference structure is found or
 								Snapshots and Subvolumes
 								------------------------
 								Subvolumes are basically a named btree that holds files and directories.
 								They have inodes inside the tree of tree roots and can have non-root
 								owners and groups. Subvolumes can be given a quota of blocks, and once
 								this quota is reached no new writes are allowed. All of the blocks and
 								file extents inside of subvolumes are reference counted to allow
 								snapshotting. Up to 2\ :sup:`64` subvolumes may be created on the FS.
 								Snapshots are identical to subvolumes, but their root block is initially
 								shared with another subvolume. When the snapshot is taken, the reference
 								count on the root block is increased, and the copy on write transaction
 								system ensures changes made in either the snapshot or the source
 								subvolume are private to that root. Snapshots are writable, and they can
 								be snapshotted again any number of times. If read only snapshots are
 								desired, their block quota is set to one at creation time.
 								Btree Roots
 								-----------
 								Each Btrfs filesystem consists of a number of tree roots. A freshly
 								formatted filesystem will have roots for:
 								-  The tree of tree roots
 								-  The tree of allocated extents
 								-  The default subvolume tree
 								The tree of tree roots records the root block for the extent tree and
 								the root blocks and names for each subvolume and snapshot tree. As
 								transactions commit, the root block pointers are updated in this tree to
 								reference the new roots created by the transaction, and then the new
 								root block of this tree is recorded in the FS super block.
 								The tree of tree roots acts as a directory of all the other trees on the
 								filesystem, and it has directory items recording the names of all
 								snapshots and subvolumes in the FS. Each snapshot or subvolume has an
 								objectid in the tree of tree roots, and at least one corresponding
 								struct btrfs_root_item. Directory items in the tree map names of
 								snapshots and subvolumes to these root items. Because the root item key
 								is updated with every transaction commit, the directory items reference
 								a generation number of (u64)-1, which tells the lookup code to find the
 								most recent root available.
 								The extent trees are used to manage allocated space on the devices. The
 								space available can be divided between a number of extent trees to
 								reduce lock contention and give different allocation policies to
 								different block ranges.
 								The diagram below depicts a collection of tree roots. The super block
 								points to the root tree, and the root tree points to the extent trees
 								and subvolumes. The root tree also has a directory to map subvolume
 								names to struct btrfs_root_items in the root tree. This filesystem has
 								one subvolume named 'default' (created by mkfs), and one snapshot of
 								'default' named 'snap' (created by the admin some time later). In this
 								example, 'default' has not changed since the snapshot was created and so
 								both point tree to the same root block on disk.
-												btrfs-progs: docs: fix image directives in design page

Copy the images from wiki so that we don't need to jump around the web
search results.

[ci skip]

Pull-request: #771
Signed-off-by: Austin Chang <austin880625@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>

											
										
										
											2024-04-02 14:21:28 +00:00
+								.. image:: Copy-Design-r.png
-												btrfs-progs: docs: add some design-related documents

Copied from wiki.

Signed-off-by: David Sterba <dsterba@suse.com>

											
										
										
											2023-03-17 21:35:30 +00:00
+								   :alt: Copy-Design-r.png
 								Copy on Write Logging
 								---------------------
 								Data and metadata in Btrfs are protected with copy on write logging
 								(COW). Once the transaction that allocated the space on disk has
 								committed, any new writes to that logical address in the file or btree
 								will go to a newly allocated block, and block pointers in the btrees and
 								super blocks will be updated to reflect the new location.
 								Some of the btrfs trees do not use reference counting for their
 								allocated space. This includes the root tree, and the extent trees. As
 								blocks are replaced in these trees, the old block is freed in the extent
 								tree. These blocks are not reused for other purposes until the
 								transaction that freed them commits.
 								All subvolume (and snapshot) trees are reference counted. When a COW
 								operation is performed on a btree node, the reference count of all the
 								blocks it points to is increased by one. For leaves, the reference
 								counts of any file extents in the leaf are increased by one. When the
 								transaction commits, a new root pointer is inserted in the root tree for
 								each new subvolume root. The key used has the form:
 								====================== =================== ==============
 								Subvolume inode number BTRFS_ROOT_ITEM_KEY Transaction ID
 								====================== =================== ==============
 								The updated btree blocks are all flushed to disk, and then the super
 								block is updated to point to the new root tree. Once the super block has
 								been properly written to disk, the transaction is considered complete.
 								At this time the root tree has two pointers for each subvolume changed
 								during the transaction. One item points to the new tree and one points
 								to the tree that existed at the start of the last transaction.
 								Any time after the commit finishes, the older subvolume root items may
 								be removed. The reference count on the subvolume root block is lowered
 								by one. If the reference count reaches zero, the block is freed and the
 								reference count on any nodes the root points to is lowered by one. If a
 								tree node or leaf can be freed, it is traversed to free the nodes or
 								extents below it in the tree in a depth first fashion.
 								The traversal and freeing of the tree may be done in pieces by inserting
 								a progress record in the root tree. The progress record indicates the
 								last key and level touched by the traversal so the current transaction
 								can commit and the traversal can resume in the next transaction. If the
 								system crashes before the traversal completes, the progress record is
 								used to safely delete the root on the next mount.
 								Ohad Rodeh presented this reference counted snapshot algorithm at the
 Linux Filesystem and Storage Workshop:
 								Slides: `LinuxFS_Workshop.pdf <Media:LinuxFS_Workshop.pdf>`__
 								Paper: `Btree_TOS.pdf <Media:Btree_TOS.pdf>`__
 								The Btrfs snapshotting implementation is based on the ideas he
 								presented.
 								Btrfsck
-												btrfs-progs: docs: fixups, references

Signed-off-by: David Sterba <dsterba@suse.com>

											
										
										
											2023-06-01 18:46:06 +00:00
+								-------
-												btrfs-progs: docs: add some design-related documents

Copied from wiki.

Signed-off-by: David Sterba <dsterba@suse.com>

											
										
										
											2023-03-17 21:35:30 +00:00
 								The filesystem checking utility is a crucial tool, but it can be a major
 								bottleneck in getting systems back online after something has gone
 								wrong. Btrfs aims to be tolerant of invalid metadata, and will avoid
 								using metadata it determines to be incorrect. The disk format allows
 								Btrfs to deal with most corruptions at run time, without crashing the
 								system and without requiring offline filesystem checking.
 								An offline btrfsck is being developed, in part to help verify the
 								filesystem during testing, and as an emergency tool to make sure the
 								filesystem is safe for mounting. The existing tool only verifies the
 								extent allocation maps, making sure that reference counts are correct
 								and that all extents are accounted for. If the extent maps are correct,
 								there is no risk of incorrectly writing over existing data or metadata
 								as blocks are allocated for new use.
 								btrfsck is able to read metadata in roughly disk order. As it scans the
 								btrees on disk, it collects the locations of nodes and leaves and pulls
 								them from the disk in large sequential batches. For the most part,
 								btrfsck is bound by the sequential read throughput of the storage, and
 								it is able to take advantage of multi-spindle arrays. The price paid for
 								the extra speed is more ram. Btrfsck uses about 3x more ram than
 								ext2fsck.