483 lines
20 KiB
ReStructuredText
483 lines
20 KiB
ReStructuredText
Btrfs design
|
||
============
|
||
|
||
Btrfs is implemented with simple and well known constructs. It should
|
||
perform well, but the long term goal of maintaining performance as the
|
||
FS system ages and grows is more important than winning a short lived
|
||
benchmark. To that end, benchmarks are being used to try to simulate
|
||
performance over the life of a filesystem.
|
||
|
||
|
||
Btree Data structures
|
||
---------------------
|
||
|
||
The Btrfs btree provides a generic facility to store a variety of data
|
||
types. Internally it only knows about three data structures: keys,
|
||
items, and a block header:
|
||
|
||
.. code-block::
|
||
|
||
struct btrfs_header {
|
||
u8 csum[32];
|
||
u8 fsid[16];
|
||
__le64 bytenr;
|
||
__le64 flags;
|
||
|
||
u8 chunk_tree_uid[16];
|
||
__le64 generation;
|
||
__le64 owner;
|
||
__le32 nritems;
|
||
u8 level;
|
||
}
|
||
|
||
.. code-block::
|
||
|
||
struct btrfs_disk_key {
|
||
__le64 objectid;
|
||
u8 type;
|
||
__le64 offset;
|
||
}
|
||
|
||
.. code-block::
|
||
|
||
struct btrfs_item {
|
||
struct btrfs_disk_key key;
|
||
__le32 offset;
|
||
__le32 size;
|
||
}
|
||
|
||
Upper nodes of the trees contain only [ key, block pointer ] pairs. Tree
|
||
leaves are broken up into two sections that grow toward each other.
|
||
Leaves have an array of fixed sized items, and an area where item data
|
||
is stored. The offset and size fields in the item indicate where in the
|
||
leaf the item data can be found. Example:
|
||
|
||
:alt: Leaf-structure.png
|
||
|
||
Leaf-structure.png
|
||
|
||
Item data is variably size, and various filesystem data structures are
|
||
defined as different types of item data. The type field in struct
|
||
btrfs_disk_key indicates the type of data stored in the item.
|
||
|
||
The block header contains a checksum for the block contents, the uuid of
|
||
the filesystem that owns the block, the level of the block in the tree,
|
||
and the block number where this block is supposed to live. These fields
|
||
allow the contents of the metadata to be verified when the data is read.
|
||
Everything that points to a btree block also stores the generation field
|
||
it expects that block to have. This allows Btrfs to detect phantom or
|
||
misplaced writes on the media.
|
||
|
||
The checksum of the lower node is not stored in the node pointer to
|
||
simplify the FS writeback code. The generation number will be known at
|
||
the time the block is inserted into the btree, but the checksum is only
|
||
calculated before writing the block to disk. Using the generation will
|
||
allow Btrfs to detect phantom writes without having to find and update
|
||
the upper node each time the lower node checksum is updated.
|
||
|
||
The generation field corresponds to the transaction id that allocated
|
||
the block, which enables easy incremental backups and is used by the
|
||
copy on write transaction subsystem.
|
||
|
||
|
||
Filesystem Data Structures
|
||
--------------------------
|
||
|
||
Each object in the filesystem has an objectid, which is allocated
|
||
dynamically on creation. A free objectid is simply a hole in the key
|
||
space of the filesystem btree; objectids that don't already exist in the
|
||
tree. The objectid makes up the most significant bits of the key,
|
||
allowing all of the items for a given filesystem object to be logically
|
||
grouped together in the btree.
|
||
|
||
The offset field of the key stores indicates the byte offset for a
|
||
particular item in the object. For file extents, this would be the byte
|
||
offset of the start of the extent in the file. The type field stores the
|
||
item type information, and has extra room for expanded use.
|
||
|
||
Inodes
|
||
------
|
||
|
||
Inodes are stored in struct btrfs_inode_item at offset zero in the key,
|
||
and have a type value of one. Inode items are always the lowest valued
|
||
key for a given object, and they store the traditional stat data for
|
||
files and directories. The inode structure is relatively small, and will
|
||
not contain embedded file data or extended attribute data. These things
|
||
are stored in other item types.
|
||
|
||
Files
|
||
-----
|
||
|
||
Small files that occupy less than one leaf block may be packed into the
|
||
btree inside the extent item. In this case the key offset is the byte
|
||
offset of the data in the file, and the size field of struct btrfs_item
|
||
indicates how much data is stored. There may be more than one of these
|
||
per file.
|
||
|
||
Larger files are stored in extents. struct btrfs_file_extent_item
|
||
records a generation number for the extent and a [ disk block, disk num
|
||
blocks ] pair to record the area of disk corresponding to the file.
|
||
Extents also store the logical offset and the number of blocks used by
|
||
this extent record into the extent on disk. This allows Btrfs to satisfy
|
||
a rewrite into the middle of an extent without having to read the old
|
||
file data first. For example, writing 1MB into the middle of a existing
|
||
128MB extent may result in three extent records:
|
||
|
||
``[ old extent: bytes 0-64MB ], [ new extent 1MB ], [ old extent: bytes 65MB – 128MB]``
|
||
|
||
File data checksums are stored in a dedicated btree in a struct
|
||
btrfs_csum_item. The offset of the key corresponds to the byte number of
|
||
the extent. The data is checksummed after any compression or encryption
|
||
is done and it reflects the bytes sent to the disk.
|
||
|
||
A single item may store a number of checksums. struct btrfs_csum_items
|
||
are only used for file extents. File data inline in the btree is covered
|
||
by the checksum at the start of the btree block.
|
||
|
||
Directories
|
||
-----------
|
||
|
||
Directories are indexed in two different ways. For filename lookup,
|
||
there is an index comprised of keys:
|
||
|
||
================== ================== ====================
|
||
Directory Objectid BTRFS_DIR_ITEM_KEY 64 bit filename hash
|
||
================== ================== ====================
|
||
|
||
The default directory hash used is crc32c, although other hashes may be
|
||
added later on. A flags field in the super block will indicate which
|
||
hash is used for a given FS.
|
||
|
||
The second directory index is used by readdir to return data in inode
|
||
number order. This more closely resembles the order of blocks on disk
|
||
and generally provides better performance for reading data in bulk
|
||
(backups, copies, etc). Also, it allows fast checking that a given inode
|
||
is linked into a directory when verifying inode link counts. This index
|
||
uses an additional set of keys:
|
||
|
||
================== =================== =====================
|
||
Directory Objectid BTRFS_DIR_INDEX_KEY Inode Sequence number
|
||
================== =================== =====================
|
||
|
||
The inode sequence number comes from the directory. It is increased each
|
||
time a new file or directory is added.
|
||
|
||
|
||
Reference Counted Extents
|
||
-------------------------
|
||
|
||
Reference counting is the basis for the snapshotting subsystems. For
|
||
every extent allocated to a btree or a file, Btrfs records the number of
|
||
references in a struct btrfs_extent_item. The trees that hold these
|
||
items also serve as the allocation map for blocks that are in use on the
|
||
filesystem. Some trees are not reference counted and are only protected
|
||
by a copy on write logging. However, the same type of extent items are
|
||
used for all allocated blocks on the disk.
|
||
|
||
A reasonably comprehensive description of the way that references work
|
||
can be found in `this email from Josef
|
||
Bacik <http://www.spinics.net/lists/linux-btrfs/msg33415.html>`__.
|
||
|
||
|
||
Extent Block Groups
|
||
-------------------
|
||
|
||
Extent block groups allow allocator optimizations by breaking the disk
|
||
up into chunks of 256MB or more. For each chunk, they record information
|
||
about the number of blocks available. Files and directories will have a
|
||
preferred block group which they try first for allocations.
|
||
|
||
Block groups have a flag that indicate if they are preferred for data or
|
||
metadata allocations, and at mkfs time the disk is broken up into
|
||
alternating metadata (33% of the disk) and data groups (66% of the
|
||
disk). As the disk fills, a group's preference may change back and
|
||
forth, but Btrfs always tries to avoid intermixing data and metadata
|
||
extents in the same group. This substantially improves fsck throughput,
|
||
and reduces seeks during writeback while the FS is mounted. It does
|
||
slightly increase the seeks while reading.
|
||
|
||
|
||
Extent Trees and DM integration
|
||
-------------------------------
|
||
|
||
The Btrfs extent trees are intended to divide up the available storage
|
||
into a number of flexible allocation policies. Each extent tree owns a
|
||
section of the underlying disk, and they can be assigned to a collection
|
||
of (or a single) tree roots, directories or inodes. Policies will direct
|
||
how a given allocation is spread across the extent trees available,
|
||
allowing the admin to direct which parts of the filesystem are striped,
|
||
mirrored or confined to a given device.
|
||
|
||
Btrfs will try to tie in with DM in order to easily manage large pools
|
||
of storage. The basic idea is to have at least one extent tree per
|
||
spindle, and then allow the admin to assign those extent trees to
|
||
subvolumes, directories or files.
|
||
|
||
|
||
Explicit Back References
|
||
------------------------
|
||
|
||
Back references have three main goals:
|
||
|
||
- Differentiate between all holders of references to an extent so that
|
||
when a reference is dropped we can make sure it was a valid reference
|
||
before freeing the extent.
|
||
- Provide enough information to quickly find the holders of an extent
|
||
if we notice a given block is corrupted or bad.
|
||
- Make it easy to migrate blocks for FS shrinking or storage pool
|
||
maintenance. This is actually the same as #2, but with a slightly
|
||
different use case.
|
||
|
||
|
||
File Extent Backrefs
|
||
^^^^^^^^^^^^^^^^^^^^
|
||
|
||
File extents can be referenced by:
|
||
|
||
- Multiple snapshots, subvolumes, or different generations in one
|
||
subvol
|
||
- Different files inside a single subvolume
|
||
- Different offsets inside a file
|
||
|
||
.. note::
|
||
The remainder of this section refers to the extent_ref_v0 structure, which is not used on current btrfs filesystems.
|
||
|
||
The extent ref structure has fields for:
|
||
|
||
- Objectid of the subvolume root
|
||
- Generation number of the tree holding the reference
|
||
- objectid of the file holding the reference
|
||
- offset in the file corresponding to the key holding the reference
|
||
|
||
When a file extent is allocated the fields are filled in:
|
||
|
||
(root objectid, transaction id, inode objectid, offset in file)
|
||
|
||
When a leaf is cow'd new references are added for every file extent
|
||
found in the leaf. It looks the same as the create case, but the
|
||
transaction id will be different when the block is cow'd.
|
||
|
||
(root objectid, transaction id, inode objectid, offset in file)
|
||
|
||
When a file extent is removed either during snapshot deletion or file
|
||
truncation, the corresponding back reference is found by searching for:
|
||
|
||
(btrfs_header_owner(leaf), btrfs_header_generation(leaf), inode
|
||
objectid, offset in file)
|
||
|
||
|
||
Btree Extent Backrefs
|
||
^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Btree extents can be referenced by:
|
||
|
||
- Different subvolumes
|
||
- Different generations of the same subvolume
|
||
|
||
Storing sufficient information for a full reverse mapping of a btree
|
||
block would require storing the lowest key of the block in the backref,
|
||
and it would require updating that lowest key either before write out or
|
||
every time it changed.
|
||
|
||
Instead, the objectid of the lowest key is stored along with the level
|
||
of the tree block. This provides a hint about where in the btree the
|
||
block can be found. Searches through the btree only need to look for a
|
||
pointer to that block, and they stop one level higher than the level
|
||
recorded in the backref.
|
||
|
||
Some btrees do not do reference counting on their extents. These include
|
||
the extent tree and the tree of tree roots. Backrefs for these trees
|
||
always have a generation of zero.
|
||
|
||
When a tree block is created, back references are inserted:
|
||
|
||
(root objectid, transaction id or zero, level, lowest objectid)
|
||
|
||
The level is stored in the objectid slot of the backref to differentiate
|
||
between Btree back references and file data back references. The highest
|
||
possible level is 255, and the lowest possible file objectid has been
|
||
raised to 256. So, if the objectid field in the back reference is less
|
||
than 256, it corresponds to a Btree block.
|
||
|
||
When a tree block is cow'd in a reference counted root, new back
|
||
references are added for all the blocks it points to:
|
||
|
||
(root objectid, transaction id, level, lowest objectid)
|
||
|
||
Because the lowest_key_objectid and the level are just hints they are
|
||
not used when backrefs are deleted. When a snapshot is created a new
|
||
reference is taken directly on the root block. This means the owner
|
||
field of the root block may be different from the objectid of the
|
||
snapshot. So, when dropping references on tree roots, the objectid of
|
||
the root structure is always used. When a backref is deleted:
|
||
|
||
.. code-block::
|
||
|
||
if backref was for a tree root:
|
||
root_objectid = root->root_key.objectid
|
||
else
|
||
root_objectid = btrfs_header_owner(parent)
|
||
|
||
(root_objectid, btrfs_header_generation(parent) or zero, 0, 0)
|
||
|
||
|
||
Back Reference Key Construction
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Back references have four fields, each 64 bits long. This is hashed into
|
||
a single 64 bit number and placed into the key offset. The key objectid
|
||
corresponds to the first byte in the extent, and the key type is set to
|
||
BTRFS_EXTENT_REF_KEY.
|
||
|
||
Hash overflows on the offset field are handled by adding one to the
|
||
calculated hash and searching forward. The searching stops when the
|
||
correct back reference structure is found or
|
||
|
||
|
||
Snapshots and Subvolumes
|
||
------------------------
|
||
|
||
Subvolumes are basically a named btree that holds files and directories.
|
||
They have inodes inside the tree of tree roots and can have non-root
|
||
owners and groups. Subvolumes can be given a quota of blocks, and once
|
||
this quota is reached no new writes are allowed. All of the blocks and
|
||
file extents inside of subvolumes are reference counted to allow
|
||
snapshotting. Up to 2\ :sup:`64` subvolumes may be created on the FS.
|
||
|
||
Snapshots are identical to subvolumes, but their root block is initially
|
||
shared with another subvolume. When the snapshot is taken, the reference
|
||
count on the root block is increased, and the copy on write transaction
|
||
system ensures changes made in either the snapshot or the source
|
||
subvolume are private to that root. Snapshots are writable, and they can
|
||
be snapshotted again any number of times. If read only snapshots are
|
||
desired, their block quota is set to one at creation time.
|
||
|
||
|
||
Btree Roots
|
||
-----------
|
||
|
||
Each Btrfs filesystem consists of a number of tree roots. A freshly
|
||
formatted filesystem will have roots for:
|
||
|
||
- The tree of tree roots
|
||
- The tree of allocated extents
|
||
- The default subvolume tree
|
||
|
||
The tree of tree roots records the root block for the extent tree and
|
||
the root blocks and names for each subvolume and snapshot tree. As
|
||
transactions commit, the root block pointers are updated in this tree to
|
||
reference the new roots created by the transaction, and then the new
|
||
root block of this tree is recorded in the FS super block.
|
||
|
||
The tree of tree roots acts as a directory of all the other trees on the
|
||
filesystem, and it has directory items recording the names of all
|
||
snapshots and subvolumes in the FS. Each snapshot or subvolume has an
|
||
objectid in the tree of tree roots, and at least one corresponding
|
||
struct btrfs_root_item. Directory items in the tree map names of
|
||
snapshots and subvolumes to these root items. Because the root item key
|
||
is updated with every transaction commit, the directory items reference
|
||
a generation number of (u64)-1, which tells the lookup code to find the
|
||
most recent root available.
|
||
|
||
The extent trees are used to manage allocated space on the devices. The
|
||
space available can be divided between a number of extent trees to
|
||
reduce lock contention and give different allocation policies to
|
||
different block ranges.
|
||
|
||
The diagram below depicts a collection of tree roots. The super block
|
||
points to the root tree, and the root tree points to the extent trees
|
||
and subvolumes. The root tree also has a directory to map subvolume
|
||
names to struct btrfs_root_items in the root tree. This filesystem has
|
||
one subvolume named 'default' (created by mkfs), and one snapshot of
|
||
'default' named 'snap' (created by the admin some time later). In this
|
||
example, 'default' has not changed since the snapshot was created and so
|
||
both point tree to the same root block on disk.
|
||
|
||
:alt: Copy-Design-r.png
|
||
|
||
Copy-Design-r.png
|
||
|
||
|
||
Copy on Write Logging
|
||
---------------------
|
||
|
||
Data and metadata in Btrfs are protected with copy on write logging
|
||
(COW). Once the transaction that allocated the space on disk has
|
||
committed, any new writes to that logical address in the file or btree
|
||
will go to a newly allocated block, and block pointers in the btrees and
|
||
super blocks will be updated to reflect the new location.
|
||
|
||
Some of the btrfs trees do not use reference counting for their
|
||
allocated space. This includes the root tree, and the extent trees. As
|
||
blocks are replaced in these trees, the old block is freed in the extent
|
||
tree. These blocks are not reused for other purposes until the
|
||
transaction that freed them commits.
|
||
|
||
All subvolume (and snapshot) trees are reference counted. When a COW
|
||
operation is performed on a btree node, the reference count of all the
|
||
blocks it points to is increased by one. For leaves, the reference
|
||
counts of any file extents in the leaf are increased by one. When the
|
||
transaction commits, a new root pointer is inserted in the root tree for
|
||
each new subvolume root. The key used has the form:
|
||
|
||
====================== =================== ==============
|
||
Subvolume inode number BTRFS_ROOT_ITEM_KEY Transaction ID
|
||
====================== =================== ==============
|
||
|
||
The updated btree blocks are all flushed to disk, and then the super
|
||
block is updated to point to the new root tree. Once the super block has
|
||
been properly written to disk, the transaction is considered complete.
|
||
At this time the root tree has two pointers for each subvolume changed
|
||
during the transaction. One item points to the new tree and one points
|
||
to the tree that existed at the start of the last transaction.
|
||
|
||
Any time after the commit finishes, the older subvolume root items may
|
||
be removed. The reference count on the subvolume root block is lowered
|
||
by one. If the reference count reaches zero, the block is freed and the
|
||
reference count on any nodes the root points to is lowered by one. If a
|
||
tree node or leaf can be freed, it is traversed to free the nodes or
|
||
extents below it in the tree in a depth first fashion.
|
||
|
||
The traversal and freeing of the tree may be done in pieces by inserting
|
||
a progress record in the root tree. The progress record indicates the
|
||
last key and level touched by the traversal so the current transaction
|
||
can commit and the traversal can resume in the next transaction. If the
|
||
system crashes before the traversal completes, the progress record is
|
||
used to safely delete the root on the next mount.
|
||
|
||
Ohad Rodeh presented this reference counted snapshot algorithm at the
|
||
2007 Linux Filesystem and Storage Workshop:
|
||
|
||
Slides: `LinuxFS_Workshop.pdf <Media:LinuxFS_Workshop.pdf>`__
|
||
|
||
Paper: `Btree_TOS.pdf <Media:Btree_TOS.pdf>`__
|
||
|
||
The Btrfs snapshotting implementation is based on the ideas he
|
||
presented.
|
||
|
||
Btrfsck
|
||
-------
|
||
|
||
The filesystem checking utility is a crucial tool, but it can be a major
|
||
bottleneck in getting systems back online after something has gone
|
||
wrong. Btrfs aims to be tolerant of invalid metadata, and will avoid
|
||
using metadata it determines to be incorrect. The disk format allows
|
||
Btrfs to deal with most corruptions at run time, without crashing the
|
||
system and without requiring offline filesystem checking.
|
||
|
||
An offline btrfsck is being developed, in part to help verify the
|
||
filesystem during testing, and as an emergency tool to make sure the
|
||
filesystem is safe for mounting. The existing tool only verifies the
|
||
extent allocation maps, making sure that reference counts are correct
|
||
and that all extents are accounted for. If the extent maps are correct,
|
||
there is no risk of incorrectly writing over existing data or metadata
|
||
as blocks are allocated for new use.
|
||
|
||
btrfsck is able to read metadata in roughly disk order. As it scans the
|
||
btrees on disk, it collects the locations of nodes and leaves and pulls
|
||
them from the disk in large sequential batches. For the most part,
|
||
btrfsck is bound by the sequential read throughput of the storage, and
|
||
it is able to take advantage of multi-spindle arrays. The price paid for
|
||
the extra speed is more ram. Btrfsck uses about 3x more ram than
|
||
ext2fsck.
|