btrfs-progs: docs: add some design-related documents
Copied from wiki. Signed-off-by: David Sterba <dsterba@suse.com>
This commit is contained in:
parent
5e4a18b4b5
commit
403ba6e6ee
|
@ -0,0 +1,111 @@
|
||||||
|
Btrees
|
||||||
|
======
|
||||||
|
|
||||||
|
Btrees Introduction
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
Btrfs uses a single set of btree manipulation code for all metadata in
|
||||||
|
the filesystem. For performance or organizational purposes, the trees
|
||||||
|
are broken up into a few different types, and each type of tree will
|
||||||
|
hold a few different types of keys. The super block holds pointers to
|
||||||
|
the tree roots of the tree of tree roots and the chunk tree.
|
||||||
|
|
||||||
|
|
||||||
|
Tree of Tree roots
|
||||||
|
------------------
|
||||||
|
|
||||||
|
This tree is used for indexing and finding the root of most of the other
|
||||||
|
trees in the filesystem. It attaches names to subvolumes and snapshots,
|
||||||
|
and stores the location of the extent allocation tree root. It also
|
||||||
|
stores pointers to all of the subvolumes or snapshots that are being
|
||||||
|
deleted by the transaction code. This allows the deletion to pick up
|
||||||
|
where it left off after a crash.
|
||||||
|
|
||||||
|
|
||||||
|
Chunk Tree
|
||||||
|
----------
|
||||||
|
|
||||||
|
The chunk tree does all of the logical to physical block address mapping
|
||||||
|
for the filesystem, and it stores information about all of the devices
|
||||||
|
in the FS. In order to bootstrap lookup in the chunk tree, the super
|
||||||
|
block also duplicates the chunk items needed to resolve blocks in the
|
||||||
|
chunk tree. Over time, the chunk tree will be split into multiple roots
|
||||||
|
to allow access of larger storage pools.
|
||||||
|
|
||||||
|
There are back references from the chunk items to the extent tree that
|
||||||
|
allocated them. Only a single extent tree can allocate extents out of a
|
||||||
|
given chunk.
|
||||||
|
|
||||||
|
Two types of key are stored in the chunk tree:
|
||||||
|
|
||||||
|
- DEV_ITEM (where the offset field is the internal devid), which
|
||||||
|
contain information on all of the underlying block devices in the
|
||||||
|
filesystem
|
||||||
|
- CHUNK_ITEM (where the offset field is the start of the chunk as a
|
||||||
|
virtual address), which maps a section of the virtual address space
|
||||||
|
(a chunk) into physical storage.
|
||||||
|
|
||||||
|
|
||||||
|
Device Allocation Tree
|
||||||
|
----------------------
|
||||||
|
|
||||||
|
The device allocation tree records which parts of each physical device
|
||||||
|
have been allocated into chunks. This is a relatively small tree that is
|
||||||
|
only updated as new chunks are allocated. It stores back references to
|
||||||
|
the chunk tree that allocated each physical extent on the device.
|
||||||
|
|
||||||
|
|
||||||
|
Extent Allocation Tree
|
||||||
|
----------------------
|
||||||
|
|
||||||
|
The extent allocation tree records byte ranges that are in use,
|
||||||
|
maintains reference counts on each extent and records back references to
|
||||||
|
the tree or file that is using each extent. Logical block groups are
|
||||||
|
created inside the extent allocation tree, and these reference large
|
||||||
|
logical extents from the chunk tree.
|
||||||
|
|
||||||
|
Each block group can only store a specific type of extent. This might
|
||||||
|
include metadata, or mirrored metadata, or striped data blocks etc.
|
||||||
|
|
||||||
|
Currently there is only one extent allocation tree shared by all the
|
||||||
|
other trees. This will change in order to scale better under load.
|
||||||
|
|
||||||
|
Keys for the extent tree use the start of the extent as the objectid. A BLOCK_GROUP_ITEM key will be followed by the EXTENT_ITEM keys for extents within that block group.
|
||||||
|
|
||||||
|
|
||||||
|
FS Trees
|
||||||
|
--------
|
||||||
|
|
||||||
|
These store files and directories, and all of the normal metadata you
|
||||||
|
would expect to find in a filesystem. There is one root for each
|
||||||
|
subvolume or snapshot, but snapshots will share blocks between roots.
|
||||||
|
|
||||||
|
Keys in FS trees always use the inode number of the filesystem object as the objectid.
|
||||||
|
|
||||||
|
Each object will have one or more of:
|
||||||
|
|
||||||
|
- Inode.
|
||||||
|
- Inode ref, indicating what name this object is known as, and in which
|
||||||
|
directory.
|
||||||
|
- For files, a set of extent information, indicating where on the
|
||||||
|
filesystem this file's data is.
|
||||||
|
- For directories, two sequences of dir_items, one indexed by a hash of
|
||||||
|
the object name, and one indexed by a unique sequential index number.
|
||||||
|
|
||||||
|
|
||||||
|
Checksum Tree
|
||||||
|
-------------
|
||||||
|
|
||||||
|
The checksum tree stores block checksums. Every 4k block of data stored
|
||||||
|
on disk has a checksum associated with it. The "offset" part of the keys
|
||||||
|
in the checksum tree indicates the start of the checksummed data on
|
||||||
|
disk. The value stored with the key is a sequence of (currently 4-byte)
|
||||||
|
checksums, for the 4k blocks starting at the offset.
|
||||||
|
|
||||||
|
|
||||||
|
Data Relocation Tree
|
||||||
|
--------------------
|
||||||
|
|
||||||
|
|
||||||
|
Log Root Tree
|
||||||
|
-------------
|
|
@ -0,0 +1,482 @@
|
||||||
|
Btrfs design
|
||||||
|
============
|
||||||
|
|
||||||
|
Btrfs is implemented with simple and well known constructs. It should
|
||||||
|
perform well, but the long term goal of maintaining performance as the
|
||||||
|
FS system ages and grows is more important than winning a short lived
|
||||||
|
benchmark. To that end, benchmarks are being used to try to simulate
|
||||||
|
performance over the life of a filesystem.
|
||||||
|
|
||||||
|
|
||||||
|
Btree Data structures
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
The Btrfs btree provides a generic facility to store a variety of data
|
||||||
|
types. Internally it only knows about three data structures: keys,
|
||||||
|
items, and a block header:
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
struct btrfs_header {
|
||||||
|
u8 csum[32];
|
||||||
|
u8 fsid[16];
|
||||||
|
__le64 bytenr;
|
||||||
|
__le64 flags;
|
||||||
|
|
||||||
|
u8 chunk_tree_uid[16];
|
||||||
|
__le64 generation;
|
||||||
|
__le64 owner;
|
||||||
|
__le32 nritems;
|
||||||
|
u8 level;
|
||||||
|
}
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
struct btrfs_disk_key {
|
||||||
|
__le64 objectid;
|
||||||
|
u8 type;
|
||||||
|
__le64 offset;
|
||||||
|
}
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
struct btrfs_item {
|
||||||
|
struct btrfs_disk_key key;
|
||||||
|
__le32 offset;
|
||||||
|
__le32 size;
|
||||||
|
}
|
||||||
|
|
||||||
|
Upper nodes of the trees contain only [ key, block pointer ] pairs. Tree
|
||||||
|
leaves are broken up into two sections that grow toward each other.
|
||||||
|
Leaves have an array of fixed sized items, and an area where item data
|
||||||
|
is stored. The offset and size fields in the item indicate where in the
|
||||||
|
leaf the item data can be found. Example:
|
||||||
|
|
||||||
|
:alt: Leaf-structure.png
|
||||||
|
|
||||||
|
Leaf-structure.png
|
||||||
|
|
||||||
|
Item data is variably size, and various filesystem data structures are
|
||||||
|
defined as different types of item data. The type field in struct
|
||||||
|
btrfs_disk_key indicates the type of data stored in the item.
|
||||||
|
|
||||||
|
The block header contains a checksum for the block contents, the uuid of
|
||||||
|
the filesystem that owns the block, the level of the block in the tree,
|
||||||
|
and the block number where this block is supposed to live. These fields
|
||||||
|
allow the contents of the metadata to be verified when the data is read.
|
||||||
|
Everything that points to a btree block also stores the generation field
|
||||||
|
it expects that block to have. This allows Btrfs to detect phantom or
|
||||||
|
misplaced writes on the media.
|
||||||
|
|
||||||
|
The checksum of the lower node is not stored in the node pointer to
|
||||||
|
simplify the FS writeback code. The generation number will be known at
|
||||||
|
the time the block is inserted into the btree, but the checksum is only
|
||||||
|
calculated before writing the block to disk. Using the generation will
|
||||||
|
allow Btrfs to detect phantom writes without having to find and update
|
||||||
|
the upper node each time the lower node checksum is updated.
|
||||||
|
|
||||||
|
The generation field corresponds to the transaction id that allocated
|
||||||
|
the block, which enables easy incremental backups and is used by the
|
||||||
|
copy on write transaction subsystem.
|
||||||
|
|
||||||
|
|
||||||
|
Filesystem Data Structures
|
||||||
|
--------------------------
|
||||||
|
|
||||||
|
Each object in the filesystem has an objectid, which is allocated
|
||||||
|
dynamically on creation. A free objectid is simply a hole in the key
|
||||||
|
space of the filesystem btree; objectids that don't already exist in the
|
||||||
|
tree. The objectid makes up the most significant bits of the key,
|
||||||
|
allowing all of the items for a given filesystem object to be logically
|
||||||
|
grouped together in the btree.
|
||||||
|
|
||||||
|
The offset field of the key stores indicates the byte offset for a
|
||||||
|
particular item in the object. For file extents, this would be the byte
|
||||||
|
offset of the start of the extent in the file. The type field stores the
|
||||||
|
item type information, and has extra room for expanded use.
|
||||||
|
|
||||||
|
Inodes
|
||||||
|
------
|
||||||
|
|
||||||
|
Inodes are stored in struct btrfs_inode_item at offset zero in the key,
|
||||||
|
and have a type value of one. Inode items are always the lowest valued
|
||||||
|
key for a given object, and they store the traditional stat data for
|
||||||
|
files and directories. The inode structure is relatively small, and will
|
||||||
|
not contain embedded file data or extended attribute data. These things
|
||||||
|
are stored in other item types.
|
||||||
|
|
||||||
|
Files
|
||||||
|
-----
|
||||||
|
|
||||||
|
Small files that occupy less than one leaf block may be packed into the
|
||||||
|
btree inside the extent item. In this case the key offset is the byte
|
||||||
|
offset of the data in the file, and the size field of struct btrfs_item
|
||||||
|
indicates how much data is stored. There may be more than one of these
|
||||||
|
per file.
|
||||||
|
|
||||||
|
Larger files are stored in extents. struct btrfs_file_extent_item
|
||||||
|
records a generation number for the extent and a [ disk block, disk num
|
||||||
|
blocks ] pair to record the area of disk corresponding to the file.
|
||||||
|
Extents also store the logical offset and the number of blocks used by
|
||||||
|
this extent record into the extent on disk. This allows Btrfs to satisfy
|
||||||
|
a rewrite into the middle of an extent without having to read the old
|
||||||
|
file data first. For example, writing 1MB into the middle of a existing
|
||||||
|
128MB extent may result in three extent records:
|
||||||
|
|
||||||
|
``[ old extent: bytes 0-64MB ], [ new extent 1MB ], [ old extent: bytes 65MB – 128MB]``
|
||||||
|
|
||||||
|
File data checksums are stored in a dedicated btree in a struct
|
||||||
|
btrfs_csum_item. The offset of the key corresponds to the byte number of
|
||||||
|
the extent. The data is checksummed after any compression or encryption
|
||||||
|
is done and it reflects the bytes sent to the disk.
|
||||||
|
|
||||||
|
A single item may store a number of checksums. struct btrfs_csum_items
|
||||||
|
are only used for file extents. File data inline in the btree is covered
|
||||||
|
by the checksum at the start of the btree block.
|
||||||
|
|
||||||
|
Directories
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Directories are indexed in two different ways. For filename lookup,
|
||||||
|
there is an index comprised of keys:
|
||||||
|
|
||||||
|
================== ================== ====================
|
||||||
|
Directory Objectid BTRFS_DIR_ITEM_KEY 64 bit filename hash
|
||||||
|
================== ================== ====================
|
||||||
|
|
||||||
|
The default directory hash used is crc32c, although other hashes may be
|
||||||
|
added later on. A flags field in the super block will indicate which
|
||||||
|
hash is used for a given FS.
|
||||||
|
|
||||||
|
The second directory index is used by readdir to return data in inode
|
||||||
|
number order. This more closely resembles the order of blocks on disk
|
||||||
|
and generally provides better performance for reading data in bulk
|
||||||
|
(backups, copies, etc). Also, it allows fast checking that a given inode
|
||||||
|
is linked into a directory when verifying inode link counts. This index
|
||||||
|
uses an additional set of keys:
|
||||||
|
|
||||||
|
================== =================== =====================
|
||||||
|
Directory Objectid BTRFS_DIR_INDEX_KEY Inode Sequence number
|
||||||
|
================== =================== =====================
|
||||||
|
|
||||||
|
The inode sequence number comes from the directory. It is increased each
|
||||||
|
time a new file or directory is added.
|
||||||
|
|
||||||
|
|
||||||
|
Reference Counted Extents
|
||||||
|
-------------------------
|
||||||
|
|
||||||
|
Reference counting is the basis for the snapshotting subsystems. For
|
||||||
|
every extent allocated to a btree or a file, Btrfs records the number of
|
||||||
|
references in a struct btrfs_extent_item. The trees that hold these
|
||||||
|
items also serve as the allocation map for blocks that are in use on the
|
||||||
|
filesystem. Some trees are not reference counted and are only protected
|
||||||
|
by a copy on write logging. However, the same type of extent items are
|
||||||
|
used for all allocated blocks on the disk.
|
||||||
|
|
||||||
|
A reasonably comprehensive description of the way that references work
|
||||||
|
can be found in `this email from Josef
|
||||||
|
Bacik <http://www.spinics.net/lists/linux-btrfs/msg33415.html>`__.
|
||||||
|
|
||||||
|
|
||||||
|
Extent Block Groups
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
Extent block groups allow allocator optimizations by breaking the disk
|
||||||
|
up into chunks of 256MB or more. For each chunk, they record information
|
||||||
|
about the number of blocks available. Files and directories will have a
|
||||||
|
preferred block group which they try first for allocations.
|
||||||
|
|
||||||
|
Block groups have a flag that indicate if they are preferred for data or
|
||||||
|
metadata allocations, and at mkfs time the disk is broken up into
|
||||||
|
alternating metadata (33% of the disk) and data groups (66% of the
|
||||||
|
disk). As the disk fills, a group's preference may change back and
|
||||||
|
forth, but Btrfs always tries to avoid intermixing data and metadata
|
||||||
|
extents in the same group. This substantially improves fsck throughput,
|
||||||
|
and reduces seeks during writeback while the FS is mounted. It does
|
||||||
|
slightly increase the seeks while reading.
|
||||||
|
|
||||||
|
|
||||||
|
Extent Trees and DM integration
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
The Btrfs extent trees are intended to divide up the available storage
|
||||||
|
into a number of flexible allocation policies. Each extent tree owns a
|
||||||
|
section of the underlying disk, and they can be assigned to a collection
|
||||||
|
of (or a single) tree roots, directories or inodes. Policies will direct
|
||||||
|
how a given allocation is spread across the extent trees available,
|
||||||
|
allowing the admin to direct which parts of the filesystem are striped,
|
||||||
|
mirrored or confined to a given device.
|
||||||
|
|
||||||
|
Btrfs will try to tie in with DM in order to easily manage large pools
|
||||||
|
of storage. The basic idea is to have at least one extent tree per
|
||||||
|
spindle, and then allow the admin to assign those extent trees to
|
||||||
|
subvolumes, directories or files.
|
||||||
|
|
||||||
|
|
||||||
|
Explicit Back References
|
||||||
|
------------------------
|
||||||
|
|
||||||
|
Back references have three main goals:
|
||||||
|
|
||||||
|
- Differentiate between all holders of references to an extent so that
|
||||||
|
when a reference is dropped we can make sure it was a valid reference
|
||||||
|
before freeing the extent.
|
||||||
|
- Provide enough information to quickly find the holders of an extent
|
||||||
|
if we notice a given block is corrupted or bad.
|
||||||
|
- Make it easy to migrate blocks for FS shrinking or storage pool
|
||||||
|
maintenance. This is actually the same as #2, but with a slightly
|
||||||
|
different use case.
|
||||||
|
|
||||||
|
|
||||||
|
File Extent Backrefs
|
||||||
|
^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
File extents can be referenced by:
|
||||||
|
|
||||||
|
- Multiple snapshots, subvolumes, or different generations in one
|
||||||
|
subvol
|
||||||
|
- Different files inside a single subvolume
|
||||||
|
- Different offsets inside a file
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
The remainder of this section refers to the extent_ref_v0 structure, which is not used on current btrfs filesystems.
|
||||||
|
|
||||||
|
The extent ref structure has fields for:
|
||||||
|
|
||||||
|
- Objectid of the subvolume root
|
||||||
|
- Generation number of the tree holding the reference
|
||||||
|
- objectid of the file holding the reference
|
||||||
|
- offset in the file corresponding to the key holding the reference
|
||||||
|
|
||||||
|
When a file extent is allocated the fields are filled in:
|
||||||
|
|
||||||
|
(root objectid, transaction id, inode objectid, offset in file)
|
||||||
|
|
||||||
|
When a leaf is cow'd new references are added for every file extent
|
||||||
|
found in the leaf. It looks the same as the create case, but the
|
||||||
|
transaction id will be different when the block is cow'd.
|
||||||
|
|
||||||
|
(root objectid, transaction id, inode objectid, offset in file)
|
||||||
|
|
||||||
|
When a file extent is removed either during snapshot deletion or file
|
||||||
|
truncation, the corresponding back reference is found by searching for:
|
||||||
|
|
||||||
|
(btrfs_header_owner(leaf), btrfs_header_generation(leaf), inode
|
||||||
|
objectid, offset in file)
|
||||||
|
|
||||||
|
|
||||||
|
Btree Extent Backrefs
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Btree extents can be referenced by:
|
||||||
|
|
||||||
|
- Different subvolumes
|
||||||
|
- Different generations of the same subvolume
|
||||||
|
|
||||||
|
Storing sufficient information for a full reverse mapping of a btree
|
||||||
|
block would require storing the lowest key of the block in the backref,
|
||||||
|
and it would require updating that lowest key either before write out or
|
||||||
|
every time it changed.
|
||||||
|
|
||||||
|
Instead, the objectid of the lowest key is stored along with the level
|
||||||
|
of the tree block. This provides a hint about where in the btree the
|
||||||
|
block can be found. Searches through the btree only need to look for a
|
||||||
|
pointer to that block, and they stop one level higher than the level
|
||||||
|
recorded in the backref.
|
||||||
|
|
||||||
|
Some btrees do not do reference counting on their extents. These include
|
||||||
|
the extent tree and the tree of tree roots. Backrefs for these trees
|
||||||
|
always have a generation of zero.
|
||||||
|
|
||||||
|
When a tree block is created, back references are inserted:
|
||||||
|
|
||||||
|
(root objectid, transaction id or zero, level, lowest objectid)
|
||||||
|
|
||||||
|
The level is stored in the objectid slot of the backref to differentiate
|
||||||
|
between Btree back references and file data back references. The highest
|
||||||
|
possible level is 255, and the lowest possible file objectid has been
|
||||||
|
raised to 256. So, if the objectid field in the back reference is less
|
||||||
|
than 256, it corresponds to a Btree block.
|
||||||
|
|
||||||
|
When a tree block is cow'd in a reference counted root, new back
|
||||||
|
references are added for all the blocks it points to:
|
||||||
|
|
||||||
|
(root objectid, transaction id, level, lowest objectid)
|
||||||
|
|
||||||
|
Because the lowest_key_objectid and the level are just hints they are
|
||||||
|
not used when backrefs are deleted. When a snapshot is created a new
|
||||||
|
reference is taken directly on the root block. This means the owner
|
||||||
|
field of the root block may be different from the objectid of the
|
||||||
|
snapshot. So, when dropping references on tree roots, the objectid of
|
||||||
|
the root structure is always used. When a backref is deleted:
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
if backref was for a tree root:
|
||||||
|
root_objectid = root->root_key.objectid
|
||||||
|
else
|
||||||
|
root_objectid = btrfs_header_owner(parent)
|
||||||
|
|
||||||
|
(root_objectid, btrfs_header_generation(parent) or zero, 0, 0)
|
||||||
|
|
||||||
|
|
||||||
|
Back Reference Key Construction
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Back references have four fields, each 64 bits long. This is hashed into
|
||||||
|
a single 64 bit number and placed into the key offset. The key objectid
|
||||||
|
corresponds to the first byte in the extent, and the key type is set to
|
||||||
|
BTRFS_EXTENT_REF_KEY.
|
||||||
|
|
||||||
|
Hash overflows on the offset field are handled by adding one to the
|
||||||
|
calculated hash and searching forward. The searching stops when the
|
||||||
|
correct back reference structure is found or
|
||||||
|
|
||||||
|
|
||||||
|
Snapshots and Subvolumes
|
||||||
|
------------------------
|
||||||
|
|
||||||
|
Subvolumes are basically a named btree that holds files and directories.
|
||||||
|
They have inodes inside the tree of tree roots and can have non-root
|
||||||
|
owners and groups. Subvolumes can be given a quota of blocks, and once
|
||||||
|
this quota is reached no new writes are allowed. All of the blocks and
|
||||||
|
file extents inside of subvolumes are reference counted to allow
|
||||||
|
snapshotting. Up to 2\ :sup:`64` subvolumes may be created on the FS.
|
||||||
|
|
||||||
|
Snapshots are identical to subvolumes, but their root block is initially
|
||||||
|
shared with another subvolume. When the snapshot is taken, the reference
|
||||||
|
count on the root block is increased, and the copy on write transaction
|
||||||
|
system ensures changes made in either the snapshot or the source
|
||||||
|
subvolume are private to that root. Snapshots are writable, and they can
|
||||||
|
be snapshotted again any number of times. If read only snapshots are
|
||||||
|
desired, their block quota is set to one at creation time.
|
||||||
|
|
||||||
|
|
||||||
|
Btree Roots
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Each Btrfs filesystem consists of a number of tree roots. A freshly
|
||||||
|
formatted filesystem will have roots for:
|
||||||
|
|
||||||
|
- The tree of tree roots
|
||||||
|
- The tree of allocated extents
|
||||||
|
- The default subvolume tree
|
||||||
|
|
||||||
|
The tree of tree roots records the root block for the extent tree and
|
||||||
|
the root blocks and names for each subvolume and snapshot tree. As
|
||||||
|
transactions commit, the root block pointers are updated in this tree to
|
||||||
|
reference the new roots created by the transaction, and then the new
|
||||||
|
root block of this tree is recorded in the FS super block.
|
||||||
|
|
||||||
|
The tree of tree roots acts as a directory of all the other trees on the
|
||||||
|
filesystem, and it has directory items recording the names of all
|
||||||
|
snapshots and subvolumes in the FS. Each snapshot or subvolume has an
|
||||||
|
objectid in the tree of tree roots, and at least one corresponding
|
||||||
|
struct btrfs_root_item. Directory items in the tree map names of
|
||||||
|
snapshots and subvolumes to these root items. Because the root item key
|
||||||
|
is updated with every transaction commit, the directory items reference
|
||||||
|
a generation number of (u64)-1, which tells the lookup code to find the
|
||||||
|
most recent root available.
|
||||||
|
|
||||||
|
The extent trees are used to manage allocated space on the devices. The
|
||||||
|
space available can be divided between a number of extent trees to
|
||||||
|
reduce lock contention and give different allocation policies to
|
||||||
|
different block ranges.
|
||||||
|
|
||||||
|
The diagram below depicts a collection of tree roots. The super block
|
||||||
|
points to the root tree, and the root tree points to the extent trees
|
||||||
|
and subvolumes. The root tree also has a directory to map subvolume
|
||||||
|
names to struct btrfs_root_items in the root tree. This filesystem has
|
||||||
|
one subvolume named 'default' (created by mkfs), and one snapshot of
|
||||||
|
'default' named 'snap' (created by the admin some time later). In this
|
||||||
|
example, 'default' has not changed since the snapshot was created and so
|
||||||
|
both point tree to the same root block on disk.
|
||||||
|
|
||||||
|
:alt: Copy-Design-r.png
|
||||||
|
|
||||||
|
Copy-Design-r.png
|
||||||
|
|
||||||
|
|
||||||
|
Copy on Write Logging
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
Data and metadata in Btrfs are protected with copy on write logging
|
||||||
|
(COW). Once the transaction that allocated the space on disk has
|
||||||
|
committed, any new writes to that logical address in the file or btree
|
||||||
|
will go to a newly allocated block, and block pointers in the btrees and
|
||||||
|
super blocks will be updated to reflect the new location.
|
||||||
|
|
||||||
|
Some of the btrfs trees do not use reference counting for their
|
||||||
|
allocated space. This includes the root tree, and the extent trees. As
|
||||||
|
blocks are replaced in these trees, the old block is freed in the extent
|
||||||
|
tree. These blocks are not reused for other purposes until the
|
||||||
|
transaction that freed them commits.
|
||||||
|
|
||||||
|
All subvolume (and snapshot) trees are reference counted. When a COW
|
||||||
|
operation is performed on a btree node, the reference count of all the
|
||||||
|
blocks it points to is increased by one. For leaves, the reference
|
||||||
|
counts of any file extents in the leaf are increased by one. When the
|
||||||
|
transaction commits, a new root pointer is inserted in the root tree for
|
||||||
|
each new subvolume root. The key used has the form:
|
||||||
|
|
||||||
|
====================== =================== ==============
|
||||||
|
Subvolume inode number BTRFS_ROOT_ITEM_KEY Transaction ID
|
||||||
|
====================== =================== ==============
|
||||||
|
|
||||||
|
The updated btree blocks are all flushed to disk, and then the super
|
||||||
|
block is updated to point to the new root tree. Once the super block has
|
||||||
|
been properly written to disk, the transaction is considered complete.
|
||||||
|
At this time the root tree has two pointers for each subvolume changed
|
||||||
|
during the transaction. One item points to the new tree and one points
|
||||||
|
to the tree that existed at the start of the last transaction.
|
||||||
|
|
||||||
|
Any time after the commit finishes, the older subvolume root items may
|
||||||
|
be removed. The reference count on the subvolume root block is lowered
|
||||||
|
by one. If the reference count reaches zero, the block is freed and the
|
||||||
|
reference count on any nodes the root points to is lowered by one. If a
|
||||||
|
tree node or leaf can be freed, it is traversed to free the nodes or
|
||||||
|
extents below it in the tree in a depth first fashion.
|
||||||
|
|
||||||
|
The traversal and freeing of the tree may be done in pieces by inserting
|
||||||
|
a progress record in the root tree. The progress record indicates the
|
||||||
|
last key and level touched by the traversal so the current transaction
|
||||||
|
can commit and the traversal can resume in the next transaction. If the
|
||||||
|
system crashes before the traversal completes, the progress record is
|
||||||
|
used to safely delete the root on the next mount.
|
||||||
|
|
||||||
|
Ohad Rodeh presented this reference counted snapshot algorithm at the
|
||||||
|
2007 Linux Filesystem and Storage Workshop:
|
||||||
|
|
||||||
|
Slides: `LinuxFS_Workshop.pdf <Media:LinuxFS_Workshop.pdf>`__
|
||||||
|
|
||||||
|
Paper: `Btree_TOS.pdf <Media:Btree_TOS.pdf>`__
|
||||||
|
|
||||||
|
The Btrfs snapshotting implementation is based on the ideas he
|
||||||
|
presented.
|
||||||
|
|
||||||
|
Btrfsck
|
||||||
|
~~~~~~~
|
||||||
|
|
||||||
|
The filesystem checking utility is a crucial tool, but it can be a major
|
||||||
|
bottleneck in getting systems back online after something has gone
|
||||||
|
wrong. Btrfs aims to be tolerant of invalid metadata, and will avoid
|
||||||
|
using metadata it determines to be incorrect. The disk format allows
|
||||||
|
Btrfs to deal with most corruptions at run time, without crashing the
|
||||||
|
system and without requiring offline filesystem checking.
|
||||||
|
|
||||||
|
An offline btrfsck is being developed, in part to help verify the
|
||||||
|
filesystem during testing, and as an emergency tool to make sure the
|
||||||
|
filesystem is safe for mounting. The existing tool only verifies the
|
||||||
|
extent allocation maps, making sure that reference counts are correct
|
||||||
|
and that all extents are accounted for. If the extent maps are correct,
|
||||||
|
there is no risk of incorrectly writing over existing data or metadata
|
||||||
|
as blocks are allocated for new use.
|
||||||
|
|
||||||
|
btrfsck is able to read metadata in roughly disk order. As it scans the
|
||||||
|
btrees on disk, it collects the locations of nodes and leaves and pulls
|
||||||
|
them from the disk in large sequential batches. For the most part,
|
||||||
|
btrfsck is bound by the sequential read throughput of the storage, and
|
||||||
|
it is able to take advantage of multi-spindle arrays. The price paid for
|
||||||
|
the extra speed is more ram. Btrfsck uses about 3x more ram than
|
||||||
|
ext2fsck.
|
|
@ -64,3 +64,5 @@ Welcome to BTRFS documentation!
|
||||||
btrfs-ioctl
|
btrfs-ioctl
|
||||||
DocConventions
|
DocConventions
|
||||||
dev-send-stream
|
dev-send-stream
|
||||||
|
dev-btrees
|
||||||
|
dev-btrfs-design
|
||||||
|
|
Loading…
Reference in New Issue