btrfs-progs: docs: add some design-related documents
Copied from wiki. Signed-off-by: David Sterba <dsterba@suse.com>
This commit is contained in:
parent
5e4a18b4b5
commit
403ba6e6ee
|
@ -0,0 +1,111 @@
|
|||
Btrees
|
||||
======
|
||||
|
||||
Btrees Introduction
|
||||
-------------------
|
||||
|
||||
Btrfs uses a single set of btree manipulation code for all metadata in
|
||||
the filesystem. For performance or organizational purposes, the trees
|
||||
are broken up into a few different types, and each type of tree will
|
||||
hold a few different types of keys. The super block holds pointers to
|
||||
the tree roots of the tree of tree roots and the chunk tree.
|
||||
|
||||
|
||||
Tree of Tree roots
|
||||
------------------
|
||||
|
||||
This tree is used for indexing and finding the root of most of the other
|
||||
trees in the filesystem. It attaches names to subvolumes and snapshots,
|
||||
and stores the location of the extent allocation tree root. It also
|
||||
stores pointers to all of the subvolumes or snapshots that are being
|
||||
deleted by the transaction code. This allows the deletion to pick up
|
||||
where it left off after a crash.
|
||||
|
||||
|
||||
Chunk Tree
|
||||
----------
|
||||
|
||||
The chunk tree does all of the logical to physical block address mapping
|
||||
for the filesystem, and it stores information about all of the devices
|
||||
in the FS. In order to bootstrap lookup in the chunk tree, the super
|
||||
block also duplicates the chunk items needed to resolve blocks in the
|
||||
chunk tree. Over time, the chunk tree will be split into multiple roots
|
||||
to allow access of larger storage pools.
|
||||
|
||||
There are back references from the chunk items to the extent tree that
|
||||
allocated them. Only a single extent tree can allocate extents out of a
|
||||
given chunk.
|
||||
|
||||
Two types of key are stored in the chunk tree:
|
||||
|
||||
- DEV_ITEM (where the offset field is the internal devid), which
|
||||
contain information on all of the underlying block devices in the
|
||||
filesystem
|
||||
- CHUNK_ITEM (where the offset field is the start of the chunk as a
|
||||
virtual address), which maps a section of the virtual address space
|
||||
(a chunk) into physical storage.
|
||||
|
||||
|
||||
Device Allocation Tree
|
||||
----------------------
|
||||
|
||||
The device allocation tree records which parts of each physical device
|
||||
have been allocated into chunks. This is a relatively small tree that is
|
||||
only updated as new chunks are allocated. It stores back references to
|
||||
the chunk tree that allocated each physical extent on the device.
|
||||
|
||||
|
||||
Extent Allocation Tree
|
||||
----------------------
|
||||
|
||||
The extent allocation tree records byte ranges that are in use,
|
||||
maintains reference counts on each extent and records back references to
|
||||
the tree or file that is using each extent. Logical block groups are
|
||||
created inside the extent allocation tree, and these reference large
|
||||
logical extents from the chunk tree.
|
||||
|
||||
Each block group can only store a specific type of extent. This might
|
||||
include metadata, or mirrored metadata, or striped data blocks etc.
|
||||
|
||||
Currently there is only one extent allocation tree shared by all the
|
||||
other trees. This will change in order to scale better under load.
|
||||
|
||||
Keys for the extent tree use the start of the extent as the objectid. A BLOCK_GROUP_ITEM key will be followed by the EXTENT_ITEM keys for extents within that block group.
|
||||
|
||||
|
||||
FS Trees
|
||||
--------
|
||||
|
||||
These store files and directories, and all of the normal metadata you
|
||||
would expect to find in a filesystem. There is one root for each
|
||||
subvolume or snapshot, but snapshots will share blocks between roots.
|
||||
|
||||
Keys in FS trees always use the inode number of the filesystem object as the objectid.
|
||||
|
||||
Each object will have one or more of:
|
||||
|
||||
- Inode.
|
||||
- Inode ref, indicating what name this object is known as, and in which
|
||||
directory.
|
||||
- For files, a set of extent information, indicating where on the
|
||||
filesystem this file's data is.
|
||||
- For directories, two sequences of dir_items, one indexed by a hash of
|
||||
the object name, and one indexed by a unique sequential index number.
|
||||
|
||||
|
||||
Checksum Tree
|
||||
-------------
|
||||
|
||||
The checksum tree stores block checksums. Every 4k block of data stored
|
||||
on disk has a checksum associated with it. The "offset" part of the keys
|
||||
in the checksum tree indicates the start of the checksummed data on
|
||||
disk. The value stored with the key is a sequence of (currently 4-byte)
|
||||
checksums, for the 4k blocks starting at the offset.
|
||||
|
||||
|
||||
Data Relocation Tree
|
||||
--------------------
|
||||
|
||||
|
||||
Log Root Tree
|
||||
-------------
|
|
@ -0,0 +1,482 @@
|
|||
Btrfs design
|
||||
============
|
||||
|
||||
Btrfs is implemented with simple and well known constructs. It should
|
||||
perform well, but the long term goal of maintaining performance as the
|
||||
FS system ages and grows is more important than winning a short lived
|
||||
benchmark. To that end, benchmarks are being used to try to simulate
|
||||
performance over the life of a filesystem.
|
||||
|
||||
|
||||
Btree Data structures
|
||||
---------------------
|
||||
|
||||
The Btrfs btree provides a generic facility to store a variety of data
|
||||
types. Internally it only knows about three data structures: keys,
|
||||
items, and a block header:
|
||||
|
||||
.. code-block::
|
||||
|
||||
struct btrfs_header {
|
||||
u8 csum[32];
|
||||
u8 fsid[16];
|
||||
__le64 bytenr;
|
||||
__le64 flags;
|
||||
|
||||
u8 chunk_tree_uid[16];
|
||||
__le64 generation;
|
||||
__le64 owner;
|
||||
__le32 nritems;
|
||||
u8 level;
|
||||
}
|
||||
|
||||
.. code-block::
|
||||
|
||||
struct btrfs_disk_key {
|
||||
__le64 objectid;
|
||||
u8 type;
|
||||
__le64 offset;
|
||||
}
|
||||
|
||||
.. code-block::
|
||||
|
||||
struct btrfs_item {
|
||||
struct btrfs_disk_key key;
|
||||
__le32 offset;
|
||||
__le32 size;
|
||||
}
|
||||
|
||||
Upper nodes of the trees contain only [ key, block pointer ] pairs. Tree
|
||||
leaves are broken up into two sections that grow toward each other.
|
||||
Leaves have an array of fixed sized items, and an area where item data
|
||||
is stored. The offset and size fields in the item indicate where in the
|
||||
leaf the item data can be found. Example:
|
||||
|
||||
:alt: Leaf-structure.png
|
||||
|
||||
Leaf-structure.png
|
||||
|
||||
Item data is variably size, and various filesystem data structures are
|
||||
defined as different types of item data. The type field in struct
|
||||
btrfs_disk_key indicates the type of data stored in the item.
|
||||
|
||||
The block header contains a checksum for the block contents, the uuid of
|
||||
the filesystem that owns the block, the level of the block in the tree,
|
||||
and the block number where this block is supposed to live. These fields
|
||||
allow the contents of the metadata to be verified when the data is read.
|
||||
Everything that points to a btree block also stores the generation field
|
||||
it expects that block to have. This allows Btrfs to detect phantom or
|
||||
misplaced writes on the media.
|
||||
|
||||
The checksum of the lower node is not stored in the node pointer to
|
||||
simplify the FS writeback code. The generation number will be known at
|
||||
the time the block is inserted into the btree, but the checksum is only
|
||||
calculated before writing the block to disk. Using the generation will
|
||||
allow Btrfs to detect phantom writes without having to find and update
|
||||
the upper node each time the lower node checksum is updated.
|
||||
|
||||
The generation field corresponds to the transaction id that allocated
|
||||
the block, which enables easy incremental backups and is used by the
|
||||
copy on write transaction subsystem.
|
||||
|
||||
|
||||
Filesystem Data Structures
|
||||
--------------------------
|
||||
|
||||
Each object in the filesystem has an objectid, which is allocated
|
||||
dynamically on creation. A free objectid is simply a hole in the key
|
||||
space of the filesystem btree; objectids that don't already exist in the
|
||||
tree. The objectid makes up the most significant bits of the key,
|
||||
allowing all of the items for a given filesystem object to be logically
|
||||
grouped together in the btree.
|
||||
|
||||
The offset field of the key stores indicates the byte offset for a
|
||||
particular item in the object. For file extents, this would be the byte
|
||||
offset of the start of the extent in the file. The type field stores the
|
||||
item type information, and has extra room for expanded use.
|
||||
|
||||
Inodes
|
||||
------
|
||||
|
||||
Inodes are stored in struct btrfs_inode_item at offset zero in the key,
|
||||
and have a type value of one. Inode items are always the lowest valued
|
||||
key for a given object, and they store the traditional stat data for
|
||||
files and directories. The inode structure is relatively small, and will
|
||||
not contain embedded file data or extended attribute data. These things
|
||||
are stored in other item types.
|
||||
|
||||
Files
|
||||
-----
|
||||
|
||||
Small files that occupy less than one leaf block may be packed into the
|
||||
btree inside the extent item. In this case the key offset is the byte
|
||||
offset of the data in the file, and the size field of struct btrfs_item
|
||||
indicates how much data is stored. There may be more than one of these
|
||||
per file.
|
||||
|
||||
Larger files are stored in extents. struct btrfs_file_extent_item
|
||||
records a generation number for the extent and a [ disk block, disk num
|
||||
blocks ] pair to record the area of disk corresponding to the file.
|
||||
Extents also store the logical offset and the number of blocks used by
|
||||
this extent record into the extent on disk. This allows Btrfs to satisfy
|
||||
a rewrite into the middle of an extent without having to read the old
|
||||
file data first. For example, writing 1MB into the middle of a existing
|
||||
128MB extent may result in three extent records:
|
||||
|
||||
``[ old extent: bytes 0-64MB ], [ new extent 1MB ], [ old extent: bytes 65MB – 128MB]``
|
||||
|
||||
File data checksums are stored in a dedicated btree in a struct
|
||||
btrfs_csum_item. The offset of the key corresponds to the byte number of
|
||||
the extent. The data is checksummed after any compression or encryption
|
||||
is done and it reflects the bytes sent to the disk.
|
||||
|
||||
A single item may store a number of checksums. struct btrfs_csum_items
|
||||
are only used for file extents. File data inline in the btree is covered
|
||||
by the checksum at the start of the btree block.
|
||||
|
||||
Directories
|
||||
-----------
|
||||
|
||||
Directories are indexed in two different ways. For filename lookup,
|
||||
there is an index comprised of keys:
|
||||
|
||||
================== ================== ====================
|
||||
Directory Objectid BTRFS_DIR_ITEM_KEY 64 bit filename hash
|
||||
================== ================== ====================
|
||||
|
||||
The default directory hash used is crc32c, although other hashes may be
|
||||
added later on. A flags field in the super block will indicate which
|
||||
hash is used for a given FS.
|
||||
|
||||
The second directory index is used by readdir to return data in inode
|
||||
number order. This more closely resembles the order of blocks on disk
|
||||
and generally provides better performance for reading data in bulk
|
||||
(backups, copies, etc). Also, it allows fast checking that a given inode
|
||||
is linked into a directory when verifying inode link counts. This index
|
||||
uses an additional set of keys:
|
||||
|
||||
================== =================== =====================
|
||||
Directory Objectid BTRFS_DIR_INDEX_KEY Inode Sequence number
|
||||
================== =================== =====================
|
||||
|
||||
The inode sequence number comes from the directory. It is increased each
|
||||
time a new file or directory is added.
|
||||
|
||||
|
||||
Reference Counted Extents
|
||||
-------------------------
|
||||
|
||||
Reference counting is the basis for the snapshotting subsystems. For
|
||||
every extent allocated to a btree or a file, Btrfs records the number of
|
||||
references in a struct btrfs_extent_item. The trees that hold these
|
||||
items also serve as the allocation map for blocks that are in use on the
|
||||
filesystem. Some trees are not reference counted and are only protected
|
||||
by a copy on write logging. However, the same type of extent items are
|
||||
used for all allocated blocks on the disk.
|
||||
|
||||
A reasonably comprehensive description of the way that references work
|
||||
can be found in `this email from Josef
|
||||
Bacik <http://www.spinics.net/lists/linux-btrfs/msg33415.html>`__.
|
||||
|
||||
|
||||
Extent Block Groups
|
||||
-------------------
|
||||
|
||||
Extent block groups allow allocator optimizations by breaking the disk
|
||||
up into chunks of 256MB or more. For each chunk, they record information
|
||||
about the number of blocks available. Files and directories will have a
|
||||
preferred block group which they try first for allocations.
|
||||
|
||||
Block groups have a flag that indicate if they are preferred for data or
|
||||
metadata allocations, and at mkfs time the disk is broken up into
|
||||
alternating metadata (33% of the disk) and data groups (66% of the
|
||||
disk). As the disk fills, a group's preference may change back and
|
||||
forth, but Btrfs always tries to avoid intermixing data and metadata
|
||||
extents in the same group. This substantially improves fsck throughput,
|
||||
and reduces seeks during writeback while the FS is mounted. It does
|
||||
slightly increase the seeks while reading.
|
||||
|
||||
|
||||
Extent Trees and DM integration
|
||||
-------------------------------
|
||||
|
||||
The Btrfs extent trees are intended to divide up the available storage
|
||||
into a number of flexible allocation policies. Each extent tree owns a
|
||||
section of the underlying disk, and they can be assigned to a collection
|
||||
of (or a single) tree roots, directories or inodes. Policies will direct
|
||||
how a given allocation is spread across the extent trees available,
|
||||
allowing the admin to direct which parts of the filesystem are striped,
|
||||
mirrored or confined to a given device.
|
||||
|
||||
Btrfs will try to tie in with DM in order to easily manage large pools
|
||||
of storage. The basic idea is to have at least one extent tree per
|
||||
spindle, and then allow the admin to assign those extent trees to
|
||||
subvolumes, directories or files.
|
||||
|
||||
|
||||
Explicit Back References
|
||||
------------------------
|
||||
|
||||
Back references have three main goals:
|
||||
|
||||
- Differentiate between all holders of references to an extent so that
|
||||
when a reference is dropped we can make sure it was a valid reference
|
||||
before freeing the extent.
|
||||
- Provide enough information to quickly find the holders of an extent
|
||||
if we notice a given block is corrupted or bad.
|
||||
- Make it easy to migrate blocks for FS shrinking or storage pool
|
||||
maintenance. This is actually the same as #2, but with a slightly
|
||||
different use case.
|
||||
|
||||
|
||||
File Extent Backrefs
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
File extents can be referenced by:
|
||||
|
||||
- Multiple snapshots, subvolumes, or different generations in one
|
||||
subvol
|
||||
- Different files inside a single subvolume
|
||||
- Different offsets inside a file
|
||||
|
||||
.. note::
|
||||
The remainder of this section refers to the extent_ref_v0 structure, which is not used on current btrfs filesystems.
|
||||
|
||||
The extent ref structure has fields for:
|
||||
|
||||
- Objectid of the subvolume root
|
||||
- Generation number of the tree holding the reference
|
||||
- objectid of the file holding the reference
|
||||
- offset in the file corresponding to the key holding the reference
|
||||
|
||||
When a file extent is allocated the fields are filled in:
|
||||
|
||||
(root objectid, transaction id, inode objectid, offset in file)
|
||||
|
||||
When a leaf is cow'd new references are added for every file extent
|
||||
found in the leaf. It looks the same as the create case, but the
|
||||
transaction id will be different when the block is cow'd.
|
||||
|
||||
(root objectid, transaction id, inode objectid, offset in file)
|
||||
|
||||
When a file extent is removed either during snapshot deletion or file
|
||||
truncation, the corresponding back reference is found by searching for:
|
||||
|
||||
(btrfs_header_owner(leaf), btrfs_header_generation(leaf), inode
|
||||
objectid, offset in file)
|
||||
|
||||
|
||||
Btree Extent Backrefs
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Btree extents can be referenced by:
|
||||
|
||||
- Different subvolumes
|
||||
- Different generations of the same subvolume
|
||||
|
||||
Storing sufficient information for a full reverse mapping of a btree
|
||||
block would require storing the lowest key of the block in the backref,
|
||||
and it would require updating that lowest key either before write out or
|
||||
every time it changed.
|
||||
|
||||
Instead, the objectid of the lowest key is stored along with the level
|
||||
of the tree block. This provides a hint about where in the btree the
|
||||
block can be found. Searches through the btree only need to look for a
|
||||
pointer to that block, and they stop one level higher than the level
|
||||
recorded in the backref.
|
||||
|
||||
Some btrees do not do reference counting on their extents. These include
|
||||
the extent tree and the tree of tree roots. Backrefs for these trees
|
||||
always have a generation of zero.
|
||||
|
||||
When a tree block is created, back references are inserted:
|
||||
|
||||
(root objectid, transaction id or zero, level, lowest objectid)
|
||||
|
||||
The level is stored in the objectid slot of the backref to differentiate
|
||||
between Btree back references and file data back references. The highest
|
||||
possible level is 255, and the lowest possible file objectid has been
|
||||
raised to 256. So, if the objectid field in the back reference is less
|
||||
than 256, it corresponds to a Btree block.
|
||||
|
||||
When a tree block is cow'd in a reference counted root, new back
|
||||
references are added for all the blocks it points to:
|
||||
|
||||
(root objectid, transaction id, level, lowest objectid)
|
||||
|
||||
Because the lowest_key_objectid and the level are just hints they are
|
||||
not used when backrefs are deleted. When a snapshot is created a new
|
||||
reference is taken directly on the root block. This means the owner
|
||||
field of the root block may be different from the objectid of the
|
||||
snapshot. So, when dropping references on tree roots, the objectid of
|
||||
the root structure is always used. When a backref is deleted:
|
||||
|
||||
.. code-block::
|
||||
|
||||
if backref was for a tree root:
|
||||
root_objectid = root->root_key.objectid
|
||||
else
|
||||
root_objectid = btrfs_header_owner(parent)
|
||||
|
||||
(root_objectid, btrfs_header_generation(parent) or zero, 0, 0)
|
||||
|
||||
|
||||
Back Reference Key Construction
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Back references have four fields, each 64 bits long. This is hashed into
|
||||
a single 64 bit number and placed into the key offset. The key objectid
|
||||
corresponds to the first byte in the extent, and the key type is set to
|
||||
BTRFS_EXTENT_REF_KEY.
|
||||
|
||||
Hash overflows on the offset field are handled by adding one to the
|
||||
calculated hash and searching forward. The searching stops when the
|
||||
correct back reference structure is found or
|
||||
|
||||
|
||||
Snapshots and Subvolumes
|
||||
------------------------
|
||||
|
||||
Subvolumes are basically a named btree that holds files and directories.
|
||||
They have inodes inside the tree of tree roots and can have non-root
|
||||
owners and groups. Subvolumes can be given a quota of blocks, and once
|
||||
this quota is reached no new writes are allowed. All of the blocks and
|
||||
file extents inside of subvolumes are reference counted to allow
|
||||
snapshotting. Up to 2\ :sup:`64` subvolumes may be created on the FS.
|
||||
|
||||
Snapshots are identical to subvolumes, but their root block is initially
|
||||
shared with another subvolume. When the snapshot is taken, the reference
|
||||
count on the root block is increased, and the copy on write transaction
|
||||
system ensures changes made in either the snapshot or the source
|
||||
subvolume are private to that root. Snapshots are writable, and they can
|
||||
be snapshotted again any number of times. If read only snapshots are
|
||||
desired, their block quota is set to one at creation time.
|
||||
|
||||
|
||||
Btree Roots
|
||||
-----------
|
||||
|
||||
Each Btrfs filesystem consists of a number of tree roots. A freshly
|
||||
formatted filesystem will have roots for:
|
||||
|
||||
- The tree of tree roots
|
||||
- The tree of allocated extents
|
||||
- The default subvolume tree
|
||||
|
||||
The tree of tree roots records the root block for the extent tree and
|
||||
the root blocks and names for each subvolume and snapshot tree. As
|
||||
transactions commit, the root block pointers are updated in this tree to
|
||||
reference the new roots created by the transaction, and then the new
|
||||
root block of this tree is recorded in the FS super block.
|
||||
|
||||
The tree of tree roots acts as a directory of all the other trees on the
|
||||
filesystem, and it has directory items recording the names of all
|
||||
snapshots and subvolumes in the FS. Each snapshot or subvolume has an
|
||||
objectid in the tree of tree roots, and at least one corresponding
|
||||
struct btrfs_root_item. Directory items in the tree map names of
|
||||
snapshots and subvolumes to these root items. Because the root item key
|
||||
is updated with every transaction commit, the directory items reference
|
||||
a generation number of (u64)-1, which tells the lookup code to find the
|
||||
most recent root available.
|
||||
|
||||
The extent trees are used to manage allocated space on the devices. The
|
||||
space available can be divided between a number of extent trees to
|
||||
reduce lock contention and give different allocation policies to
|
||||
different block ranges.
|
||||
|
||||
The diagram below depicts a collection of tree roots. The super block
|
||||
points to the root tree, and the root tree points to the extent trees
|
||||
and subvolumes. The root tree also has a directory to map subvolume
|
||||
names to struct btrfs_root_items in the root tree. This filesystem has
|
||||
one subvolume named 'default' (created by mkfs), and one snapshot of
|
||||
'default' named 'snap' (created by the admin some time later). In this
|
||||
example, 'default' has not changed since the snapshot was created and so
|
||||
both point tree to the same root block on disk.
|
||||
|
||||
:alt: Copy-Design-r.png
|
||||
|
||||
Copy-Design-r.png
|
||||
|
||||
|
||||
Copy on Write Logging
|
||||
---------------------
|
||||
|
||||
Data and metadata in Btrfs are protected with copy on write logging
|
||||
(COW). Once the transaction that allocated the space on disk has
|
||||
committed, any new writes to that logical address in the file or btree
|
||||
will go to a newly allocated block, and block pointers in the btrees and
|
||||
super blocks will be updated to reflect the new location.
|
||||
|
||||
Some of the btrfs trees do not use reference counting for their
|
||||
allocated space. This includes the root tree, and the extent trees. As
|
||||
blocks are replaced in these trees, the old block is freed in the extent
|
||||
tree. These blocks are not reused for other purposes until the
|
||||
transaction that freed them commits.
|
||||
|
||||
All subvolume (and snapshot) trees are reference counted. When a COW
|
||||
operation is performed on a btree node, the reference count of all the
|
||||
blocks it points to is increased by one. For leaves, the reference
|
||||
counts of any file extents in the leaf are increased by one. When the
|
||||
transaction commits, a new root pointer is inserted in the root tree for
|
||||
each new subvolume root. The key used has the form:
|
||||
|
||||
====================== =================== ==============
|
||||
Subvolume inode number BTRFS_ROOT_ITEM_KEY Transaction ID
|
||||
====================== =================== ==============
|
||||
|
||||
The updated btree blocks are all flushed to disk, and then the super
|
||||
block is updated to point to the new root tree. Once the super block has
|
||||
been properly written to disk, the transaction is considered complete.
|
||||
At this time the root tree has two pointers for each subvolume changed
|
||||
during the transaction. One item points to the new tree and one points
|
||||
to the tree that existed at the start of the last transaction.
|
||||
|
||||
Any time after the commit finishes, the older subvolume root items may
|
||||
be removed. The reference count on the subvolume root block is lowered
|
||||
by one. If the reference count reaches zero, the block is freed and the
|
||||
reference count on any nodes the root points to is lowered by one. If a
|
||||
tree node or leaf can be freed, it is traversed to free the nodes or
|
||||
extents below it in the tree in a depth first fashion.
|
||||
|
||||
The traversal and freeing of the tree may be done in pieces by inserting
|
||||
a progress record in the root tree. The progress record indicates the
|
||||
last key and level touched by the traversal so the current transaction
|
||||
can commit and the traversal can resume in the next transaction. If the
|
||||
system crashes before the traversal completes, the progress record is
|
||||
used to safely delete the root on the next mount.
|
||||
|
||||
Ohad Rodeh presented this reference counted snapshot algorithm at the
|
||||
2007 Linux Filesystem and Storage Workshop:
|
||||
|
||||
Slides: `LinuxFS_Workshop.pdf <Media:LinuxFS_Workshop.pdf>`__
|
||||
|
||||
Paper: `Btree_TOS.pdf <Media:Btree_TOS.pdf>`__
|
||||
|
||||
The Btrfs snapshotting implementation is based on the ideas he
|
||||
presented.
|
||||
|
||||
Btrfsck
|
||||
~~~~~~~
|
||||
|
||||
The filesystem checking utility is a crucial tool, but it can be a major
|
||||
bottleneck in getting systems back online after something has gone
|
||||
wrong. Btrfs aims to be tolerant of invalid metadata, and will avoid
|
||||
using metadata it determines to be incorrect. The disk format allows
|
||||
Btrfs to deal with most corruptions at run time, without crashing the
|
||||
system and without requiring offline filesystem checking.
|
||||
|
||||
An offline btrfsck is being developed, in part to help verify the
|
||||
filesystem during testing, and as an emergency tool to make sure the
|
||||
filesystem is safe for mounting. The existing tool only verifies the
|
||||
extent allocation maps, making sure that reference counts are correct
|
||||
and that all extents are accounted for. If the extent maps are correct,
|
||||
there is no risk of incorrectly writing over existing data or metadata
|
||||
as blocks are allocated for new use.
|
||||
|
||||
btrfsck is able to read metadata in roughly disk order. As it scans the
|
||||
btrees on disk, it collects the locations of nodes and leaves and pulls
|
||||
them from the disk in large sequential batches. For the most part,
|
||||
btrfsck is bound by the sequential read throughput of the storage, and
|
||||
it is able to take advantage of multi-spindle arrays. The price paid for
|
||||
the extra speed is more ram. Btrfsck uses about 3x more ram than
|
||||
ext2fsck.
|
|
@ -64,3 +64,5 @@ Welcome to BTRFS documentation!
|
|||
btrfs-ioctl
|
||||
DocConventions
|
||||
dev-send-stream
|
||||
dev-btrees
|
||||
dev-btrfs-design
|
||||
|
|
Loading…
Reference in New Issue