btrfs-progs: docs: add document for RAID stripe tree design
Add a document describing the layout and functionality of the newly introduced RAID Stripe Tree. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
This commit is contained in:
parent
eea8dab0fc
commit
04d9e13461
|
@ -0,0 +1,317 @@
|
|||
BTRFS RAID Stripe Tree Design
|
||||
=============================
|
||||
|
||||
|
||||
Problem Statement
|
||||
-----------------
|
||||
|
||||
|
||||
RAID on zoned devices
|
||||
---------------------
|
||||
The current implementation of RAID profiles in BTRFS is based on the implicit
|
||||
assumption that data placement is deterministic in the device chunks used for
|
||||
mapping block groups.
|
||||
With deterministic data placement, all physical on-disk extents of one logical
|
||||
file extent are positioned at the same offset relative to the starting LBA of
|
||||
a device chunk. This prevents the need for reading any meta-data to access an
|
||||
on-disk file extent. Figure 1 shows an example of it.
|
||||
|
||||
|
||||
+------------+ +------------+
|
||||
| | | |
|
||||
| | | |
|
||||
| | | |
|
||||
+------------+ +------------+
|
||||
| | | |
|
||||
| D 1 | | D 1 |
|
||||
| | | |
|
||||
+------------+ +------------+
|
||||
| | | |
|
||||
| D 2 | | D 2 |
|
||||
| | | |
|
||||
+------------+ +------------+
|
||||
| | | |
|
||||
| | | |
|
||||
| | | |
|
||||
| | | |
|
||||
+------------+ +------------+
|
||||
Figure 1: Deterministic data placement
|
||||
|
||||
|
||||
|
||||
With non-deterministic data placement, the on-disk extents of a logical file
|
||||
extent can be scattered around inside the chunk. To read back the data with
|
||||
non-deterministic data placement, additional meta-data describing the position
|
||||
of the extents inside a chunk is needed. Figure 2 shows an example for this
|
||||
style of data placement.
|
||||
|
||||
|
||||
+------------+ +------------+
|
||||
| | | |
|
||||
| | | |
|
||||
| | | |
|
||||
+------------+ +------------+
|
||||
| | | |
|
||||
| D 1 | | D 2 |
|
||||
| | | |
|
||||
+------------+ +------------+
|
||||
| | | |
|
||||
| D 2 | | D 1 |
|
||||
| | | |
|
||||
+------------+ +------------+
|
||||
| | | |
|
||||
| | | |
|
||||
| | | |
|
||||
| | | |
|
||||
+------------+ +------------+
|
||||
Figure 2: Non-deterministic data placement
|
||||
|
||||
As BTRFS support for zoned block devices uses the Zone Append operation for
|
||||
writing file data extents, there is no guarantee that the written extents have
|
||||
the same offset within different device chunks. This implies that to be able
|
||||
to use RAID with zoned devices, non-deterministic data placement must be
|
||||
supported and additional meta-data describing the location of file extents
|
||||
within device chunks is needed.
|
||||
|
||||
|
||||
Lessons learned from RAID 5
|
||||
---------------------------
|
||||
BTRFS implementation of RAID levels 5 and 6 suffer from the well-known RAID
|
||||
write hole problem. This problem exists because sub-stripe write operations
|
||||
are not done using a copy-on-write (COW) but done using Read-Modify-Write
|
||||
(RMW). With out-of-place writing like COW, no blocks will be overwritten and
|
||||
there is no risk of exposing bad data or corrupting a data stripe parity in
|
||||
case of sudden power loss or other unexpected events preventing correctly
|
||||
writing to the device.
|
||||
|
||||
RAID Stripe Tree Design overview
|
||||
--------------------------------
|
||||
|
||||
To solve the problems stated above, additional meta-data is introduced: a RAID
|
||||
Stripe Tree holds the logical to physical translation for the RAID stripes.
|
||||
For each logical file extent (struct btrfs_file_extent_item) a stripe extent
|
||||
is created (struct btrfs_stripe_extent). Each btrfs_stripe_extent entry is a
|
||||
container for an array of struct btrfs_raid_stride. A struct btrfs_raid_stride
|
||||
holds the device ID and the physical start location on that device of the
|
||||
sub-stripe of a file extent, as well as the stride's length.
|
||||
Each struct btrfs_stripe_extent is keyed by the struct btrfs_file_extent_item
|
||||
disk_bytenr and disk_num_bytes, with disk_bytenr as the objectid for the
|
||||
btrfs_key and disk_num_bytes as the offset. The key’s type is
|
||||
BTRFS_STRIPE_EXTENT_KEY.
|
||||
|
||||
On-disk format
|
||||
--------------
|
||||
|
||||
struct btrfs_file_extent_item {
|
||||
/* […] */
|
||||
__le64 disk_bytenr; ------------------------------------+
|
||||
__le64 disk_num_bytes; --------------------------------|----+
|
||||
/* […] */ | |
|
||||
}; | |
|
||||
| |
|
||||
struct btrfs_key key = { -------------------------------------|----|--+
|
||||
.objectid = btrfs_file_extent_item::disk_bytenr, <------+ | |
|
||||
.type = BTRFS_STRIPE_EXTENT_KEY, | |
|
||||
.offset = btrfs_file_extent_item::disk_num_bytes, <----------+ |
|
||||
}; |
|
||||
|
|
||||
struct btrfs_raid_stride { <------------------+ |
|
||||
__le64 devid; | |
|
||||
__le64 physical; | |
|
||||
__le64 length; | |
|
||||
}; | |
|
||||
| |
|
||||
struct btrfs_stripe_extent { <---------------|-------------------------+
|
||||
u8 encoding; |
|
||||
u8 reserved[7]; |
|
||||
struct btrfs_raid_stride strides[]; ---+
|
||||
};
|
||||
|
||||
|
||||
User-space support
|
||||
------------------
|
||||
|
||||
|
||||
mkfs
|
||||
----
|
||||
Supporting the RAID Stripe Tree in user-space consists of three things to do
|
||||
for mkfs. The first step is creating the root of the RAID Stripe Tree itself.
|
||||
Then mkfs must set the incompat flag, so mounting a filesystem with a RAID
|
||||
Stripe Tree is impossible for a kernel version without appropriate support.
|
||||
Lastly it must allow RAID profiles on zoned devices once the tree is present.
|
||||
|
||||
|
||||
Check
|
||||
-----
|
||||
The ‘btrfs check’ support for RAID Stripe Tree is not implemented yet, but its
|
||||
task is to read the struct btrfs_stripe_extent entries for each struct
|
||||
btrfs_file_extent_item and verify that a correct mapping between these two
|
||||
exists. If data checksum verification is requested as well, the tree also must
|
||||
be read to perform the logical to physical translation, otherwise the data
|
||||
cannot be read, and the checksums cannot be verified.
|
||||
|
||||
|
||||
Example tree dumps
|
||||
------------------
|
||||
|
||||
Example 1: Write 128k to and empty FS
|
||||
|
||||
RAID0
|
||||
item 0 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 56
|
||||
encoding: RAID0
|
||||
stripe 0 devid 1 physical XXXXXXXXX length 65536
|
||||
stripe 1 devid 2 physical XXXXXXXXX length 65536
|
||||
RAID1
|
||||
stripe 0 devid 1 physical XXXXXXXXX length 131072
|
||||
stripe 1 devid 2 physical XXXXXXXXX length 131072
|
||||
RAID10
|
||||
item 0 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 104
|
||||
encoding: RAID10
|
||||
stripe 0 devid 1 physical XXXXXXXXX length 65536
|
||||
stripe 1 devid 2 physical XXXXXXXXX length 65536
|
||||
stripe 2 devid 3 physical XXXXXXXXX length 65536
|
||||
stripe 3 devid 4 physical XXXXXXXXX length 65536
|
||||
|
||||
Example 2: Pre-fill one 65k stripe, write 4k to 2nd stripe, write 64k then
|
||||
write 16k.
|
||||
|
||||
RAID0
|
||||
item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 32
|
||||
encoding: RAID0
|
||||
stripe 0 devid 1 physical XXXXXXXXX length 65536
|
||||
item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 32
|
||||
encoding: RAID0
|
||||
stripe 0 devid 2 physical XXXXXXXXX length 4096
|
||||
item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
|
||||
encoding: RAID0
|
||||
stripe 0 devid 2 physical XXXXXXXXX length 61440
|
||||
stripe 1 devid 1 physical XXXXXXXXX length 4096
|
||||
item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 32
|
||||
encoding: RAID0
|
||||
stripe 0 devid 1 physical XXXXXXXXX length 4096
|
||||
RAID1
|
||||
item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
|
||||
encoding: RAID1
|
||||
stripe 0 devid 1 physical XXXXXXXXX length 65536
|
||||
stripe 1 devid 2 physical XXXXXXXXX length 65536
|
||||
item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
|
||||
encoding: RAID1
|
||||
stripe 0 devid 1 physical XXXXXXXXX length 4096
|
||||
stripe 1 devid 2 physical XXXXXXXXX length 4096
|
||||
item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
|
||||
encoding: RAID1
|
||||
stripe 0 devid 1 physical XXXXXXXXX length 65536
|
||||
stripe 1 devid 2 physical XXXXXXXXX length 65536
|
||||
item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
|
||||
encoding: RAID1
|
||||
stripe 0 devid 1 physical XXXXXXXXX length 4096
|
||||
stripe 1 devid 2 physical XXXXXXXXX length 4096
|
||||
RAID10
|
||||
item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
|
||||
encoding: RAID10
|
||||
stripe 0 devid 1 physical XXXXXXXXX length 65536
|
||||
stripe 1 devid 2 physical XXXXXXXXX length 65536
|
||||
item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
|
||||
encoding: RAID10
|
||||
stripe 0 devid 3 physical XXXXXXXXX length 4096
|
||||
stripe 1 devid 4 physical XXXXXXXXX length 4096
|
||||
item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 104
|
||||
encoding: RAID10
|
||||
stripe 0 devid 3 physical XXXXXXXXX length 61440
|
||||
stripe 1 devid 4 physical XXXXXXXXX length 61440
|
||||
stripe 2 devid 1 physical XXXXXXXXX length 4096
|
||||
stripe 3 devid 2 physical XXXXXXXXX length 4096
|
||||
item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
|
||||
encoding: RAID10
|
||||
stripe 0 devid 1 physical XXXXXXXXX length 4096
|
||||
stripe 1 devid 2 physical XXXXXXXXX length 4096
|
||||
|
||||
Example 3: Pre-fill stripe with 32K data, then write 64K of data and then
|
||||
overwrite 8k in the middle.
|
||||
|
||||
RAID0
|
||||
item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 32
|
||||
encoding: RAID0
|
||||
stripe 0 devid 1 physical XXXXXXXXX length 32768
|
||||
item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 80
|
||||
encoding: RAID0
|
||||
stripe 0 devid 1 physical XXXXXXXXX length 32768
|
||||
stripe 1 devid 2 physical XXXXXXXXX length 65536
|
||||
stripe 2 devid 1 physical XXXXXXXXX length 32768
|
||||
item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 32
|
||||
encoding: RAID0
|
||||
stripe 0 devid 1 physical XXXXXXXXX length 8192
|
||||
RAID1
|
||||
item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 56
|
||||
encoding: RAID1
|
||||
stripe 0 devid 1 physical XXXXXXXXX length 32768
|
||||
stripe 1 devid 2 physical XXXXXXXXX length 32768
|
||||
item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 56
|
||||
encoding: RAID1
|
||||
stripe 0 devid 1 physical XXXXXXXXX length 131072
|
||||
stripe 1 devid 2 physical XXXXXXXXX length 131072
|
||||
item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 56
|
||||
encoding: RAID1
|
||||
stripe 0 devid 1 physical XXXXXXXXX length 8192
|
||||
stripe 1 devid 2 physical XXXXXXXXX length 8192
|
||||
RAID10
|
||||
item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 56
|
||||
encoding: RAID10
|
||||
stripe 0 devid 1 physical XXXXXXXXX length 32768
|
||||
stripe 1 devid 2 physical XXXXXXXXX length 32768
|
||||
item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 152
|
||||
encoding: RAID10
|
||||
stripe 0 devid 1 physical XXXXXXXXX length 32768
|
||||
stripe 1 devid 2 physical XXXXXXXXX length 32768
|
||||
stripe 2 devid 3 physical XXXXXXXXX length 65536
|
||||
stripe 3 devid 4 physical XXXXXXXXX length 65536
|
||||
stripe 4 devid 1 physical XXXXXXXXX length 32768
|
||||
stripe 5 devid 2 physical XXXXXXXXX length 32768
|
||||
item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 56
|
||||
encoding: RAID10
|
||||
stripe 0 devid 1 physical XXXXXXXXX length 8192
|
||||
stripe 1 devid 2 physical XXXXXXXXX length 8192
|
||||
|
||||
|
||||
Glossary
|
||||
--------
|
||||
|
||||
|
||||
RAID
|
||||
Redundant Array of Independent Disks. This is a storage mechanism
|
||||
where data is not stored on a single disk alone but either mirrored
|
||||
(in case of RAID 1) or split across multiple disks (RAID 0). Other
|
||||
RAID levels like RAID5 and RAID6 stripe the data across multiple disks
|
||||
and add parity information to enable data recovery in case of a disk
|
||||
failure.
|
||||
|
||||
|
||||
LBA
|
||||
Logical Block Address. LBAs describe the address space of a block
|
||||
device as a linearly increasing address map. LBAs are internally
|
||||
mapped to different physical locations by the device firmware.
|
||||
|
||||
|
||||
Zoned Block Device
|
||||
Zoned Block Devices are a special kind of block devices that partition
|
||||
their LBA space into so called zones. These zones can impose write
|
||||
constraints on the host, e.g., allowing only sequential writes aligned
|
||||
to a zone write pointer.
|
||||
|
||||
|
||||
Zone Append
|
||||
A write operation where the start LBA of a Zone is specified instead
|
||||
of a destination LBA for the data to be written. On completion the
|
||||
device reports the starting LBA used to write the data back to the
|
||||
host.
|
||||
|
||||
|
||||
Copy-on-Write
|
||||
A write technique where the data is not overwritten, but a new version
|
||||
of it is written out of place.
|
||||
|
||||
|
||||
Read-Modify-Write
|
||||
A write technique where the data to be written is first read from the
|
||||
block device, modified in memory and the modified data written back in
|
||||
place on the block device.
|
Loading…
Reference in New Issue