318 lines
14 KiB
Plaintext
318 lines
14 KiB
Plaintext
BTRFS RAID Stripe Tree Design
|
||
=============================
|
||
|
||
|
||
Problem Statement
|
||
-----------------
|
||
|
||
|
||
RAID on zoned devices
|
||
---------------------
|
||
The current implementation of RAID profiles in BTRFS is based on the implicit
|
||
assumption that data placement is deterministic in the device chunks used for
|
||
mapping block groups.
|
||
With deterministic data placement, all physical on-disk extents of one logical
|
||
file extent are positioned at the same offset relative to the starting LBA of
|
||
a device chunk. This prevents the need for reading any meta-data to access an
|
||
on-disk file extent. Figure 1 shows an example of it.
|
||
|
||
|
||
+------------+ +------------+
|
||
| | | |
|
||
| | | |
|
||
| | | |
|
||
+------------+ +------------+
|
||
| | | |
|
||
| D 1 | | D 1 |
|
||
| | | |
|
||
+------------+ +------------+
|
||
| | | |
|
||
| D 2 | | D 2 |
|
||
| | | |
|
||
+------------+ +------------+
|
||
| | | |
|
||
| | | |
|
||
| | | |
|
||
| | | |
|
||
+------------+ +------------+
|
||
Figure 1: Deterministic data placement
|
||
|
||
|
||
|
||
With non-deterministic data placement, the on-disk extents of a logical file
|
||
extent can be scattered around inside the chunk. To read back the data with
|
||
non-deterministic data placement, additional meta-data describing the position
|
||
of the extents inside a chunk is needed. Figure 2 shows an example for this
|
||
style of data placement.
|
||
|
||
|
||
+------------+ +------------+
|
||
| | | |
|
||
| | | |
|
||
| | | |
|
||
+------------+ +------------+
|
||
| | | |
|
||
| D 1 | | D 2 |
|
||
| | | |
|
||
+------------+ +------------+
|
||
| | | |
|
||
| D 2 | | D 1 |
|
||
| | | |
|
||
+------------+ +------------+
|
||
| | | |
|
||
| | | |
|
||
| | | |
|
||
| | | |
|
||
+------------+ +------------+
|
||
Figure 2: Non-deterministic data placement
|
||
|
||
As BTRFS support for zoned block devices uses the Zone Append operation for
|
||
writing file data extents, there is no guarantee that the written extents have
|
||
the same offset within different device chunks. This implies that to be able
|
||
to use RAID with zoned devices, non-deterministic data placement must be
|
||
supported and additional meta-data describing the location of file extents
|
||
within device chunks is needed.
|
||
|
||
|
||
Lessons learned from RAID 5
|
||
---------------------------
|
||
BTRFS implementation of RAID levels 5 and 6 suffer from the well-known RAID
|
||
write hole problem. This problem exists because sub-stripe write operations
|
||
are not done using a copy-on-write (COW) but done using Read-Modify-Write
|
||
(RMW). With out-of-place writing like COW, no blocks will be overwritten and
|
||
there is no risk of exposing bad data or corrupting a data stripe parity in
|
||
case of sudden power loss or other unexpected events preventing correctly
|
||
writing to the device.
|
||
|
||
RAID Stripe Tree Design overview
|
||
--------------------------------
|
||
|
||
To solve the problems stated above, additional meta-data is introduced: a RAID
|
||
Stripe Tree holds the logical to physical translation for the RAID stripes.
|
||
For each logical file extent (struct btrfs_file_extent_item) a stripe extent
|
||
is created (struct btrfs_stripe_extent). Each btrfs_stripe_extent entry is a
|
||
container for an array of struct btrfs_raid_stride. A struct btrfs_raid_stride
|
||
holds the device ID and the physical start location on that device of the
|
||
sub-stripe of a file extent, as well as the stride's length.
|
||
Each struct btrfs_stripe_extent is keyed by the struct btrfs_file_extent_item
|
||
disk_bytenr and disk_num_bytes, with disk_bytenr as the objectid for the
|
||
btrfs_key and disk_num_bytes as the offset. The key’s type is
|
||
BTRFS_STRIPE_EXTENT_KEY.
|
||
|
||
On-disk format
|
||
--------------
|
||
|
||
struct btrfs_file_extent_item {
|
||
/* […] */
|
||
__le64 disk_bytenr; ------------------------------------+
|
||
__le64 disk_num_bytes; --------------------------------|----+
|
||
/* […] */ | |
|
||
}; | |
|
||
| |
|
||
struct btrfs_key key = { -------------------------------------|----|--+
|
||
.objectid = btrfs_file_extent_item::disk_bytenr, <------+ | |
|
||
.type = BTRFS_STRIPE_EXTENT_KEY, | |
|
||
.offset = btrfs_file_extent_item::disk_num_bytes, <----------+ |
|
||
}; |
|
||
|
|
||
struct btrfs_raid_stride { <------------------+ |
|
||
__le64 devid; | |
|
||
__le64 physical; | |
|
||
__le64 length; | |
|
||
}; | |
|
||
| |
|
||
struct btrfs_stripe_extent { <---------------|-------------------------+
|
||
u8 encoding; |
|
||
u8 reserved[7]; |
|
||
struct btrfs_raid_stride strides[]; ---+
|
||
};
|
||
|
||
|
||
User-space support
|
||
------------------
|
||
|
||
|
||
mkfs
|
||
----
|
||
Supporting the RAID Stripe Tree in user-space consists of three things to do
|
||
for mkfs. The first step is creating the root of the RAID Stripe Tree itself.
|
||
Then mkfs must set the incompat flag, so mounting a filesystem with a RAID
|
||
Stripe Tree is impossible for a kernel version without appropriate support.
|
||
Lastly it must allow RAID profiles on zoned devices once the tree is present.
|
||
|
||
|
||
Check
|
||
-----
|
||
The ‘btrfs check’ support for RAID Stripe Tree is not implemented yet, but its
|
||
task is to read the struct btrfs_stripe_extent entries for each struct
|
||
btrfs_file_extent_item and verify that a correct mapping between these two
|
||
exists. If data checksum verification is requested as well, the tree also must
|
||
be read to perform the logical to physical translation, otherwise the data
|
||
cannot be read, and the checksums cannot be verified.
|
||
|
||
|
||
Example tree dumps
|
||
------------------
|
||
|
||
Example 1: Write 128k to and empty FS
|
||
|
||
RAID0
|
||
item 0 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 56
|
||
encoding: RAID0
|
||
stripe 0 devid 1 physical XXXXXXXXX length 65536
|
||
stripe 1 devid 2 physical XXXXXXXXX length 65536
|
||
RAID1
|
||
stripe 0 devid 1 physical XXXXXXXXX length 131072
|
||
stripe 1 devid 2 physical XXXXXXXXX length 131072
|
||
RAID10
|
||
item 0 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 104
|
||
encoding: RAID10
|
||
stripe 0 devid 1 physical XXXXXXXXX length 65536
|
||
stripe 1 devid 2 physical XXXXXXXXX length 65536
|
||
stripe 2 devid 3 physical XXXXXXXXX length 65536
|
||
stripe 3 devid 4 physical XXXXXXXXX length 65536
|
||
|
||
Example 2: Pre-fill one 65k stripe, write 4k to 2nd stripe, write 64k then
|
||
write 16k.
|
||
|
||
RAID0
|
||
item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 32
|
||
encoding: RAID0
|
||
stripe 0 devid 1 physical XXXXXXXXX length 65536
|
||
item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 32
|
||
encoding: RAID0
|
||
stripe 0 devid 2 physical XXXXXXXXX length 4096
|
||
item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
|
||
encoding: RAID0
|
||
stripe 0 devid 2 physical XXXXXXXXX length 61440
|
||
stripe 1 devid 1 physical XXXXXXXXX length 4096
|
||
item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 32
|
||
encoding: RAID0
|
||
stripe 0 devid 1 physical XXXXXXXXX length 4096
|
||
RAID1
|
||
item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
|
||
encoding: RAID1
|
||
stripe 0 devid 1 physical XXXXXXXXX length 65536
|
||
stripe 1 devid 2 physical XXXXXXXXX length 65536
|
||
item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
|
||
encoding: RAID1
|
||
stripe 0 devid 1 physical XXXXXXXXX length 4096
|
||
stripe 1 devid 2 physical XXXXXXXXX length 4096
|
||
item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
|
||
encoding: RAID1
|
||
stripe 0 devid 1 physical XXXXXXXXX length 65536
|
||
stripe 1 devid 2 physical XXXXXXXXX length 65536
|
||
item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
|
||
encoding: RAID1
|
||
stripe 0 devid 1 physical XXXXXXXXX length 4096
|
||
stripe 1 devid 2 physical XXXXXXXXX length 4096
|
||
RAID10
|
||
item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
|
||
encoding: RAID10
|
||
stripe 0 devid 1 physical XXXXXXXXX length 65536
|
||
stripe 1 devid 2 physical XXXXXXXXX length 65536
|
||
item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
|
||
encoding: RAID10
|
||
stripe 0 devid 3 physical XXXXXXXXX length 4096
|
||
stripe 1 devid 4 physical XXXXXXXXX length 4096
|
||
item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 104
|
||
encoding: RAID10
|
||
stripe 0 devid 3 physical XXXXXXXXX length 61440
|
||
stripe 1 devid 4 physical XXXXXXXXX length 61440
|
||
stripe 2 devid 1 physical XXXXXXXXX length 4096
|
||
stripe 3 devid 2 physical XXXXXXXXX length 4096
|
||
item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
|
||
encoding: RAID10
|
||
stripe 0 devid 1 physical XXXXXXXXX length 4096
|
||
stripe 1 devid 2 physical XXXXXXXXX length 4096
|
||
|
||
Example 3: Pre-fill stripe with 32K data, then write 64K of data and then
|
||
overwrite 8k in the middle.
|
||
|
||
RAID0
|
||
item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 32
|
||
encoding: RAID0
|
||
stripe 0 devid 1 physical XXXXXXXXX length 32768
|
||
item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 80
|
||
encoding: RAID0
|
||
stripe 0 devid 1 physical XXXXXXXXX length 32768
|
||
stripe 1 devid 2 physical XXXXXXXXX length 65536
|
||
stripe 2 devid 1 physical XXXXXXXXX length 32768
|
||
item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 32
|
||
encoding: RAID0
|
||
stripe 0 devid 1 physical XXXXXXXXX length 8192
|
||
RAID1
|
||
item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 56
|
||
encoding: RAID1
|
||
stripe 0 devid 1 physical XXXXXXXXX length 32768
|
||
stripe 1 devid 2 physical XXXXXXXXX length 32768
|
||
item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 56
|
||
encoding: RAID1
|
||
stripe 0 devid 1 physical XXXXXXXXX length 131072
|
||
stripe 1 devid 2 physical XXXXXXXXX length 131072
|
||
item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 56
|
||
encoding: RAID1
|
||
stripe 0 devid 1 physical XXXXXXXXX length 8192
|
||
stripe 1 devid 2 physical XXXXXXXXX length 8192
|
||
RAID10
|
||
item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 56
|
||
encoding: RAID10
|
||
stripe 0 devid 1 physical XXXXXXXXX length 32768
|
||
stripe 1 devid 2 physical XXXXXXXXX length 32768
|
||
item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 152
|
||
encoding: RAID10
|
||
stripe 0 devid 1 physical XXXXXXXXX length 32768
|
||
stripe 1 devid 2 physical XXXXXXXXX length 32768
|
||
stripe 2 devid 3 physical XXXXXXXXX length 65536
|
||
stripe 3 devid 4 physical XXXXXXXXX length 65536
|
||
stripe 4 devid 1 physical XXXXXXXXX length 32768
|
||
stripe 5 devid 2 physical XXXXXXXXX length 32768
|
||
item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 56
|
||
encoding: RAID10
|
||
stripe 0 devid 1 physical XXXXXXXXX length 8192
|
||
stripe 1 devid 2 physical XXXXXXXXX length 8192
|
||
|
||
|
||
Glossary
|
||
--------
|
||
|
||
|
||
RAID
|
||
Redundant Array of Independent Disks. This is a storage mechanism
|
||
where data is not stored on a single disk alone but either mirrored
|
||
(in case of RAID 1) or split across multiple disks (RAID 0). Other
|
||
RAID levels like RAID5 and RAID6 stripe the data across multiple disks
|
||
and add parity information to enable data recovery in case of a disk
|
||
failure.
|
||
|
||
|
||
LBA
|
||
Logical Block Address. LBAs describe the address space of a block
|
||
device as a linearly increasing address map. LBAs are internally
|
||
mapped to different physical locations by the device firmware.
|
||
|
||
|
||
Zoned Block Device
|
||
Zoned Block Devices are a special kind of block devices that partition
|
||
their LBA space into so called zones. These zones can impose write
|
||
constraints on the host, e.g., allowing only sequential writes aligned
|
||
to a zone write pointer.
|
||
|
||
|
||
Zone Append
|
||
A write operation where the start LBA of a Zone is specified instead
|
||
of a destination LBA for the data to be written. On completion the
|
||
device reports the starting LBA used to write the data back to the
|
||
host.
|
||
|
||
|
||
Copy-on-Write
|
||
A write technique where the data is not overwritten, but a new version
|
||
of it is written out of place.
|
||
|
||
|
||
Read-Modify-Write
|
||
A write technique where the data to be written is first read from the
|
||
block device, modified in memory and the modified data written back in
|
||
place on the block device.
|