318 lines
14 KiB
Plaintext
318 lines
14 KiB
Plaintext
|
BTRFS RAID Stripe Tree Design
|
|||
|
=============================
|
|||
|
|
|||
|
|
|||
|
Problem Statement
|
|||
|
-----------------
|
|||
|
|
|||
|
|
|||
|
RAID on zoned devices
|
|||
|
---------------------
|
|||
|
The current implementation of RAID profiles in BTRFS is based on the implicit
|
|||
|
assumption that data placement is deterministic in the device chunks used for
|
|||
|
mapping block groups.
|
|||
|
With deterministic data placement, all physical on-disk extents of one logical
|
|||
|
file extent are positioned at the same offset relative to the starting LBA of
|
|||
|
a device chunk. This prevents the need for reading any meta-data to access an
|
|||
|
on-disk file extent. Figure 1 shows an example of it.
|
|||
|
|
|||
|
|
|||
|
+------------+ +------------+
|
|||
|
| | | |
|
|||
|
| | | |
|
|||
|
| | | |
|
|||
|
+------------+ +------------+
|
|||
|
| | | |
|
|||
|
| D 1 | | D 1 |
|
|||
|
| | | |
|
|||
|
+------------+ +------------+
|
|||
|
| | | |
|
|||
|
| D 2 | | D 2 |
|
|||
|
| | | |
|
|||
|
+------------+ +------------+
|
|||
|
| | | |
|
|||
|
| | | |
|
|||
|
| | | |
|
|||
|
| | | |
|
|||
|
+------------+ +------------+
|
|||
|
Figure 1: Deterministic data placement
|
|||
|
|
|||
|
|
|||
|
|
|||
|
With non-deterministic data placement, the on-disk extents of a logical file
|
|||
|
extent can be scattered around inside the chunk. To read back the data with
|
|||
|
non-deterministic data placement, additional meta-data describing the position
|
|||
|
of the extents inside a chunk is needed. Figure 2 shows an example for this
|
|||
|
style of data placement.
|
|||
|
|
|||
|
|
|||
|
+------------+ +------------+
|
|||
|
| | | |
|
|||
|
| | | |
|
|||
|
| | | |
|
|||
|
+------------+ +------------+
|
|||
|
| | | |
|
|||
|
| D 1 | | D 2 |
|
|||
|
| | | |
|
|||
|
+------------+ +------------+
|
|||
|
| | | |
|
|||
|
| D 2 | | D 1 |
|
|||
|
| | | |
|
|||
|
+------------+ +------------+
|
|||
|
| | | |
|
|||
|
| | | |
|
|||
|
| | | |
|
|||
|
| | | |
|
|||
|
+------------+ +------------+
|
|||
|
Figure 2: Non-deterministic data placement
|
|||
|
|
|||
|
As BTRFS support for zoned block devices uses the Zone Append operation for
|
|||
|
writing file data extents, there is no guarantee that the written extents have
|
|||
|
the same offset within different device chunks. This implies that to be able
|
|||
|
to use RAID with zoned devices, non-deterministic data placement must be
|
|||
|
supported and additional meta-data describing the location of file extents
|
|||
|
within device chunks is needed.
|
|||
|
|
|||
|
|
|||
|
Lessons learned from RAID 5
|
|||
|
---------------------------
|
|||
|
BTRFS implementation of RAID levels 5 and 6 suffer from the well-known RAID
|
|||
|
write hole problem. This problem exists because sub-stripe write operations
|
|||
|
are not done using a copy-on-write (COW) but done using Read-Modify-Write
|
|||
|
(RMW). With out-of-place writing like COW, no blocks will be overwritten and
|
|||
|
there is no risk of exposing bad data or corrupting a data stripe parity in
|
|||
|
case of sudden power loss or other unexpected events preventing correctly
|
|||
|
writing to the device.
|
|||
|
|
|||
|
RAID Stripe Tree Design overview
|
|||
|
--------------------------------
|
|||
|
|
|||
|
To solve the problems stated above, additional meta-data is introduced: a RAID
|
|||
|
Stripe Tree holds the logical to physical translation for the RAID stripes.
|
|||
|
For each logical file extent (struct btrfs_file_extent_item) a stripe extent
|
|||
|
is created (struct btrfs_stripe_extent). Each btrfs_stripe_extent entry is a
|
|||
|
container for an array of struct btrfs_raid_stride. A struct btrfs_raid_stride
|
|||
|
holds the device ID and the physical start location on that device of the
|
|||
|
sub-stripe of a file extent, as well as the stride's length.
|
|||
|
Each struct btrfs_stripe_extent is keyed by the struct btrfs_file_extent_item
|
|||
|
disk_bytenr and disk_num_bytes, with disk_bytenr as the objectid for the
|
|||
|
btrfs_key and disk_num_bytes as the offset. The key’s type is
|
|||
|
BTRFS_STRIPE_EXTENT_KEY.
|
|||
|
|
|||
|
On-disk format
|
|||
|
--------------
|
|||
|
|
|||
|
struct btrfs_file_extent_item {
|
|||
|
/* […] */
|
|||
|
__le64 disk_bytenr; ------------------------------------+
|
|||
|
__le64 disk_num_bytes; --------------------------------|----+
|
|||
|
/* […] */ | |
|
|||
|
}; | |
|
|||
|
| |
|
|||
|
struct btrfs_key key = { -------------------------------------|----|--+
|
|||
|
.objectid = btrfs_file_extent_item::disk_bytenr, <------+ | |
|
|||
|
.type = BTRFS_STRIPE_EXTENT_KEY, | |
|
|||
|
.offset = btrfs_file_extent_item::disk_num_bytes, <----------+ |
|
|||
|
}; |
|
|||
|
|
|
|||
|
struct btrfs_raid_stride { <------------------+ |
|
|||
|
__le64 devid; | |
|
|||
|
__le64 physical; | |
|
|||
|
__le64 length; | |
|
|||
|
}; | |
|
|||
|
| |
|
|||
|
struct btrfs_stripe_extent { <---------------|-------------------------+
|
|||
|
u8 encoding; |
|
|||
|
u8 reserved[7]; |
|
|||
|
struct btrfs_raid_stride strides[]; ---+
|
|||
|
};
|
|||
|
|
|||
|
|
|||
|
User-space support
|
|||
|
------------------
|
|||
|
|
|||
|
|
|||
|
mkfs
|
|||
|
----
|
|||
|
Supporting the RAID Stripe Tree in user-space consists of three things to do
|
|||
|
for mkfs. The first step is creating the root of the RAID Stripe Tree itself.
|
|||
|
Then mkfs must set the incompat flag, so mounting a filesystem with a RAID
|
|||
|
Stripe Tree is impossible for a kernel version without appropriate support.
|
|||
|
Lastly it must allow RAID profiles on zoned devices once the tree is present.
|
|||
|
|
|||
|
|
|||
|
Check
|
|||
|
-----
|
|||
|
The ‘btrfs check’ support for RAID Stripe Tree is not implemented yet, but its
|
|||
|
task is to read the struct btrfs_stripe_extent entries for each struct
|
|||
|
btrfs_file_extent_item and verify that a correct mapping between these two
|
|||
|
exists. If data checksum verification is requested as well, the tree also must
|
|||
|
be read to perform the logical to physical translation, otherwise the data
|
|||
|
cannot be read, and the checksums cannot be verified.
|
|||
|
|
|||
|
|
|||
|
Example tree dumps
|
|||
|
------------------
|
|||
|
|
|||
|
Example 1: Write 128k to and empty FS
|
|||
|
|
|||
|
RAID0
|
|||
|
item 0 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 56
|
|||
|
encoding: RAID0
|
|||
|
stripe 0 devid 1 physical XXXXXXXXX length 65536
|
|||
|
stripe 1 devid 2 physical XXXXXXXXX length 65536
|
|||
|
RAID1
|
|||
|
stripe 0 devid 1 physical XXXXXXXXX length 131072
|
|||
|
stripe 1 devid 2 physical XXXXXXXXX length 131072
|
|||
|
RAID10
|
|||
|
item 0 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 104
|
|||
|
encoding: RAID10
|
|||
|
stripe 0 devid 1 physical XXXXXXXXX length 65536
|
|||
|
stripe 1 devid 2 physical XXXXXXXXX length 65536
|
|||
|
stripe 2 devid 3 physical XXXXXXXXX length 65536
|
|||
|
stripe 3 devid 4 physical XXXXXXXXX length 65536
|
|||
|
|
|||
|
Example 2: Pre-fill one 65k stripe, write 4k to 2nd stripe, write 64k then
|
|||
|
write 16k.
|
|||
|
|
|||
|
RAID0
|
|||
|
item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 32
|
|||
|
encoding: RAID0
|
|||
|
stripe 0 devid 1 physical XXXXXXXXX length 65536
|
|||
|
item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 32
|
|||
|
encoding: RAID0
|
|||
|
stripe 0 devid 2 physical XXXXXXXXX length 4096
|
|||
|
item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
|
|||
|
encoding: RAID0
|
|||
|
stripe 0 devid 2 physical XXXXXXXXX length 61440
|
|||
|
stripe 1 devid 1 physical XXXXXXXXX length 4096
|
|||
|
item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 32
|
|||
|
encoding: RAID0
|
|||
|
stripe 0 devid 1 physical XXXXXXXXX length 4096
|
|||
|
RAID1
|
|||
|
item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
|
|||
|
encoding: RAID1
|
|||
|
stripe 0 devid 1 physical XXXXXXXXX length 65536
|
|||
|
stripe 1 devid 2 physical XXXXXXXXX length 65536
|
|||
|
item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
|
|||
|
encoding: RAID1
|
|||
|
stripe 0 devid 1 physical XXXXXXXXX length 4096
|
|||
|
stripe 1 devid 2 physical XXXXXXXXX length 4096
|
|||
|
item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
|
|||
|
encoding: RAID1
|
|||
|
stripe 0 devid 1 physical XXXXXXXXX length 65536
|
|||
|
stripe 1 devid 2 physical XXXXXXXXX length 65536
|
|||
|
item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
|
|||
|
encoding: RAID1
|
|||
|
stripe 0 devid 1 physical XXXXXXXXX length 4096
|
|||
|
stripe 1 devid 2 physical XXXXXXXXX length 4096
|
|||
|
RAID10
|
|||
|
item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
|
|||
|
encoding: RAID10
|
|||
|
stripe 0 devid 1 physical XXXXXXXXX length 65536
|
|||
|
stripe 1 devid 2 physical XXXXXXXXX length 65536
|
|||
|
item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
|
|||
|
encoding: RAID10
|
|||
|
stripe 0 devid 3 physical XXXXXXXXX length 4096
|
|||
|
stripe 1 devid 4 physical XXXXXXXXX length 4096
|
|||
|
item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 104
|
|||
|
encoding: RAID10
|
|||
|
stripe 0 devid 3 physical XXXXXXXXX length 61440
|
|||
|
stripe 1 devid 4 physical XXXXXXXXX length 61440
|
|||
|
stripe 2 devid 1 physical XXXXXXXXX length 4096
|
|||
|
stripe 3 devid 2 physical XXXXXXXXX length 4096
|
|||
|
item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
|
|||
|
encoding: RAID10
|
|||
|
stripe 0 devid 1 physical XXXXXXXXX length 4096
|
|||
|
stripe 1 devid 2 physical XXXXXXXXX length 4096
|
|||
|
|
|||
|
Example 3: Pre-fill stripe with 32K data, then write 64K of data and then
|
|||
|
overwrite 8k in the middle.
|
|||
|
|
|||
|
RAID0
|
|||
|
item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 32
|
|||
|
encoding: RAID0
|
|||
|
stripe 0 devid 1 physical XXXXXXXXX length 32768
|
|||
|
item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 80
|
|||
|
encoding: RAID0
|
|||
|
stripe 0 devid 1 physical XXXXXXXXX length 32768
|
|||
|
stripe 1 devid 2 physical XXXXXXXXX length 65536
|
|||
|
stripe 2 devid 1 physical XXXXXXXXX length 32768
|
|||
|
item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 32
|
|||
|
encoding: RAID0
|
|||
|
stripe 0 devid 1 physical XXXXXXXXX length 8192
|
|||
|
RAID1
|
|||
|
item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 56
|
|||
|
encoding: RAID1
|
|||
|
stripe 0 devid 1 physical XXXXXXXXX length 32768
|
|||
|
stripe 1 devid 2 physical XXXXXXXXX length 32768
|
|||
|
item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 56
|
|||
|
encoding: RAID1
|
|||
|
stripe 0 devid 1 physical XXXXXXXXX length 131072
|
|||
|
stripe 1 devid 2 physical XXXXXXXXX length 131072
|
|||
|
item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 56
|
|||
|
encoding: RAID1
|
|||
|
stripe 0 devid 1 physical XXXXXXXXX length 8192
|
|||
|
stripe 1 devid 2 physical XXXXXXXXX length 8192
|
|||
|
RAID10
|
|||
|
item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 56
|
|||
|
encoding: RAID10
|
|||
|
stripe 0 devid 1 physical XXXXXXXXX length 32768
|
|||
|
stripe 1 devid 2 physical XXXXXXXXX length 32768
|
|||
|
item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 152
|
|||
|
encoding: RAID10
|
|||
|
stripe 0 devid 1 physical XXXXXXXXX length 32768
|
|||
|
stripe 1 devid 2 physical XXXXXXXXX length 32768
|
|||
|
stripe 2 devid 3 physical XXXXXXXXX length 65536
|
|||
|
stripe 3 devid 4 physical XXXXXXXXX length 65536
|
|||
|
stripe 4 devid 1 physical XXXXXXXXX length 32768
|
|||
|
stripe 5 devid 2 physical XXXXXXXXX length 32768
|
|||
|
item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 56
|
|||
|
encoding: RAID10
|
|||
|
stripe 0 devid 1 physical XXXXXXXXX length 8192
|
|||
|
stripe 1 devid 2 physical XXXXXXXXX length 8192
|
|||
|
|
|||
|
|
|||
|
Glossary
|
|||
|
--------
|
|||
|
|
|||
|
|
|||
|
RAID
|
|||
|
Redundant Array of Independent Disks. This is a storage mechanism
|
|||
|
where data is not stored on a single disk alone but either mirrored
|
|||
|
(in case of RAID 1) or split across multiple disks (RAID 0). Other
|
|||
|
RAID levels like RAID5 and RAID6 stripe the data across multiple disks
|
|||
|
and add parity information to enable data recovery in case of a disk
|
|||
|
failure.
|
|||
|
|
|||
|
|
|||
|
LBA
|
|||
|
Logical Block Address. LBAs describe the address space of a block
|
|||
|
device as a linearly increasing address map. LBAs are internally
|
|||
|
mapped to different physical locations by the device firmware.
|
|||
|
|
|||
|
|
|||
|
Zoned Block Device
|
|||
|
Zoned Block Devices are a special kind of block devices that partition
|
|||
|
their LBA space into so called zones. These zones can impose write
|
|||
|
constraints on the host, e.g., allowing only sequential writes aligned
|
|||
|
to a zone write pointer.
|
|||
|
|
|||
|
|
|||
|
Zone Append
|
|||
|
A write operation where the start LBA of a Zone is specified instead
|
|||
|
of a destination LBA for the data to be written. On completion the
|
|||
|
device reports the starting LBA used to write the data back to the
|
|||
|
host.
|
|||
|
|
|||
|
|
|||
|
Copy-on-Write
|
|||
|
A write technique where the data is not overwritten, but a new version
|
|||
|
of it is written out of place.
|
|||
|
|
|||
|
|
|||
|
Read-Modify-Write
|
|||
|
A write technique where the data to be written is first read from the
|
|||
|
block device, modified in memory and the modified data written back in
|
|||
|
place on the block device.
|