diff --git a/Documentation/dev/design-raid-stripe-tree.txt b/Documentation/dev/design-raid-stripe-tree.txt new file mode 100644 index 00000000..11fbe44d --- /dev/null +++ b/Documentation/dev/design-raid-stripe-tree.txt @@ -0,0 +1,317 @@ +BTRFS RAID Stripe Tree Design +============================= + + +Problem Statement +----------------- + + +RAID on zoned devices +--------------------- +The current implementation of RAID profiles in BTRFS is based on the implicit +assumption that data placement is deterministic in the device chunks used for +mapping block groups. +With deterministic data placement, all physical on-disk extents of one logical +file extent are positioned at the same offset relative to the starting LBA of +a device chunk. This prevents the need for reading any meta-data to access an +on-disk file extent. Figure 1 shows an example of it. + + + +------------+ +------------+ + | | | | + | | | | + | | | | + +------------+ +------------+ + | | | | + | D 1 | | D 1 | + | | | | + +------------+ +------------+ + | | | | + | D 2 | | D 2 | + | | | | + +------------+ +------------+ + | | | | + | | | | + | | | | + | | | | + +------------+ +------------+ + Figure 1: Deterministic data placement + + + +With non-deterministic data placement, the on-disk extents of a logical file +extent can be scattered around inside the chunk. To read back the data with +non-deterministic data placement, additional meta-data describing the position +of the extents inside a chunk is needed. Figure 2 shows an example for this +style of data placement. + + + +------------+ +------------+ + | | | | + | | | | + | | | | + +------------+ +------------+ + | | | | + | D 1 | | D 2 | + | | | | + +------------+ +------------+ + | | | | + | D 2 | | D 1 | + | | | | + +------------+ +------------+ + | | | | + | | | | + | | | | + | | | | + +------------+ +------------+ + Figure 2: Non-deterministic data placement + +As BTRFS support for zoned block devices uses the Zone Append operation for +writing file data extents, there is no guarantee that the written extents have +the same offset within different device chunks. This implies that to be able +to use RAID with zoned devices, non-deterministic data placement must be +supported and additional meta-data describing the location of file extents +within device chunks is needed. + + +Lessons learned from RAID 5 +--------------------------- +BTRFS implementation of RAID levels 5 and 6 suffer from the well-known RAID +write hole problem. This problem exists because sub-stripe write operations +are not done using a copy-on-write (COW) but done using Read-Modify-Write +(RMW). With out-of-place writing like COW, no blocks will be overwritten and +there is no risk of exposing bad data or corrupting a data stripe parity in +case of sudden power loss or other unexpected events preventing correctly +writing to the device. + +RAID Stripe Tree Design overview +-------------------------------- + +To solve the problems stated above, additional meta-data is introduced: a RAID +Stripe Tree holds the logical to physical translation for the RAID stripes. +For each logical file extent (struct btrfs_file_extent_item) a stripe extent +is created (struct btrfs_stripe_extent). Each btrfs_stripe_extent entry is a +container for an array of struct btrfs_raid_stride. A struct btrfs_raid_stride +holds the device ID and the physical start location on that device of the +sub-stripe of a file extent, as well as the stride's length. +Each struct btrfs_stripe_extent is keyed by the struct btrfs_file_extent_item +disk_bytenr and disk_num_bytes, with disk_bytenr as the objectid for the +btrfs_key and disk_num_bytes as the offset. The key’s type is +BTRFS_STRIPE_EXTENT_KEY. + +On-disk format +-------------- + +struct btrfs_file_extent_item { + /* […] */ + __le64 disk_bytenr; ------------------------------------+ + __le64 disk_num_bytes; --------------------------------|----+ + /* […] */ | | +}; | | + | | +struct btrfs_key key = { -------------------------------------|----|--+ + .objectid = btrfs_file_extent_item::disk_bytenr, <------+ | | + .type = BTRFS_STRIPE_EXTENT_KEY, | | + .offset = btrfs_file_extent_item::disk_num_bytes, <----------+ | +}; | + | +struct btrfs_raid_stride { <------------------+ | + __le64 devid; | | + __le64 physical; | | + __le64 length; | | +}; | | + | | +struct btrfs_stripe_extent { <---------------|-------------------------+ + u8 encoding; | + u8 reserved[7]; | + struct btrfs_raid_stride strides[]; ---+ +}; + + +User-space support +------------------ + + +mkfs +---- +Supporting the RAID Stripe Tree in user-space consists of three things to do +for mkfs. The first step is creating the root of the RAID Stripe Tree itself. +Then mkfs must set the incompat flag, so mounting a filesystem with a RAID +Stripe Tree is impossible for a kernel version without appropriate support. +Lastly it must allow RAID profiles on zoned devices once the tree is present. + + +Check +----- +The ‘btrfs check’ support for RAID Stripe Tree is not implemented yet, but its +task is to read the struct btrfs_stripe_extent entries for each struct +btrfs_file_extent_item and verify that a correct mapping between these two +exists. If data checksum verification is requested as well, the tree also must +be read to perform the logical to physical translation, otherwise the data +cannot be read, and the checksums cannot be verified. + + +Example tree dumps +------------------ + +Example 1: Write 128k to and empty FS + +RAID0 + item 0 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 56 + encoding: RAID0 + stripe 0 devid 1 physical XXXXXXXXX length 65536 + stripe 1 devid 2 physical XXXXXXXXX length 65536 +RAID1 + stripe 0 devid 1 physical XXXXXXXXX length 131072 + stripe 1 devid 2 physical XXXXXXXXX length 131072 +RAID10 + item 0 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 104 + encoding: RAID10 + stripe 0 devid 1 physical XXXXXXXXX length 65536 + stripe 1 devid 2 physical XXXXXXXXX length 65536 + stripe 2 devid 3 physical XXXXXXXXX length 65536 + stripe 3 devid 4 physical XXXXXXXXX length 65536 + +Example 2: Pre-fill one 65k stripe, write 4k to 2nd stripe, write 64k then +write 16k. + +RAID0 + item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 32 + encoding: RAID0 + stripe 0 devid 1 physical XXXXXXXXX length 65536 + item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 32 + encoding: RAID0 + stripe 0 devid 2 physical XXXXXXXXX length 4096 + item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56 + encoding: RAID0 + stripe 0 devid 2 physical XXXXXXXXX length 61440 + stripe 1 devid 1 physical XXXXXXXXX length 4096 + item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 32 + encoding: RAID0 + stripe 0 devid 1 physical XXXXXXXXX length 4096 +RAID1 + item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56 + encoding: RAID1 + stripe 0 devid 1 physical XXXXXXXXX length 65536 + stripe 1 devid 2 physical XXXXXXXXX length 65536 + item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56 + encoding: RAID1 + stripe 0 devid 1 physical XXXXXXXXX length 4096 + stripe 1 devid 2 physical XXXXXXXXX length 4096 + item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56 + encoding: RAID1 + stripe 0 devid 1 physical XXXXXXXXX length 65536 + stripe 1 devid 2 physical XXXXXXXXX length 65536 + item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56 + encoding: RAID1 + stripe 0 devid 1 physical XXXXXXXXX length 4096 + stripe 1 devid 2 physical XXXXXXXXX length 4096 +RAID10 + item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56 + encoding: RAID10 + stripe 0 devid 1 physical XXXXXXXXX length 65536 + stripe 1 devid 2 physical XXXXXXXXX length 65536 + item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56 + encoding: RAID10 + stripe 0 devid 3 physical XXXXXXXXX length 4096 + stripe 1 devid 4 physical XXXXXXXXX length 4096 + item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 104 + encoding: RAID10 + stripe 0 devid 3 physical XXXXXXXXX length 61440 + stripe 1 devid 4 physical XXXXXXXXX length 61440 + stripe 2 devid 1 physical XXXXXXXXX length 4096 + stripe 3 devid 2 physical XXXXXXXXX length 4096 + item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56 + encoding: RAID10 + stripe 0 devid 1 physical XXXXXXXXX length 4096 + stripe 1 devid 2 physical XXXXXXXXX length 4096 + +Example 3: Pre-fill stripe with 32K data, then write 64K of data and then +overwrite 8k in the middle. + +RAID0 + item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 32 + encoding: RAID0 + stripe 0 devid 1 physical XXXXXXXXX length 32768 + item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 80 + encoding: RAID0 + stripe 0 devid 1 physical XXXXXXXXX length 32768 + stripe 1 devid 2 physical XXXXXXXXX length 65536 + stripe 2 devid 1 physical XXXXXXXXX length 32768 + item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 32 + encoding: RAID0 + stripe 0 devid 1 physical XXXXXXXXX length 8192 +RAID1 + item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 56 + encoding: RAID1 + stripe 0 devid 1 physical XXXXXXXXX length 32768 + stripe 1 devid 2 physical XXXXXXXXX length 32768 + item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 56 + encoding: RAID1 + stripe 0 devid 1 physical XXXXXXXXX length 131072 + stripe 1 devid 2 physical XXXXXXXXX length 131072 + item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 56 + encoding: RAID1 + stripe 0 devid 1 physical XXXXXXXXX length 8192 + stripe 1 devid 2 physical XXXXXXXXX length 8192 +RAID10 + item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 56 + encoding: RAID10 + stripe 0 devid 1 physical XXXXXXXXX length 32768 + stripe 1 devid 2 physical XXXXXXXXX length 32768 + item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 152 + encoding: RAID10 + stripe 0 devid 1 physical XXXXXXXXX length 32768 + stripe 1 devid 2 physical XXXXXXXXX length 32768 + stripe 2 devid 3 physical XXXXXXXXX length 65536 + stripe 3 devid 4 physical XXXXXXXXX length 65536 + stripe 4 devid 1 physical XXXXXXXXX length 32768 + stripe 5 devid 2 physical XXXXXXXXX length 32768 + item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 56 + encoding: RAID10 + stripe 0 devid 1 physical XXXXXXXXX length 8192 + stripe 1 devid 2 physical XXXXXXXXX length 8192 + + +Glossary +-------- + + +RAID + Redundant Array of Independent Disks. This is a storage mechanism + where data is not stored on a single disk alone but either mirrored + (in case of RAID 1) or split across multiple disks (RAID 0). Other + RAID levels like RAID5 and RAID6 stripe the data across multiple disks + and add parity information to enable data recovery in case of a disk + failure. + + +LBA + Logical Block Address. LBAs describe the address space of a block + device as a linearly increasing address map. LBAs are internally + mapped to different physical locations by the device firmware. + + +Zoned Block Device + Zoned Block Devices are a special kind of block devices that partition + their LBA space into so called zones. These zones can impose write + constraints on the host, e.g., allowing only sequential writes aligned + to a zone write pointer. + + +Zone Append + A write operation where the start LBA of a Zone is specified instead + of a destination LBA for the data to be written. On completion the + device reports the starting LBA used to write the data back to the + host. + + +Copy-on-Write + A write technique where the data is not overwritten, but a new version + of it is written out of place. + + +Read-Modify-Write + A write technique where the data to be written is first read from the + block device, modified in memory and the modified data written back in + place on the block device.