=========================== Erasure coding enhancements =========================== Objectives ========== Our objective is to improve the performance of erasure coding, in particular for small random accesses to make it more viable to use erasure coding pools for storing block and file data. We are looking to reduce the number of OSD read and write accesses per client I/O (sometimes referred to as I/O amplification), reduce the amount of network traffic between OSDs (network bandwidth) and reduce I/O latency (time to complete read and write I/O operations). We expect the changes will also provide modest reductions to CPU overheads. While the changes are focused on enhancing small random accesses, some enhancements will provide modest benefits for larger I/O accesses and for object storage. The following sections give a brief description of the improvements we are looking to make. Please see the later design sections for more details Current Read Implementation --------------------------- For reference this is how erasure code reads currently work .. ditaa:: RADOS Client * Current code reads all data chunks ^ * Discards unneeded data | * Returns requested data to client +----+----+ | Discard | If data cannot be read then the coding parity |unneeded | chunks are read as well and are used to reconstruct | data | the data +---------+ ^^^^ |||| |||| |||| |||+----------------------------------------------+ ||+-------------------------------------+ | |+----------------------------+ | | | | | | .-----. .-----. .-----. .-----. .-----. .-----. ( ) ( ) ( ) ( ) ( ) ( ) |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| | | | | | | | | | | | | | | | | | | | | | | | | ( ) ( ) ( ) ( ) ( ) ( ) `-----' `-----' `-----' `-----' `-----' `-----' Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q OSD Note: All the diagrams illustrate a K=4 + M=2 configuration, however the concepts and techniques can be used for all K+M configurations. Partial Reads ------------- If only a small amount of data is being read it is not necessary to read the whole stripe, for small I/Os ideally only a single OSD needs to be involved in reading the data. See also larger chunk size below. .. ditaa:: RADOS Client * Optimize by only reading required chunks ^ * For large chunk sizes and sub-chunk reads only | read a sub-chunk +----+----+ | Return | If data cannot be read then extra data and coding | data | parity chunks are read as well and are used to | | reconstruct the data +---------+ ^ | | | | | +----------------------------+ | .-----. .-----. .-----. .-----. .-----. .-----. ( ) ( ) ( ) ( ) ( ) ( ) |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| | | | | | | | | | | | | | | | | | | | | | | | | ( ) ( ) ( ) ( ) ( ) ( ) `-----' `-----' `-----' `-----' `-----' `-----' Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q OSD Pull Request https://github.com/ceph/ceph/pull/55196 is implementing most of this optimization, however it still issues full chunk reads. Current Overwrite Implementation -------------------------------- For reference here is how erasure code overwrites currently work .. ditaa:: RADOS Client | * Read all data chunks | * Merges new data +----v-----+ * Encodes new coding parities | Read old | * Writes data and coding parities |Merge new | | Encode |-------------------------------------------------------------+ | Write |---------------------------------------------------+ | +----------+ | | ^|^|^|^| | | |||||||+-------------------------------------------+ | | ||||||+-------------------------------------------+| | | |||||+-----------------------------------+ || | | ||||+-----------------------------------+| || | | |||+---------------------------+ || || | | ||+---------------------------+| || || | | |v |v |v |v v v .-----. .-----. .-----. .-----. .-----. .-----. ( ) ( ) ( ) ( ) ( ) ( ) |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| | | | | | | | | | | | | | | | | | | | | | | | | ( ) ( ) ( ) ( ) ( ) ( ) `-----' `-----' `-----' `-----' `-----' `-----' Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q OSD Partial Overwrites ------------------ Ideally we aim to be able to perform updates to erasure coded stripes by only updating a subset of the shards (those with modified data or coding parities). Avoiding performing unnecessary data updates on the other shards is easy, avoiding performing any metadata updates on the other shards is much harder (see design section on metadata updates). .. ditaa:: RADOS Client | * Only read chunks that are not being overwritten | * Merge new data +----v-----+ * Encode new coding parities | Read old | * Only write modified data and parity shards |Merge new | | Encode |-------------------------------------------------------------+ | Write |---------------------------------------------------+ | +----------+ | | ^ |^ ^ | | | || | | | | || +-------------------------------------------+ | | | || | | | | |+-----------------------------------+ | | | | +---------------------------+ | | | | | | | | | | | v | | v v .-----. .-----. .-----. .-----. .-----. .-----. ( ) ( ) ( ) ( ) ( ) ( ) |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| | | | | | | | | | | | | | | | | | | | | | | | | ( ) ( ) ( ) ( ) ( ) ( ) `-----' `-----' `-----' `-----' `-----' `-----' Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q OSD This diagram is overly simplistic, only showing the data flows. The simplest implementation of this optimization retains a metadata write to every OSD. With more effort it is possible to reduce the number of metadata updates as well, see design below for more details. Parity-delta-write ------------------ A common technique used by block storage controllers implementing RAID5 and RAID6 is to implement what is sometimes called a parity delta write. When a small part of the stripe is being overwritten it is possible to perform the update by reading the old data, XORing this with the new data to create a delta and then read each coding parity, apply the delta and write the new parity. The advantage of this technique is that it can involve a lot less I/O, especially for K+M encodings with larger values of K. The technique is not specific to M=1 and M=2, it can be applied with any number of coding parities. .. ditaa:: Parity delta writes * Read old data and XOR with new data to create a delta RADOS Client * Read old encoding parities apply the delta and write | the new encoding parities | | For K+M erasure codings where K is larger and M is small | +-----+ +-----+ this is much more efficient +->| XOR |-+->| GF |---------------------------------------------------+ +-+->| | | | |<------------------------------------------------+ | | | +-----+ | +-----+ | | | | | | | | | | +-----+ | | | | +->| XOR |-----------------------------------------+ | | | | | |<--------------------------------------+ | | | | | +-----+ | | | | | | | | | | | | | | | | | +-------------------------------+ | | | | +-------------------------------+ | | | | | | | | | | | | v | v | v .-----. .-----. .-----. .-----. .-----. .-----. ( ) ( ) ( ) ( ) ( ) ( ) |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| | | | | | | | | | | | | | | | | | | | | | | | | ( ) ( ) ( ) ( ) ( ) ( ) `-----' `-----' `-----' `-----' `-----' `-----' Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q OSD Direct Read I/O --------------- We want clients to submit small I/Os directly to the OSD that stores the data rather than directing all I/O requests to the Primary OSD and have it issue requests to the secondary OSDs. By eliminating an intermediate hop this reduces network bandwidth and improves I/O latency .. ditaa:: RADOS Client ^ | +----+----+ Client sends small read requests directly to OSD | Return | avoiding extra network hop via Primary | data | | | +---------+ ^ | | | | | | | .-----. .-----. .-----. .-----. .-----. .-----. ( ) ( ) ( ) ( ) ( ) ( ) |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| | | | | | | | | | | | | | | | | | | | | | | | | ( ) ( ) ( ) ( ) ( ) ( ) `-----' `-----' `-----' `-----' `-----' `-----' Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q OSD .. ditaa:: RADOS Client ^ ^ | | +----+----+ +--+------+ Client breaks larger read | Return | | Return | requests into separate | data | | data | requests to multiple OSDs | | | | +---------+ +---------+ Note client loses atomicity ^ ^ guarantees if this optimization | | is used as an update could occur | | between the two reads | | | | | | | | | | .-----. .-----. .-----. .-----. .-----. .-----. ( ) ( ) ( ) ( ) ( ) ( ) |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| | | | | | | | | | | | | | | | | | | | | | | | | ( ) ( ) ( ) ( ) ( ) ( ) `-----' `-----' `-----' `-----' `-----' `-----' Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q OSD Distributed processing of writes -------------------------------- The existing erasure code implementation processes write I/Os on the primary OSD, issuing both reads and writes to other OSDs to fetch and update data for other shards. This is perhaps the simplest implementation, but it uses a lot of network bandwidth. With parity-delta-writes it is possible to distribute the processing across OSDs to reduce network bandwidth. .. ditaa:: Performing the coding parity delta updates on the coding parity OSD instead of the primary OSD reduces network bandwidth RADOS Client | Note: A naive implementation will increase latency by serializing | the data and coding parity reads, for best performance these | reads need to happen in parallel | +-----+ +-----+ +->| XOR |-+------------------------------------------------------->| GF | +-+->| | | | | | | +-----+ | +----++ | | | +-----+ ^ | | | +--------------------------------------------->| XOR | | | | | | | | | | | +---+-+ | | | +-------------------------------+ ^ | | | +-------------------------------+ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | v | v | v .-----. .-----. .-----. .-----. .-----. .-----. ( ) ( ) ( ) ( ) ( ) ( ) |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| | | | | | | | | | | | | | | | | | | | | | | | | ( ) ( ) ( ) ( ) ( ) ( ) `-----' `-----' `-----' `-----' `-----' `-----' Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q OSD Direct Write I/O ---------------- .. ditaa:: RADOS Client | | Similarly Clients could direct small write I/Os | to the OSD that needs updating | | +-----+ +-----+ +->| XOR |-+--------------------->| GF | +-----+->| | | | | | | +-----+ | +----++ | | | +-----+ ^ | | | +----------->| XOR | | | | | | | | | | | +---+-+ | | | | ^ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | v | v | v .-----. .-----. .-----. .-----. .-----. .-----. ( ) ( ) ( ) ( ) ( ) ( ) |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| | | | | | | | | | | | | | | | | | | | | | | | | ( ) ( ) ( ) ( ) ( ) ( ) `-----' `-----' `-----' `-----' `-----' `-----' Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q OSD This diagram is overly simplistic, only showing the data flows - direct writes are much harder to implement and will need control messages to the Primary to ensure writes to the same stripe are ordered correctly Larger chunk size ----------------- The default chunk size is 4K, this is too small and means that small reads have to be split up and processed by many OSDs. It is more efficient if small I/Os can be serviced by a single OSD. Choosing a larger chunk size such as 64K or 256K and implementing partial reads and writes will fix this issue, but has the disadvantage that small sized RADOS objects get rounded up in size to a whole stripe of capacity. We would like the code to automatically choose what chunk size to use to optimize for both capacity and performance. Small objects should use a small chunk size like 4K, larger objects should use a larger chunk size. Code currently rounds up I/O sizes to multiples of the chunk size, which isn't an issue with a small chunk size. With a larger chunk size and partial reads/writes we should round up to the page size rather than the chunk size. Design ====== We will describe the changes we aim to make in three sections, the first section looks at the existing test tools for erasure coding and discusses the improvements we believe will be necessary to get good test coverage for the changes. The second section covers changes to the read and write I/O path. The third section discusses the changes to metadata to avoid the need to update metadata on all shards for each metadata update. While it is possible to implement many of the I/O path changes without reducing the number of metadata updates, there are bigger performance benefits if the number of metadata updates can be reduced as well. Test tools ---------- A survey of the existing test tools shows that there is insufficient coverage of erasure coding to be able to just make changes to the code and expect the existing CI pipelines to get sufficient coverage. Therefore one of the first steps will be to improve the test tools to be able to get better test coverage. Teuthology is the main test tool used to get test coverage and it relies heavily on the following tests for generating I/O: 1. **rados** task - qa/tasks/rados.py. This uses ceph_test_rados (src/test/osd/TestRados.cc) which can generate a wide mixture of different rados operations. There is limited support for read and write I/Os, typically using offset 0 although there is a chunked read command used by a couple of tests. 2. **radosbench** task - qa/tasks/radosbench.py. This uses the **rados bench** (src/tools/rados/rados.cc and src/common/obj_bencher.cc). Can be used to generate sequential and random I/O workloads, offset starts at 0 for sequential I/O. I/O size can be set but is constant for whole test. 3. **rbd_fio** task - qa/tasks/fio.py. This uses **fio** to generate read/write I/O to an rbd image volume 4. **cbt** task - qa/tasks/cbt.py. This uses the Ceph benchmark tool **cbt** to run fio or radosbench to benchmark the performance of a cluster. 5. **rbd bench**. Some of the standalone tests use rbd bench (src/tools/rbd/action/Bench.cc) to generate small amounts of I/O workload. It is also used by the **rbd_pwl_cache_recovery** task. It is hard to use these tools to get good coverage of I/Os to non-zero (and non-stripe aligned) offsets, or to generate a wide variety of offsets and lengths of I/O requests including all the boundary cases for chunks and stripes. There is scope to improve either rados, radosbench or rbd bench to generate much more interesting I/O patterns for testing erasure coding. For the optimizations described above it is essential that we have good tools for checking the consistency of either selected objects or all objects in an erasure coded pool by checking that the data and coding parities are coherent. There is a test tool **ceph-erasure-code-tool** which can use the plugins to encode and decode data provided in a set of files. However there does not seem to be any scripting in teuthology to perform consistency checks by using objectstore tool to read data and then using this tool to validate consistency. We will write some teuthology helpers that use ceph-objectstore-tool and ceph-erasure-code-tool to perform offline validation. We would also like an online way of performing full consistency checks, either for specific objects or for a whole pool. Inconveniently EC pools do not support class methods so it's not possible to use this as a way of implementing a full consistency check. We will investigate putting a flag on a read request, on the pool or implementing a new request type to perform a full consistency check on an object and look at making extensions to the rados CLI to be able to perform these tests. See also the discussion on deep scrub below. When there is more than one coding parity and there is an inconsistency between the data and the coding parities it is useful to try and analyze the cause of the inconsistency. Because the multiple coding parities are providing redundancy, there can be multiple ways of reconstructing each chunk and this can be used to detect the most like cause of the inconsistency. For example with a 4+2 erasure coding and a dropped write to 1st data OSD, the stripe (all 6 OSDs) will be inconsistent, as will be any selection of 5 OSDs that includes the 1st data OSD, but data OSDs 2,3 and 4 and the two coding parity OSDs will be still be consistent. While there are many ways a stripe could get into this state, a tool could conclude that the most likely cause is a missed update to OSD 1. Ceph does not have a tool to perform this type of analysis, but it should be easy to extend ceph-erasure-code-tool. Teuthology seems to have adequate tools for taking OSDs offline and bringing them back online again. There are a few tools for injecting read I/O errors (without taking an OSD offline) but there is scope to improve these (e.g. ability to specify a particular offset in an object that will fail a read, more controls over setting and deleting error inject sites). The general philosophy of teuthology seems to be to randomly inject faults and simply through brute force get sufficient coverage of all the error paths. This is a good approach for CI testing, however when EC code paths become complex and require multiple errors to occur with precise timings to cause a particular code path to execute it becomes hard to get coverage without running the tests for a very long time. There are some standalone tests for EC which do test some of the multiple failure paths, but these tests perform very limited amounts of I/O and don't inject failures while there are I/Os in flight so miss some of the interesting scenarios. To deal with these more complex error paths we propose developing a new type of thrasher for erasure coding that injects a sequence of errors and makes use of debug hooks to capture and delay I/O requests at particular points to ensure an error inject hits a particular timing window. To do this we will extend the tell osd command to include extra interfaces to inject errors and capture and stall I/Os at specific points. Some parts of erasure coding such as the plugins are stand alone bits of code which can be tested with unit tests. There are already some unit tests and performance benchmark tools for erasure coding, we will look to extend these to get further coverage of code that can be run stand alone. I/O path changes ---------------- Avoid unnecessary reads and writes ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The current code reads too much data for read and overwrite I/Os. For overwrites it will also rewrite unmodified data. This occurs because reads and overwrites are rounded up to full-stripe operations. This isn’t a problem when data is mainly being accessed sequentially but is very wasteful for random I/O operations. The code can be changed to only read/write necessary shards. To allow the code to efficiently support larger chunk sizes I/Os should be rounded to page size I/Os instead of chunk sized I/Os. The first simple set of optimizations eliminates unnecessary reads and unnecessary writes of data, but retains writes of metadata on all shards. This avoids breaking the current design which depends on all shards receiving a metadata update for every transaction. When changes to the metadata handling are completed (see below) then it will be possible to make further optimizations to reduce the number of metadata updates for additional savings. Parity-delta-write ^^^^^^^^^^^^^^^^^^ The current code implements overwrites by performing a full-stripe read, merging the overwritten data, calculating new coding parities and performing a full-stripe write. Reading and writing every shard is expensive, there are a number of optimizations that can be applied to speed this up. For a K+M configuration where M is small, it is often less work to perform a parity-delta-write. This is implemented by reading the old data that is about to be overwritten and XORing it with the new data to create a delta. The coding parities can then be read, updated to apply the delta and re-written. With M=2 (RAID6) this can result in just 3 read and 3 writes to perform an overwrite of less than one chunk. Note that where a large fraction of the data in the stripe is being updated, this technique can result in more work than performing a partial overwrite, however if both update techniques are supported it is fairly easy to calculate for a given I/O offset and length which is the optimal technique to use. Write I/Os submitted to the Primary OSD will perform this calculation to decide whether to use a full-stripe update or a parity-delta-write. Note that if read failures are encountered while performing a parity-delta-write and it is necessary to reconstruct data or a coding parity then it will be more efficient to switch to performing a full-stripe read, merge and write. Not all erasure codings and erasure coding libraries support the capability of performing delta updates, however those implemented using XOR and/or GF arithmetic should. We have checked jerasure and isa-l and confirmed that they support this feature, although the necessary APIs are not currently exposed by the plugins. For some erasure codes such as clay and lrc it may be possible to apply delta updates, but the delta may need to be applied in so many places that this makes it a worthless optimization. This proposal suggests that parity-delta-write optimizations are initially implemented only for the most commonly used erasure codings. Erasure code plugins will provide a new flag indicating whether they support the new interfaces needed to perform delta updates. Direct reads ^^^^^^^^^^^^ Read I/Os are currently directed to the primary OSD which then issues reads to other shards. To reduce I/O latency and network bandwidth it would be better if clients could issue direct read requests to the OSD storing the data, rather than via the primary. There are a few error scenarios where the client may still need to fallback to submitting reads to the primary, a secondary OSD will have the option of failing a direct read with -EAGAIN to request the client retries the request to the primary OSD. Direct reads will always be for <= one chunk. For reads of more than one chunk the client can issue direct reads to multiple OSDs, however these will no longer guaranteed to be atomic because an update (write) may be applied in between the separate read requests. If a client needs atomicity guarantees they will need to continue to send the read to the primary. Direct reads will be failed with EAGAIN where a reconstruct and decode operation is required to return the data. This means only reads to primary OSD will need to handle the reconstruct code path. When an OSD is backfilling we don't want the client to have large quantities of I/O failed with EAGAIN, therefore we will make the client detect this situation and avoid issuing direct I/Os to a backfilling OSD. For backwards compatibility, for client requests that cannot cope with the reduced guarantees of a direct read, and for scenarios where the direct read would be to an OSD that is absent or backfilling, reads directed to the primary OSD will still be supported. Direct writes ^^^^^^^^^^^^^ Write I/Os are currently directed to the primary OSD which then updates the other shards. To reduce latency and network bandwidth it would be better if clients could direct small overwrites requests directly to the OSD storing the data, rather than via the primary. For larger write I/Os and for error scenarios and abnormal cases clients will continue to submit write I/Os to the primary OSD. Direct writes will always be for <= one chunk and will use the parity-delta-write technique to perform the update. For medium sized writes a client may issue direct writes to multiple OSDs, but such updates will no longer be guaranteed to be atomic. If a client requires atomicity for a larger write they will need to continue to send it to the primary. For backwards compatibility, and for scenarios where the direct write would be to an OSD that is absent, writes directed to the primary OSD will still be supported. I/O serialization, recovery/backfill and other error scenarios """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" Direct writes look fairly simple until you start considering all the abnormal scenarios. The current implementation of processing all writes on the Primary OSD means that there is one central point of control for the stripe that can manage things like the ordering of multiple inflight I/Os to the same stripe, ensuring that recovery/backfill for an object has been completed before it is accessed and assigning the object version number and modification time. With direct I/Os these become distributed problems. Our approach is to send a control path message to the Primary OSD and let it continue to be the central point of control. The Primary OSD will issue a reply when the OSD can start the direct write and will be informed with another message when the I/O has completed. See section below on metadata updates for more details. Stripe cache ^^^^^^^^^^^^ Erasure code pools maintain a stripe cache which stores shard data while updates are in progress. This is required to allow writes and reads to the same stripe to be processed in parallel. For small sequential write workloads and for extreme hot spots (e.g. where the same block is repeatedly re-written for some kind of crude checkpointing mechanism) there would be a benefit in keeping the stripe cache slightly longer than the duration of the I/O. In particularly the coding parities are typically read and written for every update to a stripe. There is obviously a balancing act to achieve between keeping the cache long enough that it reduces the overheads for future I/Os versus the memory overheads of storing this data. A small (MiB as opposed to GiB sized cache) should be sufficient for most workloads. The stripe cache can also help reduce latency for direct write I/Os by allowing prefetch I/Os to read old data and coding parities ready for later parts of the write operation without requiring more complex interlocks. The stripe cache is less important when the default chunk size is small (e.g. 4K), because even with small write I/O requests there will not be many sequential updates to fill a stripe. With a larger chunk size (e.g. 64K) the benefits of a good stripe cache become more significant because the stripe size will be 100’s KiB to small number of MiB’s and hence it becomes much more likely that a sequential workload will issue many I/Os to the same stripe. Automatically choose chunk size ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The default chunk size of 4K is good for small objects because the data and coding parities are rounded up to whole chunks and because if an object has less than one data stripe of data then the capacity overheads for the coding parities are higher (e.g. a 4K object in a 10+2 erasure coded pool has 4K of data and 8K of coding parity, so there is a 200% overhead). However the optimizations above all provide much bigger savings if the typical random access I/O only reads or writes a single shard. This means that so long as objects are big enough that a larger chunk size such as 64K would be better. Whilst the user can try and predict what their typically object size will be and choose an appropriate chunk size, it would be better if the code could automatically select a small chunk size for small objects and a larger chunk size for larger objects. There will always be scenarios where an object grows (or is truncated) and the chosen chunk size becomes inappropriate, however reading and re-writing the object with a new chunk size when this happens won’t have that much performance impact. This also means that the chunk size can be deduced from the object size in object_info_t which is read before the objects data is read/modified. Clients already provide a hint as to the object size when creating the object so this could be used to select a chunk size to reduce the likelihood of having to re-stripe an object The thought is to support a new chunk size of auto/variable to enable this feature, it will only be applicable for newly created pools, there will be no way to migrate an existing pool. Deep scrub support ^^^^^^^^^^^^^^^^^^ EC Pools with overwrite do not check CRCs because it is too costly to update the CRC for the object on every overwrite, instead the code relies on Bluestore to maintain and check CRCs. When an EC pool is operating with overwrite disabled a CRC is kept for each shard, because it is possible to update CRCs as the object is appended to just by calculating a CRC for the new data being appended and then doing a simple (quick) calculation to combine the old and new CRC together. In dev/osd_internals/erasure_coding/proposals.rst it discusses the possibility of keeping CRCs at a finer granularity (for example per chunk), storing these either as an xattr or an omap (omap is more suitable as large objects could end up with a lot of CRC metadata) and updating these CRCs when data is overwritten (the update would need to perform a read-modify-write at the same granularity as the CRC). These finer granularity CRCs can then easily be combined to produce a CRC for the whole shard or even the whole erasure coded object. This proposal suggests going in the opposite direction - EC overwrite pools have survived without CRCs and relied on Bluestore up until now, so why is this feature needed? The current code doesn’t check CRCs if overwrite is enabled, but sadly still calculates and updates a CRC in the hinfo xattr, even if performing overwrites which mean that the calculated value will be garbage. This means we pay all the overheads of calculating the CRC and get no benefits. The code can easily be fixed so that CRCs are calculated and maintained when objects are written sequentially, but as soon as the first overwrite to an object occurs the hinfo xattr will be discarded and CRCs will no longer be calculated or checked. This will improve performance when objects are overwritten, and will improve data integrity in cases where they are not. While the thought is to abandon EC storing CRCs in objects being overwritten, there is an improvement that can be made to deep scrub. Currently deep scrub of an EC with overwrite pool just checks that every shard can read the object, there is no checking to verify that the copies on the shards are consistent. A full consistency check would require large data transfers between the shards so that the coding parities could be recalculated and compared with the stored versions, in most cases this would be unacceptably slow. However for many erasure codes (including the default ones used by Ceph) if the contents of a chunk are XOR’d together to produce a longitudinal summary value, then an encoding of the longitudinal summary values of each data shard should produce the same longitudinal summary values as are stored by the coding parity shards. This comparison is less expensive than the CRC checks performed by replication pools. There is a risk that by XORing the contents of a chunk together that a set of corruptions cancel each other out, but this level of check is better than no check and will be very successful at detecting a dropped write which will be the most common type of corruption. Metadata changes ---------------- What metadata do we need to consider? 1. object_info_t. Every Ceph object has some metadata stored in the object_info_t data structure. Some of these fields (e.g. object length) are not updated frequently and we can simply avoid performing partial writes optimizations when these fields need updating. The more problematic fields are the version numbers and the last modification time which are updated on every write. Version numbers of objects are compared to version numbers in PG log entries for peering/recovery and with version numbers on other shards for backfill. Version numbers and modification times can be read by clients. 2. PG log entries. The PG log is used to track inflight transactions and to allow incomplete transactions to be rolled forward/backwards after an outage/network glitch. The PG log is also used to detect and resolve duplicate requests (e.g. resent due to network glitch) from clients. Peering currently assumes that every shard has a copy of the log and that this is updated for every transaction. 3. PG stats entries and other PG metadata. There is other PG metadata (PG stats is the simplest example) that gets updated on every transaction. Currently all OSDs retain a cached and a persistent copy of this metadata. How many copies of metadata are required? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The current implementation keeps K+M replicated copies of metadata, one copy on each shard. The minimum number of copies that need to be kept to support up to M failures is M+1. In theory metadata could be erasure encoded, however given that it is small it is probably not worth the effort. One advantage of keeping K+M replicated copies of the metadata is that any fully in sync shard can read the local copy of metadata, avoiding the need for inter-OSD messages and asynchronous code paths. Specifically this means that any OSD not performing backfill can become the primary and can access metadata such as object_info_t locally. M+1 arbitrarily distributed copies ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ A partial write to one data shard will always involve updates to the data shard and all M coding parity shards, therefore for optimal performance it would be ideal if the same M+1 shards are updated to track the associated metadata update. This means that for small random writes that a different M+1 shards would get updated for each write. The drawback of this approach is that you might need to read K shards to find the most up to date version of the metadata. In this design no shard will have an up to date copy of the metadata for every object. This means that whatever shard is picked to be the acting primary that it may not have all the metadata available locally and may need to send messages to other OSDs to read it. This would add significant extra complexity to the PG code and cause divergence between Erasure coded pools and Replicated pools. For these reasons we discount this design option. M+1 copies on known shards ^^^^^^^^^^^^^^^^^^^^^^^^^^ The next best performance can be achieved by always applying metadata updates to the same M+1 shards, for example choosing the 1st data shard and all M coding parity shards. Coding parity shards will get updated by every partial write so this will result in zero or one extra shard being updated. With this approach only 1 shard needs to be read to find the most up to date version of the metadata. We can restrict the acting primary to be one of the M+1 shards, which means that once any incomplete updates in the log have been resolved that the primary will have an up to date local copy of all the metadata, this means that much more of the PG code can be kept unchanged. Partial Writes and the PG log ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Peering currently assumes that every shard has a copy of the log, however because of inflight updates and small term absences it is possible that some shards are missing some of the log entries. The job of peering is to combine the logs from the set of present shards to form a definitive log of transactions that have been committed by all the shards. Any discrepancies between a shards log and the definitive log are then resolved, typically by rolling backwards transactions (using information held in the log entry) so that all the shards are in a consistent state. To support partial writes the log entry needs to be modified to include the set of shards that are being updated. Peering needs to be modified to consider a log entry as missing from a shard only if a copy of the log entry on another shard indicates that this shard was meant to be updated. The logs are not infinite in size, and old log entries where it is known that the update has been successfully committed on all affected shards are trimmed. Log entries are first condensed to a pg_log_dup_t entry which can no longer assist in rollback of a transaction but can still be used to detect duplicated client requests, and then later completely discarded. Log trimming is performed at the same time as adding a new log entry, typically when a future write updates the log. With partial writes log trimming will only occur on shards that receive updates, which means that some shards may have stale log entries that should have been discarded. TBD: I think the code can already cope with discrepancies in log trimming between the shards. Clearly an in flight trim operation may not have completed on every shard so small discrepancies can be dealt with, but I think an absent OSD can cause larger discrepancies. I believe that this is resolved during Peering, with each OSD keeping a record of what the oldest log entry should be and this gets shared between OSDs so that they can work out stale log entries that were trimmed in absentia. Hopefully this means that only sending log trimming updates to shards that are creating new log entries will work without code changes. Backfill ^^^^^^^^ Backfill is used to correct inconsistencies between OSDs that occur when an OSD is absent for a longer period of time and the PG log entries have been trimmed. Backfill works by comparing object versions between shards. If some shards have out of date versions of an object then a reconstruct is performed by the backfill process to update the shard. If the version numbers on objects are not updated on all shards then this will break the backfill process and cause a huge amount of unnecessary reconstruct work. This is unacceptable, in particular for the scenario where an OSD is just absent for maintenance for a relatively short time with noout set. The requirement is to be able to minimize the amount of reconstruct work needed to complete a backfill. In dev/osd_internals/erasure_coding/proposals.rst it discusses the idea of each shard storing a vector of version numbers that records the most recent update that the pair both should have participated in. By collecting this information from at least M shards it is possible to work out what the expected minimum version number should be for an object on a shard and hence deduce whether a backfill is required to update the object. The drawback of this approach is that backfill will need to scan M shards to collect this information, compared with the current implementation that only scans the primary and shard(s) being backfilled. With the additional constraint that a known M+1 shards will always be updated and that the (acting) primary will be one of these shards, it will be possible to determine whether a backfill is required just by examining the vector on the primary and the object version on the shard being backfilled. If the backfill target is one of the M+1 shards the existing version number comparison is sufficient, if it is another shard then the version in the vector on the primary needs to be compared with the version on the backfill target. This means that backfill does not have to scan any more shards than it currently does, however the scan of the primary does need to read the vector and if there are multiple backfill targets then it may need to store multiple entries of the vector per object increasing memory usage during the backfill. There is only a requirement to keep the vector on the M+1 shards, and the vector only needs K-1 entires because we only need to track version number differences between any of the M+1 shards (which should have the same version) and each of the K-1 shards (which can have a stale version number). This will slightly reduce the amount of extra metadata required. The vector of version numbers could be stored in the object_info_t structure or stored as a separate attribute. Our preference is to store the vector in the object_info_t structure because typically both are accessed together, and because this makes it easier to cache both in the same object cache. We will keep metadata and memory overheads low by only storing the vector when it is needed. Care is required to ensure that existing clusters can be upgraded. The absence of the vector of version numbers implies that an object has never had a partial update and therefore all shards are expected to have the same version number for the object and the existing backfill algorithm can be used. Code references """"""""""""""" PrimaryLogPG::scan_range - this function creates a map of objects and their version numbers, on the primary it tries to get this information from the object cache, otherwise it reads to OI attribute. This will need changes to deal with the vectors. To conserve memory it will need to be provided with the set of backfill targets so it can select which part of the vector to keep. PrimaryLogPG::recover_backfill - this function call scan_range for the local (primary) and sends MOSDPGScan to the backfill targets to get them to perform the same scan. Once it has collected all the version numbers it compares the primary and backfill targets to work out which objects need to be recovered. This will also need changes to deal with the vectors when comparing version numbers. PGBackend::run_recovery_op - recovers a single object. For an EC pool this involves reconstructing the data for the shards that need backfilling (read other shards and use decode to recover). This code shouldn't need any changes. Version number and last modification time for clients ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Clients can read the object version number and set expectations about what the minimum version number is when making updates. Clients can also read the last modification time. There are use cases where it is important that these values can be read and give consistent results, but there is also a large number of scenarios where this information is not required. If the object version number is only being updated on a known M+1 shards for partial writes, then where this information is required it will need to involve a metadata access to one of those shards. We have arranged for the primary to be one of the M+1 shards so I/Os submitted to the primary will always have access to the up to date information. Direct write I/Os need to update the M+1 shards, so it is not difficult to also return this information to the client when completing the I/O. Direct read I/Os are the problem case, these will only access the local shard and will not necessarily have access to the latest version and modification time. For simplicity we will require clients that require this information to send requests to the primary rather than using the direct I/O optimization. Where a client does not need this information they can use the direct I/O optimizations. The direct read I/O optimization will still return a (potentially stale) object version number. This may still be of use to clients to help understand the ordering of I/Os to a chunk. Direct Write with Metadata updates ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Here's the full picture of what a direct write performing a parity-delta-write looks like with all the control messages: .. ditaa:: RADOS Client | ^ | | 1 28 +-----+ | | | |<------27-------+--+ | | | | | | +-------------|->| | | | | | | |<-|----2--------+ |<--------------------------------------------+ | Seq | | | |<----------------------------+ | | | | +----3---+ | | | | | | | +--|-----------------------5-----|---+ | | | | | +--|-------4---------+ | | | | +--|-10-|------->| | | | | | | | | | +---+ | | | | | | | | | | | | | | | | | | | | v | | | | | | +----++ | | +---+ | | | | | | ^ | | | |XOR+-|--|----------15-----|-----------|---|-----+ | | | | | |13 +-|--|-------14--------|-----+ | | | | | | | | +---+ | | | | | | | | | | | | ^ | | | v | | v | | | | | | | | | +------+ | | +------+ | 6 11 | | | | | | | XOR | | | | GF | | | | | | | | | | | 18 | | | | 21 | | | | | | 12 16 | | +----+-+ | | +----+-+ | | | | | | | | | ^ | | | ^ | | | | | | | | | | | | | | | | | | | | | | | | | 17 19 | | 20 22 | | | | | | | | | | | | | | | | | | | | | v | | | v | | | v | | | | | +-+----+ | | +-+----+ | | +-+----+ | | | | +->|Extent| | +->|Extent| | +->|Extent| | | | 23 |Cache | 24 |Cache | 25 |Cache | 26 | | | +----+-+ | +----+-+ | +----+-+ | | | | ^ | | ^ | | ^ | | | | | | | | | | | | | | | +---+ 7 +---+ 8 +---+ 9 +---+ | | | | | | | | | v | v | v | v .-----. .-----. .-----. .-----. .-----. .-----. ( ) ( ) ( ) ( ) ( ) ( ) |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| | | | | | | | | | | | | | | | | | | | | | | | | ( ) ( ) ( ) ( ) ( ) ( ) `-----' `-----' `-----' `-----' `-----' `-----' Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q OSD * Xattr * No Xattr * Xattr * Xattr * OI * Stale OI * OI * OI * PG log * Partial PG log * PG log * PG log * PG stats * No PG stats * PG stats * PG stats Note: Only the primary OSD and parity coding OSDs (the M+1 shards) have Xattr, up to date object info, PG log and PG stats. Only one of these OSDs is permitted to become the (acting) primary. The other data OSDs 2,3 and 4 (the K-1 shards) do not have Xattrs or PG stats, may have state object info and only have PG log entries for their own updates. OSDs 2,3 and 4 may have stale OI with an old version number. The other OSDs have the latest OI and a vector with the expected version numbers for OSDs 2,3 and 4. 1. Data message with Write I/O from client (MOSDOp) 2. Control message to Primary with Xattr (new msg MOSDEcSubOpSequence) Note: the primary needs to be told about any xattr update so it can update its copy, but the main purpose of this message is to allow the primary to sequence the write I/O. The reply message at step 10 is what allows the write to start and provides the PG stats and new object info including the new version number. If necessary the primary can delay this to ensure that recovery/backfill of the object is completed first and deal with overlapping writes. Data may be read (prefetched) before the reply, but obviously no transactions can start. 3. Prefetch request to local extent cache 4. Control message to P to prefetch to extent cache (new msg MOSDEcSubOpPrefetch equivalent of MOSDEcSubOpRead) 5. Control message to Q to prefetch to extent cache (new msg MOSDEcSubOpPrefetch equivalent of MOSDEcSubOpRead) 6. Primary reads object info 7. Prefetch old data 8. Prefetch old P 9. Prefetch old Q Note: The objective of these prefetches is to get the old data, P and Q reads started as quickly as possible to reduce the latency of the whole I/O. There may be error scenarios where the extent cache is not able to retain this and it will need to be re-read. This includes the rare/pathological scenarios where there is a mixture of writes sent to the primary and writes sent directly to the data OSD for the same object. 10. Control message to data OSD with new object info + PG stats (new msg MOSDEcSubOpSequenceReply) 11. Transaction to update object info + PG log + PG stats 12. Fetch old data (hopefully cached) Note: For best performance we want to pipeline writes to the same stripe. The primary assigns the version number to each write and consequently defines the order in which writes should be processed. It is important that the data shard and the coding parity shards apply overlapping writes in the same order. The primary knows what set of writes are in flight so can detect this situation and indicate in its reply message at step 10 that an update must wait until an earlier update has been applied. This information needs to be forwarded to the coding parities (steps 14 and 15) so they can also ensure updates are applied in the same order. 13. XOR new and old data to create delta 14. Data message to P with delta + Xattr + object info + PG log + PG stats (new msg MOSDEcSubOpDelta equivalent of MOSDEcSubOpWrite) 15. Data message to Q with delta + Xattr + object info + PG log + PG stats (new msg MOSDEcSubOpDelta equivalent of MOSDEcSubOpWrite) 16. Transaction to update data + object info + PG log 17. Fetch old P (hopefully cached) 18. XOR delta and old P to create new P 19. Transaction to update P + Xattr + object info + PG log + PG stats 20. Fetch old Q (hopefully cached) 21. XOR delta and old Q to create new Q 22. Transaction to update Q + Xattr + object info + PG log + PG stats 23. Control message to data OSD for commit (new msg MOSDEcSubOpDeltaReply equivalent of MOSDEcSubOpWriteReply) 24. Local commit notification 25. Control message to data OSD for commit (new msg MOSDEcSubOpDeltaReply equivalent of MOSDEcSubOpWriteReply) 26. Control message to data OSD for commit (new msg MOSDEcSubOpDeltaReply equivalent of MOSDEcSubOpWriteReply) 27. Control message to Primary to signal end of write (variant of new msg MOSDEcSubOpSequence) 28. Control message reply to client (MOSDOpReply) Upgrade and backwards compatibility ----------------------------------- A few of the optimizations can be made just by changing code on the primary OSD with no backwards compatibility concerns regarding clients or the other OSDs. These optimizations will be enabled as soon as the primary OSD upgrades and will replace the existing code paths. The remainder of the changes will be new I/O code paths that will exist alongside the existing code paths. Similar to EC Overwrites many of the changes will need to ensure that all OSDs are running new code and that the EC plugins support new interfaces required for parity-delta-writes. A new pool level flag will be required to enforce this. It will be possible to enable this flag (and hence enable the new performance optimizations) after upgrading an existing cluster. Once set it will not be possible to add down level OSDs to the pool. It will not be possible to turn this flag off other than by deleting the pool. Downgrade is not supported because: 1. It is not trivial to quiesce all I/O to a pool to ensure that none of the new I/O code paths are in use when the flag is cleared. 2. The PG log format for new I/Os will not be understood by down level OSDs. It would be necessary to ensure the log has been trimmed of all new format entries before clearing the flag to ensure that down level OSDs will be able to interpret the log. 3. Additional xattr data will be stored by the new I/O code paths and used by backfill. Down level code will not understand how to backfill a pool that has been running the new I/O paths and will get confused by the inconsistent object version numbers. While it is theoretically possible to disable partial updates and then scan and update all the metadata to return the pool to a state where a downgrade is possible, we have no intention of writing this code. The direct I/O changes will additionally require clients to be running new code. These will require that the pool has the new flag set and that a new client is used. Old clients can use pools with the new flag set, just without the direct I/O optimization. Not under consideration ----------------------- There is a list of enhancements discussed in doc/dev/osd_internals/erasure_coding/proposals.rst, the following are not under consideration: 1. RADOS Client Acknowledgement Generation optimization When updating K+M shards in an erasure coded pool, in theory you don’t have to wait for all the updates to complete before completing the update to the client, because so long as K updates have completed any viable subset of shards should be able to roll forward the update. For partial writes where only M+1 shards are updated this optimization does not apply as all M+1 updates need to complete before the update is completed to the client. This optimization would require changes to the peering code to work out whether partially completed updates need to be rolled forwards or backwards. To roll an update forwards it would be simplest to mark the object as missing and use the recovery path to reconstruct and push the update to OSDs that are behind. 2. Avoid sending read request to local OSD via Messenger The EC backend code has an optimization for writes to the local OSD which avoids sending a message and reply via messenger. The equivalent optimization could be made for reads as well, although a bit more care is required because the read is synchronous and will block the thread waiting for the I/O to complete. Pull request https://github.com/ceph/ceph/pull/57237 is making this optimization Stories ======= This is our high level breakdown of the work. Our intention is to deliver this work as a series of PRs. The stories are roughly in the order we plan to develop. Each story is at least one PR, where possible they will be broken up further. The earlier stories can be implemented as stand alone pieces of work and will not introduce upgrade/backwards compatibility issues. The later stories will start breaking backwards compatibility, here we plan to add a new flag to the pool to enable these new features. Initially this will be an experimental flag while the later stories are developed. Test tools - enhanced I/O generator for testing erasure coding -------------------------------------------------------------- * Extend rados bench to be able to generate more interesting patterns of I/O for erasure coding, in particular reading and writing at different offsets and for different lengths and making sure we get good coverage of boundary conditions such as the sub-chunk size, chunk size and stripe size * Improve data integrity checking by using a seed to generate data patterns and remembering which seed is used for each block that is written so that data can later be validated Test tools - offline consistency checking tool ---------------------------------------------- * Test tools for performing offline consistency checks combining use of objectstore_tool with ceph-erasure-code-tool * Enhance some of the teuthology standalone erasure code checks to use this tool Test tools - online consistency checking tool --------------------------------------------- * New CLI to be able to perform online consistency checking for an object or a range of objects that reads all the data and coding parity shards and re-encodes the data to validate the coding parities Switch for JErasure to ISA-L ---------------------------- The JErasure library has not been updated since 2014, the ISA-L library is maintained and exploits newer instructions sets (e.g. AVX512, AVX2) which provides faster encoding/decoding * Change defaults to ISA-L in upstream ceph * Benchmark Jerasure and ISA-L * Refactor Ceph isa_encode region_xor() to use AVX when M=1 * Documentation updates * Present results at performance weekly Sub Stripe Reads ---------------- Ceph currently reads an integer number of stripes and discards unneeded data. In particular for small random reads it will be more efficient to just read the required data * Help finish Pull Request https://github.com/ceph/ceph/pull/55196 if not already complete * Further changes to issue sub-chunk reads rather than full-chunk reads Simple Optimizations to Overwrite --------------------------------- Ceph overwrites currently read an integer number of stripes, merge the new data and write an integer number of stripes. This story makes simple improvements by making the same optimizations as for sub stripe reads and for small (sub-chunk) updates reducing the amount of data being read/written to each shard. * Only read chunks that are not being fully overwritten (code currently reads whole stripe and then merges new data) * Perform sub-chunk reads for sub-chunk updates * Perform sub-chunk writes for sub-chunk updates Eliminate unnecessary chunk writes but keep metadata transactions ----------------------------------------------------------------- This story avoids re-writing data that has not been modified. A transaction is still applied to every OSD to update object metadata, the PG log and PG stats. * Continue to create transactions for all chunks but without the new write data * Add sub-chunk writes to transactions where data is being modified Avoid zero padding objects to a full stripe ------------------------------------------- Objects are rounded up to an integer number of stripes by adding zero padding. These buffers of zeros are then sent in messages to other OSDs and written to the OS consuming storage. This story make optimizations to remove the need for this padding * Modifications to reconstruct reads to avoid reading zero-padding at the end of an object - just fill the read buffer with zeros instead * Avoid transfers/writes of buffers of zero padding. Still send transactions to all shards and create the object, just don't populate it with zeros * Modifications to encode/decode functions to avoid having to pass in buffers of zeros when objects are padded Erasure coding plugin changes to support distributed partial writes ------------------------------------------------------------------- This is preparatory work for future stories, it adds new APIs to the erasure code plugins. * Add a new interface to create a delta by XORing old and new data together and implement this for the ISA-L and JErasure plugins * Add a new interface to apply a delta to one coding parity by using XOR/GF and implement this for the ISA-L and JErasure plugins * Add a new interface which reports which erasure codes support this feature (ISA-L and JErasure will support it, others will not) Erasure coding interface to allow RADOS clients to direct I/Os to OSD storing the data -------------------------------------------------------------------------------------- This is preparatory work for future stories, its adds a new API for clients * New interface to convert the pair (pg, offset) to {OSD, remaining chunk length} We do not want clients to have to dynamically link to the erasure code plugins so this code will need to be part of librados. However this interface needs to understand how erasure codes distribute data and coding chunks to be able to perform this translation. We will only support ISA-L and JErasure plugins where there is a trivial striping of data chunks to OSDs. Changes to object_info_t ------------------------ This is preparatory work for future stories. This adds the vector of version numbers to object_info_t which will be used for partial updates. For replicated pools and for erasure coded objects that are not overwritten we will avoid storing extra data in object_info_t. Changes to PGLog and Peering to support updating a subset of OSDs ----------------------------------------------------------------- This is preparatory work for future stories. * Modify the PG log entry to store a record of which OSDs are being updated * Modify peering to use this extra data to work out OSDs that are missing updates Change to selection of (acting) primary --------------------------------------- This is preparatory work for future stories. Constrain the choice of primary to be the first data OSD or one of the erasure coding parities. If none of these OSDs are available and up to date then the pool must be offline. Implement parity-delta-write with all computation on the primary ---------------------------------------------------------------- * Calculate whether its more efficient for an update to perform a full stripe overwrite or a parity-delta-write * Implement new code paths to perform the parity-delta-write * Test tool enhancements. We want to make sure that both parity-delta-write and full-stripe write are tested. We will add a new conf file option with a choice of 'parity-delta', 'full-stripe', 'mixture for testing' or 'automatic' and update teuthology test cases to predominately use a mixture. Upgrades and backwards compatibility ------------------------------------ * Add a new feature flag for erasure coded pools * All OSDs must be running new code to enable the flag on the pool * Clients may only issue direct I/Os if the flag is set * OSDs running old code may not join a pool with the flag set * Its not possible to turn the feature flag off (other than by deleting the pool) Changes to Backfill to use the vector in object_info_t ------------------------------------------------------ This is preparatory work for future stories. * Modify the backfill process to use the vector of version numbers in object_info_t so that when partial updates occur we do not backfill OSDs which did not participate in the partial update. * When there is a single backfill target extract the appropriate version number from the vector (no additional storage required) * When there are multiple backfill targets extract the subset of the vector required by the backfill targets and select the appropriate entry when comparing version numbers in PrimaryLogPG::recover_backfill Test tools - offline metadata validation tool --------------------------------------------- * Test tools for performing offline consistency checking of metadata, in particular checking the vector of version numbers in object_info_t matches the versions on each OSD, but also for validating PG log entries Eliminate transactions on OSDs not updating data chunks ------------------------------------------------------- Peering, log recovery and backfill can now all cope with partial updates using the vector of version numbers in object_info_t. * Modify the overwrite I/O path to not bother with metadata only transactions (except to the Primary OSD) * Modify the update of the version numbers in object_info_t to use the vector and only update entries that are receiving a transaction * Modify the generation of the PG log entry to record which OSDs are being updated Direct reads to OSDs (single chunk only) ---------------------------------------- * Modify OSDClient to route single chunk read I/Os to the OSD storing the data * Modify OSD to accept reads from non-primary OSD (expand existing changes for replicated pools to work with EC pools as well) * If necessary fail the read with EAGAIN if the OSD is unable to process the read directly * Modify OSDClient to retry read by submitting to Primary OSD if read is failed with EAGAIN * Test tool enhancements. We want to make sure that both direct reads and reads to the primary are tested. We will add a new conf file option with a choice of 'prefer direct', 'primary only' or 'mixture for testing' and update teuthology test cases to predominately use a mixture. The changes will be made to the OSDC part of the RADOS client so will be applicable to rbd, rgw and cephfs. We will not make changes to other code that has its own version of RADOS client code such as krbd, although this could be done in the future. Direct reads to OSDs (multiple chunks) -------------------------------------- * Add a new OSDC flag NONATOMIC which allows OSDC to split a read into multiple requests * Modify OSDC to split reads spanning multiple chunks into separate requests to each OSD if the NONATOMIC flag is set * Modifications to OSDC to coalesce results (if any sub read fails the whole read needs to fail) * Changes to librbd client to set NONATOMIC flag for reads * Changes to cephfs client to set NONATOMIC flag for reads We are only changing a very limited set of clients, focusing on those that issue smaller reads and are latency sensitive. Future work could look at extending the set of clients (including krbd). Implement distributed parity-delta-write ---------------------------------------- * Implement new message MOSDEcSubOpDelta and MOSDEcSubOpDeltaReply * Change primary to calculate delta and send MOSDEcSubOpDelta message to coding parity OSDs * Modify coding parity OSDs to apply the delta and send MOSDEcSubOpDeltaReply message Note: This change will increase latency because the coding parity reads start after the old data read. Future work will fix this. Test tools - EC error injection thrasher ---------------------------------------- * Implement a new type of thrasher that specifically injects faults to stress erasure coded pools * Take one or multiple (up to M) OSDs down, more focus on taking different subsets of OSDs down to drive all the different EC recovery paths than stressing out peering/recovery/backfill (the existing OSD thrasher excels at this) * Inject read I/O failures to force reconstructs using decode for single and multiple failures * Inject delays using osd tell type interface to make it easier to test OSD down at all the interesting stages of EC I/Os * Inject delays using osd tell type interface to slow down an OSD transaction or message to expose the less common completion orders for parallel work Implement prefetch message MOSDEcSubOpPrefetch and modify extent cache ---------------------------------------------------------------------- * Implement new message MOSDEcSubOpPrefetch * Change primary to issue this message to the coding parity OSDs before starting read of old data * Change the extent cache so that each OSD caches its own data rather than caching everything on the primary * Change coding parity OSDs to handle this message and read the old coding parity into the extent cache * Changes to extent cache to retain the prefetched old parity until the MOSDEcSubOpDelta message is received, and to discard this on error paths (e.g. new OSDMap) Implement sequencing message MOSDEcSubOpSequence ------------------------------------------------ * Implement new message MODSEcSubOpSequence and MOSDEcSubOpSequenceReply * Modify primary code to create these messages and route them locally to itself in preparation for direct writes Direct writes to OSD (single chunk only) ---------------------------------------- * Modify OSDC to route single chunk write I/Os to the OSD storing the data * Changes to issue MOSDEcSubOpSequence and MOSDEcSubOpSequenceReply between data OSD and primary OSD Direct writes to OSD (multiple chunks) -------------------------------------- * Modifications to OSDC to split multiple chunk writes into separate requests if NONATOMIC flag is set * Further changes to coalescing completions (in particular reporting version number correctly) * Changes to librbd client to set NONATOMIC flag for reads * Changes to cephfs client to set NONATOMIC flag for reads We are only changing a very limited set of clients, focusing on those that issue smaller writes and are latency sensitive. Future work could look at extending the set of clients. Deep scrub / CRC ---------------- * Disable CRC generation in the EC code for overwrites, delete hinfo Xattr when first overwrite occurs * For objects in pool with new feature flag set that have not been overwritten check CRC, even if pool overwrite flag is set. The presence/absence of hinfo can be used to determine if the object has been overwritten * For deep scrub requests XOR the contents of the shard to create a longitudinal check (8 bytes wide?) * Return the longitudinal check in the scrub reply message, have the primary encode the set of longitudinal replies to check for inconsistencies Variable chunk size erasure coding ---------------------------------- * Implement new pool option for automatic/variable chunk size * When object size is small use a small chunk size (4K) when the pool is using the new option * When object size is large use a large chunk size (64K or 256K?) * Convert the chunk size by reading and re-writing the whole object when a small object grows (append) * Convert the chunk size by reading and re-writing the whole object when a large object shrinks (truncate) * Use the object size hint to avoid creating small objects and then almost immediately converting them to a larger chunk size CLAY Erasure Codes ------------------ In theory CLAY erasure codes should be good for K+M erasure codes with larger values of M, in particular when these erasure codes are used with multiple OSDs in the same failure domain (e.g. an 8+6 erasure code with 5 servers each with 4 OSDs). We would like to improve the test coverage for CLAY and perform some more benchmarking to collect data to help substantiate when people should consider using CLAY. * Benchmark CLAY erasure codes - in particular the number of I/O required for backfills when multiple OSDs fail * Enhance test cases to validate the implementation * See also https://bugzilla.redhat.com/show_bug.cgi?id=2004256