diff --git a/doc/dev/osd_internals/erasure_coding.rst b/doc/dev/osd_internals/erasure_coding.rst index 40064961bba..7495c521b81 100644 --- a/doc/dev/osd_internals/erasure_coding.rst +++ b/doc/dev/osd_internals/erasure_coding.rst @@ -85,3 +85,4 @@ Table of contents Developer notes Jerasure plugin High level design document + Erasure coding enhancements design document diff --git a/doc/dev/osd_internals/erasure_coding/enhancements.rst b/doc/dev/osd_internals/erasure_coding/enhancements.rst new file mode 100644 index 00000000000..dddf974c82b --- /dev/null +++ b/doc/dev/osd_internals/erasure_coding/enhancements.rst @@ -0,0 +1,1517 @@ +=========================== +Erasure coding enhancements +=========================== + +Objectives +========== + +Our objective is to improve the performance of erasure coding, in particular +for small random accesses to make it more viable to use erasure coding pools +for storing block and file data. + +We are looking to reduce the number of OSD read and write accesses per client +I/O (sometimes referred to as I/O amplification), reduce the amount of network +traffic between OSDs (network bandwidth) and reduce I/O latency (time to +complete read and write I/O operations). We expect the changes will also +provide modest reductions to CPU overheads. + +While the changes are focused on enhancing small random accesses, some +enhancements will provide modest benefits for larger I/O accesses and for +object storage. + +The following sections give a brief description of the improvements we are +looking to make. Please see the later design sections for more details + +Current Read Implementation +--------------------------- + +For reference this is how erasure code reads currently work + +.. ditaa:: + + RADOS Client + * Current code reads all data chunks + ^ * Discards unneeded data + | * Returns requested data to client + +----+----+ + | Discard | If data cannot be read then the coding parity + |unneeded | chunks are read as well and are used to reconstruct + | data | the data + +---------+ + ^^^^ + |||| + |||| + |||| + |||+----------------------------------------------+ + ||+-------------------------------------+ | + |+----------------------------+ | | + | | | | + .-----. .-----. .-----. .-----. .-----. .-----. + ( ) ( ) ( ) ( ) ( ) ( ) + |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| + | | | | | | | | | | | | + | | | | | | | | | | | | + ( ) ( ) ( ) ( ) ( ) ( ) + `-----' `-----' `-----' `-----' `-----' `-----' + Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q + OSD + +Note: All the diagrams illustrate a K=4 + M=2 configuration, however the +concepts and techniques can be used for all K+M configurations. + +Partial Reads +------------- + +If only a small amount of data is being read it is not necessary to read the +whole stripe, for small I/Os ideally only a single OSD needs to be involved in +reading the data. See also larger chunk size below. + +.. ditaa:: + + RADOS Client + * Optimize by only reading required chunks + ^ * For large chunk sizes and sub-chunk reads only + | read a sub-chunk + +----+----+ + | Return | If data cannot be read then extra data and coding + | data | parity chunks are read as well and are used to + | | reconstruct the data + +---------+ + ^ + | + | + | + | + | + +----------------------------+ + | + .-----. .-----. .-----. .-----. .-----. .-----. + ( ) ( ) ( ) ( ) ( ) ( ) + |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| + | | | | | | | | | | | | + | | | | | | | | | | | | + ( ) ( ) ( ) ( ) ( ) ( ) + `-----' `-----' `-----' `-----' `-----' `-----' + Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q + OSD + +Pull Request https://github.com/ceph/ceph/pull/55196 is implementing most of +this optimization, however it still issues full chunk reads. + +Current Overwrite Implementation +-------------------------------- + +For reference here is how erasure code overwrites currently work + +.. ditaa:: + + RADOS Client + | * Read all data chunks + | * Merges new data + +----v-----+ * Encodes new coding parities + | Read old | * Writes data and coding parities + |Merge new | + | Encode |-------------------------------------------------------------+ + | Write |---------------------------------------------------+ | + +----------+ | | + ^|^|^|^| | | + |||||||+-------------------------------------------+ | | + ||||||+-------------------------------------------+| | | + |||||+-----------------------------------+ || | | + ||||+-----------------------------------+| || | | + |||+---------------------------+ || || | | + ||+---------------------------+| || || | | + |v |v |v |v v v + .-----. .-----. .-----. .-----. .-----. .-----. + ( ) ( ) ( ) ( ) ( ) ( ) + |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| + | | | | | | | | | | | | + | | | | | | | | | | | | + ( ) ( ) ( ) ( ) ( ) ( ) + `-----' `-----' `-----' `-----' `-----' `-----' + Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q + OSD + +Partial Overwrites +------------------ + +Ideally we aim to be able to perform updates to erasure coded stripes by only +updating a subset of the shards (those with modified data or coding +parities). Avoiding performing unnecessary data updates on the other shards is +easy, avoiding performing any metadata updates on the other shards is much +harder (see design section on metadata updates). + +.. ditaa:: + + RADOS Client + | * Only read chunks that are not being overwritten + | * Merge new data + +----v-----+ * Encode new coding parities + | Read old | * Only write modified data and parity shards + |Merge new | + | Encode |-------------------------------------------------------------+ + | Write |---------------------------------------------------+ | + +----------+ | | + ^ |^ ^ | | + | || | | | + | || +-------------------------------------------+ | | + | || | | | + | |+-----------------------------------+ | | | + | +---------------------------+ | | | | + | | | | | | + | v | | v v + .-----. .-----. .-----. .-----. .-----. .-----. + ( ) ( ) ( ) ( ) ( ) ( ) + |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| + | | | | | | | | | | | | + | | | | | | | | | | | | + ( ) ( ) ( ) ( ) ( ) ( ) + `-----' `-----' `-----' `-----' `-----' `-----' + Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q + OSD + +This diagram is overly simplistic, only showing the data flows. The simplest +implementation of this optimization retains a metadata write to every +OSD. With more effort it is possible to reduce the number of metadata updates +as well, see design below for more details. + +Parity-delta-write +------------------ + +A common technique used by block storage controllers implementing RAID5 and +RAID6 is to implement what is sometimes called a parity delta write. When a +small part of the stripe is being overwritten it is possible to perform the +update by reading the old data, XORing this with the new data to create a +delta and then read each coding parity, apply the delta and write the new +parity. The advantage of this technique is that it can involve a lot less I/O, +especially for K+M encodings with larger values of K. The technique is not +specific to M=1 and M=2, it can be applied with any number of coding parities. + +.. ditaa:: + + Parity delta writes + * Read old data and XOR with new data to create a delta + RADOS Client * Read old encoding parities apply the delta and write + | the new encoding parities + | + | For K+M erasure codings where K is larger and M is small + | +-----+ +-----+ this is much more efficient + +->| XOR |-+->| GF |---------------------------------------------------+ + +-+->| | | | |<------------------------------------------------+ | + | | +-----+ | +-----+ | | + | | | | | + | | | +-----+ | | + | | +->| XOR |-----------------------------------------+ | | + | | | |<--------------------------------------+ | | | + | | +-----+ | | | | + | | | | | | + | | | | | | + | +-------------------------------+ | | | | + +-------------------------------+ | | | | | + | | | | | | + | v | v | v + .-----. .-----. .-----. .-----. .-----. .-----. + ( ) ( ) ( ) ( ) ( ) ( ) + |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| + | | | | | | | | | | | | + | | | | | | | | | | | | + ( ) ( ) ( ) ( ) ( ) ( ) + `-----' `-----' `-----' `-----' `-----' `-----' + Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q + OSD + +Direct Read I/O +--------------- + +We want clients to submit small I/Os directly to the OSD that stores the data +rather than directing all I/O requests to the Primary OSD and have it issue +requests to the secondary OSDs. By eliminating an intermediate hop this +reduces network bandwidth and improves I/O latency + +.. ditaa:: + + RADOS Client + ^ + | + +----+----+ Client sends small read requests directly to OSD + | Return | avoiding extra network hop via Primary + | data | + | | + +---------+ + ^ + | + | + | + | + | + | + | + .-----. .-----. .-----. .-----. .-----. .-----. + ( ) ( ) ( ) ( ) ( ) ( ) + |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| + | | | | | | | | | | | | + | | | | | | | | | | | | + ( ) ( ) ( ) ( ) ( ) ( ) + `-----' `-----' `-----' `-----' `-----' `-----' + Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q + OSD + + +.. ditaa:: + + RADOS Client + ^ ^ + | | + +----+----+ +--+------+ Client breaks larger read + | Return | | Return | requests into separate + | data | | data | requests to multiple OSDs + | | | | + +---------+ +---------+ Note client loses atomicity + ^ ^ guarantees if this optimization + | | is used as an update could occur + | | between the two reads + | | + | | + | | + | | + | | + .-----. .-----. .-----. .-----. .-----. .-----. + ( ) ( ) ( ) ( ) ( ) ( ) + |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| + | | | | | | | | | | | | + | | | | | | | | | | | | + ( ) ( ) ( ) ( ) ( ) ( ) + `-----' `-----' `-----' `-----' `-----' `-----' + Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q + OSD + +Distributed processing of writes +-------------------------------- + +The existing erasure code implementation processes write I/Os on the primary +OSD, issuing both reads and writes to other OSDs to fetch and update data for +other shards. This is perhaps the simplest implementation, but it uses a lot +of network bandwidth. With parity-delta-writes it is possible to distribute +the processing across OSDs to reduce network bandwidth. + +.. ditaa:: + + Performing the coding parity delta updates on the coding parity + OSD instead of the primary OSD reduces network bandwidth + RADOS Client + | Note: A naive implementation will increase latency by serializing + | the data and coding parity reads, for best performance these + | reads need to happen in parallel + | +-----+ +-----+ + +->| XOR |-+------------------------------------------------------->| GF | + +-+->| | | | | + | | +-----+ | +----++ + | | | +-----+ ^ | + | | +--------------------------------------------->| XOR | | | + | | | | | | + | | +---+-+ | | + | +-------------------------------+ ^ | | | + +-------------------------------+ | | | | | + | | | | | | + | | | | | | + | | | | | | + | | | | | | + | v | v | v + .-----. .-----. .-----. .-----. .-----. .-----. + ( ) ( ) ( ) ( ) ( ) ( ) + |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| + | | | | | | | | | | | | + | | | | | | | | | | | | + ( ) ( ) ( ) ( ) ( ) ( ) + `-----' `-----' `-----' `-----' `-----' `-----' + Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q + OSD + +Direct Write I/O +---------------- + +.. ditaa:: + + RADOS Client + | + | Similarly Clients could direct small write I/Os + | to the OSD that needs updating + | + | +-----+ +-----+ + +->| XOR |-+--------------------->| GF | + +-----+->| | | | | + | | +-----+ | +----++ + | | | +-----+ ^ | + | | +----------->| XOR | | | + | | | | | | + | | +---+-+ | | + | | ^ | | | + | | | | | | + | | | | | | + | | | | | | + | | | | | | + | | | | | | + | v | v | v + .-----. .-----. .-----. .-----. .-----. .-----. + ( ) ( ) ( ) ( ) ( ) ( ) + |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| + | | | | | | | | | | | | + | | | | | | | | | | | | + ( ) ( ) ( ) ( ) ( ) ( ) + `-----' `-----' `-----' `-----' `-----' `-----' + Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q + OSD + +This diagram is overly simplistic, only showing the data flows - direct writes +are much harder to implement and will need control messages to the Primary to +ensure writes to the same stripe are ordered correctly + +Larger chunk size +----------------- + +The default chunk size is 4K, this is too small and means that small reads +have to be split up and processed by many OSDs. It is more efficient if small +I/Os can be serviced by a single OSD. Choosing a larger chunk size such as 64K +or 256K and implementing partial reads and writes will fix this issue, but has +the disadvantage that small sized RADOS objects get rounded up in size to a +whole stripe of capacity. + +We would like the code to automatically choose what chunk size to use to +optimize for both capacity and performance. Small objects should use a small +chunk size like 4K, larger objects should use a larger chunk size. + +Code currently rounds up I/O sizes to multiples of the chunk size, which isn't +an issue with a small chunk size. With a larger chunk size and partial +reads/writes we should round up to the page size rather than the chunk size. + +Design +====== + +We will describe the changes we aim to make in three sections, the first +section looks at the existing test tools for erasure coding and discusses the +improvements we believe will be necessary to get good test coverage for the +changes. + +The second section covers changes to the read and write I/O path. + +The third section discusses the changes to metadata to avoid the need to +update metadata on all shards for each metadata update. While it is possible +to implement many of the I/O path changes without reducing the number of +metadata updates, there are bigger performance benefits if the number of +metadata updates can be reduced as well. + +Test tools +---------- + +A survey of the existing test tools shows that there is insufficient coverage +of erasure coding to be able to just make changes to the code and expect the +existing CI pipelines to get sufficient coverage. Therefore one of the first +steps will be to improve the test tools to be able to get better test +coverage. + +Teuthology is the main test tool used to get test coverage and it relies +heavily on the following tests for generating I/O: + +1. **rados** task - qa/tasks/rados.py. This uses ceph_test_rados + (src/test/osd/TestRados.cc) which can generate a wide mixture of different + rados operations. There is limited support for read and write I/Os, + typically using offset 0 although there is a chunked read command used by a + couple of tests. + +2. **radosbench** task - qa/tasks/radosbench.py. This uses the **rados bench** + (src/tools/rados/rados.cc and src/common/obj_bencher.cc). Can be used to + generate sequential and random I/O workloads, offset starts at 0 for + sequential I/O. I/O size can be set but is constant for whole test. + +3. **rbd_fio** task - qa/tasks/fio.py. This uses **fio** to generate + read/write I/O to an rbd image volume + +4. **cbt** task - qa/tasks/cbt.py. This uses the Ceph benchmark tool **cbt** + to run fio or radosbench to benchmark the performance of a cluster. + +5. **rbd bench**. Some of the standalone tests use rbd bench + (src/tools/rbd/action/Bench.cc) to generate small amounts of I/O + workload. It is also used by the **rbd_pwl_cache_recovery** task. + +It is hard to use these tools to get good coverage of I/Os to non-zero (and +non-stripe aligned) offsets, or to generate a wide variety of offsets and +lengths of I/O requests including all the boundary cases for chunks and +stripes. There is scope to improve either rados, radosbench or rbd bench to +generate much more interesting I/O patterns for testing erasure coding. + +For the optimizations described above it is essential that we have good tools +for checking the consistency of either selected objects or all objects in an +erasure coded pool by checking that the data and coding parities are +coherent. There is a test tool **ceph-erasure-code-tool** which can use the +plugins to encode and decode data provided in a set of files. However there +does not seem to be any scripting in teuthology to perform consistency checks +by using objectstore tool to read data and then using this tool to validate +consistency. We will write some teuthology helpers that use +ceph-objectstore-tool and ceph-erasure-code-tool to perform offline +validation. + +We would also like an online way of performing full consistency checks, either +for specific objects or for a whole pool. Inconveniently EC pools do not +support class methods so it's not possible to use this as a way of +implementing a full consistency check. We will investigate putting a flag on a +read request, on the pool or implementing a new request type to perform a full +consistency check on an object and look at making extensions to the rados CLI +to be able to perform these tests. See also the discussion on deep scrub +below. + +When there is more than one coding parity and there is an inconsistency +between the data and the coding parities it is useful to try and analyze the +cause of the inconsistency. Because the multiple coding parities are providing +redundancy, there can be multiple ways of reconstructing each chunk and this +can be used to detect the most like cause of the inconsistency. For example +with a 4+2 erasure coding and a dropped write to 1st data OSD, the stripe (all +6 OSDs) will be inconsistent, as will be any selection of 5 OSDs that includes +the 1st data OSD, but data OSDs 2,3 and 4 and the two coding parity OSDs will +be still be consistent. While there are many ways a stripe could get into this +state, a tool could conclude that the most likely cause is a missed update to +OSD 1. Ceph does not have a tool to perform this type of analysis, but it +should be easy to extend ceph-erasure-code-tool. + +Teuthology seems to have adequate tools for taking OSDs offline and bringing +them back online again. There are a few tools for injecting read I/O errors +(without taking an OSD offline) but there is scope to improve these +(e.g. ability to specify a particular offset in an object that will fail a +read, more controls over setting and deleting error inject sites). + +The general philosophy of teuthology seems to be to randomly inject faults and +simply through brute force get sufficient coverage of all the error +paths. This is a good approach for CI testing, however when EC code paths +become complex and require multiple errors to occur with precise timings to +cause a particular code path to execute it becomes hard to get coverage +without running the tests for a very long time. There are some standalone +tests for EC which do test some of the multiple failure paths, but these tests +perform very limited amounts of I/O and don't inject failures while there are +I/Os in flight so miss some of the interesting scenarios. + +To deal with these more complex error paths we propose developing a new type +of thrasher for erasure coding that injects a sequence of errors and makes use +of debug hooks to capture and delay I/O requests at particular points to +ensure an error inject hits a particular timing window. To do this we will +extend the tell osd command to include extra interfaces to inject errors and +capture and stall I/Os at specific points. + +Some parts of erasure coding such as the plugins are stand alone bits of code +which can be tested with unit tests. There are already some unit tests and +performance benchmark tools for erasure coding, we will look to extend these +to get further coverage of code that can be run stand alone. + +I/O path changes +---------------- + +Avoid unnecessary reads and writes +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The current code reads too much data for read and overwrite I/Os. For +overwrites it will also rewrite unmodified data. This occurs because reads and +overwrites are rounded up to full-stripe operations. This isn’t a problem when +data is mainly being accessed sequentially but is very wasteful for random I/O +operations. The code can be changed to only read/write necessary shards. To +allow the code to efficiently support larger chunk sizes I/Os should be +rounded to page size I/Os instead of chunk sized I/Os. + +The first simple set of optimizations eliminates unnecessary reads and +unnecessary writes of data, but retains writes of metadata on all shards. This +avoids breaking the current design which depends on all shards receiving a +metadata update for every transaction. When changes to the metadata handling +are completed (see below) then it will be possible to make further +optimizations to reduce the number of metadata updates for additional savings. + +Parity-delta-write +^^^^^^^^^^^^^^^^^^ + +The current code implements overwrites by performing a full-stripe read, +merging the overwritten data, calculating new coding parities and performing a +full-stripe write. Reading and writing every shard is expensive, there are a +number of optimizations that can be applied to speed this up. For a K+M +configuration where M is small, it is often less work to perform a +parity-delta-write. This is implemented by reading the old data that is about +to be overwritten and XORing it with the new data to create a delta. The +coding parities can then be read, updated to apply the delta and +re-written. With M=2 (RAID6) this can result in just 3 read and 3 writes to +perform an overwrite of less than one chunk. + +Note that where a large fraction of the data in the stripe is being updated, +this technique can result in more work than performing a partial overwrite, +however if both update techniques are supported it is fairly easy to calculate +for a given I/O offset and length which is the optimal technique to use. + +Write I/Os submitted to the Primary OSD will perform this calculation to +decide whether to use a full-stripe update or a parity-delta-write. Note that +if read failures are encountered while performing a parity-delta-write and it +is necessary to reconstruct data or a coding parity then it will be more +efficient to switch to performing a full-stripe read, merge and write. + +Not all erasure codings and erasure coding libraries support the capability of +performing delta updates, however those implemented using XOR and/or GF +arithmetic should. We have checked jerasure and isa-l and confirmed that they +support this feature, although the necessary APIs are not currently exposed by +the plugins. For some erasure codes such as clay and lrc it may be possible to +apply delta updates, but the delta may need to be applied in so many places +that this makes it a worthless optimization. This proposal suggests that +parity-delta-write optimizations are initially implemented only for the most +commonly used erasure codings. Erasure code plugins will provide a new flag +indicating whether they support the new interfaces needed to perform delta +updates. + +Direct reads +^^^^^^^^^^^^ + +Read I/Os are currently directed to the primary OSD which then issues reads to +other shards. To reduce I/O latency and network bandwidth it would be better +if clients could issue direct read requests to the OSD storing the data, +rather than via the primary. There are a few error scenarios where the client +may still need to fallback to submitting reads to the primary, a secondary OSD +will have the option of failing a direct read with -EAGAIN to request the +client retries the request to the primary OSD. + +Direct reads will always be for <= one chunk. For reads of more than one chunk +the client can issue direct reads to multiple OSDs, however these will no +longer guaranteed to be atomic because an update (write) may be applied in +between the separate read requests. If a client needs atomicity guarantees +they will need to continue to send the read to the primary. + +Direct reads will be failed with EAGAIN where a reconstruct and decode +operation is required to return the data. This means only reads to primary OSD +will need to handle the reconstruct code path. When an OSD is backfilling we +don't want the client to have large quantities of I/O failed with EAGAIN, +therefore we will make the client detect this situation and avoid issuing +direct I/Os to a backfilling OSD. + +For backwards compatibility, for client requests that cannot cope with the +reduced guarantees of a direct read, and for scenarios where the direct read +would be to an OSD that is absent or backfilling, reads directed to the +primary OSD will still be supported. + +Direct writes +^^^^^^^^^^^^^ + +Write I/Os are currently directed to the primary OSD which then updates the +other shards. To reduce latency and network bandwidth it would be better if +clients could direct small overwrites requests directly to the OSD storing the +data, rather than via the primary. For larger write I/Os and for error +scenarios and abnormal cases clients will continue to submit write I/Os to the +primary OSD. + +Direct writes will always be for <= one chunk and will use the +parity-delta-write technique to perform the update. For medium sized writes a +client may issue direct writes to multiple OSDs, but such updates will no +longer be guaranteed to be atomic. If a client requires atomicity for a larger +write they will need to continue to send it to the primary. + +For backwards compatibility, and for scenarios where the direct write would be +to an OSD that is absent, writes directed to the primary OSD will still be +supported. + +I/O serialization, recovery/backfill and other error scenarios +"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" + +Direct writes look fairly simple until you start considering all the abnormal +scenarios. The current implementation of processing all writes on the Primary +OSD means that there is one central point of control for the stripe that can +manage things like the ordering of multiple inflight I/Os to the same stripe, +ensuring that recovery/backfill for an object has been completed before it is +accessed and assigning the object version number and modification time. + +With direct I/Os these become distributed problems. Our approach is to send a +control path message to the Primary OSD and let it continue to be the central +point of control. The Primary OSD will issue a reply when the OSD can start +the direct write and will be informed with another message when the I/O has +completed. See section below on metadata updates for more details. + +Stripe cache +^^^^^^^^^^^^ + +Erasure code pools maintain a stripe cache which stores shard data while +updates are in progress. This is required to allow writes and reads to the +same stripe to be processed in parallel. For small sequential write workloads +and for extreme hot spots (e.g. where the same block is repeatedly re-written +for some kind of crude checkpointing mechanism) there would be a benefit in +keeping the stripe cache slightly longer than the duration of the I/O. In +particularly the coding parities are typically read and written for every +update to a stripe. There is obviously a balancing act to achieve between +keeping the cache long enough that it reduces the overheads for future I/Os +versus the memory overheads of storing this data. A small (MiB as opposed to +GiB sized cache) should be sufficient for most workloads. The stripe cache can +also help reduce latency for direct write I/Os by allowing prefetch I/Os to +read old data and coding parities ready for later parts of the write operation +without requiring more complex interlocks. + +The stripe cache is less important when the default chunk size is small +(e.g. 4K), because even with small write I/O requests there will not be many +sequential updates to fill a stripe. With a larger chunk size (e.g. 64K) the +benefits of a good stripe cache become more significant because the stripe +size will be 100’s KiB to small number of MiB’s and hence it becomes much more +likely that a sequential workload will issue many I/Os to the same stripe. + +Automatically choose chunk size +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The default chunk size of 4K is good for small objects because the data and +coding parities are rounded up to whole chunks and because if an object has +less than one data stripe of data then the capacity overheads for the coding +parities are higher (e.g. a 4K object in a 10+2 erasure coded pool has 4K of +data and 8K of coding parity, so there is a 200% overhead). However the +optimizations above all provide much bigger savings if the typical random +access I/O only reads or writes a single shard. This means that so long as +objects are big enough that a larger chunk size such as 64K would be better. + +Whilst the user can try and predict what their typically object size will be +and choose an appropriate chunk size, it would be better if the code could +automatically select a small chunk size for small objects and a larger chunk +size for larger objects. There will always be scenarios where an object grows +(or is truncated) and the chosen chunk size becomes inappropriate, however +reading and re-writing the object with a new chunk size when this happens +won’t have that much performance impact. This also means that the chunk size +can be deduced from the object size in object_info_t which is read before the +objects data is read/modified. Clients already provide a hint as to the object +size when creating the object so this could be used to select a chunk size to +reduce the likelihood of having to re-stripe an object + +The thought is to support a new chunk size of auto/variable to enable this +feature, it will only be applicable for newly created pools, there will be no +way to migrate an existing pool. + +Deep scrub support +^^^^^^^^^^^^^^^^^^ + +EC Pools with overwrite do not check CRCs because it is too costly to update +the CRC for the object on every overwrite, instead the code relies on +Bluestore to maintain and check CRCs. When an EC pool is operating with +overwrite disabled a CRC is kept for each shard, because it is possible to +update CRCs as the object is appended to just by calculating a CRC for the new +data being appended and then doing a simple (quick) calculation to combine the +old and new CRC together. + +In dev/osd_internals/erasure_coding/proposals.rst it discusses the possibility +of keeping CRCs at a finer granularity (for example per chunk), storing these +either as an xattr or an omap (omap is more suitable as large objects could +end up with a lot of CRC metadata) and updating these CRCs when data is +overwritten (the update would need to perform a read-modify-write at the same +granularity as the CRC). These finer granularity CRCs can then easily be +combined to produce a CRC for the whole shard or even the whole erasure coded +object. + +This proposal suggests going in the opposite direction - EC overwrite pools +have survived without CRCs and relied on Bluestore up until now, so why is +this feature needed? The current code doesn’t check CRCs if overwrite is +enabled, but sadly still calculates and updates a CRC in the hinfo xattr, even +if performing overwrites which mean that the calculated value will be +garbage. This means we pay all the overheads of calculating the CRC and get no +benefits. + +The code can easily be fixed so that CRCs are calculated and maintained when +objects are written sequentially, but as soon as the first overwrite to an +object occurs the hinfo xattr will be discarded and CRCs will no longer be +calculated or checked. This will improve performance when objects are +overwritten, and will improve data integrity in cases where they are not. + +While the thought is to abandon EC storing CRCs in objects being overwritten, +there is an improvement that can be made to deep scrub. Currently deep scrub +of an EC with overwrite pool just checks that every shard can read the object, +there is no checking to verify that the copies on the shards are consistent. A +full consistency check would require large data transfers between the shards +so that the coding parities could be recalculated and compared with the stored +versions, in most cases this would be unacceptably slow. However for many +erasure codes (including the default ones used by Ceph) if the contents of a +chunk are XOR’d together to produce a longitudinal summary value, then an +encoding of the longitudinal summary values of each data shard should produce +the same longitudinal summary values as are stored by the coding parity +shards. This comparison is less expensive than the CRC checks performed by +replication pools. There is a risk that by XORing the contents of a chunk +together that a set of corruptions cancel each other out, but this level of +check is better than no check and will be very successful at detecting a +dropped write which will be the most common type of corruption. + +Metadata changes +---------------- + +What metadata do we need to consider? + +1. object_info_t. Every Ceph object has some metadata stored in the + object_info_t data structure. Some of these fields (e.g. object length) are + not updated frequently and we can simply avoid performing partial writes + optimizations when these fields need updating. The more problematic fields + are the version numbers and the last modification time which are updated on + every write. Version numbers of objects are compared to version numbers in + PG log entries for peering/recovery and with version numbers on other + shards for backfill. Version numbers and modification times can be read by + clients. + +2. PG log entries. The PG log is used to track inflight transactions and to + allow incomplete transactions to be rolled forward/backwards after an + outage/network glitch. The PG log is also used to detect and resolve + duplicate requests (e.g. resent due to network glitch) from + clients. Peering currently assumes that every shard has a copy of the log + and that this is updated for every transaction. + +3. PG stats entries and other PG metadata. There is other PG metadata (PG + stats is the simplest example) that gets updated on every + transaction. Currently all OSDs retain a cached and a persistent copy of + this metadata. + +How many copies of metadata are required? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The current implementation keeps K+M replicated copies of metadata, one copy +on each shard. The minimum number of copies that need to be kept to support up +to M failures is M+1. In theory metadata could be erasure encoded, however +given that it is small it is probably not worth the effort. One advantage of +keeping K+M replicated copies of the metadata is that any fully in sync shard +can read the local copy of metadata, avoiding the need for inter-OSD messages +and asynchronous code paths. Specifically this means that any OSD not +performing backfill can become the primary and can access metadata such as +object_info_t locally. + +M+1 arbitrarily distributed copies +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +A partial write to one data shard will always involve updates to the data +shard and all M coding parity shards, therefore for optimal performance it +would be ideal if the same M+1 shards are updated to track the associated +metadata update. This means that for small random writes that a different M+1 +shards would get updated for each write. The drawback of this approach is that +you might need to read K shards to find the most up to date version of the +metadata. + +In this design no shard will have an up to date copy of the metadata for every +object. This means that whatever shard is picked to be the acting primary that +it may not have all the metadata available locally and may need to send +messages to other OSDs to read it. This would add significant extra complexity +to the PG code and cause divergence between Erasure coded pools and Replicated +pools. For these reasons we discount this design option. + +M+1 copies on known shards +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The next best performance can be achieved by always applying metadata updates +to the same M+1 shards, for example choosing the 1st data shard and all M +coding parity shards. Coding parity shards will get updated by every partial +write so this will result in zero or one extra shard being updated. With this +approach only 1 shard needs to be read to find the most up to date version of +the metadata. + +We can restrict the acting primary to be one of the M+1 shards, which means +that once any incomplete updates in the log have been resolved that the +primary will have an up to date local copy of all the metadata, this means +that much more of the PG code can be kept unchanged. + +Partial Writes and the PG log +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Peering currently assumes that every shard has a copy of the log, however +because of inflight updates and small term absences it is possible that some +shards are missing some of the log entries. The job of peering is to combine +the logs from the set of present shards to form a definitive log of +transactions that have been committed by all the shards. Any discrepancies +between a shards log and the definitive log are then resolved, typically by +rolling backwards transactions (using information held in the log entry) so +that all the shards are in a consistent state. + +To support partial writes the log entry needs to be modified to include the +set of shards that are being updated. Peering needs to be modified to consider +a log entry as missing from a shard only if a copy of the log entry on another +shard indicates that this shard was meant to be updated. + +The logs are not infinite in size, and old log entries where it is known that +the update has been successfully committed on all affected shards are +trimmed. Log entries are first condensed to a pg_log_dup_t entry which can no +longer assist in rollback of a transaction but can still be used to detect +duplicated client requests, and then later completely discarded. Log trimming +is performed at the same time as adding a new log entry, typically when a +future write updates the log. With partial writes log trimming will only occur +on shards that receive updates, which means that some shards may have stale +log entries that should have been discarded. + +TBD: I think the code can already cope with discrepancies in log trimming +between the shards. Clearly an in flight trim operation may not have completed +on every shard so small discrepancies can be dealt with, but I think an absent +OSD can cause larger discrepancies. I believe that this is resolved during +Peering, with each OSD keeping a record of what the oldest log entry should be +and this gets shared between OSDs so that they can work out stale log entries +that were trimmed in absentia. Hopefully this means that only sending log +trimming updates to shards that are creating new log entries will work without +code changes. + +Backfill +^^^^^^^^ + +Backfill is used to correct inconsistencies between OSDs that occur when an +OSD is absent for a longer period of time and the PG log entries have been +trimmed. Backfill works by comparing object versions between shards. If some +shards have out of date versions of an object then a reconstruct is performed +by the backfill process to update the shard. If the version numbers on objects +are not updated on all shards then this will break the backfill process and +cause a huge amount of unnecessary reconstruct work. This is unacceptable, in +particular for the scenario where an OSD is just absent for maintenance for a +relatively short time with noout set. The requirement is to be able to +minimize the amount of reconstruct work needed to complete a backfill. + +In dev/osd_internals/erasure_coding/proposals.rst it discusses the idea of +each shard storing a vector of version numbers that records the most recent +update that the pair both should have participated +in. By collecting this information from at least M shards it is possible to +work out what the expected minimum version number should be for an object on a +shard and hence deduce whether a backfill is required to update the +object. The drawback of this approach is that backfill will need to scan M +shards to collect this information, compared with the current implementation +that only scans the primary and shard(s) being backfilled. + +With the additional constraint that a known M+1 shards will always be updated +and that the (acting) primary will be one of these shards, it will be possible +to determine whether a backfill is required just by examining the vector on +the primary and the object version on the shard being backfilled. If the +backfill target is one of the M+1 shards the existing version number +comparison is sufficient, if it is another shard then the version in the +vector on the primary needs to be compared with the version on the backfill +target. This means that backfill does not have to scan any more shards than it +currently does, however the scan of the primary does need to read the vector +and if there are multiple backfill targets then it may need to store multiple +entries of the vector per object increasing memory usage during the backfill. + +There is only a requirement to keep the vector on the M+1 shards, and the +vector only needs K-1 entires because we only need to track version number +differences between any of the M+1 shards (which should have the same version) +and each of the K-1 shards (which can have a stale version number). This will +slightly reduce the amount of extra metadata required. The vector of version +numbers could be stored in the object_info_t structure or stored as a separate +attribute. + +Our preference is to store the vector in the object_info_t structure because +typically both are accessed together, and because this makes it easier to +cache both in the same object cache. We will keep metadata and memory +overheads low by only storing the vector when it is needed. + +Care is required to ensure that existing clusters can be upgraded. The absence +of the vector of version numbers implies that an object has never had a +partial update and therefore all shards are expected to have the same version +number for the object and the existing backfill algorithm can be used. + +Code references +""""""""""""""" + +PrimaryLogPG::scan_range - this function creates a map of objects and their +version numbers, on the primary it tries to get this information from the +object cache, otherwise it reads to OI attribute. This will need changes to +deal with the vectors. To conserve memory it will need to be provided with the +set of backfill targets so it can select which part of the vector to keep. + +PrimaryLogPG::recover_backfill - this function call scan_range for the local +(primary) and sends MOSDPGScan to the backfill targets to get them to perform +the same scan. Once it has collected all the version numbers it compares the +primary and backfill targets to work out which objects need to be +recovered. This will also need changes to deal with the vectors when comparing +version numbers. + +PGBackend::run_recovery_op - recovers a single object. For an EC pool this +involves reconstructing the data for the shards that need backfilling (read +other shards and use decode to recover). This code shouldn't need any changes. + +Version number and last modification time for clients +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Clients can read the object version number and set expectations about what the +minimum version number is when making updates. Clients can also read the last +modification time. There are use cases where it is important that these values +can be read and give consistent results, but there is also a large number of +scenarios where this information is not required. + +If the object version number is only being updated on a known M+1 shards for +partial writes, then where this information is required it will need to +involve a metadata access to one of those shards. We have arranged for the +primary to be one of the M+1 shards so I/Os submitted to the primary will +always have access to the up to date information. + +Direct write I/Os need to update the M+1 shards, so it is not difficult to +also return this information to the client when completing the I/O. + +Direct read I/Os are the problem case, these will only access the local shard +and will not necessarily have access to the latest version and modification +time. For simplicity we will require clients that require this information to +send requests to the primary rather than using the direct I/O +optimization. Where a client does not need this information they can use the +direct I/O optimizations. + +The direct read I/O optimization will still return a (potentially stale) +object version number. This may still be of use to clients to help understand +the ordering of I/Os to a chunk. + +Direct Write with Metadata updates +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Here's the full picture of what a direct write performing a parity-delta-write +looks like with all the control messages: + +.. ditaa:: + + RADOS Client + + | ^ + | | + 1 28 + +-----+ | | + | |<------27-------+--+ + | | | | + | | +-------------|->| + | | | | | + | |<-|----2--------+ |<--------------------------------------------+ + | Seq | | | |<----------------------------+ | + | | | +----3---+ | | | + | | | | +--|-----------------------5-----|---+ | + | | | | +--|-------4---------+ | | | + | +--|-10-|------->| | | | | | + | | | | +---+ | | | | | + | | | | | | | | | | | + | | | | v | | | | | | + +----++ | | +---+ | | | | | | + ^ | | | |XOR+-|--|----------15-----|-----------|---|-----+ | + | | | | |13 +-|--|-------14--------|-----+ | | | | + | | | | +---+ | | | | | | | | + | | | | ^ | | | v | | v | + | | | | | | | | +------+ | | +------+ | + 6 11 | | | | | | | XOR | | | | GF | | + | | | | | | | | | 18 | | | | 21 | | + | | | | 12 16 | | +----+-+ | | +----+-+ | + | | | | | | | | ^ | | | ^ | | + | | | | | | | | | | | | | | | + | | | | | | | | 17 19 | | 20 22 | + | | | | | | | | | | | | | | | + | | | | | v | | | v | | | v | + | | | | +-+----+ | | +-+----+ | | +-+----+ | + | | | +->|Extent| | +->|Extent| | +->|Extent| | + | | 23 |Cache | 24 |Cache | 25 |Cache | 26 + | | | +----+-+ | +----+-+ | +----+-+ | + | | | ^ | | ^ | | ^ | | + | | | | | | | | | | | | + | +---+ 7 +---+ 8 +---+ 9 +---+ + | | | | | | | | + | v | v | v | v + .-----. .-----. .-----. .-----. .-----. .-----. + ( ) ( ) ( ) ( ) ( ) ( ) + |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| + | | | | | | | | | | | | + | | | | | | | | | | | | + ( ) ( ) ( ) ( ) ( ) ( ) + `-----' `-----' `-----' `-----' `-----' `-----' + Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q + OSD + + * Xattr * No Xattr * Xattr * Xattr + * OI * Stale OI * OI * OI + * PG log * Partial PG log * PG log * PG log + * PG stats * No PG stats * PG stats * PG stats + +Note: Only the primary OSD and parity coding OSDs (the M+1 shards) have Xattr, +up to date object info, PG log and PG stats. Only one of these OSDs is +permitted to become the (acting) primary. The other data OSDs 2,3 and 4 (the +K-1 shards) do not have Xattrs or PG stats, may have state object info and +only have PG log entries for their own updates. OSDs 2,3 and 4 may have stale +OI with an old version number. The other OSDs have the latest OI and a vector +with the expected version numbers for OSDs 2,3 and 4. + +1. Data message with Write I/O from client (MOSDOp) +2. Control message to Primary with Xattr (new msg MOSDEcSubOpSequence) + +Note: the primary needs to be told about any xattr update so it can update its +copy, but the main purpose of this message is to allow the primary to sequence +the write I/O. The reply message at step 10 is what allows the write to start +and provides the PG stats and new object info including the new version +number. If necessary the primary can delay this to ensure that +recovery/backfill of the object is completed first and deal with overlapping +writes. Data may be read (prefetched) before the reply, but obviously no +transactions can start. + +3. Prefetch request to local extent cache +4. Control message to P to prefetch to extent cache (new msg + MOSDEcSubOpPrefetch equivalent of MOSDEcSubOpRead) +5. Control message to Q to prefetch to extent cache (new msg + MOSDEcSubOpPrefetch equivalent of MOSDEcSubOpRead) +6. Primary reads object info +7. Prefetch old data +8. Prefetch old P +9. Prefetch old Q + +Note: The objective of these prefetches is to get the old data, P and Q reads +started as quickly as possible to reduce the latency of the whole I/O. There +may be error scenarios where the extent cache is not able to retain this and +it will need to be re-read. This includes the rare/pathological scenarios +where there is a mixture of writes sent to the primary and writes sent +directly to the data OSD for the same object. + +10. Control message to data OSD with new object info + PG stats (new msg + MOSDEcSubOpSequenceReply) +11. Transaction to update object info + PG log + PG stats +12. Fetch old data (hopefully cached) + +Note: For best performance we want to pipeline writes to the same stripe. The +primary assigns the version number to each write and consequently defines the +order in which writes should be processed. It is important that the data shard +and the coding parity shards apply overlapping writes in the same order. The +primary knows what set of writes are in flight so can detect this situation +and indicate in its reply message at step 10 that an update must wait until an +earlier update has been applied. This information needs to be forwarded to the +coding parities (steps 14 and 15) so they can also ensure updates are applied +in the same order. + +13. XOR new and old data to create delta +14. Data message to P with delta + Xattr + object info + PG log + PG stats + (new msg MOSDEcSubOpDelta equivalent of MOSDEcSubOpWrite) +15. Data message to Q with delta + Xattr + object info + PG log + PG stats + (new msg MOSDEcSubOpDelta equivalent of MOSDEcSubOpWrite) +16. Transaction to update data + object info + PG log +17. Fetch old P (hopefully cached) +18. XOR delta and old P to create new P +19. Transaction to update P + Xattr + object info + PG log + PG stats +20. Fetch old Q (hopefully cached) +21. XOR delta and old Q to create new Q +22. Transaction to update Q + Xattr + object info + PG log + PG stats +23. Control message to data OSD for commit (new msg MOSDEcSubOpDeltaReply + equivalent of MOSDEcSubOpWriteReply) +24. Local commit notification +25. Control message to data OSD for commit (new msg MOSDEcSubOpDeltaReply + equivalent of MOSDEcSubOpWriteReply) +26. Control message to data OSD for commit (new msg MOSDEcSubOpDeltaReply + equivalent of MOSDEcSubOpWriteReply) +27. Control message to Primary to signal end of write (variant of new msg + MOSDEcSubOpSequence) +28. Control message reply to client (MOSDOpReply) + +Upgrade and backwards compatibility +----------------------------------- + +A few of the optimizations can be made just by changing code on the primary +OSD with no backwards compatibility concerns regarding clients or the other +OSDs. These optimizations will be enabled as soon as the primary OSD upgrades +and will replace the existing code paths. + +The remainder of the changes will be new I/O code paths that will exist +alongside the existing code paths. + +Similar to EC Overwrites many of the changes will need to ensure that all OSDs +are running new code and that the EC plugins support new interfaces required +for parity-delta-writes. A new pool level flag will be required to enforce +this. It will be possible to enable this flag (and hence enable the new +performance optimizations) after upgrading an existing cluster. Once set it +will not be possible to add down level OSDs to the pool. It will not be +possible to turn this flag off other than by deleting the pool. Downgrade is +not supported because: + +1. It is not trivial to quiesce all I/O to a pool to ensure that none of the + new I/O code paths are in use when the flag is cleared. + +2. The PG log format for new I/Os will not be understood by down level + OSDs. It would be necessary to ensure the log has been trimmed of all new + format entries before clearing the flag to ensure that down level OSDs will + be able to interpret the log. + +3. Additional xattr data will be stored by the new I/O code paths and used by + backfill. Down level code will not understand how to backfill a pool that + has been running the new I/O paths and will get confused by the + inconsistent object version numbers. While it is theoretically possible to + disable partial updates and then scan and update all the metadata to return + the pool to a state where a downgrade is possible, we have no intention of + writing this code. + +The direct I/O changes will additionally require clients to be running new +code. These will require that the pool has the new flag set and that a new +client is used. Old clients can use pools with the new flag set, just without +the direct I/O optimization. + +Not under consideration +----------------------- + +There is a list of enhancements discussed in +doc/dev/osd_internals/erasure_coding/proposals.rst, the following are not +under consideration: + +1. RADOS Client Acknowledgement Generation optimization + +When updating K+M shards in an erasure coded pool, in theory you don’t have to +wait for all the updates to complete before completing the update to the +client, because so long as K updates have completed any viable subset of +shards should be able to roll forward the update. + +For partial writes where only M+1 shards are updated this optimization does +not apply as all M+1 updates need to complete before the update is completed +to the client. + +This optimization would require changes to the peering code to work out +whether partially completed updates need to be rolled forwards or +backwards. To roll an update forwards it would be simplest to mark the object +as missing and use the recovery path to reconstruct and push the update to +OSDs that are behind. + +2. Avoid sending read request to local OSD via Messenger + +The EC backend code has an optimization for writes to the local OSD which +avoids sending a message and reply via messenger. The equivalent optimization +could be made for reads as well, although a bit more care is required because +the read is synchronous and will block the thread waiting for the I/O to +complete. + +Pull request https://github.com/ceph/ceph/pull/57237 is making this +optimization + +Stories +======= + +This is our high level breakdown of the work. Our intention is to deliver this +work as a series of PRs. The stories are roughly in the order we plan to +develop. Each story is at least one PR, where possible they will be broken up +further. The earlier stories can be implemented as stand alone pieces of work +and will not introduce upgrade/backwards compatibility issues. The later +stories will start breaking backwards compatibility, here we plan to add a new +flag to the pool to enable these new features. Initially this will be an +experimental flag while the later stories are developed. + +Test tools - enhanced I/O generator for testing erasure coding +-------------------------------------------------------------- + +* Extend rados bench to be able to generate more interesting patterns of I/O + for erasure coding, in particular reading and writing at different offsets + and for different lengths and making sure we get good coverage of boundary + conditions such as the sub-chunk size, chunk size and stripe size +* Improve data integrity checking by using a seed to generate data patterns + and remembering which seed is used for each block that is written so that + data can later be validated + +Test tools - offline consistency checking tool +---------------------------------------------- + +* Test tools for performing offline consistency checks combining use of + objectstore_tool with ceph-erasure-code-tool +* Enhance some of the teuthology standalone erasure code checks to use this + tool + +Test tools - online consistency checking tool +--------------------------------------------- + +* New CLI to be able to perform online consistency checking for an object or a + range of objects that reads all the data and coding parity shards and + re-encodes the data to validate the coding parities + +Switch for JErasure to ISA-L +---------------------------- + +The JErasure library has not been updated since 2014, the ISA-L library is +maintained and exploits newer instructions sets (e.g. AVX512, AVX2) which +provides faster encoding/decoding + +* Change defaults to ISA-L in upstream ceph +* Benchmark Jerasure and ISA-L +* Refactor Ceph isa_encode region_xor() to use AVX when M=1 +* Documentation updates +* Present results at performance weekly + +Sub Stripe Reads +---------------- + +Ceph currently reads an integer number of stripes and discards unneeded +data. In particular for small random reads it will be more efficient to just +read the required data + +* Help finish Pull Request https://github.com/ceph/ceph/pull/55196 if not + already complete +* Further changes to issue sub-chunk reads rather than full-chunk reads + +Simple Optimizations to Overwrite +--------------------------------- + +Ceph overwrites currently read an integer number of stripes, merge the new +data and write an integer number of stripes. This story makes simple +improvements by making the same optimizations as for sub stripe reads and for +small (sub-chunk) updates reducing the amount of data being read/written to +each shard. + +* Only read chunks that are not being fully overwritten (code currently reads + whole stripe and then merges new data) +* Perform sub-chunk reads for sub-chunk updates +* Perform sub-chunk writes for sub-chunk updates + +Eliminate unnecessary chunk writes but keep metadata transactions +----------------------------------------------------------------- + +This story avoids re-writing data that has not been modified. A transaction is +still applied to every OSD to update object metadata, the PG log and PG stats. + +* Continue to create transactions for all chunks but without the new write data +* Add sub-chunk writes to transactions where data is being modified + +Avoid zero padding objects to a full stripe +------------------------------------------- + +Objects are rounded up to an integer number of stripes by adding zero +padding. These buffers of zeros are then sent in messages to other OSDs and +written to the OS consuming storage. This story make optimizations to remove +the need for this padding + +* Modifications to reconstruct reads to avoid reading zero-padding at the end + of an object - just fill the read buffer with zeros instead +* Avoid transfers/writes of buffers of zero padding. Still send transactions + to all shards and create the object, just don't populate it with zeros +* Modifications to encode/decode functions to avoid having to pass in buffers + of zeros when objects are padded + +Erasure coding plugin changes to support distributed partial writes +------------------------------------------------------------------- + +This is preparatory work for future stories, it adds new APIs to the erasure +code plugins. + +* Add a new interface to create a delta by XORing old and new data together + and implement this for the ISA-L and JErasure plugins +* Add a new interface to apply a delta to one coding parity by using XOR/GF + and implement this for the ISA-L and JErasure plugins +* Add a new interface which reports which erasure codes support this feature + (ISA-L and JErasure will support it, others will not) + +Erasure coding interface to allow RADOS clients to direct I/Os to OSD storing the data +-------------------------------------------------------------------------------------- + +This is preparatory work for future stories, its adds a new API for clients + +* New interface to convert the pair (pg, offset) to {OSD, remaining chunk + length} + +We do not want clients to have to dynamically link to the erasure code plugins +so this code will need to be part of librados. However this interface needs to +understand how erasure codes distribute data and coding chunks to be able to +perform this translation. + +We will only support ISA-L and JErasure plugins where there is a trivial +striping of data chunks to OSDs. + +Changes to object_info_t +------------------------ + +This is preparatory work for future stories. + +This adds the vector of version numbers to object_info_t which will be used +for partial updates. For replicated pools and for erasure coded objects that +are not overwritten we will avoid storing extra data in object_info_t. + +Changes to PGLog and Peering to support updating a subset of OSDs +----------------------------------------------------------------- + +This is preparatory work for future stories. + +* Modify the PG log entry to store a record of which OSDs are being updated +* Modify peering to use this extra data to work out OSDs that are missing + updates + +Change to selection of (acting) primary +--------------------------------------- + +This is preparatory work for future stories. + +Constrain the choice of primary to be the first data OSD or one of the erasure +coding parities. If none of these OSDs are available and up to date then the +pool must be offline. + +Implement parity-delta-write with all computation on the primary +---------------------------------------------------------------- + +* Calculate whether its more efficient for an update to perform a full stripe + overwrite or a parity-delta-write +* Implement new code paths to perform the parity-delta-write +* Test tool enhancements. We want to make sure that both parity-delta-write + and full-stripe write are tested. We will add a new conf file option with a + choice of 'parity-delta', 'full-stripe', 'mixture for testing' or + 'automatic' and update teuthology test cases to predominately use a mixture. + +Upgrades and backwards compatibility +------------------------------------ + +* Add a new feature flag for erasure coded pools +* All OSDs must be running new code to enable the flag on the pool +* Clients may only issue direct I/Os if the flag is set +* OSDs running old code may not join a pool with the flag set +* Its not possible to turn the feature flag off (other than by deleting the + pool) + +Changes to Backfill to use the vector in object_info_t +------------------------------------------------------ + +This is preparatory work for future stories. + +* Modify the backfill process to use the vector of version numbers in + object_info_t so that when partial updates occur we do not backfill OSDs + which did not participate in the partial update. +* When there is a single backfill target extract the appropriate version + number from the vector (no additional storage required) +* When there are multiple backfill targets extract the subset of the vector + required by the backfill targets and select the appropriate entry when + comparing version numbers in PrimaryLogPG::recover_backfill + +Test tools - offline metadata validation tool +--------------------------------------------- + +* Test tools for performing offline consistency checking of metadata, in + particular checking the vector of version numbers in object_info_t matches + the versions on each OSD, but also for validating PG log entries + +Eliminate transactions on OSDs not updating data chunks +------------------------------------------------------- + +Peering, log recovery and backfill can now all cope with partial updates using +the vector of version numbers in object_info_t. + +* Modify the overwrite I/O path to not bother with metadata only transactions + (except to the Primary OSD) +* Modify the update of the version numbers in object_info_t to use the vector + and only update entries that are receiving a transaction +* Modify the generation of the PG log entry to record which OSDs are being + updated + +Direct reads to OSDs (single chunk only) +---------------------------------------- + +* Modify OSDClient to route single chunk read I/Os to the OSD storing the data +* Modify OSD to accept reads from non-primary OSD (expand existing changes for + replicated pools to work with EC pools as well) +* If necessary fail the read with EAGAIN if the OSD is unable to process the + read directly +* Modify OSDClient to retry read by submitting to Primary OSD if read is + failed with EAGAIN +* Test tool enhancements. We want to make sure that both direct reads and + reads to the primary are tested. We will add a new conf file option with a + choice of 'prefer direct', 'primary only' or 'mixture for testing' and + update teuthology test cases to predominately use a mixture. + +The changes will be made to the OSDC part of the RADOS client so will be +applicable to rbd, rgw and cephfs. + +We will not make changes to other code that has its own version of RADOS +client code such as krbd, although this could be done in the future. + +Direct reads to OSDs (multiple chunks) +-------------------------------------- + +* Add a new OSDC flag NONATOMIC which allows OSDC to split a read into + multiple requests +* Modify OSDC to split reads spanning multiple chunks into separate requests + to each OSD if the NONATOMIC flag is set +* Modifications to OSDC to coalesce results (if any sub read fails the whole + read needs to fail) +* Changes to librbd client to set NONATOMIC flag for reads +* Changes to cephfs client to set NONATOMIC flag for reads + +We are only changing a very limited set of clients, focusing on those that +issue smaller reads and are latency sensitive. Future work could look at +extending the set of clients (including krbd). + +Implement distributed parity-delta-write +---------------------------------------- + +* Implement new message MOSDEcSubOpDelta and MOSDEcSubOpDeltaReply +* Change primary to calculate delta and send MOSDEcSubOpDelta message to + coding parity OSDs +* Modify coding parity OSDs to apply the delta and send MOSDEcSubOpDeltaReply + message + +Note: This change will increase latency because the coding parity reads start +after the old data read. Future work will fix this. + +Test tools - EC error injection thrasher +---------------------------------------- + +* Implement a new type of thrasher that specifically injects faults to stress + erasure coded pools +* Take one or multiple (up to M) OSDs down, more focus on taking different + subsets of OSDs down to drive all the different EC recovery paths than + stressing out peering/recovery/backfill (the existing OSD thrasher excels at + this) +* Inject read I/O failures to force reconstructs using decode for single and + multiple failures +* Inject delays using osd tell type interface to make it easier to test OSD + down at all the interesting stages of EC I/Os +* Inject delays using osd tell type interface to slow down an OSD transaction + or message to expose the less common completion orders for parallel work + +Implement prefetch message MOSDEcSubOpPrefetch and modify extent cache +---------------------------------------------------------------------- + +* Implement new message MOSDEcSubOpPrefetch +* Change primary to issue this message to the coding parity OSDs before + starting read of old data +* Change the extent cache so that each OSD caches its own data rather than + caching everything on the primary +* Change coding parity OSDs to handle this message and read the old coding + parity into the extent cache +* Changes to extent cache to retain the prefetched old parity until the + MOSDEcSubOpDelta message is received, and to discard this on error paths + (e.g. new OSDMap) + +Implement sequencing message MOSDEcSubOpSequence +------------------------------------------------ + +* Implement new message MODSEcSubOpSequence and MOSDEcSubOpSequenceReply +* Modify primary code to create these messages and route them locally to + itself in preparation for direct writes + +Direct writes to OSD (single chunk only) +---------------------------------------- + +* Modify OSDC to route single chunk write I/Os to the OSD storing the data +* Changes to issue MOSDEcSubOpSequence and MOSDEcSubOpSequenceReply between + data OSD and primary OSD + +Direct writes to OSD (multiple chunks) +-------------------------------------- + +* Modifications to OSDC to split multiple chunk writes into separate requests + if NONATOMIC flag is set +* Further changes to coalescing completions (in particular reporting version + number correctly) +* Changes to librbd client to set NONATOMIC flag for reads +* Changes to cephfs client to set NONATOMIC flag for reads + +We are only changing a very limited set of clients, focusing on those that +issue smaller writes and are latency sensitive. Future work could look at +extending the set of clients. + +Deep scrub / CRC +---------------- + +* Disable CRC generation in the EC code for overwrites, delete hinfo Xattr + when first overwrite occurs +* For objects in pool with new feature flag set that have not been overwritten + check CRC, even if pool overwrite flag is set. The presence/absence of hinfo + can be used to determine if the object has been overwritten +* For deep scrub requests XOR the contents of the shard to create a + longitudinal check (8 bytes wide?) +* Return the longitudinal check in the scrub reply message, have the primary + encode the set of longitudinal replies to check for inconsistencies + +Variable chunk size erasure coding +---------------------------------- + +* Implement new pool option for automatic/variable chunk size +* When object size is small use a small chunk size (4K) when the pool is using + the new option +* When object size is large use a large chunk size (64K or 256K?) +* Convert the chunk size by reading and re-writing the whole object when a + small object grows (append) +* Convert the chunk size by reading and re-writing the whole object when a + large object shrinks (truncate) +* Use the object size hint to avoid creating small objects and then almost + immediately converting them to a larger chunk size + +CLAY Erasure Codes +------------------ + +In theory CLAY erasure codes should be good for K+M erasure codes with larger +values of M, in particular when these erasure codes are used with multiple +OSDs in the same failure domain (e.g. an 8+6 erasure code with 5 servers each +with 4 OSDs). We would like to improve the test coverage for CLAY and perform +some more benchmarking to collect data to help substantiate when people should +consider using CLAY. + +* Benchmark CLAY erasure codes - in particular the number of I/O required for + backfills when multiple OSDs fail +* Enhance test cases to validate the implementation +* See also https://bugzilla.redhat.com/show_bug.cgi?id=2004256