ceph/doc/dev/osd_internals/erasure_coding/enhancements.rst

===========================
Erasure coding enhancements
===========================

Objectives
==========

Our objective is to improve the performance of erasure coding, in particular
for small random accesses to make it more viable to use erasure coding pools
for storing block and file data.

We are looking to reduce the number of OSD read and write accesses per client
I/O (sometimes referred to as I/O amplification), reduce the amount of network
traffic between OSDs (network bandwidth) and reduce I/O latency (time to
complete read and write I/O operations). We expect the changes will also
provide modest reductions to CPU overheads.

While the changes are focused on enhancing small random accesses, some
enhancements will provide modest benefits for larger I/O accesses and for
object storage.

The following sections give a brief description of the improvements we are
looking to make. Please see the later design sections for more details

Current Read Implementation
---------------------------

For reference this is how erasure code reads currently work

.. ditaa::

 RADOS Client
                            * Current code reads all data chunks
      ^                     * Discards unneeded data
      |                     * Returns requested data to client
 +----+----+
 | Discard |                If data cannot be read then the coding parity
 |unneeded |                chunks are read as well and are used to reconstruct
 |  data   |                the data
 +---------+
    ^^^^
    ||||
    ||||
    ||||
    |||+----------------------------------------------+
    ||+-------------------------------------+         |
    |+----------------------------+         |         |
    |                             |         |         |
  .-----.                       .-----.   .-----.   .-----.   .-----.   .-----.
 (       )                     (       ) (       ) (       ) (       ) (       )
 |`-----'|                     |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
 |       |                     |       | |       | |       | |       | |       |
 |       |                     |       | |       | |       | |       | |       |
 (       )                     (       ) (       ) (       ) (       ) (       )
  `-----'                       `-----'   `-----'   `-----'   `-----'   `-----'
 Primary                        OSD 2     OSD 3     OSD 4     OSD P     OSD Q
   OSD

Note: All the diagrams illustrate a K=4 + M=2 configuration, however the
concepts and techniques can be used for all K+M configurations.

Partial Reads
-------------

If only a small amount of data is being read it is not necessary to read the
whole stripe, for small I/Os ideally only a single OSD needs to be involved in
reading the data. See also larger chunk size below.

.. ditaa::

 RADOS Client
                            * Optimize by only reading required chunks
      ^                     * For large chunk sizes and sub-chunk reads only
      |                     read a sub-chunk
 +----+----+
 | Return  |                If data cannot be read then extra data and coding
 |  data   |                parity chunks are read as well and are used to
 |         |                reconstruct the data
 +---------+
     ^
     |
     |
     |
     |
     |
     +----------------------------+
                                  |
  .-----.                       .-----.   .-----.   .-----.   .-----.   .-----.
 (       )                     (       ) (       ) (       ) (       ) (       )
 |`-----'|                     |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
 |       |                     |       | |       | |       | |       | |       |
 |       |                     |       | |       | |       | |       | |       |
 (       )                     (       ) (       ) (       ) (       ) (       )
  `-----'                       `-----'   `-----'   `-----'   `-----'   `-----'
 Primary                        OSD 2     OSD 3     OSD 4     OSD P     OSD Q
   OSD

Pull Request https://github.com/ceph/ceph/pull/55196 is implementing most of
this optimization, however it still issues full chunk reads.

Current Overwrite Implementation
--------------------------------

For reference here is how erasure code overwrites currently work

.. ditaa::

 RADOS Client
      |                     * Read all data chunks
      |                     * Merges new data
 +----v-----+               * Encodes new coding parities
 | Read old |               * Writes data and coding parities
 |Merge new |
 |  Encode  |-------------------------------------------------------------+
 |  Write   |---------------------------------------------------+         |
 +----------+                                                   |         |
    ^|^|^|^|                                                    |         |
    |||||||+-------------------------------------------+        |         |
    ||||||+-------------------------------------------+|        |         |
    |||||+-----------------------------------+        ||        |         |
    ||||+-----------------------------------+|        ||        |         |
    |||+---------------------------+        ||        ||        |         |
    ||+---------------------------+|        ||        ||        |         |
    |v                            |v        |v        |v        v         v
  .-----.                       .-----.   .-----.   .-----.   .-----.   .-----.
 (       )                     (       ) (       ) (       ) (       ) (       )
 |`-----'|                     |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
 |       |                     |       | |       | |       | |       | |       |
 |       |                     |       | |       | |       | |       | |       |
 (       )                     (       ) (       ) (       ) (       ) (       )
  `-----'                       `-----'   `-----'   `-----'   `-----'   `-----'
 Primary                        OSD 2     OSD 3     OSD 4     OSD P     OSD Q
   OSD

Partial Overwrites
------------------

Ideally we aim to be able to perform updates to erasure coded stripes by only
updating a subset of the shards (those with modified data or coding
parities). Avoiding performing unnecessary data updates on the other shards is
easy, avoiding performing any metadata updates on the other shards is much
harder (see design section on metadata updates).

.. ditaa::

 RADOS Client
      |                     * Only read chunks that are not being overwritten
      |                     * Merge new data
 +----v-----+               * Encode new coding parities
 | Read old |               * Only write modified data and parity shards
 |Merge new |
 |  Encode  |-------------------------------------------------------------+
 |  Write   |---------------------------------------------------+         |
 +----------+                                                   |         |
    ^  |^ ^                                                     |         |
    |  || |                                                     |         |
    |  || +-------------------------------------------+         |         |
    |  ||                                             |         |         |
    |  |+-----------------------------------+         |         |         |
    |  +---------------------------+        |         |         |         |
    |                              |        |         |         |         |
    |                              v        |         |         v         v
  .-----.                       .-----.   .-----.   .-----.   .-----.   .-----.
 (       )                     (       ) (       ) (       ) (       ) (       )
 |`-----'|                     |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
 |       |                     |       | |       | |       | |       | |       |
 |       |                     |       | |       | |       | |       | |       |
 (       )                     (       ) (       ) (       ) (       ) (       )
  `-----'                       `-----'   `-----'   `-----'   `-----'   `-----'
 Primary                        OSD 2     OSD 3     OSD 4     OSD P     OSD Q
   OSD

This diagram is overly simplistic, only showing the data flows. The simplest
implementation of this optimization retains a metadata write to every
OSD. With more effort it is possible to reduce the number of metadata updates
as well, see design below for more details.

Parity-delta-write
------------------

A common technique used by block storage controllers implementing RAID5 and
RAID6 is to implement what is sometimes called a parity delta write. When a
small part of the stripe is being overwritten it is possible to perform the
update by reading the old data, XORing this with the new data to create a
delta and then read each coding parity, apply the delta and write the new
parity. The advantage of this technique is that it can involve a lot less I/O,
especially for K+M encodings with larger values of K. The technique is not
specific to M=1 and M=2, it can be applied with any number of coding parities.

.. ditaa::

                        Parity delta writes
                        * Read old data and XOR with new data to create a delta
 RADOS Client           * Read old encoding parities apply the delta and write
    |                     the new encoding parities
    |                   
    |                   For K+M erasure codings where K is larger and M is small
    |  +-----+    +-----+  this is much more efficient
    +->| XOR |-+->| GF  |---------------------------------------------------+
  +-+->|     | |  |     |<------------------------------------------------+ |
  | |  +-----+ |  +-----+                                                 | |
  | |          |                                                          | |
  | |          |  +-----+                                                 | |
  | |          +->| XOR |-----------------------------------------+       | |
  | |             |     |<--------------------------------------+ |       | |
  | |             +-----+                                       | |       | |
  | |                                                           | |       | |
  | |                                                           | |       | |
  | +-------------------------------+                           | |       | |
  +-------------------------------+ |                           | |       | |
                                  | |                           | |       | |
                                  | v                           | v       | v
  .-----.                       .-----.   .-----.   .-----.   .-----.   .-----.
 (       )                     (       ) (       ) (       ) (       ) (       )
 |`-----'|                     |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
 |       |                     |       | |       | |       | |       | |       |
 |       |                     |       | |       | |       | |       | |       |
 (       )                     (       ) (       ) (       ) (       ) (       )
  `-----'                       `-----'   `-----'   `-----'   `-----'   `-----'
  Primary                        OSD 2     OSD 3     OSD 4     OSD P     OSD Q
    OSD

Direct Read I/O
---------------

We want clients to submit small I/Os directly to the OSD that stores the data
rather than directing all I/O requests to the Primary OSD and have it issue
requests to the secondary OSDs. By eliminating an intermediate hop this
reduces network bandwidth and improves I/O latency

.. ditaa::

         RADOS Client
               ^
               |
          +----+----+     Client sends small read requests directly to OSD
          | Return  |     avoiding extra network hop via Primary
          |  data   |
          |         |
          +---------+
               ^
               |
               |
               |
               |
               |
               |
               |
  .-----.   .-----.   .-----.   .-----.   .-----.   .-----.
 (       ) (       ) (       ) (       ) (       ) (       )
 |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
 |       | |       | |       | |       | |       | |       |
 |       | |       | |       | |       | |       | |       |
 (       ) (       ) (       ) (       ) (       ) (       )
  `-----'   `-----'   `-----'   `-----'   `-----'   `-----'
  Primary    OSD 2     OSD 3     OSD 4     OSD P     OSD Q
    OSD


.. ditaa::

               RADOS Client
               ^         ^
               |         |
          +----+----+ +--+------+  Client breaks larger read
          | Return  | | Return  |  requests into separate
          |  data   | |  data   |  requests to multiple OSDs
          |         | |         |  
          +---------+ +---------+  Note client loses atomicity
               ^         ^         guarantees if this optimization
               |         |         is used as an update could occur
               |         |         between the two reads
               |         |
               |         |
               |         |
               |         |
               |         |
  .-----.   .-----.   .-----.   .-----.   .-----.   .-----.
 (       ) (       ) (       ) (       ) (       ) (       )
 |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
 |       | |       | |       | |       | |       | |       |
 |       | |       | |       | |       | |       | |       |
 (       ) (       ) (       ) (       ) (       ) (       )
  `-----'   `-----'   `-----'   `-----'   `-----'   `-----'
  Primary    OSD 2     OSD 3     OSD 4     OSD P     OSD Q
    OSD

Distributed processing of writes
--------------------------------

The existing erasure code implementation processes write I/Os on the primary
OSD, issuing both reads and writes to other OSDs to fetch and update data for
other shards. This is perhaps the simplest implementation, but it uses a lot
of network bandwidth. With parity-delta-writes it is possible to distribute
the processing across OSDs to reduce network bandwidth.

.. ditaa::

               Performing the coding parity delta updates on the coding parity
               OSD instead of the primary OSD reduces network bandwidth
 RADOS Client
    |          Note: A naive implementation will increase latency by serializing
    |          the data and coding parity reads, for best performance these
    |          reads need to happen in parallel
    |  +-----+                                                          +-----+
    +->| XOR |-+------------------------------------------------------->| GF  |
  +-+->|     | |                                                        |     |
  | |  +-----+ |                                                        +----++
  | |          |                                              +-----+     ^ |
  | |          +--------------------------------------------->| XOR |     | |
  | |                                                         |     |     | |
  | |                                                         +---+-+     | |
  | +-------------------------------+                           ^ |       | |
  +-------------------------------+ |                           | |       | |
                                  | |                           | |       | |
                                  | |                           | |       | |
                                  | |                           | |       | |
                                  | |                           | |       | |
                                  | v                           | v       | v
  .-----.                       .-----.   .-----.   .-----.   .-----.   .-----.
 (       )                     (       ) (       ) (       ) (       ) (       )
 |`-----'|                     |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
 |       |                     |       | |       | |       | |       | |       |
 |       |                     |       | |       | |       | |       | |       |
 (       )                     (       ) (       ) (       ) (       ) (       )
  `-----'                       `-----'   `-----'   `-----'   `-----'   `-----'
  Primary                        OSD 2     OSD 3     OSD 4     OSD P     OSD Q
    OSD

Direct Write I/O
----------------

.. ditaa::

             RADOS Client
                  |
                  |  Similarly Clients could direct small write I/Os
                  |  to the OSD that needs updating
                  |
                  |  +-----+                        +-----+
                  +->| XOR |-+--------------------->| GF  |
            +-----+->|     | |                      |     | 
            |     |  +-----+ |                      +----++
            |     |          |            +-----+     ^ |
            |     |          +----------->| XOR |     | |
            |     |                       |     |     | |
            |     |                       +---+-+     | |
            |     |                         ^ |       | |
            |     |                         | |       | |
            |     |                         | |       | |
            |     |                         | |       | |
            |     |                         | |       | |
            |     |                         | |       | |
            |     v                         | v       | v
  .-----.   .-----.   .-----.   .-----.   .-----.   .-----.
 (       ) (       ) (       ) (       ) (       ) (       )
 |`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
 |       | |       | |       | |       | |       | |       |
 |       | |       | |       | |       | |       | |       |
 (       ) (       ) (       ) (       ) (       ) (       )
  `-----'   `-----'   `-----'   `-----'   `-----'   `-----'
  Primary    OSD 2     OSD 3     OSD 4     OSD P     OSD Q
    OSD

This diagram is overly simplistic, only showing the data flows - direct writes
are much harder to implement and will need control messages to the Primary to
ensure writes to the same stripe are ordered correctly

Larger chunk size
-----------------

The default chunk size is 4K, this is too small and means that small reads
have to be split up and processed by many OSDs. It is more efficient if small
I/Os can be serviced by a single OSD. Choosing a larger chunk size such as 64K
or 256K and implementing partial reads and writes will fix this issue, but has
the disadvantage that small sized RADOS objects get rounded up in size to a
whole stripe of capacity.

We would like the code to automatically choose what chunk size to use to
optimize for both capacity and performance. Small objects should use a small
chunk size like 4K, larger objects should use a larger chunk size.

Code currently rounds up I/O sizes to multiples of the chunk size, which isn't
an issue with a small chunk size. With a larger chunk size and partial
reads/writes we should round up to the page size rather than the chunk size.

Design
======

We will describe the changes we aim to make in three sections, the first
section looks at the existing test tools for erasure coding and discusses the
improvements we believe will be necessary to get good test coverage for the
changes.

The second section covers changes to the read and write I/O path.

The third section discusses the changes to metadata to avoid the need to
update metadata on all shards for each metadata update. While it is possible
to implement many of the I/O path changes without reducing the number of
metadata updates, there are bigger performance benefits if the number of
metadata updates can be reduced as well.

Test tools
----------

A survey of the existing test tools shows that there is insufficient coverage
of erasure coding to be able to just make changes to the code and expect the
existing CI pipelines to get sufficient coverage. Therefore one of the first
steps will be to improve the test tools to be able to get better test
coverage.

Teuthology is the main test tool used to get test coverage and it relies
heavily on the following tests for generating I/O:

1. **rados** task - qa/tasks/rados.py. This uses ceph_test_rados
   (src/test/osd/TestRados.cc) which can generate a wide mixture of different
   rados operations. There is limited support for read and write I/Os,
   typically using offset 0 although there is a chunked read command used by a
   couple of tests.

2. **radosbench** task - qa/tasks/radosbench.py. This uses the **rados bench**
   (src/tools/rados/rados.cc and src/common/obj_bencher.cc). Can be used to
   generate sequential and random I/O workloads, offset starts at 0 for
   sequential I/O. I/O size can be set but is constant for whole test.

3. **rbd_fio** task - qa/tasks/fio.py. This uses **fio** to generate
   read/write I/O to an rbd image volume

4. **cbt** task - qa/tasks/cbt.py. This uses the Ceph benchmark tool **cbt**
   to run fio or radosbench to benchmark the performance of a cluster.

5. **rbd bench**. Some of the standalone tests use rbd bench
   (src/tools/rbd/action/Bench.cc) to generate small amounts of I/O
   workload. It is also used by the **rbd_pwl_cache_recovery** task.

It is hard to use these tools to get good coverage of I/Os to non-zero (and
non-stripe aligned) offsets, or to generate a wide variety of offsets and
lengths of I/O requests including all the boundary cases for chunks and
stripes. There is scope to improve either rados, radosbench or rbd bench to
generate much more interesting I/O patterns for testing erasure coding.

For the optimizations described above it is essential that we have good tools
for checking the consistency of either selected objects or all objects in an
erasure coded pool by checking that the data and coding parities are
coherent. There is a test tool **ceph-erasure-code-tool** which can use the
plugins to encode and decode data provided in a set of files. However there
does not seem to be any scripting in teuthology to perform consistency checks
by using objectstore tool to read data and then using this tool to validate
consistency. We will write some teuthology helpers that use
ceph-objectstore-tool and ceph-erasure-code-tool to perform offline
validation.

We would also like an online way of performing full consistency checks, either
for specific objects or for a whole pool. Inconveniently EC pools do not
support class methods so it's not possible to use this as a way of
implementing a full consistency check. We will investigate putting a flag on a
read request, on the pool or implementing a new request type to perform a full
consistency check on an object and look at making extensions to the rados CLI
to be able to perform these tests. See also the discussion on deep scrub
below.

When there is more than one coding parity and there is an inconsistency
between the data and the coding parities it is useful to try and analyze the
cause of the inconsistency. Because the multiple coding parities are providing
redundancy, there can be multiple ways of reconstructing each chunk and this
can be used to detect the most like cause of the inconsistency. For example
with a 4+2 erasure coding and a dropped write to 1st data OSD, the stripe (all
6 OSDs) will be inconsistent, as will be any selection of 5 OSDs that includes
the 1st data OSD, but data OSDs 2,3 and 4 and the two coding parity OSDs will
be still be consistent. While there are many ways a stripe could get into this
state, a tool could conclude that the most likely cause is a missed update to
OSD 1. Ceph does not have a tool to perform this type of analysis, but it
should be easy to extend ceph-erasure-code-tool.

Teuthology seems to have adequate tools for taking OSDs offline and bringing
them back online again. There are a few tools for injecting read I/O errors
(without taking an OSD offline) but there is scope to improve these
(e.g. ability to specify a particular offset in an object that will fail a
read, more controls over setting and deleting error inject sites).

The general philosophy of teuthology seems to be to randomly inject faults and
simply through brute force get sufficient coverage of all the error
paths. This is a good approach for CI testing, however when EC code paths
become complex and require multiple errors to occur with precise timings to
cause a particular code path to execute it becomes hard to get coverage
without running the tests for a very long time. There are some standalone
tests for EC which do test some of the multiple failure paths, but these tests
perform very limited amounts of I/O and don't inject failures while there are
I/Os in flight so miss some of the interesting scenarios.

To deal with these more complex error paths we propose developing a new type
of thrasher for erasure coding that injects a sequence of errors and makes use
of debug hooks to capture and delay I/O requests at particular points to
ensure an error inject hits a particular timing window. To do this we will
extend the tell osd command to include extra interfaces to inject errors and
capture and stall I/Os at specific points.

Some parts of erasure coding such as the plugins are stand alone bits of code
which can be tested with unit tests. There are already some unit tests and
performance benchmark tools for erasure coding, we will look to extend these
to get further coverage of code that can be run stand alone.

I/O path changes
----------------

Avoid unnecessary reads and writes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The current code reads too much data for read and overwrite I/Os. For
overwrites it will also rewrite unmodified data. This occurs because reads and
overwrites are rounded up to full-stripe operations. This isn’t a problem when
data is mainly being accessed sequentially but is very wasteful for random I/O
operations. The code can be changed to only read/write necessary shards. To
allow the code to efficiently support larger chunk sizes I/Os should be
rounded to page size I/Os instead of chunk sized I/Os.

The first simple set of optimizations eliminates unnecessary reads and
unnecessary writes of data, but retains writes of metadata on all shards. This
avoids breaking the current design which depends on all shards receiving a
metadata update for every transaction. When changes to the metadata handling
are completed (see below) then it will be possible to make further
optimizations to reduce the number of metadata updates for additional savings.

Parity-delta-write
^^^^^^^^^^^^^^^^^^

The current code implements overwrites by performing a full-stripe read,
merging the overwritten data, calculating new coding parities and performing a
full-stripe write. Reading and writing every shard is expensive, there are a
number of optimizations that can be applied to speed this up. For a K+M
configuration where M is small, it is often less work to perform a
parity-delta-write. This is implemented by reading the old data that is about
to be overwritten and XORing it with the new data to create a delta. The
coding parities can then be read, updated to apply the delta and
re-written. With M=2 (RAID6) this can result in just 3 read and 3 writes to
perform an overwrite of less than one chunk.

Note that where a large fraction of the data in the stripe is being updated,
this technique can result in more work than performing a partial overwrite,
however if both update techniques are supported it is fairly easy to calculate
for a given I/O offset and length which is the optimal technique to use.

Write I/Os submitted to the Primary OSD will perform this calculation to
decide whether to use a full-stripe update or a parity-delta-write. Note that
if read failures are encountered while performing a parity-delta-write and it
is necessary to reconstruct data or a coding parity then it will be more
efficient to switch to performing a full-stripe read, merge and write.

Not all erasure codings and erasure coding libraries support the capability of
performing delta updates, however those implemented using XOR and/or GF
arithmetic should. We have checked jerasure and isa-l and confirmed that they
support this feature, although the necessary APIs are not currently exposed by
the plugins. For some erasure codes such as clay and lrc it may be possible to
apply delta updates, but the delta may need to be applied in so many places
that this makes it a worthless optimization. This proposal suggests that
parity-delta-write optimizations are initially implemented only for the most
commonly used erasure codings. Erasure code plugins will provide a new flag
indicating whether they support the new interfaces needed to perform delta
updates.

Direct reads
^^^^^^^^^^^^

Read I/Os are currently directed to the primary OSD which then issues reads to
other shards. To reduce I/O latency and network bandwidth it would be better
if clients could issue direct read requests to the OSD storing the data,
rather than via the primary. There are a few error scenarios where the client
may still need to fallback to submitting reads to the primary, a secondary OSD
will have the option of failing a direct read with -EAGAIN to request the
client retries the request to the primary OSD.

Direct reads will always be for <= one chunk. For reads of more than one chunk
the client can issue direct reads to multiple OSDs, however these will no
longer guaranteed to be atomic because an update (write) may be applied in
between the separate read requests. If a client needs atomicity guarantees
they will need to continue to send the read to the primary.

Direct reads will be failed with EAGAIN where a reconstruct and decode
operation is required to return the data. This means only reads to primary OSD
will need to handle the reconstruct code path. When an OSD is backfilling we
don't want the client to have large quantities of I/O failed with EAGAIN,
therefore we will make the client detect this situation and avoid issuing
direct I/Os to a backfilling OSD.

For backwards compatibility, for client requests that cannot cope with the
reduced guarantees of a direct read, and for scenarios where the direct read
would be to an OSD that is absent or backfilling, reads directed to the
primary OSD will still be supported.

Direct writes
^^^^^^^^^^^^^

Write I/Os are currently directed to the primary OSD which then updates the
other shards. To reduce latency and network bandwidth it would be better if
clients could direct small overwrites requests directly to the OSD storing the
data, rather than via the primary. For larger write I/Os and for error
scenarios and abnormal cases clients will continue to submit write I/Os to the
primary OSD.

Direct writes will always be for <= one chunk and will use the
parity-delta-write technique to perform the update. For medium sized writes a
client may issue direct writes to multiple OSDs, but such updates will no
longer be guaranteed to be atomic. If a client requires atomicity for a larger
write they will need to continue to send it to the primary.

For backwards compatibility, and for scenarios where the direct write would be
to an OSD that is absent, writes directed to the primary OSD will still be
supported.

I/O serialization, recovery/backfill and other error scenarios
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

Direct writes look fairly simple until you start considering all the abnormal
scenarios. The current implementation of processing all writes on the Primary
OSD means that there is one central point of control for the stripe that can
manage things like the ordering of multiple inflight I/Os to the same stripe,
ensuring that recovery/backfill for an object has been completed before it is
accessed and assigning the object version number and modification time.

With direct I/Os these become distributed problems. Our approach is to send a
control path message to the Primary OSD and let it continue to be the central
point of control. The Primary OSD will issue a reply when the OSD can start
the direct write and will be informed with another message when the I/O has
completed. See section below on metadata updates for more details.

Stripe cache
^^^^^^^^^^^^

Erasure code pools maintain a stripe cache which stores shard data while
updates are in progress. This is required to allow writes and reads to the
same stripe to be processed in parallel. For small sequential write workloads
and for extreme hot spots (e.g. where the same block is repeatedly re-written
for some kind of crude checkpointing mechanism) there would be a benefit in
keeping the stripe cache slightly longer than the duration of the I/O. In
particularly the coding parities are typically read and written for every
update to a stripe. There is obviously a balancing act to achieve between
keeping the cache long enough that it reduces the overheads for future I/Os
versus the memory overheads of storing this data. A small (MiB as opposed to
GiB sized cache) should be sufficient for most workloads. The stripe cache can
also help reduce latency for direct write I/Os by allowing prefetch I/Os to
read old data and coding parities ready for later parts of the write operation
without requiring more complex interlocks.

The stripe cache is less important when the default chunk size is small
(e.g. 4K), because even with small write I/O requests there will not be many
sequential updates to fill a stripe. With a larger chunk size (e.g. 64K) the
benefits of a good stripe cache become more significant because the stripe
size will be 100’s KiB to small number of MiB’s and hence it becomes much more
likely that a sequential workload will issue many I/Os to the same stripe.

Automatically choose chunk size
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The default chunk size of 4K is good for small objects because the data and
coding parities are rounded up to whole chunks and because if an object has
less than one data stripe of data then the capacity overheads for the coding
parities are higher (e.g. a 4K object in a 10+2 erasure coded pool has 4K of
data and 8K of coding parity, so there is a 200% overhead). However the
optimizations above all provide much bigger savings if the typical random
access I/O only reads or writes a single shard. This means that so long as
objects are big enough that a larger chunk size such as 64K would be better.

Whilst the user can try and predict what their typically object size will be
and choose an appropriate chunk size, it would be better if the code could
automatically select a small chunk size for small objects and a larger chunk
size for larger objects. There will always be scenarios where an object grows
(or is truncated) and the chosen chunk size becomes inappropriate, however
reading and re-writing the object with a new chunk size when this happens
won’t have that much performance impact. This also means that the chunk size
can be deduced from the object size in object_info_t which is read before the
objects data is read/modified. Clients already provide a hint as to the object
size when creating the object so this could be used to select a chunk size to
reduce the likelihood of having to re-stripe an object

The thought is to support a new chunk size of auto/variable to enable this
feature, it will only be applicable for newly created pools, there will be no
way to migrate an existing pool.

Deep scrub support
^^^^^^^^^^^^^^^^^^

EC Pools with overwrite do not check CRCs because it is too costly to update
the CRC for the object on every overwrite, instead the code relies on
Bluestore to maintain and check CRCs. When an EC pool is operating with
overwrite disabled a CRC is kept for each shard, because it is possible to
update CRCs as the object is appended to just by calculating a CRC for the new
data being appended and then doing a simple (quick) calculation to combine the
old and new CRC together.

In dev/osd_internals/erasure_coding/proposals.rst it discusses the possibility
of keeping CRCs at a finer granularity (for example per chunk), storing these
either as an xattr or an omap (omap is more suitable as large objects could
end up with a lot of CRC metadata) and updating these CRCs when data is
overwritten (the update would need to perform a read-modify-write at the same
granularity as the CRC). These finer granularity CRCs can then easily be
combined to produce a CRC for the whole shard or even the whole erasure coded
object.

This proposal suggests going in the opposite direction - EC overwrite pools
have survived without CRCs and relied on Bluestore up until now, so why is
this feature needed? The current code doesn’t check CRCs if overwrite is
enabled, but sadly still calculates and updates a CRC in the hinfo xattr, even
if performing overwrites which mean that the calculated value will be
garbage. This means we pay all the overheads of calculating the CRC and get no
benefits.

The code can easily be fixed so that CRCs are calculated and maintained when
objects are written sequentially, but as soon as the first overwrite to an
object occurs the hinfo xattr will be discarded and CRCs will no longer be
calculated or checked. This will improve performance when objects are
overwritten, and will improve data integrity in cases where they are not.

While the thought is to abandon EC storing CRCs in objects being overwritten,
there is an improvement that can be made to deep scrub. Currently deep scrub
of an EC with overwrite pool just checks that every shard can read the object,
there is no checking to verify that the copies on the shards are consistent. A
full consistency check would require large data transfers between the shards
so that the coding parities could be recalculated and compared with the stored
versions, in most cases this would be unacceptably slow. However for many
erasure codes (including the default ones used by Ceph) if the contents of a
chunk are XOR’d together to produce a longitudinal summary value, then an
encoding of the longitudinal summary values of each data shard should produce
the same longitudinal summary values as are stored by the coding parity
shards. This comparison is less expensive than the CRC checks performed by
replication pools. There is a risk that by XORing the contents of a chunk
together that a set of corruptions cancel each other out, but this level of
check is better than no check and will be very successful at detecting a
dropped write which will be the most common type of corruption.

Metadata changes
----------------

What metadata do we need to consider?

1. object_info_t. Every Ceph object has some metadata stored in the
   object_info_t data structure. Some of these fields (e.g. object length) are
   not updated frequently and we can simply avoid performing partial writes
   optimizations when these fields need updating. The more problematic fields
   are the version numbers and the last modification time which are updated on
   every write. Version numbers of objects are compared to version numbers in
   PG log entries for peering/recovery and with version numbers on other
   shards for backfill. Version numbers and modification times can be read by
   clients.

2. PG log entries. The PG log is used to track inflight transactions and to
   allow incomplete transactions to be rolled forward/backwards after an
   outage/network glitch. The PG log is also used to detect and resolve
   duplicate requests (e.g. resent due to network glitch) from
   clients. Peering currently assumes that every shard has a copy of the log
   and that this is updated for every transaction.

3. PG stats entries and other PG metadata. There is other PG metadata (PG
   stats is the simplest example) that gets updated on every
   transaction. Currently all OSDs retain a cached and a persistent copy of
   this metadata.

How many copies of metadata are required?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The current implementation keeps K+M replicated copies of metadata, one copy
on each shard. The minimum number of copies that need to be kept to support up
to M failures is M+1. In theory metadata could be erasure encoded, however
given that it is small it is probably not worth the effort. One advantage of
keeping K+M replicated copies of the metadata is that any fully in sync shard
can read the local copy of metadata, avoiding the need for inter-OSD messages
and asynchronous code paths. Specifically this means that any OSD not
performing backfill can become the primary and can access metadata such as
object_info_t locally.

M+1 arbitrarily distributed copies
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A partial write to one data shard will always involve updates to the data
shard and all M coding parity shards, therefore for optimal performance it
would be ideal if the same M+1 shards are updated to track the associated
metadata update. This means that for small random writes that a different M+1
shards would get updated for each write. The drawback of this approach is that
you might need to read K shards to find the most up to date version of the
metadata.

In this design no shard will have an up to date copy of the metadata for every
object. This means that whatever shard is picked to be the acting primary that
it may not have all the metadata available locally and may need to send
messages to other OSDs to read it. This would add significant extra complexity
to the PG code and cause divergence between Erasure coded pools and Replicated
pools. For these reasons we discount this design option.

M+1 copies on known shards
^^^^^^^^^^^^^^^^^^^^^^^^^^

The next best performance can be achieved by always applying metadata updates
to the same M+1 shards, for example choosing the 1st data shard and all M
coding parity shards. Coding parity shards will get updated by every partial
write so this will result in zero or one extra shard being updated. With this
approach only 1 shard needs to be read to find the most up to date version of
the metadata.

We can restrict the acting primary to be one of the M+1 shards, which means
that once any incomplete updates in the log have been resolved that the
primary will have an up to date local copy of all the metadata, this means
that much more of the PG code can be kept unchanged.

Partial Writes and the PG log
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Peering currently assumes that every shard has a copy of the log, however
because of inflight updates and small term absences it is possible that some
shards are missing some of the log entries. The job of peering is to combine
the logs from the set of present shards to form a definitive log of
transactions that have been committed by all the shards. Any discrepancies
between a shards log and the definitive log are then resolved, typically by
rolling backwards transactions (using information held in the log entry) so
that all the shards are in a consistent state.

To support partial writes the log entry needs to be modified to include the
set of shards that are being updated. Peering needs to be modified to consider
a log entry as missing from a shard only if a copy of the log entry on another
shard indicates that this shard was meant to be updated.

The logs are not infinite in size, and old log entries where it is known that
the update has been successfully committed on all affected shards are
trimmed. Log entries are first condensed to a pg_log_dup_t entry which can no
longer assist in rollback of a transaction but can still be used to detect
duplicated client requests, and then later completely discarded. Log trimming
is performed at the same time as adding a new log entry, typically when a
future write updates the log. With partial writes log trimming will only occur
on shards that receive updates, which means that some shards may have stale
log entries that should have been discarded.

TBD: I think the code can already cope with discrepancies in log trimming
between the shards. Clearly an in flight trim operation may not have completed
on every shard so small discrepancies can be dealt with, but I think an absent
OSD can cause larger discrepancies. I believe that this is resolved during
Peering, with each OSD keeping a record of what the oldest log entry should be
and this gets shared between OSDs so that they can work out stale log entries
that were trimmed in absentia. Hopefully this means that only sending log
trimming updates to shards that are creating new log entries will work without
code changes.

Backfill
^^^^^^^^

Backfill is used to correct inconsistencies between OSDs that occur when an
OSD is absent for a longer period of time and the PG log entries have been
trimmed. Backfill works by comparing object versions between shards. If some
shards have out of date versions of an object then a reconstruct is performed
by the backfill process to update the shard. If the version numbers on objects
are not updated on all shards then this will break the backfill process and
cause a huge amount of unnecessary reconstruct work. This is unacceptable, in
particular for the scenario where an OSD is just absent for maintenance for a
relatively short time with noout set. The requirement is to be able to
minimize the amount of reconstruct work needed to complete a backfill.

In dev/osd_internals/erasure_coding/proposals.rst it discusses the idea of
each shard storing a vector of version numbers that records the most recent
update that the pair <this shard, other shard> both should have participated
in. By collecting this information from at least M shards it is possible to
work out what the expected minimum version number should be for an object on a
shard and hence deduce whether a backfill is required to update the
object. The drawback of this approach is that backfill will need to scan M
shards to collect this information, compared with the current implementation
that only scans the primary and shard(s) being backfilled.

With the additional constraint that a known M+1 shards will always be updated
and that the (acting) primary will be one of these shards, it will be possible
to determine whether a backfill is required just by examining the vector on
the primary and the object version on the shard being backfilled. If the
backfill target is one of the M+1 shards the existing version number
comparison is sufficient, if it is another shard then the version in the
vector on the primary needs to be compared with the version on the backfill
target. This means that backfill does not have to scan any more shards than it
currently does, however the scan of the primary does need to read the vector
and if there are multiple backfill targets then it may need to store multiple
entries of the vector per object increasing memory usage during the backfill.

There is only a requirement to keep the vector on the M+1 shards, and the
vector only needs K-1 entires because we only need to track version number
differences between any of the M+1 shards (which should have the same version)
and each of the K-1 shards (which can have a stale version number). This will
slightly reduce the amount of extra metadata required. The vector of version
numbers could be stored in the object_info_t structure or stored as a separate
attribute.

Our preference is to store the vector in the object_info_t structure because
typically both are accessed together, and because this makes it easier to
cache both in the same object cache. We will keep metadata and memory
overheads low by only storing the vector when it is needed.

Care is required to ensure that existing clusters can be upgraded. The absence
of the vector of version numbers implies that an object has never had a
partial update and therefore all shards are expected to have the same version
number for the object and the existing backfill algorithm can be used.

Code references
"""""""""""""""

PrimaryLogPG::scan_range - this function creates a map of objects and their
version numbers, on the primary it tries to get this information from the
object cache, otherwise it reads to OI attribute. This will need changes to
deal with the vectors. To conserve memory it will need to be provided with the
set of backfill targets so it can select which part of the vector to keep.

PrimaryLogPG::recover_backfill - this function call scan_range for the local
(primary) and sends MOSDPGScan to the backfill targets to get them to perform
the same scan. Once it has collected all the version numbers it compares the
primary and backfill targets to work out which objects need to be
recovered. This will also need changes to deal with the vectors when comparing
version numbers.

PGBackend::run_recovery_op - recovers a single object. For an EC pool this
involves reconstructing the data for the shards that need backfilling (read
other shards and use decode to recover). This code shouldn't need any changes.

Version number and last modification time for clients
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Clients can read the object version number and set expectations about what the
minimum version number is when making updates. Clients can also read the last
modification time. There are use cases where it is important that these values
can be read and give consistent results, but there is also a large number of
scenarios where this information is not required.

If the object version number is only being updated on a known M+1 shards for
partial writes, then where this information is required it will need to
involve a metadata access to one of those shards. We have arranged for the
primary to be one of the M+1 shards so I/Os submitted to the primary will
always have access to the up to date information.

Direct write I/Os need to update the M+1 shards, so it is not difficult to
also return this information to the client when completing the I/O.

Direct read I/Os are the problem case, these will only access the local shard
and will not necessarily have access to the latest version and modification
time. For simplicity we will require clients that require this information to
send requests to the primary rather than using the direct I/O
optimization. Where a client does not need this information they can use the
direct I/O optimizations.

The direct read I/O optimization will still return a (potentially stale)
object version number. This may still be of use to clients to help understand
the ordering of I/Os to a chunk.

Direct Write with Metadata updates
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Here's the full picture of what a direct write performing a parity-delta-write
looks like with all the control messages:

.. ditaa::

                   RADOS Client
 
                        |  ^
                        |  |
                        1  28
 +-----+                |  |
 |     |<------27-------+--+
 |     |                |  |
 |     |  +-------------|->|
 |     |  |             |  |
 |     |<-|----2--------+  |<--------------------------------------------+
 | Seq |  |             |  |<----------------------------+               |
 |     |  |    +----3---+  |                             |               |
 |     |  |    |        +--|-----------------------5-----|---+           |
 |     |  |    |        +--|-------4---------+           |   |           |
 |     +--|-10-|------->|  |                 |           |   |           |
 |     |  |    |    +---+  |                 |           |   |           |
 |     |  |    |    |   |  |                 |           |   |           |
 |     |  |    |    v   |  |                 |           |   |           |
 +----++  |    |  +---+ |  |                 |           |   |           |
   ^  |   |    |  |XOR+-|--|----------15-----|-----------|---|-----+     |
   |  |   |    |  |13 +-|--|-------14--------|-----+     |   |     |     |
   |  |   |    |  +---+ |  |                 |     |     |   |     |     |
   |  |   |    |    ^   |  |                 |     v     |   |     v     |
   |  |   |    |    |   |  |                 |  +------+ |   |  +------+ |
   6  11  |    |    |   |  |                 |  | XOR  | |   |  |  GF  | |
   |  |   |    |    |   |  |                 |  | 18   | |   |  |  21  | |
   |  |   |    |    12  16 |                 |  +----+-+ |   |  +----+-+ |
   |  |   |    |    |   |  |                 |    ^  |   |   |    ^  |   |
   |  |   |    |    |   |  |                 |    |  |   |   |    |  |   |
   |  |   |    |    |   |  |                 |    17 19  |   |    20 22  |
   |  |   |    |    |   |  |                 |    |  |   |   |    |  |   |
   |  |   |    |    |   v  |                 |    |  v   |   |    |  v   |
   |  |   |    |  +-+----+ |                 |  +-+----+ |   |  +-+----+ |
   |  |   |    +->|Extent| |                 +->|Extent| |   +->|Extent| |
   |  |   23      |Cache | 24                   |Cache | 25     |Cache | 26
   |  |   |       +----+-+ |                    +----+-+ |      +----+-+ |
   |  |   |         ^  |   |                      ^  |   |        ^  |   |
   |  |   |         |  |   |                      |  |   |        |  |   |
   |  +---+         7  +---+                      8  +---+        9  +---+
   |  |             |  |                          |  |            |  |
   |  v             |  v                          |  v            |  v
  .-----.          .-----.   .-----.   .-----.   .-----.         .-----.
 (       )        (       ) (       ) (       ) (       )       (       )
 |`-----'|        |`-----'| |`-----'| |`-----'| |`-----'|       |`-----'|
 |       |        |       | |       | |       | |       |       |       |
 |       |        |       | |       | |       | |       |       |       |
 (       )        (       ) (       ) (       ) (       )       (       )
  `-----'          `-----'   `-----'   `-----'   `-----'         `-----'
 Primary            OSD 2     OSD 3     OSD 4     OSD P           OSD Q
   OSD
 
 * Xattr            * No Xattr                    * Xattr         * Xattr
 * OI               * Stale OI                    * OI            * OI
 * PG log           * Partial PG log              * PG log        * PG log
 * PG stats         * No PG stats                 * PG stats      * PG stats

Note: Only the primary OSD and parity coding OSDs (the M+1 shards) have Xattr,
up to date object info, PG log and PG stats. Only one of these OSDs is
permitted to become the (acting) primary. The other data OSDs 2,3 and 4 (the
K-1 shards) do not have Xattrs or PG stats, may have state object info and
only have PG log entries for their own updates. OSDs 2,3 and 4 may have stale
OI with an old version number. The other OSDs have the latest OI and a vector
with the expected version numbers for OSDs 2,3 and 4.

1. Data message with Write I/O from client (MOSDOp)
2. Control message to Primary with Xattr (new msg MOSDEcSubOpSequence)

Note: the primary needs to be told about any xattr update so it can update its
copy, but the main purpose of this message is to allow the primary to sequence
the write I/O. The reply message at step 10 is what allows the write to start
and provides the PG stats and new object info including the new version
number. If necessary the primary can delay this to ensure that
recovery/backfill of the object is completed first and deal with overlapping
writes. Data may be read (prefetched) before the reply, but obviously no
transactions can start.

3. Prefetch request to local extent cache
4. Control message to P to prefetch to extent cache (new msg
   MOSDEcSubOpPrefetch equivalent of MOSDEcSubOpRead)
5. Control message to Q to prefetch to extent cache (new msg
   MOSDEcSubOpPrefetch equivalent of MOSDEcSubOpRead)
6. Primary reads object info
7. Prefetch old data
8. Prefetch old P
9. Prefetch old Q

Note: The objective of these prefetches is to get the old data, P and Q reads
started as quickly as possible to reduce the latency of the whole I/O. There
may be error scenarios where the extent cache is not able to retain this and
it will need to be re-read. This includes the rare/pathological scenarios
where there is a mixture of writes sent to the primary and writes sent
directly to the data OSD for the same object.

10. Control message to data OSD with new object info + PG stats (new msg
    MOSDEcSubOpSequenceReply)
11. Transaction to update object info + PG log + PG stats
12. Fetch old data (hopefully cached)

Note: For best performance we want to pipeline writes to the same stripe. The
primary assigns the version number to each write and consequently defines the
order in which writes should be processed. It is important that the data shard
and the coding parity shards apply overlapping writes in the same order. The
primary knows what set of writes are in flight so can detect this situation
and indicate in its reply message at step 10 that an update must wait until an
earlier update has been applied. This information needs to be forwarded to the
coding parities (steps 14 and 15) so they can also ensure updates are applied
in the same order.

13. XOR new and old data to create delta
14. Data message to P with delta + Xattr + object info + PG log + PG stats
    (new msg MOSDEcSubOpDelta equivalent of MOSDEcSubOpWrite)
15. Data message to Q with delta + Xattr + object info + PG log + PG stats
    (new msg MOSDEcSubOpDelta equivalent of MOSDEcSubOpWrite)
16. Transaction to update data + object info + PG log
17. Fetch old P (hopefully cached)
18. XOR delta and old P to create new P
19. Transaction to update P + Xattr + object info + PG log + PG stats
20. Fetch old Q (hopefully cached)
21. XOR delta and old Q to create new Q
22. Transaction to update Q + Xattr + object info + PG log + PG stats
23. Control message to data OSD for commit (new msg MOSDEcSubOpDeltaReply
    equivalent of MOSDEcSubOpWriteReply)
24. Local commit notification
25. Control message to data OSD for commit (new msg MOSDEcSubOpDeltaReply
    equivalent of MOSDEcSubOpWriteReply)
26. Control message to data OSD for commit (new msg MOSDEcSubOpDeltaReply
    equivalent of MOSDEcSubOpWriteReply)
27. Control message to Primary to signal end of write (variant of new msg
    MOSDEcSubOpSequence)
28. Control message reply to client (MOSDOpReply)

Upgrade and backwards compatibility
-----------------------------------

A few of the optimizations can be made just by changing code on the primary
OSD with no backwards compatibility concerns regarding clients or the other
OSDs. These optimizations will be enabled as soon as the primary OSD upgrades
and will replace the existing code paths.

The remainder of the changes will be new I/O code paths that will exist
alongside the existing code paths.

Similar to EC Overwrites many of the changes will need to ensure that all OSDs
are running new code and that the EC plugins support new interfaces required
for parity-delta-writes. A new pool level flag will be required to enforce
this. It will be possible to enable this flag (and hence enable the new
performance optimizations) after upgrading an existing cluster. Once set it
will not be possible to add down level OSDs to the pool. It will not be
possible to turn this flag off other than by deleting the pool. Downgrade is
not supported because:

1. It is not trivial to quiesce all I/O to a pool to ensure that none of the
   new I/O code paths are in use when the flag is cleared.

2. The PG log format for new I/Os will not be understood by down level
   OSDs. It would be necessary to ensure the log has been trimmed of all new
   format entries before clearing the flag to ensure that down level OSDs will
   be able to interpret the log.

3. Additional xattr data will be stored by the new I/O code paths and used by
   backfill. Down level code will not understand how to backfill a pool that
   has been running the new I/O paths and will get confused by the
   inconsistent object version numbers. While it is theoretically possible to
   disable partial updates and then scan and update all the metadata to return
   the pool to a state where a downgrade is possible, we have no intention of
   writing this code.

The direct I/O changes will additionally require clients to be running new
code. These will require that the pool has the new flag set and that a new
client is used. Old clients can use pools with the new flag set, just without
the direct I/O optimization.

Not under consideration
-----------------------

There is a list of enhancements discussed in
doc/dev/osd_internals/erasure_coding/proposals.rst, the following are not
under consideration:

1. RADOS Client Acknowledgement Generation optimization

When updating K+M shards in an erasure coded pool, in theory you don’t have to
wait for all the updates to complete before completing the update to the
client, because so long as K updates have completed any viable subset of
shards should be able to roll forward the update.

For partial writes where only M+1 shards are updated this optimization does
not apply as all M+1 updates need to complete before the update is completed
to the client.

This optimization would require changes to the peering code to work out
whether partially completed updates need to be rolled forwards or
backwards. To roll an update forwards it would be simplest to mark the object
as missing and use the recovery path to reconstruct and push the update to
OSDs that are behind.

2. Avoid sending read request to local OSD via Messenger

The EC backend code has an optimization for writes to the local OSD which
avoids sending a message and reply via messenger. The equivalent optimization
could be made for reads as well, although a bit more care is required because
the read is synchronous and will block the thread waiting for the I/O to
complete.

Pull request https://github.com/ceph/ceph/pull/57237 is making this
optimization

Stories
=======

This is our high level breakdown of the work. Our intention is to deliver this
work as a series of PRs. The stories are roughly in the order we plan to
develop. Each story is at least one PR, where possible they will be broken up
further. The earlier stories can be implemented as stand alone pieces of work
and will not introduce upgrade/backwards compatibility issues. The later
stories will start breaking backwards compatibility, here we plan to add a new
flag to the pool to enable these new features. Initially this will be an
experimental flag while the later stories are developed.

Test tools - enhanced I/O generator for testing erasure coding
--------------------------------------------------------------

* Extend rados bench to be able to generate more interesting patterns of I/O
  for erasure coding, in particular reading and writing at different offsets
  and for different lengths and making sure we get good coverage of boundary
  conditions such as the sub-chunk size, chunk size and stripe size
* Improve data integrity checking by using a seed to generate data patterns
  and remembering which seed is used for each block that is written so that
  data can later be validated

Test tools - offline consistency checking tool
----------------------------------------------

* Test tools for performing offline consistency checks combining use of
  objectstore_tool with ceph-erasure-code-tool
* Enhance some of the teuthology standalone erasure code checks to use this
  tool

Test tools - online consistency checking tool
---------------------------------------------

* New CLI to be able to perform online consistency checking for an object or a
  range of objects that reads all the data and coding parity shards and
  re-encodes the data to validate the coding parities

Switch for JErasure to ISA-L
----------------------------

The JErasure library has not been updated since 2014, the ISA-L library is
maintained and exploits newer instructions sets (e.g. AVX512, AVX2) which
provides faster encoding/decoding

* Change defaults to ISA-L in upstream ceph
* Benchmark Jerasure and ISA-L
* Refactor Ceph isa_encode region_xor() to use AVX when M=1
* Documentation updates
* Present results at performance weekly

Sub Stripe Reads
----------------

Ceph currently reads an integer number of stripes and discards unneeded
data. In particular for small random reads it will be more efficient to just
read the required data

* Help finish Pull Request https://github.com/ceph/ceph/pull/55196 if not
  already complete
* Further changes to issue sub-chunk reads rather than full-chunk reads

Simple Optimizations to Overwrite
---------------------------------

Ceph overwrites currently read an integer number of stripes, merge the new
data and write an integer number of stripes. This story makes simple
improvements by making the same optimizations as for sub stripe reads and for
small (sub-chunk) updates reducing the amount of data being read/written to
each shard.

* Only read chunks that are not being fully overwritten (code currently reads
  whole stripe and then merges new data)
* Perform sub-chunk reads for sub-chunk updates
* Perform sub-chunk writes for sub-chunk updates

Eliminate unnecessary chunk writes but keep metadata transactions
-----------------------------------------------------------------

This story avoids re-writing data that has not been modified. A transaction is
still applied to every OSD to update object metadata, the PG log and PG stats.

* Continue to create transactions for all chunks but without the new write data
* Add sub-chunk writes to transactions where data is being modified

Avoid zero padding objects to a full stripe
-------------------------------------------

Objects are rounded up to an integer number of stripes by adding zero
padding. These buffers of zeros are then sent in messages to other OSDs and
written to the OS consuming storage. This story make optimizations to remove
the need for this padding

* Modifications to reconstruct reads to avoid reading zero-padding at the end
  of an object - just fill the read buffer with zeros instead
* Avoid transfers/writes of buffers of zero padding. Still send transactions
  to all shards and create the object, just don't populate it with zeros
* Modifications to encode/decode functions to avoid having to pass in buffers
  of zeros when objects are padded

Erasure coding plugin changes to support distributed partial writes
-------------------------------------------------------------------

This is preparatory work for future stories, it adds new APIs to the erasure
code plugins.

* Add a new interface to create a delta by XORing old and new data together
  and implement this for the ISA-L and JErasure plugins
* Add a new interface to apply a delta to one coding parity by using XOR/GF
  and implement this for the ISA-L and JErasure plugins
* Add a new interface which reports which erasure codes support this feature
  (ISA-L and JErasure will support it, others will not)

Erasure coding interface to allow RADOS clients to direct I/Os to OSD storing the data
--------------------------------------------------------------------------------------

This is preparatory work for future stories, its adds a new API for clients

* New interface to convert the pair (pg, offset) to {OSD, remaining chunk
  length}

We do not want clients to have to dynamically link to the erasure code plugins
so this code will need to be part of librados. However this interface needs to
understand how erasure codes distribute data and coding chunks to be able to
perform this translation.

We will only support ISA-L and JErasure plugins where there is a trivial
striping of data chunks to OSDs.

Changes to object_info_t
------------------------

This is preparatory work for future stories.

This adds the vector of version numbers to object_info_t which will be used
for partial updates. For replicated pools and for erasure coded objects that
are not overwritten we will avoid storing extra data in object_info_t.

Changes to PGLog and Peering to support updating a subset of OSDs
-----------------------------------------------------------------

This is preparatory work for future stories.

* Modify the PG log entry to store a record of which OSDs are being updated
* Modify peering to use this extra data to work out OSDs that are missing
  updates

Change to selection of (acting) primary
---------------------------------------

This is preparatory work for future stories.

Constrain the choice of primary to be the first data OSD or one of the erasure
coding parities. If none of these OSDs are available and up to date then the
pool must be offline.

Implement parity-delta-write with all computation on the primary
----------------------------------------------------------------

* Calculate whether its more efficient for an update to perform a full stripe
  overwrite or a parity-delta-write
* Implement new code paths to perform the parity-delta-write
* Test tool enhancements. We want to make sure that both parity-delta-write
  and full-stripe write are tested. We will add a new conf file option with a
  choice of 'parity-delta', 'full-stripe', 'mixture for testing' or
  'automatic' and update teuthology test cases to predominately use a mixture.

Upgrades and backwards compatibility
------------------------------------

* Add a new feature flag for erasure coded pools
* All OSDs must be running new code to enable the flag on the pool
* Clients may only issue direct I/Os if the flag is set
* OSDs running old code may not join a pool with the flag set
* Its not possible to turn the feature flag off (other than by deleting the
  pool)

Changes to Backfill to use the vector in object_info_t
------------------------------------------------------

This is preparatory work for future stories.

* Modify the backfill process to use the vector of version numbers in
  object_info_t so that when partial updates occur we do not backfill OSDs
  which did not participate in the partial update.
* When there is a single backfill target extract the appropriate version
  number from the vector (no additional storage required)
* When there are multiple backfill targets extract the subset of the vector
  required by the backfill targets and select the appropriate entry when
  comparing version numbers in PrimaryLogPG::recover_backfill

Test tools - offline metadata validation tool
---------------------------------------------

* Test tools for performing offline consistency checking of metadata, in
  particular checking the vector of version numbers in object_info_t matches
  the versions on each OSD, but also for validating PG log entries

Eliminate transactions on OSDs not updating data chunks
-------------------------------------------------------

Peering, log recovery and backfill can now all cope with partial updates using
the vector of version numbers in object_info_t.

* Modify the overwrite I/O path to not bother with metadata only transactions
  (except to the Primary OSD)
* Modify the update of the version numbers in object_info_t to use the vector
  and only update entries that are receiving a transaction
* Modify the generation of the PG log entry to record which OSDs are being
  updated

Direct reads to OSDs (single chunk only)
----------------------------------------

* Modify OSDClient to route single chunk read I/Os to the OSD storing the data
* Modify OSD to accept reads from non-primary OSD (expand existing changes for
  replicated pools to work with EC pools as well)
* If necessary fail the read with EAGAIN if the OSD is unable to process the
  read directly
* Modify OSDClient to retry read by submitting to Primary OSD if read is
  failed with EAGAIN
* Test tool enhancements. We want to make sure that both direct reads and
  reads to the primary are tested. We will add a new conf file option with a
  choice of 'prefer direct', 'primary only' or 'mixture for testing' and
  update teuthology test cases to predominately use a mixture.

The changes will be made to the OSDC part of the RADOS client so will be
applicable to rbd, rgw and cephfs.

We will not make changes to other code that has its own version of RADOS
client code such as krbd, although this could be done in the future.

Direct reads to OSDs (multiple chunks)
--------------------------------------

* Add a new OSDC flag NONATOMIC which allows OSDC to split a read into
  multiple requests
* Modify OSDC to split reads spanning multiple chunks into separate requests
  to each OSD if the NONATOMIC flag is set
* Modifications to OSDC to coalesce results (if any sub read fails the whole
  read needs to fail)
* Changes to librbd client to set NONATOMIC flag for reads
* Changes to cephfs client to set NONATOMIC flag for reads

We are only changing a very limited set of clients, focusing on those that
issue smaller reads and are latency sensitive. Future work could look at
extending the set of clients (including krbd).

Implement distributed parity-delta-write
----------------------------------------

* Implement new message MOSDEcSubOpDelta and MOSDEcSubOpDeltaReply
* Change primary to calculate delta and send MOSDEcSubOpDelta message to
  coding parity OSDs
* Modify coding parity OSDs to apply the delta and send MOSDEcSubOpDeltaReply
  message

Note: This change will increase latency because the coding parity reads start
after the old data read. Future work will fix this.

Test tools - EC error injection thrasher
----------------------------------------

* Implement a new type of thrasher that specifically injects faults to stress
  erasure coded pools
* Take one or multiple (up to M) OSDs down, more focus on taking different
  subsets of OSDs down to drive all the different EC recovery paths than
  stressing out peering/recovery/backfill (the existing OSD thrasher excels at
  this)
* Inject read I/O failures to force reconstructs using decode for single and
  multiple failures
* Inject delays using osd tell type interface to make it easier to test OSD
  down at all the interesting stages of EC I/Os
* Inject delays using osd tell type interface to slow down an OSD transaction
  or message to expose the less common completion orders for parallel work

Implement prefetch message MOSDEcSubOpPrefetch and modify extent cache
----------------------------------------------------------------------

* Implement new message MOSDEcSubOpPrefetch
* Change primary to issue this message to the coding parity OSDs before
  starting read of old data
* Change the extent cache so that each OSD caches its own data rather than
  caching everything on the primary
* Change coding parity OSDs to handle this message and read the old coding
  parity into the extent cache
* Changes to extent cache to retain the prefetched old parity until the
  MOSDEcSubOpDelta message is received, and to discard this on error paths
  (e.g. new OSDMap)

Implement sequencing message MOSDEcSubOpSequence
------------------------------------------------

* Implement new message MODSEcSubOpSequence and MOSDEcSubOpSequenceReply
* Modify primary code to create these messages and route them locally to
  itself in preparation for direct writes

Direct writes to OSD (single chunk only)
----------------------------------------

* Modify OSDC to route single chunk write I/Os to the OSD storing the data
* Changes to issue MOSDEcSubOpSequence and MOSDEcSubOpSequenceReply between
  data OSD and primary OSD

Direct writes to OSD (multiple chunks)
--------------------------------------

* Modifications to OSDC to split multiple chunk writes into separate requests
  if NONATOMIC flag is set
* Further changes to coalescing completions (in particular reporting version
  number correctly)
* Changes to librbd client to set NONATOMIC flag for reads
* Changes to cephfs client to set NONATOMIC flag for reads

We are only changing a very limited set of clients, focusing on those that
issue smaller writes and are latency sensitive. Future work could look at
extending the set of clients.

Deep scrub / CRC
----------------

* Disable CRC generation in the EC code for overwrites, delete hinfo Xattr
  when first overwrite occurs
* For objects in pool with new feature flag set that have not been overwritten
  check CRC, even if pool overwrite flag is set. The presence/absence of hinfo
  can be used to determine if the object has been overwritten
* For deep scrub requests XOR the contents of the shard to create a
  longitudinal check (8 bytes wide?)
* Return the longitudinal check in the scrub reply message, have the primary
  encode the set of longitudinal replies to check for inconsistencies

Variable chunk size erasure coding
----------------------------------

* Implement new pool option for automatic/variable chunk size
* When object size is small use a small chunk size (4K) when the pool is using
  the new option
* When object size is large use a large chunk size (64K or 256K?)
* Convert the chunk size by reading and re-writing the whole object when a
  small object grows (append)
* Convert the chunk size by reading and re-writing the whole object when a
  large object shrinks (truncate)
* Use the object size hint to avoid creating small objects and then almost
  immediately converting them to a larger chunk size

CLAY Erasure Codes
------------------

In theory CLAY erasure codes should be good for K+M erasure codes with larger
values of M, in particular when these erasure codes are used with multiple
OSDs in the same failure domain (e.g. an 8+6 erasure code with 5 servers each
with 4 OSDs). We would like to improve the test coverage for CLAY and perform
some more benchmarking to collect data to help substantiate when people should
consider using CLAY.

* Benchmark CLAY erasure codes - in particular the number of I/O required for
  backfills when multiple OSDs fail
* Enhance test cases to validate the implementation
* See also https://bugzilla.redhat.com/show_bug.cgi?id=2004256