mirror of https://github.com/ceph/ceph
1518 lines
76 KiB
ReStructuredText
1518 lines
76 KiB
ReStructuredText
===========================
|
||
Erasure coding enhancements
|
||
===========================
|
||
|
||
Objectives
|
||
==========
|
||
|
||
Our objective is to improve the performance of erasure coding, in particular
|
||
for small random accesses to make it more viable to use erasure coding pools
|
||
for storing block and file data.
|
||
|
||
We are looking to reduce the number of OSD read and write accesses per client
|
||
I/O (sometimes referred to as I/O amplification), reduce the amount of network
|
||
traffic between OSDs (network bandwidth) and reduce I/O latency (time to
|
||
complete read and write I/O operations). We expect the changes will also
|
||
provide modest reductions to CPU overheads.
|
||
|
||
While the changes are focused on enhancing small random accesses, some
|
||
enhancements will provide modest benefits for larger I/O accesses and for
|
||
object storage.
|
||
|
||
The following sections give a brief description of the improvements we are
|
||
looking to make. Please see the later design sections for more details
|
||
|
||
Current Read Implementation
|
||
---------------------------
|
||
|
||
For reference this is how erasure code reads currently work
|
||
|
||
.. ditaa::
|
||
|
||
RADOS Client
|
||
* Current code reads all data chunks
|
||
^ * Discards unneeded data
|
||
| * Returns requested data to client
|
||
+----+----+
|
||
| Discard | If data cannot be read then the coding parity
|
||
|unneeded | chunks are read as well and are used to reconstruct
|
||
| data | the data
|
||
+---------+
|
||
^^^^
|
||
||||
|
||
||||
|
||
||||
|
||
|||+----------------------------------------------+
|
||
||+-------------------------------------+ |
|
||
|+----------------------------+ | |
|
||
| | | |
|
||
.-----. .-----. .-----. .-----. .-----. .-----.
|
||
( ) ( ) ( ) ( ) ( ) ( )
|
||
|`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
|
||
| | | | | | | | | | | |
|
||
| | | | | | | | | | | |
|
||
( ) ( ) ( ) ( ) ( ) ( )
|
||
`-----' `-----' `-----' `-----' `-----' `-----'
|
||
Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q
|
||
OSD
|
||
|
||
Note: All the diagrams illustrate a K=4 + M=2 configuration, however the
|
||
concepts and techniques can be used for all K+M configurations.
|
||
|
||
Partial Reads
|
||
-------------
|
||
|
||
If only a small amount of data is being read it is not necessary to read the
|
||
whole stripe, for small I/Os ideally only a single OSD needs to be involved in
|
||
reading the data. See also larger chunk size below.
|
||
|
||
.. ditaa::
|
||
|
||
RADOS Client
|
||
* Optimize by only reading required chunks
|
||
^ * For large chunk sizes and sub-chunk reads only
|
||
| read a sub-chunk
|
||
+----+----+
|
||
| Return | If data cannot be read then extra data and coding
|
||
| data | parity chunks are read as well and are used to
|
||
| | reconstruct the data
|
||
+---------+
|
||
^
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
+----------------------------+
|
||
|
|
||
.-----. .-----. .-----. .-----. .-----. .-----.
|
||
( ) ( ) ( ) ( ) ( ) ( )
|
||
|`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
|
||
| | | | | | | | | | | |
|
||
| | | | | | | | | | | |
|
||
( ) ( ) ( ) ( ) ( ) ( )
|
||
`-----' `-----' `-----' `-----' `-----' `-----'
|
||
Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q
|
||
OSD
|
||
|
||
Pull Request https://github.com/ceph/ceph/pull/55196 is implementing most of
|
||
this optimization, however it still issues full chunk reads.
|
||
|
||
Current Overwrite Implementation
|
||
--------------------------------
|
||
|
||
For reference here is how erasure code overwrites currently work
|
||
|
||
.. ditaa::
|
||
|
||
RADOS Client
|
||
| * Read all data chunks
|
||
| * Merges new data
|
||
+----v-----+ * Encodes new coding parities
|
||
| Read old | * Writes data and coding parities
|
||
|Merge new |
|
||
| Encode |-------------------------------------------------------------+
|
||
| Write |---------------------------------------------------+ |
|
||
+----------+ | |
|
||
^|^|^|^| | |
|
||
|||||||+-------------------------------------------+ | |
|
||
||||||+-------------------------------------------+| | |
|
||
|||||+-----------------------------------+ || | |
|
||
||||+-----------------------------------+| || | |
|
||
|||+---------------------------+ || || | |
|
||
||+---------------------------+| || || | |
|
||
|v |v |v |v v v
|
||
.-----. .-----. .-----. .-----. .-----. .-----.
|
||
( ) ( ) ( ) ( ) ( ) ( )
|
||
|`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
|
||
| | | | | | | | | | | |
|
||
| | | | | | | | | | | |
|
||
( ) ( ) ( ) ( ) ( ) ( )
|
||
`-----' `-----' `-----' `-----' `-----' `-----'
|
||
Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q
|
||
OSD
|
||
|
||
Partial Overwrites
|
||
------------------
|
||
|
||
Ideally we aim to be able to perform updates to erasure coded stripes by only
|
||
updating a subset of the shards (those with modified data or coding
|
||
parities). Avoiding performing unnecessary data updates on the other shards is
|
||
easy, avoiding performing any metadata updates on the other shards is much
|
||
harder (see design section on metadata updates).
|
||
|
||
.. ditaa::
|
||
|
||
RADOS Client
|
||
| * Only read chunks that are not being overwritten
|
||
| * Merge new data
|
||
+----v-----+ * Encode new coding parities
|
||
| Read old | * Only write modified data and parity shards
|
||
|Merge new |
|
||
| Encode |-------------------------------------------------------------+
|
||
| Write |---------------------------------------------------+ |
|
||
+----------+ | |
|
||
^ |^ ^ | |
|
||
| || | | |
|
||
| || +-------------------------------------------+ | |
|
||
| || | | |
|
||
| |+-----------------------------------+ | | |
|
||
| +---------------------------+ | | | |
|
||
| | | | | |
|
||
| v | | v v
|
||
.-----. .-----. .-----. .-----. .-----. .-----.
|
||
( ) ( ) ( ) ( ) ( ) ( )
|
||
|`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
|
||
| | | | | | | | | | | |
|
||
| | | | | | | | | | | |
|
||
( ) ( ) ( ) ( ) ( ) ( )
|
||
`-----' `-----' `-----' `-----' `-----' `-----'
|
||
Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q
|
||
OSD
|
||
|
||
This diagram is overly simplistic, only showing the data flows. The simplest
|
||
implementation of this optimization retains a metadata write to every
|
||
OSD. With more effort it is possible to reduce the number of metadata updates
|
||
as well, see design below for more details.
|
||
|
||
Parity-delta-write
|
||
------------------
|
||
|
||
A common technique used by block storage controllers implementing RAID5 and
|
||
RAID6 is to implement what is sometimes called a parity delta write. When a
|
||
small part of the stripe is being overwritten it is possible to perform the
|
||
update by reading the old data, XORing this with the new data to create a
|
||
delta and then read each coding parity, apply the delta and write the new
|
||
parity. The advantage of this technique is that it can involve a lot less I/O,
|
||
especially for K+M encodings with larger values of K. The technique is not
|
||
specific to M=1 and M=2, it can be applied with any number of coding parities.
|
||
|
||
.. ditaa::
|
||
|
||
Parity delta writes
|
||
* Read old data and XOR with new data to create a delta
|
||
RADOS Client * Read old encoding parities apply the delta and write
|
||
| the new encoding parities
|
||
|
|
||
| For K+M erasure codings where K is larger and M is small
|
||
| +-----+ +-----+ this is much more efficient
|
||
+->| XOR |-+->| GF |---------------------------------------------------+
|
||
+-+->| | | | |<------------------------------------------------+ |
|
||
| | +-----+ | +-----+ | |
|
||
| | | | |
|
||
| | | +-----+ | |
|
||
| | +->| XOR |-----------------------------------------+ | |
|
||
| | | |<--------------------------------------+ | | |
|
||
| | +-----+ | | | |
|
||
| | | | | |
|
||
| | | | | |
|
||
| +-------------------------------+ | | | |
|
||
+-------------------------------+ | | | | |
|
||
| | | | | |
|
||
| v | v | v
|
||
.-----. .-----. .-----. .-----. .-----. .-----.
|
||
( ) ( ) ( ) ( ) ( ) ( )
|
||
|`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
|
||
| | | | | | | | | | | |
|
||
| | | | | | | | | | | |
|
||
( ) ( ) ( ) ( ) ( ) ( )
|
||
`-----' `-----' `-----' `-----' `-----' `-----'
|
||
Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q
|
||
OSD
|
||
|
||
Direct Read I/O
|
||
---------------
|
||
|
||
We want clients to submit small I/Os directly to the OSD that stores the data
|
||
rather than directing all I/O requests to the Primary OSD and have it issue
|
||
requests to the secondary OSDs. By eliminating an intermediate hop this
|
||
reduces network bandwidth and improves I/O latency
|
||
|
||
.. ditaa::
|
||
|
||
RADOS Client
|
||
^
|
||
|
|
||
+----+----+ Client sends small read requests directly to OSD
|
||
| Return | avoiding extra network hop via Primary
|
||
| data |
|
||
| |
|
||
+---------+
|
||
^
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
.-----. .-----. .-----. .-----. .-----. .-----.
|
||
( ) ( ) ( ) ( ) ( ) ( )
|
||
|`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
|
||
| | | | | | | | | | | |
|
||
| | | | | | | | | | | |
|
||
( ) ( ) ( ) ( ) ( ) ( )
|
||
`-----' `-----' `-----' `-----' `-----' `-----'
|
||
Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q
|
||
OSD
|
||
|
||
|
||
.. ditaa::
|
||
|
||
RADOS Client
|
||
^ ^
|
||
| |
|
||
+----+----+ +--+------+ Client breaks larger read
|
||
| Return | | Return | requests into separate
|
||
| data | | data | requests to multiple OSDs
|
||
| | | |
|
||
+---------+ +---------+ Note client loses atomicity
|
||
^ ^ guarantees if this optimization
|
||
| | is used as an update could occur
|
||
| | between the two reads
|
||
| |
|
||
| |
|
||
| |
|
||
| |
|
||
| |
|
||
.-----. .-----. .-----. .-----. .-----. .-----.
|
||
( ) ( ) ( ) ( ) ( ) ( )
|
||
|`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
|
||
| | | | | | | | | | | |
|
||
| | | | | | | | | | | |
|
||
( ) ( ) ( ) ( ) ( ) ( )
|
||
`-----' `-----' `-----' `-----' `-----' `-----'
|
||
Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q
|
||
OSD
|
||
|
||
Distributed processing of writes
|
||
--------------------------------
|
||
|
||
The existing erasure code implementation processes write I/Os on the primary
|
||
OSD, issuing both reads and writes to other OSDs to fetch and update data for
|
||
other shards. This is perhaps the simplest implementation, but it uses a lot
|
||
of network bandwidth. With parity-delta-writes it is possible to distribute
|
||
the processing across OSDs to reduce network bandwidth.
|
||
|
||
.. ditaa::
|
||
|
||
Performing the coding parity delta updates on the coding parity
|
||
OSD instead of the primary OSD reduces network bandwidth
|
||
RADOS Client
|
||
| Note: A naive implementation will increase latency by serializing
|
||
| the data and coding parity reads, for best performance these
|
||
| reads need to happen in parallel
|
||
| +-----+ +-----+
|
||
+->| XOR |-+------------------------------------------------------->| GF |
|
||
+-+->| | | | |
|
||
| | +-----+ | +----++
|
||
| | | +-----+ ^ |
|
||
| | +--------------------------------------------->| XOR | | |
|
||
| | | | | |
|
||
| | +---+-+ | |
|
||
| +-------------------------------+ ^ | | |
|
||
+-------------------------------+ | | | | |
|
||
| | | | | |
|
||
| | | | | |
|
||
| | | | | |
|
||
| | | | | |
|
||
| v | v | v
|
||
.-----. .-----. .-----. .-----. .-----. .-----.
|
||
( ) ( ) ( ) ( ) ( ) ( )
|
||
|`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
|
||
| | | | | | | | | | | |
|
||
| | | | | | | | | | | |
|
||
( ) ( ) ( ) ( ) ( ) ( )
|
||
`-----' `-----' `-----' `-----' `-----' `-----'
|
||
Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q
|
||
OSD
|
||
|
||
Direct Write I/O
|
||
----------------
|
||
|
||
.. ditaa::
|
||
|
||
RADOS Client
|
||
|
|
||
| Similarly Clients could direct small write I/Os
|
||
| to the OSD that needs updating
|
||
|
|
||
| +-----+ +-----+
|
||
+->| XOR |-+--------------------->| GF |
|
||
+-----+->| | | | |
|
||
| | +-----+ | +----++
|
||
| | | +-----+ ^ |
|
||
| | +----------->| XOR | | |
|
||
| | | | | |
|
||
| | +---+-+ | |
|
||
| | ^ | | |
|
||
| | | | | |
|
||
| | | | | |
|
||
| | | | | |
|
||
| | | | | |
|
||
| | | | | |
|
||
| v | v | v
|
||
.-----. .-----. .-----. .-----. .-----. .-----.
|
||
( ) ( ) ( ) ( ) ( ) ( )
|
||
|`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
|
||
| | | | | | | | | | | |
|
||
| | | | | | | | | | | |
|
||
( ) ( ) ( ) ( ) ( ) ( )
|
||
`-----' `-----' `-----' `-----' `-----' `-----'
|
||
Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q
|
||
OSD
|
||
|
||
This diagram is overly simplistic, only showing the data flows - direct writes
|
||
are much harder to implement and will need control messages to the Primary to
|
||
ensure writes to the same stripe are ordered correctly
|
||
|
||
Larger chunk size
|
||
-----------------
|
||
|
||
The default chunk size is 4K, this is too small and means that small reads
|
||
have to be split up and processed by many OSDs. It is more efficient if small
|
||
I/Os can be serviced by a single OSD. Choosing a larger chunk size such as 64K
|
||
or 256K and implementing partial reads and writes will fix this issue, but has
|
||
the disadvantage that small sized RADOS objects get rounded up in size to a
|
||
whole stripe of capacity.
|
||
|
||
We would like the code to automatically choose what chunk size to use to
|
||
optimize for both capacity and performance. Small objects should use a small
|
||
chunk size like 4K, larger objects should use a larger chunk size.
|
||
|
||
Code currently rounds up I/O sizes to multiples of the chunk size, which isn't
|
||
an issue with a small chunk size. With a larger chunk size and partial
|
||
reads/writes we should round up to the page size rather than the chunk size.
|
||
|
||
Design
|
||
======
|
||
|
||
We will describe the changes we aim to make in three sections, the first
|
||
section looks at the existing test tools for erasure coding and discusses the
|
||
improvements we believe will be necessary to get good test coverage for the
|
||
changes.
|
||
|
||
The second section covers changes to the read and write I/O path.
|
||
|
||
The third section discusses the changes to metadata to avoid the need to
|
||
update metadata on all shards for each metadata update. While it is possible
|
||
to implement many of the I/O path changes without reducing the number of
|
||
metadata updates, there are bigger performance benefits if the number of
|
||
metadata updates can be reduced as well.
|
||
|
||
Test tools
|
||
----------
|
||
|
||
A survey of the existing test tools shows that there is insufficient coverage
|
||
of erasure coding to be able to just make changes to the code and expect the
|
||
existing CI pipelines to get sufficient coverage. Therefore one of the first
|
||
steps will be to improve the test tools to be able to get better test
|
||
coverage.
|
||
|
||
Teuthology is the main test tool used to get test coverage and it relies
|
||
heavily on the following tests for generating I/O:
|
||
|
||
1. **rados** task - qa/tasks/rados.py. This uses ceph_test_rados
|
||
(src/test/osd/TestRados.cc) which can generate a wide mixture of different
|
||
rados operations. There is limited support for read and write I/Os,
|
||
typically using offset 0 although there is a chunked read command used by a
|
||
couple of tests.
|
||
|
||
2. **radosbench** task - qa/tasks/radosbench.py. This uses the **rados bench**
|
||
(src/tools/rados/rados.cc and src/common/obj_bencher.cc). Can be used to
|
||
generate sequential and random I/O workloads, offset starts at 0 for
|
||
sequential I/O. I/O size can be set but is constant for whole test.
|
||
|
||
3. **rbd_fio** task - qa/tasks/fio.py. This uses **fio** to generate
|
||
read/write I/O to an rbd image volume
|
||
|
||
4. **cbt** task - qa/tasks/cbt.py. This uses the Ceph benchmark tool **cbt**
|
||
to run fio or radosbench to benchmark the performance of a cluster.
|
||
|
||
5. **rbd bench**. Some of the standalone tests use rbd bench
|
||
(src/tools/rbd/action/Bench.cc) to generate small amounts of I/O
|
||
workload. It is also used by the **rbd_pwl_cache_recovery** task.
|
||
|
||
It is hard to use these tools to get good coverage of I/Os to non-zero (and
|
||
non-stripe aligned) offsets, or to generate a wide variety of offsets and
|
||
lengths of I/O requests including all the boundary cases for chunks and
|
||
stripes. There is scope to improve either rados, radosbench or rbd bench to
|
||
generate much more interesting I/O patterns for testing erasure coding.
|
||
|
||
For the optimizations described above it is essential that we have good tools
|
||
for checking the consistency of either selected objects or all objects in an
|
||
erasure coded pool by checking that the data and coding parities are
|
||
coherent. There is a test tool **ceph-erasure-code-tool** which can use the
|
||
plugins to encode and decode data provided in a set of files. However there
|
||
does not seem to be any scripting in teuthology to perform consistency checks
|
||
by using objectstore tool to read data and then using this tool to validate
|
||
consistency. We will write some teuthology helpers that use
|
||
ceph-objectstore-tool and ceph-erasure-code-tool to perform offline
|
||
validation.
|
||
|
||
We would also like an online way of performing full consistency checks, either
|
||
for specific objects or for a whole pool. Inconveniently EC pools do not
|
||
support class methods so it's not possible to use this as a way of
|
||
implementing a full consistency check. We will investigate putting a flag on a
|
||
read request, on the pool or implementing a new request type to perform a full
|
||
consistency check on an object and look at making extensions to the rados CLI
|
||
to be able to perform these tests. See also the discussion on deep scrub
|
||
below.
|
||
|
||
When there is more than one coding parity and there is an inconsistency
|
||
between the data and the coding parities it is useful to try and analyze the
|
||
cause of the inconsistency. Because the multiple coding parities are providing
|
||
redundancy, there can be multiple ways of reconstructing each chunk and this
|
||
can be used to detect the most like cause of the inconsistency. For example
|
||
with a 4+2 erasure coding and a dropped write to 1st data OSD, the stripe (all
|
||
6 OSDs) will be inconsistent, as will be any selection of 5 OSDs that includes
|
||
the 1st data OSD, but data OSDs 2,3 and 4 and the two coding parity OSDs will
|
||
be still be consistent. While there are many ways a stripe could get into this
|
||
state, a tool could conclude that the most likely cause is a missed update to
|
||
OSD 1. Ceph does not have a tool to perform this type of analysis, but it
|
||
should be easy to extend ceph-erasure-code-tool.
|
||
|
||
Teuthology seems to have adequate tools for taking OSDs offline and bringing
|
||
them back online again. There are a few tools for injecting read I/O errors
|
||
(without taking an OSD offline) but there is scope to improve these
|
||
(e.g. ability to specify a particular offset in an object that will fail a
|
||
read, more controls over setting and deleting error inject sites).
|
||
|
||
The general philosophy of teuthology seems to be to randomly inject faults and
|
||
simply through brute force get sufficient coverage of all the error
|
||
paths. This is a good approach for CI testing, however when EC code paths
|
||
become complex and require multiple errors to occur with precise timings to
|
||
cause a particular code path to execute it becomes hard to get coverage
|
||
without running the tests for a very long time. There are some standalone
|
||
tests for EC which do test some of the multiple failure paths, but these tests
|
||
perform very limited amounts of I/O and don't inject failures while there are
|
||
I/Os in flight so miss some of the interesting scenarios.
|
||
|
||
To deal with these more complex error paths we propose developing a new type
|
||
of thrasher for erasure coding that injects a sequence of errors and makes use
|
||
of debug hooks to capture and delay I/O requests at particular points to
|
||
ensure an error inject hits a particular timing window. To do this we will
|
||
extend the tell osd command to include extra interfaces to inject errors and
|
||
capture and stall I/Os at specific points.
|
||
|
||
Some parts of erasure coding such as the plugins are stand alone bits of code
|
||
which can be tested with unit tests. There are already some unit tests and
|
||
performance benchmark tools for erasure coding, we will look to extend these
|
||
to get further coverage of code that can be run stand alone.
|
||
|
||
I/O path changes
|
||
----------------
|
||
|
||
Avoid unnecessary reads and writes
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
The current code reads too much data for read and overwrite I/Os. For
|
||
overwrites it will also rewrite unmodified data. This occurs because reads and
|
||
overwrites are rounded up to full-stripe operations. This isn’t a problem when
|
||
data is mainly being accessed sequentially but is very wasteful for random I/O
|
||
operations. The code can be changed to only read/write necessary shards. To
|
||
allow the code to efficiently support larger chunk sizes I/Os should be
|
||
rounded to page size I/Os instead of chunk sized I/Os.
|
||
|
||
The first simple set of optimizations eliminates unnecessary reads and
|
||
unnecessary writes of data, but retains writes of metadata on all shards. This
|
||
avoids breaking the current design which depends on all shards receiving a
|
||
metadata update for every transaction. When changes to the metadata handling
|
||
are completed (see below) then it will be possible to make further
|
||
optimizations to reduce the number of metadata updates for additional savings.
|
||
|
||
Parity-delta-write
|
||
^^^^^^^^^^^^^^^^^^
|
||
|
||
The current code implements overwrites by performing a full-stripe read,
|
||
merging the overwritten data, calculating new coding parities and performing a
|
||
full-stripe write. Reading and writing every shard is expensive, there are a
|
||
number of optimizations that can be applied to speed this up. For a K+M
|
||
configuration where M is small, it is often less work to perform a
|
||
parity-delta-write. This is implemented by reading the old data that is about
|
||
to be overwritten and XORing it with the new data to create a delta. The
|
||
coding parities can then be read, updated to apply the delta and
|
||
re-written. With M=2 (RAID6) this can result in just 3 read and 3 writes to
|
||
perform an overwrite of less than one chunk.
|
||
|
||
Note that where a large fraction of the data in the stripe is being updated,
|
||
this technique can result in more work than performing a partial overwrite,
|
||
however if both update techniques are supported it is fairly easy to calculate
|
||
for a given I/O offset and length which is the optimal technique to use.
|
||
|
||
Write I/Os submitted to the Primary OSD will perform this calculation to
|
||
decide whether to use a full-stripe update or a parity-delta-write. Note that
|
||
if read failures are encountered while performing a parity-delta-write and it
|
||
is necessary to reconstruct data or a coding parity then it will be more
|
||
efficient to switch to performing a full-stripe read, merge and write.
|
||
|
||
Not all erasure codings and erasure coding libraries support the capability of
|
||
performing delta updates, however those implemented using XOR and/or GF
|
||
arithmetic should. We have checked jerasure and isa-l and confirmed that they
|
||
support this feature, although the necessary APIs are not currently exposed by
|
||
the plugins. For some erasure codes such as clay and lrc it may be possible to
|
||
apply delta updates, but the delta may need to be applied in so many places
|
||
that this makes it a worthless optimization. This proposal suggests that
|
||
parity-delta-write optimizations are initially implemented only for the most
|
||
commonly used erasure codings. Erasure code plugins will provide a new flag
|
||
indicating whether they support the new interfaces needed to perform delta
|
||
updates.
|
||
|
||
Direct reads
|
||
^^^^^^^^^^^^
|
||
|
||
Read I/Os are currently directed to the primary OSD which then issues reads to
|
||
other shards. To reduce I/O latency and network bandwidth it would be better
|
||
if clients could issue direct read requests to the OSD storing the data,
|
||
rather than via the primary. There are a few error scenarios where the client
|
||
may still need to fallback to submitting reads to the primary, a secondary OSD
|
||
will have the option of failing a direct read with -EAGAIN to request the
|
||
client retries the request to the primary OSD.
|
||
|
||
Direct reads will always be for <= one chunk. For reads of more than one chunk
|
||
the client can issue direct reads to multiple OSDs, however these will no
|
||
longer guaranteed to be atomic because an update (write) may be applied in
|
||
between the separate read requests. If a client needs atomicity guarantees
|
||
they will need to continue to send the read to the primary.
|
||
|
||
Direct reads will be failed with EAGAIN where a reconstruct and decode
|
||
operation is required to return the data. This means only reads to primary OSD
|
||
will need to handle the reconstruct code path. When an OSD is backfilling we
|
||
don't want the client to have large quantities of I/O failed with EAGAIN,
|
||
therefore we will make the client detect this situation and avoid issuing
|
||
direct I/Os to a backfilling OSD.
|
||
|
||
For backwards compatibility, for client requests that cannot cope with the
|
||
reduced guarantees of a direct read, and for scenarios where the direct read
|
||
would be to an OSD that is absent or backfilling, reads directed to the
|
||
primary OSD will still be supported.
|
||
|
||
Direct writes
|
||
^^^^^^^^^^^^^
|
||
|
||
Write I/Os are currently directed to the primary OSD which then updates the
|
||
other shards. To reduce latency and network bandwidth it would be better if
|
||
clients could direct small overwrites requests directly to the OSD storing the
|
||
data, rather than via the primary. For larger write I/Os and for error
|
||
scenarios and abnormal cases clients will continue to submit write I/Os to the
|
||
primary OSD.
|
||
|
||
Direct writes will always be for <= one chunk and will use the
|
||
parity-delta-write technique to perform the update. For medium sized writes a
|
||
client may issue direct writes to multiple OSDs, but such updates will no
|
||
longer be guaranteed to be atomic. If a client requires atomicity for a larger
|
||
write they will need to continue to send it to the primary.
|
||
|
||
For backwards compatibility, and for scenarios where the direct write would be
|
||
to an OSD that is absent, writes directed to the primary OSD will still be
|
||
supported.
|
||
|
||
I/O serialization, recovery/backfill and other error scenarios
|
||
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
|
||
|
||
Direct writes look fairly simple until you start considering all the abnormal
|
||
scenarios. The current implementation of processing all writes on the Primary
|
||
OSD means that there is one central point of control for the stripe that can
|
||
manage things like the ordering of multiple inflight I/Os to the same stripe,
|
||
ensuring that recovery/backfill for an object has been completed before it is
|
||
accessed and assigning the object version number and modification time.
|
||
|
||
With direct I/Os these become distributed problems. Our approach is to send a
|
||
control path message to the Primary OSD and let it continue to be the central
|
||
point of control. The Primary OSD will issue a reply when the OSD can start
|
||
the direct write and will be informed with another message when the I/O has
|
||
completed. See section below on metadata updates for more details.
|
||
|
||
Stripe cache
|
||
^^^^^^^^^^^^
|
||
|
||
Erasure code pools maintain a stripe cache which stores shard data while
|
||
updates are in progress. This is required to allow writes and reads to the
|
||
same stripe to be processed in parallel. For small sequential write workloads
|
||
and for extreme hot spots (e.g. where the same block is repeatedly re-written
|
||
for some kind of crude checkpointing mechanism) there would be a benefit in
|
||
keeping the stripe cache slightly longer than the duration of the I/O. In
|
||
particularly the coding parities are typically read and written for every
|
||
update to a stripe. There is obviously a balancing act to achieve between
|
||
keeping the cache long enough that it reduces the overheads for future I/Os
|
||
versus the memory overheads of storing this data. A small (MiB as opposed to
|
||
GiB sized cache) should be sufficient for most workloads. The stripe cache can
|
||
also help reduce latency for direct write I/Os by allowing prefetch I/Os to
|
||
read old data and coding parities ready for later parts of the write operation
|
||
without requiring more complex interlocks.
|
||
|
||
The stripe cache is less important when the default chunk size is small
|
||
(e.g. 4K), because even with small write I/O requests there will not be many
|
||
sequential updates to fill a stripe. With a larger chunk size (e.g. 64K) the
|
||
benefits of a good stripe cache become more significant because the stripe
|
||
size will be 100’s KiB to small number of MiB’s and hence it becomes much more
|
||
likely that a sequential workload will issue many I/Os to the same stripe.
|
||
|
||
Automatically choose chunk size
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
The default chunk size of 4K is good for small objects because the data and
|
||
coding parities are rounded up to whole chunks and because if an object has
|
||
less than one data stripe of data then the capacity overheads for the coding
|
||
parities are higher (e.g. a 4K object in a 10+2 erasure coded pool has 4K of
|
||
data and 8K of coding parity, so there is a 200% overhead). However the
|
||
optimizations above all provide much bigger savings if the typical random
|
||
access I/O only reads or writes a single shard. This means that so long as
|
||
objects are big enough that a larger chunk size such as 64K would be better.
|
||
|
||
Whilst the user can try and predict what their typically object size will be
|
||
and choose an appropriate chunk size, it would be better if the code could
|
||
automatically select a small chunk size for small objects and a larger chunk
|
||
size for larger objects. There will always be scenarios where an object grows
|
||
(or is truncated) and the chosen chunk size becomes inappropriate, however
|
||
reading and re-writing the object with a new chunk size when this happens
|
||
won’t have that much performance impact. This also means that the chunk size
|
||
can be deduced from the object size in object_info_t which is read before the
|
||
objects data is read/modified. Clients already provide a hint as to the object
|
||
size when creating the object so this could be used to select a chunk size to
|
||
reduce the likelihood of having to re-stripe an object
|
||
|
||
The thought is to support a new chunk size of auto/variable to enable this
|
||
feature, it will only be applicable for newly created pools, there will be no
|
||
way to migrate an existing pool.
|
||
|
||
Deep scrub support
|
||
^^^^^^^^^^^^^^^^^^
|
||
|
||
EC Pools with overwrite do not check CRCs because it is too costly to update
|
||
the CRC for the object on every overwrite, instead the code relies on
|
||
Bluestore to maintain and check CRCs. When an EC pool is operating with
|
||
overwrite disabled a CRC is kept for each shard, because it is possible to
|
||
update CRCs as the object is appended to just by calculating a CRC for the new
|
||
data being appended and then doing a simple (quick) calculation to combine the
|
||
old and new CRC together.
|
||
|
||
In dev/osd_internals/erasure_coding/proposals.rst it discusses the possibility
|
||
of keeping CRCs at a finer granularity (for example per chunk), storing these
|
||
either as an xattr or an omap (omap is more suitable as large objects could
|
||
end up with a lot of CRC metadata) and updating these CRCs when data is
|
||
overwritten (the update would need to perform a read-modify-write at the same
|
||
granularity as the CRC). These finer granularity CRCs can then easily be
|
||
combined to produce a CRC for the whole shard or even the whole erasure coded
|
||
object.
|
||
|
||
This proposal suggests going in the opposite direction - EC overwrite pools
|
||
have survived without CRCs and relied on Bluestore up until now, so why is
|
||
this feature needed? The current code doesn’t check CRCs if overwrite is
|
||
enabled, but sadly still calculates and updates a CRC in the hinfo xattr, even
|
||
if performing overwrites which mean that the calculated value will be
|
||
garbage. This means we pay all the overheads of calculating the CRC and get no
|
||
benefits.
|
||
|
||
The code can easily be fixed so that CRCs are calculated and maintained when
|
||
objects are written sequentially, but as soon as the first overwrite to an
|
||
object occurs the hinfo xattr will be discarded and CRCs will no longer be
|
||
calculated or checked. This will improve performance when objects are
|
||
overwritten, and will improve data integrity in cases where they are not.
|
||
|
||
While the thought is to abandon EC storing CRCs in objects being overwritten,
|
||
there is an improvement that can be made to deep scrub. Currently deep scrub
|
||
of an EC with overwrite pool just checks that every shard can read the object,
|
||
there is no checking to verify that the copies on the shards are consistent. A
|
||
full consistency check would require large data transfers between the shards
|
||
so that the coding parities could be recalculated and compared with the stored
|
||
versions, in most cases this would be unacceptably slow. However for many
|
||
erasure codes (including the default ones used by Ceph) if the contents of a
|
||
chunk are XOR’d together to produce a longitudinal summary value, then an
|
||
encoding of the longitudinal summary values of each data shard should produce
|
||
the same longitudinal summary values as are stored by the coding parity
|
||
shards. This comparison is less expensive than the CRC checks performed by
|
||
replication pools. There is a risk that by XORing the contents of a chunk
|
||
together that a set of corruptions cancel each other out, but this level of
|
||
check is better than no check and will be very successful at detecting a
|
||
dropped write which will be the most common type of corruption.
|
||
|
||
Metadata changes
|
||
----------------
|
||
|
||
What metadata do we need to consider?
|
||
|
||
1. object_info_t. Every Ceph object has some metadata stored in the
|
||
object_info_t data structure. Some of these fields (e.g. object length) are
|
||
not updated frequently and we can simply avoid performing partial writes
|
||
optimizations when these fields need updating. The more problematic fields
|
||
are the version numbers and the last modification time which are updated on
|
||
every write. Version numbers of objects are compared to version numbers in
|
||
PG log entries for peering/recovery and with version numbers on other
|
||
shards for backfill. Version numbers and modification times can be read by
|
||
clients.
|
||
|
||
2. PG log entries. The PG log is used to track inflight transactions and to
|
||
allow incomplete transactions to be rolled forward/backwards after an
|
||
outage/network glitch. The PG log is also used to detect and resolve
|
||
duplicate requests (e.g. resent due to network glitch) from
|
||
clients. Peering currently assumes that every shard has a copy of the log
|
||
and that this is updated for every transaction.
|
||
|
||
3. PG stats entries and other PG metadata. There is other PG metadata (PG
|
||
stats is the simplest example) that gets updated on every
|
||
transaction. Currently all OSDs retain a cached and a persistent copy of
|
||
this metadata.
|
||
|
||
How many copies of metadata are required?
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
The current implementation keeps K+M replicated copies of metadata, one copy
|
||
on each shard. The minimum number of copies that need to be kept to support up
|
||
to M failures is M+1. In theory metadata could be erasure encoded, however
|
||
given that it is small it is probably not worth the effort. One advantage of
|
||
keeping K+M replicated copies of the metadata is that any fully in sync shard
|
||
can read the local copy of metadata, avoiding the need for inter-OSD messages
|
||
and asynchronous code paths. Specifically this means that any OSD not
|
||
performing backfill can become the primary and can access metadata such as
|
||
object_info_t locally.
|
||
|
||
M+1 arbitrarily distributed copies
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
A partial write to one data shard will always involve updates to the data
|
||
shard and all M coding parity shards, therefore for optimal performance it
|
||
would be ideal if the same M+1 shards are updated to track the associated
|
||
metadata update. This means that for small random writes that a different M+1
|
||
shards would get updated for each write. The drawback of this approach is that
|
||
you might need to read K shards to find the most up to date version of the
|
||
metadata.
|
||
|
||
In this design no shard will have an up to date copy of the metadata for every
|
||
object. This means that whatever shard is picked to be the acting primary that
|
||
it may not have all the metadata available locally and may need to send
|
||
messages to other OSDs to read it. This would add significant extra complexity
|
||
to the PG code and cause divergence between Erasure coded pools and Replicated
|
||
pools. For these reasons we discount this design option.
|
||
|
||
M+1 copies on known shards
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
The next best performance can be achieved by always applying metadata updates
|
||
to the same M+1 shards, for example choosing the 1st data shard and all M
|
||
coding parity shards. Coding parity shards will get updated by every partial
|
||
write so this will result in zero or one extra shard being updated. With this
|
||
approach only 1 shard needs to be read to find the most up to date version of
|
||
the metadata.
|
||
|
||
We can restrict the acting primary to be one of the M+1 shards, which means
|
||
that once any incomplete updates in the log have been resolved that the
|
||
primary will have an up to date local copy of all the metadata, this means
|
||
that much more of the PG code can be kept unchanged.
|
||
|
||
Partial Writes and the PG log
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Peering currently assumes that every shard has a copy of the log, however
|
||
because of inflight updates and small term absences it is possible that some
|
||
shards are missing some of the log entries. The job of peering is to combine
|
||
the logs from the set of present shards to form a definitive log of
|
||
transactions that have been committed by all the shards. Any discrepancies
|
||
between a shards log and the definitive log are then resolved, typically by
|
||
rolling backwards transactions (using information held in the log entry) so
|
||
that all the shards are in a consistent state.
|
||
|
||
To support partial writes the log entry needs to be modified to include the
|
||
set of shards that are being updated. Peering needs to be modified to consider
|
||
a log entry as missing from a shard only if a copy of the log entry on another
|
||
shard indicates that this shard was meant to be updated.
|
||
|
||
The logs are not infinite in size, and old log entries where it is known that
|
||
the update has been successfully committed on all affected shards are
|
||
trimmed. Log entries are first condensed to a pg_log_dup_t entry which can no
|
||
longer assist in rollback of a transaction but can still be used to detect
|
||
duplicated client requests, and then later completely discarded. Log trimming
|
||
is performed at the same time as adding a new log entry, typically when a
|
||
future write updates the log. With partial writes log trimming will only occur
|
||
on shards that receive updates, which means that some shards may have stale
|
||
log entries that should have been discarded.
|
||
|
||
TBD: I think the code can already cope with discrepancies in log trimming
|
||
between the shards. Clearly an in flight trim operation may not have completed
|
||
on every shard so small discrepancies can be dealt with, but I think an absent
|
||
OSD can cause larger discrepancies. I believe that this is resolved during
|
||
Peering, with each OSD keeping a record of what the oldest log entry should be
|
||
and this gets shared between OSDs so that they can work out stale log entries
|
||
that were trimmed in absentia. Hopefully this means that only sending log
|
||
trimming updates to shards that are creating new log entries will work without
|
||
code changes.
|
||
|
||
Backfill
|
||
^^^^^^^^
|
||
|
||
Backfill is used to correct inconsistencies between OSDs that occur when an
|
||
OSD is absent for a longer period of time and the PG log entries have been
|
||
trimmed. Backfill works by comparing object versions between shards. If some
|
||
shards have out of date versions of an object then a reconstruct is performed
|
||
by the backfill process to update the shard. If the version numbers on objects
|
||
are not updated on all shards then this will break the backfill process and
|
||
cause a huge amount of unnecessary reconstruct work. This is unacceptable, in
|
||
particular for the scenario where an OSD is just absent for maintenance for a
|
||
relatively short time with noout set. The requirement is to be able to
|
||
minimize the amount of reconstruct work needed to complete a backfill.
|
||
|
||
In dev/osd_internals/erasure_coding/proposals.rst it discusses the idea of
|
||
each shard storing a vector of version numbers that records the most recent
|
||
update that the pair <this shard, other shard> both should have participated
|
||
in. By collecting this information from at least M shards it is possible to
|
||
work out what the expected minimum version number should be for an object on a
|
||
shard and hence deduce whether a backfill is required to update the
|
||
object. The drawback of this approach is that backfill will need to scan M
|
||
shards to collect this information, compared with the current implementation
|
||
that only scans the primary and shard(s) being backfilled.
|
||
|
||
With the additional constraint that a known M+1 shards will always be updated
|
||
and that the (acting) primary will be one of these shards, it will be possible
|
||
to determine whether a backfill is required just by examining the vector on
|
||
the primary and the object version on the shard being backfilled. If the
|
||
backfill target is one of the M+1 shards the existing version number
|
||
comparison is sufficient, if it is another shard then the version in the
|
||
vector on the primary needs to be compared with the version on the backfill
|
||
target. This means that backfill does not have to scan any more shards than it
|
||
currently does, however the scan of the primary does need to read the vector
|
||
and if there are multiple backfill targets then it may need to store multiple
|
||
entries of the vector per object increasing memory usage during the backfill.
|
||
|
||
There is only a requirement to keep the vector on the M+1 shards, and the
|
||
vector only needs K-1 entires because we only need to track version number
|
||
differences between any of the M+1 shards (which should have the same version)
|
||
and each of the K-1 shards (which can have a stale version number). This will
|
||
slightly reduce the amount of extra metadata required. The vector of version
|
||
numbers could be stored in the object_info_t structure or stored as a separate
|
||
attribute.
|
||
|
||
Our preference is to store the vector in the object_info_t structure because
|
||
typically both are accessed together, and because this makes it easier to
|
||
cache both in the same object cache. We will keep metadata and memory
|
||
overheads low by only storing the vector when it is needed.
|
||
|
||
Care is required to ensure that existing clusters can be upgraded. The absence
|
||
of the vector of version numbers implies that an object has never had a
|
||
partial update and therefore all shards are expected to have the same version
|
||
number for the object and the existing backfill algorithm can be used.
|
||
|
||
Code references
|
||
"""""""""""""""
|
||
|
||
PrimaryLogPG::scan_range - this function creates a map of objects and their
|
||
version numbers, on the primary it tries to get this information from the
|
||
object cache, otherwise it reads to OI attribute. This will need changes to
|
||
deal with the vectors. To conserve memory it will need to be provided with the
|
||
set of backfill targets so it can select which part of the vector to keep.
|
||
|
||
PrimaryLogPG::recover_backfill - this function call scan_range for the local
|
||
(primary) and sends MOSDPGScan to the backfill targets to get them to perform
|
||
the same scan. Once it has collected all the version numbers it compares the
|
||
primary and backfill targets to work out which objects need to be
|
||
recovered. This will also need changes to deal with the vectors when comparing
|
||
version numbers.
|
||
|
||
PGBackend::run_recovery_op - recovers a single object. For an EC pool this
|
||
involves reconstructing the data for the shards that need backfilling (read
|
||
other shards and use decode to recover). This code shouldn't need any changes.
|
||
|
||
Version number and last modification time for clients
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Clients can read the object version number and set expectations about what the
|
||
minimum version number is when making updates. Clients can also read the last
|
||
modification time. There are use cases where it is important that these values
|
||
can be read and give consistent results, but there is also a large number of
|
||
scenarios where this information is not required.
|
||
|
||
If the object version number is only being updated on a known M+1 shards for
|
||
partial writes, then where this information is required it will need to
|
||
involve a metadata access to one of those shards. We have arranged for the
|
||
primary to be one of the M+1 shards so I/Os submitted to the primary will
|
||
always have access to the up to date information.
|
||
|
||
Direct write I/Os need to update the M+1 shards, so it is not difficult to
|
||
also return this information to the client when completing the I/O.
|
||
|
||
Direct read I/Os are the problem case, these will only access the local shard
|
||
and will not necessarily have access to the latest version and modification
|
||
time. For simplicity we will require clients that require this information to
|
||
send requests to the primary rather than using the direct I/O
|
||
optimization. Where a client does not need this information they can use the
|
||
direct I/O optimizations.
|
||
|
||
The direct read I/O optimization will still return a (potentially stale)
|
||
object version number. This may still be of use to clients to help understand
|
||
the ordering of I/Os to a chunk.
|
||
|
||
Direct Write with Metadata updates
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Here's the full picture of what a direct write performing a parity-delta-write
|
||
looks like with all the control messages:
|
||
|
||
.. ditaa::
|
||
|
||
RADOS Client
|
||
|
||
| ^
|
||
| |
|
||
1 28
|
||
+-----+ | |
|
||
| |<------27-------+--+
|
||
| | | |
|
||
| | +-------------|->|
|
||
| | | | |
|
||
| |<-|----2--------+ |<--------------------------------------------+
|
||
| Seq | | | |<----------------------------+ |
|
||
| | | +----3---+ | | |
|
||
| | | | +--|-----------------------5-----|---+ |
|
||
| | | | +--|-------4---------+ | | |
|
||
| +--|-10-|------->| | | | | |
|
||
| | | | +---+ | | | | |
|
||
| | | | | | | | | | |
|
||
| | | | v | | | | | |
|
||
+----++ | | +---+ | | | | | |
|
||
^ | | | |XOR+-|--|----------15-----|-----------|---|-----+ |
|
||
| | | | |13 +-|--|-------14--------|-----+ | | | |
|
||
| | | | +---+ | | | | | | | |
|
||
| | | | ^ | | | v | | v |
|
||
| | | | | | | | +------+ | | +------+ |
|
||
6 11 | | | | | | | XOR | | | | GF | |
|
||
| | | | | | | | | 18 | | | | 21 | |
|
||
| | | | 12 16 | | +----+-+ | | +----+-+ |
|
||
| | | | | | | | ^ | | | ^ | |
|
||
| | | | | | | | | | | | | | |
|
||
| | | | | | | | 17 19 | | 20 22 |
|
||
| | | | | | | | | | | | | | |
|
||
| | | | | v | | | v | | | v |
|
||
| | | | +-+----+ | | +-+----+ | | +-+----+ |
|
||
| | | +->|Extent| | +->|Extent| | +->|Extent| |
|
||
| | 23 |Cache | 24 |Cache | 25 |Cache | 26
|
||
| | | +----+-+ | +----+-+ | +----+-+ |
|
||
| | | ^ | | ^ | | ^ | |
|
||
| | | | | | | | | | | |
|
||
| +---+ 7 +---+ 8 +---+ 9 +---+
|
||
| | | | | | | |
|
||
| v | v | v | v
|
||
.-----. .-----. .-----. .-----. .-----. .-----.
|
||
( ) ( ) ( ) ( ) ( ) ( )
|
||
|`-----'| |`-----'| |`-----'| |`-----'| |`-----'| |`-----'|
|
||
| | | | | | | | | | | |
|
||
| | | | | | | | | | | |
|
||
( ) ( ) ( ) ( ) ( ) ( )
|
||
`-----' `-----' `-----' `-----' `-----' `-----'
|
||
Primary OSD 2 OSD 3 OSD 4 OSD P OSD Q
|
||
OSD
|
||
|
||
* Xattr * No Xattr * Xattr * Xattr
|
||
* OI * Stale OI * OI * OI
|
||
* PG log * Partial PG log * PG log * PG log
|
||
* PG stats * No PG stats * PG stats * PG stats
|
||
|
||
Note: Only the primary OSD and parity coding OSDs (the M+1 shards) have Xattr,
|
||
up to date object info, PG log and PG stats. Only one of these OSDs is
|
||
permitted to become the (acting) primary. The other data OSDs 2,3 and 4 (the
|
||
K-1 shards) do not have Xattrs or PG stats, may have state object info and
|
||
only have PG log entries for their own updates. OSDs 2,3 and 4 may have stale
|
||
OI with an old version number. The other OSDs have the latest OI and a vector
|
||
with the expected version numbers for OSDs 2,3 and 4.
|
||
|
||
1. Data message with Write I/O from client (MOSDOp)
|
||
2. Control message to Primary with Xattr (new msg MOSDEcSubOpSequence)
|
||
|
||
Note: the primary needs to be told about any xattr update so it can update its
|
||
copy, but the main purpose of this message is to allow the primary to sequence
|
||
the write I/O. The reply message at step 10 is what allows the write to start
|
||
and provides the PG stats and new object info including the new version
|
||
number. If necessary the primary can delay this to ensure that
|
||
recovery/backfill of the object is completed first and deal with overlapping
|
||
writes. Data may be read (prefetched) before the reply, but obviously no
|
||
transactions can start.
|
||
|
||
3. Prefetch request to local extent cache
|
||
4. Control message to P to prefetch to extent cache (new msg
|
||
MOSDEcSubOpPrefetch equivalent of MOSDEcSubOpRead)
|
||
5. Control message to Q to prefetch to extent cache (new msg
|
||
MOSDEcSubOpPrefetch equivalent of MOSDEcSubOpRead)
|
||
6. Primary reads object info
|
||
7. Prefetch old data
|
||
8. Prefetch old P
|
||
9. Prefetch old Q
|
||
|
||
Note: The objective of these prefetches is to get the old data, P and Q reads
|
||
started as quickly as possible to reduce the latency of the whole I/O. There
|
||
may be error scenarios where the extent cache is not able to retain this and
|
||
it will need to be re-read. This includes the rare/pathological scenarios
|
||
where there is a mixture of writes sent to the primary and writes sent
|
||
directly to the data OSD for the same object.
|
||
|
||
10. Control message to data OSD with new object info + PG stats (new msg
|
||
MOSDEcSubOpSequenceReply)
|
||
11. Transaction to update object info + PG log + PG stats
|
||
12. Fetch old data (hopefully cached)
|
||
|
||
Note: For best performance we want to pipeline writes to the same stripe. The
|
||
primary assigns the version number to each write and consequently defines the
|
||
order in which writes should be processed. It is important that the data shard
|
||
and the coding parity shards apply overlapping writes in the same order. The
|
||
primary knows what set of writes are in flight so can detect this situation
|
||
and indicate in its reply message at step 10 that an update must wait until an
|
||
earlier update has been applied. This information needs to be forwarded to the
|
||
coding parities (steps 14 and 15) so they can also ensure updates are applied
|
||
in the same order.
|
||
|
||
13. XOR new and old data to create delta
|
||
14. Data message to P with delta + Xattr + object info + PG log + PG stats
|
||
(new msg MOSDEcSubOpDelta equivalent of MOSDEcSubOpWrite)
|
||
15. Data message to Q with delta + Xattr + object info + PG log + PG stats
|
||
(new msg MOSDEcSubOpDelta equivalent of MOSDEcSubOpWrite)
|
||
16. Transaction to update data + object info + PG log
|
||
17. Fetch old P (hopefully cached)
|
||
18. XOR delta and old P to create new P
|
||
19. Transaction to update P + Xattr + object info + PG log + PG stats
|
||
20. Fetch old Q (hopefully cached)
|
||
21. XOR delta and old Q to create new Q
|
||
22. Transaction to update Q + Xattr + object info + PG log + PG stats
|
||
23. Control message to data OSD for commit (new msg MOSDEcSubOpDeltaReply
|
||
equivalent of MOSDEcSubOpWriteReply)
|
||
24. Local commit notification
|
||
25. Control message to data OSD for commit (new msg MOSDEcSubOpDeltaReply
|
||
equivalent of MOSDEcSubOpWriteReply)
|
||
26. Control message to data OSD for commit (new msg MOSDEcSubOpDeltaReply
|
||
equivalent of MOSDEcSubOpWriteReply)
|
||
27. Control message to Primary to signal end of write (variant of new msg
|
||
MOSDEcSubOpSequence)
|
||
28. Control message reply to client (MOSDOpReply)
|
||
|
||
Upgrade and backwards compatibility
|
||
-----------------------------------
|
||
|
||
A few of the optimizations can be made just by changing code on the primary
|
||
OSD with no backwards compatibility concerns regarding clients or the other
|
||
OSDs. These optimizations will be enabled as soon as the primary OSD upgrades
|
||
and will replace the existing code paths.
|
||
|
||
The remainder of the changes will be new I/O code paths that will exist
|
||
alongside the existing code paths.
|
||
|
||
Similar to EC Overwrites many of the changes will need to ensure that all OSDs
|
||
are running new code and that the EC plugins support new interfaces required
|
||
for parity-delta-writes. A new pool level flag will be required to enforce
|
||
this. It will be possible to enable this flag (and hence enable the new
|
||
performance optimizations) after upgrading an existing cluster. Once set it
|
||
will not be possible to add down level OSDs to the pool. It will not be
|
||
possible to turn this flag off other than by deleting the pool. Downgrade is
|
||
not supported because:
|
||
|
||
1. It is not trivial to quiesce all I/O to a pool to ensure that none of the
|
||
new I/O code paths are in use when the flag is cleared.
|
||
|
||
2. The PG log format for new I/Os will not be understood by down level
|
||
OSDs. It would be necessary to ensure the log has been trimmed of all new
|
||
format entries before clearing the flag to ensure that down level OSDs will
|
||
be able to interpret the log.
|
||
|
||
3. Additional xattr data will be stored by the new I/O code paths and used by
|
||
backfill. Down level code will not understand how to backfill a pool that
|
||
has been running the new I/O paths and will get confused by the
|
||
inconsistent object version numbers. While it is theoretically possible to
|
||
disable partial updates and then scan and update all the metadata to return
|
||
the pool to a state where a downgrade is possible, we have no intention of
|
||
writing this code.
|
||
|
||
The direct I/O changes will additionally require clients to be running new
|
||
code. These will require that the pool has the new flag set and that a new
|
||
client is used. Old clients can use pools with the new flag set, just without
|
||
the direct I/O optimization.
|
||
|
||
Not under consideration
|
||
-----------------------
|
||
|
||
There is a list of enhancements discussed in
|
||
doc/dev/osd_internals/erasure_coding/proposals.rst, the following are not
|
||
under consideration:
|
||
|
||
1. RADOS Client Acknowledgement Generation optimization
|
||
|
||
When updating K+M shards in an erasure coded pool, in theory you don’t have to
|
||
wait for all the updates to complete before completing the update to the
|
||
client, because so long as K updates have completed any viable subset of
|
||
shards should be able to roll forward the update.
|
||
|
||
For partial writes where only M+1 shards are updated this optimization does
|
||
not apply as all M+1 updates need to complete before the update is completed
|
||
to the client.
|
||
|
||
This optimization would require changes to the peering code to work out
|
||
whether partially completed updates need to be rolled forwards or
|
||
backwards. To roll an update forwards it would be simplest to mark the object
|
||
as missing and use the recovery path to reconstruct and push the update to
|
||
OSDs that are behind.
|
||
|
||
2. Avoid sending read request to local OSD via Messenger
|
||
|
||
The EC backend code has an optimization for writes to the local OSD which
|
||
avoids sending a message and reply via messenger. The equivalent optimization
|
||
could be made for reads as well, although a bit more care is required because
|
||
the read is synchronous and will block the thread waiting for the I/O to
|
||
complete.
|
||
|
||
Pull request https://github.com/ceph/ceph/pull/57237 is making this
|
||
optimization
|
||
|
||
Stories
|
||
=======
|
||
|
||
This is our high level breakdown of the work. Our intention is to deliver this
|
||
work as a series of PRs. The stories are roughly in the order we plan to
|
||
develop. Each story is at least one PR, where possible they will be broken up
|
||
further. The earlier stories can be implemented as stand alone pieces of work
|
||
and will not introduce upgrade/backwards compatibility issues. The later
|
||
stories will start breaking backwards compatibility, here we plan to add a new
|
||
flag to the pool to enable these new features. Initially this will be an
|
||
experimental flag while the later stories are developed.
|
||
|
||
Test tools - enhanced I/O generator for testing erasure coding
|
||
--------------------------------------------------------------
|
||
|
||
* Extend rados bench to be able to generate more interesting patterns of I/O
|
||
for erasure coding, in particular reading and writing at different offsets
|
||
and for different lengths and making sure we get good coverage of boundary
|
||
conditions such as the sub-chunk size, chunk size and stripe size
|
||
* Improve data integrity checking by using a seed to generate data patterns
|
||
and remembering which seed is used for each block that is written so that
|
||
data can later be validated
|
||
|
||
Test tools - offline consistency checking tool
|
||
----------------------------------------------
|
||
|
||
* Test tools for performing offline consistency checks combining use of
|
||
objectstore_tool with ceph-erasure-code-tool
|
||
* Enhance some of the teuthology standalone erasure code checks to use this
|
||
tool
|
||
|
||
Test tools - online consistency checking tool
|
||
---------------------------------------------
|
||
|
||
* New CLI to be able to perform online consistency checking for an object or a
|
||
range of objects that reads all the data and coding parity shards and
|
||
re-encodes the data to validate the coding parities
|
||
|
||
Switch for JErasure to ISA-L
|
||
----------------------------
|
||
|
||
The JErasure library has not been updated since 2014, the ISA-L library is
|
||
maintained and exploits newer instructions sets (e.g. AVX512, AVX2) which
|
||
provides faster encoding/decoding
|
||
|
||
* Change defaults to ISA-L in upstream ceph
|
||
* Benchmark Jerasure and ISA-L
|
||
* Refactor Ceph isa_encode region_xor() to use AVX when M=1
|
||
* Documentation updates
|
||
* Present results at performance weekly
|
||
|
||
Sub Stripe Reads
|
||
----------------
|
||
|
||
Ceph currently reads an integer number of stripes and discards unneeded
|
||
data. In particular for small random reads it will be more efficient to just
|
||
read the required data
|
||
|
||
* Help finish Pull Request https://github.com/ceph/ceph/pull/55196 if not
|
||
already complete
|
||
* Further changes to issue sub-chunk reads rather than full-chunk reads
|
||
|
||
Simple Optimizations to Overwrite
|
||
---------------------------------
|
||
|
||
Ceph overwrites currently read an integer number of stripes, merge the new
|
||
data and write an integer number of stripes. This story makes simple
|
||
improvements by making the same optimizations as for sub stripe reads and for
|
||
small (sub-chunk) updates reducing the amount of data being read/written to
|
||
each shard.
|
||
|
||
* Only read chunks that are not being fully overwritten (code currently reads
|
||
whole stripe and then merges new data)
|
||
* Perform sub-chunk reads for sub-chunk updates
|
||
* Perform sub-chunk writes for sub-chunk updates
|
||
|
||
Eliminate unnecessary chunk writes but keep metadata transactions
|
||
-----------------------------------------------------------------
|
||
|
||
This story avoids re-writing data that has not been modified. A transaction is
|
||
still applied to every OSD to update object metadata, the PG log and PG stats.
|
||
|
||
* Continue to create transactions for all chunks but without the new write data
|
||
* Add sub-chunk writes to transactions where data is being modified
|
||
|
||
Avoid zero padding objects to a full stripe
|
||
-------------------------------------------
|
||
|
||
Objects are rounded up to an integer number of stripes by adding zero
|
||
padding. These buffers of zeros are then sent in messages to other OSDs and
|
||
written to the OS consuming storage. This story make optimizations to remove
|
||
the need for this padding
|
||
|
||
* Modifications to reconstruct reads to avoid reading zero-padding at the end
|
||
of an object - just fill the read buffer with zeros instead
|
||
* Avoid transfers/writes of buffers of zero padding. Still send transactions
|
||
to all shards and create the object, just don't populate it with zeros
|
||
* Modifications to encode/decode functions to avoid having to pass in buffers
|
||
of zeros when objects are padded
|
||
|
||
Erasure coding plugin changes to support distributed partial writes
|
||
-------------------------------------------------------------------
|
||
|
||
This is preparatory work for future stories, it adds new APIs to the erasure
|
||
code plugins.
|
||
|
||
* Add a new interface to create a delta by XORing old and new data together
|
||
and implement this for the ISA-L and JErasure plugins
|
||
* Add a new interface to apply a delta to one coding parity by using XOR/GF
|
||
and implement this for the ISA-L and JErasure plugins
|
||
* Add a new interface which reports which erasure codes support this feature
|
||
(ISA-L and JErasure will support it, others will not)
|
||
|
||
Erasure coding interface to allow RADOS clients to direct I/Os to OSD storing the data
|
||
--------------------------------------------------------------------------------------
|
||
|
||
This is preparatory work for future stories, its adds a new API for clients
|
||
|
||
* New interface to convert the pair (pg, offset) to {OSD, remaining chunk
|
||
length}
|
||
|
||
We do not want clients to have to dynamically link to the erasure code plugins
|
||
so this code will need to be part of librados. However this interface needs to
|
||
understand how erasure codes distribute data and coding chunks to be able to
|
||
perform this translation.
|
||
|
||
We will only support ISA-L and JErasure plugins where there is a trivial
|
||
striping of data chunks to OSDs.
|
||
|
||
Changes to object_info_t
|
||
------------------------
|
||
|
||
This is preparatory work for future stories.
|
||
|
||
This adds the vector of version numbers to object_info_t which will be used
|
||
for partial updates. For replicated pools and for erasure coded objects that
|
||
are not overwritten we will avoid storing extra data in object_info_t.
|
||
|
||
Changes to PGLog and Peering to support updating a subset of OSDs
|
||
-----------------------------------------------------------------
|
||
|
||
This is preparatory work for future stories.
|
||
|
||
* Modify the PG log entry to store a record of which OSDs are being updated
|
||
* Modify peering to use this extra data to work out OSDs that are missing
|
||
updates
|
||
|
||
Change to selection of (acting) primary
|
||
---------------------------------------
|
||
|
||
This is preparatory work for future stories.
|
||
|
||
Constrain the choice of primary to be the first data OSD or one of the erasure
|
||
coding parities. If none of these OSDs are available and up to date then the
|
||
pool must be offline.
|
||
|
||
Implement parity-delta-write with all computation on the primary
|
||
----------------------------------------------------------------
|
||
|
||
* Calculate whether its more efficient for an update to perform a full stripe
|
||
overwrite or a parity-delta-write
|
||
* Implement new code paths to perform the parity-delta-write
|
||
* Test tool enhancements. We want to make sure that both parity-delta-write
|
||
and full-stripe write are tested. We will add a new conf file option with a
|
||
choice of 'parity-delta', 'full-stripe', 'mixture for testing' or
|
||
'automatic' and update teuthology test cases to predominately use a mixture.
|
||
|
||
Upgrades and backwards compatibility
|
||
------------------------------------
|
||
|
||
* Add a new feature flag for erasure coded pools
|
||
* All OSDs must be running new code to enable the flag on the pool
|
||
* Clients may only issue direct I/Os if the flag is set
|
||
* OSDs running old code may not join a pool with the flag set
|
||
* Its not possible to turn the feature flag off (other than by deleting the
|
||
pool)
|
||
|
||
Changes to Backfill to use the vector in object_info_t
|
||
------------------------------------------------------
|
||
|
||
This is preparatory work for future stories.
|
||
|
||
* Modify the backfill process to use the vector of version numbers in
|
||
object_info_t so that when partial updates occur we do not backfill OSDs
|
||
which did not participate in the partial update.
|
||
* When there is a single backfill target extract the appropriate version
|
||
number from the vector (no additional storage required)
|
||
* When there are multiple backfill targets extract the subset of the vector
|
||
required by the backfill targets and select the appropriate entry when
|
||
comparing version numbers in PrimaryLogPG::recover_backfill
|
||
|
||
Test tools - offline metadata validation tool
|
||
---------------------------------------------
|
||
|
||
* Test tools for performing offline consistency checking of metadata, in
|
||
particular checking the vector of version numbers in object_info_t matches
|
||
the versions on each OSD, but also for validating PG log entries
|
||
|
||
Eliminate transactions on OSDs not updating data chunks
|
||
-------------------------------------------------------
|
||
|
||
Peering, log recovery and backfill can now all cope with partial updates using
|
||
the vector of version numbers in object_info_t.
|
||
|
||
* Modify the overwrite I/O path to not bother with metadata only transactions
|
||
(except to the Primary OSD)
|
||
* Modify the update of the version numbers in object_info_t to use the vector
|
||
and only update entries that are receiving a transaction
|
||
* Modify the generation of the PG log entry to record which OSDs are being
|
||
updated
|
||
|
||
Direct reads to OSDs (single chunk only)
|
||
----------------------------------------
|
||
|
||
* Modify OSDClient to route single chunk read I/Os to the OSD storing the data
|
||
* Modify OSD to accept reads from non-primary OSD (expand existing changes for
|
||
replicated pools to work with EC pools as well)
|
||
* If necessary fail the read with EAGAIN if the OSD is unable to process the
|
||
read directly
|
||
* Modify OSDClient to retry read by submitting to Primary OSD if read is
|
||
failed with EAGAIN
|
||
* Test tool enhancements. We want to make sure that both direct reads and
|
||
reads to the primary are tested. We will add a new conf file option with a
|
||
choice of 'prefer direct', 'primary only' or 'mixture for testing' and
|
||
update teuthology test cases to predominately use a mixture.
|
||
|
||
The changes will be made to the OSDC part of the RADOS client so will be
|
||
applicable to rbd, rgw and cephfs.
|
||
|
||
We will not make changes to other code that has its own version of RADOS
|
||
client code such as krbd, although this could be done in the future.
|
||
|
||
Direct reads to OSDs (multiple chunks)
|
||
--------------------------------------
|
||
|
||
* Add a new OSDC flag NONATOMIC which allows OSDC to split a read into
|
||
multiple requests
|
||
* Modify OSDC to split reads spanning multiple chunks into separate requests
|
||
to each OSD if the NONATOMIC flag is set
|
||
* Modifications to OSDC to coalesce results (if any sub read fails the whole
|
||
read needs to fail)
|
||
* Changes to librbd client to set NONATOMIC flag for reads
|
||
* Changes to cephfs client to set NONATOMIC flag for reads
|
||
|
||
We are only changing a very limited set of clients, focusing on those that
|
||
issue smaller reads and are latency sensitive. Future work could look at
|
||
extending the set of clients (including krbd).
|
||
|
||
Implement distributed parity-delta-write
|
||
----------------------------------------
|
||
|
||
* Implement new message MOSDEcSubOpDelta and MOSDEcSubOpDeltaReply
|
||
* Change primary to calculate delta and send MOSDEcSubOpDelta message to
|
||
coding parity OSDs
|
||
* Modify coding parity OSDs to apply the delta and send MOSDEcSubOpDeltaReply
|
||
message
|
||
|
||
Note: This change will increase latency because the coding parity reads start
|
||
after the old data read. Future work will fix this.
|
||
|
||
Test tools - EC error injection thrasher
|
||
----------------------------------------
|
||
|
||
* Implement a new type of thrasher that specifically injects faults to stress
|
||
erasure coded pools
|
||
* Take one or multiple (up to M) OSDs down, more focus on taking different
|
||
subsets of OSDs down to drive all the different EC recovery paths than
|
||
stressing out peering/recovery/backfill (the existing OSD thrasher excels at
|
||
this)
|
||
* Inject read I/O failures to force reconstructs using decode for single and
|
||
multiple failures
|
||
* Inject delays using osd tell type interface to make it easier to test OSD
|
||
down at all the interesting stages of EC I/Os
|
||
* Inject delays using osd tell type interface to slow down an OSD transaction
|
||
or message to expose the less common completion orders for parallel work
|
||
|
||
Implement prefetch message MOSDEcSubOpPrefetch and modify extent cache
|
||
----------------------------------------------------------------------
|
||
|
||
* Implement new message MOSDEcSubOpPrefetch
|
||
* Change primary to issue this message to the coding parity OSDs before
|
||
starting read of old data
|
||
* Change the extent cache so that each OSD caches its own data rather than
|
||
caching everything on the primary
|
||
* Change coding parity OSDs to handle this message and read the old coding
|
||
parity into the extent cache
|
||
* Changes to extent cache to retain the prefetched old parity until the
|
||
MOSDEcSubOpDelta message is received, and to discard this on error paths
|
||
(e.g. new OSDMap)
|
||
|
||
Implement sequencing message MOSDEcSubOpSequence
|
||
------------------------------------------------
|
||
|
||
* Implement new message MODSEcSubOpSequence and MOSDEcSubOpSequenceReply
|
||
* Modify primary code to create these messages and route them locally to
|
||
itself in preparation for direct writes
|
||
|
||
Direct writes to OSD (single chunk only)
|
||
----------------------------------------
|
||
|
||
* Modify OSDC to route single chunk write I/Os to the OSD storing the data
|
||
* Changes to issue MOSDEcSubOpSequence and MOSDEcSubOpSequenceReply between
|
||
data OSD and primary OSD
|
||
|
||
Direct writes to OSD (multiple chunks)
|
||
--------------------------------------
|
||
|
||
* Modifications to OSDC to split multiple chunk writes into separate requests
|
||
if NONATOMIC flag is set
|
||
* Further changes to coalescing completions (in particular reporting version
|
||
number correctly)
|
||
* Changes to librbd client to set NONATOMIC flag for reads
|
||
* Changes to cephfs client to set NONATOMIC flag for reads
|
||
|
||
We are only changing a very limited set of clients, focusing on those that
|
||
issue smaller writes and are latency sensitive. Future work could look at
|
||
extending the set of clients.
|
||
|
||
Deep scrub / CRC
|
||
----------------
|
||
|
||
* Disable CRC generation in the EC code for overwrites, delete hinfo Xattr
|
||
when first overwrite occurs
|
||
* For objects in pool with new feature flag set that have not been overwritten
|
||
check CRC, even if pool overwrite flag is set. The presence/absence of hinfo
|
||
can be used to determine if the object has been overwritten
|
||
* For deep scrub requests XOR the contents of the shard to create a
|
||
longitudinal check (8 bytes wide?)
|
||
* Return the longitudinal check in the scrub reply message, have the primary
|
||
encode the set of longitudinal replies to check for inconsistencies
|
||
|
||
Variable chunk size erasure coding
|
||
----------------------------------
|
||
|
||
* Implement new pool option for automatic/variable chunk size
|
||
* When object size is small use a small chunk size (4K) when the pool is using
|
||
the new option
|
||
* When object size is large use a large chunk size (64K or 256K?)
|
||
* Convert the chunk size by reading and re-writing the whole object when a
|
||
small object grows (append)
|
||
* Convert the chunk size by reading and re-writing the whole object when a
|
||
large object shrinks (truncate)
|
||
* Use the object size hint to avoid creating small objects and then almost
|
||
immediately converting them to a larger chunk size
|
||
|
||
CLAY Erasure Codes
|
||
------------------
|
||
|
||
In theory CLAY erasure codes should be good for K+M erasure codes with larger
|
||
values of M, in particular when these erasure codes are used with multiple
|
||
OSDs in the same failure domain (e.g. an 8+6 erasure code with 5 servers each
|
||
with 4 OSDs). We would like to improve the test coverage for CLAY and perform
|
||
some more benchmarking to collect data to help substantiate when people should
|
||
consider using CLAY.
|
||
|
||
* Benchmark CLAY erasure codes - in particular the number of I/O required for
|
||
backfills when multiple OSDs fail
|
||
* Enhance test cases to validate the implementation
|
||
* See also https://bugzilla.redhat.com/show_bug.cgi?id=2004256
|