2013-08-26 11:12:00 +00:00
============================
Erasure Code developer notes
============================
2013-06-20 16:17:09 +00:00
Introduction
------------
2013-08-26 11:12:00 +00:00
Each chapter of this document explains an aspect of the implementation
of the erasure code within Ceph. It is mostly based on examples being
2014-03-29 10:27:00 +00:00
explained to demonstrate how things work.
2013-06-20 16:17:09 +00:00
Reading and writing encoded chunks from and to OSDs
---------------------------------------------------
2013-08-26 11:12:00 +00:00
An erasure coded pool stores each object as K+M chunks. It is divided
into K data chunks and M coding chunks. The pool is configured to have
a size of K+M so that each chunk is stored in an OSD in the acting
2014-03-29 10:27:00 +00:00
set. The rank of the chunk is stored as an attribute of the object.
2013-08-26 11:12:00 +00:00
2014-03-29 10:27:00 +00:00
Let's say an erasure coded pool is created to use five OSDs ( K+M =
2013-08-26 11:12:00 +00:00
5 ) and sustain the loss of two of them ( M = 2 ).
2013-06-20 16:17:09 +00:00
When the object *NYAN* containing *ABCDEFGHI* is written to it, the
erasure encoding function splits the content in three data chunks,
simply by dividing the content in three : the first contains *ABC* ,
2013-08-26 11:12:00 +00:00
the second *DEF* and the last *GHI* . The content will be padded if the
content length is not a multiple of K. The function also creates two
coding chunks : the fourth with *YXY* and the fifth with *GQC* . Each
2013-06-20 16:17:09 +00:00
chunk is stored in an OSD in the acting set. The chunks are stored in
objects that have the same name ( *NYAN* ) but reside on different
OSDs. The order in which the chunks were created must be preserved and
2013-09-22 16:40:48 +00:00
is stored as an attribute of the object ( shard_t ), in addition to its
name. Chunk *1* contains *ABC* and is stored on *OSD5* while chunk *4*
2013-12-31 06:54:13 +00:00
contains *YXY* and is stored on *OSD3* .
2013-08-26 11:12:00 +00:00
2013-06-20 16:17:09 +00:00
::
2013-08-22 15:45:39 +00:00
2013-06-20 16:17:09 +00:00
+-------------------+
name | NYAN |
+-------------------+
content | ABCDEFGHI |
+--------+----------+
|
|
v
+------+------+
+---------------+ encode(3,2) +-----------+
| +--+--+---+---+ |
| | | | |
| +-------+ | +-----+ |
| | | | |
+--v---+ +--v---+ +--v---+ +--v---+ +--v---+
name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN |
+------+ +------+ +------+ +------+ +------+
2013-09-22 16:40:48 +00:00
shard | 1 | | 2 | | 3 | | 4 | | 5 |
2013-06-20 16:17:09 +00:00
+------+ +------+ +------+ +------+ +------+
content | ABC | | DEF | | GHI | | YXY | | QGC |
+--+---+ +--+---+ +--+---+ +--+---+ +--+---+
| | | | |
| | | | |
| | +--+---+ | |
| | | OSD1 | | |
| | +------+ | |
| | +------+ | |
| +------>| OSD2 | | |
| +------+ | |
| +------+ | |
| | OSD3 |<----+ |
| +------+ |
| +------+ |
| | OSD4 |<--------------+
| +------+
| +------+
+----------------->| OSD5 |
+------+
When the object *NYAN* is read from the erasure coded pool, the
decoding function reads three chunks : chunk *1* containing *ABC* ,
chunk *3* containing *GHI* and chunk *4* containing *YXY* and rebuild
the original content of the object *ABCDEFGHI* . The decoding function
2013-09-22 16:40:48 +00:00
is informed that the chunks *2* and *5* are missing ( they are called
*erasures* ). The chunk *5* could not be read because the *OSD4* is
2014-03-29 10:27:00 +00:00
*out* .
The decoding function could be called as soon as three chunks are
read : *OSD2* was the slowest and its chunk does not need to be taken into
account. This optimization is not implemented in Firefly.
2013-09-22 16:40:48 +00:00
2013-06-20 16:17:09 +00:00
::
2013-08-22 15:45:39 +00:00
2013-06-20 16:17:09 +00:00
+-------------------+
name | NYAN |
+-------------------+
content | ABCDEFGHI |
+--------+----------+
^
|
|
+------+------+
| decode(3,2) |
2013-08-26 11:12:00 +00:00
| erasures 2,5|
2013-06-20 16:17:09 +00:00
+-------------->| |
| +-------------+
| ^ ^
| | +-----+
| | |
+--+---+ +------+ +--+---+ +--+---+
name | NYAN | | NYAN | | NYAN | | NYAN |
+------+ +------+ +------+ +------+
2013-09-22 16:40:48 +00:00
shard | 1 | | 2 | | 3 | | 4 |
2013-06-20 16:17:09 +00:00
+------+ +------+ +------+ +------+
content | ABC | | DEF | | GHI | | YXY |
+--+---+ +--+---+ +--+---+ +--+---+
2013-09-22 16:40:48 +00:00
^ . ^ ^
| TOO . | |
| SLOW . +--+---+ |
| ^ | OSD1 | |
2013-06-20 16:17:09 +00:00
| | +------+ |
| | +------+ |
2013-09-22 16:40:48 +00:00
| +-------| OSD2 | |
2013-06-20 16:17:09 +00:00
| +------+ |
| +------+ |
| | OSD3 |-----+
| +------+
| +------+
| | OSD4 | OUT
| +------+
| +------+
+------------------| OSD5 |
+------+
2013-08-22 15:45:39 +00:00
Erasure code library
2013-06-20 16:17:09 +00:00
--------------------
Using `Reed-Solomon <https://en.wikipedia.org/wiki/Reed_Solomon> `_ ,
2013-08-26 11:12:00 +00:00
with parameters K+M, object O is encoded by dividing it into chunks O1,
O2, ... OM and computing coding chunks P1, P2, ... PK. Any K chunks
out of the available K+M chunks can be used to obtain the original
object. If data chunk O2 or coding chunk P2 are lost, they can be
repaired using any K chunks out of the K+M chunks. If more than M
2013-06-20 16:17:09 +00:00
chunks are lost, it is not possible to recover the object.
2014-03-29 10:27:00 +00:00
Reading the original content of object O can be a simple
2013-08-26 11:12:00 +00:00
concatenation of O1, O2, ... OM, because the plugins are using
`systematic codes
2017-10-23 11:26:28 +00:00
<https://en.wikipedia.org/wiki/Systematic_code> `_. Otherwise the chunks
2013-09-22 16:40:48 +00:00
must be given to the erasure code library *decode* method to retrieve
the content of the object.
2013-06-20 16:17:09 +00:00
2013-08-26 11:12:00 +00:00
Performance depend on the parameters to the encoding functions and
is also influenced by the packet sizes used when calling the encoding
functions ( for Cauchy or Liberation for instance ): smaller packets
means more calls and more overhead.
2013-06-20 16:17:09 +00:00
Although Reed-Solomon is provided as a default, Ceph uses it via an
2014-03-29 10:27:00 +00:00
`abstract API <https://github.com/ceph/ceph/blob/v0.78/src/erasure-code/ErasureCodeInterface.h> `_ designed to
2013-08-26 11:12:00 +00:00
allow each pool to choose the plugin that implements it using
2014-03-29 10:27:00 +00:00
key=value pairs stored in an `erasure code profile`_ .
2014-05-19 14:46:25 +00:00
.. _erasure code profile: ../../../erasure-coded-pool
2013-09-22 16:40:48 +00:00
2013-06-20 16:17:09 +00:00
::
2013-08-22 15:45:39 +00:00
2014-03-29 10:27:00 +00:00
$ ceph osd erasure-code-profile set myprofile \
2017-06-30 18:59:39 +00:00
crush-failure-domain=osd
2014-03-29 10:27:00 +00:00
$ ceph osd erasure-code-profile get myprofile
directory=/usr/lib/ceph/erasure-code
k=2
m=1
plugin=jerasure
technique=reed_sol_van
2017-06-30 18:59:39 +00:00
crush-failure-domain=osd
2019-09-19 15:47:07 +00:00
$ ceph osd pool create ecpool erasure myprofile
2014-03-29 10:27:00 +00:00
The *plugin* is dynamically loaded from *directory* and expected to
2014-08-21 16:22:18 +00:00
implement the *int __erasure_code_init(char * plugin_name, char *directory)* function
2014-03-29 10:27:00 +00:00
which is responsible for registering an object derived from *ErasureCodePlugin*
in the registry. The `ErasureCodePluginExample <https://github.com/ceph/ceph/blob/v0.78/src/test/erasure-code/ErasureCodePluginExample.cc> `_ plugin reads:
2013-09-22 16:40:48 +00:00
2013-08-20 14:17:10 +00:00
::
2013-08-22 15:45:39 +00:00
2013-09-22 16:40:48 +00:00
ErasureCodePluginRegistry &instance =
ErasureCodePluginRegistry::instance();
instance.add(plugin_name, new ErasureCodePluginExample());
2013-06-20 16:17:09 +00:00
2013-08-20 14:17:10 +00:00
The *ErasureCodePlugin* derived object must provide a factory method
from which the concrete implementation of the *ErasureCodeInterface*
2017-05-22 11:56:33 +00:00
object can be generated. The `ErasureCodePluginExample plugin <https://github.com/ceph/ceph/blob/v0.78/src/test/erasure-code/ErasureCodePluginExample.cc> `_ reads:
2013-09-22 16:40:48 +00:00
2013-08-20 14:17:10 +00:00
::
2013-08-22 15:45:39 +00:00
2013-08-26 11:12:00 +00:00
virtual int factory(const map<std::string,std::string> ¶meters,
ErasureCodeInterfaceRef *erasure_code) {
2013-08-20 14:17:10 +00:00
*erasure_code = ErasureCodeInterfaceRef(new ErasureCodeExample(parameters));
return 0;
2013-08-26 11:12:00 +00:00
}
2013-08-20 14:17:10 +00:00
2013-09-22 16:40:48 +00:00
The *parameters* argument is the list of *key=value* pairs that were
2014-03-29 10:27:00 +00:00
set in the erasure code profile, before the pool was created.
2013-09-22 16:40:48 +00:00
2013-08-20 14:17:10 +00:00
::
2013-08-22 15:45:39 +00:00
2014-03-29 10:27:00 +00:00
ceph osd erasure-code-profile set myprofile \
directory=<dir> \ # mandatory
plugin=jerasure \ # mandatory
2019-09-26 07:13:00 +00:00
m=10 \ # optional and plugin dependent
k=3 \ # optional and plugin dependent
technique=reed_sol_van \ # optional and plugin dependent
2013-06-20 16:17:09 +00:00
Notes
-----
If the objects are large, it may be impractical to encode and decode
them in memory. However, when using *RBD* a 1TB device is divided in
many individual 4MB objects and *RGW* does the same.
Encoding and decoding is implemented in the OSD. Although it could be
implemented client side for read write, the OSD must be able to encode
and decode on its own when scrubbing.