diff --git a/doc/rados/operations/erasure-code.rst b/doc/rados/operations/erasure-code.rst new file mode 100644 index 00000000000..512e1be1ea3 --- /dev/null +++ b/doc/rados/operations/erasure-code.rst @@ -0,0 +1,173 @@ +============= + Erasure code +============= + +A Ceph pool is associated to a type to sustain the loss of an OSD +(i.e. a disk since most of the time there is one OSD per disk). The +default choice when `creating a pool <../pools>`_ is *replicated*, +meaning every object is copied on multiple disks. The `Erasure Code +`_ pool type can be used +instead to save space. + +Creating a sample erasure coded pool +------------------------------------ + +The simplest erasure coded pool is equivalent to `RAID5 +`_ and +requires at least three hosts:: + + $ ceph osd pool create ecpool 12 12 erasure + pool 'ecpool' created + $ echo ABCDEFGHI | rados --pool ecpool put NYAN - + $ rados --pool ecpool get NYAN - + ABCDEFGHI + +.. note:: the 12 in *pool create* stands for + `the number of placement groups <../pools>`_. + +Erasure code profiles +--------------------- + +The default erasure code profile sustains the loss of a single OSD. It +is equivalent to a replicated pool of size two but requires 1.5TB +instead of 2TB to store 1TB of data. The default profile can be +displayed with:: + + $ ceph osd erasure-code-profile get default + directory=.libs + k=2 + m=1 + plugin=jerasure + ruleset-failure-domain=host + technique=reed_sol_van + +Choosing the right profile is important because it cannot be modified +after the pool is created: a new pool with a different profile needs +to be created and all objects from the previous pool moved to the new. + +The most important parameters of the profile are *K*, *M* and +*ruleset-failure-domain* because they define the storage overhead and +the data durability. For instance, if the desired architecture must +sustain the loss of two racks with a storage overhead of 40% overhead, +the following profile can be defined:: + + $ ceph osd erasure-code-profile set myprofile \ + k=3 \ + m=2 \ + ruleset-failure-domain=rack + $ ceph osd pool create ecpool 12 12 erasure *myprofile* + $ echo ABCDEFGHI | rados --pool ecpool put NYAN - + $ rados --pool ecpool get NYAN - + ABCDEFGHI + +The *NYAN* object will be divided in three (*K=3*) and two additional +*chunks* will be created (*M=2*). The value of *M* defines how many +OSD can be lost simultaneously without losing any data. The +*ruleset-failure-domain=rack* will create a CRUSH ruleset that ensures +no two *chunks* are stored in the same rack. + +.. ditaa:: + +-------------------+ + name | NYAN | + +-------------------+ + content | ABCDEFGHI | + +--------+----------+ + | + | + v + +------+------+ + +---------------+ encode(3,2) +-----------+ + | +--+--+---+---+ | + | | | | | + | +-------+ | +-----+ | + | | | | | + +--v---+ +--v---+ +--v---+ +--v---+ +--v---+ + name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN | + +------+ +------+ +------+ +------+ +------+ + shard | 1 | | 2 | | 3 | | 4 | | 5 | + +------+ +------+ +------+ +------+ +------+ + content | ABC | | DEF | | GHI | | YXY | | QGC | + +--+---+ +--+---+ +--+---+ +--+---+ +--+---+ + | | | | | + | | v | | + | | +--+---+ | | + | | | OSD1 | | | + | | +------+ | | + | | | | + | | +------+ | | + | +------>| OSD2 | | | + | +------+ | | + | | | + | +------+ | | + | | OSD3 |<----+ | + | +------+ | + | | + | +------+ | + | | OSD4 |<--------------+ + | +------+ + | + | +------+ + +----------------->| OSD5 | + +------+ + + +More information can be found in the `erasure code profiles +<../erasure-code-profile>`_ documentation. + +Erasure coded pool and cache tiering +------------------------------------ + +Erasure coded pools require more resources than replicated pools and +lack some functionalities such as partial writes. To overcome these +limitations, it is recommended to set a `cache tier <../cache-tiering>`_ +before the erasure coded pool. + +For instance, if the pool *hot-storage* is made of fast storage:: + + $ ceph osd tier add ecpool hot-storage + $ ceph osd tier cache-mode hot-storage writeback + $ ceph osd tier set-overlay ecpool hot-storage + +will place the *hot-storage* pool as tier of *ecpool* in *writeback* +mode so that every write and read to the *ecpool* are actually using +the *hot-storage* and benefit from its flexibility and speed. + +It is not possible to create an RBD image on an erasure coded pool +because it requires partial writes. It is however possible to create +an RBD image on an erasure coded pools when a replicated pool tier set +a cache tier:: + + $ rbd --pool ecpool create --size 10 myvolume + +More information can be found in the `cache tiering +<../cache-tiering>`_ documentation. + +Glossary +-------- + +*chunk* + when the encoding function is called, it returns chunks of the same + size. Data chunks which can be concatenated to reconstruct the original + object and coding chunks which can be used to rebuild a lost chunk. + +*K* + the number of data *chunks*, i.e. the number of *chunks* in which the + original object is divided. For instance if *K* = 2 a 10KB object + will be divided into *K* objects of 5KB each. + +*M* + the number of coding *chunks*, i.e. the number of additional *chunks* + computed by the encoding functions. If there are 2 coding *chunks*, + it means 2 OSDs can be out without losing data. + + +Table of content +---------------- + +.. toctree:: + :maxdepth: 1 + + erasure-code-profile + erasure-code-jerasure + erasure-code-isa + erasure-code-lrc diff --git a/doc/rados/operations/index.rst b/doc/rados/operations/index.rst index 609cc4a92c3..a21f12eced0 100644 --- a/doc/rados/operations/index.rst +++ b/doc/rados/operations/index.rst @@ -32,7 +32,7 @@ CRUSH algorithm. data-placement pools - erasure-code-profile + erasure-code cache-tiering placement-groups crush-map diff --git a/doc/rados/operations/pools.rst b/doc/rados/operations/pools.rst index 522bdd0a9db..009e453e443 100644 --- a/doc/rados/operations/pools.rst +++ b/doc/rados/operations/pools.rst @@ -9,7 +9,7 @@ pools for storing data. A pool provides you with: For replicated pools, it is the desired number of copies/replicas of an object. A typical configuration stores an object and one additional copy (i.e., ``size = 2``), but you can determine the number of copies/replicas. - For erasure coded pools, it is the number of coding chunks + For `erasure coded pools <../erasure-code>`_, it is the number of coding chunks (i.e. ``m=2`` in the **erasure code profile**) - **Placement Groups**: You can set the number of placement groups for the pool. @@ -31,7 +31,6 @@ pools for storing data. A pool provides you with: To organize data into pools, you can list, create, and remove pools. You can also view the utilization statistics for each pool. - List Pools ========== @@ -98,8 +97,9 @@ Where: :Description: The pool type which may either be **replicated** to recover from lost OSDs by keeping multiple copies of the - objects or **erasure** to get a kind of generalized - RAID5 capability. The **replicated** pools require more + objects or **erasure** to get a kind of + `generalized RAID5 <../erasure-code>`_ capability. + The **replicated** pools require more raw storage but implement all Ceph operations. The **erasure** pools require less raw storage but only implement a subset of the available operations.