mirror of
https://github.com/ceph/ceph
synced 2025-01-02 00:52:22 +00:00
dc7a2aaf7a
1) ruleset is an obsolete term, and 2) crush-{rule,failure-domain,...} is more descriptive. Note that we are changing the names of the erasure code profile keys from ruleset-* to crush-*. We will update this on upgrade when the luminous flag is set, but that means that during mon upgrade you cannot create EC pools that use these fields. When the upgrade completes (users sets require_osd_release = luminous) existing ec profiles are updated automatically. Signed-off-by: Sage Weil <sage@redhat.com>
372 lines
12 KiB
ReStructuredText
372 lines
12 KiB
ReStructuredText
======================================
|
|
Locally repairable erasure code plugin
|
|
======================================
|
|
|
|
With the *jerasure* plugin, when an erasure coded object is stored on
|
|
multiple OSDs, recovering from the loss of one OSD requires reading
|
|
from all the others. For instance if *jerasure* is configured with
|
|
*k=8* and *m=4*, losing one OSD requires reading from the eleven
|
|
others to repair.
|
|
|
|
The *lrc* erasure code plugin creates local parity chunks to be able
|
|
to recover using less OSDs. For instance if *lrc* is configured with
|
|
*k=8*, *m=4* and *l=4*, it will create an additional parity chunk for
|
|
every four OSDs. When a single OSD is lost, it can be recovered with
|
|
only four OSDs instead of eleven.
|
|
|
|
Erasure code profile examples
|
|
=============================
|
|
|
|
Reduce recovery bandwidth between hosts
|
|
---------------------------------------
|
|
|
|
Although it is probably not an interesting use case when all hosts are
|
|
connected to the same switch, reduced bandwidth usage can actually be
|
|
observed.::
|
|
|
|
$ ceph osd erasure-code-profile set LRCprofile \
|
|
plugin=lrc \
|
|
k=4 m=2 l=3 \
|
|
crush-failure-domain=host
|
|
$ ceph osd pool create lrcpool 12 12 erasure LRCprofile
|
|
|
|
|
|
Reduce recovery bandwidth between racks
|
|
---------------------------------------
|
|
|
|
In Firefly the reduced bandwidth will only be observed if the primary
|
|
OSD is in the same rack as the lost chunk.::
|
|
|
|
$ ceph osd erasure-code-profile set LRCprofile \
|
|
plugin=lrc \
|
|
k=4 m=2 l=3 \
|
|
crush-locality=rack \
|
|
crush-failure-domain=host
|
|
$ ceph osd pool create lrcpool 12 12 erasure LRCprofile
|
|
|
|
|
|
Create an lrc profile
|
|
=====================
|
|
|
|
To create a new lrc erasure code profile::
|
|
|
|
ceph osd erasure-code-profile set {name} \
|
|
plugin=lrc \
|
|
k={data-chunks} \
|
|
m={coding-chunks} \
|
|
l={locality} \
|
|
[crush-root={root}] \
|
|
[crush-locality={bucket-type}] \
|
|
[crush-failure-domain={bucket-type}] \
|
|
[crush-device-class={device-class}] \
|
|
[directory={directory}] \
|
|
[--force]
|
|
|
|
Where:
|
|
|
|
``k={data chunks}``
|
|
|
|
:Description: Each object is split in **data-chunks** parts,
|
|
each stored on a different OSD.
|
|
|
|
:Type: Integer
|
|
:Required: Yes.
|
|
:Example: 4
|
|
|
|
``m={coding-chunks}``
|
|
|
|
:Description: Compute **coding chunks** for each object and store them
|
|
on different OSDs. The number of coding chunks is also
|
|
the number of OSDs that can be down without losing data.
|
|
|
|
:Type: Integer
|
|
:Required: Yes.
|
|
:Example: 2
|
|
|
|
``l={locality}``
|
|
|
|
:Description: Group the coding and data chunks into sets of size
|
|
**locality**. For instance, for **k=4** and **m=2**,
|
|
when **locality=3** two groups of three are created.
|
|
Each set can be recovered without reading chunks
|
|
from another set.
|
|
|
|
:Type: Integer
|
|
:Required: Yes.
|
|
:Example: 3
|
|
|
|
``crush-root={root}``
|
|
|
|
:Description: The name of the crush bucket used for the first step of
|
|
the ruleset. For intance **step take default**.
|
|
|
|
:Type: String
|
|
:Required: No.
|
|
:Default: default
|
|
|
|
``crush-locality={bucket-type}``
|
|
|
|
:Description: The type of the crush bucket in which each set of chunks
|
|
defined by **l** will be stored. For instance, if it is
|
|
set to **rack**, each group of **l** chunks will be
|
|
placed in a different rack. It is used to create a
|
|
ruleset step such as **step choose rack**. If it is not
|
|
set, no such grouping is done.
|
|
|
|
:Type: String
|
|
:Required: No.
|
|
|
|
``crush-failure-domain={bucket-type}``
|
|
|
|
:Description: Ensure that no two chunks are in a bucket with the same
|
|
failure domain. For instance, if the failure domain is
|
|
**host** no two chunks will be stored on the same
|
|
host. It is used to create a ruleset step such as **step
|
|
chooseleaf host**.
|
|
|
|
:Type: String
|
|
:Required: No.
|
|
:Default: host
|
|
|
|
``crush-device-class={device-class}``
|
|
|
|
:Description: Restrict placement to devices of a specific class (e.g.,
|
|
``ssd`` or ``hdd``), using the crush device class names
|
|
in the CRUSH map.
|
|
|
|
:Type: String
|
|
:Required: No.
|
|
:Default:
|
|
|
|
``directory={directory}``
|
|
|
|
:Description: Set the **directory** name from which the erasure code
|
|
plugin is loaded.
|
|
|
|
:Type: String
|
|
:Required: No.
|
|
:Default: /usr/lib/ceph/erasure-code
|
|
|
|
``--force``
|
|
|
|
:Description: Override an existing profile by the same name.
|
|
|
|
:Type: String
|
|
:Required: No.
|
|
|
|
Low level plugin configuration
|
|
==============================
|
|
|
|
The sum of **k** and **m** must be a multiple of the **l** parameter.
|
|
The low level configuration parameters do not impose such a
|
|
restriction and it may be more convienient to use it for specific
|
|
purposes. It is for instance possible to define two groups, one with 4
|
|
chunks and another with 3 chunks. It is also possible to recursively
|
|
define locality sets, for instance datacenters and racks into
|
|
datacenters. The **k/m/l** are implemented by generating a low level
|
|
configuration.
|
|
|
|
The *lrc* erasure code plugin recursively applies erasure code
|
|
techniques so that recovering from the loss of some chunks only
|
|
requires a subset of the available chunks, most of the time.
|
|
|
|
For instance, when three coding steps are described as::
|
|
|
|
chunk nr 01234567
|
|
step 1 _cDD_cDD
|
|
step 2 cDDD____
|
|
step 3 ____cDDD
|
|
|
|
where *c* are coding chunks calculated from the data chunks *D*, the
|
|
loss of chunk *7* can be recovered with the last four chunks. And the
|
|
loss of chunk *2* chunk can be recovered with the first four
|
|
chunks.
|
|
|
|
Erasure code profile examples using low level configuration
|
|
===========================================================
|
|
|
|
Minimal testing
|
|
---------------
|
|
|
|
It is strictly equivalent to using the default erasure code profile. The *DD*
|
|
implies *K=2*, the *c* implies *M=1* and the *jerasure* plugin is used
|
|
by default.::
|
|
|
|
$ ceph osd erasure-code-profile set LRCprofile \
|
|
plugin=lrc \
|
|
mapping=DD_ \
|
|
layers='[ [ "DDc", "" ] ]'
|
|
$ ceph osd pool create lrcpool 12 12 erasure LRCprofile
|
|
|
|
Reduce recovery bandwidth between hosts
|
|
---------------------------------------
|
|
|
|
Although it is probably not an interesting use case when all hosts are
|
|
connected to the same switch, reduced bandwidth usage can actually be
|
|
observed. It is equivalent to **k=4**, **m=2** and **l=3** although
|
|
the layout of the chunks is different::
|
|
|
|
$ ceph osd erasure-code-profile set LRCprofile \
|
|
plugin=lrc \
|
|
mapping=__DD__DD \
|
|
layers='[
|
|
[ "_cDD_cDD", "" ],
|
|
[ "cDDD____", "" ],
|
|
[ "____cDDD", "" ],
|
|
]'
|
|
$ ceph osd pool create lrcpool 12 12 erasure LRCprofile
|
|
|
|
|
|
Reduce recovery bandwidth between racks
|
|
---------------------------------------
|
|
|
|
In Firefly the reduced bandwidth will only be observed if the primary
|
|
OSD is in the same rack as the lost chunk.::
|
|
|
|
$ ceph osd erasure-code-profile set LRCprofile \
|
|
plugin=lrc \
|
|
mapping=__DD__DD \
|
|
layers='[
|
|
[ "_cDD_cDD", "" ],
|
|
[ "cDDD____", "" ],
|
|
[ "____cDDD", "" ],
|
|
]' \
|
|
crush-steps='[
|
|
[ "choose", "rack", 2 ],
|
|
[ "chooseleaf", "host", 4 ],
|
|
]'
|
|
$ ceph osd pool create lrcpool 12 12 erasure LRCprofile
|
|
|
|
Testing with different Erasure Code backends
|
|
--------------------------------------------
|
|
|
|
LRC now uses jerasure as the default EC backend. It is possible to
|
|
specify the EC backend/algorithm on a per layer basis using the low
|
|
level configuration. The second argument in layers='[ [ "DDc", "" ] ]'
|
|
is actually an erasure code profile to be used for this level. The
|
|
example below specifies the ISA backend with the cauchy technique to
|
|
be used in the lrcpool.::
|
|
|
|
$ ceph osd erasure-code-profile set LRCprofile \
|
|
plugin=lrc \
|
|
mapping=DD_ \
|
|
layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]'
|
|
$ ceph osd pool create lrcpool 12 12 erasure LRCprofile
|
|
|
|
You could also use a different erasure code profile for for each
|
|
layer.::
|
|
|
|
$ ceph osd erasure-code-profile set LRCprofile \
|
|
plugin=lrc \
|
|
mapping=__DD__DD \
|
|
layers='[
|
|
[ "_cDD_cDD", "plugin=isa technique=cauchy" ],
|
|
[ "cDDD____", "plugin=isa" ],
|
|
[ "____cDDD", "plugin=jerasure" ],
|
|
]'
|
|
$ ceph osd pool create lrcpool 12 12 erasure LRCprofile
|
|
|
|
|
|
|
|
Erasure coding and decoding algorithm
|
|
=====================================
|
|
|
|
The steps found in the layers description::
|
|
|
|
chunk nr 01234567
|
|
|
|
step 1 _cDD_cDD
|
|
step 2 cDDD____
|
|
step 3 ____cDDD
|
|
|
|
are applied in order. For instance, if a 4K object is encoded, it will
|
|
first go thru *step 1* and be divided in four 1K chunks (the four
|
|
uppercase D). They are stored in the chunks 2, 3, 6 and 7, in
|
|
order. From these, two coding chunks are calculated (the two lowercase
|
|
c). The coding chunks are stored in the chunks 1 and 5, respectively.
|
|
|
|
The *step 2* re-uses the content created by *step 1* in a similar
|
|
fashion and stores a single coding chunk *c* at position 0. The last four
|
|
chunks, marked with an underscore (*_*) for readability, are ignored.
|
|
|
|
The *step 3* stores a single coding chunk *c* at position 4. The three
|
|
chunks created by *step 1* are used to compute this coding chunk,
|
|
i.e. the coding chunk from *step 1* becomes a data chunk in *step 3*.
|
|
|
|
If chunk *2* is lost::
|
|
|
|
chunk nr 01234567
|
|
|
|
step 1 _c D_cDD
|
|
step 2 cD D____
|
|
step 3 __ _cDDD
|
|
|
|
decoding will attempt to recover it by walking the steps in reverse
|
|
order: *step 3* then *step 2* and finally *step 1*.
|
|
|
|
The *step 3* knows nothing about chunk *2* (i.e. it is an underscore)
|
|
and is skipped.
|
|
|
|
The coding chunk from *step 2*, stored in chunk *0*, allows it to
|
|
recover the content of chunk *2*. There are no more chunks to recover
|
|
and the process stops, without considering *step 1*.
|
|
|
|
Recovering chunk *2* requires reading chunks *0, 1, 3* and writing
|
|
back chunk *2*.
|
|
|
|
If chunk *2, 3, 6* are lost::
|
|
|
|
chunk nr 01234567
|
|
|
|
step 1 _c _c D
|
|
step 2 cD __ _
|
|
step 3 __ cD D
|
|
|
|
The *step 3* can recover the content of chunk *6*::
|
|
|
|
chunk nr 01234567
|
|
|
|
step 1 _c _cDD
|
|
step 2 cD ____
|
|
step 3 __ cDDD
|
|
|
|
The *step 2* fails to recover and is skipped because there are two
|
|
chunks missing (*2, 3*) and it can only recover from one missing
|
|
chunk.
|
|
|
|
The coding chunk from *step 1*, stored in chunk *1, 5*, allows it to
|
|
recover the content of chunk *2, 3*::
|
|
|
|
chunk nr 01234567
|
|
|
|
step 1 _cDD_cDD
|
|
step 2 cDDD____
|
|
step 3 ____cDDD
|
|
|
|
Controlling crush placement
|
|
===========================
|
|
|
|
The default crush ruleset provides OSDs that are on different hosts. For instance::
|
|
|
|
chunk nr 01234567
|
|
|
|
step 1 _cDD_cDD
|
|
step 2 cDDD____
|
|
step 3 ____cDDD
|
|
|
|
needs exactly *8* OSDs, one for each chunk. If the hosts are in two
|
|
adjacent racks, the first four chunks can be placed in the first rack
|
|
and the last four in the second rack. So that recovering from the loss
|
|
of a single OSD does not require using bandwidth between the two
|
|
racks.
|
|
|
|
For instance::
|
|
|
|
crush-steps='[ [ "choose", "rack", 2 ], [ "chooseleaf", "host", 4 ] ]'
|
|
|
|
will create a ruleset that will select two crush buckets of type
|
|
*rack* and for each of them choose four OSDs, each of them located in
|
|
different buckets of type *host*.
|
|
|
|
The ruleset can also be manually crafted for finer control.
|