mirror of
https://github.com/ceph/ceph
synced 2024-12-12 06:28:31 +00:00
236 lines
7.9 KiB
ReStructuredText
236 lines
7.9 KiB
ReStructuredText
|
=========================
|
||
|
Coupled LAYer code plugin
|
||
|
=========================
|
||
|
|
||
|
Coupled LAYer (CLAY) codes are erasure codes designed to save in terms of network
|
||
|
bandwidth, disk IO when a failed node/OSD/rack is being repaired. Let:
|
||
|
|
||
|
d = number of OSDs contacted during repair
|
||
|
|
||
|
If *jerasure* is configured with *k=8* and *m=4*, losing one OSD requires
|
||
|
reading from the *d=8* others to repair. And recovery of say a 1GiB needs
|
||
|
a download of 8 X 1GiB = 8GiB amount of information.
|
||
|
|
||
|
However, in the case of *clay* plugin *d* is configurable such that:
|
||
|
|
||
|
k+1 <= d <= k+m-1
|
||
|
|
||
|
By default the clay code plugin picks *d=k+m-1* as it gives most savings in terms
|
||
|
of network bandwidth and disk IO. In the case of *clay* plugin configured with
|
||
|
*k=8*, *m=4* and *d=11* when a single OSD fails d=11 osds are contacted and
|
||
|
250MiB is downloaded from each of them resulting in download of 11 X 250MiB = 2.75GiB
|
||
|
amount of information. More general parameters are shown below. The benefits are huge
|
||
|
when the repair is being done for a rack that stores information in the order of
|
||
|
Tera bytes.
|
||
|
|
||
|
+-------------+---------------------------+
|
||
|
| plugin | total amount of disk IO |
|
||
|
+=============+===========================+
|
||
|
|jerasure,isa | k*S |
|
||
|
+-------------+---------------------------+
|
||
|
| clay | d*S/(d-k+1) = (k+m-1)*S/m |
|
||
|
+-------------+---------------------------+
|
||
|
|
||
|
where *S* is the amount of data stored of single OSD being repaired and
|
||
|
in the table above, we are using the maximum possible value of d for minimal amount
|
||
|
of data transmission for recovery.
|
||
|
|
||
|
Erasure code profile examples
|
||
|
=============================
|
||
|
|
||
|
Reduced bandwidth usage can actually be observed.::
|
||
|
|
||
|
$ ceph osd erasure-code-profile set CLAYprofile \
|
||
|
plugin=clay \
|
||
|
k=4 m=2 d=5 \
|
||
|
crush-failure-domain=host
|
||
|
$ ceph osd pool create claypool 12 12 erasure CLAYprofile
|
||
|
|
||
|
|
||
|
Create a clay profile
|
||
|
=====================
|
||
|
|
||
|
To create a new clay code profile::
|
||
|
|
||
|
ceph osd erasure-code-profile set {name} \
|
||
|
plugin=clay \
|
||
|
k={data-chunks} \
|
||
|
m={coding-chunks} \
|
||
|
[d={helper-chunks}] \
|
||
|
[scalar_mds={plugin-name}] \
|
||
|
[technique={technique-name}] \
|
||
|
[crush-failure-domain={bucket-type}] \
|
||
|
[directory={directory}] \
|
||
|
[--force]
|
||
|
|
||
|
Where:
|
||
|
|
||
|
``k={data chunks}``
|
||
|
|
||
|
:Description: Each object is split in **data-chunks** parts,
|
||
|
each stored on a different OSD.
|
||
|
|
||
|
:Type: Integer
|
||
|
:Required: Yes.
|
||
|
:Example: 4
|
||
|
|
||
|
``m={coding-chunks}``
|
||
|
|
||
|
:Description: Compute **coding chunks** for each object and store them
|
||
|
on different OSDs. The number of coding chunks is also
|
||
|
the number of OSDs that can be down without losing data.
|
||
|
|
||
|
:Type: Integer
|
||
|
:Required: Yes.
|
||
|
:Example: 2
|
||
|
|
||
|
``d={helper-chunks}``
|
||
|
|
||
|
:Description: Number of OSDs requested to send data during recovery of
|
||
|
a single chunk. *d* needs to be chosen such that
|
||
|
k+1 <= d <= k+m-1. Larger the *d*, better the savings.
|
||
|
|
||
|
:Type: Integer
|
||
|
:Required: No.
|
||
|
:Default: k+m-1
|
||
|
|
||
|
``scalar_mds={jerasure|isa|shec}``
|
||
|
|
||
|
:Description: **scalar_mds** specifies the plugin that is used as a
|
||
|
building block in the layered construction. It can be
|
||
|
one of *jerasure*, *isa*, *shec*
|
||
|
|
||
|
:Type: String
|
||
|
:Required: No.
|
||
|
:Default: jerasure
|
||
|
|
||
|
``technique={technique}``
|
||
|
|
||
|
:Description: **technique** specifies the technique that will be picked
|
||
|
within the 'scalar_mds' plugin specified. Supported techniques
|
||
|
are 'reed_sol_van', 'reed_sol_r6_op', 'cauchy_orig',
|
||
|
'cauchy_good', 'liber8tion' for jerasure, 'reed_sol_van',
|
||
|
'cauchy' for isa and 'single', 'multiple' for shec.
|
||
|
|
||
|
:Type: String
|
||
|
:Required: No.
|
||
|
:Default: reed_sol_van (for jerasure, isa), single (for shec)
|
||
|
|
||
|
|
||
|
``crush-root={root}``
|
||
|
|
||
|
:Description: The name of the crush bucket used for the first step of
|
||
|
the CRUSH rule. For intance **step take default**.
|
||
|
|
||
|
:Type: String
|
||
|
:Required: No.
|
||
|
:Default: default
|
||
|
|
||
|
|
||
|
``crush-failure-domain={bucket-type}``
|
||
|
|
||
|
:Description: Ensure that no two chunks are in a bucket with the same
|
||
|
failure domain. For instance, if the failure domain is
|
||
|
**host** no two chunks will be stored on the same
|
||
|
host. It is used to create a CRUSH rule step such as **step
|
||
|
chooseleaf host**.
|
||
|
|
||
|
:Type: String
|
||
|
:Required: No.
|
||
|
:Default: host
|
||
|
|
||
|
``crush-device-class={device-class}``
|
||
|
|
||
|
:Description: Restrict placement to devices of a specific class (e.g.,
|
||
|
``ssd`` or ``hdd``), using the crush device class names
|
||
|
in the CRUSH map.
|
||
|
|
||
|
:Type: String
|
||
|
:Required: No.
|
||
|
:Default:
|
||
|
|
||
|
``directory={directory}``
|
||
|
|
||
|
:Description: Set the **directory** name from which the erasure code
|
||
|
plugin is loaded.
|
||
|
|
||
|
:Type: String
|
||
|
:Required: No.
|
||
|
:Default: /usr/lib/ceph/erasure-code
|
||
|
|
||
|
``--force``
|
||
|
|
||
|
:Description: Override an existing profile by the same name.
|
||
|
|
||
|
:Type: String
|
||
|
:Required: No.
|
||
|
|
||
|
|
||
|
Notion of sub-chunks
|
||
|
====================
|
||
|
|
||
|
Clay code is able to save in terms of disk IO, network bandwidth as it
|
||
|
is a vector code and it can see a chunk at a finer granularity called
|
||
|
sub-chunks. Number of sub-chunks within a chunk for a clay code is
|
||
|
given by:
|
||
|
|
||
|
sub-chunk count = q\ :sup:`(k+m)/q`, where q=d-k+1
|
||
|
|
||
|
|
||
|
During repair of a OSD, the helper information requested
|
||
|
from an available OSD is only a fraction of a chunk. In fact, the number
|
||
|
of sub-chunks within a chunk that are accessed during repair is given by:
|
||
|
|
||
|
repair sub-chunk count = sub-chunk count / q
|
||
|
|
||
|
Examples
|
||
|
--------
|
||
|
|
||
|
#. For a configuration with *k=4*, *m=2*, *d=5*, the sub-chunk count is
|
||
|
8 and the repair sub-chunk count is 4. Therefore, only half of a chunk is read
|
||
|
during repair.
|
||
|
#. When *k=8*, *m=4*, *d=11* the sub-chunk count is 64 and repair sub-chunk count
|
||
|
is 16. A quarter of a chunk is read from an available OSD for repair of a failed
|
||
|
chunk.
|
||
|
|
||
|
|
||
|
|
||
|
How to choose configuration given a workload
|
||
|
============================================
|
||
|
|
||
|
Only a few sub-chunks are read of all the sub-chunks within a chunk. These sub-chunks
|
||
|
are not necessarily stored consecutively within a chunk. For best disk IO
|
||
|
performance, it is helpful to read contiguous data. Choose stripe-size such that
|
||
|
sub-chunk size is sufficiently large.
|
||
|
|
||
|
For a given stripe-size (that's fixed based on a workload), choose ``k``, ``m``, ``d`` such that::
|
||
|
|
||
|
sub-chunk size = stripe-size / (k*sub-chunk count) = 4KB, 8KB, 12KB ...
|
||
|
|
||
|
#. For large size workloads for which stripe size is large it is easy to choose k, m, d.
|
||
|
For example consider stripe-size of size 64MB, choosing *k=16*, *m=4* and *d=19* will
|
||
|
result in a sub-chunk count of 1024 and sub-chunk size of 4KB.
|
||
|
#. For small size workloads *k=4*, *m=2* is a good configuration that gives both network
|
||
|
and disk IO benefits.
|
||
|
|
||
|
Comparisons with LRC
|
||
|
====================
|
||
|
|
||
|
Locally Recoverable Codes (LRC) are also designed in order to save in terms of network
|
||
|
bandwidth, disk IO during single OSD recovery. However, the focus in LRCs is to keep the
|
||
|
number of OSDs contacted during repair (d) to be minimal at the cost of storage overhead.
|
||
|
*clay* code has a storage overhead m/k. In the case of *lrc*, it stores (k+m)/d parities in
|
||
|
addition to the ``m`` parities resulting in a storage overhead (m+(k+m)/d)/k. Both *clay* and *lrc*
|
||
|
can recover from failure of any ``m`` OSDs.
|
||
|
|
||
|
+-----------------+----------------------------------+----------------------------------+
|
||
|
| Parameters | disk IO, storage overhead (LRC) | disk IO, storage overhead (CLAY) |
|
||
|
+=================+================+=================+==================================+
|
||
|
| (k=10, m=4) | 7 * S, 0.6 (d=7) | 3.25 * S, 0.4 (d=13) |
|
||
|
+-----------------+----------------------------------+----------------------------------+
|
||
|
| (k=16, m=4) | 4 * S, 0.5 (d=5) | 4.75 * S, 0.25 (d=19) |
|
||
|
+-----------------+----------------------------------+----------------------------------+
|
||
|
|
||
|
|
||
|
where ``S`` is the amount of data stored of single OSD being recovered.
|