ceph/doc/rados/operations/erasure-code-clay.rst

=========================
Coupled LAYer code plugin
=========================

Coupled LAYer (CLAY) codes are erasure codes designed to save in terms of network 
bandwidth, disk IO when a failed node/OSD/rack is being repaired. Let:

	d = number of OSDs contacted during repair

If *jerasure* is configured with *k=8* and *m=4*, losing one OSD requires 
reading from the *d=8* others to repair. And recovery of say a 1GiB needs
a download of 8 X 1GiB = 8GiB amount of information.

However, in the case of *clay* plugin *d* is configurable such that:

	k+1 <= d <= k+m-1 

By default the clay code plugin picks *d=k+m-1* as it gives most savings in terms 
of network bandwidth and disk IO. In the case of *clay* plugin configured with 
*k=8*, *m=4* and *d=11* when a single OSD fails d=11 osds are contacted and 
250MiB is downloaded from each of them resulting in download of 11 X 250MiB = 2.75GiB 
amount of information. More general parameters are shown below. The benefits are huge
when the repair is being done for a rack that stores information in the order of 
Tera bytes.

	+-------------+---------------------------+
	| plugin      | total amount of disk IO   |
	+=============+===========================+
	|jerasure,isa | k*S                       |
	+-------------+---------------------------+
	| clay        | d*S/(d-k+1) = (k+m-1)*S/m |
	+-------------+---------------------------+

where *S* is the amount of data stored of single OSD being repaired and 
in the table above, we are using the maximum possible value of d for minimal amount 
of data transmission for recovery.

Erasure code profile examples
=============================

Reduced bandwidth usage can actually be observed.::

        $ ceph osd erasure-code-profile set CLAYprofile \
             plugin=clay \
             k=4 m=2 d=5 \
             crush-failure-domain=host
        $ ceph osd pool create claypool 12 12 erasure CLAYprofile


Create a clay profile
=====================

To create a new clay code profile::

        ceph osd erasure-code-profile set {name} \
             plugin=clay \
             k={data-chunks} \
             m={coding-chunks} \
             [d={helper-chunks}] \
             [scalar_mds={plugin-name}] \
             [technique={technique-name}] \
             [crush-failure-domain={bucket-type}] \
             [directory={directory}] \
             [--force]

Where:

``k={data chunks}``

:Description: Each object is split in **data-chunks** parts,
              each stored on a different OSD.

:Type: Integer
:Required: Yes.
:Example: 4

``m={coding-chunks}``

:Description: Compute **coding chunks** for each object and store them
              on different OSDs. The number of coding chunks is also
              the number of OSDs that can be down without losing data.

:Type: Integer
:Required: Yes.
:Example: 2

``d={helper-chunks}``

:Description: Number of OSDs requested to send data during recovery of
              a single chunk. *d* needs to be chosen such that
              k+1 <= d <= k+m-1. Larger the *d*, better the savings.

:Type: Integer
:Required: No.
:Default: k+m-1

``scalar_mds={jerasure|isa|shec}``

:Description: **scalar_mds** specifies the plugin that is used as a 
             building block in the layered construction. It can be 
             one of *jerasure*, *isa*, *shec*

:Type: String
:Required: No.
:Default: jerasure

``technique={technique}``

:Description: **technique** specifies the technique that will be picked
             within the 'scalar_mds' plugin specified. Supported techniques
             are 'reed_sol_van', 'reed_sol_r6_op', 'cauchy_orig', 
             'cauchy_good', 'liber8tion' for jerasure, 'reed_sol_van',
             'cauchy' for isa and 'single', 'multiple' for shec.

:Type: String
:Required: No.
:Default: reed_sol_van (for jerasure, isa), single (for shec)


``crush-root={root}``

:Description: The name of the crush bucket used for the first step of
              the CRUSH rule. For intance **step take default**.

:Type: String
:Required: No.
:Default: default


``crush-failure-domain={bucket-type}``

:Description: Ensure that no two chunks are in a bucket with the same
              failure domain. For instance, if the failure domain is
              **host** no two chunks will be stored on the same
              host. It is used to create a CRUSH rule step such as **step
              chooseleaf host**.

:Type: String
:Required: No.
:Default: host

``crush-device-class={device-class}``

:Description: Restrict placement to devices of a specific class (e.g.,
              ``ssd`` or ``hdd``), using the crush device class names
              in the CRUSH map.

:Type: String
:Required: No.
:Default:

``directory={directory}``

:Description: Set the **directory** name from which the erasure code
              plugin is loaded.

:Type: String
:Required: No.
:Default: /usr/lib/ceph/erasure-code

``--force``

:Description: Override an existing profile by the same name.

:Type: String
:Required: No.


Notion of sub-chunks
====================

Clay code is able to save in terms of disk IO, network bandwidth as it
is a vector code and it can see a chunk at a finer granularity called 
sub-chunks. Number of sub-chunks within a chunk for a clay code is
given by:

	sub-chunk count = q\ :sup:`(k+m)/q`, where q=d-k+1


During repair of a OSD, the helper information requested
from an available OSD is only a fraction of a chunk. In fact, the number
of sub-chunks within a chunk that are accessed during repair is given by:

	repair sub-chunk count = sub-chunk count / q

Examples
--------

#. For a configuration with *k=4*, *m=2*, *d=5*, the sub-chunk count is
   8 and  the repair sub-chunk count is 4. Therefore, only half of a chunk is read 
   during repair.
#. When *k=8*, *m=4*, *d=11* the sub-chunk count is 64 and repair sub-chunk count
   is 16. A quarter of a chunk is read from an available OSD for repair of a failed 
   chunk.


How to choose configuration given a workload
============================================

Only a few sub-chunks are read of all the sub-chunks within a chunk. These sub-chunks
are not necessarily stored consecutively within a chunk. For best disk IO 
performance, it is helpful to read contiguous data. Choose stripe-size such that
sub-chunk size is sufficiently large.

For a given stripe-size (that's fixed based on a workload), choose ``k``, ``m``, ``d`` such that::

	sub-chunk size = stripe-size / (k*sub-chunk count) = 4KB, 8KB, 12KB ...

#. For large size workloads for which stripe size is large it is easy to choose k, m, d.
   For example consider stripe-size of size 64MB, choosing *k=16*, *m=4* and *d=19* will
   result in a sub-chunk count of 1024 and sub-chunk size of 4KB.
#. For small size workloads *k=4*, *m=2* is a good configuration that gives both network
   and disk IO benefits.

Comparisons with LRC
====================

Locally Recoverable Codes (LRC) are also designed in order to save in terms of network
bandwidth, disk IO during single OSD recovery. However, the focus in LRCs is to keep the
number of OSDs contacted during repair (d) to be minimal at the cost of storage overhead.
*clay* code has a storage overhead m/k. In the case of *lrc*, it stores (k+m)/d parities in
addition to the ``m`` parities resulting in a storage overhead (m+(k+m)/d)/k. Both *clay* and *lrc*
can recover from failure of any ``m`` OSDs.

	+-----------------+----------------------------------+----------------------------------+
	| Parameters      | disk IO, storage overhead (LRC)  | disk IO, storage overhead (CLAY) |
	+=================+================+=================+==================================+
	| (k=10, m=4)     | 7 * S, 0.6 (d=7)                 | 3.25 * S, 0.4 (d=13)             |
	+-----------------+----------------------------------+----------------------------------+
	| (k=16, m=4)     | 4 * S, 0.5 (d=5)                 | 4.75 * S, 0.25 (d=19)            |
	+-----------------+----------------------------------+----------------------------------+


where ``S`` is the amount of data stored of single OSD being recovered.
doc: clay code plugin Signed-off-by: Myna <mynaramana@gmail.com> 2018-10-08 13:26:01 +00:00			`=========================`
			`Coupled LAYer code plugin`
			`=========================`

			`Coupled LAYer (CLAY) codes are erasure codes designed to save in terms of network`
			`bandwidth, disk IO when a failed node/OSD/rack is being repaired. Let:`

			`d = number of OSDs contacted during repair`

			`If jerasure is configured with k=8 and m=4, losing one OSD requires`
			`reading from the d=8 others to repair. And recovery of say a 1GiB needs`
			`a download of 8 X 1GiB = 8GiB amount of information.`

			`However, in the case of clay plugin d is configurable such that:`

			`k+1 <= d <= k+m-1`

			`By default the clay code plugin picks d=k+m-1 as it gives most savings in terms`
			`of network bandwidth and disk IO. In the case of clay plugin configured with`
			`k=8, m=4 and d=11 when a single OSD fails d=11 osds are contacted and`
			`250MiB is downloaded from each of them resulting in download of 11 X 250MiB = 2.75GiB`
			`amount of information. More general parameters are shown below. The benefits are huge`
			`when the repair is being done for a rack that stores information in the order of`
			`Tera bytes.`

			`+-------------+---------------------------+`
			`\| plugin \| total amount of disk IO \|`
			`+=============+===========================+`
			`\|jerasure,isa \| k*S \|`
			`+-------------+---------------------------+`
			`\| clay \| dS/(d-k+1) = (k+m-1)S/m \|`
			`+-------------+---------------------------+`

			`where S is the amount of data stored of single OSD being repaired and`
			`in the table above, we are using the maximum possible value of d for minimal amount`
			`of data transmission for recovery.`

			`Erasure code profile examples`
			`=============================`

			`Reduced bandwidth usage can actually be observed.::`

			`$ ceph osd erasure-code-profile set CLAYprofile \`
			`plugin=clay \`
			`k=4 m=2 d=5 \`
			`crush-failure-domain=host`
			`$ ceph osd pool create claypool 12 12 erasure CLAYprofile`


			`Create a clay profile`
			`=====================`

			`To create a new clay code profile::`

			`ceph osd erasure-code-profile set {name} \`
			`plugin=clay \`
			`k={data-chunks} \`
			`m={coding-chunks} \`
			`[d={helper-chunks}] \`
			`[scalar_mds={plugin-name}] \`
			`[technique={technique-name}] \`
			`[crush-failure-domain={bucket-type}] \`
			`[directory={directory}] \`
			`[--force]`

			`Where:`

			``k={data chunks}``

			`:Description: Each object is split in data-chunks parts,`
			`each stored on a different OSD.`

			`:Type: Integer`
			`:Required: Yes.`
			`:Example: 4`

			``m={coding-chunks}``

			`:Description: Compute coding chunks for each object and store them`
			`on different OSDs. The number of coding chunks is also`
			`the number of OSDs that can be down without losing data.`

			`:Type: Integer`
			`:Required: Yes.`
			`:Example: 2`

			``d={helper-chunks}``

			`:Description: Number of OSDs requested to send data during recovery of`
			`a single chunk. d needs to be chosen such that`
			`k+1 <= d <= k+m-1. Larger the d, better the savings.`

			`:Type: Integer`
			`:Required: No.`
			`:Default: k+m-1`

			``scalar_mds={jerasure\|isa\|shec}``

			`:Description: scalar_mds specifies the plugin that is used as a`
			`building block in the layered construction. It can be`
			`one of jerasure, isa, shec`

			`:Type: String`
			`:Required: No.`
			`:Default: jerasure`

			``technique={technique}``

			`:Description: technique specifies the technique that will be picked`
			`within the 'scalar_mds' plugin specified. Supported techniques`
			`are 'reed_sol_van', 'reed_sol_r6_op', 'cauchy_orig',`
			`'cauchy_good', 'liber8tion' for jerasure, 'reed_sol_van',`
			`'cauchy' for isa and 'single', 'multiple' for shec.`

			`:Type: String`
			`:Required: No.`
			`:Default: reed_sol_van (for jerasure, isa), single (for shec)`


			``crush-root={root}``

			`:Description: The name of the crush bucket used for the first step of`
			`the CRUSH rule. For intance step take default.`

			`:Type: String`
			`:Required: No.`
			`:Default: default`


			``crush-failure-domain={bucket-type}``

			`:Description: Ensure that no two chunks are in a bucket with the same`
			`failure domain. For instance, if the failure domain is`
			`host no two chunks will be stored on the same`
			`host. It is used to create a CRUSH rule step such as **step`
			`chooseleaf host**.`

			`:Type: String`
			`:Required: No.`
			`:Default: host`

			``crush-device-class={device-class}``

			`:Description: Restrict placement to devices of a specific class (e.g.,`
			``ssd`` or ``hdd``), using the crush device class names
			`in the CRUSH map.`

			`:Type: String`
			`:Required: No.`
			`:Default:`

			``directory={directory}``

			`:Description: Set the directory name from which the erasure code`
			`plugin is loaded.`

			`:Type: String`
			`:Required: No.`
			`:Default: /usr/lib/ceph/erasure-code`

			``--force``

			`:Description: Override an existing profile by the same name.`

			`:Type: String`
			`:Required: No.`


			`Notion of sub-chunks`
			`====================`

			`Clay code is able to save in terms of disk IO, network bandwidth as it`
			`is a vector code and it can see a chunk at a finer granularity called`
			`sub-chunks. Number of sub-chunks within a chunk for a clay code is`
			`given by:`

			sub-chunk count = q\ :sup:`(k+m)/q`, where q=d-k+1


			`During repair of a OSD, the helper information requested`
			`from an available OSD is only a fraction of a chunk. In fact, the number`
			`of sub-chunks within a chunk that are accessed during repair is given by:`

			`repair sub-chunk count = sub-chunk count / q`

			`Examples`
			`--------`

			`#. For a configuration with k=4, m=2, d=5, the sub-chunk count is`
			`8 and the repair sub-chunk count is 4. Therefore, only half of a chunk is read`
			`during repair.`
			`#. When k=8, m=4, d=11 the sub-chunk count is 64 and repair sub-chunk count`
			`is 16. A quarter of a chunk is read from an available OSD for repair of a failed`
			`chunk.`



			`How to choose configuration given a workload`
			`============================================`

			`Only a few sub-chunks are read of all the sub-chunks within a chunk. These sub-chunks`
			`are not necessarily stored consecutively within a chunk. For best disk IO`
			`performance, it is helpful to read contiguous data. Choose stripe-size such that`
			`sub-chunk size is sufficiently large.`

			For a given stripe-size (that's fixed based on a workload), choose ``k``, ``m``, ``d`` such that::

			`sub-chunk size = stripe-size / (k*sub-chunk count) = 4KB, 8KB, 12KB ...`

			`#. For large size workloads for which stripe size is large it is easy to choose k, m, d.`
			`For example consider stripe-size of size 64MB, choosing k=16, m=4 and d=19 will`
			`result in a sub-chunk count of 1024 and sub-chunk size of 4KB.`
			`#. For small size workloads k=4, m=2 is a good configuration that gives both network`
			`and disk IO benefits.`

			`Comparisons with LRC`
			`====================`

			`Locally Recoverable Codes (LRC) are also designed in order to save in terms of network`
			`bandwidth, disk IO during single OSD recovery. However, the focus in LRCs is to keep the`
			`number of OSDs contacted during repair (d) to be minimal at the cost of storage overhead.`
			`clay code has a storage overhead m/k. In the case of lrc, it stores (k+m)/d parities in`
			addition to the ``m`` parities resulting in a storage overhead (m+(k+m)/d)/k. Both clay and lrc
			can recover from failure of any ``m`` OSDs.

			`+-----------------+----------------------------------+----------------------------------+`
			`\| Parameters \| disk IO, storage overhead (LRC) \| disk IO, storage overhead (CLAY) \|`
			`+=================+================+=================+==================================+`
			`\| (k=10, m=4) \| 7 * S, 0.6 (d=7) \| 3.25 * S, 0.4 (d=13) \|`
			`+-----------------+----------------------------------+----------------------------------+`
			`\| (k=16, m=4) \| 4 * S, 0.5 (d=5) \| 4.75 * S, 0.25 (d=19) \|`
			`+-----------------+----------------------------------+----------------------------------+`


			where ``S`` is the amount of data stored of single OSD being recovered.