doc/rados: edit crush-map-edits.rst (1 of x)

Edit doc/rados/operations/crush-map-edits.rst.

Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com>
Co-authored-by: Cole Mitchell <cole.mitchell.ceph@gmail.com>
Signed-off-by: Zac Dover <zac.dover@proton.me>
This commit is contained in:
Zac Dover 2023-06-23 10:59:10 +10:00
parent 7917b8de66
commit f447c290a8

View File

@ -1,25 +1,24 @@
Manually editing a CRUSH Map
============================
Manually editing the CRUSH Map
==============================
.. note:: Manually editing the CRUSH map is an advanced
administrator operation. All CRUSH changes that are
necessary for the overwhelming majority of installations are
possible via the standard ceph CLI and do not require manual
CRUSH map edits. If you have identified a use case where
manual edits *are* necessary with recent Ceph releases, consider
contacting the Ceph developers so that future versions of Ceph
can obviate your corner case.
.. note:: Manually editing the CRUSH map is an advanced administrator
operation. For the majority of installations, CRUSH changes can be
implemented via the Ceph CLI and do not require manual CRUSH map edits. If
you have identified a use case where manual edits *are* necessary with a
recent Ceph release, consider contacting the Ceph developers at dev@ceph.io
so that future versions of Ceph do not have this problem.
To edit an existing CRUSH map:
To edit an existing CRUSH map, carry out the following procedure:
#. `Get the CRUSH map`_.
#. `Decompile`_ the CRUSH map.
#. Edit at least one of `Devices`_, `Buckets`_ and `Rules`_.
#. Edit at least one of the following sections: `Devices`_, `Buckets`_, and
`Rules`_. Use a text editor for this task.
#. `Recompile`_ the CRUSH map.
#. `Set the CRUSH map`_.
For details on setting the CRUSH map rule for a specific pool, see `Set
Pool Values`_.
For details on setting the CRUSH map rule for a specific pool, see `Set Pool
Values`_.
.. _Get the CRUSH map: #getcrushmap
.. _Decompile: #decompilecrushmap
@ -32,193 +31,204 @@ Pool Values`_.
.. _getcrushmap:
Get a CRUSH Map
---------------
Get the CRUSH Map
-----------------
To get the CRUSH map for your cluster, execute the following:
To get the CRUSH map for your cluster, run a command of the following form:
.. prompt:: bash $
ceph osd getcrushmap -o {compiled-crushmap-filename}
ceph osd getcrushmap -o {compiled-crushmap-filename}
Ceph will output (-o) a compiled CRUSH map to the filename you specified. Since
the CRUSH map is in a compiled form, you must decompile it first before you can
edit it.
Ceph outputs (``-o``) a compiled CRUSH map to the filename that you have
specified. Because the CRUSH map is in a compiled form, you must first
decompile it before you can edit it.
.. _decompilecrushmap:
Decompile a CRUSH Map
---------------------
Decompile the CRUSH Map
-----------------------
To decompile a CRUSH map, execute the following:
To decompile the CRUSH map, run a command of the following form:
.. prompt:: bash $
crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename}
crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename}
.. _compilecrushmap:
Recompile a CRUSH Map
---------------------
Recompile the CRUSH Map
-----------------------
To compile a CRUSH map, execute the following:
To compile the CRUSH map, run a command of the following form:
.. prompt:: bash $
crushtool -c {decompiled-crushmap-filename} -o {compiled-crushmap-filename}
crushtool -c {decompiled-crushmap-filename} -o {compiled-crushmap-filename}
.. _setcrushmap:
Set the CRUSH Map
-----------------
To set the CRUSH map for your cluster, execute the following:
To set the CRUSH map for your cluster, run a command of the following form:
.. prompt:: bash $
ceph osd setcrushmap -i {compiled-crushmap-filename}
ceph osd setcrushmap -i {compiled-crushmap-filename}
Ceph will load (-i) a compiled CRUSH map from the filename you specified.
Ceph loads (``-i``) a compiled CRUSH map from the filename that you have
specified.
Sections
--------
There are six main sections to a CRUSH Map.
A CRUSH map has six main sections:
#. **tunables:** The preamble at the top of the map describes any *tunables*
that differ from the historical / legacy CRUSH behavior. These
correct for old bugs, optimizations, or other changes that have
been made over the years to improve CRUSH's behavior.
that are not a part of legacy CRUSH behavior. These tunables correct for old
bugs, optimizations, or other changes that have been made over the years to
improve CRUSH's behavior.
#. **devices:** Devices are individual OSDs that store data.
#. **types**: Bucket ``types`` define the types of buckets used in
your CRUSH hierarchy. Buckets consist of a hierarchical aggregation
of storage locations (e.g., rows, racks, chassis, hosts, etc.) and
their assigned weights.
#. **types**: Bucket ``types`` define the types of buckets that are used in
your CRUSH hierarchy.
#. **buckets:** Once you define bucket types, you must define each node
in the hierarchy, its type, and which devices or other nodes it
#. **buckets:** Buckets consist of a hierarchical aggregation of storage
locations (for example, rows, racks, chassis, hosts) and their assigned
weights. After the bucket ``types`` have been defined, the CRUSH map defines
each node in the hierarchy, its type, and which devices or other nodes it
contains.
#. **rules:** Rules define policy about how data is distributed across
devices in the hierarchy.
#. **choose_args:** Choose_args are alternative weights associated with
the hierarchy that have been adjusted to optimize data placement. A single
choose_args map can be used for the entire cluster, or one can be
created for each individual pool.
#. **choose_args:** ``choose_args`` are alternative weights associated with
the hierarchy that have been adjusted in order to optimize data placement. A
single ``choose_args`` map can be used for the entire cluster, or a number
of ``choose_args`` maps can be created such that each map is crafted for a
particular pool.
.. _crushmapdevices:
CRUSH Map Devices
CRUSH-Map Devices
-----------------
Devices are individual OSDs that store data. Usually one is defined here for each
OSD daemon in your
cluster. Devices are identified by an ``id`` (a non-negative integer) and
a ``name``, normally ``osd.N`` where ``N`` is the device id.
Devices are individual OSDs that store data. In this section, there is usually
one device defined for each OSD daemon in your cluster. Devices are identified
by an ``id`` (a non-negative integer) and a ``name`` (usually ``osd.N``, where
``N`` is the device's ``id``).
.. _crush-map-device-class:
Devices may also have a *device class* associated with them (e.g.,
``hdd`` or ``ssd``), allowing them to be conveniently targeted by a
crush rule.
A device can also have a *device class* associated with it: for example,
``hdd`` or ``ssd``. Device classes make it possible for devices to be targeted
by CRUSH rules. This means that device classes allow CRUSH rules to select only
OSDs that match certain characteristics. For example, you might want an RBD
pool associated only with SSDs and a different RBD pool associated only with
HDDs.
To see a list of devices, run the following command:
.. prompt:: bash #
devices
ceph device ls
The output of this command takes the following form:
::
device {num} {osd.name} [class {class}]
device {num} {osd.name} [class {class}]
For example:
.. prompt:: bash #
devices
ceph device ls
::
device 0 osd.0 class ssd
device 1 osd.1 class hdd
device 2 osd.2
device 3 osd.3
device 0 osd.0 class ssd
device 1 osd.1 class hdd
device 2 osd.2
device 3 osd.3
In most cases, each device maps to a single ``ceph-osd`` daemon. This
is normally a single storage device, a pair of devices (for example,
one for data and one for a journal or metadata), or in some cases a
small RAID device.
In most cases, each device maps to a corresponding ``ceph-osd`` daemon. This
daemon might map to a single storage device, a pair of devices (for example,
one for data and one for a journal or metadata), or in some cases a small RAID
device or a partition of a larger storage device.
CRUSH Map Bucket Types
CRUSH-Map Bucket Types
----------------------
The second list in the CRUSH map defines 'bucket' types. Buckets facilitate
a hierarchy of nodes and leaves. Node (or non-leaf) buckets typically represent
physical locations in a hierarchy. Nodes aggregate other nodes or leaves.
Leaf buckets represent ``ceph-osd`` daemons and their corresponding storage
media.
The second list in the CRUSH map defines 'bucket' types. Buckets facilitate a
hierarchy of nodes and leaves. Node buckets (also known as non-leaf buckets)
typically represent physical locations in a hierarchy. Nodes aggregate other
nodes or leaves. Leaf buckets represent ``ceph-osd`` daemons and their
corresponding storage media.
.. tip:: The term "bucket" used in the context of CRUSH means a node in
the hierarchy, i.e. a location or a piece of physical hardware. It
is a different concept from the term "bucket" when used in the
context of RADOS Gateway APIs.
.. tip:: In the context of CRUSH, the term "bucket" is used to refer to
a node in the hierarchy (that is, to a location or a piece of physical
hardware). In the context of RADOS Gateway APIs, however, the term
"bucket" has a different meaning.
To add a bucket type to the CRUSH map, create a new line under your list of
To add a bucket type to the CRUSH map, create a new line under the list of
bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name.
By convention, there is one leaf bucket and it is ``type 0``; however, you may
give it any name you like (e.g., osd, disk, drive, storage)::
By convention, there is exactly one leaf bucket type and it is ``type 0``;
however, you may give the leaf bucket any name you like (for example: ``osd``,
``disk``, ``drive``, ``storage``)::
# types
type {num} {bucket-name}
# types
type {num} {bucket-name}
For example::
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root
.. _crushmapbuckets:
CRUSH Map Bucket Hierarchy
CRUSH-Map Bucket Hierarchy
--------------------------
The CRUSH algorithm distributes data objects among storage devices according
to a per-device weight value, approximating a uniform probability distribution.
The CRUSH algorithm distributes data objects among storage devices according to
a per-device weight value, approximating a uniform probability distribution.
CRUSH distributes objects and their replicas according to the hierarchical
cluster map you define. Your CRUSH map represents the available storage
devices and the logical elements that contain them.
cluster map you define. The CRUSH map represents the available storage devices
and the logical elements that contain them.
To map placement groups to OSDs across failure domains, a CRUSH map defines a
hierarchical list of bucket types (i.e., under ``#types`` in the generated CRUSH
map). The purpose of creating a bucket hierarchy is to segregate the
leaf nodes by their failure domains, such as hosts, chassis, racks, power
distribution units, pods, rows, rooms, and data centers. With the exception of
the leaf nodes representing OSDs, the rest of the hierarchy is arbitrary, and
To map placement groups (PGs) to OSDs across failure domains, a CRUSH map
defines a hierarchical list of bucket types under ``#types`` in the generated
CRUSH map. The purpose of creating a bucket hierarchy is to segregate the leaf
nodes according to their failure domains (for example: hosts, chassis, racks,
power distribution units, pods, rows, rooms, and data centers). With the
exception of the leaf nodes that represent OSDs, the hierarchy is arbitrary and
you may define it according to your own needs.
We recommend adapting your CRUSH map to your firm's hardware naming conventions
and using instance names that reflect the physical hardware. Your naming
practice can make it easier to administer the cluster and troubleshoot
problems when an OSD and/or other hardware malfunctions and the administrator
need access to physical hardware.
We recommend adapting your CRUSH map to your preferred hardware-naming
conventions and using bucket names that clearly reflect the physical
hardware. Clear naming practice can make it easier to administer the cluster
and easier to troubleshoot problems when OSDs malfunction (or other hardware
malfunctions) and the administrator needs access to physical hardware.
In the following example, the bucket hierarchy has a leaf bucket named ``osd``,
and two node buckets named ``host`` and ``rack`` respectively.
In the following example, the bucket hierarchy has a leaf bucket named ``osd``
and two node buckets named ``host`` and ``rack``:
.. ditaa::
+-----------+
@ -240,121 +250,137 @@ and two node buckets named ``host`` and ``rack`` respectively.
| Bucket | | Bucket | | Bucket | | Bucket |
+-----------+ +-----------+ +-----------+ +-----------+
.. note:: The higher numbered ``rack`` bucket type aggregates the lower
numbered ``host`` bucket type.
.. note:: The higher-numbered ``rack`` bucket type aggregates the
lower-numbered ``host`` bucket type.
Since leaf nodes reflect storage devices declared under the ``#devices`` list
at the beginning of the CRUSH map, you do not need to declare them as bucket
instances. The second lowest bucket type in your hierarchy usually aggregates
the devices (i.e., it's usually the computer containing the storage media, and
uses whatever term you prefer to describe it, such as "node", "computer",
"server," "host", "machine", etc.). In high density environments, it is
increasingly common to see multiple hosts/nodes per chassis. You should account
for chassis failure too--e.g., the need to pull a chassis if a node fails may
result in bringing down numerous hosts/nodes and their OSDs.
Because leaf nodes reflect storage devices that have already been declared
under the ``#devices`` list at the beginning of the CRUSH map, there is no need
to declare them as bucket instances. The second-lowest bucket type in your
hierarchy is typically used to aggregate the devices (that is, the
second-lowest bucket type is usually the computer that contains the storage
media and, such as ``node``, ``computer``, ``server``, ``host``, or
``machine``). In high-density environments, it is common to have multiple hosts
or nodes in a single chassis (for example, in the cases of blades or twins). It
is important to anticipate the potential consequences of chassis failure -- for
example, during the replacement of a chassis in case of a node failure, the
chassis's hosts or nodes (and their associated OSDs) will be in a ``down``
state.
When declaring a bucket instance, you must specify its type, give it a unique
name (string), assign it a unique ID expressed as a negative integer (optional),
specify a weight relative to the total capacity/capability of its item(s),
specify the bucket algorithm (usually ``straw2``), and the hash (usually ``0``,
reflecting hash algorithm ``rjenkins1``). A bucket may have one or more items.
The items may consist of node buckets or leaves. Items may have a weight that
reflects the relative weight of the item.
To declare a bucket instance, do the following: specify its type, give it a
unique name (an alphanumeric string), assign it a unique ID expressed as a
negative integer (this is optional), assign it a weight relative to the total
capacity and capability of the item(s) in the bucket, assign it a bucket
algorithm (usually ``straw2``), and specify the bucket algorithm's hash
(usually ``0``, a setting that reflects the hash algorithm ``rjenkins1``). A
bucket may have one or more items. The items may consist of node buckets or
leaves. Items may have a weight that reflects the relative weight of the item.
You may declare a node bucket with the following syntax::
To declare a node bucket, use the following syntax::
[bucket-type] [bucket-name] {
id [a unique negative numeric ID]
weight [the relative capacity/capability of the item(s)]
alg [the bucket type: uniform | list | tree | straw | straw2 ]
hash [the hash type: 0 by default]
item [item-name] weight [weight]
}
[bucket-type] [bucket-name] {
id [a unique negative numeric ID]
weight [the relative capacity/capability of the item(s)]
alg [the bucket type: uniform | list | tree | straw | straw2 ]
hash [the hash type: 0 by default]
item [item-name] weight [weight]
}
For example, using the diagram above, we would define two host buckets
and one rack bucket. The OSDs are declared as items within the host buckets::
For example, in the above diagram, two host buckets (referred to in the
declaration below as ``node1`` and ``node2``) and one rack bucket (referred to
in the declaration below as ``rack1``) are defined. The OSDs are declared as
items within the host buckets::
host node1 {
id -1
alg straw2
hash 0
item osd.0 weight 1.00
item osd.1 weight 1.00
}
host node1 {
id -1
alg straw2
hash 0
item osd.0 weight 1.00
item osd.1 weight 1.00
}
host node2 {
id -2
alg straw2
hash 0
item osd.2 weight 1.00
item osd.3 weight 1.00
}
host node2 {
id -2
alg straw2
hash 0
item osd.2 weight 1.00
item osd.3 weight 1.00
}
rack rack1 {
id -3
alg straw2
hash 0
item node1 weight 2.00
item node2 weight 2.00
}
rack rack1 {
id -3
alg straw2
hash 0
item node1 weight 2.00
item node2 weight 2.00
}
.. note:: In this example, the rack bucket does not contain any OSDs. Instead,
it contains lower-level host buckets and includes the sum of their weight in
the item entry.
.. note:: In the foregoing example, note that the rack bucket does not contain
any OSDs. Rather it contains lower level host buckets, and includes the
sum total of their weight in the item entry.
.. topic:: Bucket Types
Ceph supports five bucket types, each representing a tradeoff between
performance and reorganization efficiency. If you are unsure of which bucket
type to use, we recommend using a ``straw2`` bucket. For a detailed
discussion of bucket types, refer to
`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_,
and more specifically to **Section 3.4**. The bucket types are:
Ceph supports five bucket types. Each bucket type provides a balance between
performance and reorganization efficiency, and each is different from the
others. If you are unsure of which bucket type to use, use the ``straw2``
bucket. For a more technical discussion of bucket types than is offered
here, see **Section 3.4** of `CRUSH - Controlled, Scalable, Decentralized
Placement of Replicated Data`_.
The bucket types are as follows:
#. **uniform**: Uniform buckets aggregate devices with **exactly** the same
weight. For example, when firms commission or decommission hardware, they
typically do so with many machines that have exactly the same physical
configuration (e.g., bulk purchases). When storage devices have exactly
the same weight, you may use the ``uniform`` bucket type, which allows
CRUSH to map replicas into uniform buckets in constant time. With
non-uniform weights, you should use another bucket algorithm.
#. **uniform**: Uniform buckets aggregate devices that have **exactly**
the same weight. For example, when hardware is commissioned or
decommissioned, it is often done in sets of machines that have exactly
the same physical configuration (this can be the case, for example,
after bulk purchases). When storage devices have exactly the same
weight, you may use the ``uniform`` bucket type, which allows CRUSH to
map replicas into uniform buckets in constant time. If your devices have
non-uniform weights, you should not use the uniform bucket algorithm.
#. **list**: List buckets aggregate their content as linked lists. Based on
the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`P` algorithm,
a list is a natural and intuitive choice for an **expanding cluster**:
either an object is relocated to the newest device with some appropriate
probability, or it remains on the older devices as before. The result is
optimal data migration when items are added to the bucket. Items removed
from the middle or tail of the list, however, can result in a significant
amount of unnecessary movement, making list buckets most suitable for
circumstances in which they **never (or very rarely) shrink**.
#. **list**: List buckets aggregate their content as linked lists. The
behavior of list buckets is governed by the :abbr:`RUSH (Replication
Under Scalable Hashing)`:sub:`P` algorithm. In the behavior of this
bucket type, an object is either relocated to the newest device in
accordance with an appropriate probability, or it remains on the older
devices as before. This results in optimal data migration when items are
added to the bucket. The removal of items from the middle or the tail of
the list, however, can result in a significant amount of unnecessary
data movement. This means that list buckets are most suitable for
circumstances in which they **never shrink or very rarely shrink**.
#. **tree**: Tree buckets use a binary search tree. They are more efficient
than list buckets when a bucket contains a larger set of items. Based on
the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`R` algorithm,
tree buckets reduce the placement time to O(log :sub:`n`), making them
suitable for managing much larger sets of devices or nested buckets.
#. **tree**: Tree buckets use a binary search tree. They are more efficient
at dealing with buckets that contain many items than are list buckets.
The behavior of tree buckets is governed by the :abbr:`RUSH (Replication
Under Scalable Hashing)`:sub:`R` algorithm. Tree buckets reduce the
placement time to 0(log\ :sub:`n`). This means that tree buckets are
suitable for managing large sets of devices or nested buckets.
#. **straw**: List and Tree buckets use a divide and conquer strategy
in a way that either gives certain items precedence (e.g., those
at the beginning of a list) or obviates the need to consider entire
subtrees of items at all. That improves the performance of the replica
placement process, but can also introduce suboptimal reorganization
behavior when the contents of a bucket change due an addition, removal,
or re-weighting of an item. The straw bucket type allows all items to
fairly “compete” against each other for replica placement through a
process analogous to a draw of straws.
#. **straw**: Straw buckets allow all items in the bucket to "compete"
against each other for replica placement through a process analogous to
drawing straws. This is different from the behavior of list buckets and
tree buckets, which use a divide-and-conquer strategy that either gives
certain items precedence (for example, those at the beginning of a list)
or obviates the need to consider entire subtrees of items. Such an
approach improves the performance of the replica placement process, but
can also introduce suboptimal reorganization behavior when the contents
of a bucket change due an addition, a removal, or the re-weighting of an
item.
#. **straw2**: Straw2 buckets improve Straw to correctly avoid any data
movement between items when neighbor weights change.
* **straw2**: Straw2 buckets improve on Straw by correctly avoiding
any data movement between items when neighbor weights change. For
example, if the weight of a given item changes (including during the
operations of adding it to the cluster or removing it from the
cluster), there will be data movement to or from only that item.
Neighbor weights are not taken into account.
For example the weight of item A including adding it anew or removing
it completely, there will be data movement only to or from item A.
.. topic:: Hash
Each bucket uses a hash algorithm. Currently, Ceph supports ``rjenkins1``.
Enter ``0`` as your hash setting to select ``rjenkins1``.
Each bucket uses a hash algorithm. As of Reef, Ceph supports the
``rjenkins1`` algorithm. To select ``rjenkins1`` as the hash algorithm,
enter ``0`` as your hash setting.
.. _weightingbucketitems: