mirror of
https://github.com/ceph/ceph
synced 2025-02-09 12:00:04 +00:00
a5ce9c3863
This PR was merged by accident before it was ready. Let's revert for now and open a new PR. Signed-off-by: Samuel Just <sjust@redhat.com>
1148 lines
44 KiB
ReStructuredText
1148 lines
44 KiB
ReStructuredText
============
|
|
CRUSH Maps
|
|
============
|
|
|
|
The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm
|
|
computes storage locations in order to determine how to store and retrieve
|
|
data. CRUSH allows Ceph clients to communicate with OSDs directly rather than
|
|
through a centralized server or broker. By using an algorithmically-determined
|
|
method of storing and retrieving data, Ceph avoids a single point of failure, a
|
|
performance bottleneck, and a physical limit to its scalability.
|
|
|
|
CRUSH uses a map of the cluster (the CRUSH map) to map data to OSDs,
|
|
distributing the data across the cluster in accordance with configured
|
|
replication policy and failure domains. For a detailed discussion of CRUSH, see
|
|
`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_
|
|
|
|
CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)` and a
|
|
hierarchy of "buckets" (``host``\s, ``rack``\s) and rules that govern how CRUSH
|
|
replicates data within the cluster's pools. By reflecting the underlying
|
|
physical organization of the installation, CRUSH can model (and thereby
|
|
address) the potential for correlated device failures. Some factors relevant
|
|
to the CRUSH hierarchy include chassis, racks, physical proximity, a shared
|
|
power source, shared networking, and failure domains. By encoding this
|
|
information into the CRUSH map, CRUSH placement policies distribute object
|
|
replicas across failure domains while maintaining the desired distribution. For
|
|
example, to address the possibility of concurrent failures, it might be
|
|
desirable to ensure that data replicas are on devices that reside in or rely
|
|
upon different shelves, racks, power supplies, controllers, or physical
|
|
locations.
|
|
|
|
When OSDs are deployed, they are automatically added to the CRUSH map under a
|
|
``host`` bucket that is named for the node on which the OSDs run. This
|
|
behavior, combined with the configured CRUSH failure domain, ensures that
|
|
replicas or erasure-code shards are distributed across hosts and that the
|
|
failure of a single host or other kinds of failures will not affect
|
|
availability. For larger clusters, administrators must carefully consider their
|
|
choice of failure domain. For example, distributing replicas across racks is
|
|
typical for mid- to large-sized clusters.
|
|
|
|
|
|
CRUSH Location
|
|
==============
|
|
|
|
The location of an OSD within the CRUSH map's hierarchy is referred to as its
|
|
``CRUSH location``. The specification of a CRUSH location takes the form of a
|
|
list of key-value pairs. For example, if an OSD is in a particular row, rack,
|
|
chassis, and host, and is also part of the 'default' CRUSH root (which is the
|
|
case for most clusters), its CRUSH location can be specified as follows::
|
|
|
|
root=default row=a rack=a2 chassis=a2a host=a2a1
|
|
|
|
.. note::
|
|
|
|
#. The order of the keys does not matter.
|
|
#. The key name (left of ``=``) must be a valid CRUSH ``type``. By default,
|
|
valid CRUSH types include ``root``, ``datacenter``, ``room``, ``row``,
|
|
``pod``, ``pdu``, ``rack``, ``chassis``, and ``host``. These defined
|
|
types suffice for nearly all clusters, but can be customized by
|
|
modifying the CRUSH map.
|
|
#. Not all keys need to be specified. For example, by default, Ceph
|
|
automatically sets an ``OSD``'s location as ``root=default
|
|
host=HOSTNAME`` (as determined by the output of ``hostname -s``).
|
|
|
|
The CRUSH location for an OSD can be modified by adding the ``crush location``
|
|
option in ``ceph.conf``. When this option has been added, every time the OSD
|
|
starts it verifies that it is in the correct location in the CRUSH map and
|
|
moves itself if it is not. To disable this automatic CRUSH map management, add
|
|
the following to the ``ceph.conf`` configuration file in the ``[osd]``
|
|
section::
|
|
|
|
osd crush update on start = false
|
|
|
|
Note that this action is unnecessary in most cases.
|
|
|
|
|
|
Custom location hooks
|
|
---------------------
|
|
|
|
A custom location hook can be used to generate a more complete CRUSH location
|
|
on startup. The CRUSH location is determined by, in order of preference:
|
|
|
|
#. A ``crush location`` option in ``ceph.conf``
|
|
#. A default of ``root=default host=HOSTNAME`` where the hostname is determined
|
|
by the output of the ``hostname -s`` command
|
|
|
|
A script can be written to provide additional location fields (for example,
|
|
``rack`` or ``datacenter``) and the hook can be enabled via the following
|
|
config option::
|
|
|
|
crush location hook = /path/to/customized-ceph-crush-location
|
|
|
|
This hook is passed several arguments (see below). The hook outputs a single
|
|
line to ``stdout`` that contains the CRUSH location description. The output
|
|
resembles the following:::
|
|
|
|
--cluster CLUSTER --id ID --type TYPE
|
|
|
|
Here the cluster name is typically ``ceph``, the ``id`` is the daemon
|
|
identifier or (in the case of OSDs) the OSD number, and the daemon type is
|
|
``osd``, ``mds, ``mgr``, or ``mon``.
|
|
|
|
For example, a simple hook that specifies a rack location via a value in the
|
|
file ``/etc/rack`` might be as follows::
|
|
|
|
#!/bin/sh
|
|
echo "host=$(hostname -s) rack=$(cat /etc/rack) root=default"
|
|
|
|
|
|
CRUSH structure
|
|
===============
|
|
|
|
The CRUSH map consists of (1) a hierarchy that describes the physical topology
|
|
of the cluster and (2) a set of rules that defines data placement policy. The
|
|
hierarchy has devices (OSDs) at the leaves and internal nodes corresponding to
|
|
other physical features or groupings: hosts, racks, rows, data centers, and so
|
|
on. The rules determine how replicas are placed in terms of that hierarchy (for
|
|
example, 'three replicas in different racks').
|
|
|
|
Devices
|
|
-------
|
|
|
|
Devices are individual OSDs that store data (usually one device for each
|
|
storage drive). Devices are identified by an ``id`` (a non-negative integer)
|
|
and a ``name`` (usually ``osd.N``, where ``N`` is the device's ``id``).
|
|
|
|
In Luminous and later releases, OSDs can have a *device class* assigned (for
|
|
example, ``hdd`` or ``ssd`` or ``nvme``), allowing them to be targeted by CRUSH
|
|
rules. Device classes are especially useful when mixing device types within
|
|
hosts.
|
|
|
|
.. _crush_map_default_types:
|
|
|
|
Types and Buckets
|
|
-----------------
|
|
|
|
"Bucket", in the context of CRUSH, is a term for any of the internal nodes in
|
|
the hierarchy: hosts, racks, rows, and so on. The CRUSH map defines a series of
|
|
*types* that are used to identify these nodes. Default types include:
|
|
|
|
- ``osd`` (or ``device``)
|
|
- ``host``
|
|
- ``chassis``
|
|
- ``rack``
|
|
- ``row``
|
|
- ``pdu``
|
|
- ``pod``
|
|
- ``room``
|
|
- ``datacenter``
|
|
- ``zone``
|
|
- ``region``
|
|
- ``root``
|
|
|
|
Most clusters use only a handful of these types, and other types can be defined
|
|
as needed.
|
|
|
|
The hierarchy is built with devices (normally of type ``osd``) at the leaves
|
|
and non-device types as the internal nodes. The root node is of type ``root``.
|
|
For example:
|
|
|
|
|
|
.. ditaa::
|
|
|
|
+-----------------+
|
|
|{o}root default |
|
|
+--------+--------+
|
|
|
|
|
+---------------+---------------+
|
|
| |
|
|
+------+------+ +------+------+
|
|
|{o}host foo | |{o}host bar |
|
|
+------+------+ +------+------+
|
|
| |
|
|
+-------+-------+ +-------+-------+
|
|
| | | |
|
|
+-----+-----+ +-----+-----+ +-----+-----+ +-----+-----+
|
|
| osd.0 | | osd.1 | | osd.2 | | osd.3 |
|
|
+-----------+ +-----------+ +-----------+ +-----------+
|
|
|
|
|
|
Each node (device or bucket) in the hierarchy has a *weight* that indicates the
|
|
relative proportion of the total data that should be stored by that device or
|
|
hierarchy subtree. Weights are set at the leaves, indicating the size of the
|
|
device. These weights automatically sum in an 'up the tree' direction: that is,
|
|
the weight of the ``root`` node will be the sum of the weights of all devices
|
|
contained under it. Weights are typically measured in tebibytes (TiB).
|
|
|
|
To get a simple view of the cluster's CRUSH hierarchy, including weights, run
|
|
the following command:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd tree
|
|
|
|
Rules
|
|
-----
|
|
|
|
CRUSH rules define policy governing how data is distributed across the devices
|
|
in the hierarchy. The rules define placement as well as replication strategies
|
|
or distribution policies that allow you to specify exactly how CRUSH places
|
|
data replicas. For example, you might create one rule selecting a pair of
|
|
targets for two-way mirroring, another rule for selecting three targets in two
|
|
different data centers for three-way replication, and yet another rule for
|
|
erasure coding across six storage devices. For a detailed discussion of CRUSH
|
|
rules, see **Section 3.2** of `CRUSH - Controlled, Scalable, Decentralized
|
|
Placement of Replicated Data`_.
|
|
|
|
CRUSH rules can be created via the command-line by specifying the *pool type*
|
|
that they will govern (replicated or erasure coded), the *failure domain*, and
|
|
optionally a *device class*. In rare cases, CRUSH rules must be created by
|
|
manually editing the CRUSH map.
|
|
|
|
To see the rules that are defined for the cluster, run the following command:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush rule ls
|
|
|
|
To view the contents of the rules, run the following command:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush rule dump
|
|
|
|
.. _device_classes:
|
|
|
|
Device classes
|
|
--------------
|
|
|
|
Each device can optionally have a *class* assigned. By default, OSDs
|
|
automatically set their class at startup to `hdd`, `ssd`, or `nvme` in
|
|
accordance with the type of device they are backed by.
|
|
|
|
To explicitly set the device class of one or more OSDs, run a command of the
|
|
following form:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush set-device-class <class> <osd-name> [...]
|
|
|
|
Once a device class has been set, it cannot be changed to another class until
|
|
the old class is unset. To remove the old class of one or more OSDs, run a
|
|
command of the following form:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush rm-device-class <osd-name> [...]
|
|
|
|
This restriction allows administrators to set device classes that won't be
|
|
changed on OSD restart or by a script.
|
|
|
|
To create a placement rule that targets a specific device class, run a command
|
|
of the following form:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
|
|
|
|
To apply the new placement rule to a specific pool, run a command of the
|
|
following form:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd pool set <pool-name> crush_rule <rule-name>
|
|
|
|
Device classes are implemented by creating one or more "shadow" CRUSH
|
|
hierarchies. For each device class in use, there will be a shadow hierarchy
|
|
that contains only devices of that class. CRUSH rules can then distribute data
|
|
across the relevant shadow hierarchy. This approach is fully backward
|
|
compatible with older Ceph clients. To view the CRUSH hierarchy with shadow
|
|
items displayed, run the following command:
|
|
|
|
.. prompt:: bash #
|
|
|
|
ceph osd crush tree --show-shadow
|
|
|
|
Some older clusters that were created before the Luminous release rely on
|
|
manually crafted CRUSH maps to maintain per-device-type hierarchies. For these
|
|
clusters, there is a *reclassify* tool available that can help them transition
|
|
to device classes without triggering unwanted data movement (see
|
|
:ref:`crush-reclassify`).
|
|
|
|
Weight sets
|
|
-----------
|
|
|
|
A *weight set* is an alternative set of weights to use when calculating data
|
|
placement. The normal weights associated with each device in the CRUSH map are
|
|
set in accordance with the device size and indicate how much data should be
|
|
stored where. However, because CRUSH is a probabilistic pseudorandom placement
|
|
process, there is always some variation from this ideal distribution (in the
|
|
same way that rolling a die sixty times will likely not result in exactly ten
|
|
ones and ten sixes). Weight sets allow the cluster to perform numerical
|
|
optimization based on the specifics of your cluster (for example: hierarchy,
|
|
pools) to achieve a balanced distribution.
|
|
|
|
Ceph supports two types of weight sets:
|
|
|
|
#. A **compat** weight set is a single alternative set of weights for each
|
|
device and each node in the cluster. Compat weight sets cannot be expected
|
|
to correct all anomalies (for example, PGs for different pools might be of
|
|
different sizes and have different load levels, but are mostly treated alike
|
|
by the balancer). However, they have the major advantage of being *backward
|
|
compatible* with previous versions of Ceph. This means that even though
|
|
weight sets were first introduced in Luminous v12.2.z, older clients (for
|
|
example, Firefly) can still connect to the cluster when a compat weight set
|
|
is being used to balance data.
|
|
|
|
#. A **per-pool** weight set is more flexible in that it allows placement to
|
|
be optimized for each data pool. Additionally, weights can be adjusted
|
|
for each position of placement, allowing the optimizer to correct for a
|
|
subtle skew of data toward devices with small weights relative to their
|
|
peers (an effect that is usually apparent only in very large clusters
|
|
but that can cause balancing problems).
|
|
|
|
When weight sets are in use, the weights associated with each node in the
|
|
hierarchy are visible in a separate column (labeled either as ``(compat)`` or
|
|
as the pool name) in the output of the following command:
|
|
|
|
.. prompt:: bash #
|
|
|
|
ceph osd tree
|
|
|
|
If both *compat* and *per-pool* weight sets are in use, data placement for a
|
|
particular pool will use its own per-pool weight set if present. If only
|
|
*compat* weight sets are in use, data placement will use the compat weight set.
|
|
If neither are in use, data placement will use the normal CRUSH weights.
|
|
|
|
Although weight sets can be set up and adjusted manually, we recommend enabling
|
|
the ``ceph-mgr`` *balancer* module to perform these tasks automatically if the
|
|
cluster is running Luminous or a later release.
|
|
|
|
Modifying the CRUSH map
|
|
=======================
|
|
|
|
.. _addosd:
|
|
|
|
Adding/Moving an OSD
|
|
--------------------
|
|
|
|
.. note:: Under normal conditions, OSDs automatically add themselves to the
|
|
CRUSH map when they are created. The command in this section is rarely
|
|
needed.
|
|
|
|
|
|
To add or move an OSD in the CRUSH map of a running cluster, run a command of
|
|
the following form:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...]
|
|
|
|
For details on this command's parameters, see the following:
|
|
|
|
``name``
|
|
:Description: The full name of the OSD.
|
|
:Type: String
|
|
:Required: Yes
|
|
:Example: ``osd.0``
|
|
|
|
|
|
``weight``
|
|
:Description: The CRUSH weight of the OSD. Normally, this is its size, as measured in terabytes (TB).
|
|
:Type: Double
|
|
:Required: Yes
|
|
:Example: ``2.0``
|
|
|
|
|
|
``root``
|
|
:Description: The root node of the CRUSH hierarchy in which the OSD resides (normally ``default``).
|
|
:Type: Key-value pair.
|
|
:Required: Yes
|
|
:Example: ``root=default``
|
|
|
|
|
|
``bucket-type``
|
|
:Description: The OSD's location in the CRUSH hierarchy.
|
|
:Type: Key-value pairs.
|
|
:Required: No
|
|
:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
|
|
|
|
In the following example, the command adds ``osd.0`` to the hierarchy, or moves
|
|
``osd.0`` from a previous location:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1
|
|
|
|
|
|
Adjusting OSD weight
|
|
--------------------
|
|
|
|
.. note:: Under normal conditions, OSDs automatically add themselves to the
|
|
CRUSH map with the correct weight when they are created. The command in this
|
|
section is rarely needed.
|
|
|
|
To adjust an OSD's CRUSH weight in a running cluster, run a command of the
|
|
following form:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush reweight {name} {weight}
|
|
|
|
For details on this command's parameters, see the following:
|
|
|
|
``name``
|
|
:Description: The full name of the OSD.
|
|
:Type: String
|
|
:Required: Yes
|
|
:Example: ``osd.0``
|
|
|
|
|
|
``weight``
|
|
:Description: The CRUSH weight of the OSD.
|
|
:Type: Double
|
|
:Required: Yes
|
|
:Example: ``2.0``
|
|
|
|
|
|
.. _removeosd:
|
|
|
|
Removing an OSD
|
|
---------------
|
|
|
|
.. note:: OSDs are normally removed from the CRUSH map as a result of the
|
|
`ceph osd purge`` command. This command is rarely needed.
|
|
|
|
To remove an OSD from the CRUSH map of a running cluster, run a command of the
|
|
following form:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush remove {name}
|
|
|
|
For details on the ``name`` parameter, see the following:
|
|
|
|
``name``
|
|
:Description: The full name of the OSD.
|
|
:Type: String
|
|
:Required: Yes
|
|
:Example: ``osd.0``
|
|
|
|
|
|
Adding a CRUSH Bucket
|
|
---------------------
|
|
|
|
.. note:: Buckets are implicitly created when an OSD is added and the command
|
|
that creates it specifies a ``{bucket-type}={bucket-name}`` as part of the
|
|
OSD's location (provided that a bucket with that name does not already
|
|
exist). The command in this section is typically used when manually
|
|
adjusting the structure of the hierarchy after OSDs have already been
|
|
created. One use of this command is to move a series of hosts to a new
|
|
rack-level bucket. Another use of this command is to add new ``host``
|
|
buckets (OSD nodes) to a dummy ``root`` so that the buckets don't receive
|
|
any data until they are ready to receive data. When they are ready, move the
|
|
buckets to the ``default`` root or to any other root as described below.
|
|
|
|
To add a bucket in the CRUSH map of a running cluster, run a command of the
|
|
following form:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush add-bucket {bucket-name} {bucket-type}
|
|
|
|
For details on this command's parameters, see the following:
|
|
|
|
``bucket-name``
|
|
:Description: The full name of the bucket.
|
|
:Type: String
|
|
:Required: Yes
|
|
:Example: ``rack12``
|
|
|
|
|
|
``bucket-type``
|
|
:Description: The type of the bucket. This type must already exist in the CRUSH hierarchy.
|
|
:Type: String
|
|
:Required: Yes
|
|
:Example: ``rack``
|
|
|
|
In the following example, the command adds the ``rack12`` bucket to the hierarchy:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush add-bucket rack12 rack
|
|
|
|
Moving a Bucket
|
|
---------------
|
|
|
|
To move a bucket to a different location or position in the CRUSH map
|
|
hierarchy, run a command of the following form:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...]
|
|
|
|
For details on this command's parameters, see the following:
|
|
|
|
``bucket-name``
|
|
:Description: The name of the bucket that you are moving.
|
|
:Type: String
|
|
:Required: Yes
|
|
:Example: ``foo-bar-1``
|
|
|
|
``bucket-type``
|
|
:Description: The bucket's new location in the CRUSH hierarchy.
|
|
:Type: Key-value pairs.
|
|
:Required: No
|
|
:Example: ``datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1``
|
|
|
|
Removing a Bucket
|
|
-----------------
|
|
|
|
To remove a bucket from the CRUSH hierarchy, run a command of the following
|
|
form:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush remove {bucket-name}
|
|
|
|
.. note:: A bucket must already be empty before it is removed from the CRUSH
|
|
hierarchy. In other words, there must not be OSDs or any other CRUSH buckets
|
|
within it.
|
|
|
|
For details on the ``bucket-name`` parameter, see the following:
|
|
|
|
``bucket-name``
|
|
:Description: The name of the bucket that is being removed.
|
|
:Type: String
|
|
:Required: Yes
|
|
:Example: ``rack12``
|
|
|
|
In the following example, the command removes the ``rack12`` bucket from the
|
|
hierarchy:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush remove rack12
|
|
|
|
Creating a compat weight set
|
|
----------------------------
|
|
|
|
.. note:: Normally this action is done automatically if needed by the
|
|
``balancer`` module (provided that the module is enabled).
|
|
|
|
To create a *compat* weight set, run the following command:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush weight-set create-compat
|
|
|
|
To adjust the weights of the compat weight set, run a command of the following
|
|
form:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush weight-set reweight-compat {name} {weight}
|
|
|
|
To destroy the compat weight set, run the following command:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush weight-set rm-compat
|
|
|
|
Creating per-pool weight sets
|
|
-----------------------------
|
|
|
|
To create a weight set for a specific pool, run a command of the following
|
|
form:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush weight-set create {pool-name} {mode}
|
|
|
|
.. note:: Per-pool weight sets can be used only if all servers and daemons are
|
|
running Luminous v12.2.z or a later release.
|
|
|
|
For details on this command's parameters, see the following:
|
|
|
|
``pool-name``
|
|
:Description: The name of a RADOS pool.
|
|
:Type: String
|
|
:Required: Yes
|
|
:Example: ``rbd``
|
|
|
|
``mode``
|
|
:Description: Either ``flat`` or ``positional``. A *flat* weight set
|
|
assigns a single weight to all devices or buckets. A
|
|
*positional* weight set has a potentially different
|
|
weight for each position in the resulting placement
|
|
mapping. For example: if a pool has a replica count of
|
|
``3``, then a positional weight set will have three
|
|
weights for each device and bucket.
|
|
:Type: String
|
|
:Required: Yes
|
|
:Example: ``flat``
|
|
|
|
To adjust the weight of an item in a weight set, run a command of the following
|
|
form:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]}
|
|
|
|
To list existing weight sets, run the following command:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush weight-set ls
|
|
|
|
To remove a weight set, run a command of the following form:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush weight-set rm {pool-name}
|
|
|
|
|
|
Creating a rule for a replicated pool
|
|
-------------------------------------
|
|
|
|
When you create a CRUSH rule for a replicated pool, there is an important
|
|
decision to make: selecting a failure domain. For example, if you select a
|
|
failure domain of ``host``, then CRUSH will ensure that each replica of the
|
|
data is stored on a unique host. Alternatively, if you select a failure domain
|
|
of ``rack``, then each replica of the data will be stored in a different rack.
|
|
Your selection of failure domain should be guided by the size and its CRUSH
|
|
topology.
|
|
|
|
The entire cluster hierarchy is typically nested beneath a root node that is
|
|
named ``default``. If you have customized your hierarchy, you might want to
|
|
create a rule nested beneath some other node in the hierarchy. In creating
|
|
this rule for the customized hierarchy, the node type doesn't matter, and in
|
|
particular the rule does not have to be nested beneath a ``root`` node.
|
|
|
|
It is possible to create a rule that restricts data placement to a specific
|
|
*class* of device. By default, Ceph OSDs automatically classify themselves as
|
|
either ``hdd`` or ``ssd`` in accordance with the underlying type of device
|
|
being used. These device classes can be customized. One might set the ``device
|
|
class`` of OSDs to ``nvme`` to distinguish the from SATA SSDs, or one might set
|
|
them to something arbitrary like ``ssd-testing`` or ``ssd-ethel`` so that rules
|
|
and pools may be flexibly constrained to use (or avoid using) specific subsets
|
|
of OSDs based on specific requirements.
|
|
|
|
To create a rule for a replicated pool, run a command of the following form:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}]
|
|
|
|
For details on this command's parameters, see the following:
|
|
|
|
``name``
|
|
:Description: The name of the rule.
|
|
:Type: String
|
|
:Required: Yes
|
|
:Example: ``rbd-rule``
|
|
|
|
``root``
|
|
:Description: The name of the CRUSH hierarchy node under which data is to be placed.
|
|
:Type: String
|
|
:Required: Yes
|
|
:Example: ``default``
|
|
|
|
``failure-domain-type``
|
|
:Description: The type of CRUSH nodes used for the replicas of the failure domain.
|
|
:Type: String
|
|
:Required: Yes
|
|
:Example: ``rack``
|
|
|
|
``class``
|
|
:Description: The device class on which data is to be placed.
|
|
:Type: String
|
|
:Required: No
|
|
:Example: ``ssd``
|
|
|
|
Creating a rule for an erasure-coded pool
|
|
-----------------------------------------
|
|
|
|
For an erasure-coded pool, similar decisions need to be made: what the failure
|
|
domain is, which node in the hierarchy data will be placed under (usually
|
|
``default``), and whether placement is restricted to a specific device class.
|
|
However, erasure-code pools are created in a different way: there is a need to
|
|
construct them carefully with reference to the erasure code plugin in use. For
|
|
this reason, these decisions must be incorporated into the **erasure-code
|
|
profile**. A CRUSH rule will then be created from the erasure-code profile,
|
|
either explicitly or automatically when the profile is used to create a pool.
|
|
|
|
To list the erasure-code profiles, run the following command:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd erasure-code-profile ls
|
|
|
|
To view a specific existing profile, run a command of the following form:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd erasure-code-profile get {profile-name}
|
|
|
|
Under normal conditions, profiles should never be modified; instead, a new
|
|
profile should be created and used when creating either a new pool or a new
|
|
rule for an existing pool.
|
|
|
|
An erasure-code profile consists of a set of key-value pairs. Most of these
|
|
key-value pairs govern the behavior of the erasure code that encodes data in
|
|
the pool. However, key-value pairs that begin with ``crush-`` govern the CRUSH
|
|
rule that is created.
|
|
|
|
The relevant erasure-code profile properties are as follows:
|
|
|
|
* **crush-root**: the name of the CRUSH node under which to place data
|
|
[default: ``default``].
|
|
* **crush-failure-domain**: the CRUSH bucket type used in the distribution of
|
|
erasure-coded shards [default: ``host``].
|
|
* **crush-device-class**: the device class on which to place data [default:
|
|
none, which means that all devices are used].
|
|
* **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the
|
|
number of erasure-code shards, affecting the resulting CRUSH rule.
|
|
|
|
After a profile is defined, you can create a CRUSH rule by running a command
|
|
of the following form:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush rule create-erasure {name} {profile-name}
|
|
|
|
.. note: When creating a new pool, it is not necessary to create the rule
|
|
explicitly. If only the erasure-code profile is specified and the rule
|
|
argument is omitted, then Ceph will create the CRUSH rule automatically.
|
|
|
|
|
|
Deleting rules
|
|
--------------
|
|
|
|
To delete rules that are not in use by pools, run a command of the following
|
|
form:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush rule rm {rule-name}
|
|
|
|
.. _crush-map-tunables:
|
|
|
|
Tunables
|
|
========
|
|
|
|
The CRUSH algorithm that is used to calculate the placement of data has been
|
|
improved over time. In order to support changes in behavior, we have provided
|
|
users with sets of tunables that determine which legacy or optimal version of
|
|
CRUSH is to be used.
|
|
|
|
In order to use newer tunables, all Ceph clients and daemons must support the
|
|
new major release of CRUSH. Because of this requirement, we have created
|
|
``profiles`` that are named after the Ceph version in which they were
|
|
introduced. For example, the ``firefly`` tunables were first supported by the
|
|
Firefly release and do not work with older clients (for example, clients
|
|
running Dumpling). After a cluster's tunables profile is changed from a legacy
|
|
set to a newer or ``optimal`` set, the ``ceph-mon`` and ``ceph-osd`` options
|
|
will prevent older clients that do not support the new CRUSH features from
|
|
connecting to the cluster.
|
|
|
|
argonaut (legacy)
|
|
-----------------
|
|
|
|
The legacy CRUSH behavior used by Argonaut and older releases works fine for
|
|
most clusters, provided that not many OSDs have been marked ``out``.
|
|
|
|
bobtail (CRUSH_TUNABLES2)
|
|
-------------------------
|
|
|
|
The ``bobtail`` tunable profile provides the following improvements:
|
|
|
|
* For hierarchies with a small number of devices in leaf buckets, some PGs
|
|
might map to fewer than the desired number of replicas, resulting in
|
|
``undersized`` PGs. This is known to happen in the case of hierarchies with
|
|
``host`` nodes that have a small number of OSDs (1 to 3) nested beneath each
|
|
host.
|
|
|
|
* For large clusters, a small percentage of PGs might map to fewer than the
|
|
desired number of OSDs. This is known to happen when there are multiple
|
|
hierarchy layers in use (for example,, ``row``, ``rack``, ``host``,
|
|
``osd``).
|
|
|
|
* When one or more OSDs are marked ``out``, data tends to be redistributed
|
|
to nearby OSDs instead of across the entire hierarchy.
|
|
|
|
The tunables introduced in the Bobtail release are as follows:
|
|
|
|
* ``choose_local_tries``: Number of local retries. The legacy value is ``2``,
|
|
and the optimal value is ``0``.
|
|
|
|
* ``choose_local_fallback_tries``: The legacy value is ``5``, and the optimal
|
|
value is 0.
|
|
|
|
* ``choose_total_tries``: Total number of attempts to choose an item. The
|
|
legacy value is ``19``, but subsequent testing indicates that a value of
|
|
``50`` is more appropriate for typical clusters. For extremely large
|
|
clusters, an even larger value might be necessary.
|
|
|
|
* ``chooseleaf_descend_once``: Whether a recursive ``chooseleaf`` attempt will
|
|
retry, or try only once and allow the original placement to retry. The
|
|
legacy default is ``0``, and the optimal value is ``1``.
|
|
|
|
Migration impact:
|
|
|
|
* Moving from the ``argonaut`` tunables to the ``bobtail`` tunables triggers a
|
|
moderate amount of data movement. Use caution on a cluster that is already
|
|
populated with data.
|
|
|
|
firefly (CRUSH_TUNABLES3)
|
|
-------------------------
|
|
|
|
chooseleaf_vary_r
|
|
~~~~~~~~~~~~~~~~~
|
|
|
|
This ``firefly`` tunable profile fixes a problem with ``chooseleaf`` CRUSH step
|
|
behavior. This problem arose when a large fraction of OSDs were marked ``out``, which resulted in PG mappings with too few OSDs.
|
|
|
|
This profile was introduced in the Firefly release, and adds a new tunable as follows:
|
|
|
|
* ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will start
|
|
with a non-zero value of ``r``, as determined by the number of attempts the
|
|
parent has already made. The legacy default value is ``0``, but with this
|
|
value CRUSH is sometimes unable to find a mapping. The optimal value (in
|
|
terms of computational cost and correctness) is ``1``.
|
|
|
|
Migration impact:
|
|
|
|
* For existing clusters that store a great deal of data, changing this tunable
|
|
from ``0`` to ``1`` will trigger a large amount of data migration; a value
|
|
of ``4`` or ``5`` will allow CRUSH to still find a valid mapping and will
|
|
cause less data to move.
|
|
|
|
straw_calc_version tunable
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
There were problems with the internal weights calculated and stored in the
|
|
CRUSH map for ``straw`` algorithm buckets. When there were buckets with a CRUSH
|
|
weight of ``0`` or with a mix of different and unique weights, CRUSH would
|
|
distribute data incorrectly (that is, not in proportion to the weights).
|
|
|
|
This tunable, introduced in the Firefly release, is as follows:
|
|
|
|
* ``straw_calc_version``: A value of ``0`` preserves the old, broken
|
|
internal-weight calculation; a value of ``1`` fixes the problem.
|
|
|
|
Migration impact:
|
|
|
|
* Changing this tunable to a value of ``1`` and then adjusting a straw bucket
|
|
(either by adding, removing, or reweighting an item or by using the
|
|
reweight-all command) can trigger a small to moderate amount of data
|
|
movement provided that the cluster has hit one of the problematic
|
|
conditions.
|
|
|
|
This tunable option is notable in that it has absolutely no impact on the
|
|
required kernel version in the client side.
|
|
|
|
hammer (CRUSH_V4)
|
|
-----------------
|
|
|
|
The ``hammer`` tunable profile does not affect the mapping of existing CRUSH
|
|
maps simply by changing the profile. However:
|
|
|
|
* There is a new bucket algorithm supported: ``straw2``. This new algorithm
|
|
fixes several limitations in the original ``straw``. More specifically, the
|
|
old ``straw`` buckets would change some mappings that should not have
|
|
changed when a weight was adjusted, while ``straw2`` achieves the original
|
|
goal of changing mappings only to or from the bucket item whose weight has
|
|
changed.
|
|
|
|
* The ``straw2`` type is the default type for any newly created buckets.
|
|
|
|
Migration impact:
|
|
|
|
* Changing a bucket type from ``straw`` to ``straw2`` will trigger a small
|
|
amount of data movement, depending on how much the bucket items' weights
|
|
vary from each other. When the weights are all the same no data will move,
|
|
and the more variance there is in the weights the more movement there will
|
|
be.
|
|
|
|
jewel (CRUSH_TUNABLES5)
|
|
-----------------------
|
|
|
|
The ``jewel`` tunable profile improves the overall behavior of CRUSH. As a
|
|
result, significantly fewer mappings change when an OSD is marked ``out`` of
|
|
the cluster. This improvement results in significantly less data movement.
|
|
|
|
The new tunable introduced in the Jewel release is as follows:
|
|
|
|
* ``chooseleaf_stable``: Determines whether a recursive chooseleaf attempt
|
|
will use a better value for an inner loop that greatly reduces the number of
|
|
mapping changes when an OSD is marked ``out``. The legacy value is ``0``,
|
|
and the new value of ``1`` uses the new approach.
|
|
|
|
Migration impact:
|
|
|
|
* Changing this value on an existing cluster will result in a very large
|
|
amount of data movement because nearly every PG mapping is likely to change.
|
|
|
|
Client versions that support CRUSH_TUNABLES2
|
|
--------------------------------------------
|
|
|
|
* v0.55 and later, including Bobtail (v0.56.x)
|
|
* Linux kernel version v3.9 and later (for the CephFS and RBD kernel clients)
|
|
|
|
Client versions that support CRUSH_TUNABLES3
|
|
--------------------------------------------
|
|
|
|
* v0.78 (Firefly) and later
|
|
* Linux kernel version v3.15 and later (for the CephFS and RBD kernel clients)
|
|
|
|
Client versions that support CRUSH_V4
|
|
-------------------------------------
|
|
|
|
* v0.94 (Hammer) and later
|
|
* Linux kernel version v4.1 and later (for the CephFS and RBD kernel clients)
|
|
|
|
Client versions that support CRUSH_TUNABLES5
|
|
--------------------------------------------
|
|
|
|
* v10.0.2 (Jewel) and later
|
|
* Linux kernel version v4.5 and later (for the CephFS and RBD kernel clients)
|
|
|
|
"Non-optimal tunables" warning
|
|
------------------------------
|
|
|
|
In v0.74 and later versions, Ceph will raise a health check ("HEALTH_WARN crush
|
|
map has non-optimal tunables") if any of the current CRUSH tunables have
|
|
non-optimal values: that is, if any fail to have the optimal values from the
|
|
:ref:` ``default`` profile
|
|
<rados_operations_crush_map_default_profile_definition>`. There are two
|
|
different ways to silence the alert:
|
|
|
|
1. Adjust the CRUSH tunables on the existing cluster so as to render them
|
|
optimal. Making this adjustment will trigger some data movement
|
|
(possibly as much as 10%). This approach is generally preferred to the
|
|
other approach, but special care must be taken in situations where
|
|
data movement might affect performance: for example, in production clusters.
|
|
To enable optimal tunables, run the following command:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush tunables optimal
|
|
|
|
There are several potential problems that might make it preferable to revert
|
|
to the previous values of the tunables. The new values might generate too
|
|
much load for the cluster to handle, the new values might unacceptably slow
|
|
the operation of the cluster, or there might be a client-compatibility
|
|
problem. Such client-compatibility problems can arise when using old-kernel
|
|
CephFS or RBD clients, or pre-Bobtail ``librados`` clients. To revert to
|
|
the previous values of the tunables, run the following command:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush tunables legacy
|
|
|
|
2. To silence the alert without making any changes to CRUSH,
|
|
add the following option to the ``[mon]`` section of your ceph.conf file::
|
|
|
|
mon_warn_on_legacy_crush_tunables = false
|
|
|
|
In order for this change to take effect, you will need to either restart
|
|
the monitors or run the following command to apply the option to the
|
|
monitors while they are still running:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph tell mon.\* config set mon_warn_on_legacy_crush_tunables false
|
|
|
|
|
|
Tuning CRUSH
|
|
------------
|
|
|
|
When making adjustments to CRUSH tunables, keep the following considerations in
|
|
mind:
|
|
|
|
* Adjusting the values of CRUSH tunables will result in the shift of one or
|
|
more PGs from one storage node to another. If the Ceph cluster is already
|
|
storing a great deal of data, be prepared for significant data movement.
|
|
* When the ``ceph-osd`` and ``ceph-mon`` daemons get the updated map, they
|
|
immediately begin rejecting new connections from clients that do not support
|
|
the new feature. However, already-connected clients are effectively
|
|
grandfathered in, and any of these clients that do not support the new
|
|
feature will malfunction.
|
|
* If the CRUSH tunables are set to newer (non-legacy) values and subsequently
|
|
reverted to the legacy values, ``ceph-osd`` daemons will not be required to
|
|
support any of the newer CRUSH features associated with the newer
|
|
(non-legacy) values. However, the OSD peering process requires the
|
|
examination and understanding of old maps. For this reason, **if the cluster
|
|
has previously used non-legacy CRUSH values, do not run old versions of
|
|
the** ``ceph-osd`` **daemon** -- even if the latest version of the map has
|
|
been reverted so as to use the legacy defaults.
|
|
|
|
The simplest way to adjust CRUSH tunables is to apply them in matched sets
|
|
known as *profiles*. As of the Octopus release, Ceph supports the following
|
|
profiles:
|
|
|
|
* ``legacy``: The legacy behavior from argonaut and earlier.
|
|
* ``argonaut``: The legacy values supported by the argonaut release.
|
|
* ``bobtail``: The values supported by the bobtail release.
|
|
* ``firefly``: The values supported by the firefly release.
|
|
* ``hammer``: The values supported by the hammer release.
|
|
* ``jewel``: The values supported by the jewel release.
|
|
* ``optimal``: The best values for the current version of Ceph.
|
|
.. _rados_operations_crush_map_default_profile_definition:
|
|
* ``default``: The default values of a new cluster that has been installed
|
|
from scratch. These values, which depend on the current version of Ceph, are
|
|
hardcoded and are typically a mix of optimal and legacy values. These
|
|
values often correspond to the ``optimal`` profile of either the previous
|
|
LTS (long-term service) release or the most recent release for which most
|
|
users are expected to have up-to-date clients.
|
|
|
|
To apply a profile to a running cluster, run a command of the following form:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd crush tunables {PROFILE}
|
|
|
|
This action might trigger a great deal of data movement. Consult release notes
|
|
and documentation before changing the profile on a running cluster. Consider
|
|
throttling recovery and backfill parameters in order to limit the backfill
|
|
resulting from a specific change.
|
|
|
|
.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf
|
|
|
|
|
|
Tuning Primary OSD Selection
|
|
============================
|
|
|
|
When a Ceph client reads or writes data, it first contacts the primary OSD in
|
|
each affected PG's acting set. By default, the first OSD in the acting set is
|
|
the primary OSD (also known as the "lead OSD"). For example, in the acting set
|
|
``[2, 3, 4]``, ``osd.2`` is listed first and is therefore the primary OSD.
|
|
However, sometimes it is clear that an OSD is not well suited to act as the
|
|
lead as compared with other OSDs (for example, if the OSD has a slow drive or a
|
|
slow controller). To prevent performance bottlenecks (especially on read
|
|
operations) and at the same time maximize the utilization of your hardware, you
|
|
can influence the selection of the primary OSD either by adjusting "primary
|
|
affinity" values, or by crafting a CRUSH rule that selects OSDs that are better
|
|
suited to act as the lead rather than other OSDs.
|
|
|
|
To determine whether tuning Ceph's selection of primary OSDs will improve
|
|
cluster performance, pool redundancy strategy must be taken into account. For
|
|
replicated pools, this tuning can be especially useful, because by default read
|
|
operations are served from the primary OSD of each PG. For erasure-coded pools,
|
|
however, the speed of read operations can be increased by enabling **fast
|
|
read** (see :ref:`pool-settings`).
|
|
|
|
.. _rados_ops_primary_affinity:
|
|
|
|
Primary Affinity
|
|
----------------
|
|
|
|
**Primary affinity** is a characteristic of an OSD that governs the likelihood
|
|
that a given OSD will be selected as the primary OSD (or "lead OSD") in a given
|
|
acting set. A primary affinity value can be any real number in the range ``0``
|
|
to ``1``, inclusive.
|
|
|
|
As an example of a common scenario in which it can be useful to adjust primary
|
|
affinity values, let us suppose that a cluster contains a mix of drive sizes:
|
|
for example, suppose it contains some older racks with 1.9 TB SATA SSDs and
|
|
some newer racks with 3.84 TB SATA SSDs. The latter will on average be assigned
|
|
twice the number of PGs and will thus serve twice the number of write and read
|
|
operations -- they will be busier than the former. In such a scenario, you
|
|
might make a rough assignment of primary affinity as inversely proportional to
|
|
OSD size. Such an assignment will not be 100% optimal, but it can readily
|
|
achieve a 15% improvement in overall read throughput by means of a more even
|
|
utilization of SATA interface bandwidth and CPU cycles. This example is not
|
|
merely a thought experiment meant to illustrate the theoretical benefits of
|
|
adjusting primary affinity values; this fifteen percent improvement was
|
|
achieved on an actual Ceph cluster.
|
|
|
|
By default, every Ceph OSD has a primary affinity value of ``1``. In a cluster
|
|
in which every OSD has this default value, all OSDs are equally likely to act
|
|
as a primary OSD.
|
|
|
|
By reducing the value of a Ceph OSD's primary affinity, you make CRUSH less
|
|
likely to select the OSD as primary in a PG's acting set. To change the weight
|
|
value associated with a specific OSD's primary affinity, run a command of the
|
|
following form:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd primary-affinity <osd-id> <weight>
|
|
|
|
The primary affinity of an OSD can be set to any real number in the range
|
|
``[0-1]`` inclusive, where ``0`` indicates that the OSD may not be used as
|
|
primary and ``1`` indicates that the OSD is maximally likely to be used as a
|
|
primary. When the weight is between these extremes, its value indicates roughly
|
|
how likely it is that CRUSH will select the OSD associated with it as a
|
|
primary.
|
|
|
|
The process by which CRUSH selects the lead OSD is not a mere function of a
|
|
simple probability determined by relative affinity values. Nevertheless,
|
|
measurable results can be achieved even with first-order approximations of
|
|
desirable primary affinity values.
|
|
|
|
|
|
Custom CRUSH Rules
|
|
------------------
|
|
|
|
Some clusters balance cost and performance by mixing SSDs and HDDs in the same
|
|
replicated pool. By setting the primary affinity of HDD OSDs to ``0``,
|
|
operations will be directed to an SSD OSD in each acting set. Alternatively,
|
|
you can define a CRUSH rule that always selects an SSD OSD as the primary OSD
|
|
and then selects HDDs for the remaining OSDs. Given this rule, each PG's acting
|
|
set will contain an SSD OSD as the primary and have the remaining OSDs on HDDs.
|
|
|
|
For example, see the following CRUSH rule::
|
|
|
|
rule mixed_replicated_rule {
|
|
id 11
|
|
type replicated
|
|
step take default class ssd
|
|
step chooseleaf firstn 1 type host
|
|
step emit
|
|
step take default class hdd
|
|
step chooseleaf firstn 0 type host
|
|
step emit
|
|
}
|
|
|
|
This rule chooses an SSD as the first OSD. For an ``N``-times replicated pool,
|
|
this rule selects ``N+1`` OSDs in order to guarantee that ``N`` copies are on
|
|
different hosts, because the first SSD OSD might be colocated with any of the
|
|
``N`` HDD OSDs.
|
|
|
|
To avoid this extra storage requirement, you might place SSDs and HDDs in
|
|
different hosts. However, taking this approach means that all client requests
|
|
will be received by hosts with SSDs. For this reason, it might be advisable to
|
|
have faster CPUs for SSD OSDs and more modest CPUs for HDD OSDs, since the
|
|
latter will under normal circumstances perform only recovery operations. Here
|
|
the CRUSH roots ``ssd_hosts`` and ``hdd_hosts`` are under a strict requirement
|
|
not to contain any of the same servers, as seen in the following CRUSH rule::
|
|
|
|
rule mixed_replicated_rule_two {
|
|
id 1
|
|
type replicated
|
|
step take ssd_hosts class ssd
|
|
step chooseleaf firstn 1 type host
|
|
step emit
|
|
step take hdd_hosts class hdd
|
|
step chooseleaf firstn -1 type host
|
|
step emit
|
|
}
|
|
|
|
.. note:: If a primary SSD OSD fails, then requests to the associated PG will
|
|
be temporarily served from a slower HDD OSD until the PG's data has been
|
|
replicated onto the replacement primary SSD OSD.
|
|
|
|
|