mirror of
https://github.com/ceph/ceph
synced 2025-01-21 18:45:23 +00:00
64b9e9f505
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
200 lines
7.5 KiB
ReStructuredText
200 lines
7.5 KiB
ReStructuredText
.. _adjusting-crush:
|
|
|
|
=========================
|
|
Adjusting the CRUSH map
|
|
=========================
|
|
|
|
There are a few ways to adjust the crush map:
|
|
|
|
* online, by issuing commands to the monitor
|
|
* offline, by extracting the current map to a file, modifying it, and then reinjecting a new map
|
|
|
|
For offline changes, some can be made directly with ``crushtool``, and
|
|
others require you to decompile the file to text form, manually edit
|
|
it, and then recompile.
|
|
|
|
|
|
Adding a new device (OSD) to the map
|
|
====================================
|
|
|
|
.. _adjusting-crush-set:
|
|
|
|
Adding new devices or moving existing devices to new positions in the
|
|
CRUSH hierarchy can be done via the monitor. The general form is::
|
|
|
|
$ ceph osd crush set <id> <name> <weight> [<loc> [<lo2> ...]]
|
|
|
|
where
|
|
|
|
* ``id`` is the numeric device id (the OSD id)
|
|
* ``name`` is an alphanumeric name. By convention Ceph uses
|
|
``osd.$id``.
|
|
* ``weight`` is a floating point weight value controlling how much
|
|
data the device will be allocated. A decent convention is to make
|
|
this the number of TB the device will store.
|
|
* ``loc`` is a list of ``what=where`` pairs indicating where in the
|
|
CRUSH hierarchy the device will be stored. By default, the
|
|
hierarchy (the ``what``s) includes ``pool`` (the ``default`` pool
|
|
is normally the root of the hierarchy), ``rack``, and ``host``.
|
|
At least one of these location specifiers has to refer to an
|
|
existing point in the hierarchy, and only the lowest (most
|
|
specific) match counts. Beneath that point, any intervening
|
|
branches will be created as needed. Specifying the complete
|
|
location is always sufficient, and also safe in that existing
|
|
branches (and devices) won't be moved around.
|
|
|
|
For example, if the OSD id is ``123``, we want a weight of ``1.0`` and
|
|
the device is on host ``hostfoo`` and rack ``rackbar``::
|
|
|
|
$ ceph osd crush set 123 osd.123 1.0 root=default rack=rackbar host=hostfoo
|
|
|
|
will add it to the hierarchy, or move it from its previous position.
|
|
(Note: for clusters created prior to v0.52, replace ``root=default``
|
|
with ``pool=default``.) The rack ``rackbar`` and host ``hostfoo``
|
|
will be added as needed, as long as the pool ``default`` exists (as it
|
|
does in the default Ceph CRUSH map generated during cluster creation).
|
|
|
|
Note that if I later add another device in the same host but specify a
|
|
different pool or rack::
|
|
|
|
$ ceph osd crush set 124 osd.124 1.0 root=nondefault rack=weirdrack host=hostfoo
|
|
|
|
the device will still be placed in host ``hostfoo`` at its current
|
|
location (rack ``rackbar`` and pool ``default``).
|
|
|
|
|
|
Moving a bucket to a different position in the hierarchy
|
|
========================================================
|
|
|
|
To move an existing bucket to a different position in the hierarchy,
|
|
identify the bucket to move by name and specify the new location in
|
|
the same fashion as with ``osd crush set ...``::
|
|
|
|
$ ceph osd crush move <bucket name> [<loc> [<loc2> ...]]
|
|
|
|
where
|
|
|
|
* ``name`` is the name of the bucket to move.
|
|
* ``loc`` is a list of ``what=where`` pairs indicating where the bucket should
|
|
be moved.
|
|
|
|
|
|
Adjusting the CRUSH weight
|
|
==========================
|
|
|
|
You can adjust the CRUSH weight for a device with::
|
|
|
|
$ ceph osd crush reweight osd.123 2.0
|
|
|
|
Removing a device
|
|
=================
|
|
|
|
You can remove a device from the crush map with::
|
|
|
|
$ ceph osd crush remove osd.123
|
|
|
|
Tunables
|
|
========
|
|
|
|
There are several magic numbers that were used in the original CRUSH
|
|
implementation that have proven to be poor choices. To support
|
|
the transition away from them, newer versions of CRUSH (starting with
|
|
the v0.48 argonaut series) allow the values to be adjusted or tuned.
|
|
|
|
Clusters running recent Ceph releases support using the tunable values
|
|
in the CRUSH maps. However, older clients and daemons will not correctly interact
|
|
with clusters using the "tuned" CRUSH maps. To detect this situation,
|
|
there is now a feature bit ``CRUSH_TUNABLES`` (value 0x40000) to
|
|
reflect support for tunables.
|
|
|
|
If the OSDMap currently used by the ``ceph-mon`` or ``ceph-osd``
|
|
daemon has non-legacy values, it will require the ``CRUSH_TUNABLES``
|
|
feature bit from clients and daemons who connect to it. This means
|
|
that old clients will not be able to connect.
|
|
|
|
At some future point in time, newly created clusters will have
|
|
improved default values for the tunables. This is a matter of waiting
|
|
until the support has been present in the Linux kernel clients long
|
|
enough to make this a painless transition for most users.
|
|
|
|
Impact of legacy values
|
|
~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The legacy values result in several misbehaviors:
|
|
|
|
* For hiearchies with a small number of devices in the leaf buckets,
|
|
some PGs map to fewer than the desired number of replicas. This
|
|
commonly happens for hiearchies with "host" nodes with a small
|
|
number (1-3) of OSDs nested beneath each one.
|
|
|
|
* For large clusters, some small percentages of PGs map to less than
|
|
the desired number of OSDs. This is more prevalent when there are
|
|
several layers of the hierarchy (e.g., row, rack, host, osd).
|
|
|
|
* When some OSDs are marked out, the data tends to get redistributed
|
|
to nearby OSDs instead of across the entire hierarchy.
|
|
|
|
Which client versions support tunables
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
* argonaut series, v0.48.1 or later
|
|
* v0.49 or later
|
|
* Linux kernel version v3.5 or later (for the file system and RBD kernel clients)
|
|
|
|
A few important points
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
* Adjusting these values will result in the shift of some PGs between
|
|
storage nodes. If the Ceph cluster is already storing a lot of
|
|
data, be prepared for some fraction of the data to move.
|
|
* The ``ceph-osd`` and ``ceph-mon`` daemons will start requiring the
|
|
``CRUSH_TUNABLES`` feature of new connections as soon as they get
|
|
the updated map. However, already-connected clients are
|
|
effectively grandfathered in, and will misbehave if they do not
|
|
support the new feature.
|
|
* If the CRUSH tunables are set to non-legacy values and then later
|
|
changed back to the defult values, ``ceph-osd`` daemons will not be
|
|
required to support the feature. However, the OSD peering process
|
|
requires examining and understanding old maps. Therefore, you
|
|
should not run old (pre-v0.48) versions of the ``ceph-osd`` daemon
|
|
if the cluster has previosly used non-legacy CRUSH values, even if
|
|
the latest version of the map has been switched back to using the
|
|
legacy defaults.
|
|
|
|
Tuning CRUSH
|
|
~~~~~~~~~~~~
|
|
|
|
If you can ensure that all clients are running recent code, you can
|
|
adjust the tunables by extracting the CRUSH map, modifying the values,
|
|
and reinjecting it into the cluster.
|
|
|
|
* Extract the latest CRUSH map::
|
|
|
|
ceph osd getcrushmap -o /tmp/crush
|
|
|
|
* Adjust tunables. These values appear to offer the best behavior
|
|
for both large and small clusters we tested with. You will need to
|
|
additionally specify the ``--enable-unsafe-tunables`` argument to
|
|
``crushtool`` for this to work. Please use this option with
|
|
extreme care.::
|
|
|
|
crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new
|
|
|
|
* Reinject modified map::
|
|
|
|
ceph osd setcrushmap -i /tmp/crush.new
|
|
|
|
Legacy values
|
|
~~~~~~~~~~~~~
|
|
|
|
For reference, the legacy values for the CRUSH tunables can be set
|
|
with::
|
|
|
|
crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 -o /tmp/crush.legacy
|
|
|
|
Again, the special ``--enable-unsafe-tunables`` option is required.
|
|
Further, as noted above, be careful running old versions of the
|
|
``ceph-osd`` daemon after reverting to legacy values as the feature
|
|
bit is not perfectly enforced.
|
|
|