Merge pull request #38092 from anthonyeleven/anthonyeleven/docs-mixed-device-trick

doc/rados/operations: enhance primary affinity section

Reviewed-by: Zac Dover <zac.dover@gmail.com>
This commit is contained in:
zdover23 2020-11-23 22:20:13 +10:00 committed by GitHub
commit de870cac81
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 85 additions and 24 deletions

View File

@ -954,6 +954,7 @@ Client
:Type: 64-bit Integer Unsigned
:Default: ``100ul << 20``
.. _pool-settings:
Pool settings
=============

View File

@ -959,39 +959,99 @@ Primary Affinity
================
When a Ceph Client reads or writes data, it first contacts the primary OSD in
each affected PG's acting set. In the acting set ``[2, 3, 4]``, ``osd.2`` is
listed first and thus is the primary (lead). Sometimes an
each affected PG's acting set. By default, the first OSD in the acting set is
the primary. For example, in the acting set ``[2, 3, 4]``, ``osd.2`` is
listed first and thus is the primary (aka lead) OSD. Sometimes we know that an
OSD is less well suited to act as the lead than are other OSDs (e.g., it has
a slow disk or a slow controller). To prevent performance bottlenecks
a slow drive or a slow controller). To prevent performance bottlenecks
(especially on read operations) while maximizing utilization of your hardware,
you can set a Ceph OSD's primary affinity so that CRUSH is less likely to use
the OSD as a primary in an acting set. ::
you can influence the selection of primary OSDs by adjusting primary affinity
values, or by crafting a CRUSH rule that selects preferred OSDs first.
Tuning primary OSD selection is mainly useful for replicated pools, because
by default read operations are served from the primary OSD for each PG.
For erasure coded (EC) pools, a way to speed up read operations is to enable
**fast read** as described in :ref:`pool-settings`.
A common scenario for primary affinity is when a cluster contains
a mix of drive sizes, for example older racks with 1.9 TB SATA SSDS and newer racks with
3.84TB SATA SSDs. On average the latter will be assigned double the number of
PGs and thus will serve double the number of write and read operations, thus
they'll be busier than the former. A rough assignment of primary affinity
inversely proportional to OSD size won't be 100% optimal, but it can readily
achieve a 15% improvement in overall read throughput by utilizing SATA
interface bandwidth and CPU cycles more evenly.
By default, all ceph OSDs have primary affinity of ``1``, which indicates that
any OSD may act as a primary with equal probability.
You can reduce a Ceph OSD's primary affinity so that CRUSH is less likely to choose
the OSD as primary in a PG's acting set.::
ceph osd primary-affinity <osd-id> <weight>
Primary affinity is ``1`` by default (*i.e.,* an OSD may act as a primary). You
may set the OSD primary range as a real number in the range ``[0-1]``, where ``0``
indicates that the OSD may **NOT** be used as a primary and ``1`` means that an
OSD may be used as a primary. When the weight is ``< 1``, it is less likely that
CRUSH will select the Ceph OSD Daemon to act as a primary. The process for
You may set an OSD's primary affinity to a real number in the range
``[0-1]``, where ``0`` indicates that the OSD may **NOT** be used as a primary
and ``1`` indicates that an OSD may be used as a primary. When the weight is
between these extremes, it is less likely that
CRUSH will select that OSD as a primary. The process for
selecting the lead OSD is more nuanced than a simple probability based on
relative affinity values, but measurable results can be achieved even with
first-order approximations of desirable values.
There are occasional clusters
that balance cost and performance by mixing SSDs and HDDs in the same pool.
Careful application of CRUSH rules can direct each PG's acting set to contain
exactly one SSD OSD with the balance HDDs. By using primary affinity one can
direct most or all read operations to the SSD in the acting set. This is
a tricky setup to maintain and it is discouraged, but it's a useful example.
Custom CRUSH Rules
------------------
Another, more common scenario for primary affinity is when a cluster contains
a mix of drive sizes, for example older racks with 1.9 TB SATA SSDS and newer racks with
3.84TB SATA SSDs. On average the latter will be assigned double the number of
PGs and thus will serve double the number of write and read operations, thus
they'll be busier than the former. A rough application of primary affinity in
proportion to OSD size won't be 100% optimal, but it can readily achieve a 15%
improvement in overall read throughput by utilizing SATA interface bandwidth
and CPU cycles more evenly.
There are occasional clusters that balance cost and performance by mixing SSDs
and HDDs in the same replicated pool. By setting the primary affinity of HDD
OSDs to ``0`` one can direct operations to the SSD in each acting set. An
alternative is to define a CRUSH rule that always selects an SSD OSD as the
first OSD, then selects HDDs for the remaining OSDs. Thus, each PG's acting
set will contain exactly one SSD OSD as the primary with the balance on HDDs.
For example, the CRUSH rule below::
rule mixed_replicated_rule {
id 11
type replicated
min_size 1
max_size 10
step take default class ssd
step chooseleaf firstn 1 type host
step emit
step take default class hdd
step chooseleaf firstn 0 type host
step emit
}
chooses an SSD as the first OSD. Note that for an ``N``-times replicated pool
this rule selects ``N+1`` OSDs to guarantee that ``N`` copies are on different
hosts, because the first SSD OSD might be co-located with any of the ``N`` HDD
OSDs.
This extra storage requirement can be avoided by placing SSDs and HDDs in
different hosts with the tradeoff that hosts with SSDs will receive all client
requests. You may thus consider faster CPU(s) for SSD hosts and more modest
ones for HDD nodes, since the latter will normally only service recovery
operations. Here the CRUSH roots ``ssd_hosts`` and ``hdd_hosts`` strictly
must not contain the same servers::
rule mixed_replicated_rule_two {
id 1
type replicated
min_size 1
max_size 10
step take ssd_hosts class ssd
step chooseleaf firstn 1 type host
step emit
step take hdd_hosts class hdd
step chooseleaf firstn -1 type host
step emit
}
Note also that on failure of an SSD, requests to a PG will be served temporarily
from a (slower) HDD OSD until the PG's data has been replicated onto the replacement
primary SSD OSD.