mirror of
https://github.com/ceph/ceph
synced 2025-01-03 09:32:43 +00:00
Merge pull request #38092 from anthonyeleven/anthonyeleven/docs-mixed-device-trick
doc/rados/operations: enhance primary affinity section Reviewed-by: Zac Dover <zac.dover@gmail.com>
This commit is contained in:
commit
de870cac81
@ -954,6 +954,7 @@ Client
|
||||
:Type: 64-bit Integer Unsigned
|
||||
:Default: ``100ul << 20``
|
||||
|
||||
.. _pool-settings:
|
||||
|
||||
Pool settings
|
||||
=============
|
||||
|
@ -959,39 +959,99 @@ Primary Affinity
|
||||
================
|
||||
|
||||
When a Ceph Client reads or writes data, it first contacts the primary OSD in
|
||||
each affected PG's acting set. In the acting set ``[2, 3, 4]``, ``osd.2`` is
|
||||
listed first and thus is the primary (lead). Sometimes an
|
||||
each affected PG's acting set. By default, the first OSD in the acting set is
|
||||
the primary. For example, in the acting set ``[2, 3, 4]``, ``osd.2`` is
|
||||
listed first and thus is the primary (aka lead) OSD. Sometimes we know that an
|
||||
OSD is less well suited to act as the lead than are other OSDs (e.g., it has
|
||||
a slow disk or a slow controller). To prevent performance bottlenecks
|
||||
a slow drive or a slow controller). To prevent performance bottlenecks
|
||||
(especially on read operations) while maximizing utilization of your hardware,
|
||||
you can set a Ceph OSD's primary affinity so that CRUSH is less likely to use
|
||||
the OSD as a primary in an acting set. ::
|
||||
you can influence the selection of primary OSDs by adjusting primary affinity
|
||||
values, or by crafting a CRUSH rule that selects preferred OSDs first.
|
||||
|
||||
Tuning primary OSD selection is mainly useful for replicated pools, because
|
||||
by default read operations are served from the primary OSD for each PG.
|
||||
For erasure coded (EC) pools, a way to speed up read operations is to enable
|
||||
**fast read** as described in :ref:`pool-settings`.
|
||||
|
||||
A common scenario for primary affinity is when a cluster contains
|
||||
a mix of drive sizes, for example older racks with 1.9 TB SATA SSDS and newer racks with
|
||||
3.84TB SATA SSDs. On average the latter will be assigned double the number of
|
||||
PGs and thus will serve double the number of write and read operations, thus
|
||||
they'll be busier than the former. A rough assignment of primary affinity
|
||||
inversely proportional to OSD size won't be 100% optimal, but it can readily
|
||||
achieve a 15% improvement in overall read throughput by utilizing SATA
|
||||
interface bandwidth and CPU cycles more evenly.
|
||||
|
||||
By default, all ceph OSDs have primary affinity of ``1``, which indicates that
|
||||
any OSD may act as a primary with equal probability.
|
||||
|
||||
You can reduce a Ceph OSD's primary affinity so that CRUSH is less likely to choose
|
||||
the OSD as primary in a PG's acting set.::
|
||||
|
||||
ceph osd primary-affinity <osd-id> <weight>
|
||||
|
||||
Primary affinity is ``1`` by default (*i.e.,* an OSD may act as a primary). You
|
||||
may set the OSD primary range as a real number in the range ``[0-1]``, where ``0``
|
||||
indicates that the OSD may **NOT** be used as a primary and ``1`` means that an
|
||||
OSD may be used as a primary. When the weight is ``< 1``, it is less likely that
|
||||
CRUSH will select the Ceph OSD Daemon to act as a primary. The process for
|
||||
You may set an OSD's primary affinity to a real number in the range
|
||||
``[0-1]``, where ``0`` indicates that the OSD may **NOT** be used as a primary
|
||||
and ``1`` indicates that an OSD may be used as a primary. When the weight is
|
||||
between these extremes, it is less likely that
|
||||
CRUSH will select that OSD as a primary. The process for
|
||||
selecting the lead OSD is more nuanced than a simple probability based on
|
||||
relative affinity values, but measurable results can be achieved even with
|
||||
first-order approximations of desirable values.
|
||||
|
||||
There are occasional clusters
|
||||
that balance cost and performance by mixing SSDs and HDDs in the same pool.
|
||||
Careful application of CRUSH rules can direct each PG's acting set to contain
|
||||
exactly one SSD OSD with the balance HDDs. By using primary affinity one can
|
||||
direct most or all read operations to the SSD in the acting set. This is
|
||||
a tricky setup to maintain and it is discouraged, but it's a useful example.
|
||||
Custom CRUSH Rules
|
||||
------------------
|
||||
|
||||
Another, more common scenario for primary affinity is when a cluster contains
|
||||
a mix of drive sizes, for example older racks with 1.9 TB SATA SSDS and newer racks with
|
||||
3.84TB SATA SSDs. On average the latter will be assigned double the number of
|
||||
PGs and thus will serve double the number of write and read operations, thus
|
||||
they'll be busier than the former. A rough application of primary affinity in
|
||||
proportion to OSD size won't be 100% optimal, but it can readily achieve a 15%
|
||||
improvement in overall read throughput by utilizing SATA interface bandwidth
|
||||
and CPU cycles more evenly.
|
||||
There are occasional clusters that balance cost and performance by mixing SSDs
|
||||
and HDDs in the same replicated pool. By setting the primary affinity of HDD
|
||||
OSDs to ``0`` one can direct operations to the SSD in each acting set. An
|
||||
alternative is to define a CRUSH rule that always selects an SSD OSD as the
|
||||
first OSD, then selects HDDs for the remaining OSDs. Thus, each PG's acting
|
||||
set will contain exactly one SSD OSD as the primary with the balance on HDDs.
|
||||
|
||||
For example, the CRUSH rule below::
|
||||
|
||||
rule mixed_replicated_rule {
|
||||
id 11
|
||||
type replicated
|
||||
min_size 1
|
||||
max_size 10
|
||||
step take default class ssd
|
||||
step chooseleaf firstn 1 type host
|
||||
step emit
|
||||
step take default class hdd
|
||||
step chooseleaf firstn 0 type host
|
||||
step emit
|
||||
}
|
||||
|
||||
chooses an SSD as the first OSD. Note that for an ``N``-times replicated pool
|
||||
this rule selects ``N+1`` OSDs to guarantee that ``N`` copies are on different
|
||||
hosts, because the first SSD OSD might be co-located with any of the ``N`` HDD
|
||||
OSDs.
|
||||
|
||||
This extra storage requirement can be avoided by placing SSDs and HDDs in
|
||||
different hosts with the tradeoff that hosts with SSDs will receive all client
|
||||
requests. You may thus consider faster CPU(s) for SSD hosts and more modest
|
||||
ones for HDD nodes, since the latter will normally only service recovery
|
||||
operations. Here the CRUSH roots ``ssd_hosts`` and ``hdd_hosts`` strictly
|
||||
must not contain the same servers::
|
||||
|
||||
rule mixed_replicated_rule_two {
|
||||
id 1
|
||||
type replicated
|
||||
min_size 1
|
||||
max_size 10
|
||||
step take ssd_hosts class ssd
|
||||
step chooseleaf firstn 1 type host
|
||||
step emit
|
||||
step take hdd_hosts class hdd
|
||||
step chooseleaf firstn -1 type host
|
||||
step emit
|
||||
}
|
||||
|
||||
|
||||
|
||||
Note also that on failure of an SSD, requests to a PG will be served temporarily
|
||||
from a (slower) HDD OSD until the PG's data has been replicated onto the replacement
|
||||
primary SSD OSD.
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user