diff --git a/doc/rados/configuration/mon-config-ref.rst b/doc/rados/configuration/mon-config-ref.rst index 01a32524839..333b0f9b3a4 100644 --- a/doc/rados/configuration/mon-config-ref.rst +++ b/doc/rados/configuration/mon-config-ref.rst @@ -954,6 +954,7 @@ Client :Type: 64-bit Integer Unsigned :Default: ``100ul << 20`` +.. _pool-settings: Pool settings ============= diff --git a/doc/rados/operations/crush-map.rst b/doc/rados/operations/crush-map.rst index 2558a42fee3..68c12d1d5f7 100644 --- a/doc/rados/operations/crush-map.rst +++ b/doc/rados/operations/crush-map.rst @@ -959,39 +959,99 @@ Primary Affinity ================ When a Ceph Client reads or writes data, it first contacts the primary OSD in -each affected PG's acting set. In the acting set ``[2, 3, 4]``, ``osd.2`` is -listed first and thus is the primary (lead). Sometimes an +each affected PG's acting set. By default, the first OSD in the acting set is +the primary. For example, in the acting set ``[2, 3, 4]``, ``osd.2`` is +listed first and thus is the primary (aka lead) OSD. Sometimes we know that an OSD is less well suited to act as the lead than are other OSDs (e.g., it has -a slow disk or a slow controller). To prevent performance bottlenecks +a slow drive or a slow controller). To prevent performance bottlenecks (especially on read operations) while maximizing utilization of your hardware, -you can set a Ceph OSD's primary affinity so that CRUSH is less likely to use -the OSD as a primary in an acting set. :: +you can influence the selection of primary OSDs by adjusting primary affinity +values, or by crafting a CRUSH rule that selects preferred OSDs first. + +Tuning primary OSD selection is mainly useful for replicated pools, because +by default read operations are served from the primary OSD for each PG. +For erasure coded (EC) pools, a way to speed up read operations is to enable +**fast read** as described in :ref:`pool-settings`. + +A common scenario for primary affinity is when a cluster contains +a mix of drive sizes, for example older racks with 1.9 TB SATA SSDS and newer racks with +3.84TB SATA SSDs. On average the latter will be assigned double the number of +PGs and thus will serve double the number of write and read operations, thus +they'll be busier than the former. A rough assignment of primary affinity +inversely proportional to OSD size won't be 100% optimal, but it can readily +achieve a 15% improvement in overall read throughput by utilizing SATA +interface bandwidth and CPU cycles more evenly. + +By default, all ceph OSDs have primary affinity of ``1``, which indicates that +any OSD may act as a primary with equal probability. + +You can reduce a Ceph OSD's primary affinity so that CRUSH is less likely to choose +the OSD as primary in a PG's acting set.:: ceph osd primary-affinity -Primary affinity is ``1`` by default (*i.e.,* an OSD may act as a primary). You -may set the OSD primary range as a real number in the range ``[0-1]``, where ``0`` -indicates that the OSD may **NOT** be used as a primary and ``1`` means that an -OSD may be used as a primary. When the weight is ``< 1``, it is less likely that -CRUSH will select the Ceph OSD Daemon to act as a primary. The process for +You may set an OSD's primary affinity to a real number in the range +``[0-1]``, where ``0`` indicates that the OSD may **NOT** be used as a primary +and ``1`` indicates that an OSD may be used as a primary. When the weight is +between these extremes, it is less likely that +CRUSH will select that OSD as a primary. The process for selecting the lead OSD is more nuanced than a simple probability based on relative affinity values, but measurable results can be achieved even with first-order approximations of desirable values. -There are occasional clusters -that balance cost and performance by mixing SSDs and HDDs in the same pool. -Careful application of CRUSH rules can direct each PG's acting set to contain -exactly one SSD OSD with the balance HDDs. By using primary affinity one can -direct most or all read operations to the SSD in the acting set. This is -a tricky setup to maintain and it is discouraged, but it's a useful example. +Custom CRUSH Rules +------------------ -Another, more common scenario for primary affinity is when a cluster contains -a mix of drive sizes, for example older racks with 1.9 TB SATA SSDS and newer racks with -3.84TB SATA SSDs. On average the latter will be assigned double the number of -PGs and thus will serve double the number of write and read operations, thus -they'll be busier than the former. A rough application of primary affinity in -proportion to OSD size won't be 100% optimal, but it can readily achieve a 15% -improvement in overall read throughput by utilizing SATA interface bandwidth -and CPU cycles more evenly. +There are occasional clusters that balance cost and performance by mixing SSDs +and HDDs in the same replicated pool. By setting the primary affinity of HDD +OSDs to ``0`` one can direct operations to the SSD in each acting set. An +alternative is to define a CRUSH rule that always selects an SSD OSD as the +first OSD, then selects HDDs for the remaining OSDs. Thus, each PG's acting +set will contain exactly one SSD OSD as the primary with the balance on HDDs. + +For example, the CRUSH rule below:: + + rule mixed_replicated_rule { + id 11 + type replicated + min_size 1 + max_size 10 + step take default class ssd + step chooseleaf firstn 1 type host + step emit + step take default class hdd + step chooseleaf firstn 0 type host + step emit + } + +chooses an SSD as the first OSD. Note that for an ``N``-times replicated pool +this rule selects ``N+1`` OSDs to guarantee that ``N`` copies are on different +hosts, because the first SSD OSD might be co-located with any of the ``N`` HDD +OSDs. + +This extra storage requirement can be avoided by placing SSDs and HDDs in +different hosts with the tradeoff that hosts with SSDs will receive all client +requests. You may thus consider faster CPU(s) for SSD hosts and more modest +ones for HDD nodes, since the latter will normally only service recovery +operations. Here the CRUSH roots ``ssd_hosts`` and ``hdd_hosts`` strictly +must not contain the same servers:: + + rule mixed_replicated_rule_two { + id 1 + type replicated + min_size 1 + max_size 10 + step take ssd_hosts class ssd + step chooseleaf firstn 1 type host + step emit + step take hdd_hosts class hdd + step chooseleaf firstn -1 type host + step emit + } + +Note also that on failure of an SSD, requests to a PG will be served temporarily +from a (slower) HDD OSD until the PG's data has been replicated onto the replacement +primary SSD OSD. +