ceph/doc/dev/crush-msr.rst

============================
CRUSH MSR (Multi-step Retry)
============================

Motivation
----------

Conventional CRUSH has an important limitation: rules with
multiple `choose` steps which hit an `out` osd cannot retry
prior steps.  As an example, with a rule like
::

    rule replicated_rule_1 {
        ...
        step take default class hdd
        step chooseleaf firstn 3 type host
        step emit
    }

one might expect that if all of the OSDs on a particular host
are marked out, mappings including those OSDs would end up
on another host (provided that there are enough hosts).  Indeed,
that's what will happen.  Moreover, if 1/8 OSDs on a host are
marked out, roughly 1/8 of the PGs mapped to that host will end
up remapped to some other host keeping overall per-OSD utilization
even.

Suppose, instead, the rule were written like this:
::

    rule replicated_rule_1 {
        ...
        step take default class hdd
        step choose firstn 3 type host
        step choose firstn 1 type osd
        step emit
    }

The behavior would be very similar as long as no OSDs are marked
out.  However, if an OSD is marked out, any PGs mapped to that
OSD will be remapped to other OSDs on the same host resulting in
those OSDs being over-utilized relative to OSDs on other hosts.
Moreover, if all of the OSDs on a host are marked out, mappings
that happen to hit that host will fail resulting in undersized PGs.

As long as the goal is to split N OSDs between N failure domains,
the solution is simply to use the `chooseleaf` variant above.  However,
consider a use case where we want to split an 8+6 EC encoding over 4
hosts in order to tolerate the loss of a host and an OSD on another
host with 1.75x storage overhead.  The rule would have to look
something like:
::

    rule ecpool-86 {
        ...
        step take default class hdd
        step choose indep 4 type host
        step choose indep 4 type osd
        step emit
    }

This does split up to 16 OSDs between 4 hosts (with an 8+6 code,
it would put 4 OSDs on each of the first 3 and 2 on the last) and
meets our failure requirements.  However, for the reasons outlined
above, it will behave poorly as OSDs are marked out if there are
other hosts to rebalance to.  `chooseleaf` is not a solution here
because it does not support mapping more than one leaf below the
specified type.

MSR
---

CRUSH MSR (Multi-step Retry) rules solve the above problem by using a
different descent algorithm which retries all of the steps upon
hitting an out OSD.  Where classic CRUSH is breadth first (for each
step, it fully populates the vector before proceeding to the next
step), MSR rules are depth first -- for each choice, we recursively
descend through all of the steps before continuing with the next
choice.  The above use case can be satisfied with the following rule:

::

    rule ecpool-86 {
        type msr_indep
        ...
        step take default class hdd
        step choosemsr 4 type host
        step choosemsr 4 type osd
        step emit
    }

As with the `chooseleaf` example at the top, as OSDs are marked out,
those OSDs are be remapped proportionately to other hosts so long as
there are extras available.  For details on how that works while
still preserving failure domain isolation, see the comments in
mapper.c:crush_msr_choose.

Rule Structure
--------------

CRUSH MSR rules are crush rules with type CRUSH_RULE_TYPE_MSR_FIRSTN
or CRUSH_RULE_TYPE_MSR_INDEP (see mapper.c: rule_type_is_msr).  Unlike
with classic crush rules, individual steps do not specify firstn or
indep.  The output order is instead defined by the rule type for the
whole rule.

MSR rules have some structural differences from conventional rules:

- The rule type determines whether the mapping is FIRSTN or INDEP.
  Because the descent can retry steps, it doesn't really make sense
  for steps to individually specify output order and I'm not really
  aware of any use cases that would benefit from it.
- MSR rules *must* be structured as a (possibly empty) prefix of
  config steps (CRUSH_RULE_SET_CHOOSE_MSR*) followed by a sequence of
  EMIT blocks each comprised of a TAKE step, a sequence of CHOOSE_MSR
  steps, and ended by an EMIT step.
- MSR steps must be `choosemsr`.  `choose` and `chooseleaf` are not
  permitted.

Working Space
-------------

MSR rules also have different requirements for working space.
Conventional CRUSH requires 3 vectors of size result_max to use for
working space -- two to alternate as it processes each rule and one,
additionally, for `chooseleaf`.  MSR rules need N vectors where N is the
number of `choosemsr` steps in the longest EMIT block since it needs to
retain all of the choices made as part of each descent.

See mapper.h/c:crush_work_size, crush_msr_scan_rule for details.

Implementation
--------------

mapper.h/c:crush_do_rule internally branches to
mapper.c:crush_msr_do_rule for rules of type CRUSH_RULE_TYPE_MSR_*
(see mapper.c:rule_type_is_msr).

MSR related functions in mapper.c are annotated with more details
about the algorithm.
doc/dev/crush-msr.rst: add developer summary of crush msr Signed-off-by: Samuel Just <sjust@redhat.com> 2024-02-01 19:38:13 +00:00			`============================`
			`CRUSH MSR (Multi-step Retry)`
			`============================`

			`Motivation`
			`----------`

			`Conventional CRUSH has an important limitation: rules with`
			multiple `choose` steps which hit an `out` osd cannot retry
			`prior steps. As an example, with a rule like`
			`::`

			`rule replicated_rule_1 {`
			`...`
			`step take default class hdd`
			`step chooseleaf firstn 3 type host`
			`step emit`
			`}`

			`one might expect that if all of the OSDs on a particular host`
			`are marked out, mappings including those OSDs would end up`
			`on another host (provided that there are enough hosts). Indeed,`
			`that's what will happen. Moreover, if 1/8 OSDs on a host are`
			`marked out, roughly 1/8 of the PGs mapped to that host will end`
			`up remapped to some other host keeping overall per-OSD utilization`
			`even.`

			`Suppose, instead, the rule were written like this:`
			`::`

			`rule replicated_rule_1 {`
			`...`
			`step take default class hdd`
			`step choose firstn 3 type host`
			`step choose firstn 1 type osd`
			`step emit`
			`}`

			`The behavior would be very similar as long as no OSDs are marked`
			`out. However, if an OSD is marked out, any PGs mapped to that`
			`OSD will be remapped to other OSDs on the same host resulting in`
			`those OSDs being over-utilized relative to OSDs on other hosts.`
			`Moreover, if all of the OSDs on a host are marked out, mappings`
			`that happen to hit that host will fail resulting in undersized PGs.`

			`As long as the goal is to split N OSDs between N failure domains,`
			the solution is simply to use the `chooseleaf` variant above. However,
			`consider a use case where we want to split an 8+6 EC encoding over 4`
			`hosts in order to tolerate the loss of a host and an OSD on another`
			`host with 1.75x storage overhead. The rule would have to look`
			`something like:`
			`::`

			`rule ecpool-86 {`
			`...`
			`step take default class hdd`
			`step choose indep 4 type host`
			`step choose indep 4 type osd`
			`step emit`
			`}`

			`This does split up to 16 OSDs between 4 hosts (with an 8+6 code,`
			`it would put 4 OSDs on each of the first 3 and 2 on the last) and`
			`meets our failure requirements. However, for the reasons outlined`
			`above, it will behave poorly as OSDs are marked out if there are`
			other hosts to rebalance to. `chooseleaf` is not a solution here
			`because it does not support mapping more than one leaf below the`
			`specified type.`

			`MSR`
			`---`

			`CRUSH MSR (Multi-step Retry) rules solve the above problem by using a`
			`different descent algorithm which retries all of the steps upon`
			`hitting an out OSD. Where classic CRUSH is breadth first (for each`
			`step, it fully populates the vector before proceeding to the next`
			`step), MSR rules are depth first -- for each choice, we recursively`
			`descend through all of the steps before continuing with the next`
			`choice. The above use case can be satisfied with the following rule:`

			`::`

			`rule ecpool-86 {`
			`type msr_indep`
			`...`
			`step take default class hdd`
			`step choosemsr 4 type host`
			`step choosemsr 4 type osd`
			`step emit`
			`}`

			As with the `chooseleaf` example at the top, as OSDs are marked out,
			`those OSDs are be remapped proportionately to other hosts so long as`
			`there are extras available. For details on how that works while`
			`still preserving failure domain isolation, see the comments in`
			`mapper.c:crush_msr_choose.`

			`Rule Structure`
			`--------------`

			`CRUSH MSR rules are crush rules with type CRUSH_RULE_TYPE_MSR_FIRSTN`
			`or CRUSH_RULE_TYPE_MSR_INDEP (see mapper.c: rule_type_is_msr). Unlike`
			`with classic crush rules, individual steps do not specify firstn or`
			`indep. The output order is instead defined by the rule type for the`
			`whole rule.`

			`MSR rules have some structural differences from conventional rules:`

			`- The rule type determines whether the mapping is FIRSTN or INDEP.`
			`Because the descent can retry steps, it doesn't really make sense`
			`for steps to individually specify output order and I'm not really`
			`aware of any use cases that would benefit from it.`
			`- MSR rules must be structured as a (possibly empty) prefix of`
			`config steps (CRUSH_RULE_SET_CHOOSE_MSR*) followed by a sequence of`
			`EMIT blocks each comprised of a TAKE step, a sequence of CHOOSE_MSR`
			`steps, and ended by an EMIT step.`
			- MSR steps must be `choosemsr`. `choose` and `chooseleaf` are not
			`permitted.`

			`Working Space`
			`-------------`

			`MSR rules also have different requirements for working space.`
			`Conventional CRUSH requires 3 vectors of size result_max to use for`
			`working space -- two to alternate as it processes each rule and one,`
			additionally, for `chooseleaf`. MSR rules need N vectors where N is the
			number of `choosemsr` steps in the longest EMIT block since it needs to
			`retain all of the choices made as part of each descent.`

			`See mapper.h/c:crush_work_size, crush_msr_scan_rule for details.`

			`Implementation`
			`--------------`

			`mapper.h/c:crush_do_rule internally branches to`
			`mapper.c:crush_msr_do_rule for rules of type CRUSH_RULE_TYPE_MSR_*`
			`(see mapper.c:rule_type_is_msr).`

			`MSR related functions in mapper.c are annotated with more details`
			`about the algorithm.`