Merge pull request #55096 from athanatos/sjust/for-review/wip-crush-msr

crush: add multistep retry rules

Reviewed-by: Laura Flores <lflores@redhat.com>
This commit is contained in:
Yuri Weinstein 2024-01-26 11:57:53 -08:00 committed by GitHub
commit 37d5d931b0
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
20 changed files with 2523 additions and 192 deletions

View File

@ -419,7 +419,7 @@ centers for three-way replication, and yet another rule for erasure coding acros
six storage devices. For a detailed discussion of CRUSH rules, see **Section 3.2**
of `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_.
A rule takes the following form::
A normal CRUSH rule takes the following form::
rule <rulename> {
@ -430,6 +430,18 @@ A rule takes the following form::
step emit
}
CRUSH MSR rules are a distinct type of CRUSH rule which supports retrying steps
and provides better support for configurations that require multiple OSDs within
each failure domain. MSR rules take the following form::
rule <rulename> {
id [a unique integer ID]
type [msr_indep|msr_firsn]
step take <bucket-name> [class <device-class>]
step choosemsr <N> type <bucket-type>
step emit
}
``id``
:Description: A unique integer that identifies the rule.
@ -441,12 +453,14 @@ A rule takes the following form::
``type``
:Description: Denotes the type of replication strategy to be enforced by the
rule.
rule. msr_firstn and msr_indep are a distinct descent algorithm
which supports retrying steps within the rule and therefore
multiple OSDs per failure domain.
:Purpose: A component of the rule mask.
:Type: String
:Required: Yes
:Default: ``replicated``
:Valid Values: ``replicated`` or ``erasure``
:Valid Values: ``replicated``, ``erasure``, ``msr_firstn``, ``msr_indep``
``step take <bucket-name> [class <device-class>]``
@ -525,6 +539,16 @@ A rule takes the following form::
final CRUSH mapping transformation is therefore 1, 2, 3, 4, 5
→ 1, 2, 6, 4, 5.
``step choosemsr {num} type {bucket-type}``
:Description: Selects a num buckets of type bucket-type. msr_firstn and msr_indep
must use choosemsr rather than choose or chooseleaf.
- If ``{num} == 0``, choose ``pool-num-replicas`` buckets (as many buckets as are available).
- If ``pool-num-replicas > {num} > 0``, choose that many buckets.
:Purpose: Choose step required for msr_firstn and msr_indep rules.
:Prerequisite: Follows ``step take`` and precedes ``step emit``
:Example: ``step choosemsr 3 type host``
.. _crush-reclassify:
Migrating from a legacy SSD rule to device classes

View File

@ -709,6 +709,13 @@ The relevant erasure-code profile properties are as follows:
[default: ``default``].
* **crush-failure-domain**: the CRUSH bucket type used in the distribution of
erasure-coded shards [default: ``host``].
* **crush-osds-per-failure-domain**: Maximum number of OSDs to place in each
failure domain -- defaults to 1. Using a value greater than one will
cause a CRUSH MSR rule to be created, see below. Must be specified if
crush-num-failure-domains is specified.
* **crush-num-failure-domains**: Number of failure domains to map. Must be
specified if crush-osds-per-failure-domain is specified. Results in
a CRUSH MSR rule being created.
* **crush-device-class**: the device class on which to place data [default:
none, which means that all devices are used].
* **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the
@ -726,6 +733,21 @@ The relevant erasure-code profile properties are as follows:
argument is omitted, then Ceph will create the CRUSH rule automatically.
CRUSH MSR Rules
---------------
Creating an erasure-code profile with a crush-osds-per-failure-domain
value greater than one will cause a CRUSH MSR rule type to be created
instead of a normal CRUSH rule. Normal crush rules cannot retry prior
steps when an out OSD is encountered and rely on CHOOSELEAF steps to
permit moving OSDs to new hosts. However, CHOOSELEAF rules don't
support more than a single OSD per failure domain. MSR rules, new in
squid, support multiple OSDs per failure domain by retrying all prior
steps when an out OSD is encountered. Using MSR rules requires that
OSDs and clients be required to support the CRUSH_MSR feature bit
(squid or newer).
Deleting rules
--------------

View File

@ -11,7 +11,9 @@ tasks:
k: 4
m: 2
technique: reed_sol_van
crush-failure-domain: osd
crush-failure-domain: host
crush-osds-per-failure-domain: 2
crush-num-failure-domains: 3
op_weights:
read: 100
write: 0

View File

@ -79,7 +79,7 @@ class ECPTest(DashboardTestCase):
self.assertStatus(201)
self._get('/api/erasure_code_profile/lrc')
self.assertJsonBody({
self.assertJsonSubset({
'crush-device-class': '',
'crush-failure-domain': 'host',
'crush-root': 'default',

View File

@ -321,6 +321,13 @@ int CrushCompiler::decompile(ostream &out)
if (crush.get_allowed_bucket_algs() != CRUSH_LEGACY_ALLOWED_BUCKET_ALGS)
out << "tunable allowed_bucket_algs " << crush.get_allowed_bucket_algs()
<< "\n";
if (crush.has_nondefault_tunables_msr()) {
out << "tunable msr_descents " << crush.get_msr_descents()
<< "\n";
out << "tunable msr_collision_tries "
<< crush.get_msr_collision_tries()
<< "\n";
}
out << "\n# devices\n";
for (int i=0; i<crush.get_max_devices(); i++) {
@ -363,12 +370,18 @@ int CrushCompiler::decompile(ostream &out)
out << "\tid " << i << "\n";
switch (crush.get_rule_type(i)) {
case CEPH_PG_TYPE_REPLICATED:
case CRUSH_RULE_TYPE_REPLICATED:
out << "\ttype replicated\n";
break;
case CEPH_PG_TYPE_ERASURE:
case CRUSH_RULE_TYPE_ERASURE:
out << "\ttype erasure\n";
break;
case CRUSH_RULE_TYPE_MSR_FIRSTN:
out << "\ttype msr_firstn\n";
break;
case CRUSH_RULE_TYPE_MSR_INDEP:
out << "\ttype msr_indep\n";
break;
default:
out << "\ttype " << crush.get_rule_type(i) << "\n";
}
@ -422,6 +435,15 @@ int CrushCompiler::decompile(ostream &out)
out << "\tstep set_chooseleaf_stable " << crush.get_rule_arg1(i, j)
<< "\n";
break;
case CRUSH_RULE_SET_MSR_DESCENTS:
out << "\tstep set_msr_descents " << crush.get_rule_arg1(i, j)
<< "\n";
break;
case CRUSH_RULE_SET_MSR_COLLISION_TRIES:
out << "\tstep set_msr_collision_tries "
<< crush.get_rule_arg1(i, j)
<< "\n";
break;
case CRUSH_RULE_CHOOSE_FIRSTN:
out << "\tstep choose firstn "
<< crush.get_rule_arg1(i, j)
@ -450,6 +472,13 @@ int CrushCompiler::decompile(ostream &out)
print_type_name(out, crush.get_rule_arg2(i, j), crush);
out << "\n";
break;
case CRUSH_RULE_CHOOSE_MSR:
out << "\tstep choosemsr "
<< crush.get_rule_arg1(i, j)
<< " type ";
print_type_name(out, crush.get_rule_arg2(i, j), crush);
out << "\n";
break;
}
}
out << "}\n";
@ -532,6 +561,10 @@ int CrushCompiler::parse_tunable(iter_t const& i)
crush.set_straw_calc_version(val);
else if (name == "allowed_bucket_algs")
crush.set_allowed_bucket_algs(val);
else if (name == "msr_descents")
crush.set_msr_descents(val);
else if (name == "msr_collision_tries")
crush.set_msr_collision_tries(val);
else {
err << "tunable " << name << " not recognized" << std::endl;
return -1;
@ -781,9 +814,13 @@ int CrushCompiler::parse_rule(iter_t const& i)
string tname = string_node(i->children[start+2]);
int type;
if (tname == "replicated")
type = CEPH_PG_TYPE_REPLICATED;
type = CRUSH_RULE_TYPE_REPLICATED;
else if (tname == "erasure")
type = CEPH_PG_TYPE_ERASURE;
type = CRUSH_RULE_TYPE_ERASURE;
else if (tname == "msr_firstn")
type = CRUSH_RULE_TYPE_MSR_FIRSTN;
else if (tname == "msr_indep")
type = CRUSH_RULE_TYPE_MSR_INDEP;
else
ceph_abort();
@ -905,6 +942,18 @@ int CrushCompiler::parse_rule(iter_t const& i)
crush.set_rule_step_set_chooseleaf_stable(ruleno, step++, val);
}
break;
case crush_grammar::_step_set_msr_descents:
{
int val = int_node(s->children[1]);
crush.set_rule_step_set_msr_descents(ruleno, step++, val);
}
break;
case crush_grammar::_step_set_msr_collision_tries:
{
int val = int_node(s->children[1]);
crush.set_rule_step_set_msr_collision_tries(ruleno, step++, val);
}
break;
case crush_grammar::_step_choose:
case crush_grammar::_step_chooseleaf:
@ -932,6 +981,17 @@ int CrushCompiler::parse_rule(iter_t const& i)
}
break;
case crush_grammar::_step_choose_msr:
{
string type = string_node(s->children[3]);
if (!type_id.count(type)) {
err << "in rule '" << rname << "' type '" << type << "' not defined" << std::endl;
return -1;
}
crush.set_rule_step_choose_msr(ruleno, step++, int_node(s->children[1]), type_id[type]);
}
break;
case crush_grammar::_step_emit:
crush.set_rule_step_emit(ruleno, step++);
break;

View File

@ -135,6 +135,29 @@ bool CrushWrapper::is_v5_rule(unsigned ruleid) const
return false;
}
bool CrushWrapper::has_msr_rules() const
{
for (unsigned i=0; i<crush->max_rules; i++) {
if (is_msr_rule(i)) {
return true;
}
}
return false;
}
bool CrushWrapper::is_msr_rule(unsigned ruleid) const
{
if (ruleid >= crush->max_rules)
return false;
crush_rule *r = crush->rules[ruleid];
if (!r)
return false;
return r->type == CRUSH_RULE_TYPE_MSR_INDEP ||
r->type == CRUSH_RULE_TYPE_MSR_FIRSTN;
}
bool CrushWrapper::has_choose_args() const
{
return !choose_args.empty();
@ -2238,6 +2261,7 @@ void CrushWrapper::reweight_bucket(
int CrushWrapper::add_simple_rule_at(
string name, string root_name,
string failure_domain_name,
int num_failure_domains,
string device_class,
string mode, int rule_type,
int rno,
@ -2309,17 +2333,19 @@ int CrushWrapper::add_simple_rule_at(
}
crush_rule_set_step(rule, step++, CRUSH_RULE_TAKE, root, 0);
if (type)
crush_rule_set_step(rule, step++,
mode == "firstn" ? CRUSH_RULE_CHOOSELEAF_FIRSTN :
CRUSH_RULE_CHOOSELEAF_INDEP,
CRUSH_CHOOSE_N,
type);
crush_rule_set_step(
rule, step++,
mode == "firstn" ? CRUSH_RULE_CHOOSELEAF_FIRSTN :
CRUSH_RULE_CHOOSELEAF_INDEP,
num_failure_domains <= 0 ? CRUSH_CHOOSE_N : num_failure_domains,
type);
else
crush_rule_set_step(rule, step++,
mode == "firstn" ? CRUSH_RULE_CHOOSE_FIRSTN :
CRUSH_RULE_CHOOSE_INDEP,
CRUSH_CHOOSE_N,
0);
crush_rule_set_step(
rule, step++,
mode == "firstn" ? CRUSH_RULE_CHOOSE_FIRSTN :
CRUSH_RULE_CHOOSE_INDEP,
num_failure_domains <= 0 ? CRUSH_CHOOSE_N : num_failure_domains,
0);
crush_rule_set_step(rule, step++, CRUSH_RULE_EMIT, 0, 0);
int ret = crush_add_rule(crush, rule, rno);
@ -2335,13 +2361,125 @@ int CrushWrapper::add_simple_rule_at(
int CrushWrapper::add_simple_rule(
string name, string root_name,
string failure_domain_name,
int num_failure_domains,
string device_class,
string mode, int rule_type,
ostream *err)
{
return add_simple_rule_at(name, root_name, failure_domain_name, device_class,
mode,
rule_type, -1, err);
return add_simple_rule_at(
name, root_name, failure_domain_name, num_failure_domains,
device_class,
mode,
rule_type, -1, err);
}
int CrushWrapper::add_multi_osd_per_failure_domain_rule_at(
string name, string root_name, string failure_domain_name,
int num_failure_domains,
int osds_per_failure_domain,
string device_class,
crush_rule_type rule_type,
int rno,
ostream *err)
{
if (rule_exists(name)) {
if (err)
*err << "rule " << name << " exists";
return -EEXIST;
}
if (rno >= 0) {
if (rule_exists(rno)) {
if (err)
*err << "rule with ruleno " << rno << " exists";
return -EEXIST;
}
} else {
for (rno = 0; rno < get_max_rules(); rno++) {
if (!rule_exists(rno))
break;
}
}
if (!name_exists(root_name)) {
if (err)
*err << "root item " << root_name << " does not exist";
return -ENOENT;
}
int root = get_item_id(root_name);
int type = 0;
if (failure_domain_name.length()) {
type = get_type_id(failure_domain_name);
if (type < 0) {
if (err)
*err << "unknown type " << failure_domain_name;
return -EINVAL;
}
}
if (device_class.size()) {
if (!class_exists(device_class)) {
if (err)
*err << "device class " << device_class << " does not exist";
return -EINVAL;
}
int c = get_class_id(device_class);
if (class_bucket.count(root) == 0 ||
class_bucket[root].count(c) == 0) {
if (err)
*err << "root " << root_name << " has no devices with class "
<< device_class;
return -EINVAL;
}
root = class_bucket[root][c];
}
if (rule_type != CRUSH_RULE_TYPE_MSR_INDEP &&
rule_type != CRUSH_RULE_TYPE_MSR_FIRSTN) {
if (err)
*err << "unknown rule_type " << rule_type;
return -EINVAL;
}
int steps = 4;
crush_rule *rule = crush_make_rule(steps, rule_type);
ceph_assert(rule);
int step = 0;
crush_rule_set_step(rule, step++, CRUSH_RULE_TAKE, root, 0);
crush_rule_set_step(rule, step++,
CRUSH_RULE_CHOOSE_MSR,
num_failure_domains,
type);
crush_rule_set_step(rule, step++,
CRUSH_RULE_CHOOSE_MSR,
osds_per_failure_domain,
0);
crush_rule_set_step(rule, step++, CRUSH_RULE_EMIT, 0, 0);
int ret = crush_add_rule(crush, rule, rno);
if(ret < 0) {
*err << "failed to add rule " << rno << " because " << cpp_strerror(ret);
return ret;
}
set_rule_name(rno, name);
have_rmaps = false;
return rno;
}
int CrushWrapper::add_indep_multi_osd_per_failure_domain_rule(
string name, string root_name,
string failure_domain_name,
int num_failure_domains,
int osds_per_failure_domain,
string device_class,
ostream *err)
{
return add_multi_osd_per_failure_domain_rule_at(
name, root_name,
failure_domain_name,
num_failure_domains,
osds_per_failure_domain,
device_class,
CRUSH_RULE_TYPE_MSR_INDEP,
-1,
err);
}
float CrushWrapper::_get_take_weight_osd_map(int root,
@ -3080,6 +3218,10 @@ void CrushWrapper::encode(bufferlist& bl, uint64_t features) const
}
}
}
if (HAVE_FEATURE(features, CRUSH_MSR)) {
encode(crush->msr_descents, bl);
encode(crush->msr_collision_tries, bl);
}
}
static void decode_32_or_64_string_map(map<int32_t,string>& m, bufferlist::const_iterator& blp)
@ -3230,6 +3372,12 @@ void CrushWrapper::decode(bufferlist::const_iterator& blp)
choose_args[choose_args_index] = arg_map;
}
}
if (!blp.end()) {
decode(crush->msr_descents, blp);
decode(crush->msr_collision_tries, blp);
} else {
set_default_msr_tunables();
}
update_choose_args(nullptr); // in case we decode a legacy "corrupted" map
finalize();
}
@ -3485,6 +3633,8 @@ void CrushWrapper::dump_tunables(Formatter *f) const
f->dump_int("chooseleaf_descend_once", get_chooseleaf_descend_once());
f->dump_int("chooseleaf_vary_r", get_chooseleaf_vary_r());
f->dump_int("chooseleaf_stable", get_chooseleaf_stable());
f->dump_int("msr_descents", get_msr_descents());
f->dump_int("msr_collision_tries", get_msr_collision_tries());
f->dump_int("straw_calc_version", get_straw_calc_version());
f->dump_int("allowed_bucket_algs", get_allowed_bucket_algs());
@ -3515,6 +3665,7 @@ void CrushWrapper::dump_tunables(Formatter *f) const
f->dump_int("has_v4_buckets", (int)has_v4_buckets());
f->dump_int("require_feature_tunables5", (int)has_nondefault_tunables5());
f->dump_int("has_v5_rules", (int)has_v5_rules());
f->dump_int("has_msr_rules", (int)has_msr_rules());
}
void CrushWrapper::dump_choose_args(Formatter *f) const
@ -3613,6 +3764,11 @@ void CrushWrapper::dump_rule(int rule_id, Formatter *f) const
f->dump_int("num", get_rule_arg1(rule_id, j));
f->dump_string("type", get_type_name(get_rule_arg2(rule_id, j)));
break;
case CRUSH_RULE_CHOOSE_MSR:
f->dump_string("op", "choosemsr");
f->dump_int("num", get_rule_arg1(rule_id, j));
f->dump_string("type", get_type_name(get_rule_arg2(rule_id, j)));
break;
case CRUSH_RULE_SET_CHOOSE_TRIES:
f->dump_string("op", "set_choose_tries");
f->dump_int("num", get_rule_arg1(rule_id, j));
@ -3621,6 +3777,14 @@ void CrushWrapper::dump_rule(int rule_id, Formatter *f) const
f->dump_string("op", "set_chooseleaf_tries");
f->dump_int("num", get_rule_arg1(rule_id, j));
break;
case CRUSH_RULE_SET_MSR_DESCENTS:
f->dump_string("op", "set_msr_descents");
f->dump_int("num", get_rule_arg1(rule_id, j));
break;
case CRUSH_RULE_SET_MSR_COLLISION_TRIES:
f->dump_string("op", "set_msr_collision_tries");
f->dump_int("num", get_rule_arg1(rule_id, j));
break;
default:
f->dump_int("opcode", get_rule_op(rule_id, j));
f->dump_int("arg1", get_rule_arg1(rule_id, j));

View File

@ -125,6 +125,7 @@ public:
crush->chooseleaf_vary_r = 0;
crush->chooseleaf_stable = 0;
crush->allowed_bucket_algs = CRUSH_LEGACY_ALLOWED_BUCKET_ALGS;
set_default_msr_tunables();
}
void set_tunables_bobtail() {
crush->choose_local_tries = 0;
@ -134,6 +135,7 @@ public:
crush->chooseleaf_vary_r = 0;
crush->chooseleaf_stable = 0;
crush->allowed_bucket_algs = CRUSH_LEGACY_ALLOWED_BUCKET_ALGS;
set_default_msr_tunables();
}
void set_tunables_firefly() {
crush->choose_local_tries = 0;
@ -143,6 +145,7 @@ public:
crush->chooseleaf_vary_r = 1;
crush->chooseleaf_stable = 0;
crush->allowed_bucket_algs = CRUSH_LEGACY_ALLOWED_BUCKET_ALGS;
set_default_msr_tunables();
}
void set_tunables_hammer() {
crush->choose_local_tries = 0;
@ -156,6 +159,7 @@ public:
(1 << CRUSH_BUCKET_LIST) |
(1 << CRUSH_BUCKET_STRAW) |
(1 << CRUSH_BUCKET_STRAW2);
set_default_msr_tunables();
}
void set_tunables_jewel() {
crush->choose_local_tries = 0;
@ -169,6 +173,7 @@ public:
(1 << CRUSH_BUCKET_LIST) |
(1 << CRUSH_BUCKET_STRAW) |
(1 << CRUSH_BUCKET_STRAW2);
set_default_msr_tunables();
}
void set_tunables_legacy() {
@ -233,6 +238,24 @@ public:
crush->straw_calc_version = n;
}
int get_msr_descents() const {
return crush->msr_descents;
}
void set_msr_descents(int n) {
crush->msr_descents = n;
}
int get_msr_collision_tries() const {
return crush->msr_collision_tries;
}
void set_msr_collision_tries(int n) {
crush->msr_collision_tries = n;
}
void set_default_msr_tunables() {
set_msr_descents(100);
set_msr_collision_tries(100);
}
unsigned get_allowed_bucket_algs() const {
return crush->allowed_bucket_algs;
}
@ -248,7 +271,8 @@ public:
crush->chooseleaf_descend_once == 0 &&
crush->chooseleaf_vary_r == 0 &&
crush->chooseleaf_stable == 0 &&
crush->allowed_bucket_algs == CRUSH_LEGACY_ALLOWED_BUCKET_ALGS;
crush->allowed_bucket_algs == CRUSH_LEGACY_ALLOWED_BUCKET_ALGS &&
!has_nondefault_tunables_msr();
}
bool has_bobtail_tunables() const {
return
@ -258,7 +282,8 @@ public:
crush->chooseleaf_descend_once == 1 &&
crush->chooseleaf_vary_r == 0 &&
crush->chooseleaf_stable == 0 &&
crush->allowed_bucket_algs == CRUSH_LEGACY_ALLOWED_BUCKET_ALGS;
crush->allowed_bucket_algs == CRUSH_LEGACY_ALLOWED_BUCKET_ALGS &&
!has_nondefault_tunables_msr();
}
bool has_firefly_tunables() const {
return
@ -268,7 +293,8 @@ public:
crush->chooseleaf_descend_once == 1 &&
crush->chooseleaf_vary_r == 1 &&
crush->chooseleaf_stable == 0 &&
crush->allowed_bucket_algs == CRUSH_LEGACY_ALLOWED_BUCKET_ALGS;
crush->allowed_bucket_algs == CRUSH_LEGACY_ALLOWED_BUCKET_ALGS &&
!has_nondefault_tunables_msr();
}
bool has_hammer_tunables() const {
return
@ -281,7 +307,8 @@ public:
crush->allowed_bucket_algs == ((1 << CRUSH_BUCKET_UNIFORM) |
(1 << CRUSH_BUCKET_LIST) |
(1 << CRUSH_BUCKET_STRAW) |
(1 << CRUSH_BUCKET_STRAW2));
(1 << CRUSH_BUCKET_STRAW2)) &&
!has_nondefault_tunables_msr();
}
bool has_jewel_tunables() const {
return
@ -294,7 +321,8 @@ public:
crush->allowed_bucket_algs == ((1 << CRUSH_BUCKET_UNIFORM) |
(1 << CRUSH_BUCKET_LIST) |
(1 << CRUSH_BUCKET_STRAW) |
(1 << CRUSH_BUCKET_STRAW2));
(1 << CRUSH_BUCKET_STRAW2)) &&
!has_nondefault_tunables_msr();
}
bool has_optimal_tunables() const {
@ -322,6 +350,11 @@ public:
return
crush->chooseleaf_stable != 0;
}
bool has_nondefault_tunables_msr() const {
return
crush->msr_descents != 100 ||
crush->msr_collision_tries != 100;
}
bool has_v2_rules() const;
bool has_v3_rules() const;
@ -329,13 +362,17 @@ public:
bool has_v5_rules() const;
bool has_choose_args() const; // any choose_args
bool has_incompat_choose_args() const; // choose_args that can't be made compat
bool has_msr_rules() const;
bool is_v2_rule(unsigned ruleid) const;
bool is_v3_rule(unsigned ruleid) const;
bool is_v5_rule(unsigned ruleid) const;
bool is_msr_rule(unsigned ruleid) const;
std::string get_min_required_version() const {
if (has_v5_rules() || has_nondefault_tunables5())
if (has_msr_rules() || has_nondefault_tunables_msr())
return "squid";
else if (has_v5_rules() || has_nondefault_tunables5())
return "jewel";
else if (has_v4_buckets())
return "hammer";
@ -565,6 +602,21 @@ public:
if (have_rmaps)
rule_name_rmap[name] = i;
}
bool rule_valid_for_pool_type(int rule_id, int ptype) const {
auto rule_type = get_rule_type(rule_id);
switch (ptype) {
case CEPH_PG_TYPE_REPLICATED:
return rule_type == CRUSH_RULE_TYPE_REPLICATED ||
rule_type == CRUSH_RULE_TYPE_MSR_FIRSTN;
case CEPH_PG_TYPE_ERASURE:
return rule_type == CRUSH_RULE_TYPE_ERASURE ||
rule_type == CRUSH_RULE_TYPE_MSR_INDEP;
default:
ceph_assert(0 == "impossible");
return false;
}
}
bool is_shadow_item(int id) const {
const char *name = get_item_name(id);
return name && !is_valid_crush_name(name);
@ -1151,6 +1203,14 @@ public:
int set_rule_step_set_chooseleaf_stable(unsigned ruleno, unsigned step, int val) {
return set_rule_step(ruleno, step, CRUSH_RULE_SET_CHOOSELEAF_STABLE, val, 0);
}
int set_rule_step_set_msr_descents(unsigned ruleno, unsigned step, int val) {
return set_rule_step(ruleno, step, CRUSH_RULE_SET_MSR_DESCENTS, val, 0);
}
int set_rule_step_set_msr_collision_tries(unsigned ruleno, unsigned step, int val) {
return set_rule_step(ruleno, step, CRUSH_RULE_SET_MSR_COLLISION_TRIES, val, 0);
}
int set_rule_step_choose_firstn(unsigned ruleno, unsigned step, int val, int type) {
return set_rule_step(ruleno, step, CRUSH_RULE_CHOOSE_FIRSTN, val, type);
}
@ -1163,22 +1223,61 @@ public:
int set_rule_step_choose_leaf_indep(unsigned ruleno, unsigned step, int val, int type) {
return set_rule_step(ruleno, step, CRUSH_RULE_CHOOSELEAF_INDEP, val, type);
}
int set_rule_step_choose_msr(unsigned ruleno, unsigned step, int val, int type) {
return set_rule_step(ruleno, step, CRUSH_RULE_CHOOSE_MSR, val, type);
}
int set_rule_step_emit(unsigned ruleno, unsigned step) {
return set_rule_step(ruleno, step, CRUSH_RULE_EMIT, 0, 0);
}
int add_simple_rule(
std::string name, std::string root_name, std::string failure_domain_type,
int num_failure_domains,
std::string device_class, std::string mode, int rule_type,
std::ostream *err = 0);
int add_simple_rule(
std::string name, std::string root_name, std::string failure_domain_type,
std::string device_class, std::string mode, int rule_type,
std::ostream *err = 0) {
return add_simple_rule(
name, root_name, failure_domain_type, -1,
device_class, mode, rule_type, err);
}
int add_indep_multi_osd_per_failure_domain_rule(
std::string name, std::string root_name, std::string failure_domain_type,
int osds_per_failure_domain,
int num_failure_domains,
std::string device_class,
std::ostream *err = 0);
/**
* @param rno rule[set] id to use, -1 to pick the lowest available
*/
int add_simple_rule_at(
std::string name, std::string root_name,
std::string failure_domain_type, std::string device_class, std::string mode,
std::string failure_domain_type,
int num_failure_domains,
std::string device_class, std::string mode,
int rule_type, int rno, std::ostream *err = 0);
int add_simple_rule_at(
std::string name, std::string root_name,
std::string failure_domain_type,
std::string device_class, std::string mode,
int rule_type, int rno, std::ostream *err = 0) {
return add_simple_rule_at(
name, root_name, failure_domain_type, -1,
device_class, mode, rule_type, rno, err);
}
int add_multi_osd_per_failure_domain_rule_at(
std::string name, std::string root_name, std::string failure_domain_type,
int osds_per_failure_domain,
int num_failure_domains,
std::string device_class,
crush_rule_type rule_type,
int rno,
std::ostream *err = 0);
int remove_rule(int ruleno);

View File

@ -65,7 +65,15 @@ enum crush_opcodes {
CRUSH_RULE_SET_CHOOSE_LOCAL_TRIES = 10,
CRUSH_RULE_SET_CHOOSE_LOCAL_FALLBACK_TRIES = 11,
CRUSH_RULE_SET_CHOOSELEAF_VARY_R = 12,
CRUSH_RULE_SET_CHOOSELEAF_STABLE = 13
CRUSH_RULE_SET_CHOOSELEAF_STABLE = 13,
/* set choose_msr_total_tries */
CRUSH_RULE_SET_MSR_DESCENTS = 14,
/* set choose_msr_local_collision_tries */
CRUSH_RULE_SET_MSR_COLLISION_TRIES = 15,
/* choose variant without FIRSTN|INDEP */
CRUSH_RULE_CHOOSE_MSR = 16
};
/*
@ -87,7 +95,12 @@ struct crush_rule {
#define crush_rule_size(len) (sizeof(struct crush_rule) + \
(len)*sizeof(struct crush_rule_step))
enum crush_rule_type {
CRUSH_RULE_TYPE_REPLICATED = 1,
CRUSH_RULE_TYPE_ERASURE = 3,
CRUSH_RULE_TYPE_MSR_FIRSTN = 4,
CRUSH_RULE_TYPE_MSR_INDEP = 5
};
/*
* A bucket is a named container of other items (either devices or
@ -410,6 +423,12 @@ struct crush_map {
*/
__u8 chooseleaf_stable;
/*! Sets total descents for MSR rules */
__u8 msr_descents;
/*! Sets local collision retries for MSR rules */
__u8 msr_collision_tries;
/*! @cond INTERNAL */
/* This value is calculated after decode or construction by
the builder. It is exposed here (rather than having a

View File

@ -50,8 +50,11 @@ struct crush_grammar : public boost::spirit::grammar<crush_grammar>
_step_set_choose_tries,
_step_set_choose_local_tries,
_step_set_choose_local_fallback_tries,
_step_set_msr_descents,
_step_set_msr_collision_tries,
_step_choose,
_step_chooseleaf,
_step_choose_msr,
_step_emit,
_step,
_crushrule,
@ -91,8 +94,11 @@ struct crush_grammar : public boost::spirit::grammar<crush_grammar>
boost::spirit::rule<ScannerT, boost::spirit::parser_context<>, boost::spirit::parser_tag<_step_set_chooseleaf_tries> > step_set_chooseleaf_tries;
boost::spirit::rule<ScannerT, boost::spirit::parser_context<>, boost::spirit::parser_tag<_step_set_chooseleaf_vary_r> > step_set_chooseleaf_vary_r;
boost::spirit::rule<ScannerT, boost::spirit::parser_context<>, boost::spirit::parser_tag<_step_set_chooseleaf_stable> > step_set_chooseleaf_stable;
boost::spirit::rule<ScannerT, boost::spirit::parser_context<>, boost::spirit::parser_tag<_step_set_msr_descents> > step_set_msr_descents;
boost::spirit::rule<ScannerT, boost::spirit::parser_context<>, boost::spirit::parser_tag<_step_set_msr_collision_tries> > step_set_msr_collision_tries;
boost::spirit::rule<ScannerT, boost::spirit::parser_context<>, boost::spirit::parser_tag<_step_choose> > step_choose;
boost::spirit::rule<ScannerT, boost::spirit::parser_context<>, boost::spirit::parser_tag<_step_chooseleaf> > step_chooseleaf;
boost::spirit::rule<ScannerT, boost::spirit::parser_context<>, boost::spirit::parser_tag<_step_choose_msr> > step_choose_msr;
boost::spirit::rule<ScannerT, boost::spirit::parser_context<>, boost::spirit::parser_tag<_step_emit> > step_emit;
boost::spirit::rule<ScannerT, boost::spirit::parser_context<>, boost::spirit::parser_tag<_step> > step;
boost::spirit::rule<ScannerT, boost::spirit::parser_context<>, boost::spirit::parser_tag<_crushrule> > crushrule;
@ -149,6 +155,8 @@ struct crush_grammar : public boost::spirit::grammar<crush_grammar>
step_set_chooseleaf_tries = str_p("set_chooseleaf_tries") >> posint;
step_set_chooseleaf_vary_r = str_p("set_chooseleaf_vary_r") >> posint;
step_set_chooseleaf_stable = str_p("set_chooseleaf_stable") >> posint;
step_set_msr_descents = str_p("set_msr_descents") >> posint;
step_set_msr_collision_tries = str_p("set_msr_collision_tries") >> posint;
step_choose = str_p("choose")
>> ( str_p("indep") | str_p("firstn") )
>> integer
@ -157,6 +165,9 @@ struct crush_grammar : public boost::spirit::grammar<crush_grammar>
>> ( str_p("indep") | str_p("firstn") )
>> integer
>> str_p("type") >> name;
step_choose_msr = str_p("choosemsr")
>> integer
>> str_p("type") >> name;
step_emit = str_p("emit");
step = str_p("step") >> ( step_take |
step_set_choose_tries |
@ -165,12 +176,15 @@ struct crush_grammar : public boost::spirit::grammar<crush_grammar>
step_set_chooseleaf_tries |
step_set_chooseleaf_vary_r |
step_set_chooseleaf_stable |
step_set_msr_descents |
step_set_msr_collision_tries |
step_choose |
step_chooseleaf |
step_choose_msr |
step_emit );
crushrule = str_p("rule") >> !name >> '{'
>> (str_p("id") | str_p("ruleset")) >> posint
>> str_p("type") >> ( str_p("replicated") | str_p("erasure") )
>> str_p("type") >> ( str_p("replicated") | str_p("erasure") | str_p("msr_firstn") | str_p("msr_indep") )
>> !(str_p("min_size") >> posint)
>> !(str_p("max_size") >> posint)
>> +step

File diff suppressed because it is too large Load Diff

View File

@ -77,15 +77,11 @@ extern int crush_do_rule(const struct crush_map *map,
const __u32 *weights, int weight_max,
void *cwin, const struct crush_choose_arg *choose_args);
/* Returns the exact amount of workspace that will need to be used
for a given combination of crush_map and result_max. The caller can
then allocate this much on its own, either on the stack, in a
per-thread long-lived buffer, or however it likes. */
static inline size_t crush_work_size(const struct crush_map *map,
int result_max) {
return map->working_size + result_max * 3 * sizeof(__u32);
}
/* Returns enough workspace for any crush rule within map to generate
result_max outputs. The caller can then allocate this much on its own,
either on the stack, in a per-thread long-lived buffer, or however it likes.*/
extern size_t crush_work_size(const struct crush_map *map,
int result_max);
extern void crush_init_workspace(const struct crush_map *m, void *v);

View File

@ -52,6 +52,12 @@ int ErasureCode::init(
err |= to_string("crush-failure-domain", profile,
&rule_failure_domain,
DEFAULT_RULE_FAILURE_DOMAIN, ss);
err |= to_int("crush-osds-per-failure-domain", profile,
&rule_osds_per_failure_domain,
"0", ss);
err |= to_int("crush-num-failure-domains", profile,
&rule_num_failure_domains,
"0", ss);
err |= to_string("crush-device-class", profile,
&rule_device_class,
"", ss);
@ -66,19 +72,33 @@ int ErasureCode::create_rule(
CrushWrapper &crush,
std::ostream *ss) const
{
int ruleid = crush.add_simple_rule(
name,
rule_root,
rule_failure_domain,
rule_device_class,
"indep",
pg_pool_t::TYPE_ERASURE,
ss);
if (ruleid < 0)
return ruleid;
return ruleid;
if (rule_osds_per_failure_domain <= 1) {
return crush.add_simple_rule(
name,
rule_root,
rule_failure_domain,
rule_num_failure_domains,
rule_device_class,
"indep",
pg_pool_t::TYPE_ERASURE,
ss);
} else {
if (rule_num_failure_domains < 1) {
if (ss) {
*ss << "crush-num-failure-domains " << rule_num_failure_domains
<< " must be >= 1 if crush-osds-per-failure-domain specified";
return -EINVAL;
}
}
return crush.add_indep_multi_osd_per_failure_domain_rule(
name,
rule_root,
rule_failure_domain,
rule_num_failure_domains,
rule_osds_per_failure_domain,
rule_device_class,
ss);
}
}
int ErasureCode::sanity_check_k_m(int k, int m, ostream *ss)

View File

@ -37,6 +37,8 @@ namespace ceph {
std::string rule_root;
std::string rule_failure_domain;
std::string rule_device_class;
int rule_osds_per_failure_domain = -1;
int rule_num_failure_domains = -1;
~ErasureCode() override {}

View File

@ -137,7 +137,7 @@ DEFINE_CEPH_FEATURE(34, 3, RANGE_BLOCKLIST)
DEFINE_CEPH_FEATURE(35, 1, OSD_CACHEPOOL) // 3.14
DEFINE_CEPH_FEATURE(36, 1, CRUSH_V2) // 3.14
DEFINE_CEPH_FEATURE(37, 1, EXPORT_PEER) // 3.14
DEFINE_CEPH_FEATURE_RETIRED(38, 1, OSD_ERASURE_CODES, MIMIC, OCTOPUS)
DEFINE_CEPH_FEATURE(38, 2, CRUSH_MSR) // X.XX TODOSAM kernel version?
// available
DEFINE_CEPH_FEATURE(39, 1, OSDMAP_ENC) // 3.15
DEFINE_CEPH_FEATURE(40, 1, MDS_INLINE_DATA) // 3.19
@ -218,6 +218,7 @@ DEFINE_CEPH_FEATURE_RETIRED(63, 1, RESERVED_BROKEN, LUMINOUS, QUINCY) // client-
CEPH_FEATURE_OSD_CACHEPOOL | \
CEPH_FEATURE_CRUSH_V2 | \
CEPH_FEATURE_EXPORT_PEER | \
CEPH_FEATURE_CRUSH_MSR | \
CEPH_FEATURE_OSDMAP_ENC | \
CEPH_FEATURE_MDS_INLINE_DATA | \
CEPH_FEATURE_CRUSH_TUNABLES3 | \
@ -265,9 +266,10 @@ DEFINE_CEPH_FEATURE_RETIRED(63, 1, RESERVED_BROKEN, LUMINOUS, QUINCY) // client-
CEPH_FEATURE_CRUSH_TUNABLES2 | \
CEPH_FEATURE_CRUSH_TUNABLES3 | \
CEPH_FEATURE_CRUSH_TUNABLES5 | \
CEPH_FEATURE_CRUSH_MSR | \
CEPH_FEATURE_CRUSH_V2 | \
CEPH_FEATURE_CRUSH_V4 | \
CEPH_FEATUREMASK_CRUSH_CHOOSE_ARGS)
CEPH_FEATUREMASK_CRUSH_MSR)
/*
* make sure we don't try to use the reserved features

View File

@ -7562,6 +7562,12 @@ bool OSDMonitor::validate_crush_against_features(const CrushWrapper *newcrush,
<< newmap.require_min_compat_client;
return false;
}
if (mv > newmap.require_osd_release) {
ss << "new crush map requires client version " << mv
<< " but require_osd_release is "
<< newmap.require_osd_release;
return false;
}
}
// osd compat
@ -8072,7 +8078,7 @@ int OSDMonitor::prepare_new_pool(string& name,
return r;
}
if (osdmap.crush->get_rule_type(crush_rule) != (int)pool_type) {
if (!osdmap.crush->rule_valid_for_pool_type(crush_rule, pool_type)) {
*ss << "crush rule " << crush_rule << " type does not match pool";
return -EINVAL;
}
@ -8344,7 +8350,7 @@ int OSDMonitor::prepare_command_pool_set(const cmdmap_t& cmdmap,
return -EPERM;
}
}
if (osdmap.crush->get_rule_type(p.get_crush_rule()) != (int)p.type) {
if (!osdmap.crush->rule_valid_for_pool_type(p.get_crush_rule(), p.type)) {
ss << "crush rule " << p.get_crush_rule() << " type does not match pool";
return -EINVAL;
}
@ -8577,7 +8583,7 @@ int OSDMonitor::prepare_command_pool_set(const cmdmap_t& cmdmap,
ss << cpp_strerror(id);
return -ENOENT;
}
if (osdmap.crush->get_rule_type(id) != (int)p.get_type()) {
if (!osdmap.crush->rule_valid_for_pool_type(id, p.get_type())) {
ss << "crush rule " << id << " type does not match pool";
return -EINVAL;
}

View File

@ -1764,9 +1764,10 @@ uint64_t OSDMap::get_features(int entity_type, uint64_t *pmask) const
features |= CEPH_FEATURE_CRUSH_V4;
if (crush->has_nondefault_tunables5())
features |= CEPH_FEATURE_CRUSH_TUNABLES5;
if (crush->has_incompat_choose_args()) {
if (crush->has_incompat_choose_args())
features |= CEPH_FEATUREMASK_CRUSH_CHOOSE_ARGS;
}
if (crush->has_nondefault_tunables_msr())
features |= CEPH_FEATURE_CRUSH_MSR;
mask |= CEPH_FEATURES_CRUSH;
if (!pg_upmap.empty() || !pg_upmap_items.empty() || !pg_upmap_primaries.empty())
@ -1789,6 +1790,8 @@ uint64_t OSDMap::get_features(int entity_type, uint64_t *pmask) const
features |= CEPH_FEATURE_CRUSH_TUNABLES3;
if (crush->is_v5_rule(ruleid))
features |= CEPH_FEATURE_CRUSH_TUNABLES5;
if (crush->is_msr_rule(ruleid))
features |= CEPH_FEATURE_CRUSH_MSR;
}
}
mask |= CEPH_FEATURE_OSDHASHPSPOOL | CEPH_FEATURE_OSD_CACHEPOOL;
@ -1843,6 +1846,9 @@ ceph_release_t OSDMap::get_min_compat_client() const
{
uint64_t f = get_features(CEPH_ENTITY_TYPE_CLIENT, nullptr);
if (HAVE_FEATURE(f, CRUSH_MSR)) { // TODOSAM -- add version right before merge
return ceph_release_t::squid; // v19.2.0
}
if (HAVE_FEATURE(f, OSDMAP_PG_UPMAP) || // v12.0.0-1733-g27d6f43
HAVE_FEATURE(f, CRUSH_CHOOSE_ARGS)) { // v12.0.1-2172-gef1ef28
return ceph_release_t::luminous; // v12.2.0
@ -4524,7 +4530,7 @@ int OSDMap::validate_crush_rules(CrushWrapper *newcrush,
<< " but it is not present";
return -EINVAL;
}
if (newcrush->get_rule_type(ruleno) != (int)pool.get_type()) {
if (!newcrush->rule_valid_for_pool_type(ruleno, pool.get_type())) {
*ss << "pool " << i.first << " type does not match rule " << ruleno;
return -EINVAL;
}

View File

@ -159,6 +159,8 @@
"chooseleaf_descend_once": 0,
"chooseleaf_vary_r": 0,
"chooseleaf_stable": 0,
"msr_descents": 100,
"msr_collision_tries": 100,
"straw_calc_version": 0,
"allowed_bucket_algs": 22,
"profile": "argonaut",
@ -172,7 +174,8 @@
"has_v3_rules": 0,
"has_v4_buckets": 1,
"require_feature_tunables5": 0,
"has_v5_rules": 0
"has_v5_rules": 0,
"has_msr_rules": 0
},
"choose_args": {
"1": [],

View File

@ -6,7 +6,7 @@
osdmaptool: exported crush map to oc
$ osdmaptool --import-crush oc myosdmap
osdmaptool: osdmap file 'myosdmap'
osdmaptool: imported 497 byte crush map from oc
osdmaptool: imported 499 byte crush map from oc
osdmaptool: writing epoch 3 to myosdmap
$ osdmaptool --adjust-crush-weight 0:5 myosdmap
osdmaptool: osdmap file 'myosdmap'

File diff suppressed because it is too large Load Diff

View File

@ -176,6 +176,9 @@ zoned_enabled=0
io_uring_enabled=0
with_jaeger=0
force_addr=0
osds_per_host=0
require_osd_and_client_version=""
use_crush_tunables=""
with_mgr_dashboard=true
if [[ "$(get_cmake_variable WITH_MGR_DASHBOARD_FRONTEND)" != "ON" ]] ||
@ -599,6 +602,21 @@ case $1 in
with_jaeger=1
echo "with_jaeger $with_jaeger"
;;
--osds-per-host)
osds_per_host="$2"
shift
echo "osds_per_host $osds_per_host"
;;
--require-osd-and-client-version)
require_osd_and_client_version="$2"
shift
echo "require_osd_and_client_version $require_osd_and_client_version"
;;
--use-crush-tunables)
use_crush_tunables="$2"
shift
echo "use_crush_tunables $use_crush_tunables"
;;
*)
usage_exit
esac
@ -1095,6 +1113,15 @@ EOF
if [ "$crimson" -eq 1 ]; then
$CEPH_BIN/ceph osd set-allow-crimson --yes-i-really-mean-it
fi
if [ -n "$require_osd_and_client_version" ]; then
$CEPH_BIN/ceph osd set-require-min-compat-client $require_osd_and_client_version
$CEPH_BIN/ceph osd require-osd-release $require_osd_and_client_version --yes-i-really-mean-it
fi
if [ -n "$use_crush_tunables" ]; then
$CEPH_BIN/ceph osd crush tunables $use_crush_tunables
fi
}
start_osd() {
@ -1128,6 +1155,13 @@ start_osd() {
[osd.$osd]
host = $HOSTNAME
EOF
if [ "$osds_per_host" -gt 0 ]; then
wconf <<EOF
crush location = root=default host=$HOSTNAME-$(echo "$osd / $osds_per_host" | bc)
EOF
fi
if [ "$spdk_enabled" -eq 1 ]; then
wconf <<EOF
bluestore_block_path = spdk:${bluestore_spdk_dev[$osd]}