Merge pull request #13968 from dzafman/wip-15912-followon

osd,mon: misc full fixes and cleanups

Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: John Spray <john.spray@redhat.com>
Reviewed-by: Kefu Chai <kchai@redhat.com>
This commit is contained in:
Sage Weil 2017-04-17 16:42:13 -05:00 committed by GitHub
commit 37e9a874af
37 changed files with 584 additions and 144 deletions

View File

@ -34,8 +34,8 @@ the typical process.
Once the primary has its local reservation, it requests a remote
reservation from the backfill target. This reservation CAN be rejected,
for instance if the OSD is too full (osd_backfill_full_ratio config
option). If the reservation is rejected, the primary drops its local
for instance if the OSD is too full (backfillfull_ratio osd setting).
If the reservation is rejected, the primary drops its local
reservation, waits (osd_backfill_retry_interval), and then retries. It
will retry indefinitely.
@ -62,9 +62,10 @@ to the monitor. The state chart can set:
- recovery_wait: waiting for local/remote reservations
- recovering: recovering
- recovery_toofull: recovery stopped, OSD(s) above full ratio
- backfill_wait: waiting for remote backfill reservations
- backfilling: backfilling
- backfill_toofull: backfill reservation rejected, OSD too full
- backfill_toofull: backfill stopped, OSD(s) above backfillfull ratio
--------

View File

@ -1166,6 +1166,12 @@ Usage::
ceph pg set_full_ratio <float[0.0-1.0]>
Subcommand ``set_backfillfull_ratio`` sets ratio at which pgs are considered too full to backfill.
Usage::
ceph pg set_backfillfull_ratio <float[0.0-1.0]>
Subcommand ``set_nearfull_ratio`` sets ratio at which pgs are considered nearly
full.

View File

@ -400,6 +400,7 @@ a reasonable number for a near full ratio.
[global]
mon osd full ratio = .80
mon osd backfillfull ratio = .75
mon osd nearfull ratio = .70
@ -412,6 +413,15 @@ a reasonable number for a near full ratio.
:Default: ``.95``
``mon osd backfillfull ratio``
:Description: The percentage of disk space used before an OSD is
considered too ``full`` to backfill.
:Type: Float
:Default: ``.90``
``mon osd nearfull ratio``
:Description: The percentage of disk space used before an OSD is

View File

@ -560,15 +560,6 @@ priority than requests to read or write data.
:Default: ``512``
``osd backfill full ratio``
:Description: Refuse to accept backfill requests when the Ceph OSD Daemon's
full ratio is above this value.
:Type: Float
:Default: ``0.85``
``osd backfill retry interval``
:Description: The number of seconds to wait before retrying backfill requests.
@ -673,13 +664,6 @@ perform well in a degraded state.
:Default: ``8 << 20``
``osd recovery threads``
:Description: The number of threads for recovering data.
:Type: 32-bit Integer
:Default: ``1``
``osd recovery thread timeout``
:Description: The maximum time in seconds before timing out a recovery thread.

View File

@ -468,8 +468,7 @@ Ceph provides a number of settings to balance the resource contention between
new service requests and the need to recover data objects and restore the
placement groups to the current state. The ``osd recovery delay start`` setting
allows an OSD to restart, re-peer and even process some replay requests before
starting the recovery process. The ``osd recovery threads`` setting limits the
number of threads for the recovery process (1 thread by default). The ``osd
starting the recovery process. The ``osd
recovery thread timeout`` sets a thread timeout, because multiple OSDs may fail,
restart and re-peer at staggered rates. The ``osd recovery max active`` setting
limits the number of recovery requests an OSD will entertain simultaneously to
@ -497,8 +496,9 @@ placement group can't be backfilled, it may be considered ``incomplete``.
Ceph provides a number of settings to manage the load spike associated with
reassigning placement groups to an OSD (especially a new OSD). By default,
``osd_max_backfills`` sets the maximum number of concurrent backfills to or from
an OSD to 10. The ``osd backfill full ratio`` enables an OSD to refuse a
backfill request if the OSD is approaching its full ratio (85%, by default).
an OSD to 10. The ``backfill full ratio`` enables an OSD to refuse a
backfill request if the OSD is approaching its full ratio (90%, by default) and
change with ``ceph osd set-backfillfull-ratio`` comand.
If an OSD refuses a backfill request, the ``osd backfill retry interval``
enables an OSD to retry the request (after 10 seconds, by default). OSDs can
also set ``osd backfill scan min`` and ``osd backfill scan max`` to manage scan

View File

@ -206,7 +206,9 @@ Ceph prevents you from writing to a full OSD so that you don't lose data.
In an operational cluster, you should receive a warning when your cluster
is getting near its full ratio. The ``mon osd full ratio`` defaults to
``0.95``, or 95% of capacity before it stops clients from writing data.
The ``mon osd nearfull ratio`` defaults to ``0.85``, or 85% of capacity
The ``mon osd backfillfull ratio`` defaults to ``0.90``, or 90 % of
capacity when it blocks backfills from starting. The
``mon osd nearfull ratio`` defaults to ``0.85``, or 85% of capacity
when it generates a health warning.
Full cluster issues usually arise when testing how Ceph handles an OSD
@ -214,20 +216,21 @@ failure on a small cluster. When one node has a high percentage of the
cluster's data, the cluster can easily eclipse its nearfull and full ratio
immediately. If you are testing how Ceph reacts to OSD failures on a small
cluster, you should leave ample free disk space and consider temporarily
lowering the ``mon osd full ratio`` and ``mon osd nearfull ratio``.
lowering the ``mon osd full ratio``, ``mon osd backfillfull ratio`` and
``mon osd nearfull ratio``.
Full ``ceph-osds`` will be reported by ``ceph health``::
ceph health
HEALTH_WARN 1 nearfull osds
osd.2 is near full at 85%
HEALTH_WARN 1 nearfull osd(s)
Or::
ceph health
HEALTH_ERR 1 nearfull osds, 1 full osds
osd.2 is near full at 85%
ceph health detail
HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s)
osd.3 is full at 97%
osd.4 is backfill full at 91%
osd.2 is near full at 87%
The best way to deal with a full cluster is to add new ``ceph-osds``, allowing
the cluster to redistribute data to the newly available storage.

View File

@ -696,7 +696,7 @@ class Thrasher:
"""
Test backfills stopping when the replica fills up.
First, use osd_backfill_full_ratio to simulate a now full
First, use injectfull admin command to simulate a now full
osd by setting it to 0 on all of the OSDs.
Second, on a random subset, set
@ -705,13 +705,14 @@ class Thrasher:
Then, verify that all backfills stop.
"""
self.log("injecting osd_backfill_full_ratio = 0")
self.log("injecting backfill full")
for i in self.live_osds:
self.ceph_manager.set_config(
i,
osd_debug_skip_full_check_in_backfill_reservation=
random.choice(['false', 'true']),
osd_backfill_full_ratio=0)
random.choice(['false', 'true']))
self.ceph_manager.osd_admin_socket(i, command=['injectfull', 'backfillfull'],
check_status=True, timeout=30, stdout=DEVNULL)
for i in range(30):
status = self.ceph_manager.compile_pg_status()
if 'backfill' not in status.keys():
@ -724,8 +725,9 @@ class Thrasher:
for i in self.live_osds:
self.ceph_manager.set_config(
i,
osd_debug_skip_full_check_in_backfill_reservation='false',
osd_backfill_full_ratio=0.85)
osd_debug_skip_full_check_in_backfill_reservation='false')
self.ceph_manager.osd_admin_socket(i, command=['injectfull', 'none'],
check_status=True, timeout=30, stdout=DEVNULL)
def test_map_discontinuity(self):
"""

View File

@ -400,6 +400,7 @@ EOF
if test -z "$(get_config mon $id mon_initial_members)" ; then
ceph osd pool delete rbd rbd --yes-i-really-really-mean-it || return 1
ceph osd pool create rbd $PG_NUM || return 1
ceph osd set-backfillfull-ratio .99
fi
}
@ -634,7 +635,6 @@ function activate_osd() {
ceph_disk_args+=" --prepend-to-path="
local ceph_args="$CEPH_ARGS"
ceph_args+=" --osd-backfill-full-ratio=.99"
ceph_args+=" --osd-failsafe-full-ratio=.99"
ceph_args+=" --osd-journal-size=100"
ceph_args+=" --osd-scrub-load-threshold=2000"

View File

@ -1419,9 +1419,44 @@ function test_mon_pg()
ceph osd set-full-ratio .962
ceph osd dump | grep '^full_ratio 0.962'
ceph osd set-backfillfull-ratio .912
ceph osd dump | grep '^backfillfull_ratio 0.912'
ceph osd set-nearfull-ratio .892
ceph osd dump | grep '^nearfull_ratio 0.892'
# Check health status
ceph osd set-nearfull-ratio .913
ceph health | grep 'HEALTH_ERR Full ratio(s) out of order'
ceph health detail | grep 'backfill_ratio (0.912) < nearfull_ratio (0.913), increased'
ceph osd set-nearfull-ratio .892
ceph osd set-backfillfull-ratio .963
ceph health detail | grep 'full_ratio (0.962) < backfillfull_ratio (0.963), increased'
ceph osd set-backfillfull-ratio .912
# Check injected full results
WAITFORFULL=10
ceph --admin-daemon $CEPH_OUT_DIR/osd.0.asok injectfull nearfull
sleep $WAITFORFULL
ceph health | grep "HEALTH_WARN.*1 nearfull osd(s)"
ceph --admin-daemon $CEPH_OUT_DIR/osd.1.asok injectfull backfillfull
sleep $WAITFORFULL
ceph health | grep "HEALTH_WARN.*1 backfillfull osd(s)"
ceph --admin-daemon $CEPH_OUT_DIR/osd.2.asok injectfull failsafe
sleep $WAITFORFULL
# failsafe and full are the same as far as the monitor is concerned
ceph health | grep "HEALTH_ERR.*1 full osd(s)"
ceph --admin-daemon $CEPH_OUT_DIR/osd.0.asok injectfull full
sleep $WAITFORFULL
ceph health | grep "HEALTH_ERR.*2 full osd(s)"
ceph health detail | grep "osd.0 is full at.*%"
ceph health detail | grep "osd.2 is full at.*%"
ceph health detail | grep "osd.1 is backfill full at.*%"
ceph --admin-daemon $CEPH_OUT_DIR/osd.0.asok injectfull none
ceph --admin-daemon $CEPH_OUT_DIR/osd.1.asok injectfull none
ceph --admin-daemon $CEPH_OUT_DIR/osd.2.asok injectfull none
sleep $WAITFORFULL
ceph health | grep HEALTH_OK
ceph pg stat | grep 'pgs:'
ceph pg 0.0 query
ceph tell 0.0 query

View File

@ -359,10 +359,14 @@ if __name__ == '__main__':
r = expect('osd/dump', 'GET', 200, 'json', JSONHDR)
assert(float(r.myjson['output']['full_ratio']) == 0.90)
expect('osd/set-full-ratio?ratio=0.95', 'PUT', 200, '')
expect('osd/set-backfillfull-ratio?ratio=0.88', 'PUT', 200, '')
r = expect('osd/dump', 'GET', 200, 'json', JSONHDR)
assert(float(r.myjson['output']['backfillfull_ratio']) == 0.88)
expect('osd/set-backfillfull-ratio?ratio=0.90', 'PUT', 200, '')
expect('osd/set-nearfull-ratio?ratio=0.90', 'PUT', 200, '')
r = expect('osd/dump', 'GET', 200, 'json', JSONHDR)
assert(float(r.myjson['output']['nearfull_ratio']) == 0.90)
expect('osd/set-full-ratio?ratio=0.85', 'PUT', 200, '')
expect('osd/set-nearfull-ratio?ratio=0.85', 'PUT', 200, '')
r = expect('pg/stat', 'GET', 200, 'json', JSONHDR)
assert('num_pgs' in r.myjson['output'])

View File

@ -42,6 +42,8 @@ const char *ceph_osd_state_name(int s)
return "full";
case CEPH_OSD_NEARFULL:
return "nearfull";
case CEPH_OSD_BACKFILLFULL:
return "backfillfull";
default:
return "???";
}

View File

@ -308,6 +308,7 @@ OPTION(mon_pg_warn_min_pool_objects, OPT_INT, 1000) // do not warn on pools bel
OPTION(mon_pg_check_down_all_threshold, OPT_FLOAT, .5) // threshold of down osds after which we check all pgs
OPTION(mon_cache_target_full_warn_ratio, OPT_FLOAT, .66) // position between pool cache_target_full and max where we start warning
OPTION(mon_osd_full_ratio, OPT_FLOAT, .95) // what % full makes an OSD "full"
OPTION(mon_osd_backfillfull_ratio, OPT_FLOAT, .90) // what % full makes an OSD backfill full (backfill halted)
OPTION(mon_osd_nearfull_ratio, OPT_FLOAT, .85) // what % full makes an OSD near full
OPTION(mon_allow_pool_delete, OPT_BOOL, false) // allow pool deletion
OPTION(mon_globalid_prealloc, OPT_U32, 10000) // how many globalids to prealloc
@ -626,11 +627,11 @@ OPTION(osd_max_backfills, OPT_U64, 1)
// Minimum recovery priority (255 = max, smaller = lower)
OPTION(osd_min_recovery_priority, OPT_INT, 0)
// Refuse backfills when OSD full ratio is above this value
OPTION(osd_backfill_full_ratio, OPT_FLOAT, 0.85)
// Seconds to wait before retrying refused backfills
OPTION(osd_backfill_retry_interval, OPT_DOUBLE, 10.0)
OPTION(osd_backfill_retry_interval, OPT_DOUBLE, 30.0)
// Seconds to wait before retrying refused recovery
OPTION(osd_recovery_retry_interval, OPT_DOUBLE, 30.0)
// max agent flush ops
OPTION(osd_agent_max_ops, OPT_INT, 4)
@ -742,7 +743,6 @@ OPTION(osd_op_pq_min_cost, OPT_U64, 65536)
OPTION(osd_disk_threads, OPT_INT, 1)
OPTION(osd_disk_thread_ioprio_class, OPT_STR, "") // rt realtime be best effort idle
OPTION(osd_disk_thread_ioprio_priority, OPT_INT, -1) // 0-7
OPTION(osd_recovery_threads, OPT_INT, 1)
OPTION(osd_recover_clone_overlap, OPT_BOOL, true) // preserve clone_overlap during recovery/migration
OPTION(osd_op_num_threads_per_shard, OPT_INT, 2)
OPTION(osd_op_num_shards, OPT_INT, 5)
@ -871,6 +871,7 @@ OPTION(osd_debug_skip_full_check_in_backfill_reservation, OPT_BOOL, false)
OPTION(osd_debug_reject_backfill_probability, OPT_DOUBLE, 0)
OPTION(osd_debug_inject_copyfrom_error, OPT_BOOL, false) // inject failure during copyfrom completion
OPTION(osd_debug_misdirected_ops, OPT_BOOL, false)
OPTION(osd_debug_skip_full_check_in_recovery, OPT_BOOL, false)
OPTION(osd_enxio_on_misdirected_op, OPT_BOOL, false)
OPTION(osd_debug_verify_cached_snaps, OPT_BOOL, false)
OPTION(osd_enable_op_tracker, OPT_BOOL, true) // enable/disable OSD op tracking

View File

@ -116,6 +116,7 @@ struct ceph_eversion {
#define CEPH_OSD_NEW (1<<3) /* osd is new, never marked in */
#define CEPH_OSD_FULL (1<<4) /* osd is at or above full threshold */
#define CEPH_OSD_NEARFULL (1<<5) /* osd is at or above nearfull threshold */
#define CEPH_OSD_BACKFILLFULL (1<<6) /* osd is at or above backfillfull threshold */
extern const char *ceph_osd_state_name(int s);

View File

@ -592,6 +592,10 @@ COMMAND("osd set-full-ratio " \
"name=ratio,type=CephFloat,range=0.0|1.0", \
"set usage ratio at which OSDs are marked full",
"osd", "rw", "cli,rest")
COMMAND("osd set-backfillfull-ratio " \
"name=ratio,type=CephFloat,range=0.0|1.0", \
"set usage ratio at which OSDs are marked too full to backfill",
"osd", "rw", "cli,rest")
COMMAND("osd set-nearfull-ratio " \
"name=ratio,type=CephFloat,range=0.0|1.0", \
"set usage ratio at which OSDs are marked near-full",

View File

@ -164,7 +164,11 @@ void OSDMonitor::create_initial()
if (!g_conf->mon_debug_no_require_luminous) {
newmap.set_flag(CEPH_OSDMAP_REQUIRE_LUMINOUS);
newmap.full_ratio = g_conf->mon_osd_full_ratio;
if (newmap.full_ratio > 1.0) newmap.full_ratio /= 100;
newmap.backfillfull_ratio = g_conf->mon_osd_backfillfull_ratio;
if (newmap.backfillfull_ratio > 1.0) newmap.backfillfull_ratio /= 100;
newmap.nearfull_ratio = g_conf->mon_osd_nearfull_ratio;
if (newmap.nearfull_ratio > 1.0) newmap.nearfull_ratio /= 100;
}
// encode into pending incremental
@ -784,8 +788,17 @@ void OSDMonitor::create_pending()
OSDMap::clean_temps(g_ceph_context, osdmap, &pending_inc);
dout(10) << "create_pending did clean_temps" << dendl;
// On upgrade OSDMap has new field set by mon_osd_backfillfull_ratio config
// instead of osd_backfill_full_ratio config
if (osdmap.backfillfull_ratio <= 0) {
pending_inc.new_backfillfull_ratio = g_conf->mon_osd_backfillfull_ratio;
if (pending_inc.new_backfillfull_ratio > 1.0)
pending_inc.new_backfillfull_ratio /= 100;
dout(1) << __func__ << " setting backfillfull_ratio = "
<< pending_inc.new_backfillfull_ratio << dendl;
}
if (!osdmap.test_flag(CEPH_OSDMAP_REQUIRE_LUMINOUS)) {
// transition nearfull ratios from PGMap to OSDMap (on upgrade)
// transition full ratios from PGMap to OSDMap (on upgrade)
PGMap *pg_map = &mon->pgmon()->pg_map;
if (osdmap.full_ratio != pg_map->full_ratio) {
dout(10) << __func__ << " full_ratio " << osdmap.full_ratio
@ -800,14 +813,18 @@ void OSDMonitor::create_pending()
} else {
// safety check (this shouldn't really happen)
if (osdmap.full_ratio <= 0) {
dout(1) << __func__ << " setting full_ratio = "
<< g_conf->mon_osd_full_ratio << dendl;
pending_inc.new_full_ratio = g_conf->mon_osd_full_ratio;
if (pending_inc.new_full_ratio > 1.0)
pending_inc.new_full_ratio /= 100;
dout(1) << __func__ << " setting full_ratio = "
<< pending_inc.new_full_ratio << dendl;
}
if (osdmap.nearfull_ratio <= 0) {
dout(1) << __func__ << " setting nearfull_ratio = "
<< g_conf->mon_osd_nearfull_ratio << dendl;
pending_inc.new_nearfull_ratio = g_conf->mon_osd_nearfull_ratio;
if (pending_inc.new_nearfull_ratio > 1.0)
pending_inc.new_nearfull_ratio /= 100;
dout(1) << __func__ << " setting nearfull_ratio = "
<< pending_inc.new_nearfull_ratio << dendl;
}
}
}
@ -1048,8 +1065,8 @@ void OSDMonitor::encode_pending(MonitorDBStore::TransactionRef t)
tmp.apply_incremental(pending_inc);
if (tmp.test_flag(CEPH_OSDMAP_REQUIRE_LUMINOUS)) {
int full, nearfull;
tmp.count_full_nearfull_osds(&full, &nearfull);
int full, backfill, nearfull;
tmp.count_full_nearfull_osds(&full, &backfill, &nearfull);
if (full > 0) {
if (!tmp.test_flag(CEPH_OSDMAP_FULL)) {
dout(10) << __func__ << " setting full flag" << dendl;
@ -2287,7 +2304,7 @@ bool OSDMonitor::preprocess_full(MonOpRequestRef op)
MOSDFull *m = static_cast<MOSDFull*>(op->get_req());
int from = m->get_orig_source().num();
set<string> state;
unsigned mask = CEPH_OSD_NEARFULL | CEPH_OSD_FULL;
unsigned mask = CEPH_OSD_NEARFULL | CEPH_OSD_BACKFILLFULL | CEPH_OSD_FULL;
// check permissions, ignore if failed
MonSession *session = m->get_session();
@ -2337,7 +2354,7 @@ bool OSDMonitor::prepare_full(MonOpRequestRef op)
const MOSDFull *m = static_cast<MOSDFull*>(op->get_req());
const int from = m->get_orig_source().num();
const unsigned mask = CEPH_OSD_NEARFULL | CEPH_OSD_FULL;
const unsigned mask = CEPH_OSD_NEARFULL | CEPH_OSD_BACKFILLFULL | CEPH_OSD_FULL;
const unsigned want_state = m->state & mask; // safety first
unsigned cur_state = osdmap.get_state(from);
@ -3342,18 +3359,83 @@ void OSDMonitor::get_health(list<pair<health_status_t,string> >& summary,
}
if (osdmap.test_flag(CEPH_OSDMAP_REQUIRE_LUMINOUS)) {
int full, nearfull;
osdmap.count_full_nearfull_osds(&full, &nearfull);
if (full > 0) {
// An osd could configure failsafe ratio, to something different
// but for now assume it is the same here.
float fsr = g_conf->osd_failsafe_full_ratio;
if (fsr > 1.0) fsr /= 100;
float fr = osdmap.get_full_ratio();
float br = osdmap.get_backfillfull_ratio();
float nr = osdmap.get_nearfull_ratio();
bool out_of_order = false;
// These checks correspond to how OSDService::check_full_status() in an OSD
// handles the improper setting of these values.
if (br < nr) {
out_of_order = true;
if (detail) {
ostringstream ss;
ss << "backfill_ratio (" << br << ") < nearfull_ratio (" << nr << "), increased";
detail->push_back(make_pair(HEALTH_ERR, ss.str()));
}
br = nr;
}
if (fr < br) {
out_of_order = true;
if (detail) {
ostringstream ss;
ss << "full_ratio (" << fr << ") < backfillfull_ratio (" << br << "), increased";
detail->push_back(make_pair(HEALTH_ERR, ss.str()));
}
fr = br;
}
if (fsr < fr) {
out_of_order = true;
if (detail) {
ostringstream ss;
ss << "osd_failsafe_full_ratio (" << fsr << ") < full_ratio (" << fr << "), increased";
detail->push_back(make_pair(HEALTH_ERR, ss.str()));
}
}
if (out_of_order) {
ostringstream ss;
ss << full << " full osd(s)";
ss << "Full ratio(s) out of order";
summary.push_back(make_pair(HEALTH_ERR, ss.str()));
}
if (nearfull > 0) {
map<int, float> full, backfillfull, nearfull;
osdmap.get_full_osd_util(mon->pgmon()->pg_map.osd_stat, &full, &backfillfull, &nearfull);
if (full.size()) {
ostringstream ss;
ss << nearfull << " nearfull osd(s)";
ss << full.size() << " full osd(s)";
summary.push_back(make_pair(HEALTH_ERR, ss.str()));
}
if (backfillfull.size()) {
ostringstream ss;
ss << backfillfull.size() << " backfillfull osd(s)";
summary.push_back(make_pair(HEALTH_WARN, ss.str()));
}
if (nearfull.size()) {
ostringstream ss;
ss << nearfull.size() << " nearfull osd(s)";
summary.push_back(make_pair(HEALTH_WARN, ss.str()));
}
if (detail) {
for (auto& i: full) {
ostringstream ss;
ss << "osd." << i.first << " is full at " << roundf(i.second * 100) << "%";
detail->push_back(make_pair(HEALTH_ERR, ss.str()));
}
for (auto& i: backfillfull) {
ostringstream ss;
ss << "osd." << i.first << " is backfill full at " << roundf(i.second * 100) << "%";
detail->push_back(make_pair(HEALTH_WARN, ss.str()));
}
for (auto& i: nearfull) {
ostringstream ss;
ss << "osd." << i.first << " is near full at " << roundf(i.second * 100) << "%";
detail->push_back(make_pair(HEALTH_WARN, ss.str()));
}
}
}
// note: we leave it to ceph-mgr to generate details health warnings
// with actual osd utilizations
@ -6929,6 +7011,7 @@ bool OSDMonitor::prepare_command_impl(MonOpRequestRef op,
return true;
} else if (prefix == "osd set-full-ratio" ||
prefix == "osd set-backfillfull-ratio" ||
prefix == "osd set-nearfull-ratio") {
if (!osdmap.test_flag(CEPH_OSDMAP_REQUIRE_LUMINOUS)) {
ss << "you must complete the upgrade and set require_luminous_osds before"
@ -6945,6 +7028,8 @@ bool OSDMonitor::prepare_command_impl(MonOpRequestRef op,
}
if (prefix == "osd set-full-ratio")
pending_inc.new_full_ratio = n;
else if (prefix == "osd set-backfillfull-ratio")
pending_inc.new_backfillfull_ratio = n;
else if (prefix == "osd set-nearfull-ratio")
pending_inc.new_nearfull_ratio = n;
ss << prefix << " " << n;

View File

@ -1878,6 +1878,17 @@ int64_t PGMap::get_rule_avail(const OSDMap& osdmap, int ruleno) const
return 0;
}
float fratio;
if (osdmap.test_flag(CEPH_OSDMAP_REQUIRE_LUMINOUS) && osdmap.get_full_ratio() > 0) {
fratio = osdmap.get_full_ratio();
} else if (full_ratio > 0) {
fratio = full_ratio;
} else {
// this shouldn't really happen
fratio = g_conf->mon_osd_full_ratio;
if (fratio > 1.0) fratio /= 100;
}
int64_t min = -1;
for (map<int,float>::iterator p = wm.begin(); p != wm.end(); ++p) {
ceph::unordered_map<int32_t,osd_stat_t>::const_iterator osd_info =
@ -1892,7 +1903,7 @@ int64_t PGMap::get_rule_avail(const OSDMap& osdmap, int ruleno) const
continue;
}
double unusable = (double)osd_info->second.kb *
(1.0 - g_conf->mon_osd_full_ratio);
(1.0 - fratio);
double avail = MAX(0.0, (double)osd_info->second.kb_avail - unusable);
avail *= 1024.0;
int64_t proj = (int64_t)(avail / (double)p->second);

View File

@ -1316,6 +1316,8 @@ void PGMonitor::get_health(list<pair<health_status_t,string> >& summary,
note["backfilling"] += p->second;
if (p->first & PG_STATE_BACKFILL_TOOFULL)
note["backfill_toofull"] += p->second;
if (p->first & PG_STATE_RECOVERY_TOOFULL)
note["recovery_toofull"] += p->second;
}
ceph::unordered_map<pg_t, pg_stat_t> stuck_pgs;
@ -1403,6 +1405,7 @@ void PGMonitor::get_health(list<pair<health_status_t,string> >& summary,
PG_STATE_REPAIR |
PG_STATE_RECOVERING |
PG_STATE_RECOVERY_WAIT |
PG_STATE_RECOVERY_TOOFULL |
PG_STATE_INCOMPLETE |
PG_STATE_BACKFILL_WAIT |
PG_STATE_BACKFILL |

View File

@ -3952,7 +3952,7 @@ void FileStore::sync_entry()
derr << "ioctl WAIT_SYNC got " << cpp_strerror(err) << dendl;
assert(0 == "wait_sync got error");
}
dout(20) << " done waiting for checkpoint" << cid << " to complete" << dendl;
dout(20) << " done waiting for checkpoint " << cid << " to complete" << dendl;
}
} else
{

View File

@ -282,6 +282,11 @@ void ECBackend::handle_recovery_push(
const PushOp &op,
RecoveryMessages *m)
{
ostringstream ss;
if (get_parent()->check_failsafe_full(ss)) {
dout(10) << __func__ << " Out of space (failsafe) processing push request: " << ss.str() << dendl;
ceph_abort();
}
bool oneshot = op.before_progress.first && op.after_progress.data_complete;
ghobject_t tobj;

View File

@ -255,8 +255,8 @@ OSDService::OSDService(OSD *osd) :
watch_lock("OSDService::watch_lock"),
watch_timer(osd->client_messenger->cct, watch_lock),
next_notif_id(0),
backfill_request_lock("OSDService::backfill_request_lock"),
backfill_request_timer(cct, backfill_request_lock, false),
recovery_request_lock("OSDService::recovery_request_lock"),
recovery_request_timer(cct, recovery_request_lock, false),
reserver_finisher(cct),
local_reserver(&reserver_finisher, cct->_conf->osd_max_backfills,
cct->_conf->osd_min_recovery_priority),
@ -495,8 +495,8 @@ void OSDService::shutdown()
objecter_finisher.stop();
{
Mutex::Locker l(backfill_request_lock);
backfill_request_timer.shutdown();
Mutex::Locker l(recovery_request_lock);
recovery_request_timer.shutdown();
}
{
@ -716,13 +716,7 @@ void OSDService::check_full_status(const osd_stat_t &osd_stat)
{
Mutex::Locker l(full_status_lock);
// We base ratio on kb_avail rather than kb_used because they can
// differ significantly e.g. on btrfs volumes with a large number of
// chunks reserved for metadata, and for our purposes (avoiding
// completely filling the disk) it's far more important to know how
// much space is available to use than how much we've already used.
float ratio = ((float)(osd_stat.kb - osd_stat.kb_avail)) /
((float)osd_stat.kb);
float ratio = ((float)osd_stat.kb_used) / ((float)osd_stat.kb);
cur_ratio = ratio;
// The OSDMap ratios take precendence. So if the failsafe is .95 and
@ -735,28 +729,38 @@ void OSDService::check_full_status(const osd_stat_t &osd_stat)
return;
}
float nearfull_ratio = osdmap->get_nearfull_ratio();
float full_ratio = std::max(osdmap->get_full_ratio(), nearfull_ratio);
float backfillfull_ratio = std::max(osdmap->get_backfillfull_ratio(), nearfull_ratio);
float full_ratio = std::max(osdmap->get_full_ratio(), backfillfull_ratio);
float failsafe_ratio = std::max(get_failsafe_full_ratio(), full_ratio);
if (!osdmap->test_flag(CEPH_OSDMAP_REQUIRE_LUMINOUS)) {
// use the failsafe for nearfull and full; the mon isn't using the
// flags anyway because we're mid-upgrade.
full_ratio = failsafe_ratio;
backfillfull_ratio = failsafe_ratio;
nearfull_ratio = failsafe_ratio;
} else if (full_ratio <= 0 ||
backfillfull_ratio <= 0 ||
nearfull_ratio <= 0) {
derr << __func__ << " full_ratio or nearfull_ratio is <= 0" << dendl;
derr << __func__ << " full_ratio, backfillfull_ratio or nearfull_ratio is <= 0" << dendl;
// use failsafe flag. ick. the monitor did something wrong or the user
// did something stupid.
full_ratio = failsafe_ratio;
backfillfull_ratio = failsafe_ratio;
nearfull_ratio = failsafe_ratio;
}
enum s_names new_state;
if (ratio > failsafe_ratio) {
string inject;
s_names new_state;
if (injectfull_state > NONE && injectfull) {
new_state = injectfull_state;
inject = "(Injected)";
} else if (ratio > failsafe_ratio) {
new_state = FAILSAFE;
} else if (ratio > full_ratio) {
new_state = FULL;
} else if (ratio > backfillfull_ratio) {
new_state = BACKFILLFULL;
} else if (ratio > nearfull_ratio) {
new_state = NEARFULL;
} else {
@ -764,9 +768,11 @@ void OSDService::check_full_status(const osd_stat_t &osd_stat)
}
dout(20) << __func__ << " cur ratio " << ratio
<< ". nearfull_ratio " << nearfull_ratio
<< ". backfillfull_ratio " << backfillfull_ratio
<< ", full_ratio " << full_ratio
<< ", failsafe_ratio " << failsafe_ratio
<< ", new state " << get_full_state_name(new_state)
<< " " << inject
<< dendl;
// warn
@ -791,6 +797,8 @@ bool OSDService::need_fullness_update()
if (osdmap->exists(whoami)) {
if (osdmap->get_state(whoami) & CEPH_OSD_FULL) {
cur = FULL;
} else if (osdmap->get_state(whoami) & CEPH_OSD_BACKFILLFULL) {
cur = BACKFILLFULL;
} else if (osdmap->get_state(whoami) & CEPH_OSD_NEARFULL) {
cur = NEARFULL;
}
@ -798,41 +806,80 @@ bool OSDService::need_fullness_update()
s_names want = NONE;
if (is_full())
want = FULL;
else if (is_backfillfull())
want = BACKFILLFULL;
else if (is_nearfull())
want = NEARFULL;
return want != cur;
}
bool OSDService::check_failsafe_full()
bool OSDService::_check_full(s_names type, ostream &ss) const
{
Mutex::Locker l(full_status_lock);
if (cur_state == FAILSAFE)
if (injectfull && injectfull_state >= type) {
// injectfull is either a count of the number of times to return failsafe full
// or if -1 then always return full
if (injectfull > 0)
--injectfull;
ss << "Injected " << get_full_state_name(type) << " OSD ("
<< (injectfull < 0 ? "set" : std::to_string(injectfull)) << ")";
return true;
return false;
}
ss << "current usage is " << cur_ratio;
return cur_state >= type;
}
bool OSDService::is_nearfull()
bool OSDService::check_failsafe_full(ostream &ss) const
{
return _check_full(FAILSAFE, ss);
}
bool OSDService::check_full(ostream &ss) const
{
return _check_full(FULL, ss);
}
bool OSDService::check_backfill_full(ostream &ss) const
{
return _check_full(BACKFILLFULL, ss);
}
bool OSDService::check_nearfull(ostream &ss) const
{
return _check_full(NEARFULL, ss);
}
bool OSDService::is_failsafe_full() const
{
Mutex::Locker l(full_status_lock);
return cur_state == NEARFULL;
return cur_state == FAILSAFE;
}
bool OSDService::is_full()
bool OSDService::is_full() const
{
Mutex::Locker l(full_status_lock);
return cur_state >= FULL;
}
bool OSDService::too_full_for_backfill(double *_ratio, double *_max_ratio)
bool OSDService::is_backfillfull() const
{
Mutex::Locker l(full_status_lock);
double max_ratio;
max_ratio = cct->_conf->osd_backfill_full_ratio;
if (_ratio)
*_ratio = cur_ratio;
if (_max_ratio)
*_max_ratio = max_ratio;
return cur_ratio >= max_ratio;
return cur_state >= BACKFILLFULL;
}
bool OSDService::is_nearfull() const
{
Mutex::Locker l(full_status_lock);
return cur_state >= NEARFULL;
}
void OSDService::set_injectfull(s_names type, int64_t count)
{
Mutex::Locker l(full_status_lock);
injectfull_state = type;
injectfull = count;
}
void OSDService::update_osd_stat(vector<int>& hb_peers)
@ -868,6 +915,16 @@ void OSDService::update_osd_stat(vector<int>& hb_peers)
check_full_status(osd_stat);
}
bool OSDService::check_osdmap_full(const set<pg_shard_t> &missing_on)
{
OSDMapRef osdmap = get_osdmap();
for (auto shard : missing_on) {
if (osdmap->get_state(shard.osd) & CEPH_OSD_FULL)
return true;
}
return false;
}
void OSDService::send_message_osd_cluster(int peer, Message *m, epoch_t from_epoch)
{
OSDMapRef next_map = get_nextmap_reserved();
@ -2147,7 +2204,7 @@ int OSD::init()
tick_timer.init();
tick_timer_without_osd_lock.init();
service.backfill_request_timer.init();
service.recovery_request_timer.init();
// mount.
dout(2) << "mounting " << dev_path << " "
@ -2632,6 +2689,14 @@ void OSD::final_init()
test_ops_hook,
"Trigger a scheduled scrub ");
assert(r == 0);
r = admin_socket->register_command(
"injectfull",
"injectfull " \
"name=type,type=CephString,req=false " \
"name=count,type=CephInt,req=false ",
test_ops_hook,
"Inject a full disk (optional count times)");
assert(r == 0);
}
void OSD::create_logger()
@ -2839,6 +2904,7 @@ void OSD::create_recoverystate_perf()
rs_perf.add_time_avg(rs_down_latency, "down_latency", "Down recovery state latency");
rs_perf.add_time_avg(rs_getmissing_latency, "getmissing_latency", "Getmissing recovery state latency");
rs_perf.add_time_avg(rs_waitupthru_latency, "waitupthru_latency", "Waitupthru recovery state latency");
rs_perf.add_time_avg(rs_notrecovering_latency, "notrecovering_latency", "Notrecovering recovery state latency");
recoverystate_perf = rs_perf.create_perf_counters();
cct->get_perfcounters_collection()->add(recoverystate_perf);
@ -4854,6 +4920,24 @@ void TestOpsSocketHook::test_ops(OSDService *service, ObjectStore *store,
pg->unlock();
return;
}
if (command == "injectfull") {
int64_t count;
string type;
OSDService::s_names state;
cmd_getval(service->cct, cmdmap, "type", type, string("full"));
cmd_getval(service->cct, cmdmap, "count", count, (int64_t)-1);
if (type == "none" || count == 0) {
type = "none";
count = 0;
}
state = service->get_full_state(type);
if (state == OSDService::s_names::INVALID) {
ss << "Invalid type use (none, nearfull, backfillfull, full, failsafe)";
return;
}
service->set_injectfull(state, count);
return;
}
ss << "Internal error - command=" << command;
}
@ -5185,6 +5269,8 @@ void OSD::send_full_update()
unsigned state = 0;
if (service.is_full()) {
state = CEPH_OSD_FULL;
} else if (service.is_backfillfull()) {
state = CEPH_OSD_BACKFILLFULL;
} else if (service.is_nearfull()) {
state = CEPH_OSD_NEARFULL;
}

View File

@ -202,6 +202,7 @@ enum {
rs_down_latency,
rs_getmissing_latency,
rs_waitupthru_latency,
rs_notrecovering_latency,
rs_last,
};
@ -917,9 +918,9 @@ public:
return (((uint64_t)cur_epoch) << 32) | ((uint64_t)(next_notif_id++));
}
// -- Backfill Request Scheduling --
Mutex backfill_request_lock;
SafeTimer backfill_request_timer;
// -- Recovery/Backfill Request Scheduling --
Mutex recovery_request_lock;
SafeTimer recovery_request_timer;
// -- tids --
// for ops i issue
@ -1025,7 +1026,7 @@ public:
Mutex::Locker l(recovery_lock);
_maybe_queue_recovery();
}
void clear_queued_recovery(PG *pg, bool front = false) {
void clear_queued_recovery(PG *pg) {
Mutex::Locker l(recovery_lock);
for (list<pair<epoch_t, PGRef> >::iterator i = awaiting_throttle.begin();
i != awaiting_throttle.end();
@ -1137,26 +1138,51 @@ public:
// -- OSD Full Status --
private:
Mutex full_status_lock;
enum s_names { NONE, NEARFULL, FULL, FAILSAFE } cur_state; // ascending
const char *get_full_state_name(s_names s) {
friend TestOpsSocketHook;
mutable Mutex full_status_lock;
enum s_names { INVALID = -1, NONE, NEARFULL, BACKFILLFULL, FULL, FAILSAFE } cur_state; // ascending
const char *get_full_state_name(s_names s) const {
switch (s) {
case NONE: return "none";
case NEARFULL: return "nearfull";
case BACKFILLFULL: return "backfillfull";
case FULL: return "full";
case FAILSAFE: return "failsafe";
default: return "???";
}
}
s_names get_full_state(string type) const {
if (type == "none")
return NONE;
else if (type == "failsafe")
return FAILSAFE;
else if (type == "full")
return FULL;
else if (type == "backfillfull")
return BACKFILLFULL;
else if (type == "nearfull")
return NEARFULL;
else
return INVALID;
}
double cur_ratio; ///< current utilization
mutable int64_t injectfull = 0;
s_names injectfull_state = NONE;
float get_failsafe_full_ratio();
void check_full_status(const osd_stat_t &stat);
bool _check_full(s_names type, ostream &ss) const;
public:
bool check_failsafe_full();
bool is_nearfull();
bool is_full();
bool too_full_for_backfill(double *ratio, double *max_ratio);
bool check_failsafe_full(ostream &ss) const;
bool check_full(ostream &ss) const;
bool check_backfill_full(ostream &ss) const;
bool check_nearfull(ostream &ss) const;
bool is_failsafe_full() const;
bool is_full() const;
bool is_backfillfull() const;
bool is_nearfull() const;
bool need_fullness_update(); ///< osdmap state needs update
void set_injectfull(s_names type, int64_t count);
bool check_osdmap_full(const set<pg_shard_t> &missing_on);
// -- epochs --

View File

@ -450,7 +450,7 @@ void OSDMap::Incremental::encode(bufferlist& bl, uint64_t features) const
}
{
uint8_t target_v = 3;
uint8_t target_v = 4;
if (!HAVE_FEATURE(features, SERVER_LUMINOUS)) {
target_v = 2;
}
@ -470,6 +470,7 @@ void OSDMap::Incremental::encode(bufferlist& bl, uint64_t features) const
if (target_v >= 3) {
::encode(new_nearfull_ratio, bl);
::encode(new_full_ratio, bl);
::encode(new_backfillfull_ratio, bl);
}
ENCODE_FINISH(bl); // osd-only data
}
@ -654,7 +655,7 @@ void OSDMap::Incremental::decode(bufferlist::iterator& bl)
}
{
DECODE_START(3, bl); // extended, osd-only data
DECODE_START(4, bl); // extended, osd-only data
::decode(new_hb_back_up, bl);
::decode(new_up_thru, bl);
::decode(new_last_clean_interval, bl);
@ -677,6 +678,11 @@ void OSDMap::Incremental::decode(bufferlist::iterator& bl)
new_nearfull_ratio = -1;
new_full_ratio = -1;
}
if (struct_v >= 4) {
::decode(new_backfillfull_ratio, bl);
} else {
new_backfillfull_ratio = -1;
}
DECODE_FINISH(bl); // osd-only data
}
@ -720,6 +726,7 @@ void OSDMap::Incremental::dump(Formatter *f) const
f->dump_int("new_flags", new_flags);
f->dump_float("new_full_ratio", new_full_ratio);
f->dump_float("new_nearfull_ratio", new_nearfull_ratio);
f->dump_float("new_backfillfull_ratio", new_backfillfull_ratio);
if (fullmap.length()) {
f->open_object_section("full_map");
@ -1022,20 +1029,57 @@ int OSDMap::calc_num_osds()
return num_osd;
}
void OSDMap::count_full_nearfull_osds(int *full, int *nearfull) const
void OSDMap::count_full_nearfull_osds(int *full, int *backfill, int *nearfull) const
{
*full = 0;
*backfill = 0;
*nearfull = 0;
for (int i = 0; i < max_osd; ++i) {
if (exists(i) && is_up(i) && is_in(i)) {
if (osd_state[i] & CEPH_OSD_FULL)
++(*full);
else if (osd_state[i] & CEPH_OSD_BACKFILLFULL)
++(*backfill);
else if (osd_state[i] & CEPH_OSD_NEARFULL)
++(*nearfull);
}
}
}
static bool get_osd_utilization(const ceph::unordered_map<int32_t,osd_stat_t> &osd_stat,
int id, int64_t* kb, int64_t* kb_used, int64_t* kb_avail) {
auto p = osd_stat.find(id);
if (p == osd_stat.end())
return false;
*kb = p->second.kb;
*kb_used = p->second.kb_used;
*kb_avail = p->second.kb_avail;
return *kb > 0;
}
void OSDMap::get_full_osd_util(const ceph::unordered_map<int32_t,osd_stat_t> &osd_stat,
map<int, float> *full, map<int, float> *backfill, map<int, float> *nearfull) const
{
full->clear();
backfill->clear();
nearfull->clear();
for (int i = 0; i < max_osd; ++i) {
if (exists(i) && is_up(i) && is_in(i)) {
int64_t kb, kb_used, kb_avail;
if (osd_state[i] & CEPH_OSD_FULL) {
if (get_osd_utilization(osd_stat, i, &kb, &kb_used, &kb_avail))
full->emplace(i, (float)kb_used / (float)kb);
} else if (osd_state[i] & CEPH_OSD_BACKFILLFULL) {
if (get_osd_utilization(osd_stat, i, &kb, &kb_used, &kb_avail))
backfill->emplace(i, (float)kb_used / (float)kb);
} else if (osd_state[i] & CEPH_OSD_NEARFULL) {
if (get_osd_utilization(osd_stat, i, &kb, &kb_used, &kb_avail))
nearfull->emplace(i, (float)kb_used / (float)kb);
}
}
}
}
void OSDMap::get_all_osds(set<int32_t>& ls) const
{
for (int i=0; i<max_osd; i++)
@ -1575,6 +1619,9 @@ int OSDMap::apply_incremental(const Incremental &inc)
if (inc.new_nearfull_ratio >= 0) {
nearfull_ratio = inc.new_nearfull_ratio;
}
if (inc.new_backfillfull_ratio >= 0) {
backfillfull_ratio = inc.new_backfillfull_ratio;
}
if (inc.new_full_ratio >= 0) {
full_ratio = inc.new_full_ratio;
}
@ -2148,7 +2195,7 @@ void OSDMap::encode(bufferlist& bl, uint64_t features) const
}
{
uint8_t target_v = 2;
uint8_t target_v = 3;
if (!HAVE_FEATURE(features, SERVER_LUMINOUS)) {
target_v = 1;
}
@ -2173,6 +2220,7 @@ void OSDMap::encode(bufferlist& bl, uint64_t features) const
if (target_v >= 2) {
::encode(nearfull_ratio, bl);
::encode(full_ratio, bl);
::encode(backfillfull_ratio, bl);
}
ENCODE_FINISH(bl); // osd-only data
}
@ -2390,7 +2438,7 @@ void OSDMap::decode(bufferlist::iterator& bl)
}
{
DECODE_START(2, bl); // extended, osd-only data
DECODE_START(3, bl); // extended, osd-only data
::decode(osd_addrs->hb_back_addr, bl);
::decode(osd_info, bl);
::decode(blacklist, bl);
@ -2407,6 +2455,11 @@ void OSDMap::decode(bufferlist::iterator& bl)
nearfull_ratio = 0;
full_ratio = 0;
}
if (struct_v >= 3) {
::decode(backfillfull_ratio, bl);
} else {
backfillfull_ratio = 0;
}
DECODE_FINISH(bl); // osd-only data
}
@ -2480,6 +2533,7 @@ void OSDMap::dump(Formatter *f) const
f->dump_stream("modified") << get_modified();
f->dump_string("flags", get_flag_string());
f->dump_float("full_ratio", full_ratio);
f->dump_float("backfillfull_ratio", backfillfull_ratio);
f->dump_float("nearfull_ratio", nearfull_ratio);
f->dump_string("cluster_snapshot", get_cluster_snapshot());
f->dump_int("pool_max", get_pool_max());
@ -2701,6 +2755,7 @@ void OSDMap::print(ostream& out) const
out << "flags " << get_flag_string() << "\n";
out << "full_ratio " << full_ratio << "\n";
out << "backfillfull_ratio " << backfillfull_ratio << "\n";
out << "nearfull_ratio " << nearfull_ratio << "\n";
if (get_cluster_snapshot().length())
out << "cluster_snapshot " << get_cluster_snapshot() << "\n";

View File

@ -155,6 +155,7 @@ public:
string cluster_snapshot;
float new_nearfull_ratio = -1;
float new_backfillfull_ratio = -1;
float new_full_ratio = -1;
mutable bool have_crc; ///< crc values are defined
@ -254,7 +255,7 @@ private:
string cluster_snapshot;
bool new_blacklist_entries;
float full_ratio = 0, nearfull_ratio = 0;
float full_ratio = 0, backfillfull_ratio = 0, nearfull_ratio = 0;
mutable uint64_t cached_up_osd_features;
@ -336,10 +337,15 @@ public:
float get_full_ratio() const {
return full_ratio;
}
float get_backfillfull_ratio() const {
return backfillfull_ratio;
}
float get_nearfull_ratio() const {
return nearfull_ratio;
}
void count_full_nearfull_osds(int *full, int *nearfull) const;
void count_full_nearfull_osds(int *full, int *backfill, int *nearfull) const;
void get_full_osd_util(const ceph::unordered_map<int32_t,osd_stat_t> &osd_stat,
map<int, float> *full, map<int, float> *backfill, map<int, float> *nearfull) const;
/***** cluster state *****/
/* osds */

View File

@ -3809,14 +3809,24 @@ void PG::reject_reservation()
void PG::schedule_backfill_full_retry()
{
Mutex::Locker lock(osd->backfill_request_lock);
osd->backfill_request_timer.add_event_after(
Mutex::Locker lock(osd->recovery_request_lock);
osd->recovery_request_timer.add_event_after(
cct->_conf->osd_backfill_retry_interval,
new QueuePeeringEvt<RequestBackfill>(
this, get_osdmap()->get_epoch(),
RequestBackfill()));
}
void PG::schedule_recovery_full_retry()
{
Mutex::Locker lock(osd->recovery_request_lock);
osd->recovery_request_timer.add_event_after(
cct->_conf->osd_recovery_retry_interval,
new QueuePeeringEvt<DoRecovery>(
this, get_osdmap()->get_epoch(),
DoRecovery()));
}
void PG::clear_scrub_reserved()
{
scrubber.reserved_peers.clear();
@ -5237,6 +5247,7 @@ void PG::start_peering_interval(
state_clear(PG_STATE_PEERED);
state_clear(PG_STATE_DOWN);
state_clear(PG_STATE_RECOVERY_WAIT);
state_clear(PG_STATE_RECOVERY_TOOFULL);
state_clear(PG_STATE_RECOVERING);
peer_purged.clear();
@ -6488,6 +6499,24 @@ void PG::RecoveryState::NotBackfilling::exit()
pg->osd->recoverystate_perf->tinc(rs_notbackfilling_latency, dur);
}
/*----NotRecovering------*/
PG::RecoveryState::NotRecovering::NotRecovering(my_context ctx)
: my_base(ctx),
NamedState(context< RecoveryMachine >().pg->cct, "Started/Primary/Active/NotRecovering")
{
context< RecoveryMachine >().log_enter(state_name);
PG *pg = context< RecoveryMachine >().pg;
pg->publish_stats_to_osd();
}
void PG::RecoveryState::NotRecovering::exit()
{
context< RecoveryMachine >().log_exit(state_name, enter_time);
PG *pg = context< RecoveryMachine >().pg;
utime_t dur = ceph_clock_now() - enter_time;
pg->osd->recoverystate_perf->tinc(rs_notrecovering_latency, dur);
}
/*---RepNotRecovering----*/
PG::RecoveryState::RepNotRecovering::RepNotRecovering(my_context ctx)
: my_base(ctx),
@ -6554,18 +6583,17 @@ boost::statechart::result
PG::RecoveryState::RepNotRecovering::react(const RequestBackfillPrio &evt)
{
PG *pg = context< RecoveryMachine >().pg;
double ratio, max_ratio;
ostringstream ss;
if (pg->cct->_conf->osd_debug_reject_backfill_probability > 0 &&
(rand()%1000 < (pg->cct->_conf->osd_debug_reject_backfill_probability*1000.0))) {
ldout(pg->cct, 10) << "backfill reservation rejected: failure injection"
<< dendl;
post_event(RemoteReservationRejected());
} else if (pg->osd->too_full_for_backfill(&ratio, &max_ratio) &&
!pg->cct->_conf->osd_debug_skip_full_check_in_backfill_reservation) {
ldout(pg->cct, 10) << "backfill reservation rejected: full ratio is "
<< ratio << ", which is greater than max allowed ratio "
<< max_ratio << dendl;
} else if (!pg->cct->_conf->osd_debug_skip_full_check_in_backfill_reservation &&
pg->osd->check_backfill_full(ss)) {
ldout(pg->cct, 10) << "backfill reservation rejected: "
<< ss.str() << dendl;
post_event(RemoteReservationRejected());
} else {
pg->osd->remote_reserver.request_reservation(
@ -6590,7 +6618,7 @@ PG::RecoveryState::RepWaitBackfillReserved::react(const RemoteBackfillReserved &
{
PG *pg = context< RecoveryMachine >().pg;
double ratio, max_ratio;
ostringstream ss;
if (pg->cct->_conf->osd_debug_reject_backfill_probability > 0 &&
(rand()%1000 < (pg->cct->_conf->osd_debug_reject_backfill_probability*1000.0))) {
ldout(pg->cct, 10) << "backfill reservation rejected after reservation: "
@ -6598,11 +6626,10 @@ PG::RecoveryState::RepWaitBackfillReserved::react(const RemoteBackfillReserved &
pg->osd->remote_reserver.cancel_reservation(pg->info.pgid);
post_event(RemoteReservationRejected());
return discard_event();
} else if (pg->osd->too_full_for_backfill(&ratio, &max_ratio) &&
!pg->cct->_conf->osd_debug_skip_full_check_in_backfill_reservation) {
ldout(pg->cct, 10) << "backfill reservation rejected after reservation: full ratio is "
<< ratio << ", which is greater than max allowed ratio "
<< max_ratio << dendl;
} else if (!pg->cct->_conf->osd_debug_skip_full_check_in_backfill_reservation &&
pg->osd->check_backfill_full(ss)) {
ldout(pg->cct, 10) << "backfill reservation rejected after reservation: "
<< ss.str() << dendl;
pg->osd->remote_reserver.cancel_reservation(pg->info.pgid);
post_event(RemoteReservationRejected());
return discard_event();
@ -6673,6 +6700,15 @@ PG::RecoveryState::WaitLocalRecoveryReserved::WaitLocalRecoveryReserved(my_conte
{
context< RecoveryMachine >().log_enter(state_name);
PG *pg = context< RecoveryMachine >().pg;
// Make sure all nodes that part of the recovery aren't full
if (!pg->cct->_conf->osd_debug_skip_full_check_in_recovery &&
pg->osd->check_osdmap_full(pg->actingbackfill)) {
post_event(RecoveryTooFull());
return;
}
pg->state_clear(PG_STATE_RECOVERY_TOOFULL);
pg->state_set(PG_STATE_RECOVERY_WAIT);
pg->osd->local_reserver.request_reservation(
pg->info.pgid,
@ -6683,6 +6719,15 @@ PG::RecoveryState::WaitLocalRecoveryReserved::WaitLocalRecoveryReserved(my_conte
pg->publish_stats_to_osd();
}
boost::statechart::result
PG::RecoveryState::WaitLocalRecoveryReserved::react(const RecoveryTooFull &evt)
{
PG *pg = context< RecoveryMachine >().pg;
pg->state_set(PG_STATE_RECOVERY_TOOFULL);
pg->schedule_recovery_full_retry();
return transit<NotRecovering>();
}
void PG::RecoveryState::WaitLocalRecoveryReserved::exit()
{
context< RecoveryMachine >().log_exit(state_name, enter_time);
@ -6739,6 +6784,7 @@ PG::RecoveryState::Recovering::Recovering(my_context ctx)
PG *pg = context< RecoveryMachine >().pg;
pg->state_clear(PG_STATE_RECOVERY_WAIT);
pg->state_clear(PG_STATE_RECOVERY_TOOFULL);
pg->state_set(PG_STATE_RECOVERING);
pg->publish_stats_to_osd();
pg->queue_recovery();
@ -7187,6 +7233,7 @@ void PG::RecoveryState::Active::exit()
pg->state_clear(PG_STATE_BACKFILL_TOOFULL);
pg->state_clear(PG_STATE_BACKFILL_WAIT);
pg->state_clear(PG_STATE_RECOVERY_WAIT);
pg->state_clear(PG_STATE_RECOVERY_TOOFULL);
utime_t dur = ceph_clock_now() - enter_time;
pg->osd->recoverystate_perf->tinc(rs_active_latency, dur);
pg->agent_stop();

View File

@ -1340,6 +1340,7 @@ public:
void reject_reservation();
void schedule_backfill_full_retry();
void schedule_recovery_full_retry();
// -- recovery state --
@ -1505,6 +1506,7 @@ public:
TrivialEvent(RequestRecovery)
TrivialEvent(RecoveryDone)
TrivialEvent(BackfillTooFull)
TrivialEvent(RecoveryTooFull)
TrivialEvent(AllReplicasRecovered)
TrivialEvent(DoRecovery)
@ -1850,6 +1852,14 @@ public:
boost::statechart::result react(const RemoteReservationRejected& evt);
};
struct NotRecovering : boost::statechart::state< NotRecovering, Active>, NamedState {
typedef boost::mpl::list<
boost::statechart::transition< DoRecovery, WaitLocalRecoveryReserved >
> reactions;
explicit NotRecovering(my_context ctx);
void exit();
};
struct RepNotRecovering;
struct ReplicaActive : boost::statechart::state< ReplicaActive, Started, RepNotRecovering >, NamedState {
explicit ReplicaActive(my_context ctx);
@ -1938,10 +1948,12 @@ public:
struct WaitLocalRecoveryReserved : boost::statechart::state< WaitLocalRecoveryReserved, Active >, NamedState {
typedef boost::mpl::list <
boost::statechart::transition< LocalRecoveryReserved, WaitRemoteRecoveryReserved >
boost::statechart::transition< LocalRecoveryReserved, WaitRemoteRecoveryReserved >,
boost::statechart::custom_reaction< RecoveryTooFull >
> reactions;
explicit WaitLocalRecoveryReserved(my_context ctx);
void exit();
boost::statechart::result react(const RecoveryTooFull &evt);
};
struct Activating : boost::statechart::state< Activating, Active >, NamedState {

View File

@ -261,6 +261,10 @@ typedef ceph::shared_ptr<const OSDMap> OSDMapRef;
virtual LogClientTemp clog_error() = 0;
virtual bool check_failsafe_full(ostream &ss) = 0;
virtual bool check_osdmap_full(const set<pg_shard_t> &missing_on) = 0;
virtual ~Listener() {}
};
Listener *parent;

View File

@ -1888,8 +1888,13 @@ void PrimaryLogPG::do_op(OpRequestRef& op)
<< *m << dendl;
return;
}
if (!(m->get_source().is_mds()) && osd->check_failsafe_full() && write_ordered) {
// mds should have stopped writing before this point.
// We can't allow OSD to become non-startable even if mds
// could be writing as part of file removals.
ostringstream ss;
if (write_ordered && osd->check_failsafe_full(ss)) {
dout(10) << __func__ << " fail-safe full check failed, dropping request"
<< ss.str()
<< dendl;
return;
}
@ -3328,10 +3333,9 @@ void PrimaryLogPG::do_scan(
switch (m->op) {
case MOSDPGScan::OP_SCAN_GET_DIGEST:
{
double ratio, full_ratio;
if (osd->too_full_for_backfill(&ratio, &full_ratio)) {
dout(1) << __func__ << ": Canceling backfill, current usage is "
<< ratio << ", which exceeds " << full_ratio << dendl;
ostringstream ss;
if (osd->check_backfill_full(ss)) {
dout(1) << __func__ << ": Canceling backfill, " << ss.str() << dendl;
queue_peering_event(
CephPeeringEvtRef(
std::make_shared<CephPeeringEvt>(
@ -13027,6 +13031,11 @@ void PrimaryLogPG::_scrub_finish()
}
}
bool PrimaryLogPG::check_osdmap_full(const set<pg_shard_t> &missing_on)
{
return osd->check_osdmap_full(missing_on);
}
/*---SnapTrimmer Logging---*/
#undef dout_prefix
#define dout_prefix *_dout << pg->gen_prefix()
@ -13268,6 +13277,10 @@ int PrimaryLogPG::getattrs_maybe_cache(
return r;
}
bool PrimaryLogPG::check_failsafe_full(ostream &ss) {
return osd->check_failsafe_full(ss);
}
void intrusive_ptr_add_ref(PrimaryLogPG *pg) { pg->get("intptr"); }
void intrusive_ptr_release(PrimaryLogPG *pg) { pg->put("intptr"); }

View File

@ -1731,6 +1731,8 @@ public:
void on_flushed() override;
void on_removal(ObjectStore::Transaction *t) override;
void on_shutdown() override;
bool check_failsafe_full(ostream &ss) override;
bool check_osdmap_full(const set<pg_shard_t> &missing_on) override;
// attr cache handling
void setattr_maybe_cache(

View File

@ -807,6 +807,11 @@ void ReplicatedBackend::_do_push(OpRequestRef op)
vector<PushReplyOp> replies;
ObjectStore::Transaction t;
ostringstream ss;
if (get_parent()->check_failsafe_full(ss)) {
dout(10) << __func__ << " Out of space (failsafe) processing push request: " << ss.str() << dendl;
ceph_abort();
}
for (vector<PushOp>::const_iterator i = m->pushes.begin();
i != m->pushes.end();
++i) {
@ -862,6 +867,13 @@ void ReplicatedBackend::_do_pull_response(OpRequestRef op)
op->mark_started();
vector<PullOp> replies(1);
ostringstream ss;
if (get_parent()->check_failsafe_full(ss)) {
dout(10) << __func__ << " Out of space (failsafe) processing pull response (push): " << ss.str() << dendl;
ceph_abort();
}
ObjectStore::Transaction t;
list<pull_complete_info> to_continue;
for (vector<PushOp>::const_iterator i = m->pushes.begin();

View File

@ -789,6 +789,8 @@ std::string pg_state_string(int state)
oss << "clean+";
if (state & PG_STATE_RECOVERY_WAIT)
oss << "recovery_wait+";
if (state & PG_STATE_RECOVERY_TOOFULL)
oss << "recovery_toofull+";
if (state & PG_STATE_RECOVERING)
oss << "recovering+";
if (state & PG_STATE_DOWN)
@ -869,6 +871,8 @@ int pg_string_state(const std::string& state)
type = PG_STATE_BACKFILL_TOOFULL;
else if (state == "recovery_wait")
type = PG_STATE_RECOVERY_WAIT;
else if (state == "recovery_toofull")
type = PG_STATE_RECOVERY_TOOFULL;
else if (state == "undersized")
type = PG_STATE_UNDERSIZED;
else if (state == "activating")

View File

@ -971,6 +971,7 @@ inline ostream& operator<<(ostream& out, const osd_stat_t& s) {
#define PG_STATE_PEERED (1<<25) // peered, cannot go active, can recover
#define PG_STATE_SNAPTRIM (1<<26) // trimming snaps
#define PG_STATE_SNAPTRIM_WAIT (1<<27) // queued to trim snaps
#define PG_STATE_RECOVERY_TOOFULL (1<<28) // recovery can't proceed: too full
std::string pg_state_string(int state);
std::string pg_vector_string(const vector<int32_t> &a);

View File

@ -20,6 +20,7 @@
modified \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d+ (re)
flags
full_ratio 0
backfillfull_ratio 0
nearfull_ratio 0
pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 192 pgp_num 192 last_change 0 flags hashpspool stripe_width 0
@ -43,6 +44,7 @@
modified \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d+ (re)
flags
full_ratio 0
backfillfull_ratio 0
nearfull_ratio 0
pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 0 flags hashpspool stripe_width 0

View File

@ -77,6 +77,7 @@
modified \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d+ (re)
flags
full_ratio 0
backfillfull_ratio 0
nearfull_ratio 0
pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 192 pgp_num 192 last_change 0 flags hashpspool stripe_width 0

View File

@ -790,6 +790,7 @@
modified \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d+ (re)
flags
full_ratio 0
backfillfull_ratio 0
nearfull_ratio 0
pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 15296 pgp_num 15296 last_change 0 flags hashpspool stripe_width 0

View File

@ -183,21 +183,6 @@ class TestPG(TestArgparse):
def test_force_create_pg(self):
self.one_pgid('force_create_pg')
def set_ratio(self, command):
self.assert_valid_command(['pg',
command,
'0.0'])
assert_equal({}, validate_command(sigdict, ['pg', command]))
assert_equal({}, validate_command(sigdict, ['pg',
command,
'2.0']))
def test_set_full_ratio(self):
self.set_ratio('set_full_ratio')
def test_set_nearfull_ratio(self):
self.set_ratio('set_nearfull_ratio')
class TestAuth(TestArgparse):
@ -1153,6 +1138,24 @@ class TestOSD(TestArgparse):
'poolname',
'toomany']))
def set_ratio(self, command):
self.assert_valid_command(['osd',
command,
'0.0'])
assert_equal({}, validate_command(sigdict, ['osd', command]))
assert_equal({}, validate_command(sigdict, ['osd',
command,
'2.0']))
def test_set_full_ratio(self):
self.set_ratio('set-full-ratio')
def test_set_backfillfull_ratio(self):
self.set_ratio('set-backfillfull-ratio')
def test_set_nearfull_ratio(self):
self.set_ratio('set-nearfull-ratio')
class TestConfigKey(TestArgparse):

View File

@ -654,6 +654,14 @@ static int update_pgmap_meta(MonitorDBStore& st)
::encode(full_ratio, bl);
t->put(prefix, "full_ratio", bl);
}
{
auto backfillfull_ratio = g_ceph_context->_conf->mon_osd_backfillfull_ratio;
if (backfillfull_ratio > 1.0)
backfillfull_ratio /= 100.0;
bufferlist bl;
::encode(backfillfull_ratio, bl);
t->put(prefix, "backfillfull_ratio", bl);
}
{
auto nearfull_ratio = g_ceph_context->_conf->mon_osd_nearfull_ratio;
if (nearfull_ratio > 1.0)

View File

@ -2906,7 +2906,7 @@ int main(int argc, char **argv)
throw std::runtime_error(ss.str());
}
vector<json_spirit::Value>::iterator i = array.begin();
//if (i == array.end() || i->type() != json_spirit::str_type) {
assert(i != array.end());
if (i->type() != json_spirit::str_type) {
ss << "Object '" << object
<< "' must be a JSON array with the first element a string";