diff --git a/doc/dev/osd_internals/recovery_reservation.rst b/doc/dev/osd_internals/recovery_reservation.rst index 24db1387f50..4ab03192fe5 100644 --- a/doc/dev/osd_internals/recovery_reservation.rst +++ b/doc/dev/osd_internals/recovery_reservation.rst @@ -34,8 +34,8 @@ the typical process. Once the primary has its local reservation, it requests a remote reservation from the backfill target. This reservation CAN be rejected, -for instance if the OSD is too full (osd_backfill_full_ratio config -option). If the reservation is rejected, the primary drops its local +for instance if the OSD is too full (backfillfull_ratio osd setting). +If the reservation is rejected, the primary drops its local reservation, waits (osd_backfill_retry_interval), and then retries. It will retry indefinitely. @@ -62,9 +62,10 @@ to the monitor. The state chart can set: - recovery_wait: waiting for local/remote reservations - recovering: recovering + - recovery_toofull: recovery stopped, OSD(s) above full ratio - backfill_wait: waiting for remote backfill reservations - backfilling: backfilling - - backfill_toofull: backfill reservation rejected, OSD too full + - backfill_toofull: backfill stopped, OSD(s) above backfillfull ratio -------- diff --git a/doc/man/8/ceph.rst b/doc/man/8/ceph.rst index f878f882525..b2489126848 100644 --- a/doc/man/8/ceph.rst +++ b/doc/man/8/ceph.rst @@ -1166,6 +1166,12 @@ Usage:: ceph pg set_full_ratio +Subcommand ``set_backfillfull_ratio`` sets ratio at which pgs are considered too full to backfill. + +Usage:: + + ceph pg set_backfillfull_ratio + Subcommand ``set_nearfull_ratio`` sets ratio at which pgs are considered nearly full. diff --git a/doc/rados/configuration/mon-config-ref.rst b/doc/rados/configuration/mon-config-ref.rst index 8c05571c6ce..b19461f7a62 100644 --- a/doc/rados/configuration/mon-config-ref.rst +++ b/doc/rados/configuration/mon-config-ref.rst @@ -400,6 +400,7 @@ a reasonable number for a near full ratio. [global] mon osd full ratio = .80 + mon osd backfillfull ratio = .75 mon osd nearfull ratio = .70 @@ -412,6 +413,15 @@ a reasonable number for a near full ratio. :Default: ``.95`` +``mon osd backfillfull ratio`` + +:Description: The percentage of disk space used before an OSD is + considered too ``full`` to backfill. + +:Type: Float +:Default: ``.90`` + + ``mon osd nearfull ratio`` :Description: The percentage of disk space used before an OSD is diff --git a/doc/rados/configuration/osd-config-ref.rst b/doc/rados/configuration/osd-config-ref.rst index 06e46a6eab9..5679c0caeeb 100644 --- a/doc/rados/configuration/osd-config-ref.rst +++ b/doc/rados/configuration/osd-config-ref.rst @@ -560,15 +560,6 @@ priority than requests to read or write data. :Default: ``512`` -``osd backfill full ratio`` - -:Description: Refuse to accept backfill requests when the Ceph OSD Daemon's - full ratio is above this value. - -:Type: Float -:Default: ``0.85`` - - ``osd backfill retry interval`` :Description: The number of seconds to wait before retrying backfill requests. @@ -673,13 +664,6 @@ perform well in a degraded state. :Default: ``8 << 20`` -``osd recovery threads`` - -:Description: The number of threads for recovering data. -:Type: 32-bit Integer -:Default: ``1`` - - ``osd recovery thread timeout`` :Description: The maximum time in seconds before timing out a recovery thread. diff --git a/doc/rados/operations/monitoring-osd-pg.rst b/doc/rados/operations/monitoring-osd-pg.rst index 866ae8313ba..b390b030b71 100644 --- a/doc/rados/operations/monitoring-osd-pg.rst +++ b/doc/rados/operations/monitoring-osd-pg.rst @@ -468,8 +468,7 @@ Ceph provides a number of settings to balance the resource contention between new service requests and the need to recover data objects and restore the placement groups to the current state. The ``osd recovery delay start`` setting allows an OSD to restart, re-peer and even process some replay requests before -starting the recovery process. The ``osd recovery threads`` setting limits the -number of threads for the recovery process (1 thread by default). The ``osd +starting the recovery process. The ``osd recovery thread timeout`` sets a thread timeout, because multiple OSDs may fail, restart and re-peer at staggered rates. The ``osd recovery max active`` setting limits the number of recovery requests an OSD will entertain simultaneously to @@ -497,8 +496,9 @@ placement group can't be backfilled, it may be considered ``incomplete``. Ceph provides a number of settings to manage the load spike associated with reassigning placement groups to an OSD (especially a new OSD). By default, ``osd_max_backfills`` sets the maximum number of concurrent backfills to or from -an OSD to 10. The ``osd backfill full ratio`` enables an OSD to refuse a -backfill request if the OSD is approaching its full ratio (85%, by default). +an OSD to 10. The ``backfill full ratio`` enables an OSD to refuse a +backfill request if the OSD is approaching its full ratio (90%, by default) and +change with ``ceph osd set-backfillfull-ratio`` comand. If an OSD refuses a backfill request, the ``osd backfill retry interval`` enables an OSD to retry the request (after 10 seconds, by default). OSDs can also set ``osd backfill scan min`` and ``osd backfill scan max`` to manage scan diff --git a/doc/rados/troubleshooting/troubleshooting-osd.rst b/doc/rados/troubleshooting/troubleshooting-osd.rst index 3661f8af45d..fe29f4767f9 100644 --- a/doc/rados/troubleshooting/troubleshooting-osd.rst +++ b/doc/rados/troubleshooting/troubleshooting-osd.rst @@ -206,7 +206,9 @@ Ceph prevents you from writing to a full OSD so that you don't lose data. In an operational cluster, you should receive a warning when your cluster is getting near its full ratio. The ``mon osd full ratio`` defaults to ``0.95``, or 95% of capacity before it stops clients from writing data. -The ``mon osd nearfull ratio`` defaults to ``0.85``, or 85% of capacity +The ``mon osd backfillfull ratio`` defaults to ``0.90``, or 90 % of +capacity when it blocks backfills from starting. The +``mon osd nearfull ratio`` defaults to ``0.85``, or 85% of capacity when it generates a health warning. Full cluster issues usually arise when testing how Ceph handles an OSD @@ -214,20 +216,21 @@ failure on a small cluster. When one node has a high percentage of the cluster's data, the cluster can easily eclipse its nearfull and full ratio immediately. If you are testing how Ceph reacts to OSD failures on a small cluster, you should leave ample free disk space and consider temporarily -lowering the ``mon osd full ratio`` and ``mon osd nearfull ratio``. +lowering the ``mon osd full ratio``, ``mon osd backfillfull ratio`` and +``mon osd nearfull ratio``. Full ``ceph-osds`` will be reported by ``ceph health``:: ceph health - HEALTH_WARN 1 nearfull osds - osd.2 is near full at 85% + HEALTH_WARN 1 nearfull osd(s) Or:: - ceph health - HEALTH_ERR 1 nearfull osds, 1 full osds - osd.2 is near full at 85% + ceph health detail + HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s) osd.3 is full at 97% + osd.4 is backfill full at 91% + osd.2 is near full at 87% The best way to deal with a full cluster is to add new ``ceph-osds``, allowing the cluster to redistribute data to the newly available storage. diff --git a/qa/tasks/ceph_manager.py b/qa/tasks/ceph_manager.py index 8ff2556a7a0..1a9aff93c3c 100644 --- a/qa/tasks/ceph_manager.py +++ b/qa/tasks/ceph_manager.py @@ -696,7 +696,7 @@ class Thrasher: """ Test backfills stopping when the replica fills up. - First, use osd_backfill_full_ratio to simulate a now full + First, use injectfull admin command to simulate a now full osd by setting it to 0 on all of the OSDs. Second, on a random subset, set @@ -705,13 +705,14 @@ class Thrasher: Then, verify that all backfills stop. """ - self.log("injecting osd_backfill_full_ratio = 0") + self.log("injecting backfill full") for i in self.live_osds: self.ceph_manager.set_config( i, osd_debug_skip_full_check_in_backfill_reservation= - random.choice(['false', 'true']), - osd_backfill_full_ratio=0) + random.choice(['false', 'true'])) + self.ceph_manager.osd_admin_socket(i, command=['injectfull', 'backfillfull'], + check_status=True, timeout=30, stdout=DEVNULL) for i in range(30): status = self.ceph_manager.compile_pg_status() if 'backfill' not in status.keys(): @@ -724,8 +725,9 @@ class Thrasher: for i in self.live_osds: self.ceph_manager.set_config( i, - osd_debug_skip_full_check_in_backfill_reservation='false', - osd_backfill_full_ratio=0.85) + osd_debug_skip_full_check_in_backfill_reservation='false') + self.ceph_manager.osd_admin_socket(i, command=['injectfull', 'none'], + check_status=True, timeout=30, stdout=DEVNULL) def test_map_discontinuity(self): """ diff --git a/qa/workunits/ceph-helpers.sh b/qa/workunits/ceph-helpers.sh index 9863668de75..8642d376a73 100755 --- a/qa/workunits/ceph-helpers.sh +++ b/qa/workunits/ceph-helpers.sh @@ -400,6 +400,7 @@ EOF if test -z "$(get_config mon $id mon_initial_members)" ; then ceph osd pool delete rbd rbd --yes-i-really-really-mean-it || return 1 ceph osd pool create rbd $PG_NUM || return 1 + ceph osd set-backfillfull-ratio .99 fi } @@ -634,7 +635,6 @@ function activate_osd() { ceph_disk_args+=" --prepend-to-path=" local ceph_args="$CEPH_ARGS" - ceph_args+=" --osd-backfill-full-ratio=.99" ceph_args+=" --osd-failsafe-full-ratio=.99" ceph_args+=" --osd-journal-size=100" ceph_args+=" --osd-scrub-load-threshold=2000" diff --git a/qa/workunits/cephtool/test.sh b/qa/workunits/cephtool/test.sh index 8d0ce1baeff..353e5f74518 100755 --- a/qa/workunits/cephtool/test.sh +++ b/qa/workunits/cephtool/test.sh @@ -1419,9 +1419,44 @@ function test_mon_pg() ceph osd set-full-ratio .962 ceph osd dump | grep '^full_ratio 0.962' + ceph osd set-backfillfull-ratio .912 + ceph osd dump | grep '^backfillfull_ratio 0.912' ceph osd set-nearfull-ratio .892 ceph osd dump | grep '^nearfull_ratio 0.892' + # Check health status + ceph osd set-nearfull-ratio .913 + ceph health | grep 'HEALTH_ERR Full ratio(s) out of order' + ceph health detail | grep 'backfill_ratio (0.912) < nearfull_ratio (0.913), increased' + ceph osd set-nearfull-ratio .892 + ceph osd set-backfillfull-ratio .963 + ceph health detail | grep 'full_ratio (0.962) < backfillfull_ratio (0.963), increased' + ceph osd set-backfillfull-ratio .912 + + # Check injected full results + WAITFORFULL=10 + ceph --admin-daemon $CEPH_OUT_DIR/osd.0.asok injectfull nearfull + sleep $WAITFORFULL + ceph health | grep "HEALTH_WARN.*1 nearfull osd(s)" + ceph --admin-daemon $CEPH_OUT_DIR/osd.1.asok injectfull backfillfull + sleep $WAITFORFULL + ceph health | grep "HEALTH_WARN.*1 backfillfull osd(s)" + ceph --admin-daemon $CEPH_OUT_DIR/osd.2.asok injectfull failsafe + sleep $WAITFORFULL + # failsafe and full are the same as far as the monitor is concerned + ceph health | grep "HEALTH_ERR.*1 full osd(s)" + ceph --admin-daemon $CEPH_OUT_DIR/osd.0.asok injectfull full + sleep $WAITFORFULL + ceph health | grep "HEALTH_ERR.*2 full osd(s)" + ceph health detail | grep "osd.0 is full at.*%" + ceph health detail | grep "osd.2 is full at.*%" + ceph health detail | grep "osd.1 is backfill full at.*%" + ceph --admin-daemon $CEPH_OUT_DIR/osd.0.asok injectfull none + ceph --admin-daemon $CEPH_OUT_DIR/osd.1.asok injectfull none + ceph --admin-daemon $CEPH_OUT_DIR/osd.2.asok injectfull none + sleep $WAITFORFULL + ceph health | grep HEALTH_OK + ceph pg stat | grep 'pgs:' ceph pg 0.0 query ceph tell 0.0 query diff --git a/qa/workunits/rest/test.py b/qa/workunits/rest/test.py index 1208f85b907..7bbb6f3ccae 100755 --- a/qa/workunits/rest/test.py +++ b/qa/workunits/rest/test.py @@ -359,10 +359,14 @@ if __name__ == '__main__': r = expect('osd/dump', 'GET', 200, 'json', JSONHDR) assert(float(r.myjson['output']['full_ratio']) == 0.90) expect('osd/set-full-ratio?ratio=0.95', 'PUT', 200, '') + expect('osd/set-backfillfull-ratio?ratio=0.88', 'PUT', 200, '') + r = expect('osd/dump', 'GET', 200, 'json', JSONHDR) + assert(float(r.myjson['output']['backfillfull_ratio']) == 0.88) + expect('osd/set-backfillfull-ratio?ratio=0.90', 'PUT', 200, '') expect('osd/set-nearfull-ratio?ratio=0.90', 'PUT', 200, '') r = expect('osd/dump', 'GET', 200, 'json', JSONHDR) assert(float(r.myjson['output']['nearfull_ratio']) == 0.90) - expect('osd/set-full-ratio?ratio=0.85', 'PUT', 200, '') + expect('osd/set-nearfull-ratio?ratio=0.85', 'PUT', 200, '') r = expect('pg/stat', 'GET', 200, 'json', JSONHDR) assert('num_pgs' in r.myjson['output']) diff --git a/src/common/ceph_strings.cc b/src/common/ceph_strings.cc index 462dd6db249..1fec2f7b0a1 100644 --- a/src/common/ceph_strings.cc +++ b/src/common/ceph_strings.cc @@ -42,6 +42,8 @@ const char *ceph_osd_state_name(int s) return "full"; case CEPH_OSD_NEARFULL: return "nearfull"; + case CEPH_OSD_BACKFILLFULL: + return "backfillfull"; default: return "???"; } diff --git a/src/common/config_opts.h b/src/common/config_opts.h index ae6fdf2ccb3..c742ae6fa20 100644 --- a/src/common/config_opts.h +++ b/src/common/config_opts.h @@ -308,6 +308,7 @@ OPTION(mon_pg_warn_min_pool_objects, OPT_INT, 1000) // do not warn on pools bel OPTION(mon_pg_check_down_all_threshold, OPT_FLOAT, .5) // threshold of down osds after which we check all pgs OPTION(mon_cache_target_full_warn_ratio, OPT_FLOAT, .66) // position between pool cache_target_full and max where we start warning OPTION(mon_osd_full_ratio, OPT_FLOAT, .95) // what % full makes an OSD "full" +OPTION(mon_osd_backfillfull_ratio, OPT_FLOAT, .90) // what % full makes an OSD backfill full (backfill halted) OPTION(mon_osd_nearfull_ratio, OPT_FLOAT, .85) // what % full makes an OSD near full OPTION(mon_allow_pool_delete, OPT_BOOL, false) // allow pool deletion OPTION(mon_globalid_prealloc, OPT_U32, 10000) // how many globalids to prealloc @@ -626,11 +627,11 @@ OPTION(osd_max_backfills, OPT_U64, 1) // Minimum recovery priority (255 = max, smaller = lower) OPTION(osd_min_recovery_priority, OPT_INT, 0) -// Refuse backfills when OSD full ratio is above this value -OPTION(osd_backfill_full_ratio, OPT_FLOAT, 0.85) - // Seconds to wait before retrying refused backfills -OPTION(osd_backfill_retry_interval, OPT_DOUBLE, 10.0) +OPTION(osd_backfill_retry_interval, OPT_DOUBLE, 30.0) + +// Seconds to wait before retrying refused recovery +OPTION(osd_recovery_retry_interval, OPT_DOUBLE, 30.0) // max agent flush ops OPTION(osd_agent_max_ops, OPT_INT, 4) @@ -742,7 +743,6 @@ OPTION(osd_op_pq_min_cost, OPT_U64, 65536) OPTION(osd_disk_threads, OPT_INT, 1) OPTION(osd_disk_thread_ioprio_class, OPT_STR, "") // rt realtime be best effort idle OPTION(osd_disk_thread_ioprio_priority, OPT_INT, -1) // 0-7 -OPTION(osd_recovery_threads, OPT_INT, 1) OPTION(osd_recover_clone_overlap, OPT_BOOL, true) // preserve clone_overlap during recovery/migration OPTION(osd_op_num_threads_per_shard, OPT_INT, 2) OPTION(osd_op_num_shards, OPT_INT, 5) @@ -871,6 +871,7 @@ OPTION(osd_debug_skip_full_check_in_backfill_reservation, OPT_BOOL, false) OPTION(osd_debug_reject_backfill_probability, OPT_DOUBLE, 0) OPTION(osd_debug_inject_copyfrom_error, OPT_BOOL, false) // inject failure during copyfrom completion OPTION(osd_debug_misdirected_ops, OPT_BOOL, false) +OPTION(osd_debug_skip_full_check_in_recovery, OPT_BOOL, false) OPTION(osd_enxio_on_misdirected_op, OPT_BOOL, false) OPTION(osd_debug_verify_cached_snaps, OPT_BOOL, false) OPTION(osd_enable_op_tracker, OPT_BOOL, true) // enable/disable OSD op tracking diff --git a/src/include/rados.h b/src/include/rados.h index 0cc9380d824..cc4402ff0ed 100644 --- a/src/include/rados.h +++ b/src/include/rados.h @@ -116,6 +116,7 @@ struct ceph_eversion { #define CEPH_OSD_NEW (1<<3) /* osd is new, never marked in */ #define CEPH_OSD_FULL (1<<4) /* osd is at or above full threshold */ #define CEPH_OSD_NEARFULL (1<<5) /* osd is at or above nearfull threshold */ +#define CEPH_OSD_BACKFILLFULL (1<<6) /* osd is at or above backfillfull threshold */ extern const char *ceph_osd_state_name(int s); diff --git a/src/mon/MonCommands.h b/src/mon/MonCommands.h index 4e816890820..d1b09e66024 100644 --- a/src/mon/MonCommands.h +++ b/src/mon/MonCommands.h @@ -592,6 +592,10 @@ COMMAND("osd set-full-ratio " \ "name=ratio,type=CephFloat,range=0.0|1.0", \ "set usage ratio at which OSDs are marked full", "osd", "rw", "cli,rest") +COMMAND("osd set-backfillfull-ratio " \ + "name=ratio,type=CephFloat,range=0.0|1.0", \ + "set usage ratio at which OSDs are marked too full to backfill", + "osd", "rw", "cli,rest") COMMAND("osd set-nearfull-ratio " \ "name=ratio,type=CephFloat,range=0.0|1.0", \ "set usage ratio at which OSDs are marked near-full", diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc index 7daca9c887d..0b059271fbb 100644 --- a/src/mon/OSDMonitor.cc +++ b/src/mon/OSDMonitor.cc @@ -164,7 +164,11 @@ void OSDMonitor::create_initial() if (!g_conf->mon_debug_no_require_luminous) { newmap.set_flag(CEPH_OSDMAP_REQUIRE_LUMINOUS); newmap.full_ratio = g_conf->mon_osd_full_ratio; + if (newmap.full_ratio > 1.0) newmap.full_ratio /= 100; + newmap.backfillfull_ratio = g_conf->mon_osd_backfillfull_ratio; + if (newmap.backfillfull_ratio > 1.0) newmap.backfillfull_ratio /= 100; newmap.nearfull_ratio = g_conf->mon_osd_nearfull_ratio; + if (newmap.nearfull_ratio > 1.0) newmap.nearfull_ratio /= 100; } // encode into pending incremental @@ -784,8 +788,17 @@ void OSDMonitor::create_pending() OSDMap::clean_temps(g_ceph_context, osdmap, &pending_inc); dout(10) << "create_pending did clean_temps" << dendl; + // On upgrade OSDMap has new field set by mon_osd_backfillfull_ratio config + // instead of osd_backfill_full_ratio config + if (osdmap.backfillfull_ratio <= 0) { + pending_inc.new_backfillfull_ratio = g_conf->mon_osd_backfillfull_ratio; + if (pending_inc.new_backfillfull_ratio > 1.0) + pending_inc.new_backfillfull_ratio /= 100; + dout(1) << __func__ << " setting backfillfull_ratio = " + << pending_inc.new_backfillfull_ratio << dendl; + } if (!osdmap.test_flag(CEPH_OSDMAP_REQUIRE_LUMINOUS)) { - // transition nearfull ratios from PGMap to OSDMap (on upgrade) + // transition full ratios from PGMap to OSDMap (on upgrade) PGMap *pg_map = &mon->pgmon()->pg_map; if (osdmap.full_ratio != pg_map->full_ratio) { dout(10) << __func__ << " full_ratio " << osdmap.full_ratio @@ -800,14 +813,18 @@ void OSDMonitor::create_pending() } else { // safety check (this shouldn't really happen) if (osdmap.full_ratio <= 0) { - dout(1) << __func__ << " setting full_ratio = " - << g_conf->mon_osd_full_ratio << dendl; pending_inc.new_full_ratio = g_conf->mon_osd_full_ratio; + if (pending_inc.new_full_ratio > 1.0) + pending_inc.new_full_ratio /= 100; + dout(1) << __func__ << " setting full_ratio = " + << pending_inc.new_full_ratio << dendl; } if (osdmap.nearfull_ratio <= 0) { - dout(1) << __func__ << " setting nearfull_ratio = " - << g_conf->mon_osd_nearfull_ratio << dendl; pending_inc.new_nearfull_ratio = g_conf->mon_osd_nearfull_ratio; + if (pending_inc.new_nearfull_ratio > 1.0) + pending_inc.new_nearfull_ratio /= 100; + dout(1) << __func__ << " setting nearfull_ratio = " + << pending_inc.new_nearfull_ratio << dendl; } } } @@ -1048,8 +1065,8 @@ void OSDMonitor::encode_pending(MonitorDBStore::TransactionRef t) tmp.apply_incremental(pending_inc); if (tmp.test_flag(CEPH_OSDMAP_REQUIRE_LUMINOUS)) { - int full, nearfull; - tmp.count_full_nearfull_osds(&full, &nearfull); + int full, backfill, nearfull; + tmp.count_full_nearfull_osds(&full, &backfill, &nearfull); if (full > 0) { if (!tmp.test_flag(CEPH_OSDMAP_FULL)) { dout(10) << __func__ << " setting full flag" << dendl; @@ -2287,7 +2304,7 @@ bool OSDMonitor::preprocess_full(MonOpRequestRef op) MOSDFull *m = static_cast(op->get_req()); int from = m->get_orig_source().num(); set state; - unsigned mask = CEPH_OSD_NEARFULL | CEPH_OSD_FULL; + unsigned mask = CEPH_OSD_NEARFULL | CEPH_OSD_BACKFILLFULL | CEPH_OSD_FULL; // check permissions, ignore if failed MonSession *session = m->get_session(); @@ -2337,7 +2354,7 @@ bool OSDMonitor::prepare_full(MonOpRequestRef op) const MOSDFull *m = static_cast(op->get_req()); const int from = m->get_orig_source().num(); - const unsigned mask = CEPH_OSD_NEARFULL | CEPH_OSD_FULL; + const unsigned mask = CEPH_OSD_NEARFULL | CEPH_OSD_BACKFILLFULL | CEPH_OSD_FULL; const unsigned want_state = m->state & mask; // safety first unsigned cur_state = osdmap.get_state(from); @@ -3342,18 +3359,83 @@ void OSDMonitor::get_health(list >& summary, } if (osdmap.test_flag(CEPH_OSDMAP_REQUIRE_LUMINOUS)) { - int full, nearfull; - osdmap.count_full_nearfull_osds(&full, &nearfull); - if (full > 0) { + // An osd could configure failsafe ratio, to something different + // but for now assume it is the same here. + float fsr = g_conf->osd_failsafe_full_ratio; + if (fsr > 1.0) fsr /= 100; + float fr = osdmap.get_full_ratio(); + float br = osdmap.get_backfillfull_ratio(); + float nr = osdmap.get_nearfull_ratio(); + + bool out_of_order = false; + // These checks correspond to how OSDService::check_full_status() in an OSD + // handles the improper setting of these values. + if (br < nr) { + out_of_order = true; + if (detail) { + ostringstream ss; + ss << "backfill_ratio (" << br << ") < nearfull_ratio (" << nr << "), increased"; + detail->push_back(make_pair(HEALTH_ERR, ss.str())); + } + br = nr; + } + if (fr < br) { + out_of_order = true; + if (detail) { + ostringstream ss; + ss << "full_ratio (" << fr << ") < backfillfull_ratio (" << br << "), increased"; + detail->push_back(make_pair(HEALTH_ERR, ss.str())); + } + fr = br; + } + if (fsr < fr) { + out_of_order = true; + if (detail) { + ostringstream ss; + ss << "osd_failsafe_full_ratio (" << fsr << ") < full_ratio (" << fr << "), increased"; + detail->push_back(make_pair(HEALTH_ERR, ss.str())); + } + } + if (out_of_order) { ostringstream ss; - ss << full << " full osd(s)"; + ss << "Full ratio(s) out of order"; summary.push_back(make_pair(HEALTH_ERR, ss.str())); } - if (nearfull > 0) { + + map full, backfillfull, nearfull; + osdmap.get_full_osd_util(mon->pgmon()->pg_map.osd_stat, &full, &backfillfull, &nearfull); + if (full.size()) { ostringstream ss; - ss << nearfull << " nearfull osd(s)"; + ss << full.size() << " full osd(s)"; + summary.push_back(make_pair(HEALTH_ERR, ss.str())); + } + if (backfillfull.size()) { + ostringstream ss; + ss << backfillfull.size() << " backfillfull osd(s)"; summary.push_back(make_pair(HEALTH_WARN, ss.str())); } + if (nearfull.size()) { + ostringstream ss; + ss << nearfull.size() << " nearfull osd(s)"; + summary.push_back(make_pair(HEALTH_WARN, ss.str())); + } + if (detail) { + for (auto& i: full) { + ostringstream ss; + ss << "osd." << i.first << " is full at " << roundf(i.second * 100) << "%"; + detail->push_back(make_pair(HEALTH_ERR, ss.str())); + } + for (auto& i: backfillfull) { + ostringstream ss; + ss << "osd." << i.first << " is backfill full at " << roundf(i.second * 100) << "%"; + detail->push_back(make_pair(HEALTH_WARN, ss.str())); + } + for (auto& i: nearfull) { + ostringstream ss; + ss << "osd." << i.first << " is near full at " << roundf(i.second * 100) << "%"; + detail->push_back(make_pair(HEALTH_WARN, ss.str())); + } + } } // note: we leave it to ceph-mgr to generate details health warnings // with actual osd utilizations @@ -6929,6 +7011,7 @@ bool OSDMonitor::prepare_command_impl(MonOpRequestRef op, return true; } else if (prefix == "osd set-full-ratio" || + prefix == "osd set-backfillfull-ratio" || prefix == "osd set-nearfull-ratio") { if (!osdmap.test_flag(CEPH_OSDMAP_REQUIRE_LUMINOUS)) { ss << "you must complete the upgrade and set require_luminous_osds before" @@ -6945,6 +7028,8 @@ bool OSDMonitor::prepare_command_impl(MonOpRequestRef op, } if (prefix == "osd set-full-ratio") pending_inc.new_full_ratio = n; + else if (prefix == "osd set-backfillfull-ratio") + pending_inc.new_backfillfull_ratio = n; else if (prefix == "osd set-nearfull-ratio") pending_inc.new_nearfull_ratio = n; ss << prefix << " " << n; diff --git a/src/mon/PGMap.cc b/src/mon/PGMap.cc index ee263d3f792..68f4879a538 100644 --- a/src/mon/PGMap.cc +++ b/src/mon/PGMap.cc @@ -1878,6 +1878,17 @@ int64_t PGMap::get_rule_avail(const OSDMap& osdmap, int ruleno) const return 0; } + float fratio; + if (osdmap.test_flag(CEPH_OSDMAP_REQUIRE_LUMINOUS) && osdmap.get_full_ratio() > 0) { + fratio = osdmap.get_full_ratio(); + } else if (full_ratio > 0) { + fratio = full_ratio; + } else { + // this shouldn't really happen + fratio = g_conf->mon_osd_full_ratio; + if (fratio > 1.0) fratio /= 100; + } + int64_t min = -1; for (map::iterator p = wm.begin(); p != wm.end(); ++p) { ceph::unordered_map::const_iterator osd_info = @@ -1892,7 +1903,7 @@ int64_t PGMap::get_rule_avail(const OSDMap& osdmap, int ruleno) const continue; } double unusable = (double)osd_info->second.kb * - (1.0 - g_conf->mon_osd_full_ratio); + (1.0 - fratio); double avail = MAX(0.0, (double)osd_info->second.kb_avail - unusable); avail *= 1024.0; int64_t proj = (int64_t)(avail / (double)p->second); diff --git a/src/mon/PGMonitor.cc b/src/mon/PGMonitor.cc index 6669ffcd4b3..477dabce4e0 100644 --- a/src/mon/PGMonitor.cc +++ b/src/mon/PGMonitor.cc @@ -1316,6 +1316,8 @@ void PGMonitor::get_health(list >& summary, note["backfilling"] += p->second; if (p->first & PG_STATE_BACKFILL_TOOFULL) note["backfill_toofull"] += p->second; + if (p->first & PG_STATE_RECOVERY_TOOFULL) + note["recovery_toofull"] += p->second; } ceph::unordered_map stuck_pgs; @@ -1403,6 +1405,7 @@ void PGMonitor::get_health(list >& summary, PG_STATE_REPAIR | PG_STATE_RECOVERING | PG_STATE_RECOVERY_WAIT | + PG_STATE_RECOVERY_TOOFULL | PG_STATE_INCOMPLETE | PG_STATE_BACKFILL_WAIT | PG_STATE_BACKFILL | diff --git a/src/os/filestore/FileStore.cc b/src/os/filestore/FileStore.cc index 475d7f59481..3b0eabd12d5 100644 --- a/src/os/filestore/FileStore.cc +++ b/src/os/filestore/FileStore.cc @@ -3952,7 +3952,7 @@ void FileStore::sync_entry() derr << "ioctl WAIT_SYNC got " << cpp_strerror(err) << dendl; assert(0 == "wait_sync got error"); } - dout(20) << " done waiting for checkpoint" << cid << " to complete" << dendl; + dout(20) << " done waiting for checkpoint " << cid << " to complete" << dendl; } } else { diff --git a/src/osd/ECBackend.cc b/src/osd/ECBackend.cc index 8e7b09fe38a..ea285cc4e6f 100644 --- a/src/osd/ECBackend.cc +++ b/src/osd/ECBackend.cc @@ -282,6 +282,11 @@ void ECBackend::handle_recovery_push( const PushOp &op, RecoveryMessages *m) { + ostringstream ss; + if (get_parent()->check_failsafe_full(ss)) { + dout(10) << __func__ << " Out of space (failsafe) processing push request: " << ss.str() << dendl; + ceph_abort(); + } bool oneshot = op.before_progress.first && op.after_progress.data_complete; ghobject_t tobj; diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc index 7b1a021a228..c8031826f4e 100644 --- a/src/osd/OSD.cc +++ b/src/osd/OSD.cc @@ -255,8 +255,8 @@ OSDService::OSDService(OSD *osd) : watch_lock("OSDService::watch_lock"), watch_timer(osd->client_messenger->cct, watch_lock), next_notif_id(0), - backfill_request_lock("OSDService::backfill_request_lock"), - backfill_request_timer(cct, backfill_request_lock, false), + recovery_request_lock("OSDService::recovery_request_lock"), + recovery_request_timer(cct, recovery_request_lock, false), reserver_finisher(cct), local_reserver(&reserver_finisher, cct->_conf->osd_max_backfills, cct->_conf->osd_min_recovery_priority), @@ -495,8 +495,8 @@ void OSDService::shutdown() objecter_finisher.stop(); { - Mutex::Locker l(backfill_request_lock); - backfill_request_timer.shutdown(); + Mutex::Locker l(recovery_request_lock); + recovery_request_timer.shutdown(); } { @@ -716,13 +716,7 @@ void OSDService::check_full_status(const osd_stat_t &osd_stat) { Mutex::Locker l(full_status_lock); - // We base ratio on kb_avail rather than kb_used because they can - // differ significantly e.g. on btrfs volumes with a large number of - // chunks reserved for metadata, and for our purposes (avoiding - // completely filling the disk) it's far more important to know how - // much space is available to use than how much we've already used. - float ratio = ((float)(osd_stat.kb - osd_stat.kb_avail)) / - ((float)osd_stat.kb); + float ratio = ((float)osd_stat.kb_used) / ((float)osd_stat.kb); cur_ratio = ratio; // The OSDMap ratios take precendence. So if the failsafe is .95 and @@ -735,28 +729,38 @@ void OSDService::check_full_status(const osd_stat_t &osd_stat) return; } float nearfull_ratio = osdmap->get_nearfull_ratio(); - float full_ratio = std::max(osdmap->get_full_ratio(), nearfull_ratio); + float backfillfull_ratio = std::max(osdmap->get_backfillfull_ratio(), nearfull_ratio); + float full_ratio = std::max(osdmap->get_full_ratio(), backfillfull_ratio); float failsafe_ratio = std::max(get_failsafe_full_ratio(), full_ratio); if (!osdmap->test_flag(CEPH_OSDMAP_REQUIRE_LUMINOUS)) { // use the failsafe for nearfull and full; the mon isn't using the // flags anyway because we're mid-upgrade. full_ratio = failsafe_ratio; + backfillfull_ratio = failsafe_ratio; nearfull_ratio = failsafe_ratio; } else if (full_ratio <= 0 || + backfillfull_ratio <= 0 || nearfull_ratio <= 0) { - derr << __func__ << " full_ratio or nearfull_ratio is <= 0" << dendl; + derr << __func__ << " full_ratio, backfillfull_ratio or nearfull_ratio is <= 0" << dendl; // use failsafe flag. ick. the monitor did something wrong or the user // did something stupid. full_ratio = failsafe_ratio; + backfillfull_ratio = failsafe_ratio; nearfull_ratio = failsafe_ratio; } - enum s_names new_state; - if (ratio > failsafe_ratio) { + string inject; + s_names new_state; + if (injectfull_state > NONE && injectfull) { + new_state = injectfull_state; + inject = "(Injected)"; + } else if (ratio > failsafe_ratio) { new_state = FAILSAFE; } else if (ratio > full_ratio) { new_state = FULL; + } else if (ratio > backfillfull_ratio) { + new_state = BACKFILLFULL; } else if (ratio > nearfull_ratio) { new_state = NEARFULL; } else { @@ -764,9 +768,11 @@ void OSDService::check_full_status(const osd_stat_t &osd_stat) } dout(20) << __func__ << " cur ratio " << ratio << ". nearfull_ratio " << nearfull_ratio + << ". backfillfull_ratio " << backfillfull_ratio << ", full_ratio " << full_ratio << ", failsafe_ratio " << failsafe_ratio << ", new state " << get_full_state_name(new_state) + << " " << inject << dendl; // warn @@ -791,6 +797,8 @@ bool OSDService::need_fullness_update() if (osdmap->exists(whoami)) { if (osdmap->get_state(whoami) & CEPH_OSD_FULL) { cur = FULL; + } else if (osdmap->get_state(whoami) & CEPH_OSD_BACKFILLFULL) { + cur = BACKFILLFULL; } else if (osdmap->get_state(whoami) & CEPH_OSD_NEARFULL) { cur = NEARFULL; } @@ -798,41 +806,80 @@ bool OSDService::need_fullness_update() s_names want = NONE; if (is_full()) want = FULL; + else if (is_backfillfull()) + want = BACKFILLFULL; else if (is_nearfull()) want = NEARFULL; return want != cur; } -bool OSDService::check_failsafe_full() +bool OSDService::_check_full(s_names type, ostream &ss) const { Mutex::Locker l(full_status_lock); - if (cur_state == FAILSAFE) + + if (injectfull && injectfull_state >= type) { + // injectfull is either a count of the number of times to return failsafe full + // or if -1 then always return full + if (injectfull > 0) + --injectfull; + ss << "Injected " << get_full_state_name(type) << " OSD (" + << (injectfull < 0 ? "set" : std::to_string(injectfull)) << ")"; return true; - return false; + } + + ss << "current usage is " << cur_ratio; + return cur_state >= type; } -bool OSDService::is_nearfull() +bool OSDService::check_failsafe_full(ostream &ss) const +{ + return _check_full(FAILSAFE, ss); +} + +bool OSDService::check_full(ostream &ss) const +{ + return _check_full(FULL, ss); +} + +bool OSDService::check_backfill_full(ostream &ss) const +{ + return _check_full(BACKFILLFULL, ss); +} + +bool OSDService::check_nearfull(ostream &ss) const +{ + return _check_full(NEARFULL, ss); +} + +bool OSDService::is_failsafe_full() const { Mutex::Locker l(full_status_lock); - return cur_state == NEARFULL; + return cur_state == FAILSAFE; } -bool OSDService::is_full() +bool OSDService::is_full() const { Mutex::Locker l(full_status_lock); return cur_state >= FULL; } -bool OSDService::too_full_for_backfill(double *_ratio, double *_max_ratio) +bool OSDService::is_backfillfull() const { Mutex::Locker l(full_status_lock); - double max_ratio; - max_ratio = cct->_conf->osd_backfill_full_ratio; - if (_ratio) - *_ratio = cur_ratio; - if (_max_ratio) - *_max_ratio = max_ratio; - return cur_ratio >= max_ratio; + return cur_state >= BACKFILLFULL; +} + +bool OSDService::is_nearfull() const +{ + Mutex::Locker l(full_status_lock); + return cur_state >= NEARFULL; +} + +void OSDService::set_injectfull(s_names type, int64_t count) +{ + Mutex::Locker l(full_status_lock); + injectfull_state = type; + injectfull = count; } void OSDService::update_osd_stat(vector& hb_peers) @@ -868,6 +915,16 @@ void OSDService::update_osd_stat(vector& hb_peers) check_full_status(osd_stat); } +bool OSDService::check_osdmap_full(const set &missing_on) +{ + OSDMapRef osdmap = get_osdmap(); + for (auto shard : missing_on) { + if (osdmap->get_state(shard.osd) & CEPH_OSD_FULL) + return true; + } + return false; +} + void OSDService::send_message_osd_cluster(int peer, Message *m, epoch_t from_epoch) { OSDMapRef next_map = get_nextmap_reserved(); @@ -2147,7 +2204,7 @@ int OSD::init() tick_timer.init(); tick_timer_without_osd_lock.init(); - service.backfill_request_timer.init(); + service.recovery_request_timer.init(); // mount. dout(2) << "mounting " << dev_path << " " @@ -2632,6 +2689,14 @@ void OSD::final_init() test_ops_hook, "Trigger a scheduled scrub "); assert(r == 0); + r = admin_socket->register_command( + "injectfull", + "injectfull " \ + "name=type,type=CephString,req=false " \ + "name=count,type=CephInt,req=false ", + test_ops_hook, + "Inject a full disk (optional count times)"); + assert(r == 0); } void OSD::create_logger() @@ -2839,6 +2904,7 @@ void OSD::create_recoverystate_perf() rs_perf.add_time_avg(rs_down_latency, "down_latency", "Down recovery state latency"); rs_perf.add_time_avg(rs_getmissing_latency, "getmissing_latency", "Getmissing recovery state latency"); rs_perf.add_time_avg(rs_waitupthru_latency, "waitupthru_latency", "Waitupthru recovery state latency"); + rs_perf.add_time_avg(rs_notrecovering_latency, "notrecovering_latency", "Notrecovering recovery state latency"); recoverystate_perf = rs_perf.create_perf_counters(); cct->get_perfcounters_collection()->add(recoverystate_perf); @@ -4854,6 +4920,24 @@ void TestOpsSocketHook::test_ops(OSDService *service, ObjectStore *store, pg->unlock(); return; } + if (command == "injectfull") { + int64_t count; + string type; + OSDService::s_names state; + cmd_getval(service->cct, cmdmap, "type", type, string("full")); + cmd_getval(service->cct, cmdmap, "count", count, (int64_t)-1); + if (type == "none" || count == 0) { + type = "none"; + count = 0; + } + state = service->get_full_state(type); + if (state == OSDService::s_names::INVALID) { + ss << "Invalid type use (none, nearfull, backfillfull, full, failsafe)"; + return; + } + service->set_injectfull(state, count); + return; + } ss << "Internal error - command=" << command; } @@ -5185,6 +5269,8 @@ void OSD::send_full_update() unsigned state = 0; if (service.is_full()) { state = CEPH_OSD_FULL; + } else if (service.is_backfillfull()) { + state = CEPH_OSD_BACKFILLFULL; } else if (service.is_nearfull()) { state = CEPH_OSD_NEARFULL; } diff --git a/src/osd/OSD.h b/src/osd/OSD.h index 73998a100e5..f6afebd4df0 100644 --- a/src/osd/OSD.h +++ b/src/osd/OSD.h @@ -202,6 +202,7 @@ enum { rs_down_latency, rs_getmissing_latency, rs_waitupthru_latency, + rs_notrecovering_latency, rs_last, }; @@ -917,9 +918,9 @@ public: return (((uint64_t)cur_epoch) << 32) | ((uint64_t)(next_notif_id++)); } - // -- Backfill Request Scheduling -- - Mutex backfill_request_lock; - SafeTimer backfill_request_timer; + // -- Recovery/Backfill Request Scheduling -- + Mutex recovery_request_lock; + SafeTimer recovery_request_timer; // -- tids -- // for ops i issue @@ -1025,7 +1026,7 @@ public: Mutex::Locker l(recovery_lock); _maybe_queue_recovery(); } - void clear_queued_recovery(PG *pg, bool front = false) { + void clear_queued_recovery(PG *pg) { Mutex::Locker l(recovery_lock); for (list >::iterator i = awaiting_throttle.begin(); i != awaiting_throttle.end(); @@ -1137,26 +1138,51 @@ public: // -- OSD Full Status -- private: - Mutex full_status_lock; - enum s_names { NONE, NEARFULL, FULL, FAILSAFE } cur_state; // ascending - const char *get_full_state_name(s_names s) { + friend TestOpsSocketHook; + mutable Mutex full_status_lock; + enum s_names { INVALID = -1, NONE, NEARFULL, BACKFILLFULL, FULL, FAILSAFE } cur_state; // ascending + const char *get_full_state_name(s_names s) const { switch (s) { case NONE: return "none"; case NEARFULL: return "nearfull"; + case BACKFILLFULL: return "backfillfull"; case FULL: return "full"; case FAILSAFE: return "failsafe"; default: return "???"; } } + s_names get_full_state(string type) const { + if (type == "none") + return NONE; + else if (type == "failsafe") + return FAILSAFE; + else if (type == "full") + return FULL; + else if (type == "backfillfull") + return BACKFILLFULL; + else if (type == "nearfull") + return NEARFULL; + else + return INVALID; + } double cur_ratio; ///< current utilization + mutable int64_t injectfull = 0; + s_names injectfull_state = NONE; float get_failsafe_full_ratio(); void check_full_status(const osd_stat_t &stat); + bool _check_full(s_names type, ostream &ss) const; public: - bool check_failsafe_full(); - bool is_nearfull(); - bool is_full(); - bool too_full_for_backfill(double *ratio, double *max_ratio); + bool check_failsafe_full(ostream &ss) const; + bool check_full(ostream &ss) const; + bool check_backfill_full(ostream &ss) const; + bool check_nearfull(ostream &ss) const; + bool is_failsafe_full() const; + bool is_full() const; + bool is_backfillfull() const; + bool is_nearfull() const; bool need_fullness_update(); ///< osdmap state needs update + void set_injectfull(s_names type, int64_t count); + bool check_osdmap_full(const set &missing_on); // -- epochs -- diff --git a/src/osd/OSDMap.cc b/src/osd/OSDMap.cc index ff0d194ae15..837ae9d8fe3 100644 --- a/src/osd/OSDMap.cc +++ b/src/osd/OSDMap.cc @@ -450,7 +450,7 @@ void OSDMap::Incremental::encode(bufferlist& bl, uint64_t features) const } { - uint8_t target_v = 3; + uint8_t target_v = 4; if (!HAVE_FEATURE(features, SERVER_LUMINOUS)) { target_v = 2; } @@ -470,6 +470,7 @@ void OSDMap::Incremental::encode(bufferlist& bl, uint64_t features) const if (target_v >= 3) { ::encode(new_nearfull_ratio, bl); ::encode(new_full_ratio, bl); + ::encode(new_backfillfull_ratio, bl); } ENCODE_FINISH(bl); // osd-only data } @@ -654,7 +655,7 @@ void OSDMap::Incremental::decode(bufferlist::iterator& bl) } { - DECODE_START(3, bl); // extended, osd-only data + DECODE_START(4, bl); // extended, osd-only data ::decode(new_hb_back_up, bl); ::decode(new_up_thru, bl); ::decode(new_last_clean_interval, bl); @@ -677,6 +678,11 @@ void OSDMap::Incremental::decode(bufferlist::iterator& bl) new_nearfull_ratio = -1; new_full_ratio = -1; } + if (struct_v >= 4) { + ::decode(new_backfillfull_ratio, bl); + } else { + new_backfillfull_ratio = -1; + } DECODE_FINISH(bl); // osd-only data } @@ -720,6 +726,7 @@ void OSDMap::Incremental::dump(Formatter *f) const f->dump_int("new_flags", new_flags); f->dump_float("new_full_ratio", new_full_ratio); f->dump_float("new_nearfull_ratio", new_nearfull_ratio); + f->dump_float("new_backfillfull_ratio", new_backfillfull_ratio); if (fullmap.length()) { f->open_object_section("full_map"); @@ -1022,20 +1029,57 @@ int OSDMap::calc_num_osds() return num_osd; } -void OSDMap::count_full_nearfull_osds(int *full, int *nearfull) const +void OSDMap::count_full_nearfull_osds(int *full, int *backfill, int *nearfull) const { *full = 0; + *backfill = 0; *nearfull = 0; for (int i = 0; i < max_osd; ++i) { if (exists(i) && is_up(i) && is_in(i)) { if (osd_state[i] & CEPH_OSD_FULL) ++(*full); + else if (osd_state[i] & CEPH_OSD_BACKFILLFULL) + ++(*backfill); else if (osd_state[i] & CEPH_OSD_NEARFULL) ++(*nearfull); } } } +static bool get_osd_utilization(const ceph::unordered_map &osd_stat, + int id, int64_t* kb, int64_t* kb_used, int64_t* kb_avail) { + auto p = osd_stat.find(id); + if (p == osd_stat.end()) + return false; + *kb = p->second.kb; + *kb_used = p->second.kb_used; + *kb_avail = p->second.kb_avail; + return *kb > 0; +} + +void OSDMap::get_full_osd_util(const ceph::unordered_map &osd_stat, + map *full, map *backfill, map *nearfull) const +{ + full->clear(); + backfill->clear(); + nearfull->clear(); + for (int i = 0; i < max_osd; ++i) { + if (exists(i) && is_up(i) && is_in(i)) { + int64_t kb, kb_used, kb_avail; + if (osd_state[i] & CEPH_OSD_FULL) { + if (get_osd_utilization(osd_stat, i, &kb, &kb_used, &kb_avail)) + full->emplace(i, (float)kb_used / (float)kb); + } else if (osd_state[i] & CEPH_OSD_BACKFILLFULL) { + if (get_osd_utilization(osd_stat, i, &kb, &kb_used, &kb_avail)) + backfill->emplace(i, (float)kb_used / (float)kb); + } else if (osd_state[i] & CEPH_OSD_NEARFULL) { + if (get_osd_utilization(osd_stat, i, &kb, &kb_used, &kb_avail)) + nearfull->emplace(i, (float)kb_used / (float)kb); + } + } + } +} + void OSDMap::get_all_osds(set& ls) const { for (int i=0; i= 0) { nearfull_ratio = inc.new_nearfull_ratio; } + if (inc.new_backfillfull_ratio >= 0) { + backfillfull_ratio = inc.new_backfillfull_ratio; + } if (inc.new_full_ratio >= 0) { full_ratio = inc.new_full_ratio; } @@ -2148,7 +2195,7 @@ void OSDMap::encode(bufferlist& bl, uint64_t features) const } { - uint8_t target_v = 2; + uint8_t target_v = 3; if (!HAVE_FEATURE(features, SERVER_LUMINOUS)) { target_v = 1; } @@ -2173,6 +2220,7 @@ void OSDMap::encode(bufferlist& bl, uint64_t features) const if (target_v >= 2) { ::encode(nearfull_ratio, bl); ::encode(full_ratio, bl); + ::encode(backfillfull_ratio, bl); } ENCODE_FINISH(bl); // osd-only data } @@ -2390,7 +2438,7 @@ void OSDMap::decode(bufferlist::iterator& bl) } { - DECODE_START(2, bl); // extended, osd-only data + DECODE_START(3, bl); // extended, osd-only data ::decode(osd_addrs->hb_back_addr, bl); ::decode(osd_info, bl); ::decode(blacklist, bl); @@ -2407,6 +2455,11 @@ void OSDMap::decode(bufferlist::iterator& bl) nearfull_ratio = 0; full_ratio = 0; } + if (struct_v >= 3) { + ::decode(backfillfull_ratio, bl); + } else { + backfillfull_ratio = 0; + } DECODE_FINISH(bl); // osd-only data } @@ -2480,6 +2533,7 @@ void OSDMap::dump(Formatter *f) const f->dump_stream("modified") << get_modified(); f->dump_string("flags", get_flag_string()); f->dump_float("full_ratio", full_ratio); + f->dump_float("backfillfull_ratio", backfillfull_ratio); f->dump_float("nearfull_ratio", nearfull_ratio); f->dump_string("cluster_snapshot", get_cluster_snapshot()); f->dump_int("pool_max", get_pool_max()); @@ -2701,6 +2755,7 @@ void OSDMap::print(ostream& out) const out << "flags " << get_flag_string() << "\n"; out << "full_ratio " << full_ratio << "\n"; + out << "backfillfull_ratio " << backfillfull_ratio << "\n"; out << "nearfull_ratio " << nearfull_ratio << "\n"; if (get_cluster_snapshot().length()) out << "cluster_snapshot " << get_cluster_snapshot() << "\n"; diff --git a/src/osd/OSDMap.h b/src/osd/OSDMap.h index eb0399edda6..525ba12fb0e 100644 --- a/src/osd/OSDMap.h +++ b/src/osd/OSDMap.h @@ -155,6 +155,7 @@ public: string cluster_snapshot; float new_nearfull_ratio = -1; + float new_backfillfull_ratio = -1; float new_full_ratio = -1; mutable bool have_crc; ///< crc values are defined @@ -254,7 +255,7 @@ private: string cluster_snapshot; bool new_blacklist_entries; - float full_ratio = 0, nearfull_ratio = 0; + float full_ratio = 0, backfillfull_ratio = 0, nearfull_ratio = 0; mutable uint64_t cached_up_osd_features; @@ -336,10 +337,15 @@ public: float get_full_ratio() const { return full_ratio; } + float get_backfillfull_ratio() const { + return backfillfull_ratio; + } float get_nearfull_ratio() const { return nearfull_ratio; } - void count_full_nearfull_osds(int *full, int *nearfull) const; + void count_full_nearfull_osds(int *full, int *backfill, int *nearfull) const; + void get_full_osd_util(const ceph::unordered_map &osd_stat, + map *full, map *backfill, map *nearfull) const; /***** cluster state *****/ /* osds */ diff --git a/src/osd/PG.cc b/src/osd/PG.cc index 56b9c19bed8..c5a35dbaa4a 100644 --- a/src/osd/PG.cc +++ b/src/osd/PG.cc @@ -3809,14 +3809,24 @@ void PG::reject_reservation() void PG::schedule_backfill_full_retry() { - Mutex::Locker lock(osd->backfill_request_lock); - osd->backfill_request_timer.add_event_after( + Mutex::Locker lock(osd->recovery_request_lock); + osd->recovery_request_timer.add_event_after( cct->_conf->osd_backfill_retry_interval, new QueuePeeringEvt( this, get_osdmap()->get_epoch(), RequestBackfill())); } +void PG::schedule_recovery_full_retry() +{ + Mutex::Locker lock(osd->recovery_request_lock); + osd->recovery_request_timer.add_event_after( + cct->_conf->osd_recovery_retry_interval, + new QueuePeeringEvt( + this, get_osdmap()->get_epoch(), + DoRecovery())); +} + void PG::clear_scrub_reserved() { scrubber.reserved_peers.clear(); @@ -5237,6 +5247,7 @@ void PG::start_peering_interval( state_clear(PG_STATE_PEERED); state_clear(PG_STATE_DOWN); state_clear(PG_STATE_RECOVERY_WAIT); + state_clear(PG_STATE_RECOVERY_TOOFULL); state_clear(PG_STATE_RECOVERING); peer_purged.clear(); @@ -6488,6 +6499,24 @@ void PG::RecoveryState::NotBackfilling::exit() pg->osd->recoverystate_perf->tinc(rs_notbackfilling_latency, dur); } +/*----NotRecovering------*/ +PG::RecoveryState::NotRecovering::NotRecovering(my_context ctx) + : my_base(ctx), + NamedState(context< RecoveryMachine >().pg->cct, "Started/Primary/Active/NotRecovering") +{ + context< RecoveryMachine >().log_enter(state_name); + PG *pg = context< RecoveryMachine >().pg; + pg->publish_stats_to_osd(); +} + +void PG::RecoveryState::NotRecovering::exit() +{ + context< RecoveryMachine >().log_exit(state_name, enter_time); + PG *pg = context< RecoveryMachine >().pg; + utime_t dur = ceph_clock_now() - enter_time; + pg->osd->recoverystate_perf->tinc(rs_notrecovering_latency, dur); +} + /*---RepNotRecovering----*/ PG::RecoveryState::RepNotRecovering::RepNotRecovering(my_context ctx) : my_base(ctx), @@ -6554,18 +6583,17 @@ boost::statechart::result PG::RecoveryState::RepNotRecovering::react(const RequestBackfillPrio &evt) { PG *pg = context< RecoveryMachine >().pg; - double ratio, max_ratio; + ostringstream ss; if (pg->cct->_conf->osd_debug_reject_backfill_probability > 0 && (rand()%1000 < (pg->cct->_conf->osd_debug_reject_backfill_probability*1000.0))) { ldout(pg->cct, 10) << "backfill reservation rejected: failure injection" << dendl; post_event(RemoteReservationRejected()); - } else if (pg->osd->too_full_for_backfill(&ratio, &max_ratio) && - !pg->cct->_conf->osd_debug_skip_full_check_in_backfill_reservation) { - ldout(pg->cct, 10) << "backfill reservation rejected: full ratio is " - << ratio << ", which is greater than max allowed ratio " - << max_ratio << dendl; + } else if (!pg->cct->_conf->osd_debug_skip_full_check_in_backfill_reservation && + pg->osd->check_backfill_full(ss)) { + ldout(pg->cct, 10) << "backfill reservation rejected: " + << ss.str() << dendl; post_event(RemoteReservationRejected()); } else { pg->osd->remote_reserver.request_reservation( @@ -6590,7 +6618,7 @@ PG::RecoveryState::RepWaitBackfillReserved::react(const RemoteBackfillReserved & { PG *pg = context< RecoveryMachine >().pg; - double ratio, max_ratio; + ostringstream ss; if (pg->cct->_conf->osd_debug_reject_backfill_probability > 0 && (rand()%1000 < (pg->cct->_conf->osd_debug_reject_backfill_probability*1000.0))) { ldout(pg->cct, 10) << "backfill reservation rejected after reservation: " @@ -6598,11 +6626,10 @@ PG::RecoveryState::RepWaitBackfillReserved::react(const RemoteBackfillReserved & pg->osd->remote_reserver.cancel_reservation(pg->info.pgid); post_event(RemoteReservationRejected()); return discard_event(); - } else if (pg->osd->too_full_for_backfill(&ratio, &max_ratio) && - !pg->cct->_conf->osd_debug_skip_full_check_in_backfill_reservation) { - ldout(pg->cct, 10) << "backfill reservation rejected after reservation: full ratio is " - << ratio << ", which is greater than max allowed ratio " - << max_ratio << dendl; + } else if (!pg->cct->_conf->osd_debug_skip_full_check_in_backfill_reservation && + pg->osd->check_backfill_full(ss)) { + ldout(pg->cct, 10) << "backfill reservation rejected after reservation: " + << ss.str() << dendl; pg->osd->remote_reserver.cancel_reservation(pg->info.pgid); post_event(RemoteReservationRejected()); return discard_event(); @@ -6673,6 +6700,15 @@ PG::RecoveryState::WaitLocalRecoveryReserved::WaitLocalRecoveryReserved(my_conte { context< RecoveryMachine >().log_enter(state_name); PG *pg = context< RecoveryMachine >().pg; + + // Make sure all nodes that part of the recovery aren't full + if (!pg->cct->_conf->osd_debug_skip_full_check_in_recovery && + pg->osd->check_osdmap_full(pg->actingbackfill)) { + post_event(RecoveryTooFull()); + return; + } + + pg->state_clear(PG_STATE_RECOVERY_TOOFULL); pg->state_set(PG_STATE_RECOVERY_WAIT); pg->osd->local_reserver.request_reservation( pg->info.pgid, @@ -6683,6 +6719,15 @@ PG::RecoveryState::WaitLocalRecoveryReserved::WaitLocalRecoveryReserved(my_conte pg->publish_stats_to_osd(); } +boost::statechart::result +PG::RecoveryState::WaitLocalRecoveryReserved::react(const RecoveryTooFull &evt) +{ + PG *pg = context< RecoveryMachine >().pg; + pg->state_set(PG_STATE_RECOVERY_TOOFULL); + pg->schedule_recovery_full_retry(); + return transit(); +} + void PG::RecoveryState::WaitLocalRecoveryReserved::exit() { context< RecoveryMachine >().log_exit(state_name, enter_time); @@ -6739,6 +6784,7 @@ PG::RecoveryState::Recovering::Recovering(my_context ctx) PG *pg = context< RecoveryMachine >().pg; pg->state_clear(PG_STATE_RECOVERY_WAIT); + pg->state_clear(PG_STATE_RECOVERY_TOOFULL); pg->state_set(PG_STATE_RECOVERING); pg->publish_stats_to_osd(); pg->queue_recovery(); @@ -7187,6 +7233,7 @@ void PG::RecoveryState::Active::exit() pg->state_clear(PG_STATE_BACKFILL_TOOFULL); pg->state_clear(PG_STATE_BACKFILL_WAIT); pg->state_clear(PG_STATE_RECOVERY_WAIT); + pg->state_clear(PG_STATE_RECOVERY_TOOFULL); utime_t dur = ceph_clock_now() - enter_time; pg->osd->recoverystate_perf->tinc(rs_active_latency, dur); pg->agent_stop(); diff --git a/src/osd/PG.h b/src/osd/PG.h index 4763859d66b..b2168b3ee64 100644 --- a/src/osd/PG.h +++ b/src/osd/PG.h @@ -1340,6 +1340,7 @@ public: void reject_reservation(); void schedule_backfill_full_retry(); + void schedule_recovery_full_retry(); // -- recovery state -- @@ -1505,6 +1506,7 @@ public: TrivialEvent(RequestRecovery) TrivialEvent(RecoveryDone) TrivialEvent(BackfillTooFull) + TrivialEvent(RecoveryTooFull) TrivialEvent(AllReplicasRecovered) TrivialEvent(DoRecovery) @@ -1850,6 +1852,14 @@ public: boost::statechart::result react(const RemoteReservationRejected& evt); }; + struct NotRecovering : boost::statechart::state< NotRecovering, Active>, NamedState { + typedef boost::mpl::list< + boost::statechart::transition< DoRecovery, WaitLocalRecoveryReserved > + > reactions; + explicit NotRecovering(my_context ctx); + void exit(); + }; + struct RepNotRecovering; struct ReplicaActive : boost::statechart::state< ReplicaActive, Started, RepNotRecovering >, NamedState { explicit ReplicaActive(my_context ctx); @@ -1938,10 +1948,12 @@ public: struct WaitLocalRecoveryReserved : boost::statechart::state< WaitLocalRecoveryReserved, Active >, NamedState { typedef boost::mpl::list < - boost::statechart::transition< LocalRecoveryReserved, WaitRemoteRecoveryReserved > + boost::statechart::transition< LocalRecoveryReserved, WaitRemoteRecoveryReserved >, + boost::statechart::custom_reaction< RecoveryTooFull > > reactions; explicit WaitLocalRecoveryReserved(my_context ctx); void exit(); + boost::statechart::result react(const RecoveryTooFull &evt); }; struct Activating : boost::statechart::state< Activating, Active >, NamedState { diff --git a/src/osd/PGBackend.h b/src/osd/PGBackend.h index 763e02bc2b8..66bb890af01 100644 --- a/src/osd/PGBackend.h +++ b/src/osd/PGBackend.h @@ -261,6 +261,10 @@ typedef ceph::shared_ptr OSDMapRef; virtual LogClientTemp clog_error() = 0; + virtual bool check_failsafe_full(ostream &ss) = 0; + + virtual bool check_osdmap_full(const set &missing_on) = 0; + virtual ~Listener() {} }; Listener *parent; diff --git a/src/osd/PrimaryLogPG.cc b/src/osd/PrimaryLogPG.cc index 2f1599963c9..71846caabf8 100644 --- a/src/osd/PrimaryLogPG.cc +++ b/src/osd/PrimaryLogPG.cc @@ -1888,8 +1888,13 @@ void PrimaryLogPG::do_op(OpRequestRef& op) << *m << dendl; return; } - if (!(m->get_source().is_mds()) && osd->check_failsafe_full() && write_ordered) { + // mds should have stopped writing before this point. + // We can't allow OSD to become non-startable even if mds + // could be writing as part of file removals. + ostringstream ss; + if (write_ordered && osd->check_failsafe_full(ss)) { dout(10) << __func__ << " fail-safe full check failed, dropping request" + << ss.str() << dendl; return; } @@ -3328,10 +3333,9 @@ void PrimaryLogPG::do_scan( switch (m->op) { case MOSDPGScan::OP_SCAN_GET_DIGEST: { - double ratio, full_ratio; - if (osd->too_full_for_backfill(&ratio, &full_ratio)) { - dout(1) << __func__ << ": Canceling backfill, current usage is " - << ratio << ", which exceeds " << full_ratio << dendl; + ostringstream ss; + if (osd->check_backfill_full(ss)) { + dout(1) << __func__ << ": Canceling backfill, " << ss.str() << dendl; queue_peering_event( CephPeeringEvtRef( std::make_shared( @@ -13027,6 +13031,11 @@ void PrimaryLogPG::_scrub_finish() } } +bool PrimaryLogPG::check_osdmap_full(const set &missing_on) +{ + return osd->check_osdmap_full(missing_on); +} + /*---SnapTrimmer Logging---*/ #undef dout_prefix #define dout_prefix *_dout << pg->gen_prefix() @@ -13268,6 +13277,10 @@ int PrimaryLogPG::getattrs_maybe_cache( return r; } +bool PrimaryLogPG::check_failsafe_full(ostream &ss) { + return osd->check_failsafe_full(ss); +} + void intrusive_ptr_add_ref(PrimaryLogPG *pg) { pg->get("intptr"); } void intrusive_ptr_release(PrimaryLogPG *pg) { pg->put("intptr"); } diff --git a/src/osd/PrimaryLogPG.h b/src/osd/PrimaryLogPG.h index 3d92e3b968c..a7e812fe8da 100644 --- a/src/osd/PrimaryLogPG.h +++ b/src/osd/PrimaryLogPG.h @@ -1731,6 +1731,8 @@ public: void on_flushed() override; void on_removal(ObjectStore::Transaction *t) override; void on_shutdown() override; + bool check_failsafe_full(ostream &ss) override; + bool check_osdmap_full(const set &missing_on) override; // attr cache handling void setattr_maybe_cache( diff --git a/src/osd/ReplicatedBackend.cc b/src/osd/ReplicatedBackend.cc index b897eddf815..1364cc62770 100644 --- a/src/osd/ReplicatedBackend.cc +++ b/src/osd/ReplicatedBackend.cc @@ -807,6 +807,11 @@ void ReplicatedBackend::_do_push(OpRequestRef op) vector replies; ObjectStore::Transaction t; + ostringstream ss; + if (get_parent()->check_failsafe_full(ss)) { + dout(10) << __func__ << " Out of space (failsafe) processing push request: " << ss.str() << dendl; + ceph_abort(); + } for (vector::const_iterator i = m->pushes.begin(); i != m->pushes.end(); ++i) { @@ -862,6 +867,13 @@ void ReplicatedBackend::_do_pull_response(OpRequestRef op) op->mark_started(); vector replies(1); + + ostringstream ss; + if (get_parent()->check_failsafe_full(ss)) { + dout(10) << __func__ << " Out of space (failsafe) processing pull response (push): " << ss.str() << dendl; + ceph_abort(); + } + ObjectStore::Transaction t; list to_continue; for (vector::const_iterator i = m->pushes.begin(); diff --git a/src/osd/osd_types.cc b/src/osd/osd_types.cc index 2cce17e029f..145a0d76831 100644 --- a/src/osd/osd_types.cc +++ b/src/osd/osd_types.cc @@ -789,6 +789,8 @@ std::string pg_state_string(int state) oss << "clean+"; if (state & PG_STATE_RECOVERY_WAIT) oss << "recovery_wait+"; + if (state & PG_STATE_RECOVERY_TOOFULL) + oss << "recovery_toofull+"; if (state & PG_STATE_RECOVERING) oss << "recovering+"; if (state & PG_STATE_DOWN) @@ -869,6 +871,8 @@ int pg_string_state(const std::string& state) type = PG_STATE_BACKFILL_TOOFULL; else if (state == "recovery_wait") type = PG_STATE_RECOVERY_WAIT; + else if (state == "recovery_toofull") + type = PG_STATE_RECOVERY_TOOFULL; else if (state == "undersized") type = PG_STATE_UNDERSIZED; else if (state == "activating") diff --git a/src/osd/osd_types.h b/src/osd/osd_types.h index 0d2297582f5..1c4e4c65a6c 100644 --- a/src/osd/osd_types.h +++ b/src/osd/osd_types.h @@ -971,6 +971,7 @@ inline ostream& operator<<(ostream& out, const osd_stat_t& s) { #define PG_STATE_PEERED (1<<25) // peered, cannot go active, can recover #define PG_STATE_SNAPTRIM (1<<26) // trimming snaps #define PG_STATE_SNAPTRIM_WAIT (1<<27) // queued to trim snaps +#define PG_STATE_RECOVERY_TOOFULL (1<<28) // recovery can't proceed: too full std::string pg_state_string(int state); std::string pg_vector_string(const vector &a); diff --git a/src/test/cli/osdmaptool/clobber.t b/src/test/cli/osdmaptool/clobber.t index 275fefcc737..dd7e1756104 100644 --- a/src/test/cli/osdmaptool/clobber.t +++ b/src/test/cli/osdmaptool/clobber.t @@ -20,6 +20,7 @@ modified \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d+ (re) flags full_ratio 0 + backfillfull_ratio 0 nearfull_ratio 0 pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 192 pgp_num 192 last_change 0 flags hashpspool stripe_width 0 @@ -43,6 +44,7 @@ modified \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d+ (re) flags full_ratio 0 + backfillfull_ratio 0 nearfull_ratio 0 pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 0 flags hashpspool stripe_width 0 diff --git a/src/test/cli/osdmaptool/create-print.t b/src/test/cli/osdmaptool/create-print.t index e619f7206e9..32468a4a6fa 100644 --- a/src/test/cli/osdmaptool/create-print.t +++ b/src/test/cli/osdmaptool/create-print.t @@ -77,6 +77,7 @@ modified \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d+ (re) flags full_ratio 0 + backfillfull_ratio 0 nearfull_ratio 0 pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 192 pgp_num 192 last_change 0 flags hashpspool stripe_width 0 diff --git a/src/test/cli/osdmaptool/create-racks.t b/src/test/cli/osdmaptool/create-racks.t index 19006986f68..0759698127d 100644 --- a/src/test/cli/osdmaptool/create-racks.t +++ b/src/test/cli/osdmaptool/create-racks.t @@ -790,6 +790,7 @@ modified \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d+ (re) flags full_ratio 0 + backfillfull_ratio 0 nearfull_ratio 0 pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 15296 pgp_num 15296 last_change 0 flags hashpspool stripe_width 0 diff --git a/src/test/pybind/test_ceph_argparse.py b/src/test/pybind/test_ceph_argparse.py index 0608655b587..0c9cc7524c5 100755 --- a/src/test/pybind/test_ceph_argparse.py +++ b/src/test/pybind/test_ceph_argparse.py @@ -183,21 +183,6 @@ class TestPG(TestArgparse): def test_force_create_pg(self): self.one_pgid('force_create_pg') - def set_ratio(self, command): - self.assert_valid_command(['pg', - command, - '0.0']) - assert_equal({}, validate_command(sigdict, ['pg', command])) - assert_equal({}, validate_command(sigdict, ['pg', - command, - '2.0'])) - - def test_set_full_ratio(self): - self.set_ratio('set_full_ratio') - - def test_set_nearfull_ratio(self): - self.set_ratio('set_nearfull_ratio') - class TestAuth(TestArgparse): @@ -1153,6 +1138,24 @@ class TestOSD(TestArgparse): 'poolname', 'toomany'])) + def set_ratio(self, command): + self.assert_valid_command(['osd', + command, + '0.0']) + assert_equal({}, validate_command(sigdict, ['osd', command])) + assert_equal({}, validate_command(sigdict, ['osd', + command, + '2.0'])) + + def test_set_full_ratio(self): + self.set_ratio('set-full-ratio') + + def test_set_backfillfull_ratio(self): + self.set_ratio('set-backfillfull-ratio') + + def test_set_nearfull_ratio(self): + self.set_ratio('set-nearfull-ratio') + class TestConfigKey(TestArgparse): diff --git a/src/tools/ceph_monstore_tool.cc b/src/tools/ceph_monstore_tool.cc index 874a4f0583f..8c941443d81 100644 --- a/src/tools/ceph_monstore_tool.cc +++ b/src/tools/ceph_monstore_tool.cc @@ -654,6 +654,14 @@ static int update_pgmap_meta(MonitorDBStore& st) ::encode(full_ratio, bl); t->put(prefix, "full_ratio", bl); } + { + auto backfillfull_ratio = g_ceph_context->_conf->mon_osd_backfillfull_ratio; + if (backfillfull_ratio > 1.0) + backfillfull_ratio /= 100.0; + bufferlist bl; + ::encode(backfillfull_ratio, bl); + t->put(prefix, "backfillfull_ratio", bl); + } { auto nearfull_ratio = g_ceph_context->_conf->mon_osd_nearfull_ratio; if (nearfull_ratio > 1.0) diff --git a/src/tools/ceph_objectstore_tool.cc b/src/tools/ceph_objectstore_tool.cc index 6b173f204d0..f9bf85a5d63 100644 --- a/src/tools/ceph_objectstore_tool.cc +++ b/src/tools/ceph_objectstore_tool.cc @@ -2906,7 +2906,7 @@ int main(int argc, char **argv) throw std::runtime_error(ss.str()); } vector::iterator i = array.begin(); - //if (i == array.end() || i->type() != json_spirit::str_type) { + assert(i != array.end()); if (i->type() != json_spirit::str_type) { ss << "Object '" << object << "' must be a JSON array with the first element a string";