doc/rados/operations/health-checks: document DEVICE_HEALTH* messages

Signed-off-by: Sage Weil <sage@redhat.com>
2024-12-30 15:33:31 +00:00 · 2018-07-31 09:22:42 -05:00 · 2018-07-31 09:22:42 -05:00 · 7ab8675fdf
commit 7ab8675fdf
parent ccdfcc7e72
1 changed files with 66 additions and 0 deletions
--- a/doc/rados/operations/health-checks.rst
+++ b/doc/rados/operations/health-checks.rst
@ -251,6 +251,72 @@ You can either raise the pool quota with::
 or delete some existing data to reduce utilization.
 Device health
 -------------
 DEVICE_HEALTH
 _____________
 One or more devices is expected to fail soon, where the warning
 threshold is controlled by the ``mgr/devicehealth/warn_threshold``
 config option.
 This warning only applies to OSDs that are currently marked "in", so
 the expected response to this failure is to mark the device "out" so
 that data is migrated off of the device, and then to remove the
 hardware from the system.  Note that the marking out is normally done
 automatically if ``mgr/devicehealth/self_heal`` is enabled based on
 the ``mgr/devicehealth/mark_out_threshold``.
 Device health can be checked with::
  ceph device info <device-id>
 Device life expectancy is set by a prediction model run by
 the mgr or an by external tool via the command::
  ceph device set-life-expectancy <device-id> <from> <to>
 You can change the stored life expectancy manually, but that usually
 doesn't accomplish anything as whatever tool originally set it will
 probably set it again, and changing the stored value does not affect
 the actual health of the hardware device.
 DEVICE_HEALTH_IN_USE
 ____________________
 One or more devices is expected to fail soon and has been marked "out"
 of the cluster based on ``mgr/devicehalth/mark_out_threshold``, but it
 is still participating in one more PGs.  This may be because it was
 only recently marked "out" and data is still migrating, or because data
 cannot be migrated off for some reason (e.g., the cluster is nearly
 full, or the CRUSH hierarchy is such that there isn't another suitable
 OSD to migrate the data too).
 This message can be silenced by disabling the self heal behavior
 (setting ``mgr/devicehealth/self_heal`` to false), by adjusting the
 ``mgr/devicehealth/mark_out_threshold``, or by addressing what is
 preventing data from being migrated off of the ailing device.
 DEVICE_HEALTH_TOOMANY
 _____________________
 Too many devices is expected to fail soon and the
 ``mgr/devicehealth/self_heal`` behavior is enabled, such that marking
 out all of the ailing devices would exceed the clusters
 ``mon_osd_min_in_ratio`` ratio that prevents too many OSDs from being
 automatically marked "out".
 This generally indicates that too many devices in your cluster are
 expected to fail soon and you should take action to add newer
 (healthier) devices before too many devices fail and data is lost.
 The health message can also be silenced by adjusting parameters like
 ``mon_osd_min_in_ratio`` or ``mgr/devicehealth/mark_out_threshold``,
 but be warned that this will increase the likelihood of unrecoverable
 data loss in the cluster.
 Data health (pools & placement groups)
 --------------------------------------