doc/rados/operations/health-checks: document DEVICE_HEALTH* messages

Signed-off-by: Sage Weil <sage@redhat.com>
This commit is contained in:
Sage Weil 2018-07-31 09:22:42 -05:00
parent ccdfcc7e72
commit 7ab8675fdf

View File

@ -251,6 +251,72 @@ You can either raise the pool quota with::
or delete some existing data to reduce utilization.
Device health
-------------
DEVICE_HEALTH
_____________
One or more devices is expected to fail soon, where the warning
threshold is controlled by the ``mgr/devicehealth/warn_threshold``
config option.
This warning only applies to OSDs that are currently marked "in", so
the expected response to this failure is to mark the device "out" so
that data is migrated off of the device, and then to remove the
hardware from the system. Note that the marking out is normally done
automatically if ``mgr/devicehealth/self_heal`` is enabled based on
the ``mgr/devicehealth/mark_out_threshold``.
Device health can be checked with::
ceph device info <device-id>
Device life expectancy is set by a prediction model run by
the mgr or an by external tool via the command::
ceph device set-life-expectancy <device-id> <from> <to>
You can change the stored life expectancy manually, but that usually
doesn't accomplish anything as whatever tool originally set it will
probably set it again, and changing the stored value does not affect
the actual health of the hardware device.
DEVICE_HEALTH_IN_USE
____________________
One or more devices is expected to fail soon and has been marked "out"
of the cluster based on ``mgr/devicehalth/mark_out_threshold``, but it
is still participating in one more PGs. This may be because it was
only recently marked "out" and data is still migrating, or because data
cannot be migrated off for some reason (e.g., the cluster is nearly
full, or the CRUSH hierarchy is such that there isn't another suitable
OSD to migrate the data too).
This message can be silenced by disabling the self heal behavior
(setting ``mgr/devicehealth/self_heal`` to false), by adjusting the
``mgr/devicehealth/mark_out_threshold``, or by addressing what is
preventing data from being migrated off of the ailing device.
DEVICE_HEALTH_TOOMANY
_____________________
Too many devices is expected to fail soon and the
``mgr/devicehealth/self_heal`` behavior is enabled, such that marking
out all of the ailing devices would exceed the clusters
``mon_osd_min_in_ratio`` ratio that prevents too many OSDs from being
automatically marked "out".
This generally indicates that too many devices in your cluster are
expected to fail soon and you should take action to add newer
(healthier) devices before too many devices fail and data is lost.
The health message can also be silenced by adjusting parameters like
``mon_osd_min_in_ratio`` or ``mgr/devicehealth/mark_out_threshold``,
but be warned that this will increase the likelihood of unrecoverable
data loss in the cluster.
Data health (pools & placement groups)
--------------------------------------