mirror of
https://github.com/ceph/ceph
synced 2024-12-30 15:33:31 +00:00
doc/rados/operations/health-checks: document DEVICE_HEALTH* messages
Signed-off-by: Sage Weil <sage@redhat.com>
This commit is contained in:
parent
ccdfcc7e72
commit
7ab8675fdf
@ -251,6 +251,72 @@ You can either raise the pool quota with::
|
|||||||
or delete some existing data to reduce utilization.
|
or delete some existing data to reduce utilization.
|
||||||
|
|
||||||
|
|
||||||
|
Device health
|
||||||
|
-------------
|
||||||
|
|
||||||
|
DEVICE_HEALTH
|
||||||
|
_____________
|
||||||
|
|
||||||
|
One or more devices is expected to fail soon, where the warning
|
||||||
|
threshold is controlled by the ``mgr/devicehealth/warn_threshold``
|
||||||
|
config option.
|
||||||
|
|
||||||
|
This warning only applies to OSDs that are currently marked "in", so
|
||||||
|
the expected response to this failure is to mark the device "out" so
|
||||||
|
that data is migrated off of the device, and then to remove the
|
||||||
|
hardware from the system. Note that the marking out is normally done
|
||||||
|
automatically if ``mgr/devicehealth/self_heal`` is enabled based on
|
||||||
|
the ``mgr/devicehealth/mark_out_threshold``.
|
||||||
|
|
||||||
|
Device health can be checked with::
|
||||||
|
|
||||||
|
ceph device info <device-id>
|
||||||
|
|
||||||
|
Device life expectancy is set by a prediction model run by
|
||||||
|
the mgr or an by external tool via the command::
|
||||||
|
|
||||||
|
ceph device set-life-expectancy <device-id> <from> <to>
|
||||||
|
|
||||||
|
You can change the stored life expectancy manually, but that usually
|
||||||
|
doesn't accomplish anything as whatever tool originally set it will
|
||||||
|
probably set it again, and changing the stored value does not affect
|
||||||
|
the actual health of the hardware device.
|
||||||
|
|
||||||
|
DEVICE_HEALTH_IN_USE
|
||||||
|
____________________
|
||||||
|
|
||||||
|
One or more devices is expected to fail soon and has been marked "out"
|
||||||
|
of the cluster based on ``mgr/devicehalth/mark_out_threshold``, but it
|
||||||
|
is still participating in one more PGs. This may be because it was
|
||||||
|
only recently marked "out" and data is still migrating, or because data
|
||||||
|
cannot be migrated off for some reason (e.g., the cluster is nearly
|
||||||
|
full, or the CRUSH hierarchy is such that there isn't another suitable
|
||||||
|
OSD to migrate the data too).
|
||||||
|
|
||||||
|
This message can be silenced by disabling the self heal behavior
|
||||||
|
(setting ``mgr/devicehealth/self_heal`` to false), by adjusting the
|
||||||
|
``mgr/devicehealth/mark_out_threshold``, or by addressing what is
|
||||||
|
preventing data from being migrated off of the ailing device.
|
||||||
|
|
||||||
|
DEVICE_HEALTH_TOOMANY
|
||||||
|
_____________________
|
||||||
|
|
||||||
|
Too many devices is expected to fail soon and the
|
||||||
|
``mgr/devicehealth/self_heal`` behavior is enabled, such that marking
|
||||||
|
out all of the ailing devices would exceed the clusters
|
||||||
|
``mon_osd_min_in_ratio`` ratio that prevents too many OSDs from being
|
||||||
|
automatically marked "out".
|
||||||
|
|
||||||
|
This generally indicates that too many devices in your cluster are
|
||||||
|
expected to fail soon and you should take action to add newer
|
||||||
|
(healthier) devices before too many devices fail and data is lost.
|
||||||
|
|
||||||
|
The health message can also be silenced by adjusting parameters like
|
||||||
|
``mon_osd_min_in_ratio`` or ``mgr/devicehealth/mark_out_threshold``,
|
||||||
|
but be warned that this will increase the likelihood of unrecoverable
|
||||||
|
data loss in the cluster.
|
||||||
|
|
||||||
|
|
||||||
Data health (pools & placement groups)
|
Data health (pools & placement groups)
|
||||||
--------------------------------------
|
--------------------------------------
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user