mirror of
https://github.com/ceph/ceph
synced 2024-12-28 22:43:29 +00:00
doc/rados/operations/health-checks: document DEVICE_HEALTH* messages
Signed-off-by: Sage Weil <sage@redhat.com>
This commit is contained in:
parent
ccdfcc7e72
commit
7ab8675fdf
@ -251,6 +251,72 @@ You can either raise the pool quota with::
|
||||
or delete some existing data to reduce utilization.
|
||||
|
||||
|
||||
Device health
|
||||
-------------
|
||||
|
||||
DEVICE_HEALTH
|
||||
_____________
|
||||
|
||||
One or more devices is expected to fail soon, where the warning
|
||||
threshold is controlled by the ``mgr/devicehealth/warn_threshold``
|
||||
config option.
|
||||
|
||||
This warning only applies to OSDs that are currently marked "in", so
|
||||
the expected response to this failure is to mark the device "out" so
|
||||
that data is migrated off of the device, and then to remove the
|
||||
hardware from the system. Note that the marking out is normally done
|
||||
automatically if ``mgr/devicehealth/self_heal`` is enabled based on
|
||||
the ``mgr/devicehealth/mark_out_threshold``.
|
||||
|
||||
Device health can be checked with::
|
||||
|
||||
ceph device info <device-id>
|
||||
|
||||
Device life expectancy is set by a prediction model run by
|
||||
the mgr or an by external tool via the command::
|
||||
|
||||
ceph device set-life-expectancy <device-id> <from> <to>
|
||||
|
||||
You can change the stored life expectancy manually, but that usually
|
||||
doesn't accomplish anything as whatever tool originally set it will
|
||||
probably set it again, and changing the stored value does not affect
|
||||
the actual health of the hardware device.
|
||||
|
||||
DEVICE_HEALTH_IN_USE
|
||||
____________________
|
||||
|
||||
One or more devices is expected to fail soon and has been marked "out"
|
||||
of the cluster based on ``mgr/devicehalth/mark_out_threshold``, but it
|
||||
is still participating in one more PGs. This may be because it was
|
||||
only recently marked "out" and data is still migrating, or because data
|
||||
cannot be migrated off for some reason (e.g., the cluster is nearly
|
||||
full, or the CRUSH hierarchy is such that there isn't another suitable
|
||||
OSD to migrate the data too).
|
||||
|
||||
This message can be silenced by disabling the self heal behavior
|
||||
(setting ``mgr/devicehealth/self_heal`` to false), by adjusting the
|
||||
``mgr/devicehealth/mark_out_threshold``, or by addressing what is
|
||||
preventing data from being migrated off of the ailing device.
|
||||
|
||||
DEVICE_HEALTH_TOOMANY
|
||||
_____________________
|
||||
|
||||
Too many devices is expected to fail soon and the
|
||||
``mgr/devicehealth/self_heal`` behavior is enabled, such that marking
|
||||
out all of the ailing devices would exceed the clusters
|
||||
``mon_osd_min_in_ratio`` ratio that prevents too many OSDs from being
|
||||
automatically marked "out".
|
||||
|
||||
This generally indicates that too many devices in your cluster are
|
||||
expected to fail soon and you should take action to add newer
|
||||
(healthier) devices before too many devices fail and data is lost.
|
||||
|
||||
The health message can also be silenced by adjusting parameters like
|
||||
``mon_osd_min_in_ratio`` or ``mgr/devicehealth/mark_out_threshold``,
|
||||
but be warned that this will increase the likelihood of unrecoverable
|
||||
data loss in the cluster.
|
||||
|
||||
|
||||
Data health (pools & placement groups)
|
||||
--------------------------------------
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user