mirror of
https://github.com/ceph/ceph
synced 2025-01-10 05:00:59 +00:00
4abb79f159
This module is written by Rick Chen <rick.chen@prophetstor.com> and provides both a built-in local predictor and a cloud mode that queries a cloud service (provided by ProphetStor) to predict device failures. Signed-off-by: Rick Chen <rick.chen@prophetstor.com> Signed-off-by: Sage Weil <sage@redhat.com>
346 lines
17 KiB
ReStructuredText
346 lines
17 KiB
ReStructuredText
=====================
|
|
DISKPREDICTION PLUGIN
|
|
=====================
|
|
|
|
The *diskprediction* plugin supports two modes: cloud mode and local mode. In cloud mode, the disk and Ceph operating status information is collected from Ceph cluster and sent to a cloud-based DiskPrediction server over the Internet. DiskPrediction server analyzes the data and provides the analytics and prediction results of performance and disk health states for Ceph clusters.
|
|
|
|
Local mode doesn't require any external server for data analysis and output results. In local mode, the *diskprediction* plugin uses an internal predictor module for disk prediction service, and then returns the disk prediction result to the Ceph system.
|
|
|
|
Enabling
|
|
========
|
|
|
|
Run the following command to enable the *diskprediction* module in the Ceph
|
|
environment:
|
|
|
|
::
|
|
|
|
ceph mgr module enable diskprediction
|
|
|
|
|
|
Select the prediction mode:
|
|
|
|
::
|
|
|
|
ceph device set-prediction-mode <local/cloud>
|
|
|
|
|
|
Connection settings
|
|
===================
|
|
The connection settings are used for connection between Ceph and DiskPrediction server.
|
|
|
|
Local Mode
|
|
----------
|
|
|
|
The *diskprediction* plugin leverages Ceph device health check to collect disk health metrics and uses internal predictor module to produce the disk failure prediction and returns back to Ceph. Thus, no connection settings are required in local mode. The local predictor module requires at least six datasets of device health metrics to implement the prediction.
|
|
|
|
Run the following command to use local predictor predict device life expectancy.
|
|
|
|
::
|
|
|
|
ceph device predict-life-expectancy <device id>
|
|
|
|
|
|
Cloud Mode
|
|
----------
|
|
|
|
The user registration is required in cloud mode. The users have to sign up their accounts at https://www.diskprophet.com/#/ to receive the following DiskPrediction server information for connection settings.
|
|
|
|
**Certificate file path**: After user registration is confirmed, the system will send a confirmation email including a certificate file download link. Download the certificate file and save it to the Ceph system. Run the following command to verify the file. Without certificate file verification, the connection settings cannot be completed.
|
|
|
|
**DiskPrediction server**: The DiskPrediction server name. It could be an IP address if required.
|
|
|
|
**Connection account**: An account name used to set up the connection between Ceph and DiskPrediction server
|
|
|
|
**Connection password**: The password used to set up the connection between Ceph and DiskPrediction server
|
|
|
|
Run the following command to complete connection setup.
|
|
|
|
::
|
|
|
|
ceph device set-cloud-prediction-config <diskprediction_server> <connection_account> <connection_password> <certificate file path>
|
|
|
|
|
|
You can use the following command to display the connection settings:
|
|
|
|
::
|
|
|
|
ceph device show-prediction-config
|
|
|
|
|
|
Additional optional configuration settings are the following:
|
|
|
|
:diskprediction_upload_metrics_interval: Indicate the frequency to send Ceph performance metrics to DiskPrediction server regularly at times. Default is 10 minutes.
|
|
:diskprediction_upload_smart_interval: Indicate the frequency to send Ceph physical device info to DiskPrediction server regularly at times. Default is 12 hours.
|
|
:diskprediction_retrieve_prediction_interval: Indicate Ceph that retrieves physical device prediction data from DiskPrediction server regularly at times. Default is 12 hours.
|
|
|
|
|
|
|
|
Diskprediction Data
|
|
===================
|
|
|
|
The *diskprediction* plugin actively sends/retrieves the following data to/from DiskPrediction server.
|
|
|
|
|
|
Metrics Data
|
|
-------------
|
|
- Ceph cluster status
|
|
|
|
+----------------------+-----------------------------------------+
|
|
|key |Description |
|
|
+======================+=========================================+
|
|
|cluster_health |Ceph health check status |
|
|
+----------------------+-----------------------------------------+
|
|
|num_mon |Number of monitor node |
|
|
+----------------------+-----------------------------------------+
|
|
|num_mon_quorum |Number of monitors in quorum |
|
|
+----------------------+-----------------------------------------+
|
|
|num_osd |Total number of OSD |
|
|
+----------------------+-----------------------------------------+
|
|
|num_osd_up |Number of OSDs that are up |
|
|
+----------------------+-----------------------------------------+
|
|
|num_osd_in |Number of OSDs that are in cluster |
|
|
+----------------------+-----------------------------------------+
|
|
|osd_epoch |Current epoch of OSD map |
|
|
+----------------------+-----------------------------------------+
|
|
|osd_bytes |Total capacity of cluster in bytes |
|
|
+----------------------+-----------------------------------------+
|
|
|osd_bytes_used |Number of used bytes on cluster |
|
|
+----------------------+-----------------------------------------+
|
|
|osd_bytes_avail |Number of available bytes on cluster |
|
|
+----------------------+-----------------------------------------+
|
|
|num_pool |Number of pools |
|
|
+----------------------+-----------------------------------------+
|
|
|num_pg |Total number of placement groups |
|
|
+----------------------+-----------------------------------------+
|
|
|num_pg_active_clean |Number of placement groups in |
|
|
| |active+clean state |
|
|
+----------------------+-----------------------------------------+
|
|
|num_pg_active |Number of placement groups in active |
|
|
| |state |
|
|
+----------------------+-----------------------------------------+
|
|
|num_pg_peering |Number of placement groups in peering |
|
|
| |state |
|
|
+----------------------+-----------------------------------------+
|
|
|num_object |Total number of objects on cluster |
|
|
+----------------------+-----------------------------------------+
|
|
|num_object_degraded |Number of degraded (missing replicas) |
|
|
| |objects |
|
|
+----------------------+-----------------------------------------+
|
|
|num_object_misplaced |Number of misplaced (wrong location in |
|
|
| |the cluster) objects |
|
|
+----------------------+-----------------------------------------+
|
|
|num_object_unfound |Number of unfound objects |
|
|
+----------------------+-----------------------------------------+
|
|
|num_bytes |Total number of bytes of all objects |
|
|
+----------------------+-----------------------------------------+
|
|
|num_mds_up |Number of MDSs that are up |
|
|
+----------------------+-----------------------------------------+
|
|
|num_mds_in |Number of MDS that are in cluster |
|
|
+----------------------+-----------------------------------------+
|
|
|num_mds_failed |Number of failed MDS |
|
|
+----------------------+-----------------------------------------+
|
|
|mds_epoch |Current epoch of MDS map |
|
|
+----------------------+-----------------------------------------+
|
|
|
|
|
|
- Ceph mon/osd performance counts
|
|
|
|
Mon:
|
|
|
|
+----------------------+-----------------------------------------+
|
|
|key |Description |
|
|
+======================+=========================================+
|
|
|num_sessions |Current number of opened monitor sessions|
|
|
+----------------------+-----------------------------------------+
|
|
|session_add |Number of created monitor sessions |
|
|
+----------------------+-----------------------------------------+
|
|
|session_rm |Number of remove_session calls in monitor|
|
|
+----------------------+-----------------------------------------+
|
|
|session_trim |Number of trimed monitor sessions |
|
|
+----------------------+-----------------------------------------+
|
|
|num_elections |Number of elections monitor took part in |
|
|
+----------------------+-----------------------------------------+
|
|
|election_call |Number of elections started by monitor |
|
|
+----------------------+-----------------------------------------+
|
|
|election_win |Number of elections won by monitor |
|
|
+----------------------+-----------------------------------------+
|
|
|election_lose |Number of elections lost by monitor |
|
|
+----------------------+-----------------------------------------+
|
|
|
|
Osd:
|
|
|
|
+----------------------+-----------------------------------------+
|
|
|key |Description |
|
|
+======================+=========================================+
|
|
|op_wip |Replication operations currently being |
|
|
| |processed (primary) |
|
|
+----------------------+-----------------------------------------+
|
|
|op_in_bytes |Client operations total write size |
|
|
+----------------------+-----------------------------------------+
|
|
|op_r |Client read operations |
|
|
+----------------------+-----------------------------------------+
|
|
|op_out_bytes |Client operations total read size |
|
|
+----------------------+-----------------------------------------+
|
|
|op_w |Client write operations |
|
|
+----------------------+-----------------------------------------+
|
|
|op_latency |Latency of client operations (including |
|
|
| |queue time) |
|
|
+----------------------+-----------------------------------------+
|
|
|op_process_latency |Latency of client operations (excluding |
|
|
| |queue time) |
|
|
+----------------------+-----------------------------------------+
|
|
|op_r_latency |Latency of read operation (including |
|
|
| |queue time) |
|
|
+----------------------+-----------------------------------------+
|
|
|op_r_process_latency |Latency of read operation (excluding |
|
|
| |queue time) |
|
|
+----------------------+-----------------------------------------+
|
|
|op_w_in_bytes |Client data written |
|
|
+----------------------+-----------------------------------------+
|
|
|op_w_latency |Latency of write operation (including |
|
|
| |queue time) |
|
|
+----------------------+-----------------------------------------+
|
|
|op_w_process_latency |Latency of write operation (excluding |
|
|
| |queue time) |
|
|
+----------------------+-----------------------------------------+
|
|
|op_rw |Client read-modify-write operations |
|
|
+----------------------+-----------------------------------------+
|
|
|op_rw_in_bytes |Client read-modify-write operations write|
|
|
| |in |
|
|
+----------------------+-----------------------------------------+
|
|
|op_rw_out_bytes |Client read-modify-write operations read |
|
|
| |out |
|
|
+----------------------+-----------------------------------------+
|
|
|op_rw_latency |Latency of read-modify-write operation |
|
|
| |(including queue time) |
|
|
+----------------------+-----------------------------------------+
|
|
|op_rw_process_latency |Latency of read-modify-write operation |
|
|
| |(excluding queue time) |
|
|
+----------------------+-----------------------------------------+
|
|
|
|
|
|
- Ceph pool statistics
|
|
|
|
+----------------------+-----------------------------------------+
|
|
|key |Description |
|
|
+======================+=========================================+
|
|
|bytes_used |Per pool bytes used |
|
|
+----------------------+-----------------------------------------+
|
|
|max_avail |Max available number of bytes in the pool|
|
|
+----------------------+-----------------------------------------+
|
|
|objects |Number of objects in the pool |
|
|
+----------------------+-----------------------------------------+
|
|
|wr_bytes |Number of bytes written in the pool |
|
|
+----------------------+-----------------------------------------+
|
|
|dirty |Number of bytes dirty in the pool |
|
|
+----------------------+-----------------------------------------+
|
|
|rd_bytes |Number of bytes read in the pool |
|
|
+----------------------+-----------------------------------------+
|
|
|raw_bytes_used |Bytes used in pool including copies made |
|
|
+----------------------+-----------------------------------------+
|
|
|
|
- Ceph physical device metadata
|
|
|
|
+----------------------+-----------------------------------------+
|
|
|key |Description |
|
|
+======================+=========================================+
|
|
|disk_domain_id |Physical device identify id |
|
|
+----------------------+-----------------------------------------+
|
|
|disk_name |Device attachement name |
|
|
+----------------------+-----------------------------------------+
|
|
|disk_wwn |Device wwn |
|
|
+----------------------+-----------------------------------------+
|
|
|model |Device model name |
|
|
+----------------------+-----------------------------------------+
|
|
|serial_number |Device serial number |
|
|
+----------------------+-----------------------------------------+
|
|
|size |Device size |
|
|
+----------------------+-----------------------------------------+
|
|
|vendor |Device vendor name |
|
|
+----------------------+-----------------------------------------+
|
|
|
|
- Ceph each objects correlation information
|
|
- The plugin agent information
|
|
- The plugin agent cluster information
|
|
- The plugin agent host information
|
|
|
|
|
|
SMART Data
|
|
-----------
|
|
- Ceph physical device SMART data (provided by Ceph *devicehealth* plugin)
|
|
|
|
|
|
Prediction Data
|
|
----------------
|
|
- Ceph physical device prediction data
|
|
|
|
|
|
Receiving predicted health status from a Ceph OSD disk drive
|
|
============================================================
|
|
|
|
You can receive predicted health status from Ceph OSD disk drive by using the
|
|
following command.
|
|
|
|
::
|
|
|
|
ceph device get-predicted-status <device id>
|
|
|
|
|
|
The get-predicted-status command returns:
|
|
|
|
|
|
::
|
|
|
|
{
|
|
"near_failure": "Good",
|
|
"disk_wwn": "5000011111111111",
|
|
"serial_number": "111111111",
|
|
"predicted": "2018-05-30 18:33:12",
|
|
"attachment": "sdb"
|
|
}
|
|
|
|
|
|
+--------------------+-----------------------------------------------------+
|
|
|Attribute | Description |
|
|
+====================+=====================================================+
|
|
|near_failure | The disk failure prediction state: |
|
|
| | Good/Warning/Bad/Unknown |
|
|
+--------------------+-----------------------------------------------------+
|
|
|disk_wwn | Disk WWN number |
|
|
+--------------------+-----------------------------------------------------+
|
|
|serial_number | Disk serial number |
|
|
+--------------------+-----------------------------------------------------+
|
|
|predicted | Predicted date |
|
|
+--------------------+-----------------------------------------------------+
|
|
|attachment | device name on the local system |
|
|
+--------------------+-----------------------------------------------------+
|
|
|
|
The *near_failure* attribute for disk failure prediction state indicates disk life expectancy in the following table.
|
|
|
|
+--------------------+-----------------------------------------------------+
|
|
|near_failure | Life expectancy (weeks) |
|
|
+====================+=====================================================+
|
|
|Good | > 6 weeks |
|
|
+--------------------+-----------------------------------------------------+
|
|
|Warning | 2 weeks ~ 6 weeks |
|
|
+--------------------+-----------------------------------------------------+
|
|
|Bad | < 2 weeks |
|
|
+--------------------+-----------------------------------------------------+
|
|
|
|
|
|
Debugging
|
|
=========
|
|
|
|
If you want to debug the DiskPrediction module mapping to Ceph logging level,
|
|
use the following command.
|
|
|
|
::
|
|
|
|
[mgr]
|
|
|
|
debug mgr = 20
|
|
|
|
With logging set to debug for the manager the plugin will print out logging
|
|
message with prefix *mgr[diskprediction]* for easy filtering.
|
|
|