doc/monitoring: Improve index.rst

Signed-off-by: Anthony D'Atri <anthonyeleven@users.noreply.github.com> (cherry picked from commit 1bc67295c8)
2025-03-21 09:48:37 +00:00 · 2025-03-12 09:31:19 -04:00 · 2025-03-12 09:31:19 -04:00 · f4e5aaf4b3
commit f4e5aaf4b3
parent 1f7fb81be4
1 changed files with 241 additions and 198 deletions
--- a/doc/monitoring/index.rst
+++ b/doc/monitoring/index.rst
@ -4,45 +4,48 @@
 Monitoring overview
 ===================

-The aim of this part of the documentation is to explain the Ceph monitoring
-stack and the meaning of the main Ceph metrics.
+This document explains the Ceph monitoring
+stack and a number of important Ceph metrics.

-With a good understand of the Ceph monitoring stack and metrics users can
-create customized monitoring tools, like Prometheus queries, Grafana
-dashboards, or scripts.
+Ceph admins can explore the rich observability stack deployed by Ceph, and
+can leverage Prometheus, Alertmanager, Grafana, and scripting to create customized
+monitoring tools.


 Ceph Monitoring stack
 =====================

-Ceph provides a default monitoring stack wich is installed by cephadm and
-explained in the :ref:`Monitoring Services <mgr-cephadm-monitoring>` section of
-the cephadm documentation.
+Ceph deploys an integrated monitoring stack as described
+in the :ref:`Monitoring Services <mgr-cephadm-monitoring>` section of
+the ``cephadm`` documentation.  Deployments with external fleetwide monitoring
+and observability systems using these or other tools may choose to disable
+the stack that Ceph deploys by default.


 Ceph metrics
 ============

-The main source for Ceph metrics are the performance counters exposed by each
-Ceph daemon. The :doc:`../dev/perf_counters` are native Ceph monitoring data
+Many Ceph metrics are gathered from the performance counters exposed by each
+Ceph daemon. These :doc:`../dev/perf_counters` are native Ceph metrics.

-Performance counters are transformed into standard Prometheus metrics by the
-Ceph exporter daemon. This daemon runs on every Ceph cluster host and exposes a
-metrics end point where all the performance counters exposed by all the Ceph
-daemons running in the host are published in the form of Prometheus metrics.
+Performance counters are rendered into standard Prometheus metrics by the
+``ceph_exporter`` daemon. This daemon runs on every Ceph cluster host and exposes
+an endpoint where performance counters exposed by Ceph
+daemons running on that host are presented in the form of Prometheus metrics.

-In addition to the Ceph exporter, there is another agent to expose Ceph
-metrics. It is the Prometheus manager module, wich exposes metrics related to
-the whole cluster, basically metrics that are not produced by individual Ceph
-daemons.
+In addition to the ``ceph_exporter`` the Ceph Manager ``prometheus`` module
+exposes metrics relating to the Ceph cluster as a  whole.

-The main source for obtaining Ceph metrics is the metrics endpoint exposed by
-the Cluster Prometheus server.  Ceph can provide you with the Prometheus
-endpoint where you can obtain the complete list of metrics (coming from Ceph
-exporter daemons and Prometheus manager module) and exeute queries.
+Ceph provides a Prometheus endpoint from which one can obtain the complete list
+of available metrics, or against which admins, Grafana, and Alertmanager can exeute queries.

-Use the following command to obtain the Prometheus server endpoint in your
-cluster:
+Prometheus (and related systems) accept data queries formatted as PromQL
+expressions. Expansive documentation of PromQL can be
+viewed [here](https://prometheus.io/docs/prometheus/latest/querying/basics/) and
+several excellent books can be found at the usual sources of digital and print books.
+
+We will explore a number of PromQL queries below. Use the following command
+to obtain the Prometheus endpoint for your cluster:

 Example:

@ -54,62 +57,94 @@ Example:

 With this information you can connect to
 ``http://cephtest-node-00.cephlab.com:9095`` to access the Prometheus server
-interface.
+interface, which includes a list of targets, an expression browser, and metrics
+related to the Prometheus service itself.

-And the complete list of metrics (with help) for your cluster will be available
+The complete list of metrics (with descriptions) is available at the URL of the below form:
 in:

 ``http://cephtest-node-00.cephlab.com:9095/api/v1/targets/metadata``

+The Ceph Dashboard provides a rich set of graphs and other panels that display the
+most important cluster and service metrics.  Many of the examples in this document
+are taken from Dashboard graphics or extrapolated from metrics exposed by the
+Ceph Dashboard.

-It is good to outline that the main tool allowing users to observe and monitor a Ceph cluster is the **Ceph dashboard**. It provides graphics where the most important cluster and service metrics are represented. Most of the examples in this document are extracted from the dashboard graphics or extrapolated from the metrics exposed by the Ceph dashboard.
+Ceph daemon health metrics
+==========================

+The ``ceph_exporter`` provides a metric named ``ceph_daemon_socket_up`` that
+indicates the health status of a Ceph daemon based on its ability to respond
+via the admin socket, where a value of ``1`` means healthy, and ``0`` means
+unhealthy. Although a Ceph daemon might still be "alive" when it
+reports ``ceph_daemon_socket_up=0``, this status indicates a significant issue
+in its functionality. As such, this metric serves as an excellent means of
+detecting problems in any of the main Ceph daemons.
+
+The ``ceph_daemon_socket_up`` Prometheus metrics also have labels as described below:
+* ``ceph_daemon``: Identifier of the Ceph daemon exposing an admin socket on the host.
+* ``hostname``: Name of the host where the Ceph daemon is running.
+
+Example:
+
+.. code-block:: bash
+
+   ceph_daemon_socket_up{ceph_daemon="mds.a",hostname="testhost"} 1
+   ceph_daemon_socket_up{ceph_daemon="osd.1",hostname="testhost"} 0
+
+To identify any Ceph daemons that were not responsive at any point in the last
+12 hours, you can use the following PromQL expression:
+
+.. code-block:: bash
+
+   ceph_daemon_socket_up == 0 or min_over_time(ceph_daemon_socket_up[12h]) == 0

 Performance metrics
 ===================

-Main metrics used to measure Cluster Ceph performance:
+Below we explore a a number of metrics that indicate Ceph cluster performance.

-All metrics have the following labels:
-``ceph_daemon``: identifier of the OSD daemon generating the metric
-``instance``: the IP address of the ceph exporter instance exposing the metric.
-``job``: prometheus scrape job
+All of these metrics have the following labels:
+* ``ceph_daemon``: Identifier of the Ceph daemon from which the metric was harvested
+* ``instance``: The IP address of the exporter instance exposing the metric.
+* ``job``: Prometheus scrape job name

-Example:
+Below is an example Prometheus query result showing these labels:

 .. code-block:: bash

  ceph_osd_op_r{ceph_daemon="osd.0", instance="192.168.122.7:9283", job="ceph"} = 73981

-*Cluster I/O (throughput):*
-Use ``ceph_osd_op_r_out_bytes`` and ``ceph_osd_op_w_in_bytes`` to obtain the cluster throughput generated by clients
+*Cluster throughput:*
+Query ``ceph_osd_op_r_out_bytes`` and ``ceph_osd_op_w_in_bytes`` to obtain cluster client throughput:

 Example:

 .. code-block:: bash

-  Writes (B/s):
+  # Writes (B/s):
  sum(irate(ceph_osd_op_w_in_bytes[1m]))

-  Reads (B/s):
+  # Reads (B/s):
  sum(irate(ceph_osd_op_r_out_bytes[1m]))


 *Cluster I/O (operations):*
-Use ``ceph_osd_op_r``, ``ceph_osd_op_w`` to obtain the number of operations generated by clients
+Query ``ceph_osd_op_r``, ``ceph_osd_op_w`` to obtain the rates of client operations (IOPS):

 Example:

 .. code-block:: bash

-  Writes (ops/s):
+  # Writes (ops/s):
  sum(irate(ceph_osd_op_w[1m]))

-  Reads (ops/s):
+  # Reads (ops/s):
  sum(irate(ceph_osd_op_r[1m]))

 *Latency:*
-Use ``ceph_osd_op_latency_sum`` wich represents the delay before a OSD transfer of data begins following a client instruction for its transfer
+Query ``ceph_osd_op_latency_sum`` to measure the delay before OSD transfers of data
+begins in respose to client requests:

 Example:

@ -121,128 +156,134 @@ Example:
 OSD performance
 ===============

-The previous explained cluster performance metrics are based in OSD metrics, selecting the right label we can obtain for a single OSD the same performance information explained for the cluster:
+The cluster performance metrics described above are gathered from OSD metrics.
+By specifying an appropriate label value or regular expression we can retrieve
+performance metrics for one or a subset of the cluster's OSDs:

-Example:
+Examples:

 .. code-block:: bash

-  OSD 0 read latency
+  # OSD 0 read latency
  irate(ceph_osd_op_r_latency_sum{ceph_daemon=~"osd.0"}[1m]) / on (ceph_daemon) irate(ceph_osd_op_r_latency_count[1m])

-  OSD 0 write IOPS
+  # OSD 0 write IOPS
  irate(ceph_osd_op_w{ceph_daemon=~"osd.0"}[1m])

-  OSD 0 write thughtput (bytes)
+  # OSD 0 write thughtput (bytes)
  irate(ceph_osd_op_w_in_bytes{ceph_daemon=~"osd.0"}[1m])

-  OSD.0 total raw capacity available
+  # OSD.0 total raw capacity available
  ceph_osd_stat_bytes{ceph_daemon="osd.0", instance="cephtest-node-00.cephlab.com:9283", job="ceph"} = 536451481


-Physical disk performance:
-==========================
+Physical storage drive performance:
+===================================

-Combining Prometheus ``node_exporter`` metrics with Ceph metrics we can have
-information about the performance provided by physical disks used by OSDs.
+By combining Prometheus ``node_exporter`` metrics with Ceph cluster metrics we can
+derive performance information for physical storage media backing Ceph OSDs.

 Example:

 .. code-block:: bash

-  Read latency of device used by OSD 0:
+  # Read latency of device used by osd.0
  label_replace(irate(node_disk_read_time_seconds_total[1m]) / irate(node_disk_reads_completed_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")

-  Write latency of device used by OSD 0
+  # Write latency of device used by osd.0
  label_replace(irate(node_disk_write_time_seconds_total[1m]) / irate(node_disk_writes_completed_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")

-  IOPS (device used by OSD.0)
-  reads:
+  # IOPS of device used by osd.0
+  # reads:
  label_replace(irate(node_disk_reads_completed_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")

-  writes:
+  # writes:
  label_replace(irate(node_disk_writes_completed_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")

-  Throughput (device used by OSD.0)
-  reads:
+  # Throughput for device used by osd.0
+  # reads:
  label_replace(irate(node_disk_read_bytes_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")

-  writes:
+  # writes:
  label_replace(irate(node_disk_written_bytes_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")

-  Physical Device Utilization (%) for OSD.0 in the last 5 minutes
+  # Physical drive utilization (%) for osd.0 in the last 5 minutes. Note that this value has limited mean for SSDs
  label_replace(irate(node_disk_io_time_seconds_total[5m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")

 Pool metrics
 ============

-These metrics have the following labels:
-``instance``: the ip address of the Ceph exporter daemon producing the metric.
-``pool_id``: identifier of the pool
-``job``: prometheus scrape job
+Ceph pool metrics have the following labels:
+* ``instance``: The IP address of the exporter providing the metric
+* ``pool_id``: Numeric identifier of the Ceph pool
+* ``job``: Prometheus scrape job name


- ``ceph_pool_metadata``: Information about the pool It can be used together
-  with other metrics to provide more contextual information in queries and
-  graphs.  Apart of the three common labels this metric provide the following
-  extra labels:
+Pool-specific metrics include:
+* ``ceph_pool_metadata``: Information about the pool that can be used together
+  with other metrics to provide more information in query resultss and
+  graphs. In addition to the above three common labels this metric
+  provides the following:

-  - ``compression_mode``: compression used in the pool (lz4, snappy, zlib,
-    zstd, none). Example: compression_mode="none"
+    * ``compression_mode``: Compression type enabled for the pool. Values are ``lz4``, ``snappy``,
+       ``zlib``, ``zstd``, and ``none`). Example: ``compression_mode="none"``

-  - ``description``: brief description of the pool type (replica:number of
-    replicas or Erasure code: ec profile). Example: description="replica:3"
-  - ``name``: name of the pool. Example: name=".mgr"
-  - ``type``: type of pool (replicated/erasure code). Example: type="replicated"
+    * ``description``: Brief description of the pool data protection strategy
+      including replica number or EC profile. Example: ``description="replica:3"``

- ``ceph_pool_bytes_used``: Total raw capacity consumed by user data and associated overheads by pool (metadata + redundancy):
+    * ``name``: Name of the pool. Example: ``name=".mgr"``

- ``ceph_pool_stored``: Total of CLIENT data stored in the pool
+    * ``type``: Data protection strategy, replicated or EC. ``Example: type="replicated"``

- ``ceph_pool_compress_under_bytes``: Data eligible to be compressed in the pool
+* ``ceph_pool_bytes_used``: Total raw capacity (after replication or EC) consumed by user data and metadata

- ``ceph_pool_compress_bytes_used``:  Data compressed in the pool
+* ``ceph_pool_stored``: Total client data stored in the pool (before data protection)

- ``ceph_pool_rd``: CLIENT read operations per pool (reads per second)
+* ``ceph_pool_compress_under_bytes``: Data eligible to be compressed in the pool

- ``ceph_pool_rd_bytes``: CLIENT read operations in bytes per pool
+* ``ceph_pool_compress_bytes_used``:  Data compressed in the pool

- ``ceph_pool_wr``: CLIENT write operations per pool (writes per second)
+* ``ceph_pool_rd``: Client read operations per pool (reads per second)

- ``ceph_pool_wr_bytes``: CLIENT write operation in bytes per pool
+* ``ceph_pool_rd_bytes``: Client read operations in bytes per pool
+
+* ``ceph_pool_wr``: Client write operations per pool (writes per second)
+
+* ``ceph_pool_wr_bytes``: Client write operation in bytes per pool


 **Useful queries**:

 .. code-block:: bash

-  Total raw capacity available in the cluster:
+  # Total raw capacity available in the cluster:
  sum(ceph_osd_stat_bytes)

-  Total raw capacity consumed in the cluster (including metadata + redundancy):
+  # Total raw capacity consumed in the cluster (including metadata + redundancy):
  sum(ceph_pool_bytes_used)

-  Total of CLIENT data stored in the cluster:
+  # Total client data stored in the cluster:
  sum(ceph_pool_stored)

-  Compression savings:
+  # Compression savings:
  sum(ceph_pool_compress_under_bytes - ceph_pool_compress_bytes_used)

-  CLIENT IOPS for a pool (testrbdpool)
+  # Client IOPS for a specific pool
  reads: irate(ceph_pool_rd[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"}
  writes: irate(ceph_pool_wr[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"}

-  CLIENT Throughput for a pool
+  # Client throughput for a specific pool
  reads: irate(ceph_pool_rd_bytes[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"}
  writes: irate(ceph_pool_wr_bytes[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"}

-Object metrics
-==============
+RGW metrics
+==================

 These metrics have the following labels:
-``instance``: the ip address of the ceph exporter daemon providing the metric
-``instance_id``: identifier of the rgw daemon
-``job``: prometheus scrape job
+
+* ``instance``: The IP address of the exporter providing the metric
+* ``instance_id``: Identifier of the RGW daemon instance
+* ``job``: Orometheus scrape job name

 Example:

@ -253,93 +294,94 @@ Example:

 Generic metrics
 ---------------
- ``ceph_rgw_metadata``: Provides generic information about the RGW daemon.  It
-  can be used together with other metrics to provide more contextual
-  information in queries and graphs. Apart from the three common labels, this
-  metric provides the following extra labels:
+* ``ceph_rgw_metadata``: Provides generic information about an RGW daemon. This
+  can be used together with other metrics to provide contextual
+  information in queries and graphs. In addtion to the three common labels, this
+  metric provides the following:

-  - ``ceph_daemon``: Name of the Ceph daemon. Example:
-    ceph_daemon="rgw.rgwtest.cephtest-node-00.sxizyq",
-  - ``ceph_version``: Version of Ceph daemon. Example: ceph_version="ceph
-    version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)",
-  - ``hostname``: Name of the host where the daemon runs. Example:
-    hostname:"cephtest-node-00.cephlab.com",
+    * ``ceph_daemon``: Name of the RGW daemon instance. Example:
+    ``ceph_daemon="rgw.rgwtest.cephtest-node-00.sxizyq"``
+    * ``ceph_version``: Version of the RGW daemon. Example: ``ceph_version="ceph
+    version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)"``
+    * ``hostname``: Name of the host where the daemon runs. Example:
+    ``hostname:"cephtest-node-00.cephlab.com"``

- ``ceph_rgw_req``: Number total of requests for the daemon (GET+PUT+DELETE)
-    Useful to detect bottlenecks and optimize load distribution.
+* ``ceph_rgw_req``: Number of requests processed by the daemon (``GET``+``PUT``+``DELETE``).
+    Useful for detecting bottlenecks and optimizing load distribution.

- ``ceph_rgw_qlen``: RGW operations queue length for the daemon.
-    Useful to detect bottlenecks and optimize load distribution.
+* ``ceph_rgw_qlen``: Operations queue length for the daemon.
+    Useful for detecting bottlenecks and optimizing load distribution.

- ``ceph_rgw_failed_req``: Aborted requests.
-    Useful to detect daemon errors
+* ``ceph_rgw_failed_req``: Aborted requests.
+    Useful for detecting daemon errors.


-GET operations: related metrics
+GET operation metrics
+---------------------
+* ``ceph_rgw_op_global_get_obj_lat_count``: Number of ``GET`` requests
+
+* ``ceph_rgw_op_global_get_obj_lat_sum``: Total latency for ``GET`` requests
+
+* ``ceph_rgw_op_global_get_obj_ops``: Total number of ``GET`` requests
+
+* ``ceph_rgw_op_global_get_obj_bytes``: Total bytes transferred for ``GET`` requests
+
+
+PUT operation metrics
 -------------------------------
- ``ceph_rgw_get_initial_lat_count``: Number of get operations
+* ``ceph_rgw_op_global_put_obj_lat_count``: Number of get operations

- ``ceph_rgw_get_initial_lat_sum``: Total latency time for the GET operations
+* ``ceph_rgw_op_global_put_obj_lat_sum``: Total latency time for ``PUT`` operations

- ``ceph_rgw_get``: Number total of GET requests
+* ``ceph_rgw_op_global_put_obj_ops``: Total number of ``PUT`` operations

- ``ceph_rgw_get_b``: Total bytes transferred in GET operations
+* ``ceph_rgw_op_global_get_obj_bytes``: Total bytes transferred in ``PUT`` operations


-Put operations: related metrics
-------------------------------
- ``ceph_rgw_put_initial_lat_count``: Number of get operations
-
- ``ceph_rgw_put_initial_lat_sum``: Total latency time for the PUT operations
-
- ``ceph_rgw_put``: Total number of PUT operations
-
- ``ceph_rgw_get_b``: Total bytes transferred in PUT operations
-
-
-Useful queries
--------------
+Additional Useful queries
+-------------------------

 .. code-block:: bash

-  The average of get latencies:
-  rate(ceph_rgw_get_initial_lat_sum[30s]) / rate(ceph_rgw_get_initial_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
+  # Average GET latency
+  rate(ceph_rgw_op_global_get_obj_lat_sum[30s]) / rate(ceph_rgw_op_global_get_obj_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata

-  The average of put latencies:
-  rate(ceph_rgw_put_initial_lat_sum[30s]) / rate(ceph_rgw_put_initial_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
+  # Average PUT latency
+  rate(ceph_rgw_op_global_put_obj_lat_sum[30s]) / rate(ceph_rgw_op_global_put_obj_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata

-  Total requests per second:
+  # Requests per second
  rate(ceph_rgw_req[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata

-  Total number of "other" operations (LIST, DELETE)
-  rate(ceph_rgw_req[30s]) -  (rate(ceph_rgw_get[30s]) + rate(ceph_rgw_put[30s]))
+  # Total number of "other" operations (``LIST``, ``DELETE``, etc)
+  rate(ceph_rgw_req[30s]) -  (rate(ceph_rgw_op_global_get_obj_ops[30s]) + rate(ceph_rgw_op_global_put_obj_ops[30s]))

-  GET latencies
-  rate(ceph_rgw_get_initial_lat_sum[30s]) /  rate(ceph_rgw_get_initial_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
+  # GET latency per RGW instance
+  rate(ceph_rgw_op_global_get_obj_lat_sum[30s]) /  rate(ceph_rgw_op_global_get_obj_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata

-  PUT latencies
-  rate(ceph_rgw_put_initial_lat_sum[30s]) /  rate(ceph_rgw_put_initial_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
+  # PUT latency per RGW instance
+  rate(ceph_rgw_op_global_put_obj_lat_sum[30s]) /  rate(ceph_rgw_op_global_put_obj_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata

-  Bandwidth consumed by GET operations
-  sum(rate(ceph_rgw_get_b[30s]))
+  # Bandwidth consumed by GET operations
+  sum(rate(ceph_rgw_op_global_get_obj_bytes[30s]))

-  Bandwidth consumed by PUT operations
-  sum(rate(ceph_rgw_put_b[30s]))
+  # Bandwidth consumed by PUT operations
+  sum(rate(ceph_rgw_op_global_put_obj_bytes[30s]))

-  Bandwidth consumed by RGW instance (PUTs + GETs)
-  sum by (instance_id) (rate(ceph_rgw_get_b[30s]) + rate(ceph_rgw_put_b[30s])) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
+  # Bandwidth consumed by RGW instance (PUTs + GETs)
+  sum by (instance_id) (rate(ceph_rgw_op_global_get_obj_bytes[30s]) + rate(ceph_rgw_op_global_put_obj_bytes[30s])) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata

-  Http errors:
+  # HTTP errors and other request failures
  rate(ceph_rgw_failed_req[30s])


-Filesystem Metrics
-==================
+CephFS Metrics
+==============

 These metrics have the following labels:
-``ceph_daemon``: The name of the MDS daemon
-``instance``: the ip address (and port) of of the Ceph exporter daemon exposing the metric
-``job``: prometheus scrape job
+
+* ``ceph_daemon``: The name of the MDS daemon
+* ``instance``: The IP address and port of the exporter exposing the metric
+* ``job``: Prometheus scrape job name

 Example:

@ -348,18 +390,18 @@ Example:
  ceph_mds_request{ceph_daemon="mds.test.cephtest-node-00.hmhsoh", instance="192.168.122.7:9283", job="ceph"} = 1452


-Main metrics
------------
+Important metrics
+-----------------

- ``ceph_mds_metadata``: Provides general information about the MDS daemon.  It
-  can be used together with other metrics to provide more contextual
-  information in queries and graphs.  It provides the following extra labels:
+* ``ceph_mds_metadata``: Provides general information about the MDS daemon.  It
+  can be used together with other metrics to provide contextual
+  information in queries and graphs.  The following extra labels are populated:

-  - ``ceph_version``: MDS daemon Ceph version
-  - ``fs_id``: filesystem cluster id
-  - ``hostname``: Host name where the MDS daemon runs
-  - ``public_addr``: Public address where the MDS daemon runs
-  - ``rank``: Rank of the MDS daemon
+    * ``ceph_version``: MDS daemon version
+    * ``fs_id``: CephFS filesystem ID
+    * ``hostname``: Name of the host where the MDS daemon runs
+    * ``public_addr``: Public address of the host where the MDS daemon runs
+    * ``rank``: Rank of the MDS daemon

 Example:

@ -368,29 +410,29 @@ Example:
 ceph_mds_metadata{ceph_daemon="mds.test.cephtest-node-00.hmhsoh", ceph_version="ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)", fs_id="-1", hostname="cephtest-node-00.cephlab.com", instance="cephtest-node-00.cephlab.com:9283", job="ceph", public_addr="192.168.122.145:6801/118896446", rank="-1"}


- ``ceph_mds_request``: Total number of requests for the MDs daemon
+* ``ceph_mds_request``: Total number of requests for the MDS

- ``ceph_mds_reply_latency_sum``: Reply latency total
+* ``ceph_mds_reply_latency_sum``: Reply latency total

- ``ceph_mds_reply_latency_count``: Reply latency count
+* ``ceph_mds_reply_latency_count``: Reply latency count

- ``ceph_mds_server_handle_client_request``: Number of client requests
+* ``ceph_mds_server_handle_client_request``: Number of client requests

- ``ceph_mds_sessions_session_count``: Session count
+* ``ceph_mds_sessions_session_count``: Session count

- ``ceph_mds_sessions_total_load``: Total load
+* ``ceph_mds_sessions_total_load``: Total load

- ``ceph_mds_sessions_sessions_open``: Sessions currently open
+* ``ceph_mds_sessions_sessions_open``: Sessions currently open

- ``ceph_mds_sessions_sessions_stale``: Sessions currently stale
+* ``ceph_mds_sessions_sessions_stale``: Sessions currently stale

- ``ceph_objecter_op_r``: Number of read operations
+* ``ceph_objecter_op_r``: Number of read operations

- ``ceph_objecter_op_w``: Number of write operations
+* ``ceph_objecter_op_w``: Number of write operations

- ``ceph_mds_root_rbytes``: Total number of bytes managed by the daemon
+* ``ceph_mds_root_rbytes``: Total number of bytes managed by the daemon

- ``ceph_mds_root_rfiles``: Total number of files managed by the daemon
+* ``ceph_mds_root_rfiles``: Total number of files managed by the daemon


 Useful queries:
@ -398,41 +440,41 @@ Useful queries:

 .. code-block:: bash

-  Total MDS daemons read workload:
+  # Total MDS read workload:
  sum(rate(ceph_objecter_op_r[1m]))

-  Total MDS daemons write workload:
+  # Total MDS daemons workload:
  sum(rate(ceph_objecter_op_w[1m]))

-  MDS daemon read workload: (daemon name is "mdstest")
+  # Read workload for a specific MDS
  sum(rate(ceph_objecter_op_r{ceph_daemon=~"mdstest"}[1m]))

-  MDS daemon write workload: (daemon name is "mdstest")
+  # Write workload for a specific MDS
  sum(rate(ceph_objecter_op_r{ceph_daemon=~"mdstest"}[1m]))

-  The average of reply latencies:
+  # Average reply latency
  rate(ceph_mds_reply_latency_sum[30s]) / rate(ceph_mds_reply_latency_count[30s])

-  Total requests per second:
+  # Total requests per second
  rate(ceph_mds_request[30s]) * on (instance) group_right (ceph_daemon) ceph_mds_metadata


 Block metrics
 =============

-By default RBD metrics for images are not available in order to provide the
-best performance in the prometheus manager module.
+By default RBD metrics for images are not gathered, as their cardinality may
+be high.  This helps ensure the performance of the Manager's ``prometheus`` module.

-To produce metrics for RBD images it is needed to configure properly the
-manager option ``mgr/prometheus/rbd_stats_pools``. For more information please
+To produce metrics for RBD images, configure the
+Manager option ``mgr/prometheus/rbd_stats_pools``. For more information
 see :ref:`prometheus-rbd-io-statistics`

-
 These metrics have the following labels:
-``image``: Name of the image which produces the metric value.
-``instance``: Node where the rbd metric is produced. (It points to the Ceph exporter daemon)
-``job``: Name of the Prometheus scrape job.
-``pool``: Image pool name.
+
+* ``image``: Name of the image (volume)
+* ``instance``: Node where the exporter runs
+* ``job``: Name of the Prometheus scrape job
+* ``pool``: RBD pool name

 Example:

@ -441,24 +483,25 @@ Example:
  ceph_rbd_read_bytes{image="test2", instance="cephtest-node-00.cephlab.com:9283", job="ceph", pool="testrbdpool"}


-Main metrics
------------
+Important  metrics
+------------------

- ``ceph_rbd_read_bytes``: RBD image bytes read
+* ``ceph_rbd_read_bytes``: RBD bytes read

- ``ceph_rbd_read_latency_count``: RBD image reads latency count
+* ``ceph_rbd_write_bytes``: RBD image bytes written

- ``ceph_rbd_read_latency_sum``: RBD image reads latency total
+* ``ceph_rbd_read_latency_count``: RBD read operation latency count

- ``ceph_rbd_read_ops``: RBD image reads count
+* ``ceph_rbd_read_latency_sum``: RBD read operation latency total time

- ``ceph_rbd_write_bytes``: RBD image bytes written
+* ``ceph_rbd_read_ops``: RBD read operation count

- ``ceph_rbd_write_latency_count``: RBD image writes latency count
+* ``ceph_rbd_write_ops``: RBD write operation count

- ``ceph_rbd_write_latency_sum``: RBD image writes latency total
+* ``ceph_rbd_write_latency_count``: RBD write operation latency count
+
+* ``ceph_rbd_write_latency_sum``: RBD write operation latency total

- ``ceph_rbd_write_ops``: RBD image writes count


 Useful queries
@ -466,7 +509,7 @@ Useful queries

 .. code-block:: bash

-  The average of read latencies:
+  # Average read latency
  rate(ceph_rbd_read_latency_sum[30s]) / rate(ceph_rbd_read_latency_count[30s]) * on (instance) group_left (ceph_daemon) ceph_rgw_metadata