From 6f6df03e8c6b7ad150038509f4e45e70ae716479 Mon Sep 17 00:00:00 2001
From: Alex Marangone <amarangone@digitalocean.com>
Date: Fri, 17 Feb 2023 09:16:56 -0800
Subject: [PATCH] doc: add metrics list and desc

---
 METRICS.md | 225 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 README.md  |   6 +-
 2 files changed, 228 insertions(+), 3 deletions(-)
 create mode 100644 METRICS.md

diff --git a/METRICS.md b/METRICS.md
new file mode 100644
index 0000000..53d4092
--- /dev/null
+++ b/METRICS.md
@@ -0,0 +1,225 @@
+# Metrics Collected
+
+Ceph exporter implements multiple collectors:
+
+## Cluster usage
+
+General cluster level data usage.
+
+Labels:
+- `cluster`: cluster name
+
+Metrics:
+- `ceph_cluster_capacity_bytes`: Total capacity of the cluster
+- `ceph_cluster_used_bytes`: Capacity of the cluster currently in use
+- `ceph_cluster_available_bytes`: Available space within the cluster
+
+## Pool usage
+
+Per-pool usage data
+
+Labels:
+- `cluster`: cluster name
+- `pool`: pool name
+
+Metrics:
+ - `ceph_pool_used_bytes`: Capacity of the pool that is currently under use
+ - `ceph_pool_raw_used_bytes`: Raw capacity of the pool that is currently under use, this factors in the size
+ - `ceph_pool_available_bytes`: Free space for the pool
+ - `ceph_pool_percent_used`: Percentage of the capacity available to this pool that is used by this pool
+ - `ceph_pool_objects_total`: Total no. of objects allocated within the pool
+ - `ceph_pool_dirty_objects_total`: Total no. of dirty objects in a cache-tier pool
+ - `ceph_pool_unfound_objects_total`: Total no. of unfound objects for the pool
+ - `ceph_pool_read_total`: Total read I/O calls for the pool
+ - `ceph_pool_read_bytes_total`: Total read throughput for the pool
+ - `ceph_pool_write_total`: Total write I/O calls for the pool
+ - `ceph_pool_write_bytes_total`: Total write throughput for the pool
+
+## Pool info
+
+General pool information
+
+Labels:
+- `cluster`: cluster name
+- `pool`: pool name
+- `root`: CRUSH root of the pool
+- `profile`: `replicated` or EC profile being used
+
+Metrics:
+- `ceph_pool_pg_num`: The total count of PGs alotted to a pool
+- `ceph_pool__pgp_num`: The total count of PGs alotted to a pool and used for placements
+- `ceph_pool_min_size`: Minimum number of copies or chunks of an object that need to be present for active I/O
+- `ceph_pool_size`: Total copies or chunks of an object that need to be present for a healthy cluster
+- `ceph_pool_quota_max_bytes`: Maximum amount of bytes of data allowed in a pool
+- `ceph_pool_quota_max_objects`: Maximum amount of RADOS objects allowed in a pool
+- `ceph_pool_stripe_width`: Stripe width of a RADOS object in a pool
+- `ceph_pool_expansion_factor`: Data expansion multiplier for a pool
+
+## Cluster health
+
+Cluster health metrics
+
+Labels:
+ - `cluster`: cluster name
+
+Metrics:
+- `ceph_health_status`: Health status of Cluster, can vary only between 3 states (err:2, warn:1, ok:0)
+- `ceph_health_status_interp`: Health status of Cluster, can vary only between 4 states (err:3, critical_warn:2, soft_warn:1, ok:0)
+- `ceph_mons_down`: Count of Mons that are in DOWN state
+- `ceph_total_pgs`: Total no. of PGs in the cluster
+- `ceph_pg_state`: State of PGs in the cluster
+- `ceph_active_pgs`: No. of active PGs in the cluster
+- `ceph_scrubbing_pgs`: No. of scrubbing PGs in the cluster
+- `ceph_deep_scrubbing_pgs`: No. of deep scrubbing PGs in the cluster
+- `ceph_recovering_pgs`: No. of recovering PGs in the cluster
+- `ceph_recovery_wait_pgs`: No. of PGs in the cluster with recovery_wait state
+- `ceph_backfilling_pgs`: No. of backfilling PGs in the cluster
+- `ceph_backfill_wait_pgs`: No. of PGs in the cluster with backfill_wait state
+- `ceph_forced_recovery_pgs`: No. of PGs in the cluster with forced_recovery state
+- `ceph_forced_backfill_pgs`: No. of PGs in the cluster with forced_backfill state
+- `ceph_down_pgs`: No. of PGs in the cluster in down state
+- `ceph_incomplete_pgs`: No. of PGs in the cluster in incomplete state
+- `ceph_inconsistent_pgs`: No. of PGs in the cluster in inconsistent state
+- `ceph_snaptrim_pgs`: No. of snaptrim PGs in the cluster
+- `ceph_snaptrim_wait_pgs`: No. of PGs in the cluster with snaptrim_wait state
+- `ceph_repairing_pgs`: No. of PGs in the cluster with repair state
+- `ceph_slow_requests`: No. of slow requests/slow ops
+- `ceph_degraded_pgs`: No. of PGs in a degraded state
+- `ceph_stuck_degraded_pgs`: No. of PGs stuck in a degraded state
+- `ceph_unclean_pgs`: No. of PGs in an unclean state
+- `ceph_stuck_unclean_pgs`: No. of PGs stuck in an unclean state
+- `ceph_undersized_pgs`: No. of undersized PGs in the cluster
+- `ceph_stuck_undersized_pgs`: No. of stuck undersized PGs in the cluster
+- `ceph_stale_pgs`: No. of stale PGs in the cluster
+- `ceph_stuck_stale_pgs`: No. of stuck stale PGs in the cluster
+- `ceph_peering_pgs`: No. of peering PGs in the cluster
+- `ceph_degraded_objects`: No. of degraded objects across all PGs, includes replicas
+- `ceph_misplaced_objects`: No. of misplaced objects across all PGs, includes replicas
+- `ceph_misplaced_ratio`: ratio of misplaced objects to total objects
+- `ceph_new_crash_reports`: Number of new crash reports available
+- `ceph_osds_too_many_repair`: Number of OSDs with too many repaired reads
+- `ceph_cluster_objects`: No. of rados objects within the cluster
+- `ceph_osd_map_flags`: A metric for all OSDMap flags
+- `ceph_osds_down`: Count of OSDs that are in DOWN state
+- `ceph_osds_up`: Count of OSDs that are in UP state
+- `ceph_osds_in`: Count of OSDs that are in IN state and available to serve requests
+- `ceph_osds`: Count of total OSDs in the cluster
+- `ceph_pgs_remapped`: No. of PGs that are remapped and incurring cluster-wide movement
+- `ceph_recovery_io_bytes`: Rate of bytes being recovered in cluster per second
+- `ceph_recovery_io_keys`: Rate of keys being recovered in cluster per second
+- `ceph_recovery_io_objects`: Rate of objects being recovered in cluster per second
+- `ceph_client_io_read_bytes`: Rate of bytes being read by all clients per second
+- `ceph_client_io_write_bytes`: Rate of bytes being written by all clients per second
+- `ceph_client_io_ops`: Total client ops on the cluster measured per second
+- `ceph_client_io_read_ops`: Total client read I/O ops on the cluster measured per second
+- `ceph_client_io_write_ops`: Total client write I/O ops on the cluster measured per second
+- `ceph_cache_flush_io_bytes`: Rate of bytes being flushed from the cache pool per second
+- `ceph_cache_evict_io_bytes`: Rate of bytes being evicted from the cache pool per second
+- `ceph_cache_promote_io_ops`: Total cache promote operations measured per second
+- `ceph_mgrs_active`: Count of active mgrs, can be either 0 or 1
+- `ceph_mgrs`: Total number of mgrs, including standbys
+- `ceph_rbd_mirror_up`: Alive rbd-mirror daemons
+
+## Ceph monitor
+
+Ceph Monitor metrics
+
+Labels:
+- `cluster`: cluster name
+- `daemon`: daemon name. `ceph_versions` and `ceph_features` only
+- `release`, `features`: ceph feature name and feature flag. `ceph_features` only
+- `version_tag`, `sha1`, `release_name`:  ceph version infortmation. `ceph_features` only
+
+Metrics:
+- `ceph_monitor_capacity_bytes`: Total storage capacity of the monitor node
+- `ceph_monitor_used_bytes`: Storage of the monitor node that is currently allocated for use
+- `ceph_monitor_avail_bytes`: Total unused storage capacity that the monitor node has left
+- `ceph_monitor_avail_percent`: Percentage of total unused storage capacity that the monitor node has left
+- `ceph_monitor_store_capacity_bytes`: Total capacity of the FileStore backing the monitor daemon
+- `ceph_monitor_store_sst_bytes`: Capacity of the FileStore used only for raw SSTs
+- `ceph_monitor_store_log_bytes`: Capacity of the FileStore used only for logging
+- `ceph_monitor_store_misc_bytes`: Capacity of the FileStore used only for storing miscellaneous information
+- `ceph_monitor_clock_skew_seconds`: Clock skew the monitor node is incurring
+- `ceph_monitor_latency_seconds`: Latency the monitor node is incurring
+- `ceph_monitor_quorum_count`: he total size of the monitor quorum
+- `ceph_versions`: Counts of current versioned daemons, parsed from `ceph versions`
+- `ceph_features`: Counts of current client features, parsed from `ceph features`
+
+## OSD collector
+
+OSD level metrics
+
+Labels:
+- `cluster`: cluster name
+- `osd`: OSD id
+- `device_class`: CRUSH device class
+- `host`: CRUSH host the OSD is in
+- `rack`: CRUSH rack the OSD is in
+- `root`: CRUSH root the OSD is in
+- `pgid`: PG id for recovery related metrics
+
+Metrics:
+- `ceph_osd_crush_weight`: OSD Crush Weight
+- `ceph_osd_depth`: OSD Depth
+- `ceph_osd_reweight`: OSD Reweight
+- `ceph_osd_bytes`: OSD Total Bytes
+- `ceph_osd_used_bytes`: OSD Used Storage in Bytes
+- `ceph_osd_avail_bytes`: OSD Available Storage in Bytes
+- `ceph_osd_utilization`: OSD Utilization
+- `ceph_osd_variance`: OSD Variance
+- `ceph_osd_pgs`: OSD Placement Group Count
+- `ceph_osd_pg_upmap_items_total`: OSD PG-Upmap Exception Table Entry Count
+- `ceph_osd_total_bytes`: OSD Total Storage Bytes
+- `ceph_osd_total_used_bytes`: OSD Total Used Storage Bytes
+- `ceph_osd_total_avail_bytes`: OSD Total Available Storage Bytes
+- `ceph_osd_average_utilization`: OSD Average Utilization
+- `ceph_osd_perf_commit_latency_seconds`: OSD Perf Commit Latency
+- `ceph_osd_perf_apply_latency_seconds`: OSD Perf Apply Latency
+- `ceph_osd_in`: OSD In Status
+- `ceph_osd_up`: OSD Up Status
+- `ceph_osd_full_ratio`: OSD Full Ratio Value
+- `ceph_osd_near_full_ratio`: OSD Near Full Ratio Value
+- `ceph_osd_backfill_full_ratio`: OSD Backfill Full Ratio Value
+- `ceph_osd_full`: OSD Full Status
+- `ceph_osd_near_full`: OSD Near Full Status
+- `ceph_osd_backfill_full`: OSD Backfill Full Status
+- `ceph_osd_down`: Number of OSDs down in the cluster
+- `ceph_osd_scrub_state`: State of OSDs involved in a scrub
+- `ceph_pg_objects_recovered`: Number of objects recovered in a PG
+- `ceph_osd_objects_backfilled`: Average number of objects backfilled in an OSD
+- `ceph_pg_oldest_inactive`: The amount of time in seconds that the oldest PG has been inactive for
+
+## Crash collector
+
+Ceph crash daemon related metrics
+
+Labels:
+- `cluster`: cluster name
+
+Metrics:
+- `ceph_crash_reports`: Count of crashes reports per daemon, according to `ceph crash ls`
+
+## RBD Mirror collector
+
+Ceph RBD mirror health collector
+
+Labels:
+- `cluster`: cluster name
+
+Metrics:
+- `ceph_rbd_mirror_pool_status`: Health status of rbd-mirror, can vary only between 3 states (err:2, warn:1, ok:0)
+- `ceph_rbd_mirror_pool_daemon_status`: Health status of rbd-mirror daemons, can vary only between 3 states (err:2, warn:1, ok:0)
+- `ceph_rbd_mirror_pool_image_status`: "Health status of rbd-mirror images, can vary only between 3 states (err:2, warn:1, ok:0)
+
+## RGW collector
+
+RGW related metrics. Only enabled if `RGW_MODE={1,2}` is set.
+
+Labels:
+- `cluster`: cluster name
+
+Metrics:
+- `ceph_rgw_gc_active_tasks`: RGW GC active task count
+- `ceph_rgw_gc_active_objects`: RGW GC active object count
+- `ceph_rgw_gc_pending_tasks`: RGW GC pending task count
+- `ceph_rgw_gc_pending_objects`: RGW GC pending object count
diff --git a/README.md b/README.md
index 8bb99e9..929261a 100644
--- a/README.md
+++ b/README.md
@@ -6,6 +6,8 @@ with the monitors using an appropriate wrapper over
 `rados_mon_command()`. Hence, no additional setup is necessary other than
 having a working Ceph cluster.
 
+A List of all the metrics collected is available on [METRICS.md](./METRICS.md) page.
+
 ## Dependencies
 
 You should ideally run this exporter from the client that can talk to the Ceph
@@ -129,6 +131,4 @@ can generate views like:
 
 ![](sample.png)
 
----
-
-Copyright @ 2016-2020 DigitalOcean™ Inc.
+Copyright @ 2016-2023 DigitalOcean™ Inc.