Merge pull request #233 from digitalocean/metricsDoc

doc: add metrics list and desc
2023-02-21 07:11:21 -08:00 · 2023-02-21 07:11:21 -08:00 · 06e78e98ed
parent 49e6345cb6 6f6df03e8c
commit 06e78e98ed
2 changed files with 228 additions and 3 deletions
--- a/METRICS.md
+++ b/METRICS.md
@ -0,0 +1,225 @@
 # Metrics Collected
 Ceph exporter implements multiple collectors:
 ## Cluster usage
 General cluster level data usage.
 Labels:
 - `cluster`: cluster name
 Metrics:
 - `ceph_cluster_capacity_bytes`: Total capacity of the cluster
 - `ceph_cluster_used_bytes`: Capacity of the cluster currently in use
 - `ceph_cluster_available_bytes`: Available space within the cluster
 ## Pool usage
 Per-pool usage data
 Labels:
 - `cluster`: cluster name
 - `pool`: pool name
 Metrics:
 - `ceph_pool_used_bytes`: Capacity of the pool that is currently under use
 - `ceph_pool_raw_used_bytes`: Raw capacity of the pool that is currently under use, this factors in the size
 - `ceph_pool_available_bytes`: Free space for the pool
 - `ceph_pool_percent_used`: Percentage of the capacity available to this pool that is used by this pool
 - `ceph_pool_objects_total`: Total no. of objects allocated within the pool
 - `ceph_pool_dirty_objects_total`: Total no. of dirty objects in a cache-tier pool
 - `ceph_pool_unfound_objects_total`: Total no. of unfound objects for the pool
 - `ceph_pool_read_total`: Total read I/O calls for the pool
 - `ceph_pool_read_bytes_total`: Total read throughput for the pool
 - `ceph_pool_write_total`: Total write I/O calls for the pool
 - `ceph_pool_write_bytes_total`: Total write throughput for the pool
 ## Pool info
 General pool information
 Labels:
 - `cluster`: cluster name
 - `pool`: pool name
 - `root`: CRUSH root of the pool
 - `profile`: `replicated` or EC profile being used
 Metrics:
 - `ceph_pool_pg_num`: The total count of PGs alotted to a pool
 - `ceph_pool__pgp_num`: The total count of PGs alotted to a pool and used for placements
 - `ceph_pool_min_size`: Minimum number of copies or chunks of an object that need to be present for active I/O
 - `ceph_pool_size`: Total copies or chunks of an object that need to be present for a healthy cluster
 - `ceph_pool_quota_max_bytes`: Maximum amount of bytes of data allowed in a pool
 - `ceph_pool_quota_max_objects`: Maximum amount of RADOS objects allowed in a pool
 - `ceph_pool_stripe_width`: Stripe width of a RADOS object in a pool
 - `ceph_pool_expansion_factor`: Data expansion multiplier for a pool
 ## Cluster health
 Cluster health metrics
 Labels:
 - `cluster`: cluster name
 Metrics:
 - `ceph_health_status`: Health status of Cluster, can vary only between 3 states (err:2, warn:1, ok:0)
 - `ceph_health_status_interp`: Health status of Cluster, can vary only between 4 states (err:3, critical_warn:2, soft_warn:1, ok:0)
 - `ceph_mons_down`: Count of Mons that are in DOWN state
 - `ceph_total_pgs`: Total no. of PGs in the cluster
 - `ceph_pg_state`: State of PGs in the cluster
 - `ceph_active_pgs`: No. of active PGs in the cluster
 - `ceph_scrubbing_pgs`: No. of scrubbing PGs in the cluster
 - `ceph_deep_scrubbing_pgs`: No. of deep scrubbing PGs in the cluster
 - `ceph_recovering_pgs`: No. of recovering PGs in the cluster
 - `ceph_recovery_wait_pgs`: No. of PGs in the cluster with recovery_wait state
 - `ceph_backfilling_pgs`: No. of backfilling PGs in the cluster
 - `ceph_backfill_wait_pgs`: No. of PGs in the cluster with backfill_wait state
 - `ceph_forced_recovery_pgs`: No. of PGs in the cluster with forced_recovery state
 - `ceph_forced_backfill_pgs`: No. of PGs in the cluster with forced_backfill state
 - `ceph_down_pgs`: No. of PGs in the cluster in down state
 - `ceph_incomplete_pgs`: No. of PGs in the cluster in incomplete state
 - `ceph_inconsistent_pgs`: No. of PGs in the cluster in inconsistent state
 - `ceph_snaptrim_pgs`: No. of snaptrim PGs in the cluster
 - `ceph_snaptrim_wait_pgs`: No. of PGs in the cluster with snaptrim_wait state
 - `ceph_repairing_pgs`: No. of PGs in the cluster with repair state
 - `ceph_slow_requests`: No. of slow requests/slow ops
 - `ceph_degraded_pgs`: No. of PGs in a degraded state
 - `ceph_stuck_degraded_pgs`: No. of PGs stuck in a degraded state
 - `ceph_unclean_pgs`: No. of PGs in an unclean state
 - `ceph_stuck_unclean_pgs`: No. of PGs stuck in an unclean state
 - `ceph_undersized_pgs`: No. of undersized PGs in the cluster
 - `ceph_stuck_undersized_pgs`: No. of stuck undersized PGs in the cluster
 - `ceph_stale_pgs`: No. of stale PGs in the cluster
 - `ceph_stuck_stale_pgs`: No. of stuck stale PGs in the cluster
 - `ceph_peering_pgs`: No. of peering PGs in the cluster
 - `ceph_degraded_objects`: No. of degraded objects across all PGs, includes replicas
 - `ceph_misplaced_objects`: No. of misplaced objects across all PGs, includes replicas
 - `ceph_misplaced_ratio`: ratio of misplaced objects to total objects
 - `ceph_new_crash_reports`: Number of new crash reports available
 - `ceph_osds_too_many_repair`: Number of OSDs with too many repaired reads
 - `ceph_cluster_objects`: No. of rados objects within the cluster
 - `ceph_osd_map_flags`: A metric for all OSDMap flags
 - `ceph_osds_down`: Count of OSDs that are in DOWN state
 - `ceph_osds_up`: Count of OSDs that are in UP state
 - `ceph_osds_in`: Count of OSDs that are in IN state and available to serve requests
 - `ceph_osds`: Count of total OSDs in the cluster
 - `ceph_pgs_remapped`: No. of PGs that are remapped and incurring cluster-wide movement
 - `ceph_recovery_io_bytes`: Rate of bytes being recovered in cluster per second
 - `ceph_recovery_io_keys`: Rate of keys being recovered in cluster per second
 - `ceph_recovery_io_objects`: Rate of objects being recovered in cluster per second
 - `ceph_client_io_read_bytes`: Rate of bytes being read by all clients per second
 - `ceph_client_io_write_bytes`: Rate of bytes being written by all clients per second
 - `ceph_client_io_ops`: Total client ops on the cluster measured per second
 - `ceph_client_io_read_ops`: Total client read I/O ops on the cluster measured per second
 - `ceph_client_io_write_ops`: Total client write I/O ops on the cluster measured per second
 - `ceph_cache_flush_io_bytes`: Rate of bytes being flushed from the cache pool per second
 - `ceph_cache_evict_io_bytes`: Rate of bytes being evicted from the cache pool per second
 - `ceph_cache_promote_io_ops`: Total cache promote operations measured per second
 - `ceph_mgrs_active`: Count of active mgrs, can be either 0 or 1
 - `ceph_mgrs`: Total number of mgrs, including standbys
 - `ceph_rbd_mirror_up`: Alive rbd-mirror daemons
 ## Ceph monitor
 Ceph Monitor metrics
 Labels:
 - `cluster`: cluster name
 - `daemon`: daemon name. `ceph_versions` and `ceph_features` only
 - `release`, `features`: ceph feature name and feature flag. `ceph_features` only
 - `version_tag`, `sha1`, `release_name`:  ceph version infortmation. `ceph_features` only
 Metrics:
 - `ceph_monitor_capacity_bytes`: Total storage capacity of the monitor node
 - `ceph_monitor_used_bytes`: Storage of the monitor node that is currently allocated for use
 - `ceph_monitor_avail_bytes`: Total unused storage capacity that the monitor node has left
 - `ceph_monitor_avail_percent`: Percentage of total unused storage capacity that the monitor node has left
 - `ceph_monitor_store_capacity_bytes`: Total capacity of the FileStore backing the monitor daemon
 - `ceph_monitor_store_sst_bytes`: Capacity of the FileStore used only for raw SSTs
 - `ceph_monitor_store_log_bytes`: Capacity of the FileStore used only for logging
 - `ceph_monitor_store_misc_bytes`: Capacity of the FileStore used only for storing miscellaneous information
 - `ceph_monitor_clock_skew_seconds`: Clock skew the monitor node is incurring
 - `ceph_monitor_latency_seconds`: Latency the monitor node is incurring
 - `ceph_monitor_quorum_count`: he total size of the monitor quorum
 - `ceph_versions`: Counts of current versioned daemons, parsed from `ceph versions`
 - `ceph_features`: Counts of current client features, parsed from `ceph features`
 ## OSD collector
 OSD level metrics
 Labels:
 - `cluster`: cluster name
 - `osd`: OSD id
 - `device_class`: CRUSH device class
 - `host`: CRUSH host the OSD is in
 - `rack`: CRUSH rack the OSD is in
 - `root`: CRUSH root the OSD is in
 - `pgid`: PG id for recovery related metrics
 Metrics:
 - `ceph_osd_crush_weight`: OSD Crush Weight
 - `ceph_osd_depth`: OSD Depth
 - `ceph_osd_reweight`: OSD Reweight
 - `ceph_osd_bytes`: OSD Total Bytes
 - `ceph_osd_used_bytes`: OSD Used Storage in Bytes
 - `ceph_osd_avail_bytes`: OSD Available Storage in Bytes
 - `ceph_osd_utilization`: OSD Utilization
 - `ceph_osd_variance`: OSD Variance
 - `ceph_osd_pgs`: OSD Placement Group Count
 - `ceph_osd_pg_upmap_items_total`: OSD PG-Upmap Exception Table Entry Count
 - `ceph_osd_total_bytes`: OSD Total Storage Bytes
 - `ceph_osd_total_used_bytes`: OSD Total Used Storage Bytes
 - `ceph_osd_total_avail_bytes`: OSD Total Available Storage Bytes
 - `ceph_osd_average_utilization`: OSD Average Utilization
 - `ceph_osd_perf_commit_latency_seconds`: OSD Perf Commit Latency
 - `ceph_osd_perf_apply_latency_seconds`: OSD Perf Apply Latency
 - `ceph_osd_in`: OSD In Status
 - `ceph_osd_up`: OSD Up Status
 - `ceph_osd_full_ratio`: OSD Full Ratio Value
 - `ceph_osd_near_full_ratio`: OSD Near Full Ratio Value
 - `ceph_osd_backfill_full_ratio`: OSD Backfill Full Ratio Value
 - `ceph_osd_full`: OSD Full Status
 - `ceph_osd_near_full`: OSD Near Full Status
 - `ceph_osd_backfill_full`: OSD Backfill Full Status
 - `ceph_osd_down`: Number of OSDs down in the cluster
 - `ceph_osd_scrub_state`: State of OSDs involved in a scrub
 - `ceph_pg_objects_recovered`: Number of objects recovered in a PG
 - `ceph_osd_objects_backfilled`: Average number of objects backfilled in an OSD
 - `ceph_pg_oldest_inactive`: The amount of time in seconds that the oldest PG has been inactive for
 ## Crash collector
 Ceph crash daemon related metrics
 Labels:
 - `cluster`: cluster name
 Metrics:
 - `ceph_crash_reports`: Count of crashes reports per daemon, according to `ceph crash ls`
 ## RBD Mirror collector
 Ceph RBD mirror health collector
 Labels:
 - `cluster`: cluster name
 Metrics:
 - `ceph_rbd_mirror_pool_status`: Health status of rbd-mirror, can vary only between 3 states (err:2, warn:1, ok:0)
 - `ceph_rbd_mirror_pool_daemon_status`: Health status of rbd-mirror daemons, can vary only between 3 states (err:2, warn:1, ok:0)
 - `ceph_rbd_mirror_pool_image_status`: "Health status of rbd-mirror images, can vary only between 3 states (err:2, warn:1, ok:0)
 ## RGW collector
 RGW related metrics. Only enabled if `RGW_MODE={1,2}` is set.
 Labels:
 - `cluster`: cluster name
 Metrics:
 - `ceph_rgw_gc_active_tasks`: RGW GC active task count
 - `ceph_rgw_gc_active_objects`: RGW GC active object count
 - `ceph_rgw_gc_pending_tasks`: RGW GC pending task count
 - `ceph_rgw_gc_pending_objects`: RGW GC pending object count
--- a/README.md
+++ b/README.md
@ -6,6 +6,8 @@ with the monitors using an appropriate wrapper over
 `rados_mon_command()`. Hence, no additional setup is necessary other than
 having a working Ceph cluster.
 A List of all the metrics collected is available on [METRICS.md](./METRICS.md) page.
 ## Dependencies
 You should ideally run this exporter from the client that can talk to the Ceph
@ -129,6 +131,4 @@ can generate views like:
 ![](sample.png)
---
+Copyright @ 2016-2023 DigitalOcean™ Inc.
 Copyright @ 2016-2020 DigitalOcean™ Inc.