Merge pull request #233 from digitalocean/metricsDoc

doc: add metrics list and desc
This commit is contained in:
Alexandre Marangone 2023-02-21 07:11:21 -08:00 committed by GitHub
commit 06e78e98ed
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 228 additions and 3 deletions

225
METRICS.md Normal file
View File

@ -0,0 +1,225 @@
# Metrics Collected
Ceph exporter implements multiple collectors:
## Cluster usage
General cluster level data usage.
Labels:
- `cluster`: cluster name
Metrics:
- `ceph_cluster_capacity_bytes`: Total capacity of the cluster
- `ceph_cluster_used_bytes`: Capacity of the cluster currently in use
- `ceph_cluster_available_bytes`: Available space within the cluster
## Pool usage
Per-pool usage data
Labels:
- `cluster`: cluster name
- `pool`: pool name
Metrics:
- `ceph_pool_used_bytes`: Capacity of the pool that is currently under use
- `ceph_pool_raw_used_bytes`: Raw capacity of the pool that is currently under use, this factors in the size
- `ceph_pool_available_bytes`: Free space for the pool
- `ceph_pool_percent_used`: Percentage of the capacity available to this pool that is used by this pool
- `ceph_pool_objects_total`: Total no. of objects allocated within the pool
- `ceph_pool_dirty_objects_total`: Total no. of dirty objects in a cache-tier pool
- `ceph_pool_unfound_objects_total`: Total no. of unfound objects for the pool
- `ceph_pool_read_total`: Total read I/O calls for the pool
- `ceph_pool_read_bytes_total`: Total read throughput for the pool
- `ceph_pool_write_total`: Total write I/O calls for the pool
- `ceph_pool_write_bytes_total`: Total write throughput for the pool
## Pool info
General pool information
Labels:
- `cluster`: cluster name
- `pool`: pool name
- `root`: CRUSH root of the pool
- `profile`: `replicated` or EC profile being used
Metrics:
- `ceph_pool_pg_num`: The total count of PGs alotted to a pool
- `ceph_pool__pgp_num`: The total count of PGs alotted to a pool and used for placements
- `ceph_pool_min_size`: Minimum number of copies or chunks of an object that need to be present for active I/O
- `ceph_pool_size`: Total copies or chunks of an object that need to be present for a healthy cluster
- `ceph_pool_quota_max_bytes`: Maximum amount of bytes of data allowed in a pool
- `ceph_pool_quota_max_objects`: Maximum amount of RADOS objects allowed in a pool
- `ceph_pool_stripe_width`: Stripe width of a RADOS object in a pool
- `ceph_pool_expansion_factor`: Data expansion multiplier for a pool
## Cluster health
Cluster health metrics
Labels:
- `cluster`: cluster name
Metrics:
- `ceph_health_status`: Health status of Cluster, can vary only between 3 states (err:2, warn:1, ok:0)
- `ceph_health_status_interp`: Health status of Cluster, can vary only between 4 states (err:3, critical_warn:2, soft_warn:1, ok:0)
- `ceph_mons_down`: Count of Mons that are in DOWN state
- `ceph_total_pgs`: Total no. of PGs in the cluster
- `ceph_pg_state`: State of PGs in the cluster
- `ceph_active_pgs`: No. of active PGs in the cluster
- `ceph_scrubbing_pgs`: No. of scrubbing PGs in the cluster
- `ceph_deep_scrubbing_pgs`: No. of deep scrubbing PGs in the cluster
- `ceph_recovering_pgs`: No. of recovering PGs in the cluster
- `ceph_recovery_wait_pgs`: No. of PGs in the cluster with recovery_wait state
- `ceph_backfilling_pgs`: No. of backfilling PGs in the cluster
- `ceph_backfill_wait_pgs`: No. of PGs in the cluster with backfill_wait state
- `ceph_forced_recovery_pgs`: No. of PGs in the cluster with forced_recovery state
- `ceph_forced_backfill_pgs`: No. of PGs in the cluster with forced_backfill state
- `ceph_down_pgs`: No. of PGs in the cluster in down state
- `ceph_incomplete_pgs`: No. of PGs in the cluster in incomplete state
- `ceph_inconsistent_pgs`: No. of PGs in the cluster in inconsistent state
- `ceph_snaptrim_pgs`: No. of snaptrim PGs in the cluster
- `ceph_snaptrim_wait_pgs`: No. of PGs in the cluster with snaptrim_wait state
- `ceph_repairing_pgs`: No. of PGs in the cluster with repair state
- `ceph_slow_requests`: No. of slow requests/slow ops
- `ceph_degraded_pgs`: No. of PGs in a degraded state
- `ceph_stuck_degraded_pgs`: No. of PGs stuck in a degraded state
- `ceph_unclean_pgs`: No. of PGs in an unclean state
- `ceph_stuck_unclean_pgs`: No. of PGs stuck in an unclean state
- `ceph_undersized_pgs`: No. of undersized PGs in the cluster
- `ceph_stuck_undersized_pgs`: No. of stuck undersized PGs in the cluster
- `ceph_stale_pgs`: No. of stale PGs in the cluster
- `ceph_stuck_stale_pgs`: No. of stuck stale PGs in the cluster
- `ceph_peering_pgs`: No. of peering PGs in the cluster
- `ceph_degraded_objects`: No. of degraded objects across all PGs, includes replicas
- `ceph_misplaced_objects`: No. of misplaced objects across all PGs, includes replicas
- `ceph_misplaced_ratio`: ratio of misplaced objects to total objects
- `ceph_new_crash_reports`: Number of new crash reports available
- `ceph_osds_too_many_repair`: Number of OSDs with too many repaired reads
- `ceph_cluster_objects`: No. of rados objects within the cluster
- `ceph_osd_map_flags`: A metric for all OSDMap flags
- `ceph_osds_down`: Count of OSDs that are in DOWN state
- `ceph_osds_up`: Count of OSDs that are in UP state
- `ceph_osds_in`: Count of OSDs that are in IN state and available to serve requests
- `ceph_osds`: Count of total OSDs in the cluster
- `ceph_pgs_remapped`: No. of PGs that are remapped and incurring cluster-wide movement
- `ceph_recovery_io_bytes`: Rate of bytes being recovered in cluster per second
- `ceph_recovery_io_keys`: Rate of keys being recovered in cluster per second
- `ceph_recovery_io_objects`: Rate of objects being recovered in cluster per second
- `ceph_client_io_read_bytes`: Rate of bytes being read by all clients per second
- `ceph_client_io_write_bytes`: Rate of bytes being written by all clients per second
- `ceph_client_io_ops`: Total client ops on the cluster measured per second
- `ceph_client_io_read_ops`: Total client read I/O ops on the cluster measured per second
- `ceph_client_io_write_ops`: Total client write I/O ops on the cluster measured per second
- `ceph_cache_flush_io_bytes`: Rate of bytes being flushed from the cache pool per second
- `ceph_cache_evict_io_bytes`: Rate of bytes being evicted from the cache pool per second
- `ceph_cache_promote_io_ops`: Total cache promote operations measured per second
- `ceph_mgrs_active`: Count of active mgrs, can be either 0 or 1
- `ceph_mgrs`: Total number of mgrs, including standbys
- `ceph_rbd_mirror_up`: Alive rbd-mirror daemons
## Ceph monitor
Ceph Monitor metrics
Labels:
- `cluster`: cluster name
- `daemon`: daemon name. `ceph_versions` and `ceph_features` only
- `release`, `features`: ceph feature name and feature flag. `ceph_features` only
- `version_tag`, `sha1`, `release_name`: ceph version infortmation. `ceph_features` only
Metrics:
- `ceph_monitor_capacity_bytes`: Total storage capacity of the monitor node
- `ceph_monitor_used_bytes`: Storage of the monitor node that is currently allocated for use
- `ceph_monitor_avail_bytes`: Total unused storage capacity that the monitor node has left
- `ceph_monitor_avail_percent`: Percentage of total unused storage capacity that the monitor node has left
- `ceph_monitor_store_capacity_bytes`: Total capacity of the FileStore backing the monitor daemon
- `ceph_monitor_store_sst_bytes`: Capacity of the FileStore used only for raw SSTs
- `ceph_monitor_store_log_bytes`: Capacity of the FileStore used only for logging
- `ceph_monitor_store_misc_bytes`: Capacity of the FileStore used only for storing miscellaneous information
- `ceph_monitor_clock_skew_seconds`: Clock skew the monitor node is incurring
- `ceph_monitor_latency_seconds`: Latency the monitor node is incurring
- `ceph_monitor_quorum_count`: he total size of the monitor quorum
- `ceph_versions`: Counts of current versioned daemons, parsed from `ceph versions`
- `ceph_features`: Counts of current client features, parsed from `ceph features`
## OSD collector
OSD level metrics
Labels:
- `cluster`: cluster name
- `osd`: OSD id
- `device_class`: CRUSH device class
- `host`: CRUSH host the OSD is in
- `rack`: CRUSH rack the OSD is in
- `root`: CRUSH root the OSD is in
- `pgid`: PG id for recovery related metrics
Metrics:
- `ceph_osd_crush_weight`: OSD Crush Weight
- `ceph_osd_depth`: OSD Depth
- `ceph_osd_reweight`: OSD Reweight
- `ceph_osd_bytes`: OSD Total Bytes
- `ceph_osd_used_bytes`: OSD Used Storage in Bytes
- `ceph_osd_avail_bytes`: OSD Available Storage in Bytes
- `ceph_osd_utilization`: OSD Utilization
- `ceph_osd_variance`: OSD Variance
- `ceph_osd_pgs`: OSD Placement Group Count
- `ceph_osd_pg_upmap_items_total`: OSD PG-Upmap Exception Table Entry Count
- `ceph_osd_total_bytes`: OSD Total Storage Bytes
- `ceph_osd_total_used_bytes`: OSD Total Used Storage Bytes
- `ceph_osd_total_avail_bytes`: OSD Total Available Storage Bytes
- `ceph_osd_average_utilization`: OSD Average Utilization
- `ceph_osd_perf_commit_latency_seconds`: OSD Perf Commit Latency
- `ceph_osd_perf_apply_latency_seconds`: OSD Perf Apply Latency
- `ceph_osd_in`: OSD In Status
- `ceph_osd_up`: OSD Up Status
- `ceph_osd_full_ratio`: OSD Full Ratio Value
- `ceph_osd_near_full_ratio`: OSD Near Full Ratio Value
- `ceph_osd_backfill_full_ratio`: OSD Backfill Full Ratio Value
- `ceph_osd_full`: OSD Full Status
- `ceph_osd_near_full`: OSD Near Full Status
- `ceph_osd_backfill_full`: OSD Backfill Full Status
- `ceph_osd_down`: Number of OSDs down in the cluster
- `ceph_osd_scrub_state`: State of OSDs involved in a scrub
- `ceph_pg_objects_recovered`: Number of objects recovered in a PG
- `ceph_osd_objects_backfilled`: Average number of objects backfilled in an OSD
- `ceph_pg_oldest_inactive`: The amount of time in seconds that the oldest PG has been inactive for
## Crash collector
Ceph crash daemon related metrics
Labels:
- `cluster`: cluster name
Metrics:
- `ceph_crash_reports`: Count of crashes reports per daemon, according to `ceph crash ls`
## RBD Mirror collector
Ceph RBD mirror health collector
Labels:
- `cluster`: cluster name
Metrics:
- `ceph_rbd_mirror_pool_status`: Health status of rbd-mirror, can vary only between 3 states (err:2, warn:1, ok:0)
- `ceph_rbd_mirror_pool_daemon_status`: Health status of rbd-mirror daemons, can vary only between 3 states (err:2, warn:1, ok:0)
- `ceph_rbd_mirror_pool_image_status`: "Health status of rbd-mirror images, can vary only between 3 states (err:2, warn:1, ok:0)
## RGW collector
RGW related metrics. Only enabled if `RGW_MODE={1,2}` is set.
Labels:
- `cluster`: cluster name
Metrics:
- `ceph_rgw_gc_active_tasks`: RGW GC active task count
- `ceph_rgw_gc_active_objects`: RGW GC active object count
- `ceph_rgw_gc_pending_tasks`: RGW GC pending task count
- `ceph_rgw_gc_pending_objects`: RGW GC pending object count

View File

@ -6,6 +6,8 @@ with the monitors using an appropriate wrapper over
`rados_mon_command()`. Hence, no additional setup is necessary other than `rados_mon_command()`. Hence, no additional setup is necessary other than
having a working Ceph cluster. having a working Ceph cluster.
A List of all the metrics collected is available on [METRICS.md](./METRICS.md) page.
## Dependencies ## Dependencies
You should ideally run this exporter from the client that can talk to the Ceph You should ideally run this exporter from the client that can talk to the Ceph
@ -129,6 +131,4 @@ can generate views like:
![](sample.png) ![](sample.png)
--- Copyright @ 2016-2023 DigitalOcean™ Inc.
Copyright @ 2016-2020 DigitalOcean™ Inc.