From 6f6df03e8c6b7ad150038509f4e45e70ae716479 Mon Sep 17 00:00:00 2001 From: Alex Marangone Date: Fri, 17 Feb 2023 09:16:56 -0800 Subject: [PATCH] doc: add metrics list and desc --- METRICS.md | 225 +++++++++++++++++++++++++++++++++++++++++++++++++++++ README.md | 6 +- 2 files changed, 228 insertions(+), 3 deletions(-) create mode 100644 METRICS.md diff --git a/METRICS.md b/METRICS.md new file mode 100644 index 0000000..53d4092 --- /dev/null +++ b/METRICS.md @@ -0,0 +1,225 @@ +# Metrics Collected + +Ceph exporter implements multiple collectors: + +## Cluster usage + +General cluster level data usage. + +Labels: +- `cluster`: cluster name + +Metrics: +- `ceph_cluster_capacity_bytes`: Total capacity of the cluster +- `ceph_cluster_used_bytes`: Capacity of the cluster currently in use +- `ceph_cluster_available_bytes`: Available space within the cluster + +## Pool usage + +Per-pool usage data + +Labels: +- `cluster`: cluster name +- `pool`: pool name + +Metrics: + - `ceph_pool_used_bytes`: Capacity of the pool that is currently under use + - `ceph_pool_raw_used_bytes`: Raw capacity of the pool that is currently under use, this factors in the size + - `ceph_pool_available_bytes`: Free space for the pool + - `ceph_pool_percent_used`: Percentage of the capacity available to this pool that is used by this pool + - `ceph_pool_objects_total`: Total no. of objects allocated within the pool + - `ceph_pool_dirty_objects_total`: Total no. of dirty objects in a cache-tier pool + - `ceph_pool_unfound_objects_total`: Total no. of unfound objects for the pool + - `ceph_pool_read_total`: Total read I/O calls for the pool + - `ceph_pool_read_bytes_total`: Total read throughput for the pool + - `ceph_pool_write_total`: Total write I/O calls for the pool + - `ceph_pool_write_bytes_total`: Total write throughput for the pool + +## Pool info + +General pool information + +Labels: +- `cluster`: cluster name +- `pool`: pool name +- `root`: CRUSH root of the pool +- `profile`: `replicated` or EC profile being used + +Metrics: +- `ceph_pool_pg_num`: The total count of PGs alotted to a pool +- `ceph_pool__pgp_num`: The total count of PGs alotted to a pool and used for placements +- `ceph_pool_min_size`: Minimum number of copies or chunks of an object that need to be present for active I/O +- `ceph_pool_size`: Total copies or chunks of an object that need to be present for a healthy cluster +- `ceph_pool_quota_max_bytes`: Maximum amount of bytes of data allowed in a pool +- `ceph_pool_quota_max_objects`: Maximum amount of RADOS objects allowed in a pool +- `ceph_pool_stripe_width`: Stripe width of a RADOS object in a pool +- `ceph_pool_expansion_factor`: Data expansion multiplier for a pool + +## Cluster health + +Cluster health metrics + +Labels: + - `cluster`: cluster name + +Metrics: +- `ceph_health_status`: Health status of Cluster, can vary only between 3 states (err:2, warn:1, ok:0) +- `ceph_health_status_interp`: Health status of Cluster, can vary only between 4 states (err:3, critical_warn:2, soft_warn:1, ok:0) +- `ceph_mons_down`: Count of Mons that are in DOWN state +- `ceph_total_pgs`: Total no. of PGs in the cluster +- `ceph_pg_state`: State of PGs in the cluster +- `ceph_active_pgs`: No. of active PGs in the cluster +- `ceph_scrubbing_pgs`: No. of scrubbing PGs in the cluster +- `ceph_deep_scrubbing_pgs`: No. of deep scrubbing PGs in the cluster +- `ceph_recovering_pgs`: No. of recovering PGs in the cluster +- `ceph_recovery_wait_pgs`: No. of PGs in the cluster with recovery_wait state +- `ceph_backfilling_pgs`: No. of backfilling PGs in the cluster +- `ceph_backfill_wait_pgs`: No. of PGs in the cluster with backfill_wait state +- `ceph_forced_recovery_pgs`: No. of PGs in the cluster with forced_recovery state +- `ceph_forced_backfill_pgs`: No. of PGs in the cluster with forced_backfill state +- `ceph_down_pgs`: No. of PGs in the cluster in down state +- `ceph_incomplete_pgs`: No. of PGs in the cluster in incomplete state +- `ceph_inconsistent_pgs`: No. of PGs in the cluster in inconsistent state +- `ceph_snaptrim_pgs`: No. of snaptrim PGs in the cluster +- `ceph_snaptrim_wait_pgs`: No. of PGs in the cluster with snaptrim_wait state +- `ceph_repairing_pgs`: No. of PGs in the cluster with repair state +- `ceph_slow_requests`: No. of slow requests/slow ops +- `ceph_degraded_pgs`: No. of PGs in a degraded state +- `ceph_stuck_degraded_pgs`: No. of PGs stuck in a degraded state +- `ceph_unclean_pgs`: No. of PGs in an unclean state +- `ceph_stuck_unclean_pgs`: No. of PGs stuck in an unclean state +- `ceph_undersized_pgs`: No. of undersized PGs in the cluster +- `ceph_stuck_undersized_pgs`: No. of stuck undersized PGs in the cluster +- `ceph_stale_pgs`: No. of stale PGs in the cluster +- `ceph_stuck_stale_pgs`: No. of stuck stale PGs in the cluster +- `ceph_peering_pgs`: No. of peering PGs in the cluster +- `ceph_degraded_objects`: No. of degraded objects across all PGs, includes replicas +- `ceph_misplaced_objects`: No. of misplaced objects across all PGs, includes replicas +- `ceph_misplaced_ratio`: ratio of misplaced objects to total objects +- `ceph_new_crash_reports`: Number of new crash reports available +- `ceph_osds_too_many_repair`: Number of OSDs with too many repaired reads +- `ceph_cluster_objects`: No. of rados objects within the cluster +- `ceph_osd_map_flags`: A metric for all OSDMap flags +- `ceph_osds_down`: Count of OSDs that are in DOWN state +- `ceph_osds_up`: Count of OSDs that are in UP state +- `ceph_osds_in`: Count of OSDs that are in IN state and available to serve requests +- `ceph_osds`: Count of total OSDs in the cluster +- `ceph_pgs_remapped`: No. of PGs that are remapped and incurring cluster-wide movement +- `ceph_recovery_io_bytes`: Rate of bytes being recovered in cluster per second +- `ceph_recovery_io_keys`: Rate of keys being recovered in cluster per second +- `ceph_recovery_io_objects`: Rate of objects being recovered in cluster per second +- `ceph_client_io_read_bytes`: Rate of bytes being read by all clients per second +- `ceph_client_io_write_bytes`: Rate of bytes being written by all clients per second +- `ceph_client_io_ops`: Total client ops on the cluster measured per second +- `ceph_client_io_read_ops`: Total client read I/O ops on the cluster measured per second +- `ceph_client_io_write_ops`: Total client write I/O ops on the cluster measured per second +- `ceph_cache_flush_io_bytes`: Rate of bytes being flushed from the cache pool per second +- `ceph_cache_evict_io_bytes`: Rate of bytes being evicted from the cache pool per second +- `ceph_cache_promote_io_ops`: Total cache promote operations measured per second +- `ceph_mgrs_active`: Count of active mgrs, can be either 0 or 1 +- `ceph_mgrs`: Total number of mgrs, including standbys +- `ceph_rbd_mirror_up`: Alive rbd-mirror daemons + +## Ceph monitor + +Ceph Monitor metrics + +Labels: +- `cluster`: cluster name +- `daemon`: daemon name. `ceph_versions` and `ceph_features` only +- `release`, `features`: ceph feature name and feature flag. `ceph_features` only +- `version_tag`, `sha1`, `release_name`: ceph version infortmation. `ceph_features` only + +Metrics: +- `ceph_monitor_capacity_bytes`: Total storage capacity of the monitor node +- `ceph_monitor_used_bytes`: Storage of the monitor node that is currently allocated for use +- `ceph_monitor_avail_bytes`: Total unused storage capacity that the monitor node has left +- `ceph_monitor_avail_percent`: Percentage of total unused storage capacity that the monitor node has left +- `ceph_monitor_store_capacity_bytes`: Total capacity of the FileStore backing the monitor daemon +- `ceph_monitor_store_sst_bytes`: Capacity of the FileStore used only for raw SSTs +- `ceph_monitor_store_log_bytes`: Capacity of the FileStore used only for logging +- `ceph_monitor_store_misc_bytes`: Capacity of the FileStore used only for storing miscellaneous information +- `ceph_monitor_clock_skew_seconds`: Clock skew the monitor node is incurring +- `ceph_monitor_latency_seconds`: Latency the monitor node is incurring +- `ceph_monitor_quorum_count`: he total size of the monitor quorum +- `ceph_versions`: Counts of current versioned daemons, parsed from `ceph versions` +- `ceph_features`: Counts of current client features, parsed from `ceph features` + +## OSD collector + +OSD level metrics + +Labels: +- `cluster`: cluster name +- `osd`: OSD id +- `device_class`: CRUSH device class +- `host`: CRUSH host the OSD is in +- `rack`: CRUSH rack the OSD is in +- `root`: CRUSH root the OSD is in +- `pgid`: PG id for recovery related metrics + +Metrics: +- `ceph_osd_crush_weight`: OSD Crush Weight +- `ceph_osd_depth`: OSD Depth +- `ceph_osd_reweight`: OSD Reweight +- `ceph_osd_bytes`: OSD Total Bytes +- `ceph_osd_used_bytes`: OSD Used Storage in Bytes +- `ceph_osd_avail_bytes`: OSD Available Storage in Bytes +- `ceph_osd_utilization`: OSD Utilization +- `ceph_osd_variance`: OSD Variance +- `ceph_osd_pgs`: OSD Placement Group Count +- `ceph_osd_pg_upmap_items_total`: OSD PG-Upmap Exception Table Entry Count +- `ceph_osd_total_bytes`: OSD Total Storage Bytes +- `ceph_osd_total_used_bytes`: OSD Total Used Storage Bytes +- `ceph_osd_total_avail_bytes`: OSD Total Available Storage Bytes +- `ceph_osd_average_utilization`: OSD Average Utilization +- `ceph_osd_perf_commit_latency_seconds`: OSD Perf Commit Latency +- `ceph_osd_perf_apply_latency_seconds`: OSD Perf Apply Latency +- `ceph_osd_in`: OSD In Status +- `ceph_osd_up`: OSD Up Status +- `ceph_osd_full_ratio`: OSD Full Ratio Value +- `ceph_osd_near_full_ratio`: OSD Near Full Ratio Value +- `ceph_osd_backfill_full_ratio`: OSD Backfill Full Ratio Value +- `ceph_osd_full`: OSD Full Status +- `ceph_osd_near_full`: OSD Near Full Status +- `ceph_osd_backfill_full`: OSD Backfill Full Status +- `ceph_osd_down`: Number of OSDs down in the cluster +- `ceph_osd_scrub_state`: State of OSDs involved in a scrub +- `ceph_pg_objects_recovered`: Number of objects recovered in a PG +- `ceph_osd_objects_backfilled`: Average number of objects backfilled in an OSD +- `ceph_pg_oldest_inactive`: The amount of time in seconds that the oldest PG has been inactive for + +## Crash collector + +Ceph crash daemon related metrics + +Labels: +- `cluster`: cluster name + +Metrics: +- `ceph_crash_reports`: Count of crashes reports per daemon, according to `ceph crash ls` + +## RBD Mirror collector + +Ceph RBD mirror health collector + +Labels: +- `cluster`: cluster name + +Metrics: +- `ceph_rbd_mirror_pool_status`: Health status of rbd-mirror, can vary only between 3 states (err:2, warn:1, ok:0) +- `ceph_rbd_mirror_pool_daemon_status`: Health status of rbd-mirror daemons, can vary only between 3 states (err:2, warn:1, ok:0) +- `ceph_rbd_mirror_pool_image_status`: "Health status of rbd-mirror images, can vary only between 3 states (err:2, warn:1, ok:0) + +## RGW collector + +RGW related metrics. Only enabled if `RGW_MODE={1,2}` is set. + +Labels: +- `cluster`: cluster name + +Metrics: +- `ceph_rgw_gc_active_tasks`: RGW GC active task count +- `ceph_rgw_gc_active_objects`: RGW GC active object count +- `ceph_rgw_gc_pending_tasks`: RGW GC pending task count +- `ceph_rgw_gc_pending_objects`: RGW GC pending object count diff --git a/README.md b/README.md index 8bb99e9..929261a 100644 --- a/README.md +++ b/README.md @@ -6,6 +6,8 @@ with the monitors using an appropriate wrapper over `rados_mon_command()`. Hence, no additional setup is necessary other than having a working Ceph cluster. +A List of all the metrics collected is available on [METRICS.md](./METRICS.md) page. + ## Dependencies You should ideally run this exporter from the client that can talk to the Ceph @@ -129,6 +131,4 @@ can generate views like: ![](sample.png) ---- - -Copyright @ 2016-2020 DigitalOcean™ Inc. +Copyright @ 2016-2023 DigitalOcean™ Inc.