2019-04-15 13:35:09 +00:00
groups :
- name : cluster health
rules :
- alert : health error
expr : ceph_health_status == 2
for : 5m
labels :
severity : critical
type : ceph_default
2019-05-06 14:26:37 +00:00
oid : 1.3 .6 .1 .4 .1 .50495 .15 .1 .2 .2 .1
2019-04-15 13:35:09 +00:00
annotations :
2020-01-23 11:52:24 +00:00
description : >
Ceph in HEALTH_ERROR state for more than 5 minutes.
Please check "ceph health detail" for more information.
2019-04-15 13:35:09 +00:00
- alert : health warn
expr : ceph_health_status == 1
for : 15m
labels :
severity : warning
type : ceph_default
2019-05-06 14:26:37 +00:00
oid : 1.3 .6 .1 .4 .1 .50495 .15 .1 .2 .2 .2
2019-04-15 13:35:09 +00:00
annotations :
2020-01-23 11:52:24 +00:00
description : >
Ceph has been in HEALTH_WARN for more than 15 minutes.
Please check "ceph health detail" for more information.
2019-04-15 13:35:09 +00:00
- name : mon
rules :
- alert : low monitor quorum count
expr : sum(ceph_mon_quorum_status) < 3
labels :
severity : critical
type : ceph_default
2019-05-06 14:26:37 +00:00
oid : 1.3 .6 .1 .4 .1 .50495 .15 .1 .2 .3 .1
2019-04-15 13:35:09 +00:00
annotations :
2020-01-23 11:52:24 +00:00
description : |
Monitor count in quorum is below three.
Only {{ $value }} of {{ with query "count(ceph_mon_quorum_status)" }}{{ . | first | value }}{{ end }} monitors are active.
The following monitors are down :
{{- range query "(ceph_mon_quorum_status == 0) + on(ceph_daemon) group_left(hostname) (ceph_mon_metadata * 0)" }}
- {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }}
{{- end }}
2019-04-15 13:35:09 +00:00
- name : osd
rules :
- alert : 10 % OSDs down
2020-04-30 08:50:07 +00:00
expr : count(ceph_osd_up == 0) / count(ceph_osd_up) * 100 >= 10
2019-04-15 13:35:09 +00:00
labels :
severity : critical
type : ceph_default
2019-05-06 14:26:37 +00:00
oid : 1.3 .6 .1 .4 .1 .50495 .15 .1 .2 .4 .1
2019-04-15 13:35:09 +00:00
annotations :
2020-01-23 11:52:24 +00:00
description : |
2020-04-30 08:50:07 +00:00
{{ $value | humanize }}% or {{ with query "count(ceph_osd_up == 0)" }}{{ . | first | value }}{{ end }} of {{ with query "count(ceph_osd_up)" }}{{ . | first | value }}{{ end }} OSDs are down (≥ 10%).
2020-01-23 11:52:24 +00:00
The following OSDs are down :
{{- range query "(ceph_osd_up * on(ceph_daemon) group_left(hostname) ceph_osd_metadata) == 0" }}
- {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }}
{{- end }}
2019-04-15 13:35:09 +00:00
- alert : OSD down
expr : count(ceph_osd_up == 0) > 0
for : 15m
labels :
severity : warning
type : ceph_default
2019-05-06 14:26:37 +00:00
oid : 1.3 .6 .1 .4 .1 .50495 .15 .1 .2 .4 .2
2019-04-15 13:35:09 +00:00
annotations :
2020-01-23 11:52:24 +00:00
description : |
{{ $s := "" }}{{ if gt $value 1.0 }}{{ $s = "s" }}{{ end }}
{{ $value }} OSD{{ $s }} down for more than 15 minutes.
{{ $value }} of {{ query "count(ceph_osd_up)" | first | value }} OSDs are down.
The following OSD{{ $s }} {{ if eq $s "" }}is{{ else }}are{{ end }} down:
{{- range query "(ceph_osd_up * on(ceph_daemon) group_left(hostname) ceph_osd_metadata) == 0"}}
- {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }}
{{- end }}
2019-04-15 13:35:09 +00:00
- alert : OSDs near full
2020-01-23 11:52:24 +00:00
expr : |
(
((ceph_osd_stat_bytes_used / ceph_osd_stat_bytes) and on(ceph_daemon) ceph_osd_up == 1)
* on (ceph_daemon) group_left(hostname) ceph_osd_metadata
) * 100 > 90
2019-11-18 11:39:27 +00:00
for : 5m
2019-04-15 13:35:09 +00:00
labels :
severity : critical
type : ceph_default
2019-05-06 14:26:37 +00:00
oid : 1.3 .6 .1 .4 .1 .50495 .15 .1 .2 .4 .3
2019-04-15 13:35:09 +00:00
annotations :
2020-01-23 11:52:24 +00:00
description : >
OSD {{ $labels.ceph_daemon }} on {{ $labels.hostname }} is
dangerously full : {{ $value | humanize }}%
- alert : flapping OSD
expr : |
(
rate(ceph_osd_up[5m])
* on (ceph_daemon) group_left(hostname) ceph_osd_metadata
) * 60 > 1
2019-04-15 13:35:09 +00:00
labels :
severity : warning
type : ceph_default
2019-05-06 14:26:37 +00:00
oid : 1.3 .6 .1 .4 .1 .50495 .15 .1 .2 .4 .4
2019-04-15 13:35:09 +00:00
annotations :
description : >
2020-01-23 11:52:24 +00:00
OSD {{ $labels.ceph_daemon }} on {{ $labels.hostname }} was
marked down and back up at {{ $value | humanize }} times once a
2019-11-18 11:40:55 +00:00
minute for 5 minutes.
2020-01-23 11:52:24 +00:00
2019-04-15 13:35:09 +00:00
# alert on high deviation from average PG count
- alert : high pg count deviation
2020-01-23 11:52:24 +00:00
expr : |
abs(
(
(ceph_osd_numpg > 0) - on (job) group_left avg(ceph_osd_numpg > 0) by (job)
) / on (job) group_left avg(ceph_osd_numpg > 0) by (job)
) * on(ceph_daemon) group_left(hostname) ceph_osd_metadata > 0.30
2019-04-15 13:35:09 +00:00
for : 5m
labels :
severity : warning
type : ceph_default
2019-05-06 14:26:37 +00:00
oid : 1.3 .6 .1 .4 .1 .50495 .15 .1 .2 .4 .5
2019-04-15 13:35:09 +00:00
annotations :
description : >
2020-01-23 11:52:24 +00:00
OSD {{ $labels.ceph_daemon }} on {{ $labels.hostname }} deviates
by more than 30% from average PG count.
2019-04-15 13:35:09 +00:00
# alert on high commit latency...but how high is too high
- name : mds
rules :
# no mds metrics are exported yet
- name : mgr
rules :
# no mgr metrics are exported yet
- name : pgs
rules :
- alert : pgs inactive
2020-01-21 10:44:50 +00:00
expr : ceph_pool_metadata * on(pool_id,instance) group_left() (ceph_pg_total - ceph_pg_active) > 0
2019-04-15 13:35:09 +00:00
for : 5m
labels :
severity : critical
type : ceph_default
2019-05-06 14:26:37 +00:00
oid : 1.3 .6 .1 .4 .1 .50495 .15 .1 .2 .7 .1
2019-04-15 13:35:09 +00:00
annotations :
2020-01-23 11:52:24 +00:00
description : >
2020-01-21 10:44:50 +00:00
{{ $value }} PGs have been inactive for more than 5 minutes in pool {{ $labels.name }}.
2020-01-23 11:52:24 +00:00
Inactive placement groups aren't able to serve read/write
requests.
2019-04-15 13:35:09 +00:00
- alert : pgs unclean
2020-01-21 10:44:50 +00:00
expr : ceph_pool_metadata * on(pool_id,instance) group_left() (ceph_pg_total - ceph_pg_clean) > 0
2019-04-15 13:35:09 +00:00
for : 15m
labels :
severity : warning
type : ceph_default
2019-05-06 14:26:37 +00:00
oid : 1.3 .6 .1 .4 .1 .50495 .15 .1 .2 .7 .2
2019-04-15 13:35:09 +00:00
annotations :
2020-01-23 11:52:24 +00:00
description : >
2020-01-21 10:44:50 +00:00
{{ $value }} PGs haven't been clean for more than 15 minutes in pool {{ $labels.name }}.
2020-01-23 11:52:24 +00:00
Unclean PGs haven't been able to completely recover from a
previous failure.
2019-04-15 13:35:09 +00:00
- name : nodes
rules :
- alert : root volume full
2020-01-23 11:52:24 +00:00
expr : node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100 < 5
2020-03-27 10:06:08 +00:00
for : 5m
2019-04-15 13:35:09 +00:00
labels :
severity : critical
type : ceph_default
2019-05-06 14:26:37 +00:00
oid : 1.3 .6 .1 .4 .1 .50495 .15 .1 .2 .8 .1
2019-04-15 13:35:09 +00:00
annotations :
2020-01-23 11:52:24 +00:00
description : >
Root volume (OSD and MON store) is dangerously full : {{ $value | humanize }}% free.
2020-11-11 17:55:30 +00:00
# alert on nic packet errors and drops rates > 1% packets/s
2019-04-15 13:35:09 +00:00
- alert : network packets dropped
2020-11-11 17:55:30 +00:00
expr : |
(
increase(node_network_receive_drop_total{device!="lo"}[1m]) +
increase(node_network_transmit_drop_total{device!="lo"}[1m])
) / (
increase(node_network_receive_packets_total{device!="lo"}[1m]) +
increase(node_network_transmit_packets_total{device!="lo"}[1m])
) >= 0.0001 or (
increase(node_network_receive_drop_total{device!="lo"}[1m]) +
increase(node_network_transmit_drop_total{device!="lo"}[1m])
) >= 10
2019-04-15 13:35:09 +00:00
labels :
severity : warning
type : ceph_default
2019-05-06 14:26:37 +00:00
oid : 1.3 .6 .1 .4 .1 .50495 .15 .1 .2 .8 .2
2019-04-15 13:35:09 +00:00
annotations :
description : >
2020-11-11 17:55:30 +00:00
Node {{ $labels.instance }} experiences packet drop > 0.01% or >
10 packets/s on interface {{ $labels.device }}.
2020-01-23 11:52:24 +00:00
2019-04-15 13:35:09 +00:00
- alert : network packet errors
2020-01-23 11:52:24 +00:00
expr : |
2020-11-11 17:55:30 +00:00
(
increase(node_network_receive_errs_total{device!="lo"}[1m]) +
increase(node_network_transmit_errs_total{device!="lo"}[1m])
) / (
increase(node_network_receive_packets_total{device!="lo"}[1m]) +
increase(node_network_transmit_packets_total{device!="lo"}[1m])
) >= 0.0001 or (
increase(node_network_receive_errs_total{device!="lo"}[1m]) +
increase(node_network_transmit_errs_total{device!="lo"}[1m])
) >= 10
2019-04-15 13:35:09 +00:00
labels :
severity : warning
type : ceph_default
2019-05-06 14:26:37 +00:00
oid : 1.3 .6 .1 .4 .1 .50495 .15 .1 .2 .8 .3
2019-04-15 13:35:09 +00:00
annotations :
description : >
2020-11-11 17:55:30 +00:00
Node {{ $labels.instance }} experiences packet errors > 0.01% or
> 10 packets/s on interface {{ $labels.device }}.
2020-01-23 11:52:24 +00:00
2020-03-26 21:49:57 +00:00
- alert : storage filling up
2020-01-23 11:52:24 +00:00
expr : |
2020-03-26 21:49:57 +00:00
predict_linear(node_filesystem_free_bytes[2d], 3600 * 24 * 5) *
on (instance) group_left(nodename) node_uname_info < 0
2019-04-15 13:35:09 +00:00
labels :
severity : warning
type : ceph_default
2019-05-06 14:26:37 +00:00
oid : 1.3 .6 .1 .4 .1 .50495 .15 .1 .2 .8 .4
2019-04-15 13:35:09 +00:00
annotations :
description : >
2020-01-23 11:52:24 +00:00
Mountpoint {{ $labels.mountpoint }} on {{ $labels.nodename }}
will be full in less than 5 days assuming the average fill-up
rate of the past 48 hours.
2021-01-05 09:04:22 +00:00
- alert : MTU Mismatch
2021-09-02 06:27:57 +00:00
expr : node_network_mtu_bytes{device!="lo"} * (node_network_up{device!="lo"} > 0) != on() group_left() (quantile(0.5, node_network_mtu_bytes{device!="lo"}))
2021-01-05 09:04:22 +00:00
labels :
severity : warning
type : ceph_default
oid : 1.3 .6 .1 .4 .1 .50495 .15 .1 .2 .8 .5
annotations :
description : >
Node {{ $labels.instance }} has a different MTU size ({{ $value }})
than the median value on device {{ $labels.device }}.
2019-04-15 13:35:09 +00:00
- name : pools
rules :
- alert : pool full
2020-01-23 11:52:24 +00:00
expr : |
2020-01-21 13:52:29 +00:00
ceph_pool_stored / (ceph_pool_stored + ceph_pool_max_avail)
2020-01-23 11:52:24 +00:00
* on (pool_id) group_right ceph_pool_metadata * 100 > 90
2019-04-15 13:35:09 +00:00
labels :
severity : critical
type : ceph_default
2019-05-06 14:26:37 +00:00
oid : 1.3 .6 .1 .4 .1 .50495 .15 .1 .2 .9 .1
2019-04-15 13:35:09 +00:00
annotations :
2020-01-23 11:52:24 +00:00
description : Pool {{ $labels.name }} at {{ $value | humanize }}% capacity.
2019-04-15 13:35:09 +00:00
- alert : pool filling up
2020-01-23 11:52:24 +00:00
expr : |
(
2020-11-25 11:05:53 +00:00
predict_linear(ceph_pool_stored[2d], 3600 * 24 * 5)
>= ceph_pool_stored + ceph_pool_max_avail
2020-04-08 13:02:45 +00:00
) * on(pool_id) group_left(name) ceph_pool_metadata
2019-04-15 13:35:09 +00:00
labels :
severity : warning
type : ceph_default
2019-05-06 14:26:37 +00:00
oid : 1.3 .6 .1 .4 .1 .50495 .15 .1 .2 .9 .2
2019-04-15 13:35:09 +00:00
annotations :
description : >
Pool {{ $labels.name }} will be full in less than 5 days
2020-01-23 11:52:24 +00:00
assuming the average fill-up rate of the past 48 hours.
2020-10-08 03:30:56 +00:00
- name : healthchecks
rules :
- alert : Slow OSD Ops
expr : ceph_healthcheck_slow_ops > 0
for : 30s
labels :
severity : warning
type : ceph_default
annotations :
description : >
{{ $value }} OSD requests are taking too long to process (osd_op_complaint_time exceeded)