mirror of
https://github.com/ceph/ceph
synced 2025-02-23 02:57:21 +00:00
monitoring: Fix "10% OSDs down" alert description
The alert was triggered when less than 90% of OSDs were _up_, but then the description took that value and described it as the percentage of OSDs being _down_. So with 12% of OSDs down, the alert description would read: ``` 88% or 88 of 100 OSDs are down (>=10%). ``` which can be panic-inducing. This commit changes the alert expression to actually compute the ratio of OSDs being down, which makes the correct value appear in the description. Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
This commit is contained in:
parent
a96f9583f4
commit
653c3f6682
@ -47,14 +47,14 @@ groups:
|
||||
- name: osd
|
||||
rules:
|
||||
- alert: 10% OSDs down
|
||||
expr: (sum(ceph_osd_up) / count(ceph_osd_up)) * 100 <= 90
|
||||
expr: count(ceph_osd_up == 0) / count(ceph_osd_up) * 100 >= 10
|
||||
labels:
|
||||
severity: critical
|
||||
type: ceph_default
|
||||
oid: 1.3.6.1.4.1.50495.15.1.2.4.1
|
||||
annotations:
|
||||
description: |
|
||||
{{ $value | humanize}}% or {{with query "sum(ceph_osd_up)" }}{{ . | first | value }}{{ end }} of {{ with query "count(ceph_osd_up)"}}{{. | first | value }}{{ end }} OSDs are down (>=10%).
|
||||
{{ $value | humanize }}% or {{ with query "count(ceph_osd_up == 0)" }}{{ . | first | value }}{{ end }} of {{ with query "count(ceph_osd_up)" }}{{ . | first | value }}{{ end }} OSDs are down (≥ 10%).
|
||||
|
||||
The following OSDs are down:
|
||||
{{- range query "(ceph_osd_up * on(ceph_daemon) group_left(hostname) ceph_osd_metadata) == 0" }}
|
||||
|
Loading…
Reference in New Issue
Block a user