Commit Graph

14 Commits

Author SHA1 Message Date
Benoît Knecht
653c3f6682 monitoring: Fix "10% OSDs down" alert description
The alert was triggered when less than 90% of OSDs were _up_, but then the
description took that value and described it as the percentage of OSDs being
_down_. So with 12% of OSDs down, the alert description would read:

```
88% or 88 of 100 OSDs are down (>=10%).
```

which can be panic-inducing.

This commit changes the alert expression to actually compute the ratio of OSDs
being down, which makes the correct value appear in the description.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
2020-05-06 18:49:26 +02:00
Volker Theile
e197e4d7f4 monitoring: alert for pool fill up broken
Fixes: https://tracker.ceph.com/issues/44991
Signed-off-by: Volker Theile <vtheile@suse.com>
2020-04-08 15:02:45 +02:00
Volker Theile
a5ade11a31
Merge pull request #34239 from p-se/wip-pse-fix-false-root-vol-full-alert
monitoring: root volume full alert fires false positives

Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Jan Fajerski <jfajerski@suse.com>
Reviewed-by: Volker Theile <vtheile@suse.com>
2020-04-06 14:17:17 +02:00
Patrick Seidensal
6935dc5592 monitoring: alert for prediction of disk and pool fill up broken
Fixes: https://tracker.ceph.com/issues/44776

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2020-03-27 13:44:28 +01:00
Patrick Seidensal
f8e347f771 monitoring: root volume full alert fires false positives
Fixes: https://tracker.ceph.com/issues/44780

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2020-03-27 11:06:08 +01:00
James Cheng
1b980ef88c
monitoring: Fix pool capacity incorrect
Signed-off-by: James Cheng <james59988@gmail.com>
2020-02-18 19:19:13 +08:00
Aleksei Zakharov
4eb58f7ccc monitoring/grafana,prometheus: add per-pool pg states support
Signed-off-by: Aleksei Zakharov <zaharov@selectel.ru>
2020-01-29 17:28:36 +03:00
Patrick Seidensal
fb51c589b5 monitoring: add details to Prometheus' alerts
Fixes: https://tracker.ceph.com/issues/43764

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2020-01-24 14:21:31 +01:00
Jan Fajerski
e098536acc
Merge pull request #32325 from Kriechi/fix-42982
monitoring: fix prometheus alert for full pools
2020-01-20 10:42:36 +01:00
Thomas Kriechbaumer
9abddc0dd3 monitoring: fix prometheus alert for full pools
The existing alert (introduced via
https://tracker.ceph.com/issues/24977) already triggers when still 50%
of storage space are available.

Fixes: https://tracker.ceph.com/issues/42982
Signed-off-by: Thomas Kriechbaumer <thomas@kriechbaumer.name>
2019-12-18 15:04:51 +01:00
Patrick Seidensal
d262adeb21 monitoring: fix indentation of ceph default alerts
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2019-11-18 12:40:55 +01:00
Patrick Seidensal
e923af3430 monitoring: wait before firing osd full alert
Fixes: https://tracker.ceph.com/issues/42862

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2019-11-18 12:39:27 +01:00
Volker Theile
8e6838c740 monitoring: SNMP OID per every Prometheus alert rule
Use the Ceph enterprise OID 50495 (https://www.iana.org/assignments/enterprise-numbers/enterprise-numbers) and create OIDs for every Prometheus alert rule according to the schema at https://github.com/SUSE/prometheus-webhook-snmp/blob/master/README.md.

Example OID:
1.3.6.1.4.1.50495.15.1.2.2.1

All alert rule OIDs are located below the object identifier 15 (15 for p which is the first character of prometheus). Check out the MIB at https://github.com/SUSE/prometheus-webhook-snmp/blob/master/PROMETHEUS-ALERT-CEPH-MIB.txt for more details.

Signed-off-by: Volker Theile <vtheile@suse.com>
2019-05-28 09:59:50 +02:00
Jan Fajerski
c0e58bd8ae monitoring: add a few prometheus alerts
Alerts are from
https://github.com/SUSE/DeepSea/blob/SES5/srv/salt/ceph/monitoring/prometheus/files/ses_default_alerts.yml
but updated for the mgr module and node_exporter >= 0.15.

Signed-off-by: Jan Fajerski <jfajerski@suse.com>
2019-04-26 11:21:39 +02:00