Commit Graph

52 Commits

Author SHA1 Message Date
Volker Theile
a5ade11a31
Merge pull request #34239 from p-se/wip-pse-fix-false-root-vol-full-alert
monitoring: root volume full alert fires false positives

Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Jan Fajerski <jfajerski@suse.com>
Reviewed-by: Volker Theile <vtheile@suse.com>
2020-04-06 14:17:17 +02:00
Lenz Grimmer
b6ad9a804b
Merge pull request #34240 from krig/grafana-dashboards-fixes
mgr/dashboard: Repair broken grafana panels

Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Stephan Müller <smueller@suse.com>
2020-04-06 10:55:20 +02:00
Patrick Seidensal
6935dc5592 monitoring: alert for prediction of disk and pool fill up broken
Fixes: https://tracker.ceph.com/issues/44776

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2020-03-27 13:44:28 +01:00
Kristoffer Grönlund
b7abaab5bd dashboard: Convert FQDN to hostname in grafana panels
The $ceph_hosts variable contained the FQDN for hosts
while the instance label created by ceph only has
the hostname.

Fixes: https://tracker.ceph.com/issues/44784
Signed-off-by: Kristoffer Grönlund <kgronlund@suse.com>
2020-03-27 12:33:15 +01:00
Kristoffer Grönlund
136d21e21d dashboard: Resolve FQDN / hostname mismatch in hosts overview panel
In the AVG Disk Utilization panel, the result is calculated
by combining the output of node_disk_io_time_seconds_total
with the output of ceph_disk_occupation. However, the
first vector encodes the instance label with the full FQDN
while the ceph label only contains the hostname:port. In
order for these to match correctly, the domain name and port
has to be stripped from the labels.

Fixes: https://tracker.ceph.com/issues/44784
Signed-off-by: Kristoffer Grönlund <kgronlund@suse.com>
2020-03-27 12:33:09 +01:00
Kristoffer Grönlund
8b61b8d3d7 dashboard: Use exported_instance to identify OSDs
When moving to LVM-based ceph-volume setups, several
grafana dashboards stopped working. The problem is that
(device, instance) no longer results in unique labels
which causes errors like:

"many-to-many matching not allowed: matching labels must be unique on one side"

Fixes: https://tracker.ceph.com/issues/44784
Signed-off-by: Kristoffer Grönlund <kgronlund@suse.com>
2020-03-27 12:33:01 +01:00
Kristoffer Grönlund
4444333243 dashboard: AVG RAM Utilization panel always showed "N/A"
The references to `$osd_hosts` etc. were encoded as
`[[osd_hosts]]` in the PromQL expression divisor, and
the panel always displayed N/A as the result of the
query.

Replacing the `[[...]]` with `$...` makes the expression
work again.

Fixes: https://tracker.ceph.com/issues/44784
Signed-off-by: Kristoffer Grönlund <kgronlund@suse.com>
2020-03-27 12:32:52 +01:00
Patrick Seidensal
f8e347f771 monitoring: root volume full alert fires false positives
Fixes: https://tracker.ceph.com/issues/44780

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2020-03-27 11:06:08 +01:00
Kefu Chai
a12f9f19e0
Merge pull request #32749 from james58899/fix-capacity
monitoring: Fix pool capacity incorrect

Reviewed-by: Jan Fajerski <jfajerski@suse.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
2020-03-27 16:13:29 +08:00
Alfonso Martínez
1f0cddfafc monitoring: fix RGW grafana chart 'Average GET/PUT Latencies'
Fixes: https://tracker.ceph.com/issues/44538
Signed-off-by: Alfonso Martínez <almartin@redhat.com>
2020-03-10 12:05:26 +01:00
Patrick Seidensal
1794b55e64 monitoring: restore lost pool full alert
Fixes: https://tracker.ceph.com/issues/44366

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2020-03-02 11:43:03 +01:00
James Cheng
1b980ef88c
monitoring: Fix pool capacity incorrect
Signed-off-by: James Cheng <james59988@gmail.com>
2020-02-18 19:19:13 +08:00
Avan Thakkar
dd8cb9d2d6 mgr/dashboard: UI fixes
Fixes: https://tracker.ceph.com/issues/42914

Signed-off-by: Avan Thakkar <athakkar@redhat.com>
2020-02-10 22:57:57 +05:30
Aleksei Zakharov
a37cf380ad mgr/grafana: sum pg states for cluster
Also, revert table formatting.

Signed-off-by: Aleksei Zakharov <zaharov@selectel.ru>
2020-01-29 17:28:36 +03:00
Aleksei Zakharov
4eb58f7ccc monitoring/grafana,prometheus: add per-pool pg states support
Signed-off-by: Aleksei Zakharov <zaharov@selectel.ru>
2020-01-29 17:28:36 +03:00
Patrick Seidensal
fb51c589b5 monitoring: add details to Prometheus' alerts
Fixes: https://tracker.ceph.com/issues/43764

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2020-01-24 14:21:31 +01:00
Jan Fajerski
e098536acc
Merge pull request #32325 from Kriechi/fix-42982
monitoring: fix prometheus alert for full pools
2020-01-20 10:42:36 +01:00
Bryan Stillwell
8eafb09acb Switch spelling of utilization
Prefer the non-British spelling of utilization since that's what the majority
of the code base seems to use.

Signed-off-by: Bryan Stillwell <bstillwell@godaddy.com>
2020-01-07 16:57:36 -07:00
Thomas Kriechbaumer
9abddc0dd3 monitoring: fix prometheus alert for full pools
The existing alert (introduced via
https://tracker.ceph.com/issues/24977) already triggers when still 50%
of storage space are available.

Fixes: https://tracker.ceph.com/issues/42982
Signed-off-by: Thomas Kriechbaumer <thomas@kriechbaumer.name>
2019-12-18 15:04:51 +01:00
Lenz Grimmer
11a1708e19
mgr/dashboard: grafana charts match time picker selection. (#31964)
mgr/dashboard: grafana charts match time picker selection.

Reviewed-by: Jan Fajerski <jfajerski@suse.com>
Reviewed-by: Laura Paduano <lpaduano@suse.com>
Reviewed-by: Patrick Seidensal <pnawracay@suse.com>
2019-12-03 17:09:00 +00:00
Alfonso Martínez
5ba114330e mgr/dashboard: grafana charts match time picker selection.
Fixes: https://tracker.ceph.com/issues/43097
Signed-off-by: Alfonso Martínez <almartin@redhat.com>
2019-12-03 14:15:10 +01:00
Ernesto Puerta
1182073f0c
mgr/dashboard,grafana: remove shortcut menu
Remove shortcut menu (links) and add check in grafana CI script.

Fixes: https://tracker.ceph.com/issues/43091
Signed-off-by: Ernesto Puerta <epuertat@redhat.com>
2019-12-03 10:21:35 +01:00
Patrick Seidensal
d262adeb21 monitoring: fix indentation of ceph default alerts
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2019-11-18 12:40:55 +01:00
Patrick Seidensal
e923af3430 monitoring: wait before firing osd full alert
Fixes: https://tracker.ceph.com/issues/42862

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2019-11-18 12:39:27 +01:00
Radu Toader
3beaf63761
mgr/dashboard: fix grafana dashboards
Fixes: https://tracker.ceph.com/issues/42542

Sort order was wrong for some dashboards,
fixed empty / buggy Top 3 clients IOPS by pool / Throughput - in Pools
Overall performance
fixed Avg utilization Multiple series found - in Host Overall
performance
Fixed invalid dimensions for plot - in OSD Overall performance

Signed-off-by: Radu Toader <radu.m.toader@gmail.com>
2019-10-30 11:03:03 +02:00
Volker Theile
8e6838c740 monitoring: SNMP OID per every Prometheus alert rule
Use the Ceph enterprise OID 50495 (https://www.iana.org/assignments/enterprise-numbers/enterprise-numbers) and create OIDs for every Prometheus alert rule according to the schema at https://github.com/SUSE/prometheus-webhook-snmp/blob/master/README.md.

Example OID:
1.3.6.1.4.1.50495.15.1.2.2.1

All alert rule OIDs are located below the object identifier 15 (15 for p which is the first character of prometheus). Check out the MIB at https://github.com/SUSE/prometheus-webhook-snmp/blob/master/PROMETHEUS-ALERT-CEPH-MIB.txt for more details.

Signed-off-by: Volker Theile <vtheile@suse.com>
2019-05-28 09:59:50 +02:00
Jan Fajerski
e7a4437fdc monitoring: update Grafana dashboards
Fix various panels that used outdated metric names, cluncky or
unnecessary label_replace calls. Also unify the style of many panels.

Fixes: http://tracker.ceph.com/issues/39652

Signed-off-by: Jan Fajerski <jfajerski@suse.com>
2019-05-14 13:47:55 +02:00
Jan Fajerski
c0e58bd8ae monitoring: add a few prometheus alerts
Alerts are from
https://github.com/SUSE/DeepSea/blob/SES5/srv/salt/ceph/monitoring/prometheus/files/ses_default_alerts.yml
but updated for the mgr module and node_exporter >= 0.15.

Signed-off-by: Jan Fajerski <jfajerski@suse.com>
2019-04-26 11:21:39 +02:00
Jan Fajerski
287e209351 monitoring/grafana: fix typo in README
Signed-off-by: Jan Fajerski <jfajerski@suse.com>
2019-04-16 14:19:51 +02:00
Neha Gupta
739fdbad37 mgr/dashboard: Fixed performance details context for host list row selection
Fixes: http://tracker.ceph.com/issues/37854

Signed-off-by: Neha Gupta <gnehapk@gmail.com>
2019-01-18 13:36:49 +09:00
Jason Dillaman
f4ac899950 monitoring/grafana: new RBD overview dashboard page
This page pulls RBD stats from the Natuatilus prometheus exporter.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
2019-01-11 16:41:46 -05:00
Boris Ranto
1ade714910 cmake: Support grafana dashboard installation
We are currently hosting the grafana dashboards in our repo but we do
not install them. This patch adds the cmake support.

Signed-off-by: Boris Ranto <branto@redhat.com>
2018-10-25 17:09:02 +02:00
Lenz Grimmer
94aefee3b0
Merge pull request #24314 from rhcs-dashboard/dashboards
mgr/dashboard: Grafana dashboard updates and additions

Reviewed-by: Boris Ranto <branto@redhat.com>
2018-10-19 12:42:23 +02:00
Paul Cuzner
a848411bd8 MGR/dashboard: make grafana datasource selectable
Grafana dashboard updated to use a templating
variable for the datasource

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
a99618ce41 MGR/dashboard: make grafana datasource selectable
Grafana dashboard updated to use a templating
variable for the datasource

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
b64289ca3d MGR/dashboard: make grafana datasource selectable
Grafana dashboard updated to use a templating
variable for the datasource

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
5432470914 MGR/dashboard: make grafana datasource selectable
Grafana dashboard updated to use a templating
variable for the datasource

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
bc5eea09c8 MGR/dashboard: make grafana datasource selectable
Grafana dashboard updated to use a templating
variable for the datasource

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
ba1a3b3a09 MGR/dashboard: make grafana datasource selectable
Grafana dashboard updated to use a templating
variable for the datasource

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
f97fee3a83 MGR/dashboard: make grafana datasource selectable
Grafana dashboard updated to use a templating
variable for the datasource

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
02b5414d19 MGR/dashboard: make grafana datasource selectable
Grafana dashboard updated to use a templating
variable for the datasource

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
7c04098e68 MGR/dashboard: make grafana datasource selectable
Grafana dashboard updated to use a templating
variable for the datasource

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
2c346efd12 Fix linewidth issue in pools overview dashboard
Linewidth was set to two, but the idea is that
a linewidth of >1 is reserved for eye-catcher
plot lines like maximums

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
b84f0ce45f Refresh of the dashboards
Fixes some minor anomalies and tested against
node_exporter 0.15 and 0.16

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
7d97bb28a8 Updated requirements information
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
0e655f8400 Added new Overview dashboards
These new dashboard definitions provide the high
level views for the hosts in the cluster and the
OSDs.

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
4292a7a357 Screenshots added for all dashboards
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
3c7c32f2ed Add Host level details dashboard
The host-details.json file provides a view of host
level metrics. The panels are arranged in two
rows;
Overview : Cpu/RAM/Network related stats
OSD Performance: OSD physical drive stats

The overview row is shown by default. Click on
the OSD Performance row to show the remaining
graphs

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
a0d9325c4d Document the current state of the dashboards
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:26:08 +13:00
Paul Cuzner
8ebf2ede7f Initial grafana dashboard definitions
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00