RepoMirrors/ceph

mirror of https://github.com/ceph/ceph synced 2024-12-19 09:57:05 +00:00

Author	SHA1	Message	Date
Paul Cuzner	2010432b50	mgr/prometheus: Add healthcheck metric for SLOW_OPS SLOW_OPS is triggered by op tracker, and generates a health alert but healthchecks do not create metrics for prometheus to use as alert triggers. This change adds SLOW_OPS metric, and provides a simple means to extend to other relevant health checks in the future If the extract of the value from the health check message fails we log an error and remove the metric from the metric set. In addition the metric description has changed to better reflect the scenarios where SLOW_OPS can be triggered. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2020-11-02 15:30:49 +13:00
Seena Fallah	0fd28f646c	monitoring: Use null yaxes min for OSD read latency According to seriesOverrides that negative-Y for read param there shouldn't be a minimum for yaxes Signed-off-by: Seena Fallah <seenafallah@gmail.com>	2020-10-12 19:56:18 +03:30
Patrick Seidensal	fe64b9d176	mgr/dashboard: Fix many-to-many issue in host-details dashboard The labels on one side do not match the labels of the other side, where a label_replace is used. The fix uses the same label_replace on the missing side. Fixes: https://tracker.ceph.com/issues/47334 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2020-09-07 12:37:40 +02:00
Avan Thakkar	f039e5585d	mgr/dashboard: cpu stats incorrectly displayed Fixes: https://tracker.ceph.com/issues/46683 Signed-off-by: Avan Thakkar <athakkar@redhat.com>	2020-07-23 11:57:32 +05:30
pcuzner	0021dd278b	Merge pull request #35610 from pcuzner/wip-grafana-container monitoring: add grafana container build file	2020-07-06 13:06:55 +12:00
Lenz Grimmer	399521d66b	Merge pull request #34532 from rhcs-dashboard/wip-45068-fix-parse-error mgr/dashboard: Prometheus query error in the metrics of Pools, OSDs and RBD images Reviewed-by: Alfonso Martínez <almartin@redhat.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Volker Theile <vtheile@suse.com>	2020-06-30 10:50:59 +02:00
Paul Cuzner	3c813729dc	monitoring:add grafama container build file This commit provides the Makefile to create the ceph-grafana containers for nautilus, octopus and master releases. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2020-06-17 17:20:45 +12:00
Kiefer Chang	b963b7fbe9	monitoring: fixing some issues in RBD detail dashboard - Exchange read/write legends in The `I/O Bytes per second` panel. - Rename `I/O Bytes per second` to `Throughput`. - Rename `IOPS Count` to just `IOPS`. - Remove instance name from legends. - Fixes typos: `Averange` -> `Average`. Fixes: https://tracker.ceph.com/issues/45735 Signed-off-by: Kiefer Chang <kiefer.chang@suse.com>	2020-05-28 14:49:31 +08:00
Alfonso Martínez	cf4ff7d2f0	mgr/dashboard: grafana panels for rgw multisite sync performance * RGW sync perf. counters are now exposed through grafana panels. * Sync Performance tab is only shown if rgw realm is detected. * Prometheus module: added metrics suitable for prometheus consumption (from existing ones, not replacing for backward compatibility). Fixes: https://tracker.ceph.com/issues/45310 Signed-off-by: Alfonso Martínez <almartin@redhat.com>	2020-05-22 13:36:10 +02:00
Benoît Knecht	653c3f6682	monitoring: Fix "10% OSDs down" alert description The alert was triggered when less than 90% of OSDs were _up_, but then the description took that value and described it as the percentage of OSDs being _down_. So with 12% of OSDs down, the alert description would read: ``` 88% or 88 of 100 OSDs are down (>=10%). ``` which can be panic-inducing. This commit changes the alert expression to actually compute the ratio of OSDs being down, which makes the correct value appear in the description. Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>	2020-05-06 18:49:26 +02:00
Lenz Grimmer	9334471340	Merge pull request #33991 from SchoolGuy/monitoring/rbd-image-details mgr/dashboard/grafana: Add rbd-image details dashboard Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Jan Fajerski <jfajerski@suse.com> Reviewed-by: Laura Paduano <lpaduano@suse.com> Reviewed-by: Patrick Seidensal <pnawracay@suse.com> Reviewed-by: Volker Theile <vtheile@suse.com>	2020-05-04 09:59:53 +02:00
Enno Gotthold	dfb1e0020e	mgr/dashboard: Remove additional unneeded steps for the metrics calculation Signed-off-by: Enno Gotthold <egotthold@suse.de>	2020-04-28 13:34:16 +02:00
Ernesto Puerta	3fd804f10b	monitoring: fix decimal precision in Grafana % Set decimal precision to 2 positions for charts using percentunits. Fixes: https://tracker.ceph.com/issues/45183 Signed-off-by: Ernesto Puerta <epuertat@redhat.com>	2020-04-22 13:39:16 +02:00
Avan Thakkar	47b515c094	mgr/dashboard: Prometheus query error in the metrics of Pools, OSDs and RBD images Fixes: https://tracker.ceph.com/issues/45068 Signed-off-by: Avan Thakkar <athakkar@redhat.com>	2020-04-21 23:03:09 +05:30
Volker Theile	e197e4d7f4	monitoring: alert for pool fill up broken Fixes: https://tracker.ceph.com/issues/44991 Signed-off-by: Volker Theile <vtheile@suse.com>	2020-04-08 15:02:45 +02:00
Volker Theile	a5ade11a31	Merge pull request #34239 from p-se/wip-pse-fix-false-root-vol-full-alert monitoring: root volume full alert fires false positives Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Jan Fajerski <jfajerski@suse.com> Reviewed-by: Volker Theile <vtheile@suse.com>	2020-04-06 14:17:17 +02:00
Lenz Grimmer	b6ad9a804b	Merge pull request #34240 from krig/grafana-dashboards-fixes mgr/dashboard: Repair broken grafana panels Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Stephan Müller <smueller@suse.com>	2020-04-06 10:55:20 +02:00
Patrick Seidensal	6935dc5592	monitoring: alert for prediction of disk and pool fill up broken Fixes: https://tracker.ceph.com/issues/44776 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2020-03-27 13:44:28 +01:00
Kristoffer Grönlund	b7abaab5bd	dashboard: Convert FQDN to hostname in grafana panels The $ceph_hosts variable contained the FQDN for hosts while the instance label created by ceph only has the hostname. Fixes: https://tracker.ceph.com/issues/44784 Signed-off-by: Kristoffer Grönlund <kgronlund@suse.com>	2020-03-27 12:33:15 +01:00
Kristoffer Grönlund	136d21e21d	dashboard: Resolve FQDN / hostname mismatch in hosts overview panel In the AVG Disk Utilization panel, the result is calculated by combining the output of node_disk_io_time_seconds_total with the output of ceph_disk_occupation. However, the first vector encodes the instance label with the full FQDN while the ceph label only contains the hostname:port. In order for these to match correctly, the domain name and port has to be stripped from the labels. Fixes: https://tracker.ceph.com/issues/44784 Signed-off-by: Kristoffer Grönlund <kgronlund@suse.com>	2020-03-27 12:33:09 +01:00
Kristoffer Grönlund	8b61b8d3d7	dashboard: Use exported_instance to identify OSDs When moving to LVM-based ceph-volume setups, several grafana dashboards stopped working. The problem is that (device, instance) no longer results in unique labels which causes errors like: "many-to-many matching not allowed: matching labels must be unique on one side" Fixes: https://tracker.ceph.com/issues/44784 Signed-off-by: Kristoffer Grönlund <kgronlund@suse.com>	2020-03-27 12:33:01 +01:00
Kristoffer Grönlund	4444333243	dashboard: AVG RAM Utilization panel always showed "N/A" The references to `$osd_hosts` etc. were encoded as `[[osd_hosts]]` in the PromQL expression divisor, and the panel always displayed N/A as the result of the query. Replacing the `[[...]]` with `$...` makes the expression work again. Fixes: https://tracker.ceph.com/issues/44784 Signed-off-by: Kristoffer Grönlund <kgronlund@suse.com>	2020-03-27 12:32:52 +01:00
Patrick Seidensal	f8e347f771	monitoring: root volume full alert fires false positives Fixes: https://tracker.ceph.com/issues/44780 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2020-03-27 11:06:08 +01:00
Kefu Chai	a12f9f19e0	Merge pull request #32749 from james58899/fix-capacity monitoring: Fix pool capacity incorrect Reviewed-by: Jan Fajerski <jfajerski@suse.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com>	2020-03-27 16:13:29 +08:00
Enno Gotthold	9707cb30cb	mgr/dashboard: Add grafana chart for rbd image details Fixes: https://tracker.ceph.com/issues/44623 Signed-off-by: Enno Gotthold <egotthold@suse.de> This dashboard will per default be empty as the already existing dashboard with the summary for all rbd images.	2020-03-26 08:21:30 +01:00
Alfonso Martínez	1f0cddfafc	monitoring: fix RGW grafana chart 'Average GET/PUT Latencies' Fixes: https://tracker.ceph.com/issues/44538 Signed-off-by: Alfonso Martínez <almartin@redhat.com>	2020-03-10 12:05:26 +01:00
Patrick Seidensal	1794b55e64	monitoring: restore lost `pool full` alert Fixes: https://tracker.ceph.com/issues/44366 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2020-03-02 11:43:03 +01:00
James Cheng	1b980ef88c	monitoring: Fix pool capacity incorrect Signed-off-by: James Cheng <james59988@gmail.com>	2020-02-18 19:19:13 +08:00
Avan Thakkar	dd8cb9d2d6	mgr/dashboard: UI fixes Fixes: https://tracker.ceph.com/issues/42914 Signed-off-by: Avan Thakkar <athakkar@redhat.com>	2020-02-10 22:57:57 +05:30
Aleksei Zakharov	a37cf380ad	mgr/grafana: sum pg states for cluster Also, revert table formatting. Signed-off-by: Aleksei Zakharov <zaharov@selectel.ru>	2020-01-29 17:28:36 +03:00
Aleksei Zakharov	4eb58f7ccc	monitoring/grafana,prometheus: add per-pool pg states support Signed-off-by: Aleksei Zakharov <zaharov@selectel.ru>	2020-01-29 17:28:36 +03:00
Patrick Seidensal	fb51c589b5	monitoring: add details to Prometheus' alerts Fixes: https://tracker.ceph.com/issues/43764 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2020-01-24 14:21:31 +01:00
Jan Fajerski	e098536acc	Merge pull request #32325 from Kriechi/fix-42982 monitoring: fix prometheus alert for full pools	2020-01-20 10:42:36 +01:00
Bryan Stillwell	8eafb09acb	Switch spelling of utilization Prefer the non-British spelling of utilization since that's what the majority of the code base seems to use. Signed-off-by: Bryan Stillwell <bstillwell@godaddy.com>	2020-01-07 16:57:36 -07:00
Thomas Kriechbaumer	9abddc0dd3	monitoring: fix prometheus alert for full pools The existing alert (introduced via https://tracker.ceph.com/issues/24977) already triggers when still 50% of storage space are available. Fixes: https://tracker.ceph.com/issues/42982 Signed-off-by: Thomas Kriechbaumer <thomas@kriechbaumer.name>	2019-12-18 15:04:51 +01:00
Lenz Grimmer	11a1708e19	mgr/dashboard: grafana charts match time picker selection. (#31964 ) mgr/dashboard: grafana charts match time picker selection. Reviewed-by: Jan Fajerski <jfajerski@suse.com> Reviewed-by: Laura Paduano <lpaduano@suse.com> Reviewed-by: Patrick Seidensal <pnawracay@suse.com>	2019-12-03 17:09:00 +00:00
Alfonso Martínez	5ba114330e	mgr/dashboard: grafana charts match time picker selection. Fixes: https://tracker.ceph.com/issues/43097 Signed-off-by: Alfonso Martínez <almartin@redhat.com>	2019-12-03 14:15:10 +01:00
Ernesto Puerta	1182073f0c	mgr/dashboard,grafana: remove shortcut menu Remove shortcut menu (links) and add check in grafana CI script. Fixes: https://tracker.ceph.com/issues/43091 Signed-off-by: Ernesto Puerta <epuertat@redhat.com>	2019-12-03 10:21:35 +01:00
Patrick Seidensal	d262adeb21	monitoring: fix indentation of ceph default alerts Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2019-11-18 12:40:55 +01:00
Patrick Seidensal	e923af3430	monitoring: wait before firing osd full alert Fixes: https://tracker.ceph.com/issues/42862 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2019-11-18 12:39:27 +01:00
Radu Toader	3beaf63761	mgr/dashboard: fix grafana dashboards Fixes: https://tracker.ceph.com/issues/42542 Sort order was wrong for some dashboards, fixed empty / buggy Top 3 clients IOPS by pool / Throughput - in Pools Overall performance fixed Avg utilization Multiple series found - in Host Overall performance Fixed invalid dimensions for plot - in OSD Overall performance Signed-off-by: Radu Toader <radu.m.toader@gmail.com>	2019-10-30 11:03:03 +02:00
Volker Theile	8e6838c740	monitoring: SNMP OID per every Prometheus alert rule Use the Ceph enterprise OID 50495 (https://www.iana.org/assignments/enterprise-numbers/enterprise-numbers) and create OIDs for every Prometheus alert rule according to the schema at https://github.com/SUSE/prometheus-webhook-snmp/blob/master/README.md. Example OID: 1.3.6.1.4.1.50495.15.1.2.2.1 All alert rule OIDs are located below the object identifier 15 (15 for p which is the first character of prometheus). Check out the MIB at https://github.com/SUSE/prometheus-webhook-snmp/blob/master/PROMETHEUS-ALERT-CEPH-MIB.txt for more details. Signed-off-by: Volker Theile <vtheile@suse.com>	2019-05-28 09:59:50 +02:00
Jan Fajerski	e7a4437fdc	monitoring: update Grafana dashboards Fix various panels that used outdated metric names, cluncky or unnecessary label_replace calls. Also unify the style of many panels. Fixes: http://tracker.ceph.com/issues/39652 Signed-off-by: Jan Fajerski <jfajerski@suse.com>	2019-05-14 13:47:55 +02:00
Jan Fajerski	c0e58bd8ae	monitoring: add a few prometheus alerts Alerts are from https://github.com/SUSE/DeepSea/blob/SES5/srv/salt/ceph/monitoring/prometheus/files/ses_default_alerts.yml but updated for the mgr module and node_exporter >= 0.15. Signed-off-by: Jan Fajerski <jfajerski@suse.com>	2019-04-26 11:21:39 +02:00
Jan Fajerski	287e209351	monitoring/grafana: fix typo in README Signed-off-by: Jan Fajerski <jfajerski@suse.com>	2019-04-16 14:19:51 +02:00
Neha Gupta	739fdbad37	mgr/dashboard: Fixed performance details context for host list row selection Fixes: http://tracker.ceph.com/issues/37854 Signed-off-by: Neha Gupta <gnehapk@gmail.com>	2019-01-18 13:36:49 +09:00
Jason Dillaman	f4ac899950	monitoring/grafana: new RBD overview dashboard page This page pulls RBD stats from the Natuatilus prometheus exporter. Signed-off-by: Jason Dillaman <dillaman@redhat.com>	2019-01-11 16:41:46 -05:00
Boris Ranto	1ade714910	cmake: Support grafana dashboard installation We are currently hosting the grafana dashboards in our repo but we do not install them. This patch adds the cmake support. Signed-off-by: Boris Ranto <branto@redhat.com>	2018-10-25 17:09:02 +02:00
Lenz Grimmer	94aefee3b0	Merge pull request #24314 from rhcs-dashboard/dashboards mgr/dashboard: Grafana dashboard updates and additions Reviewed-by: Boris Ranto <branto@redhat.com>	2018-10-19 12:42:23 +02:00
Paul Cuzner	a848411bd8	MGR/dashboard: make grafana datasource selectable Grafana dashboard updated to use a templating variable for the datasource Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2018-10-09 08:23:39 +13:00

1 2

68 Commits