Commit Graph

209 Commits

Author SHA1 Message Date
Aashish Sharma
3063c8a4fb
Merge pull request #48783 from rhcs-dashboard/fix-grafana-stat-panel
mgr/dashboard: Replace vonage-status-panel with native grafana stat panel


Reviewed-by: Nizamudeen A <nia@redhat.com>
2023-02-08 18:34:30 +05:30
Pere Diaz Bou
8e07fbd2ea
Merge pull request #48843 from rhcs-dashboard/expose_slow_ops
mgr/prometheus: expose daemon health metrics

Reviewed-by: Anthony D Atri <anthony.datri@gmail.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2022-12-20 12:25:32 +01:00
Pere Diaz Bou
5a2b7c25b6 mgr/prometheus: expose daemon health metrics
Until now daemon health metrics were stored without being used. One of
the most helpful metrics there is SLOW_OPS with respect to OSDs and MONs
which this commit tries to expose to bring fine grained metrics to find
troublesome OSDs instead of having a lone healthcheck of slow ops in the
whole cluster.

Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-12-20 09:44:49 +01:00
Aashish Sharma
3e08b81b40 mgr/dashboard: Replace vonage-status-panel with native grafana stat panel
Fixes: https://tracker.ceph.com/issues/58295
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2022-12-16 10:51:47 +05:30
Kefu Chai
34e2e33870 *: s/whitelist_externals/allowlist_externals/
as allowlist_externals was introduced in
tox v4.0. see
5e33fda1a4 , but
this option was backported to 3.18 as an alias of whitelist_externals, so we don't need
to specify the minversion to 4.0 in this change.

as we started using tox 4.0 and up (v4.0.2 in specific). tox complains
and fails like:

alerts-lint: failed with promtool is not allowed, use allowlist_externals to allow it
  alerts-lint: FAIL code 1 (9.25 seconds)

see https://tox.wiki/en/latest/faq.html#tox-4-removed-tox-ini-keys
and https://tox.wiki/en/latest/config.html#allowlist_externals

it'd be nice to use a more inclusive language also. so, in this change,
s/whitelist_externals/allowlist_externals/ in all tox.ini in this
project.

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
2022-12-08 15:07:00 +08:00
Nizamudeen A
3f1c1b6376
Merge pull request #48526 from rhcs-dashboard/fix-cephPoolGrowth-alert
mgr/dashboard: Fix CephPoolGrowthWarning alert

Reviewed-by: Pegonzal <NOT@FOUND>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2022-11-29 18:29:01 +05:30
Aashish Sharma
97189b66af mgr/dashboard: Fix CephPoolGrowthWarning alert
Prometheus reports an error - many-to-many matching not allowed: matching labels must be unique on one side for CephPoolGrowthWarning if we have same pool ids on two different instances.

Fixes: https://tracker.ceph.com/issues/58017
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2022-11-22 11:55:41 +05:30
Tatjana Dehler
08352b6540
ceph-mixing: fix ceph_hosts variable
Do only use `instance` to query for hostnames in single-cluster-mode.
Consider the cluster matcher only in multi-cluster-mode. In this case
the query will look like:
`"label_values({cluster=~\"$cluster\"}, instance)"`.

Fixes: https://tracker.ceph.com/issues/57987
Signed-off-by: Tatjana Dehler <tdehler@suse.com>
2022-11-11 16:35:05 +01:00
Christian Kugler
4aecdad350
ceph-mixin: Add Prometheus Alert for Degraded Bond
Currently there is no alert for a network interface card to be misconfigured or
failed which is part of a network bond.

This could lead to redundancies and performance being degraded unnoticed.

To solve this, I use node exporter metrics to look at the number of total peers
of the bond and the ones that are active. If the numbers differ, something is up
and should be looked at.

Fixes: https://tracker.ceph.com/issues/57962
Signed-off-by: Christian Kugler <syphdias+git@gmail.com>
2022-11-02 14:48:57 +01:00
zdover23
23aa2be306
Merge pull request #47305 from zdover23/wip-doc-2022-07-25-pr4600-cleanup
doc/monitoring: add min vers of apps in mon stack

Reviewed-by: Anthony D'Atri <anthony.datri@gmail.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
2022-09-13 13:44:43 +10:00
Nizamudeen A
d84a03e989
Merge pull request #47700 from s0nea/wip-rgw-overview-labels
monitoring/ceph-mixin: add RGW host to label info

Reviewed-by: MrFreezeex <NOT@FOUND>
Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-09-09 17:36:40 +05:30
Tatjana Dehler
15fa97d49d
monitoring/ceph-mixin: add RGW host to label info
Add the missing information about the RGW instance to the labels of the
"Average GET/PUT Latencies" panel on the "RGW Overview" dashboard.

Fixes: https://tracker.ceph.com/issues/57166
Signed-off-by: Tatjana Dehler <tdehler@suse.com>
2022-09-06 16:19:19 +02:00
Zac Dover
367695f5b0 doc/monitoring: add min vers of apps in mon stack
https://tracker.ceph.com/issues/45447

This PR adds recommended versions of grafana and
prometheus and alert manager.

This PR is a second attempt at getting the information
in the following PR into the docs:
https://github.com/ceph/ceph/pull/46000/files

Himadri Maheshwari deserves the credit for the work
in this commit.

Signed-off-by: Zac Dover <zac.dover@gmail.com>
Signed-off-by: Himadri Maheshwari <himadri.maheshwari7915@gmail.com>
2022-09-05 07:36:52 +10:00
Arthur Outhenin-Chalandre
f744a93ef1
Merge pull request #47707 from bosc0/fix_alert
Ceph-mixin: Fix CephNodeNetworkPacket alerts
2022-08-30 12:49:23 +02:00
Arthur Outhenin-Chalandre
4909e795c9
Merge pull request #47669 from MrFreezeex/jb-path
ceph-mixin: fix PATH issues with jsonnet-bundler
2022-08-30 08:35:04 +02:00
Aswin Toni
351e1ac639 ceph-mixin: fix CephNodeNetworkPacket alerts
Signed-off-by: Aswin Toni <aswin.toni@cern.ch>
2022-08-23 15:26:52 +02:00
Tatjana Dehler
42ff9370a0
monitoring/ceph-mixin: add entries to envlist
Add the missing entries `jsonnet-bundler-install` and
`jsonnet-bundler-update` to envlist.

Signed-off-by: Tatjana Dehler <tdehler@suse.com>
2022-08-19 12:08:56 +02:00
Aswin Toni
35183140f6 ceph-mixin: fix config inheritance
Signed-off-by: Aswin Toni <aswin.toni@cern.ch>
2022-08-18 16:21:36 +02:00
Arthur Outhenin-Chalandre
d46e14c71b
ceph-mixin: fix PATH issues with jsonnet-bundler
In 4a3afcf, the $PATH is set for the test, but we cannot set multiple
properties with a single `set_property()` cmake command. We fix that by
adding the installation path of jsonnet-bundler
(CMAKE_CURRENT_BINARY_DIR) to the $PATH used for every tox test.

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
Co-Authored-By: Kefu Chai <tchaikov@gmail.com>
2022-08-18 13:43:34 +02:00
Aswin Toni
2e0e684fc2 ceph-mixin: Remove jsonnet building
Signed-off-by: Aswin Toni <aswin.toni@cern.ch>
2022-08-17 12:08:56 +02:00
Aswin Toni
5cdc1c62c5 prometheus: add multicluster support to alerts
Signed-off-by: Aswin Toni <aswin.toni@cern.ch>
2022-08-17 12:08:56 +02:00
Kefu Chai
4a3afcf277 cmake: set $PATH for tests using jsonnet tools
otherwise they would not able to find executables installed into
${CMAKE_CURRENT_BINARY_DIR}.

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
2022-08-16 10:53:29 +08:00
Nizamudeen A
e9d361f621
Merge pull request #47334 from s0nea/wip-osd-objectstore-types-fix
monitoring/ceph-mixin: OSD overview typo fix

Reviewed-by: MrFreezeex <NOT@FOUND>
Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2022-08-01 13:47:03 +05:30
Anthony D'Atri
9b65974468 monitoring/ceph-mixin: clean up prometheus_alerts.yml
Signed-off-by: Anthony D'Atri <anthonyeleven@users.noreply.github.com>
2022-07-28 19:17:51 -07:00
Tatjana Dehler
8faaca2082
monitoring/ceph-mixin: OSD overview typo fix
Correct a wrongly set bracket on ceph-dashboard -> OSD Overview ->
OSD Objectstore Types resulting in a parser error.

Fixes: https://tracker.ceph.com/issues/56948
Signed-off-by: Tatjana Dehler <tdehler@suse.com>
2022-07-28 15:15:32 +02:00
Arthur Outhenin-Chalandre
37add644d1
ceph-mixin: remove timepicker override in every dashboards
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-05-24 11:54:26 +02:00
Arthur Outhenin-Chalandre
5db37300fd
ceph-mixin: rationalize local helper functions to utils
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-05-24 11:50:49 +02:00
Arthur Outhenin-Chalandre
0b7cc6bc99
ceph-mixin: fix typos
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-05-18 10:02:54 +02:00
Arthur Outhenin-Chalandre
c8f086c182
ceph-mixin: fix test with rate and label changes
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-05-17 09:42:29 +02:00
Arthur Outhenin-Chalandre
3b6356c872
ceph-mixin: don't add cluster matcher if showcluster is disabled
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-05-17 09:41:21 +02:00
Arthur Outhenin-Chalandre
fd4f484d22
ceph-mixin: refactor the structure of _config and utils
Before this refactor we couln't override the config externally. Now the
_config is correctly propagated and not only taken from the
config.libsonnet file.

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-05-16 15:26:56 +02:00
Arthur Outhenin-Chalandre
4595e9af23
ceph-mixin: fix makefile dashboards dependency
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-05-16 15:26:55 +02:00
Arthur Outhenin-Chalandre
faeea8d165
ceph-mixin: fix linting issue and add cluster template support
Fix most of the issues reported by dashboards-linter:
- Add matcher/template for job (and also cluster)
- use $__rate_interval everywhere

Also this change all the irate functions to rate as most of irate where
not actually used correctly. While using irate on graph for instance you
can easily miss some of the metrics values as irate only take the two
last values and the query steps can be quite large if you want a graph
for a few hours/a day or more.

Fixes: https://tracker.ceph.com/issues/55003
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>

ceph-mixin: add config with matchers and tags

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-05-16 15:26:53 +02:00
Arthur Outhenin-Chalandre
1452311a9b
ceph-mixin: rewrite promql queries to multiline
Fixes: https://tracker.ceph.com/issues/55005
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-04-27 17:55:52 +02:00
Aashish Sharma
2877920f58 mgr/dashboard: upgrade grafana pie-chart and vonage-status-panel versions
Fixes:https://tracker.ceph.com/issues/55195
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2022-04-06 15:24:41 +05:30
Ernesto Puerta
8721bd6c5d
monitoring/grafana: fix version
Fixes: https://tracker.ceph.com/issues/55172
Signed-off-by: Ernesto Puerta <epuertat@redhat.com>
2022-04-04 13:52:43 +02:00
Ernesto Puerta
a98c2475c6
Merge pull request #45254 from travisn/prometheus-rules-typos
prometheus: Spell check the alert descriptions

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Laura Flores <lflores@redhat.com>
Reviewed-by: Michael Fritch <mfritch@suse.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
Reviewed-by: sunilangadi2 <NOT@FOUND>
Reviewed-by: Travis Nielsen <tnielsen@redhat.com>
2022-04-04 13:46:00 +02:00
David Galloway
b4910a6627
Merge pull request #45739 from rhcs-dashboard/fix-55155-master
grafana/Makefile: don't push to docker
2022-04-01 13:30:05 -04:00
Ernesto Puerta
7e6309fac3
grafana/Makefile: don't push to docker
Fixes: https://tracker.ceph.com/issues/55155
Signed-off-by: Ernesto Puerta <epuertat@redhat.com>
2022-04-01 11:44:43 +02:00
Ernesto Puerta
2d1c480f5a
Merge pull request #45583 from p-se/monitoring-alert-mtu-group-by-devices
mgr/dashboard: Compare values of MTU alert by device

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
Reviewed-by: p-se <NOT@FOUND>
2022-04-01 11:11:30 +02:00
Ernesto Puerta
87f494eda0
Merge pull request #45578 from rhcs-dashboard/fix-grafana-build
mgr/dashboard: remove transition-through-oci image workaround in grafana  build

Reviewed-by: Dan Mick <dmick@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2022-03-31 19:58:29 +02:00
Travis Nielsen
9cca95b16a
prometheus: spell check the alert descriptions
Signed-off-by: Travis Nielsen <tnielsen@redhat.com>
2022-03-30 17:38:43 -06:00
Ernesto Puerta
043f7953d8
Merge pull request #45335 from rhcs-dashboard/fix-54513-master
mgr/dashboard: Pool overall performance shows multiple entries of same pool in pool overview

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com>
Reviewed-by: sunilangadi2 <NOT@FOUND>
2022-03-30 14:05:38 +02:00
Aashish Sharma
9719cc795e mgr/dashboard: Pool overall performance shows multiple entries of same pool in pool overview
This PR intends to fix this issue

Fixes:https://tracker.ceph.com/issues/54513
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2022-03-28 18:25:25 +05:30
Aashish Sharma
49d6068463
mgr/dashboard: fix promtool test for mtu alert
Fixes: https://tracker.ceph.com/issues/55004
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2022-03-28 13:39:38 +02:00
Patrick Seidensal
3821548a37
mgr/dashboard: Compare values of MTU alert by device
Fixes: https://tracker.ceph.com/issues/55004

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2022-03-28 13:38:15 +02:00
Aashish Sharma
64b0e5ce8a mgr/dashboard: fix transition-through-oci image workaround in grafana build
Fixes: https://tracker.ceph.com/issues/54311
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2022-03-23 13:59:28 +05:30
Aashish Sharma
c306778889 mgr/dashboard/monitoring: update grafana version
Fixes: https://tracker.ceph.com/issues/54311

Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2022-03-21 17:40:03 +05:30
Rishabh Dave
a6f5efb620 monitoring: mention PyYAML only once in requirements
Following error occurs while running "sudo install-deps.sh" -
ERROR: Double requirement given: PyYAML==6.0 (from -r requirements-lint.txt (line 5)) (already in pyyaml (from -r requirements-alerts.txt (line 1)), name='PyYAML')

PyYAML is mentioned twice as a requirement. It is mentioned once in both
the following files -
monitoring/ceph-mixin/requirements-lint.txt
monitoring/ceph-mixin/requirements-alerts.txt

These requirements were added in commits
44d3e4c264 and
4750ac0d77.

Fixes: https://tracker.ceph.com/issues/54185
Signed-off-by: Rishabh Dave <ridave@redhat.com>
2022-02-08 11:19:15 +05:30
Nizamudeen A
27592b7561 cephadm: change shared_folder directory for prometheus and grafana
After https://github.com/ceph/ceph/pull/44059 the monitoring/prometheus
and monitoring/grafana/dashboards directories are changed to
monitoring/ceph-mixins. That broke the shared_folders in the cephadm
bootstrap script.

Changed all the instances of monitoring/prometheus and
monitoring/grafana/dashboards to monitoring/ceph-mixins

Also, renaming all the instances of prometheus_alerts.yaml to
prometheus_alerts.yml.

Fixes: https://tracker.ceph.com/issues/54176
Signed-off-by: Nizamudeen A <nia@redhat.com>
2022-02-07 16:34:37 +05:30