Commit Graph

161 Commits

Author SHA1 Message Date
Rishabh Dave
a6f5efb620 monitoring: mention PyYAML only once in requirements
Following error occurs while running "sudo install-deps.sh" -
ERROR: Double requirement given: PyYAML==6.0 (from -r requirements-lint.txt (line 5)) (already in pyyaml (from -r requirements-alerts.txt (line 1)), name='PyYAML')

PyYAML is mentioned twice as a requirement. It is mentioned once in both
the following files -
monitoring/ceph-mixin/requirements-lint.txt
monitoring/ceph-mixin/requirements-alerts.txt

These requirements were added in commits
44d3e4c264 and
4750ac0d77.

Fixes: https://tracker.ceph.com/issues/54185
Signed-off-by: Rishabh Dave <ridave@redhat.com>
2022-02-08 11:19:15 +05:30
Nizamudeen A
27592b7561 cephadm: change shared_folder directory for prometheus and grafana
After https://github.com/ceph/ceph/pull/44059 the monitoring/prometheus
and monitoring/grafana/dashboards directories are changed to
monitoring/ceph-mixins. That broke the shared_folders in the cephadm
bootstrap script.

Changed all the instances of monitoring/prometheus and
monitoring/grafana/dashboards to monitoring/ceph-mixins

Also, renaming all the instances of prometheus_alerts.yaml to
prometheus_alerts.yml.

Fixes: https://tracker.ceph.com/issues/54176
Signed-off-by: Nizamudeen A <nia@redhat.com>
2022-02-07 16:34:37 +05:30
Ernesto Puerta
6a4b1e148d
Merge pull request #44796 from pcuzner/remove-old-mib
monitoring: remove old MIB

Reviewed-by: Alfonso Martínez <almartin@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-02-04 17:42:08 +01:00
Arthur Outhenin-Chalandre
8ff1e6b399
monitoring: build jsonnet/jb only for testing
Build jsonnet and jb in the testso that we can build ceph without
internet access and still be able to run the test needed for monitoring
using jsonnet tools.

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-02-03 13:08:37 +01:00
Arthur Outhenin-Chalandre
ecaf9070ae
spec: debian: monitoring: build jsonnet from source to use 0.18.0
As this new version is recently released it's still not in every distro
we use. We now build jsonnet from source so that we can use this new
version of jsonnet. This commit could be reverted later on when the new
version would be available everywhere.

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-02-03 13:08:36 +01:00
Arthur Outhenin-Chalandre
98236e3a1d
mgr/dashboard: monitoring: refactor into ceph-mixin
Mixin is a way to bundle dashboards, prometheus rules and alerts into
jsonnet package. Shifting to mixin will allow easier integration with
monitoring automation that some users may use.

This commit moves `/monitoring/grafana/dashboards` and
`/monitoring/prometheus` to `/monitoring/ceph-mixin`. Prometheus alerts
was also converted to Jsonnet using an automated way (from yaml to json
to jsonnet). This commit minimises any change made to the generated files
and should not change neithers the dashboards nor the Prometheus alerts.

In the future some configuration will also be added to jsonnet to add
more functionalities to the dashboards or alerts (i.e.: multi cluster).

Fixes: https://tracker.ceph.com/issues/53374
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-02-03 13:08:20 +01:00
Ernesto Puerta
c47ace9215
Merge pull request #43707 from BenoitKnecht/ceph-mgr-service-id
mgr: Fix ceph_daemon label in ceph_rgw_* metrics

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-02-02 18:39:57 +01:00
Paul Cuzner
cbeab5c566 monitoring: remove old MIB
The MIB file that matches the OID definitions in the alerts is
CEPH-MIB.txt. The old MIB from the original SuSE snmp
gateway work, therefore needs to be removed to avoid
confusion.

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2022-01-27 11:24:34 +13:00
Pere Diaz Bou
57c26311de monitoring/grafana: replace filestore osd count
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-01-18 14:14:41 +01:00
Pere Diaz Bou
a3cf5c5e9f monitoring/grafana: use Path class instead of split
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-01-18 13:24:12 +01:00
Pere Diaz Bou
1e4d85d04f monitoring/grafana: remove explicit str casting
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-01-18 13:24:12 +01:00
Pere Diaz Bou
2b4f3561d2 monitoring/grafana: add generated json files
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-01-18 13:24:12 +01:00
Pere Diaz Bou
b381a83e9b monitoring/grafana: ValueError instead of RuntimeError
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-01-18 13:24:12 +01:00
Pere Diaz Bou
4c302234ff monitoring/grafana: Replace missing legendFormat warning with error
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-01-18 13:24:10 +01:00
Patrick Seidensal
7d7488018e monitoring: Add unit tests for OSD panels in ceph-cluster dashboard
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2022-01-13 13:27:55 +01:00
Patrick Seidensal
4a6b2c1dfb monitoring: fix display ceph_osd_in in Grafana panel
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2022-01-13 13:27:55 +01:00
Patrick Seidensal
18d3a71618 mgr/prometheus: Fix regression with OSD/host details/overview dashboards
Fix issues with PromQL expressions and vector matching with the
`ceph_disk_occupation` metric.

As it turns out, `ceph_disk_occupation` cannot simply be used as
expected, as there seem to be some edge cases for users that have
several OSDs on a single disk.  This leads to issues which cannot be
approached by PromQL alone (many-to-many PromQL erros).  The data we
have expected is simply different in some rare cases.

I have not found a sole PromQL solution to this issue. What we basically
need is the following.

1. Match on labels `host` and `instance` to get one or more OSD names
   from a metadata metric (`ceph_disk_occupation`) to let a user know
   about which OSDs belong to which disk.

2. Match on labels `ceph_daemon` of the `ceph_disk_occupation` metric,
   in which case the value of `ceph_daemon` must not refer to more than
   a single OSD. The exact opposite to requirement 1.

As both operations are currently performed on a single metric, and there
is no way to satisfy both requirements on a single metric, the intention
of this commit is to extend the metric by providing a similar metric
that satisfies one of the requirements. This enables the queries to
differentiate between a vector matching operation to show a string to
the user (where `ceph_daemon` could possibly be `osd.1` or
`osd.1+osd.2`) and to match a vector by having a single `ceph_daemon` in
the condition for the matching.

Although the `ceph_daemon` label is used on a variety of daemons, only
OSDs seem to be affected by this issue (only if more than one OSD is run
on a single disk).  This means that only the `ceph_disk_occupation`
metadata metric seems to need to be extended and provided as two
metrics.

`ceph_disk_occupation` is supposed to be used for matching the
`ceph_daemon` label value.

    foo * on(ceph_daemon) group_left ceph_disk_occupation

`ceph_disk_occupation_human` is supposed to be used for anything where
the resulting data is displayed to be consumed by humans (graphs, alert
messages, etc).

    foo * on(device,instance)
    group_left(ceph_daemon) ceph_disk_occupation_human

Fixes: https://tracker.ceph.com/issues/52974

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2022-01-13 13:27:55 +01:00
Benoît Knecht
2daaa052ea monitoring/grafana: Add tests for radosgw panels
Some of the expressions modified in c40290390d7 were not covered by any tests,
especially those in the `radosgw-detail.json` dashboard.

This commit fills in those gaps.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
2022-01-11 13:17:48 +01:00
Benoît Knecht
adc36dea7f monitoring/grafana: Update radosgw dashboards
With the `ceph_daemon` label now replaced by `instance_id` on all `ceph_rgw_*`
metrics, we need to update Grafana dashboards get that label back from
`ceph_rgw_metadata` using this type of construct:

```
ceph_rgw_req * on (instance_id) group_left(ceph_daemon) ceph_rgw_metadata
```

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
2022-01-11 13:17:20 +01:00
Ernesto Puerta
978d5829f2
Merge pull request #44294 from rhcs-dashboard/feature-bluestore-onode
mgr/dashboard: monitoring:Implement BlueStore onode hit/miss counters into the dashboard

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Alfonso Martínez <almartin@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Laura Flores <lflores@redhat.com>
Reviewed-by: neha-ojha <NOT@FOUND>
Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-01-11 11:24:21 +01:00
Aashish Sharma
15aa4dffa9 mgr/dashboard: monitoring:Implement BlueStore onode hit/miss counters into the dashboard
Provide the details pulled from Bluestore stats in order to display the onode hit/miss counters

Fixes: https://tracker.ceph.com/issues/53577
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2022-01-05 14:22:53 +05:30
Ernesto Puerta
cdc9f742df
Merge pull request #44190 from rhcs-dashboard/grafana-regex
monitoring/grafana: improve grafana unit tests variable substitution

Reviewed-by: Alfonso Martínez <almartin@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com>
2021-12-21 17:58:17 +01:00
Pere Diaz Bou
bbbdf8e6a2 monitoring/grafana: doctest util regex
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2021-12-15 09:36:08 +01:00
Pere Diaz Bou
2286ddc1c2 monitoring/grafana: rename tox promql test
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2021-12-14 09:36:23 +01:00
Pere Diaz Bou
5ebdb746e8 monitoring/grafana: improve grafana unit tests variable substitution
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2021-12-14 09:36:23 +01:00
Ernesto Puerta
d10b0b7e72
mgr/dashboard: disable Promql test in ARM
Temporarily disable this test while debugging the issue (since https://github.com/ceph/ceph/pull/43669
originally passed the ARM check).

Fixes: https://tracker.ceph.com/issues/53451
Signed-off-by: Ernesto Puerta <epuertat@redhat.com>
2021-12-13 20:20:44 +01:00
Avan Thakkar
8d83126e51 mgr/dashboard: introduce HAProxy metrics for RGW
Fixes: https://tracker.ceph.com/issues/53311
Signed-off-by: Avan Thakkar <athakkar@redhat.com>
2021-12-09 20:03:03 +05:30
Pere Diaz Bou
44d3e4c264 monitoring/grafana: Grafana query tester
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2021-11-16 10:30:49 +01:00
Paul Cuzner
7ffcbd7f79 mgr/prometheus: Update rule format and enhance SNMP support
Rules now adhere to the format defined by Prometheus.io.
This changes alert naming and each alert now includes a
a summary description to provide a quick one-liner.

In addition to reformatting some missing alerts for MDS and
cephadm have been added, and corresponding tests added.

The MIB has also been refactored, so it now passes standard
lint tests and a README included for devs to understand the
OID schema.

Fixes: https://tracker.ceph.com/issues/53111

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2021-11-05 11:24:25 +13:00
Sebastian Wagner
aae2ea3897
Merge pull request #43293 from pcuzner/granular-alerts
mgr/prometheus: expose ceph healthchecks as metrics

Reviewed-by: Boris Ranto <branto@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Sebastian Wagner <sewagner@redhat.com>
2021-10-29 00:23:24 +02:00
Pere Diaz Bou
e1bc6f24ff monitoring: ethernet bonding filter in Network Load
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2021-10-27 09:08:20 +02:00
Paul Cuzner
37b82b8793 mgr/prometheus: remove cmake tests
Temporary removal of the cmake test integration

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2021-10-27 09:58:17 +13:00
Sebastian Wagner
b830c555d2 monitoring/prometheus: Add cmake integration
Signed-off-by: Sebastian Wagner <sewagner@redhat.com>
2021-10-22 13:37:31 +13:00
Paul Cuzner
4750ac0d77 mgr/prometheus: add test cases and validation using tox
Focus all tests inside a tests directory, and use pytest/tox to
perform validation of the overall content. tox tests also use
promtool if available to provide rule checks and unittest runs.

In addition to these checks a validate_rules script provides the
format, and content checks against all rules - which is also
called via tox (but can be run independently too)

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2021-10-22 13:36:40 +13:00
Paul Cuzner
e0dfc02063 mgr/prometheus: track individual healthchecks as metrics
This patch creates a health history object maintained in
the modules kvstore.  The history and current health
checks are used to create a metric per healthcheck whilst
also providing a history feature. Two new commands are added:
ceph healthcheck history ls
ceph healthcheck history clear

In addition to the new commands, the additional metrics
have been used to update the prometheus alerts

Fixes: https://tracker.ceph.com/issues/52638

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2021-10-22 13:32:39 +13:00
Aashish Sharma
ed954b0e6c mgr/dashboard: monitoring: grafonnet refactoring for cephfs dashboards
This PR intends to refactor cephfs dashboards using grafonnet

Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-10-19 12:36:31 +05:30
Aashish Sharma
e490e2f3ab mgr/dashboard: monitoring: grafonnet refactoring for osds dashboards
This PR intends to refactor osds dashboards using grafonnet

Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-10-19 12:13:50 +05:30
Aashish Sharma
8c48821c21 mgr/dashboard: monitoring: grafonnet refactoring for pools dashboards
This PR intends to refactor pools dashboards using grafonnet

Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-10-19 12:10:56 +05:30
Aashish Sharma
e737aaa000 mgr/dashboard: monitoring: grafonnet refactoring for rbd dashboards
This PR intends to refactor rbd dashboards using grafonnet

Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-10-19 12:09:04 +05:30
Aashish Sharma
eb01954cd9 mgr/dashboard: monitoring: grafonnet refactoring for radosgw dashboards
This PR intends to refactor radosgw dashboards using grafonnet

Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-10-19 11:57:28 +05:30
Ernesto Puerta
19535b1d0e
Merge pull request #43469 from rhcs-dashboard/hosts-grafana-dashboards
mgr/dashboard: monitoring: grafonnet refactoring for hosts dashboards

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2021-10-18 17:14:03 +02:00
Ernesto Puerta
9b40c9df26
Merge pull request #43377 from rhcs-dashboard/fix-clients-connection-query
mgr/dashboard: replace "Ceph-cluster" Client connections with active-standby MGRs

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Greg Farnum <gfarnum@redhat.com>
Reviewed-by: neha-ojha <NOT@FOUND>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2021-10-13 13:37:51 +02:00
Sebastian Wagner
53382d70eb
Merge pull request #43274 from pcuzner/add-mib
monitoring:Adding the Ceph MIB

Reviewed-by: Sebastian Wagner <sewagner@redhat.com>
2021-10-12 22:29:06 +02:00
Aashish Sharma
f7714de294 mgr/dashboard: monitoring: grafonnet refactoring for hosts dashboards
This PR intends to refactor hosts dashboards using grafonnet

Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-10-12 11:05:02 +05:30
Avan Thakkar
d388c5e958 mgr/dashboard: replace Client connections with active-stdby mgrs
Fixes: https://tracker.ceph.com/issues/52121
Signed-off-by: Avan Thakkar <athakkar@redhat.com>
2021-10-11 21:53:23 +05:30
Paul Cuzner
b96aa5d184 monitoring:Updated README
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2021-10-06 14:32:47 +13:00
Ernesto Puerta
ba9e17d2d2
Merge pull request #43132 from p-se/monitoring-grafana-piechart-update
monitoring: update grafana-piechart-panel plugin

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
Reviewed-by: p-se <NOT@FOUND>
2021-09-28 18:37:45 +02:00
Paul Cuzner
f9213ad9cf monitoring:Adding the Ceph MIB
The ceph MIB has been created and maintained in a
a separate repo:
https://github.com/SUSE/prometheus-webhook-snmp

This patch brings this MIB into the main ceph repo, so
alert changes can target prometheus and potentially
SNMP environments within the same PR.

Kudos to Volker Theile for creating the MIB.

Fixes: https://tracker.ceph.com/issues/52708

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2021-09-23 11:06:19 +12:00
Patrick Seidensal
af94237621
monitoring: update grafana-piechart-panel plugin
Fixes: https://tracker.ceph.com/issues/51211

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2021-09-10 15:28:17 +02:00
Aashish Sharma
58d635455d mgr/dashboard: Incorrect MTU mismatch warning
The MTU mismatch warning was being fired for those NIC's as well that are in down state. This PR intends to fix this issue

Fixes:https://tracker.ceph.com/issues/52028
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-09-02 15:34:36 +05:30