Commit Graph

191 Commits

Author SHA1 Message Date
Tatjana Dehler
42ff9370a0
monitoring/ceph-mixin: add entries to envlist
Add the missing entries `jsonnet-bundler-install` and
`jsonnet-bundler-update` to envlist.

Signed-off-by: Tatjana Dehler <tdehler@suse.com>
2022-08-19 12:08:56 +02:00
Aswin Toni
2e0e684fc2 ceph-mixin: Remove jsonnet building
Signed-off-by: Aswin Toni <aswin.toni@cern.ch>
2022-08-17 12:08:56 +02:00
Aswin Toni
5cdc1c62c5 prometheus: add multicluster support to alerts
Signed-off-by: Aswin Toni <aswin.toni@cern.ch>
2022-08-17 12:08:56 +02:00
Kefu Chai
4a3afcf277 cmake: set $PATH for tests using jsonnet tools
otherwise they would not able to find executables installed into
${CMAKE_CURRENT_BINARY_DIR}.

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
2022-08-16 10:53:29 +08:00
Nizamudeen A
e9d361f621
Merge pull request #47334 from s0nea/wip-osd-objectstore-types-fix
monitoring/ceph-mixin: OSD overview typo fix

Reviewed-by: MrFreezeex <NOT@FOUND>
Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2022-08-01 13:47:03 +05:30
Anthony D'Atri
9b65974468 monitoring/ceph-mixin: clean up prometheus_alerts.yml
Signed-off-by: Anthony D'Atri <anthonyeleven@users.noreply.github.com>
2022-07-28 19:17:51 -07:00
Tatjana Dehler
8faaca2082
monitoring/ceph-mixin: OSD overview typo fix
Correct a wrongly set bracket on ceph-dashboard -> OSD Overview ->
OSD Objectstore Types resulting in a parser error.

Fixes: https://tracker.ceph.com/issues/56948
Signed-off-by: Tatjana Dehler <tdehler@suse.com>
2022-07-28 15:15:32 +02:00
Arthur Outhenin-Chalandre
37add644d1
ceph-mixin: remove timepicker override in every dashboards
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-05-24 11:54:26 +02:00
Arthur Outhenin-Chalandre
5db37300fd
ceph-mixin: rationalize local helper functions to utils
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-05-24 11:50:49 +02:00
Arthur Outhenin-Chalandre
0b7cc6bc99
ceph-mixin: fix typos
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-05-18 10:02:54 +02:00
Arthur Outhenin-Chalandre
c8f086c182
ceph-mixin: fix test with rate and label changes
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-05-17 09:42:29 +02:00
Arthur Outhenin-Chalandre
3b6356c872
ceph-mixin: don't add cluster matcher if showcluster is disabled
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-05-17 09:41:21 +02:00
Arthur Outhenin-Chalandre
fd4f484d22
ceph-mixin: refactor the structure of _config and utils
Before this refactor we couln't override the config externally. Now the
_config is correctly propagated and not only taken from the
config.libsonnet file.

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-05-16 15:26:56 +02:00
Arthur Outhenin-Chalandre
4595e9af23
ceph-mixin: fix makefile dashboards dependency
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-05-16 15:26:55 +02:00
Arthur Outhenin-Chalandre
faeea8d165
ceph-mixin: fix linting issue and add cluster template support
Fix most of the issues reported by dashboards-linter:
- Add matcher/template for job (and also cluster)
- use $__rate_interval everywhere

Also this change all the irate functions to rate as most of irate where
not actually used correctly. While using irate on graph for instance you
can easily miss some of the metrics values as irate only take the two
last values and the query steps can be quite large if you want a graph
for a few hours/a day or more.

Fixes: https://tracker.ceph.com/issues/55003
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>

ceph-mixin: add config with matchers and tags

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-05-16 15:26:53 +02:00
Arthur Outhenin-Chalandre
1452311a9b
ceph-mixin: rewrite promql queries to multiline
Fixes: https://tracker.ceph.com/issues/55005
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-04-27 17:55:52 +02:00
Aashish Sharma
2877920f58 mgr/dashboard: upgrade grafana pie-chart and vonage-status-panel versions
Fixes:https://tracker.ceph.com/issues/55195
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2022-04-06 15:24:41 +05:30
Ernesto Puerta
8721bd6c5d
monitoring/grafana: fix version
Fixes: https://tracker.ceph.com/issues/55172
Signed-off-by: Ernesto Puerta <epuertat@redhat.com>
2022-04-04 13:52:43 +02:00
Ernesto Puerta
a98c2475c6
Merge pull request #45254 from travisn/prometheus-rules-typos
prometheus: Spell check the alert descriptions

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Laura Flores <lflores@redhat.com>
Reviewed-by: Michael Fritch <mfritch@suse.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
Reviewed-by: sunilangadi2 <NOT@FOUND>
Reviewed-by: Travis Nielsen <tnielsen@redhat.com>
2022-04-04 13:46:00 +02:00
David Galloway
b4910a6627
Merge pull request #45739 from rhcs-dashboard/fix-55155-master
grafana/Makefile: don't push to docker
2022-04-01 13:30:05 -04:00
Ernesto Puerta
7e6309fac3
grafana/Makefile: don't push to docker
Fixes: https://tracker.ceph.com/issues/55155
Signed-off-by: Ernesto Puerta <epuertat@redhat.com>
2022-04-01 11:44:43 +02:00
Ernesto Puerta
2d1c480f5a
Merge pull request #45583 from p-se/monitoring-alert-mtu-group-by-devices
mgr/dashboard: Compare values of MTU alert by device

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
Reviewed-by: p-se <NOT@FOUND>
2022-04-01 11:11:30 +02:00
Ernesto Puerta
87f494eda0
Merge pull request #45578 from rhcs-dashboard/fix-grafana-build
mgr/dashboard: remove transition-through-oci image workaround in grafana  build

Reviewed-by: Dan Mick <dmick@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2022-03-31 19:58:29 +02:00
Travis Nielsen
9cca95b16a
prometheus: spell check the alert descriptions
Signed-off-by: Travis Nielsen <tnielsen@redhat.com>
2022-03-30 17:38:43 -06:00
Ernesto Puerta
043f7953d8
Merge pull request #45335 from rhcs-dashboard/fix-54513-master
mgr/dashboard: Pool overall performance shows multiple entries of same pool in pool overview

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com>
Reviewed-by: sunilangadi2 <NOT@FOUND>
2022-03-30 14:05:38 +02:00
Aashish Sharma
9719cc795e mgr/dashboard: Pool overall performance shows multiple entries of same pool in pool overview
This PR intends to fix this issue

Fixes:https://tracker.ceph.com/issues/54513
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2022-03-28 18:25:25 +05:30
Aashish Sharma
49d6068463
mgr/dashboard: fix promtool test for mtu alert
Fixes: https://tracker.ceph.com/issues/55004
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2022-03-28 13:39:38 +02:00
Patrick Seidensal
3821548a37
mgr/dashboard: Compare values of MTU alert by device
Fixes: https://tracker.ceph.com/issues/55004

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2022-03-28 13:38:15 +02:00
Aashish Sharma
64b0e5ce8a mgr/dashboard: fix transition-through-oci image workaround in grafana build
Fixes: https://tracker.ceph.com/issues/54311
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2022-03-23 13:59:28 +05:30
Aashish Sharma
c306778889 mgr/dashboard/monitoring: update grafana version
Fixes: https://tracker.ceph.com/issues/54311

Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2022-03-21 17:40:03 +05:30
Rishabh Dave
a6f5efb620 monitoring: mention PyYAML only once in requirements
Following error occurs while running "sudo install-deps.sh" -
ERROR: Double requirement given: PyYAML==6.0 (from -r requirements-lint.txt (line 5)) (already in pyyaml (from -r requirements-alerts.txt (line 1)), name='PyYAML')

PyYAML is mentioned twice as a requirement. It is mentioned once in both
the following files -
monitoring/ceph-mixin/requirements-lint.txt
monitoring/ceph-mixin/requirements-alerts.txt

These requirements were added in commits
44d3e4c264 and
4750ac0d77.

Fixes: https://tracker.ceph.com/issues/54185
Signed-off-by: Rishabh Dave <ridave@redhat.com>
2022-02-08 11:19:15 +05:30
Nizamudeen A
27592b7561 cephadm: change shared_folder directory for prometheus and grafana
After https://github.com/ceph/ceph/pull/44059 the monitoring/prometheus
and monitoring/grafana/dashboards directories are changed to
monitoring/ceph-mixins. That broke the shared_folders in the cephadm
bootstrap script.

Changed all the instances of monitoring/prometheus and
monitoring/grafana/dashboards to monitoring/ceph-mixins

Also, renaming all the instances of prometheus_alerts.yaml to
prometheus_alerts.yml.

Fixes: https://tracker.ceph.com/issues/54176
Signed-off-by: Nizamudeen A <nia@redhat.com>
2022-02-07 16:34:37 +05:30
Ernesto Puerta
6a4b1e148d
Merge pull request #44796 from pcuzner/remove-old-mib
monitoring: remove old MIB

Reviewed-by: Alfonso Martínez <almartin@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-02-04 17:42:08 +01:00
Arthur Outhenin-Chalandre
8ff1e6b399
monitoring: build jsonnet/jb only for testing
Build jsonnet and jb in the testso that we can build ceph without
internet access and still be able to run the test needed for monitoring
using jsonnet tools.

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-02-03 13:08:37 +01:00
Arthur Outhenin-Chalandre
ecaf9070ae
spec: debian: monitoring: build jsonnet from source to use 0.18.0
As this new version is recently released it's still not in every distro
we use. We now build jsonnet from source so that we can use this new
version of jsonnet. This commit could be reverted later on when the new
version would be available everywhere.

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-02-03 13:08:36 +01:00
Arthur Outhenin-Chalandre
98236e3a1d
mgr/dashboard: monitoring: refactor into ceph-mixin
Mixin is a way to bundle dashboards, prometheus rules and alerts into
jsonnet package. Shifting to mixin will allow easier integration with
monitoring automation that some users may use.

This commit moves `/monitoring/grafana/dashboards` and
`/monitoring/prometheus` to `/monitoring/ceph-mixin`. Prometheus alerts
was also converted to Jsonnet using an automated way (from yaml to json
to jsonnet). This commit minimises any change made to the generated files
and should not change neithers the dashboards nor the Prometheus alerts.

In the future some configuration will also be added to jsonnet to add
more functionalities to the dashboards or alerts (i.e.: multi cluster).

Fixes: https://tracker.ceph.com/issues/53374
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-02-03 13:08:20 +01:00
Ernesto Puerta
c47ace9215
Merge pull request #43707 from BenoitKnecht/ceph-mgr-service-id
mgr: Fix ceph_daemon label in ceph_rgw_* metrics

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-02-02 18:39:57 +01:00
Paul Cuzner
cbeab5c566 monitoring: remove old MIB
The MIB file that matches the OID definitions in the alerts is
CEPH-MIB.txt. The old MIB from the original SuSE snmp
gateway work, therefore needs to be removed to avoid
confusion.

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2022-01-27 11:24:34 +13:00
Pere Diaz Bou
57c26311de monitoring/grafana: replace filestore osd count
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-01-18 14:14:41 +01:00
Pere Diaz Bou
a3cf5c5e9f monitoring/grafana: use Path class instead of split
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-01-18 13:24:12 +01:00
Pere Diaz Bou
1e4d85d04f monitoring/grafana: remove explicit str casting
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-01-18 13:24:12 +01:00
Pere Diaz Bou
2b4f3561d2 monitoring/grafana: add generated json files
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-01-18 13:24:12 +01:00
Pere Diaz Bou
b381a83e9b monitoring/grafana: ValueError instead of RuntimeError
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-01-18 13:24:12 +01:00
Pere Diaz Bou
4c302234ff monitoring/grafana: Replace missing legendFormat warning with error
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-01-18 13:24:10 +01:00
Patrick Seidensal
7d7488018e monitoring: Add unit tests for OSD panels in ceph-cluster dashboard
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2022-01-13 13:27:55 +01:00
Patrick Seidensal
4a6b2c1dfb monitoring: fix display ceph_osd_in in Grafana panel
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2022-01-13 13:27:55 +01:00
Patrick Seidensal
18d3a71618 mgr/prometheus: Fix regression with OSD/host details/overview dashboards
Fix issues with PromQL expressions and vector matching with the
`ceph_disk_occupation` metric.

As it turns out, `ceph_disk_occupation` cannot simply be used as
expected, as there seem to be some edge cases for users that have
several OSDs on a single disk.  This leads to issues which cannot be
approached by PromQL alone (many-to-many PromQL erros).  The data we
have expected is simply different in some rare cases.

I have not found a sole PromQL solution to this issue. What we basically
need is the following.

1. Match on labels `host` and `instance` to get one or more OSD names
   from a metadata metric (`ceph_disk_occupation`) to let a user know
   about which OSDs belong to which disk.

2. Match on labels `ceph_daemon` of the `ceph_disk_occupation` metric,
   in which case the value of `ceph_daemon` must not refer to more than
   a single OSD. The exact opposite to requirement 1.

As both operations are currently performed on a single metric, and there
is no way to satisfy both requirements on a single metric, the intention
of this commit is to extend the metric by providing a similar metric
that satisfies one of the requirements. This enables the queries to
differentiate between a vector matching operation to show a string to
the user (where `ceph_daemon` could possibly be `osd.1` or
`osd.1+osd.2`) and to match a vector by having a single `ceph_daemon` in
the condition for the matching.

Although the `ceph_daemon` label is used on a variety of daemons, only
OSDs seem to be affected by this issue (only if more than one OSD is run
on a single disk).  This means that only the `ceph_disk_occupation`
metadata metric seems to need to be extended and provided as two
metrics.

`ceph_disk_occupation` is supposed to be used for matching the
`ceph_daemon` label value.

    foo * on(ceph_daemon) group_left ceph_disk_occupation

`ceph_disk_occupation_human` is supposed to be used for anything where
the resulting data is displayed to be consumed by humans (graphs, alert
messages, etc).

    foo * on(device,instance)
    group_left(ceph_daemon) ceph_disk_occupation_human

Fixes: https://tracker.ceph.com/issues/52974

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2022-01-13 13:27:55 +01:00
Benoît Knecht
2daaa052ea monitoring/grafana: Add tests for radosgw panels
Some of the expressions modified in c40290390d7 were not covered by any tests,
especially those in the `radosgw-detail.json` dashboard.

This commit fills in those gaps.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
2022-01-11 13:17:48 +01:00
Benoît Knecht
adc36dea7f monitoring/grafana: Update radosgw dashboards
With the `ceph_daemon` label now replaced by `instance_id` on all `ceph_rgw_*`
metrics, we need to update Grafana dashboards get that label back from
`ceph_rgw_metadata` using this type of construct:

```
ceph_rgw_req * on (instance_id) group_left(ceph_daemon) ceph_rgw_metadata
```

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
2022-01-11 13:17:20 +01:00
Ernesto Puerta
978d5829f2
Merge pull request #44294 from rhcs-dashboard/feature-bluestore-onode
mgr/dashboard: monitoring:Implement BlueStore onode hit/miss counters into the dashboard

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Alfonso Martínez <almartin@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Laura Flores <lflores@redhat.com>
Reviewed-by: neha-ojha <NOT@FOUND>
Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-01-11 11:24:21 +01:00