Following error occurs while running "sudo install-deps.sh" -
ERROR: Double requirement given: PyYAML==6.0 (from -r requirements-lint.txt (line 5)) (already in pyyaml (from -r requirements-alerts.txt (line 1)), name='PyYAML')
PyYAML is mentioned twice as a requirement. It is mentioned once in both
the following files -
monitoring/ceph-mixin/requirements-lint.txt
monitoring/ceph-mixin/requirements-alerts.txt
These requirements were added in commits
44d3e4c264 and
4750ac0d77.
Fixes: https://tracker.ceph.com/issues/54185
Signed-off-by: Rishabh Dave <ridave@redhat.com>
After https://github.com/ceph/ceph/pull/44059 the monitoring/prometheus
and monitoring/grafana/dashboards directories are changed to
monitoring/ceph-mixins. That broke the shared_folders in the cephadm
bootstrap script.
Changed all the instances of monitoring/prometheus and
monitoring/grafana/dashboards to monitoring/ceph-mixins
Also, renaming all the instances of prometheus_alerts.yaml to
prometheus_alerts.yml.
Fixes: https://tracker.ceph.com/issues/54176
Signed-off-by: Nizamudeen A <nia@redhat.com>
Build jsonnet and jb in the testso that we can build ceph without
internet access and still be able to run the test needed for monitoring
using jsonnet tools.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
As this new version is recently released it's still not in every distro
we use. We now build jsonnet from source so that we can use this new
version of jsonnet. This commit could be reverted later on when the new
version would be available everywhere.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
Mixin is a way to bundle dashboards, prometheus rules and alerts into
jsonnet package. Shifting to mixin will allow easier integration with
monitoring automation that some users may use.
This commit moves `/monitoring/grafana/dashboards` and
`/monitoring/prometheus` to `/monitoring/ceph-mixin`. Prometheus alerts
was also converted to Jsonnet using an automated way (from yaml to json
to jsonnet). This commit minimises any change made to the generated files
and should not change neithers the dashboards nor the Prometheus alerts.
In the future some configuration will also be added to jsonnet to add
more functionalities to the dashboards or alerts (i.e.: multi cluster).
Fixes: https://tracker.ceph.com/issues/53374
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
The MIB file that matches the OID definitions in the alerts is
CEPH-MIB.txt. The old MIB from the original SuSE snmp
gateway work, therefore needs to be removed to avoid
confusion.
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Fix issues with PromQL expressions and vector matching with the
`ceph_disk_occupation` metric.
As it turns out, `ceph_disk_occupation` cannot simply be used as
expected, as there seem to be some edge cases for users that have
several OSDs on a single disk. This leads to issues which cannot be
approached by PromQL alone (many-to-many PromQL erros). The data we
have expected is simply different in some rare cases.
I have not found a sole PromQL solution to this issue. What we basically
need is the following.
1. Match on labels `host` and `instance` to get one or more OSD names
from a metadata metric (`ceph_disk_occupation`) to let a user know
about which OSDs belong to which disk.
2. Match on labels `ceph_daemon` of the `ceph_disk_occupation` metric,
in which case the value of `ceph_daemon` must not refer to more than
a single OSD. The exact opposite to requirement 1.
As both operations are currently performed on a single metric, and there
is no way to satisfy both requirements on a single metric, the intention
of this commit is to extend the metric by providing a similar metric
that satisfies one of the requirements. This enables the queries to
differentiate between a vector matching operation to show a string to
the user (where `ceph_daemon` could possibly be `osd.1` or
`osd.1+osd.2`) and to match a vector by having a single `ceph_daemon` in
the condition for the matching.
Although the `ceph_daemon` label is used on a variety of daemons, only
OSDs seem to be affected by this issue (only if more than one OSD is run
on a single disk). This means that only the `ceph_disk_occupation`
metadata metric seems to need to be extended and provided as two
metrics.
`ceph_disk_occupation` is supposed to be used for matching the
`ceph_daemon` label value.
foo * on(ceph_daemon) group_left ceph_disk_occupation
`ceph_disk_occupation_human` is supposed to be used for anything where
the resulting data is displayed to be consumed by humans (graphs, alert
messages, etc).
foo * on(device,instance)
group_left(ceph_daemon) ceph_disk_occupation_human
Fixes: https://tracker.ceph.com/issues/52974
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
Some of the expressions modified in c40290390d7 were not covered by any tests,
especially those in the `radosgw-detail.json` dashboard.
This commit fills in those gaps.
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
With the `ceph_daemon` label now replaced by `instance_id` on all `ceph_rgw_*`
metrics, we need to update Grafana dashboards get that label back from
`ceph_rgw_metadata` using this type of construct:
```
ceph_rgw_req * on (instance_id) group_left(ceph_daemon) ceph_rgw_metadata
```
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
Provide the details pulled from Bluestore stats in order to display the onode hit/miss counters
Fixes: https://tracker.ceph.com/issues/53577
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
Rules now adhere to the format defined by Prometheus.io.
This changes alert naming and each alert now includes a
a summary description to provide a quick one-liner.
In addition to reformatting some missing alerts for MDS and
cephadm have been added, and corresponding tests added.
The MIB has also been refactored, so it now passes standard
lint tests and a README included for devs to understand the
OID schema.
Fixes: https://tracker.ceph.com/issues/53111
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Focus all tests inside a tests directory, and use pytest/tox to
perform validation of the overall content. tox tests also use
promtool if available to provide rule checks and unittest runs.
In addition to these checks a validate_rules script provides the
format, and content checks against all rules - which is also
called via tox (but can be run independently too)
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
This patch creates a health history object maintained in
the modules kvstore. The history and current health
checks are used to create a metric per healthcheck whilst
also providing a history feature. Two new commands are added:
ceph healthcheck history ls
ceph healthcheck history clear
In addition to the new commands, the additional metrics
have been used to update the prometheus alerts
Fixes: https://tracker.ceph.com/issues/52638
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
This PR intends to refactor cephfs dashboards using grafonnet
Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
This PR intends to refactor osds dashboards using grafonnet
Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
This PR intends to refactor pools dashboards using grafonnet
Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
This PR intends to refactor rbd dashboards using grafonnet
Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
This PR intends to refactor radosgw dashboards using grafonnet
Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
This PR intends to refactor hosts dashboards using grafonnet
Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
The ceph MIB has been created and maintained in a
a separate repo:
https://github.com/SUSE/prometheus-webhook-snmp
This patch brings this MIB into the main ceph repo, so
alert changes can target prometheus and potentially
SNMP environments within the same PR.
Kudos to Volker Theile for creating the MIB.
Fixes: https://tracker.ceph.com/issues/52708
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
The MTU mismatch warning was being fired for those NIC's as well that are in down state. This PR intends to fix this issue
Fixes:https://tracker.ceph.com/issues/52028
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
so we don't build this target when running "make", and hence avoid
accessing the internet in a building envronment where the internest
access is not allowed.
Signed-off-by: Kefu Chai <kchai@redhat.com>
when download/building grafonnet-lib, dpdk, spdk, liburing and fio,
they dump lots of output during configuration and building phrases,
all of which is irrelevant to us. so let's just silence it.
Signed-off-by: Kefu Chai <kchai@redhat.com>
The value we get is a perunit, so the limits and the max value should
be over 1, not 100. Note that the value being shown was correct, it
was the gauge that was not showing the correct indicators.
Signed-off-by: David Caro <david@dcaro.es>
The `instance` label is only useful if
- the exporter returns only data about its node or instance
- the exporter provides an instance label and then may return data about
other nodes
In this case, it's about the Prometheus mgr module, which is a single
exporter providing data about a whole cluster, so not only data related
to the node (or instance) the mgr module is running on. It is
completely irrelevant on which node the exporter runs on, the data
provided doesn't change. The exporter also doesn't provide `instance`
labels (which Prometheus wouldn't change due to our configuration, see
"honor_labels" setting).
(Actually there's one exception where `instance` labels are provided by
the Ceph mgr module, but that doesn't affect the Ceph Cluster
dashboard.)
Note that keeping that instance label on this particular dashboard would
enable the user to switch between a previously failed mgr instance and
the data collected from there and the currently running mgr instance (on
which the Prometheus mgr module runs on). That'd split the data, which
I don't think is a useful feature, but rather looks broken.
Fixes: https://tracker.ceph.com/issues/51212
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
The health status widget doesn't show any status because it requires its
query to return a single result. But in case a mgr instance had failed,
it would return more, provided the incident has happened in the
requested time frame.
This is simply an issue of the `instant` switch being disabled for that
widget. As only one mgr instance can ever be providing data at a time,
enabling `instant` completely solves that issue.
Fixes: https://tracker.ceph.com/issues/51212
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
Remove hard-coded timezone off Grafana dashboards to enable the Grafana
administrator to decide which timezone should be used for dashboards.
If we hard-coded those values, changing the global settings in Grafana
wouldn't have an effect. And the administrators can't change the
automatically imported Grafana dashboards provided by us.
Fixes: https://tracker.ceph.com/issues/51212
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
Convert newline character from CRLF in `rbd-details.json` to LF, so that
it will be consistent with all the other dashboard JSON files.
Fixes: https://tracker.ceph.com/issues/51212
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
Workaround https://github.com/containers/buildah/issues/3253
by pushing to a local OCI-format image to clear out erroneously-left
'parent' field in buildah commit --squash output. Can be removed
when the fix for the above is available.
Signed-off-by: Dan Mick <dmick@redhat.com>
podman login caches auth tokens in auth.json; for sudo, it may be
placed in /run/containers/0 or it may be in /run/users/0/containers;
the latter directory is removed when root "logs out", which isn't
clear what it means with sudo/su. Several builds failed because
they couldn't find the cached auth between sudo podman login and sudo
podman push. Sidestep the confusion by just using a local file for
the auth cache.
Signed-off-by: Dan Mick <dmick@redhat.com>
- allow env setting of versions of components
- add docker/quay username/password variables
- derive container version from grafana version
- make arch-specific tags
- expand clean target to remove container images
- remove release-specific targets, "all" target
- move push operations to separate "push" target
Signed-off-by: Dan Mick <dmick@redhat.com>
Some build machines don't have /etc/sub{u,g}id set up for
so-called "rootless" (non-privileged) operation. Use sudo
to avoid the need for "rootless".
Signed-off-by: Dan Mick <dmick@redhat.com>
This is a replacement dashboard configuration for the
pool overview page. It provides a cluster wide view of
capacity consumed and compression effectiveness, and
breaks this down by each pool within the configuration.
Fixes: https://tracker.ceph.com/issues/50226
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
mgr/dashboard: Fixed name clash when hostname similar to another
Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: p-se <NOT@FOUND>
The hosts-overview Grafana dashboard json file contains a repeated element, making
it invalid JSON. Some JSON parsers handle this. However, this prevents Jsonnet
from parsing the dashboard, which prevents the deployment of this dashboard via
Jsonnet.
Fixes: https://tracker.ceph.com/issues/50410
Signed-off-by: Malcolm Holmes <mdh@odoko.co.uk>
run-promtool-unittests is failing with difference in floating point values in some complex calculations. This PR intends to simplify those calculations and fix this issue.
Fixes: https://tracker.ceph.com/issues/49952
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
This PR intends to add unit testing for prometheus rules using promtool. To run the tests run 'run-promtool-unittests.sh' file.
Fixes: https://tracker.ceph.com/issues/45415
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
mgr/dashboard: prometheus alerting: add some leeway for package drops and errors
Reviewed-by: Stephan Müller <smueller@suse.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
This PR intends to alert a user if a specific network is configured with a custom MTU
Fixes: https://tracker.ceph.com/issues/48748
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
SLOW_OPS is triggered by op tracker, and generates a health
alert but healthchecks do not create metrics for prometheus to
use as alert triggers. This change adds SLOW_OPS metric, and
provides a simple means to extend to other relevant health
checks in the future
If the extract of the value from the health check message fails
we log an error and remove the metric from the metric set. In
addition the metric description has changed to better reflect
the scenarios where SLOW_OPS can be triggered.
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
The labels on one side do not match the labels of the other side, where
a label_replace is used. The fix uses the same label_replace on the
missing side.
Fixes: https://tracker.ceph.com/issues/47334
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
mgr/dashboard: Prometheus query error in the metrics of Pools, OSDs and RBD images
Reviewed-by: Alfonso Martínez <almartin@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Volker Theile <vtheile@suse.com>
This commit provides the Makefile to create the
ceph-grafana containers for nautilus, octopus and
master releases.
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>