Until now daemon health metrics were stored without being used. One of
the most helpful metrics there is SLOW_OPS with respect to OSDs and MONs
which this commit tries to expose to bring fine grained metrics to find
troublesome OSDs instead of having a lone healthcheck of slow ops in the
whole cluster.
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
Adding logic to modify the master zonegroup endpoints
Do no call pull realm when modifying zone
Only update the endpoints if the modified zone is master
Adding support to set custom endpoints when creating realm or zone
Signed-off-by: Redouane Kachach <rkachach@redhat.com>
Add documentation for the option to specify the sectype (for enabling kerberos)
when creating a new export.
Signed-off-by: John Mulligan <jmulligan@redhat.com>
mgr/dashboard: add option to resolve ip addr
Reviewed-by: Pegonzal <NOT@FOUND>
Reviewed-by: Anthony D Atri <anthony.datri@gmail.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Add the option `redirect_resolve_ip_addr` to the dashboard module.
If the option is set to `True`, try to resolve the IP address before
redirecting from the passive to the active mgr instance.
If the option is set to `False`, follow the already known behavior.
Fixes: https://tracker.ceph.com/issues/56699
Signed-off-by: Tatjana Dehler <tdehler@suse.com>
This PR adds unselectable prompts to three files that are
transcluded in the doc/mgr/dashboard.rst file. These three
files are:
1. debug.inc.rst
2. feature_toggles.inc.rst
3. motd.inc.rst
The addition of unselectable prompts to these three files
completes the work begun in PR#47810 (d8064b4), which sought
to bring dashboard.rst into line with the unselectable prompt
standard introduced by Kefu Chai in 2020.
Signed-off-by: Zac Dover <zac.dover@gmail.com>
This commit adds prompt directives (.. prompt:: bash $) to
the commands in dashboard.rst.
There are several ".. include::" directives in the dashboard.rst
file, which means that part of this page is sourced from elsewhere
than the dashboard.rst file. Because I have not yet added prompt
directives to those files, there is an inconsistency in the rendering
of this file. Most of the commands on this page have unselectable
prompts (unselectable prompts are the prompts that don't get added to
the buffer when you copy them to one of the clipboards). But the
commands on this page that come from those ".. include::" directives
do not yet have unselectable prompts.
This file is over 1600 lines long. It was perhaps not optimally wise
of me to have edited all of it in one fell swoop. It took many hours,
and carefully checking it will probably take at least one hour. I
suggest that whoever reviews this should not spend much time on it,
but should instead make a quick pass over the page and make sure that
it looks passable.
The English syntax on this page (and throughout the Dashboard doc-
umentation) will be tightened to remove ambiguity and to improve
readability in the near future, so hold all English-language-related
comments for a future pull request.
Signed-off-by: Zac Dover <zac.dover@gmail.com>
This PR improves the English language in the "Orchestrator CLI"
section of the MGR documentation. It adds a couple of section
headers in order to signpost the information in the document
a bit more than had already been done, but it makes no major
structural changes to the presentation of the information here.
This PR was motivated by feedback from the 2022 Ceph User Survey
in which one of the respondents wrote "better ceph orch documen-
tation".
The final section on this page, "Current Implementation Status",
must be verified by someone who is familiar with the current state
of "ceph orch" and a date stamp should be applied to the top of
the section so that the word "current" has a meaningful referent.
Signed-off-by: Zac Dover <zac.dover@gmail.com>
.. of NFS and ingress services after creating/deleting a NFS cluster.
The `nfs cluster info` command is not sufficient to show that the
NFS cluster is created/deleted as expected.
Signed-off-by: Ramana Raja <rraja@redhat.com>
A recent change in the mgr/nfs module should enable the functioning
of export management commands/API calls as long as the rados namespaces
and objects have been already established. Document this fact, noting
that now only the `ceph nfs cluster ...` calls *require* an
orchestration module.
Signed-off-by: John Mulligan <jmulligan@redhat.com>
The new collection is called `basic_usage_by_class`. This info should be separate
from `basic_pool_usage` since it doesn't involve pool statistics.
Signed-off-by: Laura Flores <lflores@redhat.com>
- Added the word "default" since we are only collecting
default pool applications
- Removed the word "data" since we are actually collecting
usage *statistics*
Signed-off-by: Laura Flores <lflores@redhat.com>
After https://github.com/ceph/ceph/pull/44059 the monitoring/prometheus
and monitoring/grafana/dashboards directories are changed to
monitoring/ceph-mixins. That broke the shared_folders in the cephadm
bootstrap script.
Changed all the instances of monitoring/prometheus and
monitoring/grafana/dashboards to monitoring/ceph-mixins
Also, renaming all the instances of prometheus_alerts.yaml to
prometheus_alerts.yml.
Fixes: https://tracker.ceph.com/issues/54176
Signed-off-by: Nizamudeen A <nia@redhat.com>
The progress module disabled the pg recovery event by default
since the event is expensive and has interrupted other serviceis
when there is OSDs being marked in/out from the the cluster.
To turn the event on manually:
ceph config set mgr mgr/progress/allow_pg_recovery_event true
Updated qa/tasks/mgr/test_progress.py to enable
the pg recovery event when testing the progress module.
Signed-off-by: Kamoltat <ksirivad@redhat.com>
Enable or disable all telemetry channels at once with:
ceph telemetry enable channel all
ceph telemetry disable channel all
Signed-off-by: Yaarit Hatuka <yaarit@redhat.com>
STATUS column now indicates whether a collection is being reported, and
the reasons why it's not (either the user is not opted-in to this
collection, or its channel is off).
Also, removed the ENROLLED and DEFAULT columns due to potential
confusion they may cause.
In case a user is not opted-in to certain collections, a message will
appear above the table with the missing collections:
New collections are available:
['basic_base', 'basic_mds_metadata', 'crash_base', 'device_base',
'ident_base', 'perf_perf']
Run `ceph telemetry on` to opt-in to these collections.
Signed-off-by: Yaarit Hatuka <yaarit@redhat.com>
Fix issues with PromQL expressions and vector matching with the
`ceph_disk_occupation` metric.
As it turns out, `ceph_disk_occupation` cannot simply be used as
expected, as there seem to be some edge cases for users that have
several OSDs on a single disk. This leads to issues which cannot be
approached by PromQL alone (many-to-many PromQL erros). The data we
have expected is simply different in some rare cases.
I have not found a sole PromQL solution to this issue. What we basically
need is the following.
1. Match on labels `host` and `instance` to get one or more OSD names
from a metadata metric (`ceph_disk_occupation`) to let a user know
about which OSDs belong to which disk.
2. Match on labels `ceph_daemon` of the `ceph_disk_occupation` metric,
in which case the value of `ceph_daemon` must not refer to more than
a single OSD. The exact opposite to requirement 1.
As both operations are currently performed on a single metric, and there
is no way to satisfy both requirements on a single metric, the intention
of this commit is to extend the metric by providing a similar metric
that satisfies one of the requirements. This enables the queries to
differentiate between a vector matching operation to show a string to
the user (where `ceph_daemon` could possibly be `osd.1` or
`osd.1+osd.2`) and to match a vector by having a single `ceph_daemon` in
the condition for the matching.
Although the `ceph_daemon` label is used on a variety of daemons, only
OSDs seem to be affected by this issue (only if more than one OSD is run
on a single disk). This means that only the `ceph_disk_occupation`
metadata metric seems to need to be extended and provided as two
metrics.
`ceph_disk_occupation` is supposed to be used for matching the
`ceph_daemon` label value.
foo * on(ceph_daemon) group_left ceph_disk_occupation
`ceph_disk_occupation_human` is supposed to be used for anything where
the resulting data is displayed to be consumed by humans (graphs, alert
messages, etc).
foo * on(device,instance)
group_left(ceph_daemon) ceph_disk_occupation_human
Fixes: https://tracker.ceph.com/issues/52974
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
This PR specifies that the data source must be set to
be "Dashboard1" when you configure Grafana and Prometheus
manually.
This is a fixup of another PR which was created by Dr
Jake Grimmett. This is that PR:
Credit goes to Dr Jake Grimmett of Cambridge.
https://github.com/ceph/ceph/pull/44150/
Signed-off-by: Zac Dover <zac.dover@gmail.com>