This makes
Mar 24 12:00:32 dael conmon[3969650]: Wed Mar 24 16:00:32 2021: cant open raw socket. errno=1
go away and allows it to enter the MASTER state.
Signed-off-by: Sage Weil <sage@newdream.net>
* refs/pull/40219/head:
mon/MgrStatMonitor: ignore MMgrReport from non-active mgr
mgr: tell monc when we get new servicemap, fsmap
Reviewed-by: Neha Ojha <nojha@redhat.com>
Reviewed-by: Mykola Golub <mgolub@suse.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
* refs/pull/40103/head:
cephadm: fix a minor typo in logging message
Reviewed-by: Adam King <adking@redhat.com>
Reviewed-by: Juan Miguel Olmo <jolmomar@redhat.com>
* refs/pull/40222/head:
mgr/orchestrator: remove image name field from 'orch ps' and 'orch ls'
Reviewed-by: Michael Fritch <mfritch@suse.com>
Reviewed-by: Sage Weil <sage@redhat.com>
we install different versions of precompiled ceph-libboost packages
for different branches when building and testing them on ubuntu test
nodes. for instance,
- nautilus: v1.72
- octopus, pacific: v1.73
they share the same set of test nodes. and these ceph-libboost packages
conflict with each other, because they install files to the same places.
in order to avoid the confliction, we should uninstall existing packages
before installing a different version of ceph-libboost packages.
ceph-libboost${version}-dev is a package providing the shared headers of
boost library, so, in this change we check if it is installed before
returning or removing the existing packages.
Signed-off-by: Kefu Chai <kchai@redhat.com>
* refs/pull/40242/head:
mgr/cephadm/upgrade: do not repeat crash message
mgr/cephadm/upgrade: a little less verbose
mgr/cephadm: don't log not-ok-to-stop at ERR level
mgr/cephadm: is presumed -> appears
mgr/cephadm: don't double-log ok-to-stop results
mgr/cephadm/upgrade: include upgrade progress in ceph -s
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
- Do not list individual daemon ids as this won't scale for larger
clusters
- Do not contemplate multile daemons of the same type that register with
different "daemon_type" -- not until we actually have any that do that.
- Present counts by various groupings: distinct hosts and rgw zones to
start.
services:
mon: 1 daemons, quorum a (age 4m)
mgr: x(active, since 3m)
osd: 1 osds: 1 up (since 3m), 1 in (since 3m)
cephfs-mirror: 1 daemon active (1 hosts)
rbd-mirror: 2 daemons active (1 hosts)
rgw: 2 daemons active (1 hosts, 1 zones)
Signed-off-by: Sage Weil <sage@newdream.net>
failure_info keeps strong references of the MOSDFailure messages
sent by osd or peon monitors, whenever monitor starts to handle
an MOSDFailure message, it registers it in its OpTracker. and
the failure report messageis unregistered when monitor acks them
by either canceling them or replying the reporters with a new
osdmap marking the target osd down. but if this does not happen,
the failure reports just pile up in OpTracker. and monitor considers
them as slow ops. and they are reported as SLOW_OPS health warning.
in theory, it does not take long to mark an unresponsive osd down if
we have enough reporters. but there is chance, that a reporter fails
to cancel its report before it reboots, and the monitor also fails
to collect enough reports and mark the target osd down. so the
target osd never gets an osdmap marking it down, so it won't send
an alive message to monitor to fix this.
in this change, we check for the stale failure info in tick(), and
simply drop the stale reports. so the messages can released and
marked "done".
Fixes: https://tracker.ceph.com/issues/47380
Signed-off-by: Kefu Chai <kchai@redhat.com>
will add a trim failures call in the loop, which mutates failure_info,
while we are still iterating this map. so have to restructure the loop
a little bit.
Fixes: https://tracker.ceph.com/issues/47380
Signed-off-by: Kefu Chai <kchai@redhat.com>
we always return "no_op" message to proxy monitor in
`OSDMonitor::prepare_failure()` at the very beginning of this method. so
no need to reply the peon again when discarding the failure report.
Signed-off-by: Kefu Chai <kchai@redhat.com>
* early return if routed request is not found in routed_requests.
reduce the indent level, for better readability.
* do not look up the request twice. for better performance.
* use unique_ptr<> for holding the request, for better readability
Signed-off-by: Kefu Chai <kchai@redhat.com>
The _do_upgrade() method runs a zillion times; try to report fewer
repetitive messages on every iteration.
Signed-off-by: Sage Weil <sage@newdream.net>
avoid failures from appearing on the consle when exec'ing within the
container during the `ls` command
Signed-off-by: Michael Fritch <mfritch@suse.com>
Registering by gid allows multiple radosgw instances to share an auth
key/identity. Including the id in the metadata allows them to still be
identified by name (even if not uniquely).
Signed-off-by: Sage Weil <sage@newdream.net>
before this change, we use docker for running promtools offered by
a docker image, but this is not efficient, and quite a few developers
do not want to use docker for running "make check". this change was
introduced by #39246, the reason was that, in Ceph's CI process, we
are using Ubuntu/Bionic for running "make check" jobs, but prometheus
packaged by Bionic does not offer the "test rules" command. so, to
address problem, we are using "dnanexus/promtool:2.9.2" docker image
for verifying monitoring/prometheus/alerts/test_alerts.yml.
after this change, we use prometheus packaged by debian derivatives
instead of pulling a docker image.
* debian/control: add prometheus as a "make check" dependency
* install-deps.sh: partially revert
53a5816ded, as we don't need to
pull docker or start docker service for using promtool anymore.
* cmake: check if promtool is capable of running "test rules"
command, bail out if it is not.
see also: https://tracker.ceph.com/issues/49653
Signed-off-by: Kefu Chai <kchai@redhat.com>