* refs/pull/28855/head:
doc: document scrub summary in ceph status output
test: extend scrub control test to validate mds task status
mds: send scrub state changes to cluster log.
mds: periodically sent mds scrub status to ceph manager
mgr, mon: allow normal ceph services to register with manager
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
* refs/pull/28378/head:
qa/tasks: introduce Thrasher base class
qa/tasks: Fix typo
qa/tasks: manage thrashers
qa/tasks: start DaemonWatchdog when ceph starts
qa/tasks: make watch and bark handle more daemons
qa/tasks: move DaemonWatchdog to new file
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
* Introduced a Thrasher base class.
* Updated thrashers to inherit from Thrasher.
* Replaced the magic variable e with Thrasher.exception as per the discussion.
Now the exception variable sets by default as the thrashers are inheriting
from the Thrasher class.
Fixes: https://github.com/ceph/ceph/pull/28378#discussion_r309337928
Fixes: https://tracker.ceph.com/issues/41133
Signed-off-by: Jos Collin <jcollin@redhat.com>
* refs/pull/29493/head:
qa/tasks/mgr/mgr_test_case: get mgrmap from 'mgr dump', not status
qa/tasks/ceph_manager: no newlines in 'ceph -s' output
mon: make mon summary more concise in 'ceph -s'
mon/MgrStatMonitor: set initial service_map 'modified' to cluster mkfs
mon: remove double-nesting of "osdmap" for ceph status
mon/MgrMap: make print_summary (used by 'ceph -s') more concise
Reviewed-by: Neha Ojha <nojha@redhat.com>
Reviewed-by: Kefu Chai <kchai@redhat.com>
in fbd4836d, a regression is introduced:
self.log("failed to read erasure_code_profile. %s was likely removed",
pool)
because `self.log` is actually a lambda which just do
self.logger.info(x)
in this change
* `Thrasher.log()` is added for three reasons:
- in PEP-8,
> Always use a def statement instead of an assignment statement that
> binds a lambda expression directly to an identifier
so a better way is to define a method using `def`
- and i think it helps with the readability
* `logger` parameter is now mandatory now in the constructor of
`Thrasher` class. because the instance of this class is only created
by `qa/tasks/thrashosds.py`, like:
thrash_proc = ceph_manager.Thrasher(
cluster_manager,
config,
logger=log.getChild('thrasher')
)
and `log.getChild()` does not return `None`, so there is no need to
handle that case.
Signed-off-by: Kefu Chai <kchai@redhat.com>
Currently these can be thrown off if the cluster is creating or removing
pools at the same time. Fix by taking a single snapshot of the pg stats
and based our judgement on that.
Signed-off-by: Sage Weil <sage@redhat.com>
to be specific, ignore errors when querying erasure coded pool's
erasure-code-profile. the pool might be removed after
"test_pool_min_size" lists all pools and before queries the pools'
erasure-code-profile. in that case, we should just continue on with the
next pool.
normally, the pools are created by the "radosbench" tasks. and they
don't delete the ec profiles after removing the ec pools using them, but
i don't want to rely on this fact. so, in this change, the `try` block
guards both `ceph osd pool get <pool_name> erasure_code_profile`
and `ceph osd erasure-code-profile get <profile>` calls.
Fixes: http://tracker.ceph.com/issues/40533
Signed-off-by: Kefu Chai <kchai@redhat.com>
If there are leftover merges at the end of the run they can take a long
time to get through, blowing our timeout for (waiting for pgs to become
active and to stop splitting/merge) and scrubbing pgs. Stop all of that
at the end of the run so that we don't have to wait so long.
Signed-off-by: Sage Weil <sage@redhat.com>
Some tests have m=2,k=2 and this will break them. Sometimes even if we
have 5 up osds, we end up with 4 and CRUSH gets picky, so build in a
buffer and only do this if we have 6 up.
We don't have an easy way from here to see what the min up osds for healthy
is... basically this map discontinuity test just sucks.
Signed-off-by: Sage Weil <sage@redhat.com>
We currently import a portion of the PG if it has split. Merge is more
complicated, though, mainly because COT is operating in a mode where it
fast-forwards the PG to the latest OSDMap epoch, which means it has to
implement any transformations to the PG (split/merge) independently.
Avoid doing this for merge.
Signed-off-by: Sage Weil <sage@redhat.com>
Also:
- Do not print **offset** until specified
- Count missing objects correctly (used to be primary's local missing)
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
the default timeout is none in that case, there are cases where it can hang forever
due to error cases, since this dumps quite a lot of info the logs grow in GB's, with
default timeout of 1200 we can avoid such huge logs and fail sooner. Any tests needing
higher timeout can pass the required value.
Signed-off-by: Vasu Kulkarni <vasu@redhat.com>
It's possible for tell osd.* to race against an osd we stopped but the
cluster doesn't know is down yet. In tha case we'll get ENXIO on that
osd and the command will fail.
In this context, we don't care.
Signed-off-by: Sage Weil <sage@redhat.com>
* move Thrasher._set_config() to CephManager, and make it a public
method, and rename it to inject_args(),
* use this method instead of using 'tell ... injectargs ...' directly
Signed-off-by: Kefu Chai <kchai@redhat.com>