we don't have "pg_num_target" in "osd dump" back in mimic, so we don't
need to check it if it is missing when performing upgrade test.
Signed-off-by: Kefu Chai <kchai@redhat.com>
* refs/pull/28855/head:
doc: document scrub summary in ceph status output
test: extend scrub control test to validate mds task status
mds: send scrub state changes to cluster log.
mds: periodically sent mds scrub status to ceph manager
mgr, mon: allow normal ceph services to register with manager
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
* refs/pull/28378/head:
qa/tasks: introduce Thrasher base class
qa/tasks: Fix typo
qa/tasks: manage thrashers
qa/tasks: start DaemonWatchdog when ceph starts
qa/tasks: make watch and bark handle more daemons
qa/tasks: move DaemonWatchdog to new file
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
* Introduced a Thrasher base class.
* Updated thrashers to inherit from Thrasher.
* Replaced the magic variable e with Thrasher.exception as per the discussion.
Now the exception variable sets by default as the thrashers are inheriting
from the Thrasher class.
Fixes: https://github.com/ceph/ceph/pull/28378#discussion_r309337928
Fixes: https://tracker.ceph.com/issues/41133
Signed-off-by: Jos Collin <jcollin@redhat.com>
* refs/pull/29493/head:
qa/tasks/mgr/mgr_test_case: get mgrmap from 'mgr dump', not status
qa/tasks/ceph_manager: no newlines in 'ceph -s' output
mon: make mon summary more concise in 'ceph -s'
mon/MgrStatMonitor: set initial service_map 'modified' to cluster mkfs
mon: remove double-nesting of "osdmap" for ceph status
mon/MgrMap: make print_summary (used by 'ceph -s') more concise
Reviewed-by: Neha Ojha <nojha@redhat.com>
Reviewed-by: Kefu Chai <kchai@redhat.com>
in fbd4836d, a regression is introduced:
self.log("failed to read erasure_code_profile. %s was likely removed",
pool)
because `self.log` is actually a lambda which just do
self.logger.info(x)
in this change
* `Thrasher.log()` is added for three reasons:
- in PEP-8,
> Always use a def statement instead of an assignment statement that
> binds a lambda expression directly to an identifier
so a better way is to define a method using `def`
- and i think it helps with the readability
* `logger` parameter is now mandatory now in the constructor of
`Thrasher` class. because the instance of this class is only created
by `qa/tasks/thrashosds.py`, like:
thrash_proc = ceph_manager.Thrasher(
cluster_manager,
config,
logger=log.getChild('thrasher')
)
and `log.getChild()` does not return `None`, so there is no need to
handle that case.
Signed-off-by: Kefu Chai <kchai@redhat.com>
Currently these can be thrown off if the cluster is creating or removing
pools at the same time. Fix by taking a single snapshot of the pg stats
and based our judgement on that.
Signed-off-by: Sage Weil <sage@redhat.com>
to be specific, ignore errors when querying erasure coded pool's
erasure-code-profile. the pool might be removed after
"test_pool_min_size" lists all pools and before queries the pools'
erasure-code-profile. in that case, we should just continue on with the
next pool.
normally, the pools are created by the "radosbench" tasks. and they
don't delete the ec profiles after removing the ec pools using them, but
i don't want to rely on this fact. so, in this change, the `try` block
guards both `ceph osd pool get <pool_name> erasure_code_profile`
and `ceph osd erasure-code-profile get <profile>` calls.
Fixes: http://tracker.ceph.com/issues/40533
Signed-off-by: Kefu Chai <kchai@redhat.com>
If there are leftover merges at the end of the run they can take a long
time to get through, blowing our timeout for (waiting for pgs to become
active and to stop splitting/merge) and scrubbing pgs. Stop all of that
at the end of the run so that we don't have to wait so long.
Signed-off-by: Sage Weil <sage@redhat.com>
Some tests have m=2,k=2 and this will break them. Sometimes even if we
have 5 up osds, we end up with 4 and CRUSH gets picky, so build in a
buffer and only do this if we have 6 up.
We don't have an easy way from here to see what the min up osds for healthy
is... basically this map discontinuity test just sucks.
Signed-off-by: Sage Weil <sage@redhat.com>
We currently import a portion of the PG if it has split. Merge is more
complicated, though, mainly because COT is operating in a mode where it
fast-forwards the PG to the latest OSDMap epoch, which means it has to
implement any transformations to the PG (split/merge) independently.
Avoid doing this for merge.
Signed-off-by: Sage Weil <sage@redhat.com>
Also:
- Do not print **offset** until specified
- Count missing objects correctly (used to be primary's local missing)
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
the default timeout is none in that case, there are cases where it can hang forever
due to error cases, since this dumps quite a lot of info the logs grow in GB's, with
default timeout of 1200 we can avoid such huge logs and fail sooner. Any tests needing
higher timeout can pass the required value.
Signed-off-by: Vasu Kulkarni <vasu@redhat.com>
It's possible for tell osd.* to race against an osd we stopped but the
cluster doesn't know is down yet. In tha case we'll get ENXIO on that
osd and the command will fail.
In this context, we don't care.
Signed-off-by: Sage Weil <sage@redhat.com>
* move Thrasher._set_config() to CephManager, and make it a public
method, and rename it to inject_args(),
* use this method instead of using 'tell ... injectargs ...' directly
Signed-off-by: Kefu Chai <kchai@redhat.com>
osd will refused to create new pgs, until its pg number is lower
than the max-pg-per-osd upper bound setting.
Signed-off-by: Kefu Chai <kchai@redhat.com>
bluestore_fsck_on_mount and bluestore_fsck_on_mount_deep are enabled by
default. and bluestore is used as the default store backend. it takes
longer to perform the deep fsck with verbose log. so prolong the
revive_osd()'s timeout from 150 sec to 360 sec.
Fixes: http://tracker.ceph.com/issues/21474
Signed-off-by: Kefu Chai <kchai@redhat.com>
Pg state maybe all in active+clean when no recovering going on,
so check it again before timedout.
Fixes: http://tracker.ceph.com/issues/21294
Signed-off-by: huangjun <huangjun@xsky.com>
We assume below that rerrosd is up, but it may not be when we exit the
loop.
Fixes: http://tracker.ceph.com/issues/21206
Signed-off-by: Sage Weil <sage@redhat.com>
This randomly issues pg force-recovery/force-backfill and
pg cancel-force-recovery/cancel-force-backfill during QA
testing. Disabled for upgrades from hammer, jewel and kraken.
Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
New option "random_eio" to Thrasher, sets 1 osd random read percentage
New option "objectsize" to radosbench task (-o bench option)
New option "type" to radosbench specify write, seq or rand
Signed-off-by: David Zafman <dzafman@redhat.com>
Make sure OSDs are up *and* they have flushed their PG stats before
waiting for recovery to ensure that we do not see a stale 'clean' state.
Signed-off-by: Sage Weil <sage@redhat.com>
The helper gets a sequence number from the osd (or osds), and then
polls the mon until that seq is reflected there.
This is overkill in some cases, since many tests only require that the
stats be reflected on the mgr (not the mon), but waiting for it to also
reach the mon is sufficient!
Signed-off-by: Sage Weil <sage@redhat.com>
Pulling this out of the 'pg dump' heap is inefficient.
Also, pg dump data comes from the mgr and may be stale.
Signed-off-by: Sage Weil <sage@redhat.com>
Keep the pool flag around so we can distinguish between a pool that
should maintain hashes for each chunk, and a missing one is a bug, vs
an overwrites pool where we rely on bluestore checksums for detecting
corruption.
Signed-off-by: Josh Durgin <jdurgin@redhat.com>
'remap' is to non-specific a name. In particular, it
sounds like it is related to the 'remapped' PG state
but in reality it is not related.
'upmap' or 'pg-upmap' is more specific: it maps a pgid
to the 'up' set value (or item)
Signed-off-by: Sage Weil <sage@redhat.com>
On slower machines (VPS, OVH) it takes time for the OSD to go down.
Fixes: http://tracker.ceph.com/issues/19556
Signed-off-by: Nathan Cutler <ncutler@suse.com>
we should not update pools_to_fix_pgp_num if the pool is not expanded or
the pg_num is not increased due to pgs being created. this prevent us
from fixing the pgp_num after done with thrashing if we actually did
nothing when fixing the pgp_num when thrashing, but we removed the pool
from pools_to_fix_pgp_num after set_pool_pgpnum() returns.
Signed-off-by: Kefu Chai <kchai@redhat.com>