Currently these can be thrown off if the cluster is creating or removing
pools at the same time. Fix by taking a single snapshot of the pg stats
and based our judgement on that.
Signed-off-by: Sage Weil <sage@redhat.com>
to be specific, ignore errors when querying erasure coded pool's
erasure-code-profile. the pool might be removed after
"test_pool_min_size" lists all pools and before queries the pools'
erasure-code-profile. in that case, we should just continue on with the
next pool.
normally, the pools are created by the "radosbench" tasks. and they
don't delete the ec profiles after removing the ec pools using them, but
i don't want to rely on this fact. so, in this change, the `try` block
guards both `ceph osd pool get <pool_name> erasure_code_profile`
and `ceph osd erasure-code-profile get <profile>` calls.
Fixes: http://tracker.ceph.com/issues/40533
Signed-off-by: Kefu Chai <kchai@redhat.com>
If there are leftover merges at the end of the run they can take a long
time to get through, blowing our timeout for (waiting for pgs to become
active and to stop splitting/merge) and scrubbing pgs. Stop all of that
at the end of the run so that we don't have to wait so long.
Signed-off-by: Sage Weil <sage@redhat.com>
Some tests have m=2,k=2 and this will break them. Sometimes even if we
have 5 up osds, we end up with 4 and CRUSH gets picky, so build in a
buffer and only do this if we have 6 up.
We don't have an easy way from here to see what the min up osds for healthy
is... basically this map discontinuity test just sucks.
Signed-off-by: Sage Weil <sage@redhat.com>
We currently import a portion of the PG if it has split. Merge is more
complicated, though, mainly because COT is operating in a mode where it
fast-forwards the PG to the latest OSDMap epoch, which means it has to
implement any transformations to the PG (split/merge) independently.
Avoid doing this for merge.
Signed-off-by: Sage Weil <sage@redhat.com>
Also:
- Do not print **offset** until specified
- Count missing objects correctly (used to be primary's local missing)
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
the default timeout is none in that case, there are cases where it can hang forever
due to error cases, since this dumps quite a lot of info the logs grow in GB's, with
default timeout of 1200 we can avoid such huge logs and fail sooner. Any tests needing
higher timeout can pass the required value.
Signed-off-by: Vasu Kulkarni <vasu@redhat.com>
It's possible for tell osd.* to race against an osd we stopped but the
cluster doesn't know is down yet. In tha case we'll get ENXIO on that
osd and the command will fail.
In this context, we don't care.
Signed-off-by: Sage Weil <sage@redhat.com>
* move Thrasher._set_config() to CephManager, and make it a public
method, and rename it to inject_args(),
* use this method instead of using 'tell ... injectargs ...' directly
Signed-off-by: Kefu Chai <kchai@redhat.com>
osd will refused to create new pgs, until its pg number is lower
than the max-pg-per-osd upper bound setting.
Signed-off-by: Kefu Chai <kchai@redhat.com>
bluestore_fsck_on_mount and bluestore_fsck_on_mount_deep are enabled by
default. and bluestore is used as the default store backend. it takes
longer to perform the deep fsck with verbose log. so prolong the
revive_osd()'s timeout from 150 sec to 360 sec.
Fixes: http://tracker.ceph.com/issues/21474
Signed-off-by: Kefu Chai <kchai@redhat.com>
Pg state maybe all in active+clean when no recovering going on,
so check it again before timedout.
Fixes: http://tracker.ceph.com/issues/21294
Signed-off-by: huangjun <huangjun@xsky.com>
We assume below that rerrosd is up, but it may not be when we exit the
loop.
Fixes: http://tracker.ceph.com/issues/21206
Signed-off-by: Sage Weil <sage@redhat.com>