We assume below that rerrosd is up, but it may not be when we exit the
loop.
Fixes: http://tracker.ceph.com/issues/21206
Signed-off-by: Sage Weil <sage@redhat.com>
This randomly issues pg force-recovery/force-backfill and
pg cancel-force-recovery/cancel-force-backfill during QA
testing. Disabled for upgrades from hammer, jewel and kraken.
Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
New option "random_eio" to Thrasher, sets 1 osd random read percentage
New option "objectsize" to radosbench task (-o bench option)
New option "type" to radosbench specify write, seq or rand
Signed-off-by: David Zafman <dzafman@redhat.com>
Make sure OSDs are up *and* they have flushed their PG stats before
waiting for recovery to ensure that we do not see a stale 'clean' state.
Signed-off-by: Sage Weil <sage@redhat.com>
The helper gets a sequence number from the osd (or osds), and then
polls the mon until that seq is reflected there.
This is overkill in some cases, since many tests only require that the
stats be reflected on the mgr (not the mon), but waiting for it to also
reach the mon is sufficient!
Signed-off-by: Sage Weil <sage@redhat.com>
Pulling this out of the 'pg dump' heap is inefficient.
Also, pg dump data comes from the mgr and may be stale.
Signed-off-by: Sage Weil <sage@redhat.com>
Keep the pool flag around so we can distinguish between a pool that
should maintain hashes for each chunk, and a missing one is a bug, vs
an overwrites pool where we rely on bluestore checksums for detecting
corruption.
Signed-off-by: Josh Durgin <jdurgin@redhat.com>
'remap' is to non-specific a name. In particular, it
sounds like it is related to the 'remapped' PG state
but in reality it is not related.
'upmap' or 'pg-upmap' is more specific: it maps a pgid
to the 'up' set value (or item)
Signed-off-by: Sage Weil <sage@redhat.com>
On slower machines (VPS, OVH) it takes time for the OSD to go down.
Fixes: http://tracker.ceph.com/issues/19556
Signed-off-by: Nathan Cutler <ncutler@suse.com>
we should not update pools_to_fix_pgp_num if the pool is not expanded or
the pg_num is not increased due to pgs being created. this prevent us
from fixing the pgp_num after done with thrashing if we actually did
nothing when fixing the pgp_num when thrashing, but we removed the pool
from pools_to_fix_pgp_num after set_pool_pgpnum() returns.
Signed-off-by: Kefu Chai <kchai@redhat.com>
https://github.com/ceph/ceph/pull/13194 introduced a regression:
2017-02-06T16:14:23.162 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
File "/home/teuthworker/src/github.com_ceph_ceph_master/qa/tasks/ceph_manager.py", line 722, in wrapper
return func(self)
File "/home/teuthworker/src/github.com_ceph_ceph_master/qa/tasks/ceph_manager.py", line 839, in do_thrash
self.choose_action()()
File "/home/teuthworker/src/github.com_ceph_ceph_master/qa/tasks/ceph_manager.py", line 305, in kill_osd
output = proc.stderr.getvalue()
AttributeError: 'NoneType' object has no attribute 'getvalue'
This is because the original patch failed to pass "stderr=StringIO()" to run().
Fixes: http://tracker.ceph.com/issues/16263
Signed-off-by: Nathan Cutler <ncutler@suse.com>
Signed-off-by: Kefu Chai <kchai@redhat.com>
If Thrasher.__init__() spawns the do_thrash thread before initializing the
ceph_objectstore_tool property, do_thrash races with the rest
of Thrasher.__init__() and in some cases do_thrash can call kill_osd() before
Trasher.__init__() progresses much further. This can lead to an exception
("AttributeError: Thrasher instance has no attribute 'ceph_objectstore_tool'")
being thrown in kill_osd().
This commit eliminates the race by making sure the ceph_objectstore_tool
attribute is initialized before the do_thrash thread is spawned.
Fixes: http://tracker.ceph.com/issues/18799
Signed-off-by: Nathan Cutler <ncutler@suse.com>