At the end of start_rgw() we wait till establishing HTTP connections
with RadosGW become possible. However, if RadosGW uses the FastCGI,
the condition can't be fulfilled without spawning HTTP server first.
Signed-off-by: Radoslaw Zarzynski <rzarzynski@mirantis.com>
if we run upgrade test, where, for example, "jewel" is not in
ceph-ci.git repo, we should check ceph.git to clone the workunits.
Signed-off-by: Kefu Chai <kchai@redhat.com>
as "workunits" reside in ceph/qa/workunits, it's more intuitive to
respect suite-repo option when cloning workunits.
Signed-off-by: Kefu Chai <kchai@redhat.com>
we should not update pools_to_fix_pgp_num if the pool is not expanded or
the pg_num is not increased due to pgs being created. this prevent us
from fixing the pgp_num after done with thrashing if we actually did
nothing when fixing the pgp_num when thrashing, but we removed the pool
from pools_to_fix_pgp_num after set_pool_pgpnum() returns.
Signed-off-by: Kefu Chai <kchai@redhat.com>
as "workunits" reside in ceph/qa/workunits, it's more intuitive to
respect suite-repo option when cloning workunits.
Signed-off-by: Kefu Chai <kchai@redhat.com>
It should live in teuthology, not in Ceph. And it is currently broken:
there is no need to keep it around.
Fixes: http://tracker.ceph.com/issues/18846
Signed-off-by: Loic Dachary <loic@dachary.org>
There were some cases where we would leave a mountpoint
that would cause the teuthology teardown to get hung up
when it tried to look inside cephtest/
Signed-off-by: John Spray <john.spray@redhat.com>
Thrashing MDS will often result in failures which often do not stop the
test. The failure may also cause the test to stall which will force the
machines to needlessly be locked until a timeout is reached. This
watchdog will unmount mounts and kill daemons when a failure is
detected.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
While the trasher supports the behavior desired by issue 10792 [1], the
bugs uncovered due to deactivating MDS (and sometimes killing
deactivating MDS) are presently a distraction from addressing issues
during normal failures. So now thrashing max_mds is turned off by
default. I have added a TODO to deactivate ranks in order (configurably)
as random deactivation causes a lot of other problems.
This also fixes a bug: random.randrange(0.0, 1.0) always returns 0.
Oops.
[1] http://tracker.ceph.com/issues/10792
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Currently multimds is prone to many failures when killing an active or
stopping MDS when there are MDS in the cluster which have been
deactivated (stopping). Have this turned off by default for now.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
The thrasher can enter an infinite loop waiting for an MDS to take a
certain rank when a replacement may not be possible. For example,
max_mds actives are already running.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
During the course of thrashing max_mds, the ranks assigned to MDSs may
develop holes. This causes the thrasher to try to wrongly deactivate
ranks that are not assigned.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
https://github.com/ceph/ceph/pull/13194 introduced a regression:
2017-02-06T16:14:23.162 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
File "/home/teuthworker/src/github.com_ceph_ceph_master/qa/tasks/ceph_manager.py", line 722, in wrapper
return func(self)
File "/home/teuthworker/src/github.com_ceph_ceph_master/qa/tasks/ceph_manager.py", line 839, in do_thrash
self.choose_action()()
File "/home/teuthworker/src/github.com_ceph_ceph_master/qa/tasks/ceph_manager.py", line 305, in kill_osd
output = proc.stderr.getvalue()
AttributeError: 'NoneType' object has no attribute 'getvalue'
This is because the original patch failed to pass "stderr=StringIO()" to run().
Fixes: http://tracker.ceph.com/issues/16263
Signed-off-by: Nathan Cutler <ncutler@suse.com>
Signed-off-by: Kefu Chai <kchai@redhat.com>
If Thrasher.__init__() spawns the do_thrash thread before initializing the
ceph_objectstore_tool property, do_thrash races with the rest
of Thrasher.__init__() and in some cases do_thrash can call kill_osd() before
Trasher.__init__() progresses much further. This can lead to an exception
("AttributeError: Thrasher instance has no attribute 'ceph_objectstore_tool'")
being thrown in kill_osd().
This commit eliminates the race by making sure the ceph_objectstore_tool
attribute is initialized before the do_thrash thread is spawned.
Fixes: http://tracker.ceph.com/issues/18799
Signed-off-by: Nathan Cutler <ncutler@suse.com>
The umount process can get stuck, in which case
we want to fail the test rather than waiting around for it.
During teardown of the kclient task catch this
timeout explicitly so that we will powercycle the node if
needed.
Signed-off-by: John Spray <john.spray@redhat.com>