Currently multimds is prone to many failures when killing an active or
stopping MDS when there are MDS in the cluster which have been
deactivated (stopping). Have this turned off by default for now.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
The thrasher can enter an infinite loop waiting for an MDS to take a
certain rank when a replacement may not be possible. For example,
max_mds actives are already running.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
During the course of thrashing max_mds, the ranks assigned to MDSs may
develop holes. This causes the thrasher to try to wrongly deactivate
ranks that are not assigned.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
https://github.com/ceph/ceph/pull/13194 introduced a regression:
2017-02-06T16:14:23.162 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
File "/home/teuthworker/src/github.com_ceph_ceph_master/qa/tasks/ceph_manager.py", line 722, in wrapper
return func(self)
File "/home/teuthworker/src/github.com_ceph_ceph_master/qa/tasks/ceph_manager.py", line 839, in do_thrash
self.choose_action()()
File "/home/teuthworker/src/github.com_ceph_ceph_master/qa/tasks/ceph_manager.py", line 305, in kill_osd
output = proc.stderr.getvalue()
AttributeError: 'NoneType' object has no attribute 'getvalue'
This is because the original patch failed to pass "stderr=StringIO()" to run().
Fixes: http://tracker.ceph.com/issues/16263
Signed-off-by: Nathan Cutler <ncutler@suse.com>
Signed-off-by: Kefu Chai <kchai@redhat.com>
If Thrasher.__init__() spawns the do_thrash thread before initializing the
ceph_objectstore_tool property, do_thrash races with the rest
of Thrasher.__init__() and in some cases do_thrash can call kill_osd() before
Trasher.__init__() progresses much further. This can lead to an exception
("AttributeError: Thrasher instance has no attribute 'ceph_objectstore_tool'")
being thrown in kill_osd().
This commit eliminates the race by making sure the ceph_objectstore_tool
attribute is initialized before the do_thrash thread is spawned.
Fixes: http://tracker.ceph.com/issues/18799
Signed-off-by: Nathan Cutler <ncutler@suse.com>
The umount process can get stuck, in which case
we want to fail the test rather than waiting around for it.
During teardown of the kclient task catch this
timeout explicitly so that we will powercycle the node if
needed.
Signed-off-by: John Spray <john.spray@redhat.com>
Do the write after opening the file, so that we get good
behaviour wrt the change in Mount.open_background that uses
file existence to confirm that the open happened.
Signed-off-by: John Spray <john.spray@redhat.com>
Previously we could readily end up hanging on teardown
when something had gone wrong with umount. Forcing
is a big hammer (umount_wait will power cycle the node
if umount isn't working), so if we had to do that
then raise an exception to indicate that something
was wrong with the test.
Fixes: http://tracker.ceph.com/issues/18663
Signed-off-by: John Spray <john.spray@redhat.com>
Previously a later remote call could end up executing
before the remote python program in open_background
had actually got as far as opening the file.
Fixes: http://tracker.ceph.com/issues/18661
Signed-off-by: John Spray <john.spray@redhat.com>
Convenient when you want to create a fresh cluster
each test run: just pass --create and you'll get
a cluster with the right number of daemons for
the tests you're running.
Signed-off-by: John Spray <john.spray@redhat.com>
Previously this could get hung up if we killed one
PID and then the daemon reappears with a different
one (perhaps because we caught it during
daemonization?)
Signed-off-by: John Spray <john.spray@redhat.com>
If we checkout ceph-ci.git, and don't find a branch,
we'll try again from ceph.git. But the checkout will
already exist and the clone will fail, so we'll still
fail to find the branch.
The same can happen if a previous workunit task already
checked out the repo.
Fix by removing the repo before checkout (the first and
second times). Note that this may break if there are
multiple workunit tasks running in parallel on the same
role. That is already racy, so if it's happening, we'll
want to switch to using a truly unique clonedir for each
instantiation.
Fixes: http://tracker.ceph.com/issues/18336
Signed-off-by: Sage Weil <sage@redhat.com>
...before sending a tell command. Otherwise osd.2 might
start without 1, the io unblocks, and the tell fails
because osd.1 is still down.
Fixes: http://tracker.ceph.com/issues/18303
Signed-off-by: Sage Weil <sage@redhat.com>