* refs/pull/28378/head:
qa/tasks: introduce Thrasher base class
qa/tasks: Fix typo
qa/tasks: manage thrashers
qa/tasks: start DaemonWatchdog when ceph starts
qa/tasks: make watch and bark handle more daemons
qa/tasks: move DaemonWatchdog to new file
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
* Introduced a Thrasher base class.
* Updated thrashers to inherit from Thrasher.
* Replaced the magic variable e with Thrasher.exception as per the discussion.
Now the exception variable sets by default as the thrashers are inheriting
from the Thrasher class.
Fixes: https://github.com/ceph/ceph/pull/28378#discussion_r309337928
Fixes: https://tracker.ceph.com/issues/41133
Signed-off-by: Jos Collin <jcollin@redhat.com>
* Added daemons to thrashers
* Join the mds thrasher, as the other thrashers did
Fixes: http://tracker.ceph.com/issues/10369
Signed-off-by: Jos Collin <jcollin@redhat.com>
* Start DaemonWatchdog when ceph starts
* Drop the DaemonWatchdog starting in mds_thrash.py
* Bring the thrashers in mds_thrash.py into the context
Fixes: http://tracker.ceph.com/issues/10369
Signed-off-by: Jos Collin <jcollin@redhat.com>
Current monitor only allows deactivating one mds at a time. Besides,
the mds to deactivate should have max rank id.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Thrashing MDS will often result in failures which often do not stop the
test. The failure may also cause the test to stall which will force the
machines to needlessly be locked until a timeout is reached. This
watchdog will unmount mounts and kill daemons when a failure is
detected.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
While the trasher supports the behavior desired by issue 10792 [1], the
bugs uncovered due to deactivating MDS (and sometimes killing
deactivating MDS) are presently a distraction from addressing issues
during normal failures. So now thrashing max_mds is turned off by
default. I have added a TODO to deactivate ranks in order (configurably)
as random deactivation causes a lot of other problems.
This also fixes a bug: random.randrange(0.0, 1.0) always returns 0.
Oops.
[1] http://tracker.ceph.com/issues/10792
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Currently multimds is prone to many failures when killing an active or
stopping MDS when there are MDS in the cluster which have been
deactivated (stopping). Have this turned off by default for now.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
The thrasher can enter an infinite loop waiting for an MDS to take a
certain rank when a replacement may not be possible. For example,
max_mds actives are already running.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
During the course of thrashing max_mds, the ranks assigned to MDSs may
develop holes. This causes the thrasher to try to wrongly deactivate
ranks that are not assigned.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>