qa/tasks/workunit: use the suite repo for cloning workunit
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
as "workunits" reside in ceph/qa/workunits, it's more intuitive to
respect suite-repo option when cloning workunits.
Signed-off-by: Kefu Chai <kchai@redhat.com>
The copy.sh is not only testing the rbd copy, but also
others such as rbd ls, rbd remove. Then rename it to generic.sh
Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
Rather than blocking the main op queue, just pause for that amount of
time between state machine cycles.
Also, add osd_snap_trim_sleep to a few of the thrasher yamls.
Signed-off-by: Samuel Just <sjust@redhat.com>
in given keyring file, should alert user and should not allow this import.
Because in 'ceph auth list' we keep all the keyrings with caps and importing
'client.admin' user keyring without caps locks the cluster with error[1]
because admin keyring caps are missing in 'ceph auth'.
[1] Error connecting to cluster: PermissionDeniedError
Fixes: http://tracker.ceph.com/issues/18932
Signed-off-by: Vikhyat Umrao <vumrao@redhat.com>
we should not update pools_to_fix_pgp_num if the pool is not expanded or
the pg_num is not increased due to pgs being created. this prevent us
from fixing the pgp_num after done with thrashing if we actually did
nothing when fixing the pgp_num when thrashing, but we removed the pool
from pools_to_fix_pgp_num after set_pool_pgpnum() returns.
Signed-off-by: Kefu Chai <kchai@redhat.com>
This script currently has a syntax error, but still exits with
success, which is hiding that failure. Expose it by allowing
the 'sudo' exit code to be the script's exit code.
Signed-off-by: Dan Mick <dan.mick@redhat.com>
This is based on a script that I've been using for a while for basic
smoke testing. The matrix has exploded with the addition of data-pool
and now it's primarily a data-pool test fixture that takes minutes to
run, so turning it into a workunit.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
as "workunits" reside in ceph/qa/workunits, it's more intuitive to
respect suite-repo option when cloning workunits.
Signed-off-by: Kefu Chai <kchai@redhat.com>
osd: have clients resend ops on pg split
Reviewed-by: Greg Farnum <gfarnum@redhat.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Reviewed-by: Samuel Just <sjust@redhat.com>
we have
2017-02-04T16:15:46.090 INFO:tasks.workunit.client.0.mira032.stdout:error in 22088
2017-02-04T16:15:46.092 INFO:tasks.workunit.client.0.mira032.stderr:bash: line 1: 22092 Alarm clock ceph_test_rados_api_aio 2>&1
2017-02-04T16:15:46.096 INFO:tasks.workunit.client.0.mira032.stderr: 22093 Done | tee ceph_test_rados_api_aio.log
2017-02-04T16:15:46.099 INFO:tasks.workunit.client.0.mira032.stderr: 22094 Done | sed "s/^/ api_aio: /"
2017-02-04T16:15:46.102 INFO:tasks.workunit.client.0.mira032.stderr:+
if a unittest in rados/test.sh fails in teuthology.log, but it would
be desirable to have the failed test name in the line of "error in
22088".
Signed-off-by: Kefu Chai <kchai@redhat.com>
It should live in teuthology, not in Ceph. And it is currently broken:
there is no need to keep it around.
Fixes: http://tracker.ceph.com/issues/18846
Signed-off-by: Loic Dachary <loic@dachary.org>
These were running so few ops that they weren't
giving any meaningful exercise to a multimds
system beyond what we're already covering in
the fs suite.
Signed-off-by: John Spray <john.spray@redhat.com>
There were some cases where we would leave a mountpoint
that would cause the teuthology teardown to get hung up
when it tried to look inside cephtest/
Signed-off-by: John Spray <john.spray@redhat.com>
Thrashing MDS will often result in failures which often do not stop the
test. The failure may also cause the test to stall which will force the
machines to needlessly be locked until a timeout is reached. This
watchdog will unmount mounts and kill daemons when a failure is
detected.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
While the trasher supports the behavior desired by issue 10792 [1], the
bugs uncovered due to deactivating MDS (and sometimes killing
deactivating MDS) are presently a distraction from addressing issues
during normal failures. So now thrashing max_mds is turned off by
default. I have added a TODO to deactivate ranks in order (configurably)
as random deactivation causes a lot of other problems.
This also fixes a bug: random.randrange(0.0, 1.0) always returns 0.
Oops.
[1] http://tracker.ceph.com/issues/10792
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
The thrasher expects in some scenarios for the cluster to stabilize with
a new MDS taking over when there are no standbys available. This can
cause the thrasher to quit because the cluster never stabilizes.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Currently multimds is prone to many failures when killing an active or
stopping MDS when there are MDS in the cluster which have been
deactivated (stopping). Have this turned off by default for now.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
The thrasher can enter an infinite loop waiting for an MDS to take a
certain rank when a replacement may not be possible. For example,
max_mds actives are already running.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
During the course of thrashing max_mds, the ranks assigned to MDSs may
develop holes. This causes the thrasher to try to wrongly deactivate
ranks that are not assigned.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
https://github.com/ceph/ceph/pull/13194 introduced a regression:
2017-02-06T16:14:23.162 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
File "/home/teuthworker/src/github.com_ceph_ceph_master/qa/tasks/ceph_manager.py", line 722, in wrapper
return func(self)
File "/home/teuthworker/src/github.com_ceph_ceph_master/qa/tasks/ceph_manager.py", line 839, in do_thrash
self.choose_action()()
File "/home/teuthworker/src/github.com_ceph_ceph_master/qa/tasks/ceph_manager.py", line 305, in kill_osd
output = proc.stderr.getvalue()
AttributeError: 'NoneType' object has no attribute 'getvalue'
This is because the original patch failed to pass "stderr=StringIO()" to run().
Fixes: http://tracker.ceph.com/issues/16263
Signed-off-by: Nathan Cutler <ncutler@suse.com>
Signed-off-by: Kefu Chai <kchai@redhat.com>
`set +o` prints out the full command line which is echoed if "xtrace" is
enabled. this increases the verbosity of get_timeout_delays().
in this change, we follow the way of kill_daemons() to kill the extra
output. see aefcf6d.
Signed-off-by: Kefu Chai <kchai@redhat.com>
If Thrasher.__init__() spawns the do_thrash thread before initializing the
ceph_objectstore_tool property, do_thrash races with the rest
of Thrasher.__init__() and in some cases do_thrash can call kill_osd() before
Trasher.__init__() progresses much further. This can lead to an exception
("AttributeError: Thrasher instance has no attribute 'ceph_objectstore_tool'")
being thrown in kill_osd().
This commit eliminates the race by making sure the ceph_objectstore_tool
attribute is initialized before the do_thrash thread is spawned.
Fixes: http://tracker.ceph.com/issues/18799
Signed-off-by: Nathan Cutler <ncutler@suse.com>
The umount process can get stuck, in which case
we want to fail the test rather than waiting around for it.
During teardown of the kclient task catch this
timeout explicitly so that we will powercycle the node if
needed.
Signed-off-by: John Spray <john.spray@redhat.com>
no need to mention ceph_dev_branch explicitly. it will be taken from the
ceph branch value mentioned in the teuthology-suite command
Signed-off-by: Tamil Muthamizhan <tmuthami@redhat.com>
This var is mostly used when running rbd_mirror test scripts on
teuthology. It can be used locally though to speedup re-running the
tests:
Set a test temp directory:
export RBD_MIRROR_TEMDIR=/tmp/tmp.rbd_mirror
Run the tests the first time with NOCLEANUP flag (the cluster and
daemons are not stopped on finish):
RBD_MIRROR_NOCLEANUP=1 ../qa/workunits/rbd/rbd_mirror.sh
Now, to re-run the test without restarting the cluster, run cleanup
with USE_EXISTING_CLUSTER flag:
RBD_MIRROR_USE_EXISTING_CLUSTER=1 \
../qa/workunits/rbd/rbd_mirror_ha.sh cleanup
and then run the tests:
RBD_MIRROR_USE_EXISTING_CLUSTER=1
../qa/workunits/rbd/rbd_mirror_ha.sh
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
by optionally specifyning daemon instance after cluster name and
colon, like:
start_mirror ${cluster}:${instance}
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Currently if user perform image rename operation and user give pool
name as a optional parameter (--pool=<pool_name>) then currently
its taking this optional pool name for source pool and making
destination pool name default pool name.
With this fix if user provide pool name as a optional pool name
parameter then it will consider both soruce and destination pool
name as optional parameter pool name.
Fixes: http://tracker.ceph.com/issues/18326
Reported-by: МАРК КОРЕНБЕРГ <socketpair@gmail.com>
Signed-off-by: Gaurav Kumar Garg <garg.gaurav52@gmail.com>
Do the write after opening the file, so that we get good
behaviour wrt the change in Mount.open_background that uses
file existence to confirm that the open happened.
Signed-off-by: John Spray <john.spray@redhat.com>
Previously we could readily end up hanging on teardown
when something had gone wrong with umount. Forcing
is a big hammer (umount_wait will power cycle the node
if umount isn't working), so if we had to do that
then raise an exception to indicate that something
was wrong with the test.
Fixes: http://tracker.ceph.com/issues/18663
Signed-off-by: John Spray <john.spray@redhat.com>
Using cephfs_[meta]data collides with the pools that teuthology
already creates if an mds is defined.
This became a (noticeable) problem with 052c3d3f68
Signed-off-by: Sage Weil <sage@redhat.com>
This mimics the OpenStack tempest gate tests that OpenStack
Zuul executes as a gate test.
Fixes: http://tracker.ceph.com/issues/18594
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Previously a later remote call could end up executing
before the remote python program in open_background
had actually got as far as opening the file.
Fixes: http://tracker.ceph.com/issues/18661
Signed-off-by: John Spray <john.spray@redhat.com>
Quotas don't work with kclient, and multimds tasks
are run against kclient. We don't need to run this
against fuse here because it's a basic correctness
test that's run against fuse in the fs suite.
Fixes: http://tracker.ceph.com/issues/18600
Signed-off-by: John Spray <john.spray@redhat.com>
...so that we can selectively disable those
which are not appropriate for multimds testing, or
which are not kclient compatible (all multimds workunits
run against both kclient and fuse).
Signed-off-by: John Spray <john.spray@redhat.com>
We have an updated nfs-utils that is no longer
generating spurious selinux warnings on CentOS.
Fixes: http://tracker.ceph.com/issues/16397
Signed-off-by: John Spray <john.spray@redhat.com>
Currently it only allows you to move buckets, which is annoying and much
less useful. To move an OSD you need to use create-or-move, which is
harder to use.
Fixes: http://tracker.ceph.com/issues/18587
Signed-off-by: Sage Weil <sage@redhat.com>
When a variable is not being observed we currently mark it
"unchangable". This can be misleading so try something hopefully a
little more informative.
Fixes: http://tracker.ceph.com/issues/18424
Signed-off-by: Brad Hubbard <bhubbard@redhat.com>
The rbd_cli_tests Perl script is not maintained and currently serves no
purpose. The RbdLib.pm module was only used by rbd_functional_tests.pl (which
was dropped by 276ffb4631) and rbd_cli_tests.pl
so drop it as well.
Fixes: http://tracker.ceph.com/issues/14825
Signed-off-by: Nathan Cutler <ncutler@suse.com>
Due to http://tracker.ceph.com/issues/18309 the pid file for fuse clients
should always be set to the empty string. (Teuthology's default ceph.conf
sets it to /var/run/ceph/$cluster-$name.pid)
This commit adds a reusable yaml facet for this purpose.
Signed-off-by: Nathan Cutler <ncutler@suse.com>
Convenient when you want to create a fresh cluster
each test run: just pass --create and you'll get
a cluster with the right number of daemons for
the tests you're running.
Signed-off-by: John Spray <john.spray@redhat.com>
Previously this could get hung up if we killed one
PID and then the daemon reappears with a different
one (perhaps because we caught it during
daemonization?)
Signed-off-by: John Spray <john.spray@redhat.com>
* replace hard-code pool name with $POOL
* replace hard-code object name with $OBJ
* introduce a new variable called $POOL_EC
* clean up pool
* simplify test case
Signed-off-by: liuchang0812 <liuchang0812@gmail.com>
This means users don't have to manually translate a rule
they just created to a ruleset ID in order to map a pool
to it.
Signed-off-by: Sage Weil <sage@redhat.com>
In preparation to deglobalizing CephContext, remove the CephContext*
parameter to ceph_clock_now() and ceph::real_clock::now() that carries
a configurable offset.
Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
If we checkout ceph-ci.git, and don't find a branch,
we'll try again from ceph.git. But the checkout will
already exist and the clone will fail, so we'll still
fail to find the branch.
The same can happen if a previous workunit task already
checked out the repo.
Fix by removing the repo before checkout (the first and
second times). Note that this may break if there are
multiple workunit tasks running in parallel on the same
role. That is already racy, so if it's happening, we'll
want to switch to using a truly unique clonedir for each
instantiation.
Fixes: http://tracker.ceph.com/issues/18336
Signed-off-by: Sage Weil <sage@redhat.com>
qa: fixed script to schedule rados and other suites with --subset option
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
...before sending a tell command. Otherwise osd.2 might
start without 1, the io unblocks, and the tell fails
because osd.1 is still down.
Fixes: http://tracker.ceph.com/issues/18303
Signed-off-by: Sage Weil <sage@redhat.com>
This is a dev hack to generate a bunch of bogus osdmaps. The maps are
all screwed up anyway (e.g., invalid addrs) and this is minimally useful.
Signed-off-by: Sage Weil <sage@redhat.com>
The test case is not stable due to racing console output. This
results in spurious failures.
Fixes: http://tracker.ceph.com/issues/10773
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Otherwise, it does not work as supposed to work in statements like below:
set -e
test_status_in_pool_dir ... && ...
(e.g. in wait_for_status_in_pool_dir)
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
This fixes a race in resync tests leading to false negative results.
Fixes: http://tracker.ceph.com/issues/18048
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
When displaying the output of a background process, do it on stderr so
that it is not bufferized. Otherwise the output of the background
process may be displayed after it completed.
Prefix the output of a background process with the PID of the process
known to the parent instead of the PID of the awk process processing the
output. When wait_background loops, it will print the process on which
it is waiting and it is confusing that they do not match with the PID
prefixing the process output.
Refs: http://tracker.ceph.com/issues/17830
Signed-off-by: Loic Dachary <loic@dachary.org>
Do all math using bc so we can have fractions
Allow caller to specify the first step (default 1)
Add testing of fractional first step
Signed-off-by: David Zafman <dzafman@redhat.com>
The TENTH_TIMEOUT was not delcared as an int and failed to be set with
the correct number. The test of the function did not catch this.
Implement computing of the increasingly large sleep delays in a separate
function so that it can be tested more easily. Give up on sub-second
sleep because a the function will not sleep at all if the cluster is
already clean. And if it is not already clean, it is very unlikely to
become clean within less than a second. The downside of having very
short sleep time is that it needlessly stress the machine and also
possibly spam the logs.
Refs: http://tracker.ceph.com/issues/17830
Signed-off-by: Loic Dachary <loic@dachary.org>
For vstart.sh powered tests, save 9 characters in the path name
by replacing testdir/test- with td/t-
60 characters imposed by jenkins
9 characters for src/test
5 characters for td/t-
33 left (instead of 24) for the test to create asok such as out/client.admin.25327.asok
Moving these files outside of the build directory is a bad idea because
tests should only create/use files within the builddir and not write
outside of this directory. Doing so would make things more complicated
for cleanup in case the test fail and create other problems as a
consequence (filling out disk space, conflicting directories between
runs etc.).
For ceph-helpers.sh tests replace testdir with td, saving 5 characters.
This is not strictly necessary but keeps the directory names consistent:
if the developer wants to get rid of all the test leftovers, it is
enough to remove the a single directory: td.
Fixes: http://tracker.ceph.com/issues/16014
Signed-off-by: Loic Dachary <loic@dachary.org>
common osd: Improve scrub analysis, list-inconsistent-obj output and osd-scrub-repair test
Reviewed-by: Samuel Just <sjust@redhat.com>
Reviewed-by: Kefu Chai <kchai@redhat.com>
Tests use objectstore_tool() which stops and starts OSDs,
but may assume consistency of object locations.
Signed-off-by: David Zafman <dzafman@redhat.com>
Reduce size of log on timeout by doing a backoff so that
we don't log 3000 loops at 1/10 second sleeps.
Signed-off-by: David Zafman <dzafman@redhat.com>