Sometimes mount A would get a cap revoke when mount
B did its last IO, resulting in mount A's OSD epoch
getting updated too.
Fix by making sure mount B is the last one to have
done IO before we do the barrier, so that when
it does IO again after the barrier, mount A can't
be holding any caps that B would need.
Fixes: #11913
Signed-off-by: John Spray <john.spray@redhat.com>
To test that metadata written recently is
preserved across a client+server crash when
barriered with a directory fsync.
Signed-off-by: John Spray <john.spray@redhat.com>
Looks like Sandon's and Sage's changes raced and there are now two
sites where we fetch overrides. One should be enough.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The interval between writes was too short because
it was not taking account of the way OSDMap full
flags are set on tick rather than immediately.
Fixes: #11779
Signed-off-by: John Spray <john.spray@redhat.com>
flock only works properly on FUSE versions >=2.9, which is newer
than eg Ubuntu Precise. So check the version on our client mounts and
only test flock if it's at least that new.
Fixes: #9995
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Run the same procedure as TestClusterFull, but
instead of limiting OSD memstore size, use pool
quota on the data pool.
Signed-off-by: John Spray <john.spray@redhat.com>
Create divergent priors and a split and then move a pg using
ceph-objectstore-tool export/import
Add yaml file to run the reg11184 task
Fixes: #11343
Signed-off-by: David Zafman <dzafman@redhat.com>
Based on tasks/divergent_priors.py but also do simple export/remove/import on
same osd.
Add yaml file to run the divergent_priors2 task
Signed-off-by: David Zafman <dzafman@redhat.com>
Flake8 fixes
Use new set_recovery_delay admin socket command
Fix bad value set for filestore_blackhole
Make sure log trims and only require 100 objects
Use kick_recovery_wq to properly set osd_recovery_delay_start to 0
Write and remove divergent and verify removal was undone
Fix to make compatible with wip-10809-11135-10290
Make sure to set_recovery_delay in a non-racey way (while osd running but down)
Leave divergent "in" so its PGs aren't treated as strays
Add yaml file to run the divergent_priors task
Signed-off-by: David Zafman <dzafman@redhat.com>
This patch also adds some convenience facilities for making
some of the ceph_manager methods into tasks usable from a
yaml file.
Signed-off-by: Samuel Just <sjust@redhat.com>
Now that service IDs are modified during run, we have
to avoid repeatedly evaluating first_mon for where
to run ceph_deploy, as the answer will change.
Fixes: #11495
Signed-off-by: John Spray <john.spray@redhat.com>
The early non-defaults caused failures due to xfstests_url: None not
being overridden by run_xfstests(). Move the defaults to xfstests() and
don pass xfstests_branch past that point.
Signed-off-by: Josh Durgin <jdurgin@redhat.com>
This test apparently had not been touched since
"fs new" was added. In addition to calling
Filesystem.create:
* modify the get_nodes_using_role
function to modify ctx.cluster.remotes so that the
service IDs match what ceph-deploy will set
* log exceptions during ceph_deploy setup, as otherwise
they can get lost if another exception occurs during
teardown (so that it's all easier to debug).
* default to passing --dev=master during install, so
that we don't error out horribly when run without
an explicit branch set (e.g. when run outside
scheduled suite)
Fixes: #11316
Signed-off-by: John Spray <john.spray@redhat.com>
... s/mon_remote/admin_remote/ and allow caller to pass
in which remote they want to use for that. Enables use
with ceph_deploy task which does not give admin keys
to mons.
Signed-off-by: John Spray <john.spray@redhat.com>
This broke with recent Client changes that
do better caching of readdir results, such
that doing an ls twice is no longer sufficient
to see a fresh result after repair - we need
to remount instead.
Signed-off-by: John Spray <john.spray@redhat.com>
To track recent change in master where instead of
crashing on missing MDSTable object we'll go
into damaged state.
Instead of catching a crash, handle the rank's
transition to the damanged state. Leave the crash
handling code (unused for the moment) in the
Filesystem class in case it's needed elsewhere
soon.
Signed-off-by: John Spray <john.spray@redhat.com>
This tests the new purge file/ops throttling
in the MDS, via the new perf counters for
strays/purging.
Fixes: #10390
Signed-off-by: John Spray <john.spray@redhat.com>
...to avoid having boilerplate in each test module,
and gain the ability to run them all in one go
with a nice test-by-test pass/fail report.
Signed-off-by: John Spray <john.spray@redhat.com>
Were previously taking the baseline from just after the
client did a delete, which was racy: should have taken
it from before, to get a steady state.
Also update the perf dump calls to take advantage of
the new filtering syntax.
Signed-off-by: John Spray <john.spray@redhat.com>
Wherever we are subsequently waiting for daemons
to be healthy, we should be doing a fail during the restart.
Also catch some places that were doing this longhand and use
the handy fail_restart version instead.
Signed-off-by: John Spray <john.spray@redhat.com>
In python, isinstance(foo, str) will fail if
a unicode string is passed in. The correct check
is basestring.
Signed-off-by: John Spray <john.spray@redhat.com>
...as long as only one is active, all the ops
that default to talking to a single MDS should
be happy to talk to the active MDS, even if there
happens to be a standby lying around too.
Signed-off-by: John Spray <john.spray@redhat.com>
man samba(8) contains sentences:
To shut down a user's smbd process it is recommended that SIGKILL (-9)
NOT be used, except as a last resort, as this may leave the shared
memory area in an inconsistent state. The safe way to terminate an smbd
is to send it a SIGTERM (-15) signal and wait for it to die on its own.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
These variables are needed because ceph-qa-suite bootstraps ceph-qa-chef via
http download of solo-from/scratch/run. This adds a variable to override the
default script. It also adds variables to the rbd task to override the versions
of run_xfstests_krbd.sh and run_xfstests.sh downloaded by the default task.
variables added
======
tasks:
-chef
script_url: # override default location for solo-from-scratch for Chef
chef_repo: # override default Chef repo used by solo-from-scratch
chef_branch: # to choose a different git upstream branch for ceph-qa-chef
-rbd.xfstests:
client.0:
xfstests_branch: # to choose a different git upstream branch for xfstests
xfstests_url: # override git base URL for run_xfstests{_krbd}.sh
Signed-off-by: Douglas Fuller <dfuller@redhat.com>
Pass -f by default to btrfs instead of first trying without and *then*
trying with.
Among other things, this avoids a confusing failure where we try mkfs.ext4
device (no -f), fail for some reason, and then try again with -f and get
a usage error (-f does not mean force for mke2fs).
Signed-off-by: Sage Weil <sage@redhat.com>
Make a DEFAULTS dict that is updated by any user parms, so that
defaults are documented centrally and so config.get(key, defval) is
no longer necessary everywhere.
Signed-off-by: Dan Mick <dan.mick@redhat.com>
Stop trying to build test images inside this test; presume the test
image is available built externally (in a file path or an http URL).
Config vars ice_tool_dir, ice_version, iceball_location, and
ice_git_location go away in favor of 'test_image', the path to the
testable image (which can still be a tar.gz or an .iso).
Signed-off-by: Dan Mick <dan.mick@redhat.com>
Ubuntu's mount/kernel support "mount <file> <mntpnt>" directly;
apparently Centos 6 (and presumably RHEL6) require specifying at
least '-o loop' (a /dev/loopN will be dynamically allocated and removed
on unmount).
Signed-off-by: Dan Mick <dan.mick@redhat.com>
Move the get_user_summary(out, user) logic to util.rgw so that it can be
shared between radosgw_admin_rest.py and radosgw_admin.py and modify
them accordingly.
http://tracker.ceph.com/issues/11180Fixes: #11180
Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit 97e6d808f0)
Move the get_user_summary(out, user) logic to util.rgw so that it can be
shared between radosgw_admin_rest.py and radosgw_admin.py and modify
them accordingly.
http://tracker.ceph.com/issues/11180Fixes: #11180
Signed-off-by: Loic Dachary <loic@dachary.org>
To avoid internal.coredump task synthesizing a failure
during teardown from the core we left behind.
Fixes: #10949
Signed-off-by: John Spray <john.spray@redhat.com>
This was an overly strict success condition: the
flush operation doesn't promise to leave you an empty
journal, it promises that anything in the journal
before the flush will be flushed.
Fixes: #10712
Signed-off-by: John Spray <john.spray@redhat.com>
Where multiple MDSs were on the same node, trying
to concurrently update their firewall state was
causing an exception because the iptables command
errors out if another instance is already running.
Fixes: #10948
Signed-off-by: John Spray <john.spray@redhat.com>
teuthology helpfully escapes things for us so
the \; didn't need the backslash. The logic
was still falling over in some cases too.
Additionally, make the FUSE /sys/ abort operation
more surgical by working out the connection name
of our own mount during mount().
Signed-off-by: John Spray <john.spray@redhat.com>
* add a wrapper to log uncaught exception to self.logger, greenlet also
prints the backtrace and exception to stderr, but teuthology.log does
not capture stderr. so we need to catch them by ourselves to reveal
more info to root-cause this issue.
* log uncaught exception thrown by Thrasher.do_thrash() to self.log.
See: #10630
Signed-off-by: Kefu Chai <kchai@redhat.com>
Specifically, I want to know *who* is running the ceph-osd that is
holding the files open.
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit a68281e147)