The cluster is expected to become degraded during reboot.
Fixes: http://tracker.ceph.com/issues/20731
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
I'm seeing sporadic single thread deadlocks on fio stat_mutex during krbd
thrash runs:
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7f89ee730740 (LWP 15604) 0x00007f89ed9f41bd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0 0x00007f89ed9f41bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007f89ed9f17b2 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#2 0x00000000004429b9 in fio_mutex_down (mutex=0x7f89ee72d000) at mutex.c:170
#3 0x0000000000459704 in thread_main (data=<optimized out>) at backend.c:1639
#4 0x000000000045b013 in fork_main (offset=0, shmid=<optimized out>, sk_out=0x0) at backend.c:1778
#5 run_threads (sk_out=sk_out@entry=0x0) at backend.c:2195
#6 0x000000000045b47f in fio_backend (sk_out=sk_out@entry=0x0) at backend.c:2400
#7 0x000000000040cb0c in main (argc=2, argv=0x7fffad3e3888, envp=<optimized out>) at fio.c:63
(gdb) up 2
170 pthread_cond_wait(&mutex->cond, &mutex->lock);
(gdb) p mutex.lock.__data.__owner
$1 = 15604
Upgrading to 2.21 seems to make these go away.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If a device has already been bounded to a class,
do not allow to change its class silently.
Require user call rm-device-class first.
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
A class can be considered as in-use only if it is referenced by
any of the existing crush rules.
The patch also makes the output more human readable. For example:
./bin/ceph osd crush rule create-replicated myrule default host ssd
./bin/ceph osd crush class rm ssd
Error EBUSY: class 'ssd' still referenced by crush_rule 'myrule'
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
This patch solves the problem below:
./bin/ceph osd crush move osd.0 root=foo rack=foo-rack host=foo-host
moved item id 0 name 'osd.0' to location {host=foo-host,rack=foo-rack,root=foo} in crush map
./bin/ceph osd crush rule create-replicated foo-rule foo host ssd
Error EINVAL: root foo has no devices with class ssd
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Review current log messages for consistency, accuracy and necessesity as
part of usability initiative. First in a series.
Signed-off-by: Brad Hubbard <bhubbard@redhat.com>
With the peering deletes change, setting luminous sets the osdmap flag
which triggers a new peering interval. That can lead to health warnings
about PG_AVAILABILITY or PG_DEGRADED. Ignore those!
Fixes: http://tracker.ceph.com/issues/20693
Signed-off-by: Sage Weil <sage@redhat.com>
The old structure of link at top folder is pretty much outdated, the test
config option needs to be specific to cluster yaml.
Signed-off-by: Vasu Kulkarni <vasu@redhat.com>
Test cluster with 2 osds, stop osd.0, if osd.1
report the pg stats during pg peering, mon will
record pg state to 'peering',then stop osd.1,
finally the pg state will stuck in 'stale+peering',
which is unexpected.
Let's wait_for_active() after stop osd.0.
Signed-off-by: huangjun <huangjun@xsky.com>
- stop running via make check
- add teuthology yamls to run them
- disable ceph_objecstore_tool.py for now (too slow for make check, and
we can't use vstart in teuthology via a package install)
- drop cephtool tests since those are already covered by other teuthology
tests
- leave a handful of (fast!) ceph-helpers tests for make check for minimal
integration tests.
Signed-off-by: Sage Weil <sage@redhat.com>
Many of the files in qa/qa_scripts/openstack had incorrect shebang
lines: the bang was missing. This means that those scripts would
execute using the calling user's login shell, which is doubtless not
what the author intended. Now they'll always use bash.
Two scripts do not need shebangs, because they contain only library
functions and don't execute anything. I removed their shebangs.
Signed-off-by: Alan Somers <asomers@gmail.com>
to shorten the pathname of unix domain socket created for admin socket,
so it does not exceed the limit of 107 on GNU/Linux:
* ceph-helper.sh: the temp directory is named ${TMPDIR:-/tmp}/ceph-asok.$$
* vstart.sh: the temp directory is named `mktemp -u -d "${TMPDIR:-/tmp}/ceph-asok.XXXXXX"`
Fixes: http://tracker.ceph.com/issues/16895
Signed-off-by: Kefu Chai <kchai@redhat.com>
We have a few open tickets regarding the mgr being down during suites
involving messenger failure injection. There are a few suspicions that
this may be related with the monclient, but we'll need more logs to
validate those suspicions and, more, to validate we're actually fixing
the issue.
Signed-off-by: Joao Eduardo Luis <joao@suse.de>
The CRUSH rule creation is busted (rules and buckets out of order), but
after I fix that it doesn't seem to run right anyway. Remove it.
We get the mon thrasher coverage from rados/monthrash already; I don't
think this is adding meaningful coverage for the amount of effort it takes
to maintain.
Signed-off-by: Sage Weil <sage@redhat.com>
cephtool/test.sh: Only delete a test pool when no longer needed.
Reviewed-by: Willem Jan Withagen <wjw@digiware.nl>
Reviewed-by: xie xingguo <xie.xingguo@zte.com.cn>
the pool_getset pool is deleted before all tests on it are complete
4: /home/jenkins/workspace/ceph-master/qa/workunits/cephtool/test.sh:1990: test_mon_osd_pool_set: ceph osd pool delete pool_get
set pool_getset --yes-i-really-really-mean-it
4: pool 'pool_getset' removed
4: /home/jenkins/workspace/ceph-master/qa/workunits/cephtool/test.sh:1992: test_mon_osd_pool_set: ceph osd pool get rbd crush_r
ule
4: /home/jenkins/workspace/ceph-master/qa/workunits/cephtool/test.sh:1992: test_mon_osd_pool_set: grep 'crush_rule: '
4: crush_rule: replicated_rule
4: /home/jenkins/workspace/ceph-master/qa/workunits/cephtool/test.sh:1994: test_mon_osd_pool_set: ceph -f json osd pool get poo
l_getset compression_mode
4: Error ENOENT: unrecognized pool 'pool_getset'
Signed-off-by: Willem Jan Withagen <wjw@digiware.nl>
This randomly issues pg force-recovery/force-backfill and
pg cancel-force-recovery/cancel-force-backfill during QA
testing. Disabled for upgrades from hammer, jewel and kraken.
Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
The output of ceph osd stat has changed,
It printed:
cluster b370a29d-9287-4ca3-ab57-3d824f65e339
health HEALTH_OK
monmap e1: 1 mons at {ceph1=10.0.0.8:6789/0}, election epoch 2, quorum 0 ceph1
osdmap e63: 2 osds: 2 up, 2 in
pgmap v41338: 952 pgs, 20 pools, 17130 MB data, 2199 objects
115 GB used, 167 GB / 297 GB avail
952 active+clean
but now the osdmap line has gone and thus this no longer works:
qa/workunits/cephtool/test.sh:1944:
old_pgs=$(ceph osd pool get $TEST_POOL_GETSET pg_num | sed -e 's/pg_num: //')
new_pgs=$(($old_pgs+$(ceph osd stat | grep osdmap | awk '{print $3}')*32))
4: qa/workunits/cephtool/test.sh: line 1945: 10+*32: syntax errotoken is "*32")
- And parse the output in json , with jq, for better reliability
Signed-off-by: Willem Jan Withagen <wjw@digiware.nl>
- factor out install and ceph into ceph/ceph.yaml
- pg_num thrashing + 20 minute health timeout for thrashosds
- common thrashosds-health.yaml whitelist
- drop iozone workload
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* refs/remotes/upstream/pull/15979/head:
Ignore unmatched rstat errors from MDS during rebuild testing
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
* refs/remotes/upstream/pull/16288/head:
qa/cephfs: don't use int() to convert string of float point number
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
to avoid possible deadlock. quote from doc of Popen.wait()
> This will deadlock when using stdout=PIPE and/or stderr=PIPE and the
child process generates enough output to a pipe such that it blocks
waiting for the OS pipe buffer to accept more data. Use communicate() to
avoid that.
and print out the stdout and stderr using LOG.warn() if the command
fails.
Signed-off-by: Kefu Chai <kchai@redhat.com>
The former semantic of ceph-disk destroy is now implemented with the
--purge flag. Use that for the ceph-disk suite.
Signed-off-by: Loic Dachary <loic@dachary.org>
Add a set of new tests for the case when public_addr and public_bind_addr
are different for a mon. In order to test this properly I had to employ
port forwarding with socat. This helps simulate what would happen in a
environment like Kubernetes. socat is now a build dependency.
Also, moved jq_success to ceph-helpers.sh and refactored run_mon to enable
creating the mons without creating the rbd pool immediately.
Signed-off-by: Bassam Tabbara <bassam.tabbara@quantum.com>
Valgrind runs itself on forked children, and does its cleanup when they
complete, and this is slow... slow enough that it frequently makes the
test time out.
Valgrind let's you ignore child *processes* that you exec, but I can't
find a way to skip forked children in the same address space.
Work around this by skip this validation when running under valgrind.
Fixes: http://tracker.ceph.com/issues/20602
Signed-off-by: Sage Weil <sage@redhat.com>