This reverts commit 1323bdb839e77fb27cba36ef2725bb7f163b1db4.
The tests needs to scrub while recovery is in progress, so catching
recovery from the logs after the fact isn't the proper setup.
We can use osd_recovery_sleep config.
Signed-off-by: David Zafman <dzafman@redhat.com>
Given and initial (set of) osd(s), if provide up to N OSDs that can be
stopped together without making PGs become unavailable.
This can be used to quickly identify large(r) batches of OSDs that can be
stopped together to (for example) upgrade.
Signed-off-by: Sage Weil <sage@newdream.net>
in beb62c029a, FEATURE_QUINCY was added to
ceph::features::mon::get_persistent(), so update the test accordingly.
Signed-off-by: Kefu Chai <kchai@redhat.com>
The 'recovering' state is transitory. Existing code looks for it by
polling 'pg stat', missing from time to time.
New version searches the tails of the relevant OSDs' logs.
Fixes: https://tracker.ceph.com/issues/48719
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
While the relevant comment says:
'# Execute the command and prepend the output with its pid'
the actual PID logged is the same for all background processes,
which isn't very helpful.
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
Stop waiting for a scrub to happen if the Primary for the target
PG changes.
Fixes: https://tracker.ceph.com/issues/48720
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
in my test bed, it takes 11 seconds to boot the 3 OSDs and to restart
one of them, this fails the test.
so we need to take the time into consideration. in this change, the
delay is added to the total "warn_older_version_delay", so the monitor
does not start sending warning earlier than expected.
Signed-off-by: Kefu Chai <kchai@redhat.com>
in e5b1ae5554, a new option named
"debug_version_for_testing" is introduced to override the version so
we can test version check.
in crimson, we have two families of shared functions.
- one of them is used by alien store. they are compiled with
-DWITH_SEASTAR and -DWITH_ALIEN, to enable the shim code between
seastar and POSIX thread.
- another is used by crimson in general. where no lock is allowed.
currently, we use the "crimson" and "ceph" namespace to differentiate
these two families of functions, so they can colocate in the same
executable without violating the ODR. see src/include/common_fwd.h for
more details.
the functions defined in src/common/version.cc are also shared by
alien store and crimson code. and because we have different
implementations of `CephContext` in crimson and in classic OSD (i.e.
alienstore), we have to have different implementations of this function
as well, if we follow the same approach. but since these functions are
very simple and are non-blocking, there is not much value in
differentiating them, it is better to inject the test settings using
environment variable instead of using ceph option subsystem.
in this change, "ceph_debug_version_for_testing" environment variable is
checked instead, so that crimson and alienstore can share the same
compilation unit of version.cc. and "debug_version_for_testing" option
is removed.
Signed-off-by: Kefu Chai <kchai@redhat.com>
Didn't really test extended sleep in original code:
Cause by: 3bfb5c2621cf9b5e602bc37724b20c18eb852aea
Signed-off-by: David Zafman <dzafman@redhat.com>
Add test case for permitted hours to make sure scrub doesn't start
Remove permitted hours in extended sleep test
Fixes: https://tracker.ceph.com/issues/48077
Signed-off-by: David Zafman <dzafman@redhat.com>
While creating erasure-coded profile make sure
that user is specifying valid crush-failure-domain.
Fixes: https://tracker.ceph.com/issues/47452
Signed-off-by: Prashant Dhange <pdhange@redhat.com>
This overrides what the CephContext believes to be the current quorum of
monitors (retrieved from other instances of the MonClient), introduced
by [1]. Tests need to be able to target a specific monitor for
exercising forwarding and other things.
[1] 731e2db9fb4611f767446a3c8e778a097ce70d35
Fixes: https://tracker.ceph.com/issues/47180
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
The test should mark the OSD out to check if only "in" OSDs are considered by
the osdmap trimming logic.
Fixes: https://tracker.ceph.com/issues/47309
Signed-off-by: Neha Ojha <nojha@redhat.com>
we could pass `text=True` for better readability, but that's introduced
in python3.7, or pass `error="ignore"` but it's too long.
Signed-off-by: Kefu Chai <kchai@redhat.com>
no need to check for their existence, and prepare a replacement.
because we've migrated to python3. and we only support python3.6 and up.
Signed-off-by: Kefu Chai <kchai@redhat.com>
Test that the osd doesn't crash when it gets a bad incremental osdmap.
Related-to: https://tracker.ceph.com/issues/46443
Signed-off-by: Dan van der Ster <daniel.vanderster@cern.ch>
I have absolutely no idea why it's counting features, but
apparently it is and bumping the value to 7 makes it pass.
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Include test case
Configurable by setting mon_osd_warn_num_repaired (default 10)
Ignore new health warning with random eio injection test
Fixes: https://tracker.ceph.com/issues/41564
Signed-off-by: David Zafman <dzafman@redhat.com>
a0b453ad33 added the wait state, which can
make PGs stay in active+clean+wait for a while instead of going into
active+clean directly. As far as TEST_auto_repair_bluestore_failed is
concerned, we only care about the repair state being cleared.
Fixes: https://tracker.ceph.com/issues/45075
Signed-off-by: Neha Ojha <nojha@redhat.com>
v2 was introduced in nautilus, and we don't support mimic -> pacific
upgrades (only mimic -> octopus). This test can be removed!
Signed-off-by: Sage Weil <sage@redhat.com>
to address the test failures like
```
2020-04-07T15:44:58.693 INFO:tasks.workunit.client.0.smithi049.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-repair.sh:498: TEST_auto_repair_bluestore_failed: ceph pg dump
pgs
2020-04-07T15:44:58.694 INFO:tasks.workunit.client.0.smithi049.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-repair.sh:498: TEST_auto_repair_bluestore_failed: pgid
2020-04-07T15:44:58.694 INFO:tasks.workunit.client.0.smithi049.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-repair.sh: line 498: pgid: command not found
```
Signed-off-by: Kefu Chai <kchai@redhat.com>
It is possible for the pg dump to not be the latest when we check for newprimary
in _common_test(). This is because mgr_stats_period is 5 seconds, and we may not
have fetched the latest stats just yet. This causes the test to look at the same
stats before and after wait_for_clean.
Fixes: https://tracker.ceph.com/issues/43807 (2)
Signed-off-by: Neha Ojha <nojha@redhat.com>
Mon might fail to share the newest map with any of up osds, e.g.,
due to an injected broken pipe. Since we don't have any client
activities during the osd-markdown tests, osds might be unaware of
the map changes made through CLI. Make sure osds have pulled the
newest map down before we can test its reaction correctly.
Fixes: https://tracker.ceph.com/issues/44662
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
* refs/pull/33885/head:
Merge pull request #33848 from mchangir/octopus-tests-remove-suprious-whitespace
Merge PR #33746 into octopus
Merge PR #33830 into octopus
Merge PR #33732 into octopus
Merge PR #33620 into octopus
Merge pull request #33876 from tchaikov/octopus-cephadm-mypy
cephadm: add "assert foo is not None" for mypy check
Merge pull request #33067 from tspmelo/wip-rbd-delete-with-snapshot
cephadm: add grafana adopt
Merge PR #33771 into octopus
Merge PR #33850 into octopus
Merge PR #33853 into octopus
Merge PR #33857 into octopus
Merge PR #32990 into octopus
Merge PR #33713 into octopus
Merge PR #33838 into octopus
qa/tasks/cephadm: no default mon|mgr|crash service specs
qa/suites/rados/cephadm/upgrade: upgrade start point that supports the no-spec option
Merge PR #33832 into octopus
cephadm: bootstrap: wait for mgr to restart after enabling a module
mgr: add 'mgr_status' tell command
Merge pull request #33839 from rhcs-dashboard/44538-fix-rgw-grafana-get-put-latencies
Merge pull request #33743 from votdev/issue_43869_fix_qa_test
cephadm: create initial mon and mgr service specs too
cephadm: no need to pregenerate a crash key for the bootstrap host
mgr/cephadm: do not complain when we don't have enough hosts
mgr/cephadm: remove orphan daemons
mgr/cephadm: report size=0 for fabricated ServiceDescription
mgr/cephadm: safety check to prevent removing all mon|mgr daemons
mgr/cephadm: prevent scaling mon|mgr below count=1
mgr/cephadm: do not remove daemons from remove_service
Merge pull request #33805 from tchaikov/wip-44500
spec: Podman (temporarily) requires apparmor-abstractions on suse
mgr/cephadm: Make sure we don't co-locate the same daemon
monitoring: fix RGW grafana chart 'Average GET/PUT Latencies'
tests: remove spurious whitespace
mgr/cephadm: fix service list filtering
Merge PR #33825 into octopus
Merge PR #33811 into octopus
Revert "Merge pull request #33673 from cbodley/wip-denc-enum"
mgr/cephadm: fix upgrade order
Merge PR #33801 into octopus
Merge PR #33822 into octopus
cephadm: bootstrap: tolerate error return from -h
Merge PR #33809 into octopus
Merge PR #32678 into octopus
cephadm: use `sh` instead of `bash` during enter
ceph.in: only shut down rados on clean exit
common/ceph_timer: Pass reference to waited time on stack
common/ceph_timer: Add test
common/ceph_timer: Use unique_function, allowing noncopyable events
common/ceph_timer: Couple cleanups
common/ceph_timer: Fix namespaces
common/ceph_timer: Add missing includes
common/ceph_timer.h: Don't indent contents of a namespace
mgr/dashboard: Crush rule modal
mgr/dashboard: Preserve rule selection on pool type change
mgr/dashboard: Crush rule is only send during replicated pool creation
mgr/dashboard: Explicit returns in pool form
mgr/dashboard: Removes fork join in pool form
mgr/dashboard: Hide ECP actions during ec pool edit
mgr/dashboard: Pool form erasure/replicated boolean
mgr/dashboard: Change pool info API endpoint
mgr/dashboard: Moves ECP info endpoint to UI-API
mgr/cephadm: add _remove_osds_bg back to main loop
mgr/cephadm/osd: update removal report immediately
qa/tasks/ceph_manager: use StringIO for capturing COT output
qa/standalone/scrub/osd-scrub-repair: force osdmap prop to osds
qa/standalone/scrub/osd-scrub-test: wait longer for update
qa/tasks/ceph_manager: capture stderr for COT
qa/suites/rados/ceph: drop opensuse for now
mon/MonClient: send logs to mon on separate schedule than pings
mgr/dashboard: Fix missing ImageSpec usage
mgr/dashboard: Allow removing RBD with snapshots
mgr/dashboard: Refactor and cleanup tasks.mgr.dashboard.test_user
mgr/dashboard: support multiple DriveGroups when creating OSDs
mon/MonClient: send logs to mon even if we have no keelalive2
cephadm: flag dashboard user to change password
Reviewed-by: Sebastian Wagner <swagner@suse.com>
* refs/pull/33809/head:
qa/standalone/scrub/osd-scrub-repair: force osdmap prop to osds
qa/standalone/scrub/osd-scrub-test: wait longer for update
Reviewed-by: David Zafman <dzafman@redhat.com>
Adds option `mon_allow_pool_size_one` which will be disabled by default
to ensure pools are not configured without replicas.
If the user still wants to use pool size 1, they will have to change the
value of `mon_allow_pool_size_one` to true and then have to pass flag
`--yes-i-really-mean-it` to cli command:
Example:
`ceph osd pool test set size 1 --yes-i-really-mean-it`
Fixes: https://tracker.ceph.com/issues/44025
Signed-off-by: Deepika Upadhyay <dupadhya@redhat.com>
test: Expect being off by up to 2 and make sure all PGs are active+clean
Reviewed-by: Neha Ojha <nojha@redhat.com>
Reviewed-by: Kefu Chai <kchai@redhat.com>
The -N option to vstart.sh was removed, use -k
Old hinfo_key binary happen to be utf-8 decodable, now it
throws an exception trying to decode it. Use new
option to ceph-objectstore-tool to treat stdout as a terminal
and convert binary data to base64.
Signed-off-by: David Zafman <dzafman@redhat.com>
One of our customers wants to verify the data safety of Ceph during scaling
the cluster up, and the test case looks like:
- keep checking the status of a speficied pg, who's up is [1, 2, 3]
- add more osds: up [1, 2, 3] -> up [1, 4, 5], acting = [1, 2, 3], backfill_targets = [4, 5],
pg is remapped
- stop osd.2: up [1, 4, 5], acting = [1, 3], backfill_targets = [4, 5], pg is undersized
- restart osd.2, acting will stay unchanged as 2 belongs to neither current up nor acting set,
hence leaving the corresponding pg pinning undersized for a long time until all backfill
targets completes
It does not pose any critical problem -- we'll end up getting that pg back into active + clean,
except that the long live DEGRADED warnings keep bothering our customer who cares about data
safety more than any thing else.
The right way to achieve the above goal is for:
boost::statechart::result PeeringState::Active::react(const MNotifyRec& notevt)
to check whether the newly booted node could be validly chosen for the acting set and
request a new temp mapping. The new temp mapping would then trigger a real interval change
that will get rid of the DEGRADED warning.
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Signed-off-by: Yan Jun <yan.jun8@zte.com.cn>
Make sure PGs peer (simply flushing state to mon isn't enough).
Fixes: https://tracker.ceph.com/issues/43721
Signed-off-by: Sage Weil <sage@redhat.com>
To avoid confusion fix function names in osd-backfill-space.sh for how
they actually work.
Fixes: https://tracker.ceph.com/issues/43592
Signed-off-by: David Zafman <dzafman@redhat.com>
There were a couple of problems found by flake8 in the qa/
directory (most of them fixed now). Enabling flake8 during the usual
check runs hopefully avoids adding new issues in the future.
Signed-off-by: Thomas Bechtold <tbechtold@suse.com>
* refs/pull/32138/head:
ceph-daemon: combine SUDO and ARGS into a single var
Reviewed-by: Sebastian Wagner <swagner@suse.com>
Reviewed-by: Sage Weil <sage@redhat.com>
- reduce the amount of typing/noise for each CEPH_DAEMON invocation
- ensure the `--image` param is passed to each test invocation
- allow passing additional args to ceph-daemon via CEPH_DAEMON_ARGS
Signed-off-by: Michael Fritch <mfritch@suse.com>
The py2 ConfigParser doesn't like whitespace before the config option
name. (The py3 version doesn't care.) Filter it out before parsing.
Signed-off-by: Sage Weil <sage@redhat.com>
I thought I took this out of the PR but somehow it got merged in... must
have repushed and old branch and not realized. :/
Signed-off-by: Sage Weil <sage@redhat.com>
* refs/pull/32039/head:
test: Improve races by using kill_daemons which waits for OSDs terminate
test: run-standalone.sh: Only run execs in the subdirectories of qa/standalone
test: Use activate_osd() when restarting OSDs
test: osd-scrub-snaps.sh: Fix race with osd restart and doing a scrub
Reviewed-by: Neha Ojha <nojha@redhat.com>
Allow passing CEPH_DAEMON via the environment or default to using the
script from the standard location.
Signed-off-by: Michael Fritch <mfritch@suse.com>
We need to pay attention to account for CRUSH_ITEM_NONE entries in the
EC PG acting set.
Fixes: https://tracker.ceph.com/issues/43151
Signed-off-by: Sage Weil <sage@redhat.com>
* refs/pull/31869/head:
ceph-daemon: bootstrap: deploy initial mon via deploy_daemon()
qa/standalone/test_ceph_daemon.sh: more $SUDO
ceph-daemon: configure firewalld for new daemon deploys
ceph-daemon: name mgr the same way mgr/ssh does
Reviewed-by: Michael Fritch <mfritch@suse.com>
* refs/pull/31913/head:
ceph-daemon: Allow env var for setting the used image
Reviewed-by: Michael Fritch <mfritch@suse.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Instead of always adding "--image my-custom-image" when calling
ceph-daemon with a non-standard image, allow to set the environment
variable called CEPH_DAEMON_IMAGE which will adjust the --image
default.
That way, the command line arguments when using ceph-daemon with a
custom image are a bit shorter.
Signed-off-by: Thomas Bechtold <tbechtold@suse.com>
The deepsea.tgz tar contains actual device nodes for the OSD block devices
(not symlinks or files). Must be root to untar.
Signed-off-by: Sage Weil <sage@redhat.com>
mktemp creates these files, so we have to pass --allow-overwrite (or
delete them after we get the unique name but before we write to them--this
is easier).
Broken by c7fe27a72a61d1345a66b8830fd17e7b922abd44
Signed-off-by: Sage Weil <sage@redhat.com>
osd/OSDMap: Show health warning if a pool is configured with size 1
Reviewed-by: Sage Weil <sweil@redhat.com>
Reviewed-by: David Zafman <dzafman@redhat.com>
Reviewed-by: Neha Ojha <nojha@redhat.com>
Introduce a config option called 'mon_warn_on_pool_no_redundancy' that is
used to show a health warning if any pool in the ceph cluster is
configured with a size of 1. The user can mute/unmute the warning using
'ceph health mute/unmute POOL_NO_REDUNDANCY'.
Add standalone test to verify warning on setting pool size=1. Set the
associated warning to 'false' in ceph.conf.template under qa/tasks so
that existing tests do not break.
Fixes: https://tracker.ceph.com/issues/41666
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
Moving ceph-daemon into src/ceph-daemon/ makes it simpler to add extra
code (eg. tox.ini, README, unittests, ...) specific to ceph-daemon.
That way related files are in a single directory.
Signed-off-by: Thomas Bechtold <tbechtold@suse.com>