Modified test cases:
1. osd-recovery-prio.sh:
Set osd_op_queue = wpq for all tests since mclock
doesn't consider recovery priority as part of its
scheduling algorithm.
2. osd-recovery-stats.sh:
a. TEST_recovery_undersized():
- Set osd_mclock_profile to high_recovery_ops profile.
- Increase wait for recovery timeout to 300 secs.
3. osd-rep-recov-eio.sh:
a. TEST_rep_backfill_unfound():
- Set osd_mclock_profile to high_recovery_ops profile.
- Increase wait for backfill_unfound to 360 secs.
4. repeer-on-acting-back.sh:
a. TEST_repeer_on_down_act():
- Set osd_mclock_profile to high_recovery_ops profile.
(To improve the test duration)
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
List of changes:
1. Remove the enforcement to use osd_op_queue=wpq when an osd is brought
up in the following functions:
- run_osd()
- run_osd_filestore() and
- activate_osd()
2. New functions:
- get_op_scheduler() - Get the current osd_op_queue for an osd.
3. Modified test cases:
- test_run_osd() - Add check for osd_max_backfill count.
The mclock scheduler overrides the count to 1000.
4. New test cases:
- test_activate_osd_after_mark_down()
- test_get_op_scheduler()
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
qa/standalone: fixing the timings when waiting for deep-scrub to start
Reviewed-by: Neha Ojha <nojha@redhat.com>
Reviewed-by: Sridhar Seshasayee <sseshasa@redhat.com>
initiate_and_fetch_state() initiates a scrub, then polls the published
PG state looking for 'scrubbing'. Calling flush_pg_stats() as part of
the polling process might cause the scrub and the following recovery to
be missed altogether.
Note: this polling mechanism is definitely not robust. Will be
redesigned in the future.
Fixes: https://tracker.ceph.com/issues/51581
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
* refs/pull/42041/head:
mgr/restful: ignore min/max_size
test/crush: drop min/max_size refs
qa/workunits/mon/pool_ops: remove test for min/max_size check
qa: scrub a few remaining mentions of ruleset
qa/standalone/mon/osd-*: fix tests
PendingReleaseNotes: note min/max_size removal
mgr/dashboard: remove max/min_size and ruleset
mon/OSDMonitor: fix calls to CrushTester
crush: eliminate min_size and max_size
test/cli/crushtool: reunumber rulesets in test maps
crushtool: require min/max or num-rep for --test
crush: remove last traces of 'ruleset'
test/cli/crushtool: use 'id' instead of 'ruleset' in crush inputs
crushtool: take --min-rep and --max-rep explicitly
crush/CrushTester: drop --ruleset
doc: scrub 'ruleset' from docs
src/erasure-code: rule, not ruleset
mon/OSDMonitor: remove check_crush_rule() callers
mon/OSDMonitor: rule, not ruleset
crushtool: remove check for overlapped ruels
crush/CrushWrapper: get_osd_pool_default_crush_replicated_ruleset -> rule
crush: remove find_rule()
mon/OSDMonitor: use pool's crush rule directly
osd/OSDMap: drop checks for ruleset == ruleid
osd/OSDMap: use pool's crush rule_id directly
mon/PGMap: use pool's crush_rule directly
mon/OSDMonitor: remove crush ruleset->rule rewrite
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Tests identified with missing teardown within osd-scrub-repair.sh:
1. TEST_periodic_scrub_replicated()
2. TEST_scrub_warning()
3. TEST_request_scrub_priority()
Centralize setup and teardown within the run() function for all the tests.
Fixes: https://tracker.ceph.com/issues/51580
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
The following files and tests in them did not teardown the
cluster after a test completed.
1. osd/osd-force-create.sh
2. osd/osd-reuse-id.sh
3. osd/pg-split-merge.sh
This wouldn't cause issues if the tests are run individually. But when
running all the tests in the files mentioned above, it could introduce
unexpected test failures down the line. For e.g., multiple tests may
create pools with same name and if they are not cleaned up properly, this
could result in unexpected failures in a subsequent test.
Fixes: https://tracker.ceph.com/issues/51580
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
This is mostly for testing: a lot of tests assume that there are no
existing pools. These tests relied on a config to turn off creating the
"device_health_metrics" pool which generally exists for any new Ceph
cluster. It would be better to make these tests tolerant of the new .mgr
pool but clearly there's a lot of these. So just convert the config to
make it work.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
This change is a follow-up to commit
b6e9c0903d that set the scheduler to wpq in
run_osd() and run_osd_filestore(). In addition, activate_osd() too has to
set the scheduler type to 'wpq' in order to be consistent and avoid test
failures.
The above is a temporary measure until all the standalone tests are
modified to run well with the mclock_scheduler.
Fixes: https://tracker.ceph.com/issues/51074
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
A new test: auto_repair_bluestore_tag.
Based on auto_repair_bluestore_basic. Sets auto-repair, starts a periodic
deep-scrub, then verifies that the PG state, while scrubbing, is 'scrubbing+deep'
and not 'scrubbing+deep+repair'.
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
mclock_scheduler is now the default and some of these tests need to be modified
to run well with it. Continue using wpq until
https://tracker.ceph.com/issues/50574 is addressed.
Signed-off-by: Neha Ojha <nojha@redhat.com>
There already is a test to verify the mempool sharding works, in the sense that
it uses at least half of the variables available to count the number of
allocated objects and their total size. This new test verifies that, with
sharding, object counting is at least twice faster than without sharding. It
also collects cacheline contention data with the perf c2c tool. The manual
analysis of this data shows the optimization gain is indeed related to cacheline
contention.
Fixes: https://tracker.ceph.com/issues/49896
Signed-off-by: Loïc Dachary <loic@dachary.org>
Sync up with master up to commit 3d8e73b266 ("Merge pull request
#40731 from tchaikov/wip-yamlize-options"). Specifically, bring in
src/common/options.cc yamlization and move new auth-related options
into src/common/options/global.yaml.in.
Conflicts:
src/common/options.cc
src/common/options/global.yaml.in
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Change TEST_recovery_scrub_2 to create more objects and use
osd_recovery_sleep to prevent recovery from finihing before
we start to scrub. Verify that at least 1 scrub was started
while the pg was reovering.
Fixes: https://tracker.ceph.com/issues/49779
Signed-off-by: David Zafman <dzafman@redhat.com>
This reverts commit 1323bdb839.
The tests needs to scrub while recovery is in progress, so catching
recovery from the logs after the fact isn't the proper setup.
We can use osd_recovery_sleep config.
Signed-off-by: David Zafman <dzafman@redhat.com>
Given and initial (set of) osd(s), if provide up to N OSDs that can be
stopped together without making PGs become unavailable.
This can be used to quickly identify large(r) batches of OSDs that can be
stopped together to (for example) upgrade.
Signed-off-by: Sage Weil <sage@newdream.net>
in beb62c029a, FEATURE_QUINCY was added to
ceph::features::mon::get_persistent(), so update the test accordingly.
Signed-off-by: Kefu Chai <kchai@redhat.com>
The 'recovering' state is transitory. Existing code looks for it by
polling 'pg stat', missing from time to time.
New version searches the tails of the relevant OSDs' logs.
Fixes: https://tracker.ceph.com/issues/48719
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
While the relevant comment says:
'# Execute the command and prepend the output with its pid'
the actual PID logged is the same for all background processes,
which isn't very helpful.
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
Stop waiting for a scrub to happen if the Primary for the target
PG changes.
Fixes: https://tracker.ceph.com/issues/48720
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
in my test bed, it takes 11 seconds to boot the 3 OSDs and to restart
one of them, this fails the test.
so we need to take the time into consideration. in this change, the
delay is added to the total "warn_older_version_delay", so the monitor
does not start sending warning earlier than expected.
Signed-off-by: Kefu Chai <kchai@redhat.com>
in e5b1ae5554c4d8a20f9f0ff562b231ad0b0ba0ab, a new option named
"debug_version_for_testing" is introduced to override the version so
we can test version check.
in crimson, we have two families of shared functions.
- one of them is used by alien store. they are compiled with
-DWITH_SEASTAR and -DWITH_ALIEN, to enable the shim code between
seastar and POSIX thread.
- another is used by crimson in general. where no lock is allowed.
currently, we use the "crimson" and "ceph" namespace to differentiate
these two families of functions, so they can colocate in the same
executable without violating the ODR. see src/include/common_fwd.h for
more details.
the functions defined in src/common/version.cc are also shared by
alien store and crimson code. and because we have different
implementations of `CephContext` in crimson and in classic OSD (i.e.
alienstore), we have to have different implementations of this function
as well, if we follow the same approach. but since these functions are
very simple and are non-blocking, there is not much value in
differentiating them, it is better to inject the test settings using
environment variable instead of using ceph option subsystem.
in this change, "ceph_debug_version_for_testing" environment variable is
checked instead, so that crimson and alienstore can share the same
compilation unit of version.cc. and "debug_version_for_testing" option
is removed.
Signed-off-by: Kefu Chai <kchai@redhat.com>
Add test case for permitted hours to make sure scrub doesn't start
Remove permitted hours in extended sleep test
Fixes: https://tracker.ceph.com/issues/48077
Signed-off-by: David Zafman <dzafman@redhat.com>
While creating erasure-coded profile make sure
that user is specifying valid crush-failure-domain.
Fixes: https://tracker.ceph.com/issues/47452
Signed-off-by: Prashant Dhange <pdhange@redhat.com>
This overrides what the CephContext believes to be the current quorum of
monitors (retrieved from other instances of the MonClient), introduced
by [1]. Tests need to be able to target a specific monitor for
exercising forwarding and other things.
[1] 731e2db9fb4611f767446a3c8e778a097ce70d35
Fixes: https://tracker.ceph.com/issues/47180
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
The test should mark the OSD out to check if only "in" OSDs are considered by
the osdmap trimming logic.
Fixes: https://tracker.ceph.com/issues/47309
Signed-off-by: Neha Ojha <nojha@redhat.com>
we could pass `text=True` for better readability, but that's introduced
in python3.7, or pass `error="ignore"` but it's too long.
Signed-off-by: Kefu Chai <kchai@redhat.com>
no need to check for their existence, and prepare a replacement.
because we've migrated to python3. and we only support python3.6 and up.
Signed-off-by: Kefu Chai <kchai@redhat.com>
Test that the osd doesn't crash when it gets a bad incremental osdmap.
Related-to: https://tracker.ceph.com/issues/46443
Signed-off-by: Dan van der Ster <daniel.vanderster@cern.ch>
I have absolutely no idea why it's counting features, but
apparently it is and bumping the value to 7 makes it pass.
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Include test case
Configurable by setting mon_osd_warn_num_repaired (default 10)
Ignore new health warning with random eio injection test
Fixes: https://tracker.ceph.com/issues/41564
Signed-off-by: David Zafman <dzafman@redhat.com>
a0b453ad335671bd92f165115d6ee984d2412448 added the wait state, which can
make PGs stay in active+clean+wait for a while instead of going into
active+clean directly. As far as TEST_auto_repair_bluestore_failed is
concerned, we only care about the repair state being cleared.
Fixes: https://tracker.ceph.com/issues/45075
Signed-off-by: Neha Ojha <nojha@redhat.com>
v2 was introduced in nautilus, and we don't support mimic -> pacific
upgrades (only mimic -> octopus). This test can be removed!
Signed-off-by: Sage Weil <sage@redhat.com>
to address the test failures like
```
2020-04-07T15:44:58.693 INFO:tasks.workunit.client.0.smithi049.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-repair.sh:498: TEST_auto_repair_bluestore_failed: ceph pg dump
pgs
2020-04-07T15:44:58.694 INFO:tasks.workunit.client.0.smithi049.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-repair.sh:498: TEST_auto_repair_bluestore_failed: pgid
2020-04-07T15:44:58.694 INFO:tasks.workunit.client.0.smithi049.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-repair.sh: line 498: pgid: command not found
```
Signed-off-by: Kefu Chai <kchai@redhat.com>
It is possible for the pg dump to not be the latest when we check for newprimary
in _common_test(). This is because mgr_stats_period is 5 seconds, and we may not
have fetched the latest stats just yet. This causes the test to look at the same
stats before and after wait_for_clean.
Fixes: https://tracker.ceph.com/issues/43807 (2)
Signed-off-by: Neha Ojha <nojha@redhat.com>
Mon might fail to share the newest map with any of up osds, e.g.,
due to an injected broken pipe. Since we don't have any client
activities during the osd-markdown tests, osds might be unaware of
the map changes made through CLI. Make sure osds have pulled the
newest map down before we can test its reaction correctly.
Fixes: https://tracker.ceph.com/issues/44662
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
* refs/pull/33885/head:
Merge pull request #33848 from mchangir/octopus-tests-remove-suprious-whitespace
Merge PR #33746 into octopus
Merge PR #33830 into octopus
Merge PR #33732 into octopus
Merge PR #33620 into octopus
Merge pull request #33876 from tchaikov/octopus-cephadm-mypy
cephadm: add "assert foo is not None" for mypy check
Merge pull request #33067 from tspmelo/wip-rbd-delete-with-snapshot
cephadm: add grafana adopt
Merge PR #33771 into octopus
Merge PR #33850 into octopus
Merge PR #33853 into octopus
Merge PR #33857 into octopus
Merge PR #32990 into octopus
Merge PR #33713 into octopus
Merge PR #33838 into octopus
qa/tasks/cephadm: no default mon|mgr|crash service specs
qa/suites/rados/cephadm/upgrade: upgrade start point that supports the no-spec option
Merge PR #33832 into octopus
cephadm: bootstrap: wait for mgr to restart after enabling a module
mgr: add 'mgr_status' tell command
Merge pull request #33839 from rhcs-dashboard/44538-fix-rgw-grafana-get-put-latencies
Merge pull request #33743 from votdev/issue_43869_fix_qa_test
cephadm: create initial mon and mgr service specs too
cephadm: no need to pregenerate a crash key for the bootstrap host
mgr/cephadm: do not complain when we don't have enough hosts
mgr/cephadm: remove orphan daemons
mgr/cephadm: report size=0 for fabricated ServiceDescription
mgr/cephadm: safety check to prevent removing all mon|mgr daemons
mgr/cephadm: prevent scaling mon|mgr below count=1
mgr/cephadm: do not remove daemons from remove_service
Merge pull request #33805 from tchaikov/wip-44500
spec: Podman (temporarily) requires apparmor-abstractions on suse
mgr/cephadm: Make sure we don't co-locate the same daemon
monitoring: fix RGW grafana chart 'Average GET/PUT Latencies'
tests: remove spurious whitespace
mgr/cephadm: fix service list filtering
Merge PR #33825 into octopus
Merge PR #33811 into octopus
Revert "Merge pull request #33673 from cbodley/wip-denc-enum"
mgr/cephadm: fix upgrade order
Merge PR #33801 into octopus
Merge PR #33822 into octopus
cephadm: bootstrap: tolerate error return from -h
Merge PR #33809 into octopus
Merge PR #32678 into octopus
cephadm: use `sh` instead of `bash` during enter
ceph.in: only shut down rados on clean exit
common/ceph_timer: Pass reference to waited time on stack
common/ceph_timer: Add test
common/ceph_timer: Use unique_function, allowing noncopyable events
common/ceph_timer: Couple cleanups
common/ceph_timer: Fix namespaces
common/ceph_timer: Add missing includes
common/ceph_timer.h: Don't indent contents of a namespace
mgr/dashboard: Crush rule modal
mgr/dashboard: Preserve rule selection on pool type change
mgr/dashboard: Crush rule is only send during replicated pool creation
mgr/dashboard: Explicit returns in pool form
mgr/dashboard: Removes fork join in pool form
mgr/dashboard: Hide ECP actions during ec pool edit
mgr/dashboard: Pool form erasure/replicated boolean
mgr/dashboard: Change pool info API endpoint
mgr/dashboard: Moves ECP info endpoint to UI-API
mgr/cephadm: add _remove_osds_bg back to main loop
mgr/cephadm/osd: update removal report immediately
qa/tasks/ceph_manager: use StringIO for capturing COT output
qa/standalone/scrub/osd-scrub-repair: force osdmap prop to osds
qa/standalone/scrub/osd-scrub-test: wait longer for update
qa/tasks/ceph_manager: capture stderr for COT
qa/suites/rados/ceph: drop opensuse for now
mon/MonClient: send logs to mon on separate schedule than pings
mgr/dashboard: Fix missing ImageSpec usage
mgr/dashboard: Allow removing RBD with snapshots
mgr/dashboard: Refactor and cleanup tasks.mgr.dashboard.test_user
mgr/dashboard: support multiple DriveGroups when creating OSDs
mon/MonClient: send logs to mon even if we have no keelalive2
cephadm: flag dashboard user to change password
Reviewed-by: Sebastian Wagner <swagner@suse.com>