Commit Graph

53 Commits

Author SHA1 Message Date
Kefu Chai
4af71e7c00
Merge pull request #23103 from ifed01/wip-ifed-bluefs-migrate
os/bluestore: allow ceph-bluestore-tool to coalesce, add and migrate BlueFS backing volumes

Reviewed-by:  Sage Weil <sage@redhat.com>
2018-10-22 22:33:08 +08:00
Igor Fedotov
02b5768a4f tests: add qa test case for bluefs volume coalescence
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
2018-10-17 22:39:27 +03:00
huanwen ren
f1219d716d qa/osd: fixup osd-rep-recov-eio.sh fails to parse pg dump
Fixes: http://tracker.ceph.com/issues/36418
Signed-off-by: huanwen ren <ren.huanwen@zte.com.cn>
2018-10-16 02:18:22 +08:00
Sage Weil
9bf7c810a7 Merge PR #23985 into master
* refs/pull/23985/head:
	ceph-objectstore-tool: add back pool dne check
	qa/suites/rados/singleton/reg11184: remove old test
	ceph-objectstore-tool: import pg at original epoch
	osd: handle null pg slot on startup
	ceph-objectstore-tool: drop support for ancient export files
	osd: avoid dropping osd_lock when pg osdmaps are not laggy
	qa/standalone/osd/pg-merge.sh: add merge vs pg import test

Reviewed-by: xie xingguo <xie.xingguo@zte.com.cn>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
2018-09-21 08:21:53 -05:00
Kefu Chai
4b0e2c8ed4 qa: fix typos
Signed-off-by: Kefu Chai <kchai@redhat.com>
2018-09-21 12:41:42 +08:00
Sage Weil
26cb966cab ceph-objectstore-tool: import pg at original epoch
- In the jewel era, we fast-forwarded the PG to the OSD's latest epoch
and cleared past_intervals.

- In mimic, as of 2347ecb961, we brought the
PG up to date while updating past_intervals.  (At the same time we removed
the OSD's parallel past_intervals regeneration.)

The problem is that the tool then has to reimplement the past_intervals
update logic, and *also* has to cope with splits and merges.  Splits are
somewhat easier (until now we enable partial import of a PG into a split
child), but merges are not so easy.

This patch changes it so we import the PG and leave the pg_epoch matching
the import file.  The OSD is then responsible for bringing it up to date
with the latest map, and dealing with any intervening splits or merges.

We also adjust the safety check to ensure that we don't collide with
any existing PG, either a child we eventually split into, or a parent
we eventually merge into.

Fixes: http://tracker.ceph.com/issues/35955
Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-20 12:58:00 -05:00
Sage Weil
da887c82ce qa/standalone/osd/pg-merge.sh: add merge vs pg import test
- You can't import the source half a PG that's since merged.  Sorry!  We
could implement this later.
- You can import the target half, but the result will then be incomplete,
and you rely on backfill to clean it up.
- Map gaps don't affect this behavior.

Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-17 12:52:46 -05:00
David Zafman
ef6940fbb6 test: osd-backfill-stats.sh: Fix subtests to get primary which can change
Fixes: http://tracker.ceph.com/issues/35982

Signed-off-by: David Zafman <dzafman@redhat.com>
2018-09-13 13:19:23 -07:00
Kefu Chai
510d9e1345
Merge pull request #23723 from xiexingguo/wip-list-missing
osd/PrimaryLogPG: rename list_missing -> list_unfound command

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2018-09-11 20:25:21 +08:00
Sage Weil
f47921f293 qa/standalone/osd/osd-backfill-stats: fixes
Grep from the primary's log, not every osd's log.

For the backfill_remapped task in particular, after the pg_temp change it
just so happens that the primary changes across the pool size change and
thus two different primaries do (some) backfill.  Fix that test to pass
the correct primary.

Other tests are unaffected as they do not (happen to) trigger a primary
change and already satisfied the (removed) check that only one OSD does
backfill.

Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 17:11:18 -05:00
xie xingguo
85ba2f0a82 osd/PrimaryLogPG: s/list_missing/list_unfound/
Also:
- Do not print **offset** until specified
- Count missing objects correctly (used to be primary's local missing)

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
2018-09-06 09:52:20 +08:00
Xie Xingguo
0857124d23
Merge pull request #23663 from xiexingguo/wip-incompat-async-fixes
osd: some recovery improvements and cleanups


Reviewed-by: Sage Weil <sage@redhat.com>
2018-09-01 14:27:27 +08:00
xie xingguo
22786cffa8 osd/PG: force auth_log_shard to be primary when appropriate
So if there are a lot fo missing objects on primary, we can
make use of auth_log_shard to restore client I/O quickly.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
2018-08-31 16:29:25 +08:00
Sage Weil
85083f39b5 Merge PR #23572 into master
* refs/pull/23572/head:
	qa/standalone/osd/osd-force-create-pg: add force-create-pg test
	mon/MonCommands: fix 'osd force-create-pg'

Reviewed-by: Kefu Chai <kchai@redhat.com>
2018-08-30 08:52:44 -05:00
Josh Durgin
cc41b51c6a
Merge pull request #23518 from dzafman/wip-25084
osd: When possible check CRC in build_push_op() so repair can eventually stop

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
2018-08-23 11:39:05 -07:00
David Zafman
bc33170310 test: Use pids instead of jobspecs which were wrong
Fixes: http://tracker.ceph.com/issues/27056

Signed-off-by: David Zafman <dzafman@redhat.com>
2018-08-22 10:57:04 -07:00
David Zafman
c1b2bd7f16 test: Fix test to detect a test setup failure
Signed-off-by: David Zafman <dzafman@redhat.com>
2018-08-15 15:45:44 -07:00
David Zafman
72c34949fc test: Add test for filestore bad CRC in primary pull request
Signed-off-by: David Zafman <dzafman@redhat.com>
2018-08-15 15:45:44 -07:00
Sage Weil
ed9ec42c42 qa/standalone/osd/osd-force-create-pg: add force-create-pg test
Signed-off-by: Sage Weil <sage@redhat.com>
2018-08-15 06:47:47 -05:00
Sage Weil
4108ebc0ab qa/standalone/osd/ec-error-rollforward: reproduce bug 24597
This reproduces http://tracker.ceph.com/issues/24597

Signed-off-by: Sage Weil <sage@redhat.com>
2018-07-11 16:15:49 -05:00
Sage Weil
4f9fdd98e2 qa/standalone/osd/repro_long_log.sh: fix test
The log trimming case wasn't quite right.  Before HEAD^ we were
rolling forward too aggressively and miscalculating the can_rollforward_to,
which affected the trim_to calculation.

Signed-off-by: Sage Weil <sage@redhat.com>
2018-07-11 16:15:49 -05:00
David Zafman
33538aca35 test: Fix standalone main usage
Signed-off-by: David Zafman <dzafman@redhat.com>
2018-06-18 14:09:14 -07:00
David Zafman
39fc43556f test: Put files in private test directory
Signed-off-by: David Zafman <dzafman@redhat.com>
2018-06-18 14:08:23 -07:00
Erwan Velu
e6e10246c6 tests: Protecting rados bench against endless loop
If the cluster dies during the rados bench, the maximum running time is
no more considered and all emitted aios are pending.

rados bench never quits and the global testing timeout (3600 sec : 1
hour) have to be reach to get a failure.

This situation is dramatic for a background test or a CI run as it locks
the whole job for too long for an event that will never occurs.

This ideal solution would be having 'rados bench' considering a failure
once the timeout is reached when aios are pending.

A possible workaround here is to put use the system command 'timeout'
before calling rados bench and fail if rados didn't completed on time.

To avoid side effects, this patch is doubling rados timeout. If rados
didn't completed after twice the expected time, it have to fail to avoid
locking the whole testing job.

Please find below the way it worked on a real test case.
We can see no IO after t>2 but despite timeout=4 the bench continue.
Thanks to this patch, the bench is stopped at t=8 and return 1.

5: /home/erwan/ceph/src/test/smoke.sh:55: TEST_multimon:  timeout 8 rados -p foo bench 4 write -b 4096 --no-cleanup
5: hints = 1
5: Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 4 seconds or 0 objects
5: Object prefix: benchmark_data_mr-meeseeks_184960
5:   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
5:     0       0         0         0         0         0           -           0
5:     1      16      1144      1128   4.40538   4.40625  0.00412965   0.0141116
5:     2      16      2147      2131   4.16134   3.91797  0.00985654   0.0109079
5:     3      16      2147      2131   2.77424         0           -   0.0109079
5:     4      16      2147      2131    2.0807         0           -   0.0109079
5:     5      16      2147      2131   1.66456         0           -   0.0109079
5:     6      16      2147      2131   1.38714         0           -   0.0109079
5:     7      16      2147      2131   1.18897         0           -   0.0109079
5: /home/erwan/ceph/src/test/smoke.sh:55: TEST_multimon:  return 1
5: /home/erwan/ceph/src/test/smoke.sh:18: run:  return 1

Signed-off-by: Erwan Velu <erwan@redhat.com>
2018-06-14 11:06:52 +02:00
Neha Ojha
7f6f4f90fe qa: modify TEST_recovery_sizeup() to handle async recovery
Signed-off-by: Neha Ojha <nojha@redhat.com>
2018-03-15 11:13:34 -07:00
David Zafman
8a7e6c2349
Merge pull request #20220 from dzafman/wip-calc-stats3
osd: Improve recovery stat handling by using peer_missing and missing_loc info

Reviewed-by: Sage Weil <sage@redhat.com>
2018-03-14 11:07:44 -07:00
David Zafman
af85f3cc48 test: osd-backfill-stats.sh parallel osd-recovery-stats.sh check() changes
Signed-off-by: David Zafman <dzafman@redhat.com>
2018-03-14 10:07:11 -07:00
David Zafman
acc1f80684 test: Use "(est)" in log message when an osd doesn't have peer_missing
Consolidate check() code and common script code
TEST_recovery_multi() wasn't reliable due to delayed peer_missing

Signed-off-by: David Zafman <dzafman@redhat.com>
2018-03-14 10:07:11 -07:00
David Zafman
12e331b742 test: osd-recovery-stats.sh: New test with different missing objs on multiple OSDs
Signed-off-by: David Zafman <dzafman@redhat.com>
2018-03-14 10:07:11 -07:00
David Zafman
09b5697ba2 test: Correction for better degraded/misplaced handling
Signed-off-by: David Zafman <dzafman@redhat.com>
2018-03-14 10:07:11 -07:00
David Zafman
d7fd9174b9 osd: Fix for handling more than 1 missing target
Fix test case to test more than 1 target

Signed-off-by: David Zafman <dzafman@redhat.com>
2018-03-14 10:07:03 -07:00
Josh Durgin
1c15458a00 PrimaryLogPG: only trim up to osd_pg_log_trim_max entries at once
This prevents the fix for http://tracker.ceph.com/issues/22050 or
potential future bugs from causing too much latency by trimming too
many log entries at once.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
2018-03-09 19:14:28 -05:00
Josh Durgin
b50186bfe6 PG, PrimaryLogPG: trim log and rollback info for error log entries
Regular updates piggyback some osd state for this purpose with
MOSDRepOp[Reply]. Do the same thing for pure log entry updates (write
errors and lost/revert additions) via MOSDPGUpdateLogMissing[Reply].

Fixes: http://tracker.ceph.com/issues/22050
Signed-off-by: Josh Durgin <jdurgin@redhat.com>
2018-03-09 17:54:08 -05:00
Josh Durgin
2067f7c679
Merge pull request #20786 from dzafman/wip-zafman-log-trim
tools/ceph-objectstore-tool: command to trim the pg log

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
2018-03-08 16:42:31 -08:00
Josh Durgin
b01e4ea5e2 tools: Add pg log trim command to ceph-objectstore-tool
Add test script that verifies the command in qa/standalone/osd

Fixes: http://tracker.ceph.com/issues/23242

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
Signed-off-by: David Zafman <dzafman@redhat.com>
2018-03-08 15:58:55 -08:00
Sage Weil
c9e974800f qa: --no-mon-config for ceph-objectstore-tool --op mkfs ..
Signed-off-by: Sage Weil <sage@redhat.com>
2018-03-06 14:44:50 -06:00
Kefu Chai
ac56a202fd qa/standalone: extract delete_pool()
some tests, like osd-backfill-stats.sh are using delete_pool(), but
they don't have this function defined. and this function is defined
in standalone tests separately, so would be simpler if we can
consolidate them in ceph-helper.sh.

Signed-off-by: Kefu Chai <kchai@redhat.com>
2018-02-28 15:40:28 +08:00
David Zafman
7ccb7b7023
Merge pull request #19850 from dzafman/wip-calc-stats
osd/PG: re-write of _update_calc_stats and improve pg degraded state

Fixes: http://tracker.ceph.com/issues/20059

Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
2018-01-16 11:58:49 -08:00
Kefu Chai
7aba57b9b4
Merge pull request #18191 from hjwsm1989/osd-mark-down
qa/standalone/osd/osd-mark-down: create pool to get updated osdmap faster

Reviewed-by: Kefu Chai <kchai@redhat.com>
2018-01-15 11:09:02 +08:00
David Zafman
88ce0c1a91 test: Verify stat calculations during backfill
Signed-off-by: David Zafman <dzafman@redhat.com>
2018-01-14 18:17:23 -08:00
David Zafman
f5af1af6d3 test: Verify stat calculations during recovery
Signed-off-by: David Zafman <dzafman@redhat.com>
2018-01-14 18:17:23 -08:00
Kefu Chai
e7097593a7 qa/standalone: remove osd-map-max-advance related tests
this setting was removed in 8967b73

Fixes: http://tracker.ceph.com/issues/22596
Signed-off-by: Kefu Chai <kchai@redhat.com>
2018-01-06 19:40:15 +08:00
Kefu Chai
2ceff9eb4e qa/stanalone: pass options using --<option-name>=<value>
not "--<option-name> <value>', otherwise `ceph-authtool` would error
out:

$ CEPH_ARGS='--osd-map-max-advance 1000' bin/ceph-authtool --gen-print-key
bin/ceph-authtool: unexpected '1000'
usage: ceph-authtool keyringfile [OPTIONS]...
....

but using the syntax of `--<option-name>=<value>', it works:

$ CEPH_ARGS='--osd-map-max-advance=1000' bin/ceph-authtool --gen-print-key
AQBAhTNamf5+ABAASkAp/6IGq7LkUTEOMp/fgw==

Signed-off-by: Kefu Chai <kchai@redhat.com>
2017-12-15 16:19:15 +08:00
David Zafman
c2572bee3c test: Add replicated recovery/backfill test
Signed-off-by: David Zafman <dzafman@redhat.com>
2017-10-18 11:12:14 -07:00
huangjun
ee618a38a9 qa/standalone/osd/osd-mark-down: create pool to get updated osdmap faster
Mon send osdmap to random osds after we mark osd down, the down osd
may use more than $sleep time to get updated osdmap if there is no
osd ping between osds. So create pool after setup cluster.

Signed-off-by: huangjun <huangjun@xsky.com>
2017-10-09 22:19:29 +08:00
Kefu Chai
30b5b4627c Merge pull request #16494 from asomers/bin_bash
misc: Fix bash path in shebangs

Reviewed-by: Willem Jan Withagen <wjw@digiware.nl>
Reviewed-by: Kefu Chai <kchai@redhat.com>
2017-08-27 10:14:14 +08:00
David Zafman
4db5124e1a qa: For FreeBSD skip osd-dup.sh because there is no bluestore
Signed-off-by: David Zafman <dzafman@redhat.com>
2017-08-10 08:30:47 -07:00
David Zafman
99ad4bbd91 qa: Add create_pool() which sleeps 1 second like python variant
wait_for_clean() can miss the new pool if it races with pool create.

Fixes: http://tracker.ceph.com/issues/20465

Signed-off-by: David Zafman <dzafman@redhat.com>
2017-08-04 06:38:09 -07:00
xie xingguo
734b5f2c60 test/osd-fast-mark-down: enable 'osd-class-update-on-start' by default
116cf759c8
will now hide all shadow trees(roots), so this is not applicable anymore
(actually it is misleading).

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
2017-08-03 17:26:26 -04:00
Alan Somers
3aae5ca6fd scripts: fix bash path in shebangs
/bin/bash is a Linuxism.  Other operating systems install bash to
different paths.  Use /usr/bin/env in shebangs to find bash.

Signed-off-by: Alan Somers <asomers@gmail.com>
2017-07-27 13:24:26 -06:00