RepoMirrors/ceph - ceph

Commit Graph

Author	SHA1	Message	Date
David Zafman	fbc8bcfe05	test: test_get_timeout_delays() fix Caused by: `7b0d1c8b8a` Signed-off-by: David Zafman <dzafman@redhat.com>	2018-07-03 14:01:36 -07:00
Josh Durgin	9106dc56c2	Merge pull request #22761 from fullerdj/wip-djf-24686 osd/filestore: Change default filestore_merge_threshold to -10 Reviewed-by: Josh Durgin <jdurgin@redhat.com>	2018-07-02 17:36:00 -07:00
Douglas Fuller	75f55f2dfc	osd/filestore: Change default filestore_merge_threshold to -1 Performance evaluations of medium to large size Ceph clusters have demonstrated negligible performance impact from unnecessarily deep directory hierarchies but significant performance impact from filestore split and merge activity. Disable merges by default. Fixes: http://tracker.ceph.com/issues/24686 Signed-off-by: Douglas Fuller <dfuller@redhat.com>	2018-06-29 11:45:12 -04:00
David Zafman	663d96e934	Merge pull request #22727 from dzafman/wip-21664 qa/standalone/scrub: When possible show side-by-side diff in addition to regular diff Reviewed-by: Kefu Chai <kchai@redhat.com>	2018-06-28 19:59:21 -04:00
David Zafman	3ff56a82a4	Merge pull request #22763 from dzafman/wip-remove-sudo qa: Don't use sudo when moving logs Reviewed-by: Neha Ojha <nojha@redhat.com>	2018-06-28 18:37:24 -04:00
David Zafman	23ed63e15f	Merge pull request #22441 from ErwanAliasr1/evelu-makecheck Improving make check reliability Reviewed-by: Kefu Chai <kchai@redhat.com> Reviewed-by: David Zafman <dzafman@redhat.com>	2018-06-28 14:55:12 -04:00
David Zafman	808c628304	qa: Don't use sudo when moving logs Caused by: `f0964beac5` Signed-off-by: David Zafman <dzafman@redhat.com>	2018-06-28 09:17:06 -07:00
David Zafman	ebb05b2542	test: When possible show side-by-side diff in addition to regular diff Fixes: https://tracker.ceph.com/issues/21664 Signed-off-by: David Zafman <dzafman@redhat.com>	2018-06-26 18:23:07 -07:00
David Zafman	f0964beac5	qa: For teuthology copy logs to teuthology expected location Signed-off-by: David Zafman <dzafman@redhat.com>	2018-06-25 18:06:01 -07:00
Erwan Velu	57df91380b	qa/standalone/ceph-helpers.sh: Setup ulimit in setup() If ulimit is set to a 1024 value, ceph-osd will segfault with the following error : filestore(td/smoke/0) error (24) Too many open files not handled on operation 0x55565d1fd004 (2182.1.0, or op 0, counting from 0) This patch is about to insure that before setting up ceph daemons in tests, a valid ulimit value is setup. Signed-off-by: Erwan Velu <erwan@redhat.com>	2018-06-25 22:09:14 +02:00
Erwan Velu	7b0d1c8b8a	qa/standalone/ceph-helpers.sh: Thinner resolution in get_timeout_delays() get_timeout_delays() is a generic function to compute delays for a long period of time without saturating the CPU is busy loops. It works pretty fine when the delay is short like having the following series when requesting a 20seconds timeout : "0.1 0.2 0.4 0.8 1.6 3.2 6.4 7.3 ". Here the maximum between two loops is 7.3 which is perfectly fine. When the timeout reaches 300sec, the same code produces the following series : "0.1 0.2 0.4 0.8 1.6 3.2 6.4 12.8 25.6 51.2 102.4 95.3 " In such example there is delays which are nearly 2 minutes ! That is not efficient as the expected event, between two loops, could arrive just after this long sleep occurs making a minute+ sleep for nothing. On a local system that could be ok while on a CI, if all jobs run like CI the overall is pretty unefficient by generating useless CPU waits. This patch is about adding a maximum acceptable delay time between two loops while keeping the same rampup behavior. On the same 300 seconds delay example, with MAX_TIMEOUT set to 10, we now have the following series: "0.1 0.2 0.4 0.8 1.6 3.2 6.4 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 7.3" We can see that the long 12/25/51/102/95 values vanished and being replaced by a series of 10 seconds. It's up to every test defining the probability of having a soonish event to complete. The MAX_TIMEOUT is set to 15seconds. Signed-off-by: Erwan Velu <erwan@redhat.com>	2018-06-25 22:09:14 +02:00
Sage Weil	3cd7d5eb22	Merge PR #22343 into master * refs/pull/22343/head: qa/standalone remove ceph-disk from activate_osd helper cmake: remove subman.sh tests test remove ceph-disk directory debian: remove ceph_detect_init python files from base qa/standalone remove virtualenv paths for ceph-disk and ceph-detect-init debian: remove ceph-disk ceph-detect-init python files rpm: remove ceph-disk ceph-detect-init python files alpine: remove ceph-disk ceph-detect-init python files alpine: remove ceph-osd and parttypeuuid udev rules debian: remove ceph-osd and parttypeuuid udev rules rpm: remove ceph-osd and parttypeuuid udev rules ceph-helpers.sh: remove ceph-disk, set up osds directly CMakeLists.txt: add back CEPH_BUILD_VIRTUALENV alpine: remove ceph-disk, add ceph-volume in APKBUILD.in upstart: remove ceph-disk activation call doc/install add anchor for manual osd deployment in freebsd guide doc/dev remove ceph-disk from freebsd guide, link to manual reference doc/dev/config-key remove ceph-disk references doc/dev remove ceph-disk.rst doc/dev: change ceph-disk suite examples for ceph-deploy doc/man_index: remove ceph-disk, ceph-detect-init refs doc/install: remove ceph-disk from freebsd examples doc/rados remove ceph-disk from man references doc/man remove ceph-disk ref from ceph-volume-systemd doc/man: update reference from ceph-disk to ceph-volume doc/man: remove ceph-disk, ceph-detect-init from cmake doc/man/ceph-volume remove doc reference to ceph-disk doc/man: remove ceph-disk, ceph-detect-init qa/suites: remove ceph-disk qa/run-standalone.sh: remove requirement for ceph-detect-init virtualenv qa/workunits: remove ceph-detect-init from rbdmapfile test qa/workunits: remove ceph-detect-init from ceph-helpers-root.sh qa/workunits: remove ceph-disk build: remove ceph-disk from freebsd script cmake: remove ceph-disk, ceph-detect-init tox tests init-ceph: remove ceph-disk cmake: remove top-level entries for ceph-disk, ceph-detect-init debian: remove ceph-detect-init references debian: remove ceph-disk references src: remove ceph-detect-init tool rpm: remove ceph-disk, ceph-detect-init from spec file test: remove subman script script: remove subman script udev: remove parttypeuuid rules for ceph-disk tool remove ceph-disk from ps-ceph.pl upstart: remove ceph-disk conf file systemd: remove ceph-disk from CMakeLists systemd: remove ceph-disk service udev: remove ceph-disk rules src: remove ceph-disk tool	2018-06-19 07:07:55 -05:00
David Zafman	fe09fc5e9d	test: Fail immediately if some operations fail Signed-off-by: David Zafman <dzafman@redhat.com>	2018-06-18 14:09:14 -07:00
David Zafman	33538aca35	test: Fix standalone main usage Signed-off-by: David Zafman <dzafman@redhat.com>	2018-06-18 14:09:14 -07:00
David Zafman	f886ebba08	test: Fix some function desciptions Signed-off-by: David Zafman <dzafman@redhat.com>	2018-06-18 14:09:14 -07:00
David Zafman	39fc43556f	test: Put files in private test directory Signed-off-by: David Zafman <dzafman@redhat.com>	2018-06-18 14:08:23 -07:00
Erwan Velu	2ce480b8fd	qa/standalone/ceph-helpers.sh: Fixing comment for wait_for_health() wait_for_health doesn't check if the cluster is making progress. So let's adjust the comment accordingly. Signed-off-by: Erwan Velu <erwan@redhat.com>	2018-06-14 11:06:52 +02:00
Erwan Velu	e6e10246c6	tests: Protecting rados bench against endless loop If the cluster dies during the rados bench, the maximum running time is no more considered and all emitted aios are pending. rados bench never quits and the global testing timeout (3600 sec : 1 hour) have to be reach to get a failure. This situation is dramatic for a background test or a CI run as it locks the whole job for too long for an event that will never occurs. This ideal solution would be having 'rados bench' considering a failure once the timeout is reached when aios are pending. A possible workaround here is to put use the system command 'timeout' before calling rados bench and fail if rados didn't completed on time. To avoid side effects, this patch is doubling rados timeout. If rados didn't completed after twice the expected time, it have to fail to avoid locking the whole testing job. Please find below the way it worked on a real test case. We can see no IO after t>2 but despite timeout=4 the bench continue. Thanks to this patch, the bench is stopped at t=8 and return 1. 5: /home/erwan/ceph/src/test/smoke.sh:55: TEST_multimon: timeout 8 rados -p foo bench 4 write -b 4096 --no-cleanup 5: hints = 1 5: Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 4 seconds or 0 objects 5: Object prefix: benchmark_data_mr-meeseeks_184960 5: sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 5: 0 0 0 0 0 0 - 0 5: 1 16 1144 1128 4.40538 4.40625 0.00412965 0.0141116 5: 2 16 2147 2131 4.16134 3.91797 0.00985654 0.0109079 5: 3 16 2147 2131 2.77424 0 - 0.0109079 5: 4 16 2147 2131 2.0807 0 - 0.0109079 5: 5 16 2147 2131 1.66456 0 - 0.0109079 5: 6 16 2147 2131 1.38714 0 - 0.0109079 5: 7 16 2147 2131 1.18897 0 - 0.0109079 5: /home/erwan/ceph/src/test/smoke.sh:55: TEST_multimon: return 1 5: /home/erwan/ceph/src/test/smoke.sh:18: run: return 1 Signed-off-by: Erwan Velu <erwan@redhat.com>	2018-06-14 11:06:52 +02:00
Erwan Velu	62d2646c30	qa/standalone/ceph-helpers.sh: Defining custom timeout for wait_for_clean() The wait_for_clean() is using the default timeout aka 300sec = 5mn. wait_for_clean() is trying to find a clean status within that timeout _or_ reset its counter if any progress got made in between loops. In a case where the cluster is sane, the recovery should be made in shorter than 5mn but it the cluster died, waiting for 5mn for nothing is unefficient. This patch is about defining a custom timeout for a wait_for_clean() not to wait much more that 1m30 (90sec). If no progress is made in that period, there is very few chance this will read the a valid state anyhow. Signed-off-by: Erwan Velu <erwan@redhat.com>	2018-06-14 11:06:52 +02:00
Alfredo Deza	5b3a540045	qa/standalone remove ceph-disk from activate_osd helper Signed-off-by: Alfredo Deza <adeza@redhat.com>	2018-06-13 15:16:27 -04:00
Alfredo Deza	aa4f5569c3	qa/standalone remove virtualenv paths for ceph-disk and ceph-detect-init Signed-off-by: Alfredo Deza <adeza@redhat.com>	2018-06-13 15:16:27 -04:00
Dan Mick	50f2b72f2f	ceph-helpers.sh: remove ceph-disk, set up osds directly Signed-off-by: Dan Mick <dan.mick@redhat.com>	2018-06-13 15:16:26 -04:00
David Zafman	c1e96ae7cb	test: Use a file that should be on all OSes Also, create temporary files in test specific dir and remove Caused by: `154330fd68` Signed-off-by: David Zafman <dzafman@redhat.com>	2018-06-05 11:27:12 -07:00
Sage Weil	154330fd68	osd/PrimaryLogPG: fix on_local_recover crash on stray clone If there is a stray clone (one that does not appear in the SnapSet) and we do any sort of recovery on it the OSD will crash. Log an error instead but continue. This addresses a problem where a cluster has both (1) an unexpected clone and (2) the clone is not present on all replicas. Doing repair on that PG will both not fix the unexpected clone and also cause the remaining OSDs to crash trying to recover it. Include a test. Fixes: https://tracker.ceph.com/issues/24396 Signed-off-by: Sage Weil <sage@redhat.com>	2018-06-05 11:09:01 -05:00
Kefu Chai	333068b208	Merge pull request #22346 from dzafman/wip-scrub-omap osd: Handle omap and data digests independently Reviewed-by: Kefu Chai <kchai@redhat.com>	2018-06-04 19:53:18 +08:00
Kefu Chai	0829e83fde	Merge pull request #22196 from thinkercui/bugfix osd: read object attrs failed at EC recovery Reviewed-by: Josh Durgin <jdurgin@redhat.com>	2018-06-03 01:52:24 +08:00
cuixf	3eb1679b1f	osd: retry to read object attrs at EC recovery In EC recovery read, if the object's attrs read failed or with errors, we erase the attrs we have read and try to read it again from left shards. This will make the primary osd get the object's attrs correct and avoid assert. Signed-off-by: xiaofei cui <cuixiaofei@sangfor.com>	2018-06-01 06:26:56 -04:00
David Zafman	843598b69b	Revert "qa/standalone/scrub/osd-scrub-repair.sh: drop omap_digest flag" This reverts commit `886606bfd7`. Signed-off-by: David Zafman <dzafman@redhat.com> Conflicts: qa/standalone/scrub/osd-scrub-repair.sh (manually made equivalent changes)	2018-05-31 12:01:53 -07:00
Sage Weil	c3164df959	qa/standalone/mon/misc: fix features test Signed-off-by: Sage Weil <sage@redhat.com>	2018-05-25 17:02:49 -05:00
David Zafman	1a7fa9a62a	test: Add test cases for multiple copy pool and snapshot errors Signed-off-by: David Zafman <dzafman@redhat.com>	2018-04-28 16:42:19 -07:00
David Zafman	2fa596dc0c	test: Prepare for second test and minor improvements Check list-inconsistent-obj output Check how many _scan_snap groupings Use more general check for crashed osd(s) Signed-off-by: David Zafman <dzafman@redhat.com>	2018-04-28 16:42:19 -07:00
David Zafman	bae4940574	test: Fix comment at end of scrub test scripts Signed-off-by: David Zafman <dzafman@redhat.com>	2018-04-28 16:42:19 -07:00
Sage Weil	27e91a99f5	Merge pull request #21273 from jdurgin/wip-23195 osd/ECBackend: only check required shards when finishing recovery reads Reviewed-by: David Zafman <dzafman@redhat.com> Reviewed-by: Greg Farnum <gfarnum@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>	2018-04-24 17:20:25 -05:00
Josh Durgin	d4808256d2	osd/ECBackend: preserve requests for other objects when sending extra reads When multiple objects are in flight for the same ReadOp, swap() on the map<hobject_t, read_request_t> would remove requests for all objects. We just want to replace the requests for the single object we're dealing with in send_all_remaining_reads(). This prevents crashing trying to look up rop.to_read[hoid] when another object in the same ReadOp gets an EIO and tries to send more requests. Test this by using osd-recovery-max-single-start to bundle multiple reads into one ReadOp. Save and restore CEPH_ARGS so custom settings are reset for each test. Fixes: http://tracker.ceph.com/issues/23195 (the 2nd crash there) Signed-off-by: Josh Durgin <jdurgin@redhat.com>	2018-04-20 19:42:15 -04:00
Josh Durgin	b162a5478d	osd/ECBackend: recover from EIO based on the minimum data necessary Discount shards that already returned EIO, and use minimum_to_decode() to request just what is necessary to recover or read the originally requested extents of the object. Signed-off-by: Josh Durgin <jdurgin@redhat.com>	2018-04-20 19:42:14 -04:00
Josh Durgin	468ad4b410	osd/ECBackend: only check required shards when finishing recovery reads `1235810c2a` allowed recovery to use multiple passes of reads to handle EIO, but the end condition for checking whether we finished reading requires the full data to be decodable (this is what get_want_to_read_shards returns). This is just a loss of efficiency normally, since when there is only one object the subsequent read works, and grabs all the data necessary. The crash comes from having multiple objects in the same ReadOp - in this case the sequence of events is: - start recovery of two objects (osd_recovery_max_single_start > 1) - read object a shard 3 - read object b shard 3 - fail minimum_to_decode because shard 3 can't reconstruct all of object a - re-read all of object a, marking more reads in progress - fail minimum_to_decode because shard 3 can't reconstruct all of object b - skip re-reading object because there are now reads in progress - finish reading k shards of object a - still fail minimum_to_decode for object b, so no extra data was read - send_all_remaining_reads tries to lookup object b in ReadOp object - crash dereferencing to_read[object b], since this was cleared after handling the original object b read reply This patch fixes the immediate inefficiency and crash by only checking for the missing shards that were requested, rather than the entire object, for recovery reads. Fixes: http://tracker.ceph.com/issues/23195 (first crash) Signed-off-by: Josh Durgin <jdurgin@redhat.com>	2018-04-20 19:42:14 -04:00
Nathan Cutler	f03b9028f5	qa/standalone/ceph-helpers.sh: provide argument to dirname Fixes: http://tracker.ceph.com/issues/23805 Signed-off-by: Nathan Cutler <ncutler@suse.com>	2018-04-20 10:10:15 +02:00
David Zafman	458babe7ee	test: Use jq in a compatible way and for easier diff analysis Signed-off-by: David Zafman <dzafman@redhat.com>	2018-04-16 08:11:24 -07:00
David Zafman	c6207d21a8	Merge pull request #21362 from dzafman/wip-hex-digest osd: Change shard digests to hex like object info digests Reviewed-by: Kefu Chai <kchai@redhat.com>	2018-04-12 16:07:36 -07:00
David Zafman	22ddc6da5f	osd: Change shard digests to hex like object info digests Signed-off-by: David Zafman <dzafman@redhat.com>	2018-04-12 07:59:21 -07:00
Kefu Chai	4cc3dab070	Merge pull request #21318 from badone/wip-qa-mon-misc-add-osdmap-prune-tests qa/standalone/mon/misc.sh: Add osdmap-prune tests Reviewed-by: Joao Eduardo Luis <joao@suse.de> Reviewed-by: Kefu Chai <kchai@redhat.com>	2018-04-11 23:08:33 +08:00
David Zafman	9c5ef19f93	test: Be smarter about when jsonschema can be used Signed-off-by: David Zafman <dzafman@redhat.com>	2018-04-10 13:52:10 -07:00
David Zafman	60ae2b8eb3	osd rados command: Show snapset in list-inconsistent-snapset Add SnapSet bufferlist to inconsistent_snapset_t Partial fix for http://tracker.ceph.com/issues/23428 Signed-off-by: David Zafman <dzafman@redhat.com>	2018-04-10 13:51:48 -07:00
David Zafman	1b1d45bf51	test: Add getjson variable to save output Signed-off-by: David Zafman <dzafman@redhat.com>	2018-04-10 13:26:08 -07:00
David Zafman	007cb45fe5	osd rados command: Change error name snapset_mismatch to snapset_error Signed-off-by: David Zafman <dzafman@redhat.com>	2018-04-10 13:26:08 -07:00
David Zafman	0c7ac9db3b	test: Clean-up test and use local values for number of objects and osds Signed-off-by: David Zafman <dzafman@redhat.com>	2018-04-10 13:26:08 -07:00
David Zafman	982509514c	osd rados command: list-inconsistent-obj attribute improvements System attributes shown as "object_info", "snapset" and "hashinfo" Only output user attributes as "attrs" Drop leading undescore "_" for user attribute keys Improve logic as to when to show user attributes or specific system attributes Signed-off-by: David Zafman <dzafman@redhat.com>	2018-04-10 13:26:08 -07:00
David Zafman	01687b052f	osd rados command: Change "oi" to "info" in scrub handling errors data_digest_mismatch_oi -> data_digest_mismatch_info omap_digest_mismatch_oi -> omap_digest_mismatch_info size_mismatch_oi -> size_mismatch_info obj_size_oi_mismatch -> obj_size_info_mismatch Signed-off-by: David Zafman <dzafman@redhat.com>	2018-04-10 13:26:08 -07:00
David Zafman	273f6213ea	osd rados command: Change "oi_attr" to "info" in scrub handling errors oi_attr_missing -> info_missing oi_attr_corrupted -> info_corrupted Signed-off-by: David Zafman <dzafman@redhat.com>	2018-04-10 13:26:08 -07:00
David Zafman	bec67e3d40	osd rados command: Rename ss_attr_missing/ss_attr_corrupted to snapset_missing/snapset_corrupted Signed-off-by: David Zafman <dzafman@redhat.com>	2018-04-10 13:26:08 -07:00

1 2 3 4

192 Commits