ceph/qa/standalone/osd
Erwan Velu e6e10246c6 tests: Protecting rados bench against endless loop
If the cluster dies during the rados bench, the maximum running time is
no more considered and all emitted aios are pending.

rados bench never quits and the global testing timeout (3600 sec : 1
hour) have to be reach to get a failure.

This situation is dramatic for a background test or a CI run as it locks
the whole job for too long for an event that will never occurs.

This ideal solution would be having 'rados bench' considering a failure
once the timeout is reached when aios are pending.

A possible workaround here is to put use the system command 'timeout'
before calling rados bench and fail if rados didn't completed on time.

To avoid side effects, this patch is doubling rados timeout. If rados
didn't completed after twice the expected time, it have to fail to avoid
locking the whole testing job.

Please find below the way it worked on a real test case.
We can see no IO after t>2 but despite timeout=4 the bench continue.
Thanks to this patch, the bench is stopped at t=8 and return 1.

5: /home/erwan/ceph/src/test/smoke.sh:55: TEST_multimon:  timeout 8 rados -p foo bench 4 write -b 4096 --no-cleanup
5: hints = 1
5: Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 4 seconds or 0 objects
5: Object prefix: benchmark_data_mr-meeseeks_184960
5:   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
5:     0       0         0         0         0         0           -           0
5:     1      16      1144      1128   4.40538   4.40625  0.00412965   0.0141116
5:     2      16      2147      2131   4.16134   3.91797  0.00985654   0.0109079
5:     3      16      2147      2131   2.77424         0           -   0.0109079
5:     4      16      2147      2131    2.0807         0           -   0.0109079
5:     5      16      2147      2131   1.66456         0           -   0.0109079
5:     6      16      2147      2131   1.38714         0           -   0.0109079
5:     7      16      2147      2131   1.18897         0           -   0.0109079
5: /home/erwan/ceph/src/test/smoke.sh:55: TEST_multimon:  return 1
5: /home/erwan/ceph/src/test/smoke.sh:18: run:  return 1

Signed-off-by: Erwan Velu <erwan@redhat.com>
2018-06-14 11:06:52 +02:00
..
osd-backfill-stats.sh test: osd-backfill-stats.sh parallel osd-recovery-stats.sh check() changes 2018-03-14 10:07:11 -07:00
osd-bench.sh scripts: fix bash path in shebangs 2017-07-27 13:24:26 -06:00
osd-config.sh qa/standalone: remove osd-map-max-advance related tests 2018-01-06 19:40:15 +08:00
osd-copy-from.sh scripts: fix bash path in shebangs 2017-07-27 13:24:26 -06:00
osd-dup.sh tests: Protecting rados bench against endless loop 2018-06-14 11:06:52 +02:00
osd-fast-mark-down.sh tests: Protecting rados bench against endless loop 2018-06-14 11:06:52 +02:00
osd-markdown.sh qa/standalone/osd/osd-mark-down: create pool to get updated osdmap faster 2017-10-09 22:19:29 +08:00
osd-reactivate.sh scripts: fix bash path in shebangs 2017-07-27 13:24:26 -06:00
osd-recovery-stats.sh qa: modify TEST_recovery_sizeup() to handle async recovery 2018-03-15 11:13:34 -07:00
osd-rep-recov-eio.sh qa/standalone: extract delete_pool() 2018-02-28 15:40:28 +08:00
osd-reuse-id.sh scripts: fix bash path in shebangs 2017-07-27 13:24:26 -06:00
repro_long_log.sh PrimaryLogPG: only trim up to osd_pg_log_trim_max entries at once 2018-03-09 19:14:28 -05:00