ceph/qa/tasks/peer.py

"""
Peer test (Single test, not much configurable here)
"""
import logging
import json
import time

import ceph_manager
from teuthology import misc as teuthology
from util.rados import rados

log = logging.getLogger(__name__)

def task(ctx, config):
    """
    Test peering.
    """
    if config is None:
        config = {}
    assert isinstance(config, dict), \
        'peer task only accepts a dict for configuration'
    first_mon = teuthology.get_first_mon(ctx, config)
    (mon,) = ctx.cluster.only(first_mon).remotes.iterkeys()

    manager = ceph_manager.CephManager(
        mon,
        ctx=ctx,
        logger=log.getChild('ceph_manager'),
        )

    while len(manager.get_osd_status()['up']) < 3:
        time.sleep(10)
    manager.flush_pg_stats([0, 1, 2])
    manager.wait_for_clean()

    for i in range(3):
        manager.set_config(
            i,
            osd_recovery_delay_start=120)

    # take on osd down
    manager.kill_osd(2)
    manager.mark_down_osd(2)

    # kludge to make sure they get a map
    rados(ctx, mon, ['-p', 'data', 'get', 'dummy', '-'])

    manager.flush_pg_stats([0, 1])
    manager.wait_for_recovery()

    # kill another and revive 2, so that some pgs can't peer.
    manager.kill_osd(1)
    manager.mark_down_osd(1)
    manager.revive_osd(2)
    manager.wait_till_osd_is_up(2)

    manager.flush_pg_stats([0, 2])

    manager.wait_for_active_or_down()

    manager.flush_pg_stats([0, 2])

    # look for down pgs
    num_down_pgs = 0
    pgs = manager.get_pg_stats()
    for pg in pgs:
        out = manager.raw_cluster_cmd('pg', pg['pgid'], 'query')
	log.debug("out string %s",out)
        j = json.loads(out)
        log.info("pg is %s, query json is %s", pg, j)

        if pg['state'].count('down'):
            num_down_pgs += 1
            # verify that it is blocked on osd.1
            rs = j['recovery_state']
            assert len(rs) >= 2
            assert rs[0]['name'] == 'Started/Primary/Peering/Down'
            assert rs[1]['name'] == 'Started/Primary/Peering'
            assert rs[1]['blocked']
            assert rs[1]['down_osds_we_would_probe'] == [1]
            assert len(rs[1]['peering_blocked_by']) == 1
            assert rs[1]['peering_blocked_by'][0]['osd'] == 1

    assert num_down_pgs > 0

    # bring it all back
    manager.revive_osd(1)
    manager.wait_till_osd_is_up(1)
    manager.flush_pg_stats([0, 1, 2])
    manager.wait_for_clean()
Added docstrings, and improved some of the comments on several tasks. 2013-10-12 08:28:27 +00:00			`"""`
			`Peer test (Single test, not much configurable here)`
			`"""`
add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00			`import logging`
			`import json`
tasks: fix non-existent sleep function CephManager has no sleep function. Use time.sleep() instead. Ran into this while testing a branch. Apparently it doesn't happen much since this hasn't changed in years, but the error was copied into several tasks. Signed-off-by: Josh Durgin <jdurgin@redhat.com> 2016-06-02 22:24:56 +00:00			`import time`
add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00
task_util: move rados command here Six copies are replaced with one, with an added option to check status automatically. This should probably be used in a few places where the return code is ignored. Signed-off-by: Josh Durgin <josh.durgin@inktank.com> 2013-07-22 21:21:51 +00:00			`import ceph_manager`
			`from teuthology import misc as teuthology`
Update module references Signed-off-by: Zack Cerza <zack.cerza@inktank.com> 2014-08-07 14:24:59 +00:00			`from util.rados import rados`
add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00
			`log = logging.getLogger(__name__)`

			`def task(ctx, config):`
			`"""`
			`Test peering.`
			`"""`
			`if config is None:`
			`config = {}`
			`assert isinstance(config, dict), \`
			`'peer task only accepts a dict for configuration'`
			`first_mon = teuthology.get_first_mon(ctx, config)`
Revert "Lines formerly of the form '(remote,) = ctx.cluster.only(role).remotes.keys()'" This reverts commit d693b3f8950ffd1f2492a4db0f8234fee31f00f0. 2014-03-27 16:35:28 +00:00			`(mon,) = ctx.cluster.only(first_mon).remotes.iterkeys()`
add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00
			`manager = ceph_manager.CephManager(`
			`mon,`
			`ctx=ctx,`
			`logger=log.getChild('ceph_manager'),`
			`)`

fix misc checks that wait for N osds to be up These all cut&pasted broken code, blah! 2012-04-19 19:43:54 +00:00			`while len(manager.get_osd_status()['up']) < 3:`
tasks: fix non-existent sleep function CephManager has no sleep function. Use time.sleep() instead. Ran into this while testing a branch. Apparently it doesn't happen much since this hasn't changed in years, but the error was copied into several tasks. Signed-off-by: Josh Durgin <jdurgin@redhat.com> 2016-06-02 22:24:56 +00:00			`time.sleep(10)`
qa/tasks: use new reliable flush_pg_stats helper The helper gets a sequence number from the osd (or osds), and then polls the mon until that seq is reflected there. This is overkill in some cases, since many tests only require that the stats be reflected on the mgr (not the mon), but waiting for it to also reach the mon is sufficient! Signed-off-by: Sage Weil <sage@redhat.com> 2017-05-18 22:16:55 +00:00			`manager.flush_pg_stats([0, 1, 2])`
add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00			`manager.wait_for_clean()`

peer: add recovery delay to make test behave Otherwise it was (very) racy! 2013-02-11 14:59:17 +00:00			`for i in range(3):`
			`manager.set_config(`
			`i,`
			`osd_recovery_delay_start=120)`

add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00			`# take on osd down`
			`manager.kill_osd(2)`
			`manager.mark_down_osd(2)`

			`# kludge to make sure they get a map`
Replace /tmp/cephtest/ with configurable path Teuthology uses /tmp/cephtest/ as the scratch test directory for a run. This patch replaces /tmp/cephtest/ everywhere with a per-run directory: {basedir}/{rundir} where {basedir} is a directory configured in .teuthology.yaml (/tmp/cephtest if not specified), and {rundir} is the name of the run, as given in --name. If no name is specified, {user}-{timestamp} is used. To get the old behavior (/tmp/cephtest), set test_path: /tmp/cephtest in .teuthology.yaml. This change was modivated by #3782, which requires a test dir that survives across reboots, but also resolves #3767. Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> 2013-01-23 20:37:39 +00:00			`rados(ctx, mon, ['-p', 'data', 'get', 'dummy', '-'])`
add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00
qa/tasks: use new reliable flush_pg_stats helper The helper gets a sequence number from the osd (or osds), and then polls the mon until that seq is reflected there. This is overkill in some cases, since many tests only require that the stats be reflected on the mgr (not the mon), but waiting for it to also reach the mon is sufficient! Signed-off-by: Sage Weil <sage@redhat.com> 2017-05-18 22:16:55 +00:00			`manager.flush_pg_stats([0, 1])`
add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00			`manager.wait_for_recovery()`

			`# kill another and revive 2, so that some pgs can't peer.`
			`manager.kill_osd(1)`
			`manager.mark_down_osd(1)`
			`manager.revive_osd(2)`
			`manager.wait_till_osd_is_up(2)`

qa/tasks: use new reliable flush_pg_stats helper The helper gets a sequence number from the osd (or osds), and then polls the mon until that seq is reflected there. This is overkill in some cases, since many tests only require that the stats be reflected on the mgr (not the mon), but waiting for it to also reach the mon is sufficient! Signed-off-by: Sage Weil <sage@redhat.com> 2017-05-18 22:16:55 +00:00			`manager.flush_pg_stats([0, 2])`
add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00
peer: wait for peering to complete, or block We need to wait for peering to either complete, or block because it is waiting for another PG. _Then_ look at all the PG states and compare the mon values with what we get from qeurying the OSDs directly. 2012-02-26 05:05:00 +00:00			`manager.wait_for_active_or_down()`

qa/tasks: use new reliable flush_pg_stats helper The helper gets a sequence number from the osd (or osds), and then polls the mon until that seq is reflected there. This is overkill in some cases, since many tests only require that the stats be reflected on the mgr (not the mon), but waiting for it to also reach the mon is sufficient! Signed-off-by: Sage Weil <sage@redhat.com> 2017-05-18 22:16:55 +00:00			`manager.flush_pg_stats([0, 2])`
peer: wait for peering to complete, or block We need to wait for peering to either complete, or block because it is waiting for another PG. _Then_ look at all the PG states and compare the mon values with what we get from qeurying the OSDs directly. 2012-02-26 05:05:00 +00:00
add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00			`# look for down pgs`
			`num_down_pgs = 0`
			`pgs = manager.get_pg_stats()`
			`for pg in pgs:`
			`out = manager.raw_cluster_cmd('pg', pg['pgid'], 'query')`
Added a debug message The debug message is to print the string that should be JSON. This is to track a nightly run failure. Signed-off-by: tamil <tamil.muthamizhan@inktank.com> 2012-07-03 23:04:12 +00:00			`log.debug("out string %s",out)`
stop stripping leading \n from osd commands leaving them in for mon command, but not for any good reason. 2013-06-13 21:51:21 +00:00			`j = json.loads(out)`
peer: ignore +scrubbing portion of pg state It can cause the mon state and osd states to not match. 2012-02-28 17:50:29 +00:00			`log.info("pg is %s, query json is %s", pg, j)`
add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00
			`if pg['state'].count('down'):`
			`num_down_pgs += 1`
			`# verify that it is blocked on osd.1`
			`rs = j['recovery_state']`
qa/tasks/peer: update task based on current peering behavior This changed in 0be3f5f72e169fad08dcb0240de45287b567bd49. Fixes: http://tracker.ceph.com/issues/18330 Signed-off-by: Sage Weil <sage@redhat.com> 2016-12-22 04:06:09 +00:00			`assert len(rs) >= 2`
			`assert rs[0]['name'] == 'Started/Primary/Peering/Down'`
add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00			`assert rs[1]['name'] == 'Started/Primary/Peering'`
			`assert rs[1]['blocked']`
			`assert rs[1]['down_osds_we_would_probe'] == [1]`
			`assert len(rs[1]['peering_blocked_by']) == 1`
			`assert rs[1]['peering_blocked_by'][0]['osd'] == 1`

			`assert num_down_pgs > 0`

			`# bring it all back`
			`manager.revive_osd(1)`
			`manager.wait_till_osd_is_up(1)`
qa/tasks: use new reliable flush_pg_stats helper The helper gets a sequence number from the osd (or osds), and then polls the mon until that seq is reflected there. This is overkill in some cases, since many tests only require that the stats be reflected on the mgr (not the mon), but waiting for it to also reach the mon is sufficient! Signed-off-by: Sage Weil <sage@redhat.com> 2017-05-18 22:16:55 +00:00			`manager.flush_pg_stats([0, 1, 2])`
add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00			`manager.wait_for_clean()`