ceph/qa/tasks/peer.py

"""
Peer test (Single test, not much configurable here)
"""
import logging
import json
import time

from tasks import ceph_manager
from tasks.util.rados import rados
from teuthology import misc as teuthology

log = logging.getLogger(__name__)

def task(ctx, config):
    """
    Test peering.
    """
    if config is None:
        config = {}
    assert isinstance(config, dict), \
        'peer task only accepts a dict for configuration'
    first_mon = teuthology.get_first_mon(ctx, config)
    (mon,) = ctx.cluster.only(first_mon).remotes.keys()

    manager = ceph_manager.CephManager(
        mon,
        ctx=ctx,
        logger=log.getChild('ceph_manager'),
        )

    while len(manager.get_osd_status()['up']) < 3:
        time.sleep(10)
    manager.flush_pg_stats([0, 1, 2])
    manager.wait_for_clean()

    for i in range(3):
        manager.set_config(
            i,
            osd_recovery_delay_start=120)

    # take on osd down
    manager.kill_osd(2)
    manager.mark_down_osd(2)

    # kludge to make sure they get a map
    rados(ctx, mon, ['-p', 'data', 'get', 'dummy', '-'])

    manager.flush_pg_stats([0, 1])
    manager.wait_for_recovery()

    # kill another and revive 2, so that some pgs can't peer.
    manager.kill_osd(1)
    manager.mark_down_osd(1)
    manager.revive_osd(2)
    manager.wait_till_osd_is_up(2)

    manager.flush_pg_stats([0, 2])

    manager.wait_for_active_or_down()

    manager.flush_pg_stats([0, 2])

    # look for down pgs
    num_down_pgs = 0
    pgs = manager.get_pg_stats()
    for pg in pgs:
        out = manager.raw_cluster_cmd('pg', pg['pgid'], 'query')
        log.debug("out string %s",out)
        j = json.loads(out)
        log.info("pg is %s, query json is %s", pg, j)

        if pg['state'].count('down'):
            num_down_pgs += 1
            # verify that it is blocked on osd.1
            rs = j['recovery_state']
            assert len(rs) >= 2
            assert rs[0]['name'] == 'Started/Primary/Peering/Down'
            assert rs[1]['name'] == 'Started/Primary/Peering'
            assert rs[1]['blocked']
            assert rs[1]['down_osds_we_would_probe'] == [1]
            assert len(rs[1]['peering_blocked_by']) == 1
            assert rs[1]['peering_blocked_by'][0]['osd'] == 1

    assert num_down_pgs > 0

    # bring it all back
    manager.revive_osd(1)
    manager.wait_till_osd_is_up(1)
    manager.flush_pg_stats([0, 1, 2])
    manager.wait_for_clean()
Added docstrings, and improved some of the comments on several tasks. 2013-10-12 08:28:27 +00:00			`"""`
			`Peer test (Single test, not much configurable here)`
			`"""`
add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00			`import logging`
			`import json`
tasks: fix non-existent sleep function CephManager has no sleep function. Use time.sleep() instead. Ran into this while testing a branch. Apparently it doesn't happen much since this hasn't changed in years, but the error was copied into several tasks. Signed-off-by: Josh Durgin <jdurgin@redhat.com> 2016-06-02 22:24:56 +00:00			`import time`
add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00
qa: import with full path to be py3 compatible Signed-off-by: Kefu Chai <kchai@redhat.com> 2020-03-24 08:33:22 +00:00			`from tasks import ceph_manager`
			`from tasks.util.rados import rados`
task_util: move rados command here Six copies are replaced with one, with an added option to check status automatically. This should probably be used in a few places where the return code is ignored. Signed-off-by: Josh Durgin <josh.durgin@inktank.com> 2013-07-22 21:21:51 +00:00			`from teuthology import misc as teuthology`
add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00
			`log = logging.getLogger(__name__)`

			`def task(ctx, config):`
			`"""`
			`Test peering.`
			`"""`
			`if config is None:`
			`config = {}`
			`assert isinstance(config, dict), \`
			`'peer task only accepts a dict for configuration'`
			`first_mon = teuthology.get_first_mon(ctx, config)`
qa: get rid of iterkeys for py3 compatibility Fixes: https://tracker.ceph.com/issues/42287 Signed-off-by: Kyr Shatskyy <kyrylo.shatskyy@suse.com> 2019-10-11 15:57:47 +00:00			`(mon,) = ctx.cluster.only(first_mon).remotes.keys()`
add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00
			`manager = ceph_manager.CephManager(`
			`mon,`
			`ctx=ctx,`
			`logger=log.getChild('ceph_manager'),`
			`)`

fix misc checks that wait for N osds to be up These all cut&pasted broken code, blah! 2012-04-19 19:43:54 +00:00			`while len(manager.get_osd_status()['up']) < 3:`
tasks: fix non-existent sleep function CephManager has no sleep function. Use time.sleep() instead. Ran into this while testing a branch. Apparently it doesn't happen much since this hasn't changed in years, but the error was copied into several tasks. Signed-off-by: Josh Durgin <jdurgin@redhat.com> 2016-06-02 22:24:56 +00:00			`time.sleep(10)`
qa/tasks: use new reliable flush_pg_stats helper The helper gets a sequence number from the osd (or osds), and then polls the mon until that seq is reflected there. This is overkill in some cases, since many tests only require that the stats be reflected on the mgr (not the mon), but waiting for it to also reach the mon is sufficient! Signed-off-by: Sage Weil <sage@redhat.com> 2017-05-18 22:16:55 +00:00			`manager.flush_pg_stats([0, 1, 2])`
add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00			`manager.wait_for_clean()`

peer: add recovery delay to make test behave Otherwise it was (very) racy! 2013-02-11 14:59:17 +00:00			`for i in range(3):`
			`manager.set_config(`
			`i,`
			`osd_recovery_delay_start=120)`

add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00			`# take on osd down`
			`manager.kill_osd(2)`
			`manager.mark_down_osd(2)`

			`# kludge to make sure they get a map`
Replace /tmp/cephtest/ with configurable path Teuthology uses /tmp/cephtest/ as the scratch test directory for a run. This patch replaces /tmp/cephtest/ everywhere with a per-run directory: {basedir}/{rundir} where {basedir} is a directory configured in .teuthology.yaml (/tmp/cephtest if not specified), and {rundir} is the name of the run, as given in --name. If no name is specified, {user}-{timestamp} is used. To get the old behavior (/tmp/cephtest), set test_path: /tmp/cephtest in .teuthology.yaml. This change was modivated by #3782, which requires a test dir that survives across reboots, but also resolves #3767. Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> 2013-01-23 20:37:39 +00:00			`rados(ctx, mon, ['-p', 'data', 'get', 'dummy', '-'])`
add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00
qa/tasks: use new reliable flush_pg_stats helper The helper gets a sequence number from the osd (or osds), and then polls the mon until that seq is reflected there. This is overkill in some cases, since many tests only require that the stats be reflected on the mgr (not the mon), but waiting for it to also reach the mon is sufficient! Signed-off-by: Sage Weil <sage@redhat.com> 2017-05-18 22:16:55 +00:00			`manager.flush_pg_stats([0, 1])`
add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00			`manager.wait_for_recovery()`

			`# kill another and revive 2, so that some pgs can't peer.`
			`manager.kill_osd(1)`
			`manager.mark_down_osd(1)`
			`manager.revive_osd(2)`
			`manager.wait_till_osd_is_up(2)`

qa/tasks: use new reliable flush_pg_stats helper The helper gets a sequence number from the osd (or osds), and then polls the mon until that seq is reflected there. This is overkill in some cases, since many tests only require that the stats be reflected on the mgr (not the mon), but waiting for it to also reach the mon is sufficient! Signed-off-by: Sage Weil <sage@redhat.com> 2017-05-18 22:16:55 +00:00			`manager.flush_pg_stats([0, 2])`
add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00
peer: wait for peering to complete, or block We need to wait for peering to either complete, or block because it is waiting for another PG. _Then_ look at all the PG states and compare the mon values with what we get from qeurying the OSDs directly. 2012-02-26 05:05:00 +00:00			`manager.wait_for_active_or_down()`

qa/tasks: use new reliable flush_pg_stats helper The helper gets a sequence number from the osd (or osds), and then polls the mon until that seq is reflected there. This is overkill in some cases, since many tests only require that the stats be reflected on the mgr (not the mon), but waiting for it to also reach the mon is sufficient! Signed-off-by: Sage Weil <sage@redhat.com> 2017-05-18 22:16:55 +00:00			`manager.flush_pg_stats([0, 2])`
peer: wait for peering to complete, or block We need to wait for peering to either complete, or block because it is waiting for another PG. _Then_ look at all the PG states and compare the mon values with what we get from qeurying the OSDs directly. 2012-02-26 05:05:00 +00:00
add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00			`# look for down pgs`
			`num_down_pgs = 0`
			`pgs = manager.get_pg_stats()`
			`for pg in pgs:`
			`out = manager.raw_cluster_cmd('pg', pg['pgid'], 'query')`
qa: Run flake8 on python2 and python3 To be able to catch problems with python2 and python3, run flake8 with both versions. From the flake8 homepage: It is very important to install Flake8 on the correct version of Python for your needs. If you want Flake8 to properly parse new language features in Python 3.5 (for example), you need it to be installed on 3.5 for Flake8 to understand those features. In many ways, Flake8 is tied to the version of Python on which it runs. Also fix the problems with python3 on the way. Note: This requires now the six module for teuthology. But this is already an install_require in teuthology itself. Signed-off-by: Thomas Bechtold <tbechtold@suse.com> 2019-12-09 16:27:46 +00:00			`log.debug("out string %s",out)`
stop stripping leading \n from osd commands leaving them in for mon command, but not for any good reason. 2013-06-13 21:51:21 +00:00			`j = json.loads(out)`
peer: ignore +scrubbing portion of pg state It can cause the mon state and osd states to not match. 2012-02-28 17:50:29 +00:00			`log.info("pg is %s, query json is %s", pg, j)`
add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00
			`if pg['state'].count('down'):`
			`num_down_pgs += 1`
			`# verify that it is blocked on osd.1`
			`rs = j['recovery_state']`
qa/tasks/peer: update task based on current peering behavior This changed in 0be3f5f72e169fad08dcb0240de45287b567bd49. Fixes: http://tracker.ceph.com/issues/18330 Signed-off-by: Sage Weil <sage@redhat.com> 2016-12-22 04:06:09 +00:00			`assert len(rs) >= 2`
			`assert rs[0]['name'] == 'Started/Primary/Peering/Down'`
add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00			`assert rs[1]['name'] == 'Started/Primary/Peering'`
			`assert rs[1]['blocked']`
			`assert rs[1]['down_osds_we_would_probe'] == [1]`
			`assert len(rs[1]['peering_blocked_by']) == 1`
			`assert rs[1]['peering_blocked_by'][0]['osd'] == 1`

			`assert num_down_pgs > 0`

			`# bring it all back`
			`manager.revive_osd(1)`
			`manager.wait_till_osd_is_up(1)`
qa/tasks: use new reliable flush_pg_stats helper The helper gets a sequence number from the osd (or osds), and then polls the mon until that seq is reflected there. This is overkill in some cases, since many tests only require that the stats be reflected on the mgr (not the mon), but waiting for it to also reach the mon is sufficient! Signed-off-by: Sage Weil <sage@redhat.com> 2017-05-18 22:16:55 +00:00			`manager.flush_pg_stats([0, 1, 2])`
add peer task Force a pg to get stuck in 'down' state, verify we can query the peering state, then start the OSD so it can recover. 2012-02-24 23:05:17 +00:00			`manager.wait_for_clean()`