ceph/qa/tasks/dump_stuck.py

"""
Dump_stuck command
"""
import logging
import re
import time

import ceph_manager
from teuthology import misc as teuthology


log = logging.getLogger(__name__)

def check_stuck(manager, num_inactive, num_unclean, num_stale, timeout=10):
    """
    Do checks.  Make sure get_stuck_pgs return the right amout of information, then
    extract health information from the raw_cluster_cmd and compare the results with
    values passed in.  This passes if all asserts pass.
 
    :param num_manager: Ceph manager
    :param num_inactive: number of inaactive pages that are stuck
    :param num_unclean: number of unclean pages that are stuck
    :paran num_stale: number of stale pages that are stuck
    :param timeout: timeout value for get_stuck_pgs calls
    """
    inactive = manager.get_stuck_pgs('inactive', timeout)
    unclean = manager.get_stuck_pgs('unclean', timeout)
    stale = manager.get_stuck_pgs('stale', timeout)
    log.info('inactive %s / %d,  unclean %s / %d,  stale %s / %d',
             len(inactive), num_inactive,
             len(unclean), num_unclean,
             len(stale), num_stale)
    assert len(inactive) == num_inactive
    assert len(unclean) == num_unclean
    assert len(stale) == num_stale

def task(ctx, config):
    """
    Test the dump_stuck command.

    :param ctx: Context
    :param config: Configuration
    """
    assert config is None, \
        'dump_stuck requires no configuration'
    assert teuthology.num_instances_of_type(ctx.cluster, 'osd') == 2, \
        'dump_stuck requires exactly 2 osds'

    timeout = 60
    first_mon = teuthology.get_first_mon(ctx, config)
    (mon,) = ctx.cluster.only(first_mon).remotes.iterkeys()

    manager = ceph_manager.CephManager(
        mon,
        ctx=ctx,
        logger=log.getChild('ceph_manager'),
        )

    manager.flush_pg_stats([0, 1])
    manager.wait_for_clean(timeout)

    manager.raw_cluster_cmd('tell', 'mon.0', 'injectargs', '--',
#                            '--mon-osd-report-timeout 90',
                            '--mon-pg-stuck-threshold 10')

    # all active+clean
    check_stuck(
        manager,
        num_inactive=0,
        num_unclean=0,
        num_stale=0,
        )
    num_pgs = manager.get_num_pgs()

    manager.mark_out_osd(0)
    time.sleep(timeout)
    manager.flush_pg_stats([1])
    manager.wait_for_recovery(timeout)

    # all active+clean+remapped
    check_stuck(
        manager,
        num_inactive=0,
        num_unclean=0,
        num_stale=0,
        )

    manager.mark_in_osd(0)
    manager.flush_pg_stats([0, 1])
    manager.wait_for_clean(timeout)

    # all active+clean
    check_stuck(
        manager,
        num_inactive=0,
        num_unclean=0,
        num_stale=0,
        )

    log.info('stopping first osd')
    manager.kill_osd(0)
    manager.mark_down_osd(0)
    manager.wait_for_active(timeout)

    log.info('waiting for all to be unclean')
    starttime = time.time()
    done = False
    while not done:
        try:
            check_stuck(
                manager,
                num_inactive=0,
                num_unclean=num_pgs,
                num_stale=0,
                )
            done = True
        except AssertionError:
            # wait up to 15 minutes to become stale
            if time.time() - starttime > 900:
                raise


    log.info('stopping second osd')
    manager.kill_osd(1)
    manager.mark_down_osd(1)

    log.info('waiting for all to be stale')
    starttime = time.time()
    done = False
    while not done:
        try:
            check_stuck(
                manager,
                num_inactive=0,
                num_unclean=num_pgs,
                num_stale=num_pgs,
                )
            done = True
        except AssertionError:
            # wait up to 15 minutes to become stale
            if time.time() - starttime > 900:
                raise

    log.info('reviving')
    for id_ in teuthology.all_roles_of_type(ctx.cluster, 'osd'):
        manager.revive_osd(id_)
        manager.mark_in_osd(id_)
    while True:
        try:
            manager.flush_pg_stats([0, 1])
            break
        except Exception:
            log.exception('osds must not be started yet, waiting...')
            time.sleep(1)
    manager.wait_for_clean(timeout)

    check_stuck(
        manager,
        num_inactive=0,
        num_unclean=0,
        num_stale=0,
        )
Added docstrings, and improved some of the comments on several tasks. 2013-10-12 08:28:27 +00:00			`"""`
			`Dump_stuck command`
			`"""`
Add a task for testing stuck pg visibility. 2012-02-21 21:11:05 +00:00			`import logging`
dump_stuck: verify that 'ceph health' mentions the right number of inactive/unclean/stale pgs 2012-02-28 21:55:46 +00:00			`import re`
Add a task for testing stuck pg visibility. 2012-02-21 21:11:05 +00:00			`import time`

			`import ceph_manager`
			`from teuthology import misc as teuthology`


			`log = logging.getLogger(__name__)`

			`def check_stuck(manager, num_inactive, num_unclean, num_stale, timeout=10):`
Added docstrings, and improved some of the comments on several tasks. 2013-10-12 08:28:27 +00:00			`"""`
			`Do checks. Make sure get_stuck_pgs return the right amout of information, then`
			`extract health information from the raw_cluster_cmd and compare the results with`
			`values passed in. This passes if all asserts pass.`

			`:param num_manager: Ceph manager`
			`:param num_inactive: number of inaactive pages that are stuck`
			`:param num_unclean: number of unclean pages that are stuck`
			`:paran num_stale: number of stale pages that are stuck`
			`:param timeout: timeout value for get_stuck_pgs calls`
			`"""`
Add a task for testing stuck pg visibility. 2012-02-21 21:11:05 +00:00			`inactive = manager.get_stuck_pgs('inactive', timeout)`
			`unclean = manager.get_stuck_pgs('unclean', timeout)`
			`stale = manager.get_stuck_pgs('stale', timeout)`
tasks/dump_stuck: more verbose Signed-off-by: Sage Weil <sage@redhat.com> 2016-10-05 15:33:37 +00:00			`log.info('inactive %s / %d, unclean %s / %d, stale %s / %d',`
			`len(inactive), num_inactive,`
			`len(unclean), num_unclean,`
			`len(stale), num_stale)`
			`assert len(inactive) == num_inactive`
			`assert len(unclean) == num_unclean`
Add a task for testing stuck pg visibility. 2012-02-21 21:11:05 +00:00			`assert len(stale) == num_stale`

			`def task(ctx, config):`
			`"""`
			`Test the dump_stuck command.`
dump_stuck: note required ceph configuration 2012-02-29 23:47:17 +00:00
Added docstrings, and improved some of the comments on several tasks. 2013-10-12 08:28:27 +00:00			`:param ctx: Context`
			`:param config: Configuration`
Add a task for testing stuck pg visibility. 2012-02-21 21:11:05 +00:00			`"""`
			`assert config is None, \`
			`'dump_stuck requires no configuration'`
			`assert teuthology.num_instances_of_type(ctx.cluster, 'osd') == 2, \`
			`'dump_stuck requires exactly 2 osds'`

			`timeout = 60`
			`first_mon = teuthology.get_first_mon(ctx, config)`
Revert "Lines formerly of the form '(remote,) = ctx.cluster.only(role).remotes.keys()'" This reverts commit d693b3f8950ffd1f2492a4db0f8234fee31f00f0. 2014-03-27 16:35:28 +00:00			`(mon,) = ctx.cluster.only(first_mon).remotes.iterkeys()`
Add a task for testing stuck pg visibility. 2012-02-21 21:11:05 +00:00
			`manager = ceph_manager.CephManager(`
			`mon,`
			`ctx=ctx,`
			`logger=log.getChild('ceph_manager'),`
			`)`

qa/tasks: use new reliable flush_pg_stats helper The helper gets a sequence number from the osd (or osds), and then polls the mon until that seq is reflected there. This is overkill in some cases, since many tests only require that the stats be reflected on the mgr (not the mon), but waiting for it to also reach the mon is sufficient! Signed-off-by: Sage Weil <sage@redhat.com> 2017-05-18 22:16:55 +00:00			`manager.flush_pg_stats([0, 1])`
Add a task for testing stuck pg visibility. 2012-02-21 21:11:05 +00:00			`manager.wait_for_clean(timeout)`
dump_stuck: verify that 'ceph health' mentions the right number of inactive/unclean/stale pgs 2012-02-28 21:55:46 +00:00
dump_stuck: fix test The mon-osd-report-timeout setting shouldn't be there! We will set the other item explicitly, and remove both from the suite yaml. Fixes: #5440 2013-06-25 19:45:22 +00:00			`manager.raw_cluster_cmd('tell', 'mon.0', 'injectargs', '--',`
ceph_manager, dump_stuck: fix injectargs args Signed-off-by: Sage Weil <sage@inktank.com> 2013-07-28 00:41:51 +00:00			`# '--mon-osd-report-timeout 90',`
			`'--mon-pg-stuck-threshold 10')`
dump_stuck: fix test The mon-osd-report-timeout setting shouldn't be there! We will set the other item explicitly, and remove both from the suite yaml. Fixes: #5440 2013-06-25 19:45:22 +00:00
qa/tasks/dump_stuck: fix for active+clean+remapped In d24a8886658c2d8882275d69c6409717a62701be we made remapped a clean state but didn't fix this test. Fixes: http://tracker.ceph.com/issues/20431 Signed-off-by: Sage Weil <sage@redhat.com> 2017-06-27 16:01:07 +00:00			`# all active+clean`
Add a task for testing stuck pg visibility. 2012-02-21 21:11:05 +00:00			`check_stuck(`
			`manager,`
			`num_inactive=0,`
			`num_unclean=0,`
			`num_stale=0,`
			`)`
			`num_pgs = manager.get_num_pgs()`

			`manager.mark_out_osd(0)`
			`time.sleep(timeout)`
qa/tasks: use new reliable flush_pg_stats helper The helper gets a sequence number from the osd (or osds), and then polls the mon until that seq is reflected there. This is overkill in some cases, since many tests only require that the stats be reflected on the mgr (not the mon), but waiting for it to also reach the mon is sufficient! Signed-off-by: Sage Weil <sage@redhat.com> 2017-05-18 22:16:55 +00:00			`manager.flush_pg_stats([1])`
Add a task for testing stuck pg visibility. 2012-02-21 21:11:05 +00:00			`manager.wait_for_recovery(timeout)`

qa/tasks/dump_stuck: fix for active+clean+remapped In d24a8886658c2d8882275d69c6409717a62701be we made remapped a clean state but didn't fix this test. Fixes: http://tracker.ceph.com/issues/20431 Signed-off-by: Sage Weil <sage@redhat.com> 2017-06-27 16:01:07 +00:00			`# all active+clean+remapped`
Add a task for testing stuck pg visibility. 2012-02-21 21:11:05 +00:00			`check_stuck(`
			`manager,`
			`num_inactive=0,`
qa/tasks/dump_stuck: fix for active+clean+remapped In d24a8886658c2d8882275d69c6409717a62701be we made remapped a clean state but didn't fix this test. Fixes: http://tracker.ceph.com/issues/20431 Signed-off-by: Sage Weil <sage@redhat.com> 2017-06-27 16:01:07 +00:00			`num_unclean=0,`
Add a task for testing stuck pg visibility. 2012-02-21 21:11:05 +00:00			`num_stale=0,`
			`)`

			`manager.mark_in_osd(0)`
qa/tasks: use new reliable flush_pg_stats helper The helper gets a sequence number from the osd (or osds), and then polls the mon until that seq is reflected there. This is overkill in some cases, since many tests only require that the stats be reflected on the mgr (not the mon), but waiting for it to also reach the mon is sufficient! Signed-off-by: Sage Weil <sage@redhat.com> 2017-05-18 22:16:55 +00:00			`manager.flush_pg_stats([0, 1])`
Add a task for testing stuck pg visibility. 2012-02-21 21:11:05 +00:00			`manager.wait_for_clean(timeout)`

qa/tasks/dump_stuck: fix for active+clean+remapped In d24a8886658c2d8882275d69c6409717a62701be we made remapped a clean state but didn't fix this test. Fixes: http://tracker.ceph.com/issues/20431 Signed-off-by: Sage Weil <sage@redhat.com> 2017-06-27 16:01:07 +00:00			`# all active+clean`
Add a task for testing stuck pg visibility. 2012-02-21 21:11:05 +00:00			`check_stuck(`
			`manager,`
			`num_inactive=0,`
			`num_unclean=0,`
			`num_stale=0,`
			`)`

tasks/dump_suck: mark down osds one at a time This forces them to be unclean, then stale. This ensures that after they are both down they are both always unclean, whereas previously it would be possible for them to be only stale and not unclean. Signed-off-by: Sage Weil <sage@redhat.com> 2016-10-05 19:30:23 +00:00			`log.info('stopping first osd')`
			`manager.kill_osd(0)`
			`manager.mark_down_osd(0)`
qa/tasks/dump_stuck: fix dump_stuck test bug Test cluster with 2 osds, stop osd.0, if osd.1 report the pg stats during pg peering, mon will record pg state to 'peering',then stop osd.1, finally the pg state will stuck in 'stale+peering', which is unexpected. Let's wait_for_active() after stop osd.0. Signed-off-by: huangjun <huangjun@xsky.com> 2017-07-25 11:14:07 +00:00			`manager.wait_for_active(timeout)`
tasks/dump_suck: mark down osds one at a time This forces them to be unclean, then stale. This ensures that after they are both down they are both always unclean, whereas previously it would be possible for them to be only stale and not unclean. Signed-off-by: Sage Weil <sage@redhat.com> 2016-10-05 19:30:23 +00:00
			`log.info('waiting for all to be unclean')`
			`starttime = time.time()`
			`done = False`
			`while not done:`
			`try:`
			`check_stuck(`
			`manager,`
			`num_inactive=0,`
			`num_unclean=num_pgs,`
			`num_stale=0,`
			`)`
			`done = True`
			`except AssertionError:`
			`# wait up to 15 minutes to become stale`
			`if time.time() - starttime > 900:`
			`raise`


			`log.info('stopping second osd')`
			`manager.kill_osd(1)`
			`manager.mark_down_osd(1)`
Add a task for testing stuck pg visibility. 2012-02-21 21:11:05 +00:00
tasks/dump_stuck: more verbose Signed-off-by: Sage Weil <sage@redhat.com> 2016-10-05 15:33:37 +00:00			`log.info('waiting for all to be stale')`
Add a task for testing stuck pg visibility. 2012-02-21 21:11:05 +00:00			`starttime = time.time()`
			`done = False`
			`while not done:`
			`try:`
			`check_stuck(`
			`manager,`
			`num_inactive=0,`
tasks/dump_stuck: fix unclean count Signed-off-by: Sage Weil <sage@redhat.com> 2016-10-05 15:41:59 +00:00			`num_unclean=num_pgs,`
Add a task for testing stuck pg visibility. 2012-02-21 21:11:05 +00:00			`num_stale=num_pgs,`
			`)`
			`done = True`
			`except AssertionError:`
			`# wait up to 15 minutes to become stale`
			`if time.time() - starttime > 900:`
			`raise`

tasks/dump_stuck: more verbose Signed-off-by: Sage Weil <sage@redhat.com> 2016-10-05 15:33:37 +00:00			`log.info('reviving')`
Add a task for testing stuck pg visibility. 2012-02-21 21:11:05 +00:00			`for id_ in teuthology.all_roles_of_type(ctx.cluster, 'osd'):`
			`manager.revive_osd(id_)`
			`manager.mark_in_osd(id_)`
dump_stuck: fix race with osd start Occasionally we don't wait long enough for the osd to start and mark itself up. Keep trying until flush succeeds. Fixes: #5431 Signed-off-by: Sage Weil <sage@inktank.com> 2013-06-23 23:21:45 +00:00			`while True:`
			`try:`
qa/tasks: use new reliable flush_pg_stats helper The helper gets a sequence number from the osd (or osds), and then polls the mon until that seq is reflected there. This is overkill in some cases, since many tests only require that the stats be reflected on the mgr (not the mon), but waiting for it to also reach the mon is sufficient! Signed-off-by: Sage Weil <sage@redhat.com> 2017-05-18 22:16:55 +00:00			`manager.flush_pg_stats([0, 1])`
dump_stuck: fix race with osd start Occasionally we don't wait long enough for the osd to start and mark itself up. Keep trying until flush succeeds. Fixes: #5431 Signed-off-by: Sage Weil <sage@inktank.com> 2013-06-23 23:21:45 +00:00			`break`
Never use 'except:' without specifying an Exception. 2013-08-30 15:58:10 +00:00			`except Exception:`
			`log.exception('osds must not be started yet, waiting...')`
dump_stuck: fix race with osd start Occasionally we don't wait long enough for the osd to start and mark itself up. Keep trying until flush succeeds. Fixes: #5431 Signed-off-by: Sage Weil <sage@inktank.com> 2013-06-23 23:21:45 +00:00			`time.sleep(1)`
Add a task for testing stuck pg visibility. 2012-02-21 21:11:05 +00:00			`manager.wait_for_clean(timeout)`

			`check_stuck(`
			`manager,`
			`num_inactive=0,`
			`num_unclean=0,`
			`num_stale=0,`
			`)`