Commit Graph

66 Commits

Author SHA1 Message Date
Sage Weil
fe9fb49e27 ceph_manager: use get() for self.config powercycle checks
I think this is what is going on...

Traceback (most recent call last):
  File "/var/lib/teuthworker/teuthology-master/teuthology/contextutil.py", line 27, in nested
    yield vars
  File "/var/lib/teuthworker/teuthology-master/teuthology/task/ceph.py", line 1158, in task
    yield
  File "/var/lib/teuthworker/teuthology-master/teuthology/run_tasks.py", line 25, in run_tasks
    manager = _run_one_task(taskname, ctx=ctx, config=config)
  File "/var/lib/teuthworker/teuthology-master/teuthology/run_tasks.py", line 14, in _run_one_task
    return fn(**kwargs)
  File "/var/lib/teuthworker/teuthology-master/teuthology/task/dump_stuck.py", line 93, in task
    manager.kill_osd(id_)
  File "/var/lib/teuthworker/teuthology-master/teuthology/task/ceph_manager.py", line 665, in kill_osd
    if 'powercycle' in self.config and self.config['powercycle']:
TypeError: argument of type 'NoneType' is not iterable
2013-02-02 21:01:08 -08:00
Samuel Just
fadc22c0b9 ceph_manager: wait for admin socket on restart, use for set_config
Fixes: #3966
Signed-off-by: Samuel Just <sam.just@inktank.com>
2013-01-31 12:59:00 -08:00
Sam Lang
8f720454cb Assign devices to osds using the device wwn
Linux doesn't guarantee device names (/dev/sdb, etc.)
are always mapped to the same disk.  Instead of assigning
nominal devices to osds, we map devices by their wwn
(/dev/disk/by-id/wwn-*) to an osd (both data and journal).

Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-31 08:23:39 -06:00
Sam Lang
58111595d4 Support power cycling osds/nodes through ipmi
This patch defines a RemoteConsole class associated
with each Remote class instance, allowing
power cycling a target through ipmi.

Fixes/Implements #3782.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-31 08:23:37 -06:00
Sam Lang
ace4cb07b2 Replace /tmp/cephtest/ with configurable path
Teuthology uses /tmp/cephtest/ as the scratch test directory for
a run.  This patch replaces /tmp/cephtest/ everywhere with a
per-run directory: {basedir}/{rundir} where {basedir} is a directory
configured in .teuthology.yaml (/tmp/cephtest if not specified),
and {rundir} is the name of the run, as given in --name.  If no name
is specified, {user}-{timestamp} is used.

To get the old behavior (/tmp/cephtest), set test_path: /tmp/cephtest
in .teuthology.yaml.

This change was modivated by #3782, which requires a test dir that
survives across reboots, but also resolves #3767.

Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-31 08:23:31 -06:00
Sam Lang
14730276b9 Fixes for syntax errors found by pyflakes.
This patch includes minor fixes to the teuthology
python code for syntax errors found by running
check-syntax.sh (which runs pyflakes on each file).

Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-31 07:58:57 -06:00
Samuel Just
1c31194920 osd_recovery: inject a recovery delay
Signed-off-by: Samuel Just <sam.just@inktank.com>
2013-01-28 20:22:33 -08:00
Sage Weil
b5f81636a2 osdthrasher: inject pause on a live (on in) osd 2013-01-26 13:13:08 -08:00
Samuel Just
3a5c70b89b ceph_manager: turn long stall injection off by default
Signed-off-by: Samuel Just <sam.just@inktank.com>
2013-01-24 17:31:38 -08:00
Sage Weil
20af01f23b ceph_manager: fix get_num_active_recovered()
The states now have 'backfill' *or* 'recover' in them.
2013-01-24 16:23:33 -08:00
Samuel Just
6a859bcd56 ceph_manager: use 80/70 as pause_long, pause_check_after defaults
OSD::op_tp suicides after 150.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2013-01-24 12:50:26 -08:00
Samuel Just
0f24dca2d7 ceph_manager: use do_rados for rmpool
Signed-off-by: Samuel Just <sam.just@inktank.com>
2013-01-24 10:08:44 -08:00
Samuel Just
ec5a14553f ceph_manager: default chance_down to 0.4
Signed-off-by: Samuel Just <sam.just@inktank.com>
2013-01-23 17:44:05 -08:00
Samuel Just
566ae5332e ceph_manager: add filestore and heartbeat stalls
Signed-off-by: Samuel Just <sam.just@inktank.com>
2013-01-23 17:40:40 -08:00
David Zafman
e714c77812 osd: Testing of deep-scrub omap changes
Fix scrub_test.py and add omap corruption test

Signed-off-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
2013-01-22 15:48:45 -08:00
Sam Lang
53f22d9493 task/mds_thrasher: New task for thrashing the mds
Signed-off-by: Sam Lang <sam.lang@inktank.com>
2013-01-18 15:48:52 -06:00
Joao Eduardo Luis
e88b909a1d task: ceph_manager: add 'get_mon_health' function
Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
2013-01-04 17:03:55 +00:00
Samuel Just
f2dbe5edd7 CephManager: add ability to test split
Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-12-11 15:11:06 -08:00
Samuel Just
f309c33d2d Clean up string interpolation operator spacing ceph_manager.py
Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-11-09 10:52:16 -08:00
Samuel Just
f82d4a7b86 Add divergent_priors test
Tests scenario where merge_old_entry encounters a divergent
entry where the prior_version is prior to log_tail.  This
is a problem since it will go into the missing set, but won't
be re-added to the missing set during read_log() if the node
restarts prior to recovering the object.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-11-09 10:52:15 -08:00
Samuel Just
bd83ed70dc ceph_manager: add test_min_size action
Thrasher can now with configurable frequency test min_size by
taking down all but one osd, waiting, killing that osd and bringing
back the others, and verifying that the cluster goes clean.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-11-07 12:56:31 -08:00
Mike Ryan
3b85b2311b task: verify scrub detects files whose contents changed
Signed-off-by: Mike Ryan <mike.ryan@inktank.com>
2012-08-02 11:14:51 -07:00
Sage Weil
a9f2bf622f ceph_manager: wait_for_active 2012-07-28 10:23:18 -07:00
Sage Weil
731d520900 ceph_manager: count 'incomplete' as 'down' 2012-07-28 10:23:18 -07:00
Josh Durgin
ddb98f7773 ceph_manager: don't try to start greenlet twice
spawn already scheduled it. Trying to start it again hits an assert.
2012-04-10 16:23:58 -07:00
Samuel Just
b4aa098f47 make Thrasher not inherit from Greenlet 2012-03-29 18:08:19 -07:00
Sage Weil
84cd4ed6c3 peer: wait for peering to complete, or block
We need to wait for peering to either complete, or block because it is
waiting for another PG.  _Then_ look at all the PG states and compare the
mon values with what we get from qeurying the OSDs directly.
2012-02-25 21:05:00 -08:00
Sage Weil
c43e87d118 ceph_manager: list_pg_missing
List missing objects for the given pgid.
2012-02-24 12:42:39 -08:00
Josh Durgin
995dc1f751 Add a task for testing stuck pg visibility. 2012-02-21 15:12:48 -08:00
Sage Weil
45b6189b7d ceph_manager: ignore stale states when counting
also remove assumptions about ordering of states
2012-02-18 14:44:53 -08:00
Sage Weil
196d4a1f16 wait_till_clean -> wait_for_clean and wait_for_recovery
Clean now also means the correct number of replicas, whereas recovered
means we have done all the work we can do given the replicas/osds we have.
For example, degraded and clean are now mutually exclusive.

Also move away from 'till'.
2012-02-17 21:53:25 -08:00
Sage Weil
6f3abc6ced ceph_manager: mark in a bit more often than out
Otherwise we can get into cases where many/most nodes are out, and things
don't work as well.  e.g., crush may start to fail.
2012-02-13 15:28:24 -08:00
Sage Weil
e337c4727c ceph_manager: add manager.blackhole_kill_osd()
This will suspend disk writes for a couple seconds and then kill the
daemon.  It helps us similute a hardware failure.
2012-01-31 16:13:59 -08:00
Samuel Just
4aa9ca4551 CephManager: base timeout on time since last change in active+clean
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
2012-01-24 11:28:38 -08:00
Sage Weil
45e4c924fa thrashosds: maxdead default to 0
This avoids any possibility of blocking peering.
2012-01-17 09:24:54 -08:00
Sage Weil
71390f9784 thrashosds: fix action selection
I'm not sure what the old code was trying to do, but I'm pretty sure it
wasn't doing it correctly.. a .1 chance_down was killing an OSD for me
virtually every time.
2012-01-16 15:05:43 -08:00
Sage Weil
8fc6086986 thrashosds: make actions less nonsensical
Make marking OSD up/down and in/out totally orthogonal.

Signed-off-by: Sage Weil <sage@newdream.net>
2012-01-16 15:05:43 -08:00
Sage Weil
59369237c9 thrasher: don't mark down osds out; tell monitor same
Stopping ceph-osd doesn't make it out (immediately).  Prevent monitor
from doing this after a delay too so we can keep our notion of what is
up/down/in/out accurate.
2012-01-11 12:54:09 -08:00
Sage Weil
6dae2f8ae3 thrasher: adjust min_dead default
Make this 1, not 2.  That's a bit more friendly.  It doesn't strictly
matter, tho, since we revive osds before waiting for clean.
2012-01-11 12:54:09 -08:00
Sage Weil
fb74b90152 thrasher: add max_dead
Add max_dead, and revive osds prior to waiting for clean.  Otherwise we
can leave too many OSDs down and the cluster will never go clean.
2012-01-11 12:54:08 -08:00
Sage Weil
13445d237b ceph_manager: a booting osd is no longer automatically marked in
as of ceph.git commit 96b7b0d83e
2012-01-06 17:21:38 -08:00
Sage Weil
4b53288b0c ceph_manager: % 2011-11-19 20:56:49 -08:00
Sage Weil
89f80412c2 ceph_manager: fix logging 2011-11-17 13:46:02 -08:00
Josh Durgin
f4d527e743 thrashosds: timeout for every clean check, not just the last one 2011-11-17 11:11:33 -08:00
Josh Durgin
9d12b720e8 ceph_manager: add a default timeout of 5 minutes for mon quorum 2011-11-17 11:05:12 -08:00
Josh Durgin
cb9ac0897b ceph_manager: log mon quorum status so the logs show progress (or lack thereof) 2011-11-17 10:45:19 -08:00
Sage Weil
60863f70eb ceph_manager: manipulate monitors 2011-11-08 22:17:00 -08:00
Josh Durgin
006a0dd423 Remove unused imports and variable. 2011-11-08 16:09:21 -08:00
Josh Durgin
4f3b113832 ceph_manager: log ceph -s output so progress is visible in the logs 2011-11-03 13:27:44 -07:00
Sage Weil
b8beff3dd5 ceph_manager: count active+clean+<somjething else> as active+clean
In my case, one pg was active+clean+scrubbing.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-21 10:54:05 -07:00