Commit Graph

480 Commits

Author SHA1 Message Date
Sage Weil
af4ce44233 ceph: use any fs, not just btrfs, on scratch devices
The

  btrfs: true

syntax is replaced with

  fs: btrfs

or ext4, xfs.
2012-02-13 15:28:24 -08:00
Sage Weil
975d73a2bb nuke: nuke testrados and rados processes, too
So that -r is needed slightly less often.
2012-02-13 15:28:24 -08:00
Sage Weil
46b612efa4 misc: make get_scratch_devices look for (almost) any disk that's not mounted 2012-02-13 15:28:24 -08:00
Sage Weil
2adad559bd hammer.sh: assume path is set 2012-02-11 14:19:49 -08:00
Josh Durgin
0cd16cf03d ceph: always add logger for daemons
The extra log function added redundant info and didn't allow different
levels.
2012-02-02 09:36:04 -08:00
Josh Durgin
7af7c66bd0 ceph: rename type parameter to type_
type is a built-in and shouldn't be aliased.
2012-02-02 09:35:58 -08:00
Josh Durgin
7146db9215 ceph: use the correct comparison operator
is compares identity (i.e. address in cpython), not value.
2012-02-02 09:27:04 -08:00
Josh Durgin
e7672b6433 ceph: sync before unmounting btrfs devices
There may still be writes in flight, since the osds may not have
shutdown cleanly. This should prevent EBUSY when unmounting.

Fixes: #1997
2012-02-02 09:26:45 -08:00
Josh Durgin
1364b8826f ceph: delay raising exceptions until all daemons are stopped
If a daemon crashes, the exception is raised when we stop it. This
caused some daemons to continue running during cleanup, since the rest
of the daemons of the same type would not be shut down. Also log each
daemon that crashed, for easier debugging.

Fixes: #1744
2012-02-02 09:26:25 -08:00
Sage Weil
0236dc0f5e add backfill task
This does a basic test of backfill functionality, including a divergent
log on a backfill target (#1983).
2012-01-31 16:25:53 -08:00
Sage Weil
e337c4727c ceph_manager: add manager.blackhole_kill_osd()
This will suspend disk writes for a couple seconds and then kill the
daemon.  It helps us similute a hardware failure.
2012-01-31 16:13:59 -08:00
Tommi Virtanen
d7be77628c Allow user to disable lock checking.
The new plana hardware isn't in the old sepia lock database,
and the machine pools are risky to merge as nothing in the
software guarantees allocation from just one pool. This allows
us to hand-allocate machines temporarily.
2012-01-31 08:05:36 -08:00
Tommi Virtanen
09bed16408 Allow user to provide flavor to use.
With this, you can use Ubuntu 11.10 machines with teuthology by saying::

  tasks:
  - ceph:
      flavor: oneiric
  ...
2012-01-31 07:59:43 -08:00
Josh Durgin
f84b4aa5e3 Add admin socket task.
This simply gets the output of an admin socket command, makes sure
it's json, and runs a user-provided test script on it.
2012-01-27 17:13:36 -08:00
Samuel Just
4aa9ca4551 CephManager: base timeout on time since last change in active+clean
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
2012-01-24 11:28:38 -08:00
Josh Durgin
29885f3e42 kernel: ignore connection problems while waiting for reboot 2012-01-18 17:49:05 -08:00
Sage Weil
45e4c924fa thrashosds: maxdead default to 0
This avoids any possibility of blocking peering.
2012-01-17 09:24:54 -08:00
Sage Weil
bf22a4fb92 task/rados: use new usage for radosmodel tool 2012-01-16 16:53:55 -08:00
Sage Weil
71390f9784 thrashosds: fix action selection
I'm not sure what the old code was trying to do, but I'm pretty sure it
wasn't doing it correctly.. a .1 chance_down was killing an OSD for me
virtually every time.
2012-01-16 15:05:43 -08:00
Sage Weil
8fc6086986 thrashosds: make actions less nonsensical
Make marking OSD up/down and in/out totally orthogonal.

Signed-off-by: Sage Weil <sage@newdream.net>
2012-01-16 15:05:43 -08:00
Sage Weil
9419f583c6 ls: include duration, less noise 2012-01-16 13:18:49 -08:00
Sage Weil
c5bbfffa05 hammer.sh: new -nuke syntax 2012-01-16 13:18:31 -08:00
Sage Weil
8fb115fe2c include run duration in summary.yaml 2012-01-16 12:39:20 -08:00
Sage Weil
7b47e49fa8 ls: fix extraneous newline 2012-01-16 10:47:44 -08:00
Sage Weil
b58f9560ea ceph: ignore all leaks
unless/until we figure out where the DefinitelyLost records are coming
from.. at first glance they look bogus.
2012-01-16 09:55:47 -08:00
Sage Weil
40fb86ff81 ceph: take single arg or list for valgrind args 2012-01-16 09:22:45 -08:00
Sage Weil
c88ec5719e combined mon, osd, mds starter functions 2012-01-15 22:54:09 -08:00
Sage Weil
f8ec23e79d rbd: default to all: 2012-01-15 22:53:39 -08:00
Sage Weil
72057a9cd8 use local mirrors for (most) github urls
A cronjob on ceph.newdream.net updates these every 15 minutes.  Sigh.
2012-01-15 22:52:58 -08:00
Sage Weil
fbfa94bb09 teuthology-ls: show pid, last line of output for running jobs 2012-01-15 22:52:58 -08:00
Sage Weil
f70b158cd1 show host -> roles mapping on startup
Less guessing when manually inspecting an in-progress or hung run.
2012-01-15 22:52:58 -08:00
Sage Weil
f795261454 lost_unfound: make test work with backfill
If we backfill, we fail to peer instead of having every object show up as
'unfound'.  Avoid that by preventing log trimming, so that we always do
log recovery for this test.
2012-01-15 22:52:58 -08:00
Tommi Virtanen
3bfa41cf6a Use yaml.safe_dump so unicode doesn't mess up the yaml files.
In general, yaml.dump is comparable to pickle, and my personal
coding standard says *never* use it. yaml.safe_dump is much nicer.
yaml.dump should have been named yaml.unsafe_dump, yaml.safe_dump
should have been named yaml.dump :(
2012-01-13 11:26:36 -08:00
Josh Durgin
0da44591a9 nuke: take config files from -t argument
teuthology-lock and teuthology-updatekeys both use -t for this already
2012-01-12 14:48:36 -08:00
Josh Durgin
96e89d30ec kernel: loop reconnecting in case we race with shutdown
Previously, if we reconnected before shutdown completed we asserted
that the kernel did not boot into the new version, when we just needed
to wait for the machine to reboot.
2012-01-12 13:02:22 -08:00
Sage Weil
59369237c9 thrasher: don't mark down osds out; tell monitor same
Stopping ceph-osd doesn't make it out (immediately).  Prevent monitor
from doing this after a delay too so we can keep our notion of what is
up/down/in/out accurate.
2012-01-11 12:54:09 -08:00
Sage Weil
3c0346b4cb lost_unfound: typo 2012-01-11 12:54:09 -08:00
Sage Weil
6dae2f8ae3 thrasher: adjust min_dead default
Make this 1, not 2.  That's a bit more friendly.  It doesn't strictly
matter, tho, since we revive osds before waiting for clean.
2012-01-11 12:54:09 -08:00
Sage Weil
fb74b90152 thrasher: add max_dead
Add max_dead, and revive osds prior to waiting for clean.  Otherwise we
can leave too many OSDs down and the cluster will never go clean.
2012-01-11 12:54:08 -08:00
Sage Weil
50463ffddd verify all osds start before checking health
Just checking health isn't good enough, since it races with OSD startup:
we can have a healthy cluster with 0 (or something else < total) OSDs.
2012-01-11 12:54:08 -08:00
Josh Durgin
f4883ebf09 ceph: let the user running ceph-osd remove subvolumes
This will prevent EPERM when using the SNAP_DESTROY ioctl,
so the filestore will use btrfs snaps.
2012-01-10 16:07:04 -08:00
Josh Durgin
d2fadf9fe2 syslog: ignore lockdep non-static key warning
It looks like this warning was made default in linux 3.2.
This will keep happening until #1922 is done.
2012-01-10 15:28:42 -08:00
Sage Weil
b354ce4e91 run: put pid in archive dir
This will make it easy for teuthology-ls to show you the running process's
pid (if it's still running).  Or for other utiltizes to kill + clean up
a hung teuthology run.
2012-01-08 14:39:30 -08:00
Sage Weil
13445d237b ceph_manager: a booting osd is no longer automatically marked in
as of ceph.git commit 96b7b0d83e
2012-01-06 17:21:38 -08:00
Sage Weil
001701a0f7 mon_recovery: need n/2 + 1 monitors for quorum 2012-01-06 15:12:15 -08:00
Sage Weil
da9210779e ceph: don't skip monitor ports
We can use the same port multiple times if they are on a different hosts.
2012-01-06 13:36:54 -08:00
Josh Durgin
561f06cf94 suite: make email-on-success the default behavior
This way you can tell when a run is complete, instead of wondering if
it's stuck in the queue.
2012-01-05 17:27:31 -08:00
Josh Durgin
ec3a3a9654 rados: fix example config 2012-01-03 14:07:45 -08:00
Josh Durgin
cdd5c456a0 nuke-on-error: only unlock if this run locked the machines 2012-01-03 13:02:31 -08:00
Josh Durgin
0176c9ab0f Remove unused mon.0 variables. 2012-01-03 13:02:31 -08:00