tempnam() is considered an unsafe security risk because the filename
generated is easy to guess and can be symlinked in advance. Use
mkstemp() instead.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Joe Buck <jbbuck@gmail.com>
This reverts commit 67a616a979.
Sigh. As it turns out, /etc/default/grub being hacked also
causes the same problem. I think there's a way to fix that cleanly
as well, but until then, replacing the "accept installed version"
hack here so jobs can run.
This reverts commit 5995ae7e78.
With the changes to ceph-qa-chef and the teuthology kernel task,
we're no longer touching packaged file /etc/grub.d/10_linux, which
was the reason for this apt forcing. Remove so that we find other
package problems that might be masked by this; we can always
put it back if there are such problems until we can fix those as well.
Signed-off-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit c2b0828b19)
We had been writing 01_ceph_kernel with the kernel title, and
relying on the fact that grub.cfg would never have submenus in it
(implemented by a hack to /etc/grub.d/10_linux which neutered its
submenu creation). However, that hack was modifying a package file,
and got in the way of later apt commands. Rather than doing it
that way, this divines the title of the submenu and sets the
default variable to "submenu>kernel", which works to select the
desired kernel.
It depends on there being only one level of submenu, and on the
format of the menuentry and submenu commands, dictated by grub2.
None of this is likely to work at all outside Ubuntu.
Fixes: #4496
Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 52aec32a7d)
The pg state could easily have changed in the mean time,
for example, from recovery_wait to recovering.
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
This reverts commit 5995ae7e78.
With the changes to ceph-qa-chef and the teuthology kernel task,
we're no longer touching packaged file /etc/grub.d/10_linux, which
was the reason for this apt forcing. Remove so that we find other
package problems that might be masked by this; we can always
put it back if there are such problems until we can fix those as well.
Signed-off-by: Dan Mick <dan.mick@inktank.com>
fa2049f caused an import cycle between lock.py and misc.py. Move the
needed functions from lock.py to lockstatus.py so that we can avoid the
import cycle.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
We had been writing 01_ceph_kernel with the kernel title, and
relying on the fact that grub.cfg would never have submenus in it
(implemented by a hack to /etc/grub.d/10_linux which neutered its
submenu creation). However, that hack was modifying a package file,
and got in the way of later apt commands. Rather than doing it
that way, this divines the title of the submenu and sets the
default variable to "submenu>kernel", which works to select the
desired kernel.
It depends on there being only one level of submenu, and on the
format of the menuentry and submenu commands, dictated by grub2.
None of this is likely to work at all outside Ubuntu.
Fixes: #4496
Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Nightlies run on teuthology currently use a testdir of
/home/ubuntu/cephtest, but this causes stale job errors occasionally
from the previous tests not getting properly cleaned up, which prevents
the nightlies from running successfully.
The misc.py get_testdir() function can specify a testdir that is
specific to the job, but previously the path was too long and would
cause separate job failures.
This patch does two things to resolve that. First, it uses the job id
from the teuthology run if one exists. This should be a relatively
short number that will identify the job run effectively. Second,
if the job id isn't available, it creates a shortened form of the
job's name, for example the job name:
teuthology-2013-04-09_23:51:49-rgw-next-testing-basic
becomes:
te1304092351rntb
Signed-off-by: Sam Lang <sam.lang@inktank.com>
This is a fix for issue #4677 which was caused by kdb output being
hard-coded to ttyS1 which is fine for all our hardware except mira
machines. This change just checks to see if mira is in the host's
name and uses ttyS2 instead (simple fix).
Resolves an issue where we
were not properly escaping the generated
public key when doing matches against it.
Signed-off-by: Joe Buck <jbbuck@gmail.com>
Reviewd-by: Sam Lang <sam.lang@inktank.com>
Change apt commands to prevent prompts from coming up (forcing
non-interactive mode) so things like grub or other stuff doesn't
break teuthology runs.
Signed-off-by: Sandon Van Ness <sandon@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Change apt commands to prevent prompts from coming up (forcing
non-interactive mode) so things like grub or other stuff doesn't
break teuthology runs.
Signed-off-by: Sandon Van Ness <sandon@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Modify the Hadoop task to support branches
being specified for both the Apache and Inktank
Hadoop branches.
Signed-off-by: Joe Buck <jbbuck@gmail.com>
Reviewd-by: Sam Lang <sam.lang@inktank.com>
Updated the ssh-keys task to cleanup
any left-over keys from previous tasks
(indicated by the user being 'ssh-keys-user').
Also, some of the functions in the ssh_keys task seem
like they could be useful in general.
This patch refactors them into misc.py.
Signed-off-by: Joe Buck <jbbuck@gmail.com>
Reviewd-by: Sam Lang <sam.lang@inktank.com>
Downburst create is used to reinstall a VM when it is locked.
Downburst destroy is used to remove a VM when it is unlocked.
Host keys are regenerated on each vm instantiation, so the keys
need to be checked prior to use.
If needed, qa-ceph-chef is run on newly installed systems to insure that
they are fully functional.
Signed-off-by: Warren Usui <warren.usui@inktank.com>
If the lock request succeeds in updating the db, but the client gets a
timeout from apache, they can now try again and get back the machines
they just locked.
Only automatic runs have a description set when locking several
machines, so this does not affect users of teuthology-lock
--lock-many, where no description can be set in the same request.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Optional flag makes us suck down the archive (mostly, the logs, which
might be huge for some debugging tests) unless the test has failed.
Signed-off-by: Sage Weil <sage@inktank.com>
In cases where the mds thrasher continuously loops
waiting for an mds to be removed from the map, or
for a new mds to become active, we want to start logging
the mds state for debugging.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
Debug the osd op ordering by default. Most of the runs have a small number
of clients, which makes the STL maps cheap.
Signed-off-by: Sage Weil <sage@inktank.com>
Pass the desc to the lock operation.
The unlock operation now clears desc for us; no need to do it outselves.
Signed-off-by: Sage Weil <sage@inktank.com>
Note whenever locks are acquired/released, or a machine's description is updated.
Under apache, these will go to error.log.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Verify there is no /var/lib/ceph, just like we do with the cephtest
directory. We will need to change this (or make it optional) when we
allow runs against an existing cluster, but then a whole bunch of other
things will need to change then as well.
Signed-off-by: Sage Weil <sage@inktank.com>
This patch corrects an issue where a workunit task is
not cleaning up generated directories
if the 'all' key is used to specify clients.
Signed-off-by: Joe Buck <jbbuck@gmail.com>
Reviewed-by: Sam Lang <sam.lang@inktank.com>
Don't use exit status info to track daemon state. We need to find
a better way to do this for the restart task.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
Do yum install rather than yum reinstall for CentOS.
When exiting CentOS, yum erase the ceph-release rpm.
Signed-off-by: Warren Usui <warren.usui@inktank.com>
The exitstatus on the process is a gevent.AsyncResult
(not an int). Use the try/except pattern for handling
errors instead.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
Tested for the existence of /sys/fs/fuse/connections/*/abort
before clobbering it. This problem was generated when all
the machines were virtual CentOS machines.
Signed-off-by: Warren Usui <warren.usui@inktank.com>
The last command a restart script outputs is 'done'
indicating the script does not require being restarted
further. Handle this case properly.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
The ceph daemons support being killed at a specific code point
with a config option. In some cases, we want to test a kill point
only once for a given daemon run (such as replay that only occurs
during daemon startup). This task allows running a script or executable
and (when the script sends a command to the task) restarting it with
a temporary config that has the appropriate kill point set. Once
the daemon asserts and gets restarted, the original config is used.
Adds a specific restart_with_args() method to the DaemonState in the
ceph task.
Right now this task follows the workunit task closely, but uses stdout/stdin
to specify when to restart a daemon.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
install_packages, remove_packages and remove_sources are now the
installation and removal functions used by teuthology. Debian
references have been removed outside of tasks/install.py. CentOS
functionality parallel to Debian have been added to tasks/install.py,
and el6 references have been added to nuke.py, task/ceph-fuse.y and
task/install.py.
Some files created by CentOS are removed with rm -fr. This should
be changed once the installation/removal rpm procedure is implemented.
Signed-off-by: Warren Usui <warren.usui@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
We don't need to setup the ipmi console on runs that
don't use powercycling, so delay setup of the RemoteConsole
with ipmi to the thrashosd task and only then if the powercycle
config is set. This avoids spurious test failures from flaky
ipmi.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
If powercycling was requested for the osd thrasher
we should ensure that we are able to reach the
ipmi console. This helps us avoid weird errors.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
Debug the osd op ordering by default. Most of the runs have a small number
of clients, which makes the STL maps cheap.
Signed-off-by: Sage Weil <sage@inktank.com>
Email.py was added so that the emailto attribute could be passed,
and to prevent 'module object has no attribute: email' errors from
happening. Run.py actual performs the email operation and calls
suite.email_results to do the actual send mail operation. The
information passed right now is the summary and config information.
Signed-off-by: Warren Usui <warren.usui@inktank.com>
If the yaml has
wait-for-package: true
then block and poll for the packages to appear if they are not already
there. This is only useful for new branches or explicit sha1's, obviously.
Signed-off-by: Sage Weil <sage@inktank.com>
Since teuthology now uses debian packages,
we do not need to set this in workunit.
The one test that uses this now tests for
it locally.
Signed-off-by: Joe Buck <jbbuck@gmail.com>
Some tests require additional packages
(e.g., java bindings, hadoop bindings).
Extend the install task to allow for those
packages to be specified in the yaml files.
Signed-off-by: Joe Buck <jbbuck@gmail.com>
Reviewed-by: Sam Lang <sam.lang@inktank.com>
The new monitor store does not create the data directory on --mkfs. We
must create it instead, much like what happens with the osds.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
We now add a new option 'thrash-many' that by being set to true will break
the default behaviour of killing only one monitor at a time. Instead,
this option will select up to the maximum number of killable monitors to
kill in each round.
We also add a new 'maintain-quorum' option that will limit the amount of
monitors that can be killed in each thrashing round. If set to true, this
option will limit the amount of killable monitors up to (n/2-1). This
means that if we are running a configuration that only has up to two
configured monitors, if 'maintain-quorum' is set to true, this task won't
run as there are no killable monitors -- in such a scenario, this option
should be set to false.
Furthermore, if 'store-thrash' is set to true, then 'maintain-quorum' must
also be set to true, as we cannot let the task to thrash all the monitor
stores, or we wouldn't be able to sync from other monitors, nor can we
let quorum be dropped, or we won't be able to resync our way into quorum.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
This patch introduces an option to thrash a monitor store when we thrash
the monitors, as well as a 'store-thrash-probability' option (defaulting
to 50%).
We also took this opportunity to introduce a new 'seed' option, that ought
to allow a given run of this task to be reproducible. This might come in
hand when attempting to reproduce a given behavior that would otherwise
be randomly triggered.
You should note that while the 'seed' option will indeed mimic past
behaviors, this only applies to a past behavior of this task: other tasks
are not affected by this value, nor are any workunits or even ceph daemons.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
- call apt separately for each package; it will error out annoyingly if
there is one in the list not in the APT sources.
- use dpkg with appropriate force to clean up broken half-installs.
Signed-off-by: Sage Weil <sage@inktank.com>
Otherwise we get stuck in a loop if an osd crashes unexpectedly, the
task never fails, and we don't collect all the evidence.
Signed-off-by: Sage Weil <sage@inktank.com>
Some command-line tools need to reference the path
to the test directory, which is created at run-time.
We export this as TESTDIR
Signed-off-by: Joe Buck <jbbuck@gmail.com>
Reviewed-by: Sam Lang <sam.lang@inktank.com>
We need to switch around how these are compressed and pulled, since they
aren't in the regular archive dir anymore.
Signed-off-by: Sage Weil <sage@inktank.com>
This required reordering the cluster setup so that we do the ceph-osd
--mkfs --mkkey prior to gathering keys and initializing the monitors.
Also, run daemons as root.
Signed-off-by: Sage Weil <sage@inktank.com>
Installing debs means we are more likely to hit a case where we interrupt
apt/dpkg. Try to mop up as best we can in nuke.
Signed-off-by: Sage Weil <sage@inktank.com>
apt-get doesn't have a nice way to tell if the package is not install and
we don't need to purge it. Well, not one I found in 5 minutes. Just
do a big purge and assume it works, or failed because there was nothing to
be done.
Signed-off-by: Sage Weil <sage@inktank.com>
The ceph task installs ceph using the debian
packages now, and all invocations of binaries installed
in {tmpdir}/binary/usr/local/bin/ are replace with
the use of the binaries installed in standard locations
by the debs.
Author: Sander Pool <sander.pool@inktank.com>
Signed-off-by: Sam Lang <sam.lang@inktank.com>
Added the ability to support multiple types of machines with
--machine-type added to teuthology-lock when used with --lock-many
or --machine-type with teuthology --lock (automated tests). It
defaults to 'plana' and the 'vps' type is currently unused but
should be in the future.
Also updated teutholoy-lock --summary to be machine type aware
and sort things in a nice output.
Signed-off-by: Sandon Van Ness <sandon@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Added the ability to support multiple types of machines with
--machine-type added to teuthology-lock when used with --lock-many
or --machine-type with teuthology --lock (automated tests). It
defaults to 'plana' and the 'vps' type is currently unused but
should be in the future.
Signed-off-by: Sandon Van Ness <sandon@van-ness.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Also fix up the template to use {{field}} for stuff we don't want to parse.
There is probably a better way...
Signed-off-by: Sage Weil <sage@inktank.com>
This can cause issues when unmount hangs. Our automatic runs reboot
everything unconditionally, so this caused a bunch of unecessary hangs
when an fs was accidentally rendered un-unmountable.
When nodes are rebooted, the connections remain open
even after calling reconnect and setting up new ssh
sessions to the rebooted nodes. This causes ECONNRESET
errors to show up in the teuthology output.
Close the existing connections before trying to reconnect.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
kill_mon is getting a config set to None, which blows
up now due to the check for powercycle. Initialize
the config to an empty dict if we don't get anything
on init. This is the error showing up in teuthology:
2013-02-04T15:04:16.595 ERROR:teuthology.run_tasks:Manager failed: <contextlib.GeneratorContextManager object at 0x1fcafd0>
Traceback (most recent call last):
File "/var/lib/teuthworker/teuthology-master/teuthology/run_tasks.py", line 45, in run_tasks
suppress = manager.__exit__(*exc_info)
File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/var/lib/teuthworker/teuthology-master/teuthology/task/mon_thrash.py", line 142, in task
thrash_proc.do_join()
File "/var/lib/teuthworker/teuthology-master/teuthology/task/mon_thrash.py", line 69, in do_join
self.thread.get()
File "/var/lib/teuthworker/teuthology-master/virtualenv/local/lib/python2.7/site-packages/gevent/greenlet.py", line 308, in get
raise self._exception
AttributeError: 'NoneType' object has no attribute 'get'
Signed-off-by: Sam Lang <sam.lang@inktank.com>
Nuke will cleanup the base test directory by default, but can
cleanup the test directory for a given run if specified.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
I think this is what is going on...
Traceback (most recent call last):
File "/var/lib/teuthworker/teuthology-master/teuthology/contextutil.py", line 27, in nested
yield vars
File "/var/lib/teuthworker/teuthology-master/teuthology/task/ceph.py", line 1158, in task
yield
File "/var/lib/teuthworker/teuthology-master/teuthology/run_tasks.py", line 25, in run_tasks
manager = _run_one_task(taskname, ctx=ctx, config=config)
File "/var/lib/teuthworker/teuthology-master/teuthology/run_tasks.py", line 14, in _run_one_task
return fn(**kwargs)
File "/var/lib/teuthworker/teuthology-master/teuthology/task/dump_stuck.py", line 93, in task
manager.kill_osd(id_)
File "/var/lib/teuthworker/teuthology-master/teuthology/task/ceph_manager.py", line 665, in kill_osd
if 'powercycle' in self.config and self.config['powercycle']:
TypeError: argument of type 'NoneType' is not iterable