Make minor adjustments to ceph_manager.CephManager so that methods
run_ceph_w(), run_cluster_cmd() raw_cluster_cmd() and
raw_cluster_cmd_result() can be reused, instead of duplicating, in
subclasses. The adjustments are -
* Having variables contain arguments that'll be prepended to every
command received by the methods above.
* Grouping variables that needs to be overridden together so that it is
easy to spot and override them for users.
Signed-off-by: Rishabh Dave <ridave@redhat.com>
Instead prepend "exec sudo" to the command arguments of
LocalCephManager.run_ceph_w(). This makes the default parameter
"shell=False" redundant in case of
ceph_manager.CephManager.run_ceph_w(), so get rid of it too and update
calls to run_ceph_w() accordingly.
The reason behind using any of these workarounds is that running "ceph
-w" with "shell" set to True leads to crash for Ceph API CI job. See
this ticket for more details: https://tracker.ceph.com/issues/49644.
The reason behind switching the workaround is that in the following
commits to reduce duplication LocalCephManager.run_ceph_w() will be
deleted and CephManager.run_ceph_w() will be used by LocalCephManager
via inheritance. However, due to the issue described above, Ceph API
test will fail since "shell" is set to "True" for the command issued by
CephManager.run_ceph_w(). Prepending "exec sudo" to the command when it
is used in LocalCephManager makes this duplication unnecessary and also
prevents Ceph API test from failing.
Signed-off-by: Rishabh Dave <ridave@redhat.com>
Save the return value of method "teuthology.get_testdir()" instead of
calling it repeatedly in the same class.
Signed-off-by: Rishabh Dave <ridave@redhat.com>
mon_tick_interval is 5 seconds by default. monitors update their
rotating keys every mon_tick_interval. before monitors forms a
quorum, the auth requests from clients are put into the wait list.
these requests are re-enqueued once the monitors form a quorum. but
there is a small window of mon_tick_interval, before they are able
to serve the auth requests even after their claim to be able to
server requests. if these re-enqueued requests happen to be served
in this window, and if authx is enabled, they will be greeted with
errors like
handle_auth_bad_method server allowed_methods [2] but i only support [2]
in the case of ceph cli, the error would look like:
[errno 13] RADOS permission denied (error connecting to the cluster)
so, to address this issue, the EACCES error is ignored when waiting
for a quorum.
Signed-off-by: Kefu Chai <kchai@redhat.com>
This assumes that k8s is installed and kubectl works.
The ceph container to use is selected the same way the cephadm
task does it.
All scratch devices are consumed as OSDs.
A ceph.conf and client.admin keyring are deployed on all test
nodes, so normal tasks should work (if/when packages are installed).
Fixes: https://tracker.ceph.com/issues/47507
Signed-off-by: Sage Weil <sage@newdream.net>
* refs/pull/38443/head:
qa: set "shell" to False for run_ceph_w()
vstart_runner: make "shell" a default argument
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Setting shell to True in call to run() in LocalCephManager.run_ceph_w()
leads to a crash when self.subproc.communicate() is executed for the
process created by running "ceph -w".
Signed-off-by: Rishabh Dave <ridave@redhat.com>
Modify CephManager.run_cluster_cmd() to accept command arguments as
string as well since typing commands as strings is much lesser effort
than typing as list. This brings the interface a step closer to
teuthology.orchestra.remote.run()'s interface since it too can accept
commands arguments as string.
The change in cephfs_test_case.py is just to allow testing this PR
locally and on teuthology.
Signed-off-by: Rishabh Dave <ridave@redhat.com>
The use of chdir will muck up the use of nsenter with valgrind:
2021-03-03T02:13:49.897 DEBUG:teuthology.orchestra.run.smithi144:> sudo nsenter --net=/var/run/netns/ceph-ns--home-ubuntu-cephtest-mnt.0 cd /home/ubuntu/cephtest && sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper term env 'OPENSSL_ia32cap=~0x1000000000000000' valgrind --trace-children=no --child-silent-after-fork=yes '--soname-synonyms=somalloc=*tcmalloc*' --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/client.0.log --time-stamp=yes --vgdb=yes --exit-on-first-error=yes --error-exitcode=42 --tool=memcheck --leak-check=full --show-reachable=yes ceph-fuse -f --admin-socket '/var/run/ceph/$cluster-$name.$pid.asok' --id 0 /home/ubuntu/cephtest/mnt.0
2021-03-03T02:13:49.899 DEBUG:teuthology.orchestra.run.smithi144:> sudo modprobe fuse
2021-03-03T02:13:49.914 INFO:teuthology.orchestra.run:Running command with timeout 30
2021-03-03T02:13:49.914 DEBUG:teuthology.orchestra.run.smithi144:> sudo mount -t fusectl /sys/fs/fuse/connections /sys/fs/fuse/connections
2021-03-03T02:13:49.919 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi144.stderr:nsenter: failed to execute cd: No such file or directory
It's not necessary to chdir at all to do the mount, so don't.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
This method is unused in the teuthology repo. The helper method better
belongs here where it is more easily modified.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
In methods raw_cluster_cmd_result() of CephManager and LocalCephManager
and raw_cluster_cmd of LocalCephManager when keyword arguments are
passed instead of positional arguments, the methods run ceph command
with no arguments. This is because the methods do
"kwargs['args'] = args" unconditionally.
Fixes: https://tracker.ceph.com/issues/49486
Signed-off-by: Rishabh Dave <ridave@redhat.com>
In CephManager.raw_cluster_cmd(), pass only kwargs to run_cluster_cmd()
instead of both args and kwargs since passing both will lead to
"TypeError: got multiple values".
Fixes: https://tracker.ceph.com/issues/49495
Signed-off-by: Rishabh Dave <ridave@redhat.com>
After deciding to always enable tracking log in early phase, there's no
need to keep "log_early" option here and remove it directly.
Suggested-by: Kefu Chai <kefu@redhat.com>
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
no need to check for their existence, and prepare a replacement.
because we've migrated to python3. and we only support python3.6 and up.
Signed-off-by: Kefu Chai <kchai@redhat.com>
This new method should allow better control on the process launched by
the passed command. This is achieved by allowing arguments provided by
teuthology.orchestra.run.run().
Signed-off-by: Rishabh Dave <ridave@redhat.com>
* refs/pull/35522/head:
vstart_runner: set default values of stdout and stderr to None
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Not doing so leads to tests run successfully with vstart_runner.py but
crash when triggered with teuthology since the default values of these
variables there is None.
Fixes: https://tracker.ceph.com/issues/45815
Signed-off-by: Rishabh Dave <ridave@redhat.com>
Add helpers that dump information only about PGs that haven't reached
the desired state when we fail. Previously we dumped the output of
"ceph pg dump" before failing, which prints a lot of unnecessary information
about PGs that are not responsible for the failure, making debugging harder.
Also, try to make the failure messages distinct.
Signed-off-by: Neha Ojha <nojha@redhat.com>
as the caller might want to `len(manager.get_osd_status()['raw'])`, and
`len()` does not accept a `filter` object.
also, the filtered osd statuses are printed out using `self.log()`, so
we should materialize the `filter` object before sending it to logging
facility. otherwise we will have something like:
```
2020-04-08T02:58:37.001 INFO:tasks.ceph.ceph_manager.ceph:<filter object at 0x7f5a080e1518>
```
in the logging message.
Signed-off-by: Kefu Chai <kchai@redhat.com>
in python2, dict.values() and dict.keys() return lists. but in python3,
they return views, which cannot be indexed directly using an integer index.
there are three use cases when we access these views in python3:
1. get the first element
2. get all the elements and then *might* want to access them by index
3. get the first element assuming there is only a single element in
the view
4. iterate thru the view
in the 1st case, we cannot assume the number of elements, so to be
python3 compatible, we should use `next(iter(a_dict))` instead.
in the 2nd case, in this change, the view is materialized using
`list(a_dict)`.
in the 3rd case, we can just continue using the short hand of
```py
(first_element,) = a_dict.keys()
```
to unpack the view. this works in both python2 and python3.
in the 4th case, the existing code works in both python2 and python3, as
both list and view can be iterated using `iter`, and `len` works as
well.
Signed-off-by: Kefu Chai <kchai@redhat.com>
there are couple factors we should consider when choosing between
BytesIO and StringIO:
- if the producer is producing binary
- if we are expecting binary
- if the layers in between them are doing the decoding/encoding
automatically.
in our case, the producer is either the ChannelFile instances returned
by paramiko.SSHClient or subprocess.CompletedProcess insances returned
by subprocess.run(). the former are file-like objects opened in "r" mode,
but their contents are decoded with utf-8 when reading if
ChannelFile.FLAG_BINARY is not specified. that's why we always try to
add this flag in orchestra/run.py when collecting the stdout and stderr
from paramiko.SSHClient after executing a command.
back in python2, this works just fine. as we don't differentiate bytes
from str by then.
but in python3, we have to make a decision. in the case of
ceph-objectstore-tool (COT for short), it does not produce binary and
we don't check its output with binary, so, if neither Remote.run() nor
LocalRemote.run() decodes/encodes for us, it's fine.
so it boils down to `copy_to_log()`:
i think we we should respect the consumer's expectation, and only decode
the output if a StringIO is passed in as stdout or stderr.
as we always log the output with logging we could either set
`ChannelFile.FLAG_BINARY` depending on the type of `capture` or not.
if it's not set, paramiko will return str (bytes) on python2, and str on
python3. if it's not set paramiko will return str (bytes) on python2,
and bytes on python3.
if there is non-ASCII in the output, logging will bail fail with
`UnicodeDecodeError` exception. and paramiko throws the same exception
when trying to decode for us if `ChannelFile.FLAG_BINARY` is not
specified.
so to ensure that we always have logging messages no matter if the
producer follows the rule of "use StringIO if you only emit text" or
not, we have to use `ChannelFile.FLAG_BINARY`, and force paramiko
to send us the bytes. but we still have the luxury to use StringIO
and do the decode when the caller asks for str explicitly. that'd save
the pain of using `str.decode()` or `six.ensure_str()` everywhere
even if we can assure that the program does not write binary.
Signed-off-by: Kefu Chai <kchai@redhat.com>
as we are expecting the error message written to stderr, and we need to
check for the error messages in it.
this change addresses the regression introduced by
204ceee156cbb8a20bdf56efb0cd0610ee4c107e
Fixes: https://tracker.ceph.com/issues/44500
Signed-off-by: Kefu Chai <kchai@redhat.com>
This puts the conf and keyring in /etc/ceph earlier rather than later,
making them useful for debugging a live system *during* bootstrap. It's
also less code.
Signed-off-by: Sage Weil <sage@redhat.com>
A first step to do more automatic code checks on the qa/
directory. This is useful while transitioning to python3.
Also use log_exc to top-level to not run into:
error: Argument 1 to "log_exc" has incompatible type
"Callable[[OSDThrasher], Any]"; expected "OSDThrasher"
Signed-off-by: Thomas Bechtold <tbechtold@suse.com>
This is harmless if logging is low, but adds useful info when it is turned
up.
Hunting bug https://tracker.ceph.com/issues/43914
Signed-off-by: Sage Weil <sage@redhat.com>