doc/dev: Teuthology guide PR#37949 grammar edit

This PR improves the wording of the technical
information added to the documentation in PR#37949.
This is the second is a series of two PRs, which series
is dedicated to testing a workflow wherein developers
add technical information to the documentation and then
technical writers improve its presentation.

Signed-off-by: Zac Dover <zac.dover@gmail.com>
This commit is contained in:
Zac Dover 2021-01-30 02:41:22 +10:00
parent 6baa74d8e7
commit 07a9bf4d55
3 changed files with 114 additions and 100 deletions

View File

@ -3,59 +3,66 @@
Analyzing and Debugging A Teuthology Job
========================================
For scheduling an integration test please refer to, `Scheduling Test Run`_.
To learn more about how to schedule an integration test, refer to `Scheduling
Test Run`_.
Once a teuthology run is successfully completed, we can access the results using
pulpito dashboard, which looks like:
When a teuthology run has been completed successfully, use `pulpito`_ dasboard
to view the results::
http://pulpito.front.sepia.ceph.com/<job-name>/<job-id>/
http://pulpito.front.sepia.ceph.com/<job-name>/<job-id>/
or via sshing into teuthology server::
.. _pulpito: https://pulpito.ceph.com
or ssh into the teuthology server::
ssh <username>@teuthology.front.sepia.ceph.com
and accessing `teuthology archives`_, for example::
nano /a/teuthology-2021-01-06_07:01:02-rados-master-distro-basic-smithi/
and access `teuthology archives`_, like this for example:
.. note:: This would require Sepia lab access. To know how to request it, see:
.. prompt:: bash $
nano /a/teuthology-2021-01-06_07:01:02-rados-master-distro-basic-smithi/
.. note:: This requires you to have access to the Sepia lab. To learn how to
request access to the Speia lab, see:
https://ceph.github.io/sepia/adding_users/
On pulpito, the jobs in red specify either a failed or dead job.
Here, a job is combination of daemons and configurations that are formed using
On pulpito, jobs in red specify either a failed job or a dead job.
A job is combination of daemons and configurations that are formed using
`qa/suites`_ yaml fragments.
Taking these configurations, teuthology runs the tasks that are present in
Teuthology uses these configurations and runs the tasks that are present in
`qa/tasks`_, which are commands used for setting up the test environment and
testing Ceph's components.
These tasks help us in covering large subset of usecase scenarios and hence
exposing the bugs which were uncaught by `make check`_ testing.
These tasks cover a large subset of use cases and help to
expose the bugs that aren't caught by `make check`_ testing.
.. _make check: ../tests-integration-testing-teuthology-intro/#make-check
A job failure hence might be because of:
A job failure might be caused by one or more of the following reasons:
* environment setup(`testing on varied systems<https://github.com/ceph/ceph/tree/master/qa/distros/supported>_`):
* environment setup (`testing on varied
systems <https://github.com/ceph/ceph/tree/master/qa/distros/supported>`_):
testing compatibility with stable realeases for supported versions.
* permutation of config values: for instance, qa/suites/rados/thrash ensures to
test Ceph under stressful workload, so that we be able to catch corner case
bugs.
The final setup config yaml that would be used for testing can be accessed
at::
* permutation of config values: for instance, `qa/suites/rados/thrash
<https://github.com/ceph/ceph/tree/master/qa/suites/rados/thrash>`_ ensures
running thrashing tests against Ceph under stressful workloads, so that we
are able to catch corner-case bugs. The final setup config yaml used for
testing can be accessed at::
/a/<job-name>/<job-id>/orig.config.yaml
More details about config.yaml can be found on `detailed test config`_
More details about config.yaml can be found at `detailed test config`_
Triaging the cause of failure
------------------------------
To triage a job failure, open the teuthology log for it using either(from the
pulpito page):
To triage a job failure, open the teuthology log for it using either the job
name or the job id (from the pulpito page):
http://qa-proxy.ceph.com/<job-name>/<job-id>/teuthology.log
http://qa-proxy.ceph.com/<job-name>/<job-id>/teuthology.log
and then opening log file with signature as:
Open the log file:
/a/<job-name>/<job-id>/teuthology.log
@ -63,50 +70,55 @@ for example in our case::
nano /a/teuthology-2021-01-06_07:01:02-rados-master-distro-basic-smithi/5759282/teuthology.log
Generally, a job failure is recorded in teuthology log as a Traceback which gets
added to job summary.
While analyzing a job failure, we generally start looking for ``Traceback``
keyword and further see the call stack and logs that might had lead to failure.
Most of the time, traceback will also be including the failing command.
A job failure is recorded in the teuthology log as a Traceback and is
added to the job summary.
To analyze a job failure, locate the ``Traceback`` keyword and examine the call
stack and logs for issues that caused the failure. Usually the traceback
will include the command that failed.
.. note:: the teuthology logs are deleted every once in a while, if you are
unable to access example link, please feel free to refer any other case from
http://pulpito.front.sepia.ceph.com/
unable to access example link, please feel free to refer any other
case from http://pulpito.front.sepia.ceph.com/
Reporting the Issue
-------------------
Once the cause of failure is triaged, and is something which might not be
related to the developer's code change, this indicates that it might be a
generic failure for the upstream branch (in our case octopus), in which case, we
look for related failure keywords on https://tracker.ceph.com/.
If a similar issue has been reported via a tracker.ceph.com ticket, please add
any relevant feedback to it. Otherwise, please create a new tracker ticket for
it. If you are not familiar with the cause of failure, someone else will look at
it.
After you have triaged the cause of the failure and you have determined that the
failure was not caused by the developer's code change, this might indicate a
known failure for the upstream branch (in our case, the upstream branch is
octopus). If the failure was not caused by a developer's code change, go to
https://tracker.ceph.com and look for tracker issues related to the failure by using keywords spotted in the failure under investigation.
If a similar issue has been reported via a tracker.ceph.com ticket, add to it a
link to the new test run and any relevant feedback. If you don't find a ticket
referring to an issue similar to the one that you have discovered, create a new
tracker ticket for it. If you are not familiar with the cause of failure, ask
one of the team members for help.
Debugging an issue using interactive-on-error
---------------------------------------------
To investigate an issue, the first step would be to try to reproduce it, for
that purpose. For this purpose you can run a job similar to the failed job,
using `interactive-on-error`_ mode in teuthology::
It is important to be able to reproduce an issue when investigating its cause.
Run a job similar to the failed job, using the `interactive-on-error`_ mode in
teuthology::
ideepika@teuthology:~/teuthology$ ./virtualenv/bin/teuthology -v --lock --block $<your-config-yaml> --interactive-on-error
we can either have a `custom config.yaml`_ or use the one from failed job; for
which copy the ``orig.config.yaml`` to your local dir and change the `testing
priority`_ accordingly, which would look like::
For this job, use either `custom config.yaml`_ or the yaml file from
the failed job. If you intend to use the yaml file from the failed job, copy
``orig.config.yaml`` to your local dir and change the `testing priority`_
accordingly, like so::
ideepika@teuthology:~/teuthology$ cp /a/teuthology-2021-01-06_07:01:02-rados-master-distro-basic-smithi/5759282/orig.config.yaml test.yaml
ideepika@teuthology:~/teuthology$ ./virtualenv/bin/teuthology -v --lock --block test.yaml --interactive-on-error
Teuthology will then lock the machines required by the ``config.yaml``, when
their is job failure, which halts at an interactive python session which let's
us investigate the ctx values and the targets via sshing into them, once we have
In the event of job failure, teuthology will lock the machines required by
``config.yaml``. Teuthology will halt at an interactive python session.
By sshing into the targets, we can investigate their ctx values. After we have
investigated the system, we can manually terminate the session and let
teuthology cleanup.
teuthology clean the session up.
Suggested Resources
--------------------
@ -121,3 +133,4 @@ Suggested Resources
.. _interactive-on-error: https://docs.ceph.com/projects/teuthology/en/latest/detailed_test_config.html#troubleshooting
.. _custom config.yaml: https://docs.ceph.com/projects/teuthology/en/latest/detailed_test_config.html#test-configuration
.. _testing priority: ../tests-integration-testing-teuthology-intro/#testing-priority
.. _thrash: https://github.com/ceph/ceph/tree/master/qa/suites/rados/thrash

View File

@ -4,8 +4,8 @@ Testing - Integration Tests - Introduction
==========================================
Ceph has two types of tests: :ref:`make check <make-check>` tests and
integration tests. When a test requires multiple machines, root access or lasts
for a longer time (for example, to simulate a realistic Ceph deployment), it is
integration tests. When a test requires multiple machines, root access, or lasts
for a long time (for example, to simulate a realistic Ceph workload), it is
deemed to be an integration test. Integration tests are organized into "suites",
which are defined in the `ceph/qa sub-directory`_ and run with the
``teuthology-suite`` command.
@ -39,10 +39,10 @@ branch and the stable branches). Traditionally, these tests are called "the
nightlies" because the Ceph core developers used to live and work in
the same time zone and from their perspective the tests were run overnight.
The results of the nightlies are published at http://pulpito.ceph.com/. The
developer nick shows in the test results URL and in the first column of the
Pulpito dashboard. The results are also reported on the `ceph-qa mailing list
<https://ceph.com/irc/>`_ for analysis.
The results of nightly test runs are published at http://pulpito.ceph.com/
under the user ``teuthology``. The developer nick appears in URL of the the
test results and in the first column of the Pulpito dashboard. The results are
also reported on the `ceph-qa mailing list <https://ceph.com/irc/>`_.
Testing Priority
----------------
@ -79,10 +79,9 @@ Job priority should be selected based on the following recommendations:
* **200 <= Priority < 1000:** Use this priority for large test runs that can
be done over the course of a week.
In case you don't know how many jobs would be triggered by ``teuthology-suite``
command, use ``--dry-run`` to get a count first and then issue
``teuthology-suite`` command again, this time without ``--dry-run`` and with
``-p`` and an appropriate number as an argument to it.
To learn how many jobs the ``teuthology-suite`` command will trigger, use the
``--dry-run`` flag. If you are happy with the number of jobs, issue the ``teuthology-suite`` command again without
``--dry-run`` and with ``-p`` and an appropriate number as an argument.
To skip the priority check, use ``--force-priority``. In order to be sensitive
to the runs of other developers who also need to do testing, please use it in
@ -92,14 +91,14 @@ Suites Inventory
----------------
The ``suites`` directory of the `ceph/qa sub-directory`_ contains all the
integration tests, for all the Ceph components.
integration tests for all the Ceph components.
`ceph-deploy <https://github.com/ceph/ceph/tree/master/qa/suites/ceph-deploy>`_
install a Ceph cluster with ``ceph-deploy`` (`ceph-deploy man page`_)
`dummy <https://github.com/ceph/ceph/tree/master/qa/suites/dummy>`_
get a machine, do nothing and return success (commonly used to
verify the integration testing infrastructure works as expected)
verify that the integration testing infrastructure works as expected)
`fs <https://github.com/ceph/ceph/tree/master/qa/suites/fs>`_
test CephFS mounted using FUSE
@ -145,20 +144,20 @@ teuthology-describe-tests
``teuthology-describe`` was added to the `teuthology framework`_ to facilitate
documentation and better understanding of integration tests.
The upshot is that tests can be documented by embedding ``meta:``
annotations in the yaml files used to define the tests. The results can be
seen in the `teuthology-desribe usecases`_
Tests can be documented by embedding ``meta:`` annotations in the yaml files
used to define the tests. The results can be seen in the `teuthology-desribe
usecases`_
Since this is a new feature, many yaml files have yet to be annotated.
Developers are encouraged to improve the documentation, in terms of both
coverage and quality.
Developers are encouraged to improve the coverage and the quality of the
documentation.
How integration tests are run
-----------------------------
Given that - as a new Ceph developer - you will typically not have access
to the `Sepia lab`_, you may rightly ask how you can run the integration
tests in your own environment.
As a new Ceph developer you will probably not have access to the `Sepia lab`_.
You might however be able to run some integration tests in your own
environment. Ask members from the relevant team how to do this.
One option is to set up a teuthology cluster on bare metal. Though this is a
non-trivial task, it `is` possible. Here are `some notes

View File

@ -53,13 +53,8 @@ teuthology. This procedure explains how to run tests using teuthology.
ssh <username>@teuthology.front.sepia.ceph.com
This requires that you have access to the Sepia lab. Learn about requesting
access here:
https://ceph.github.io/sepia/adding_users/
#. Install teuthology in a virtual environment and activate that virtual
environment. Follow the relevant instructions in `Running Your First Test`_.
This requires Sepia lab access. To request access to the Sepia lab, see:
https://ceph.github.io/sepia/adding_users/
#. Run the ``teuthology-suite`` command:
@ -112,6 +107,11 @@ teuthology. This procedure explains how to run tests using teuthology.
Other frequently used/useful options are ``-d`` (or ``--distro``),
``--distroversion``, ``--filter-out``, ``--timeout``, ``flavor``, ``-rerun``,
``-l`` (for limiting number of jobs) , ``-n`` (for how many times the job will
run). Run ``teuthology-suite --help`` to read descriptions of these and other
options.
.. _teuthology_testing_qa_changes:
@ -124,12 +124,14 @@ rebuild the binaries before you re-run tests. If you make changes only in
You just have to make sure to tell the ``teuthology-suite`` command to use a
separate branch for running the tests.
The separate branch can be passed to the command by using ``--suite-repo`` and
``--suite-branch``. The first option (``--suite-repo``) accepts the link to the GitHub fork where your PR branch exists and the second option (``--suite-branch``) accepts the name of the PR branch.
If you made changes only in ``qa/``
(https://github.com/ceph/ceph/tree/master/qa), you do not need to rebuild the
binaries. You can use existing binaries that are built periodically for master and other stable branches and run your test changes against them.
Your branch with the qa changes can be tested by passing two extra arguments to the ``teuthology-suite`` command: (1) ``--suite-repo``, specifying your ceph repo, and (2) ``--suite-branch``, specifying your branch name.
For example, if you want to make changes in ``qa/`` after testing ``branch-x``
(which shows up in ceph-ci as a branch named ``wip-username-branch-x``), you
can do so by running following command:
(for which the ceph-ci branch is ``wip-username-branch-x``), run the following
command::
.. prompt:: bash $
@ -191,28 +193,26 @@ Viewing Tests Results
Pulpito Dashboard
*****************
Once the teuthology job is scheduled, the status/results for test run could
be checked from https://pulpito.ceph.com/.
It could be used for quickly checking out job logs, their status, etc.
After the teuthology job is scheduled, the status and results of the test run
can be checked at https://pulpito.ceph.com/.
Teuthology Archives
*******************
Once the tests have finished running, the log for the job can be obtained by
clicking on job ID at the Pulpito page for your tests. It's more convenient to
download the log and then view it rather than viewing it in an internet browser
since these logs can easily be up to size of 1 GB. It is easier to
ssh into the teuthology machine again (``teuthology.front.sepia.ceph.com``), and
access the following path::
After the tests have finished running, the log for the job can be obtained by
clicking on the job ID at the Pulpito page associated with your tests. It's
more convenient to download the log and then view it rather than viewing it in
an internet browser since these logs can easily be up to 1 GB in size. It is
easier to ssh into the teuthology machine (``teuthology.front.sepia.ceph.com``)
and access the following path::
/ceph/teuthology-archive/<test-id>/<job-id>/teuthology.log
For example, for above test ID path is::
For example: for the above test ID, the path is::
/ceph/teuthology-archive/teuthology-2019-12-10_05:00:03-smoke-master-testing-basic-smithi/4588482/teuthology.log
This way the log can be viewed remotely without having to wait too
much.
This method can be used to view the log more quickly than would be possible through a browser.
.. note:: To access archives more conveniently, ``/a/`` has been symbolically
linked to ``/ceph/teuthology-archive/``. For instance, to access the previous
@ -222,13 +222,15 @@ much.
Killing Tests
-------------
Sometimes a teuthology job might not complete running for several minutes or
even hours after tests that were trigged have completed running and other
times wrong set of tests can be triggered is filter wasn't chosen carefully.
To save resource it's better to termniate such a job. Following is the command
to terminate a job::
``teuthology-kill`` can be used to kill jobs that have been running
unexpectedly for several hours, or when developers want to terminate tests
before they complete.
teuthology-kill -r teuthology-2019-12-10_05:00:03-smoke-master-testing-basic-smithi
Here is the command that terminates jobs:
.. prompt:: bash $
teuthology-kill -r teuthology-2019-12-10_05:00:03-smoke-master-testing-basic-smithi
Let's call the argument passed to ``-r`` as test ID. It can be found
easily in the link to the Pulpito page for the tests you triggered. For