ceph/doc/dev/developer_guide/testing_integration_tests/tests-integration-testing-teuthology-debugging-tips.rst
Zac Dover 5c4a1a317d doc/dev: rewrite t8y "triaging" section
This commit simplifes and clarifies the "Triaging
the Cause of Failure" section in the Teuthology
Guide in the Developer Guide.

Signed-off-by: Zac Dover <zac.dover@gmail.com>
2021-02-26 00:05:38 +10:00

140 lines
6.0 KiB
ReStructuredText

.. _tests-integration-testing-teuthology-debugging-tips:
Analyzing and Debugging A Teuthology Job
========================================
To learn more about how to schedule an integration test, refer to `Scheduling
Test Run`_.
When a teuthology run has been completed successfully, use `pulpito`_ dasboard
to view the results::
http://pulpito.front.sepia.ceph.com/<job-name>/<job-id>/
.. _pulpito: https://pulpito.ceph.com
or ssh into the teuthology server::
ssh <username>@teuthology.front.sepia.ceph.com
and access `teuthology archives`_, like this for example:
.. prompt:: bash $
nano /a/teuthology-2021-01-06_07:01:02-rados-master-distro-basic-smithi/
.. note:: This requires you to have access to the Sepia lab. To learn how to
request access to the Speia lab, see:
https://ceph.github.io/sepia/adding_users/
On pulpito, jobs in red specify either a failed job or a dead job.
A job is combination of daemons and configurations that are formed using
`qa/suites`_ yaml fragments.
Teuthology uses these configurations and runs the tasks that are present in
`qa/tasks`_, which are commands used for setting up the test environment and
testing Ceph's components.
These tasks cover a large subset of use cases and help to
expose the bugs that aren't caught by `make check`_ testing.
.. _make check: ../tests-integration-testing-teuthology-intro/#make-check
A job failure might be caused by one or more of the following reasons:
* environment setup (`testing on varied
systems <https://github.com/ceph/ceph/tree/master/qa/distros/supported>`_):
testing compatibility with stable realeases for supported versions.
* permutation of config values: for instance, `qa/suites/rados/thrash
<https://github.com/ceph/ceph/tree/master/qa/suites/rados/thrash>`_ ensures
running thrashing tests against Ceph under stressful workloads, so that we
are able to catch corner-case bugs. The final setup config yaml used for
testing can be accessed at::
/a/<job-name>/<job-id>/orig.config.yaml
More details about config.yaml can be found at `detailed test config`_
Triaging the cause of failure
------------------------------
When a job fails, you will need to read its teuthology log in order to triage
the cause of its failure. Use the job's name and id from pulpito to locate your
failed job's teuthology log::
http://qa-proxy.ceph.com/<job-name>/<job-id>/teuthology.log
Open the log file::
/a/<job-name>/<job-id>/teuthology.log
For example:
.. prompt:: bash $
nano /a/teuthology-2021-01-06_07:01:02-rados-master-distro-basic-smithi/5759282/teuthology.log
Every job failure is recorded in the teuthology log as a Traceback and is
added to the job summary.
Find the ``Traceback`` keyword and search the call stack and the logs for
issues that caused the failure. Usually the traceback will include the command
that failed.
.. note:: The teuthology logs are deleted from time to time. If you are unable
to access the link in this example, just use any other case from
http://pulpito.front.sepia.ceph.com/
Reporting the Issue
-------------------
After you have triaged the cause of the failure and you have determined that the
failure was not caused by the developer's code change, this might indicate a
known failure for the upstream branch (in our case, the upstream branch is
octopus). If the failure was not caused by a developer's code change, go to
https://tracker.ceph.com and look for tracker issues related to the failure by using keywords spotted in the failure under investigation.
If a similar issue has been reported via a tracker.ceph.com ticket, add to it a
link to the new test run and any relevant feedback. If you don't find a ticket
referring to an issue similar to the one that you have discovered, create a new
tracker ticket for it. If you are not familiar with the cause of failure, ask
one of the team members for help.
Debugging an issue using interactive-on-error
---------------------------------------------
It is important to be able to reproduce an issue when investigating its cause.
Run a job similar to the failed job, using the `interactive-on-error`_ mode in
teuthology::
ideepika@teuthology:~/teuthology$ ./virtualenv/bin/teuthology -v --lock --block $<your-config-yaml> --interactive-on-error
For this job, use either `custom config.yaml`_ or the yaml file from
the failed job. If you intend to use the yaml file from the failed job, copy
``orig.config.yaml`` to your local dir and change the `testing priority`_
accordingly, like so::
ideepika@teuthology:~/teuthology$ cp /a/teuthology-2021-01-06_07:01:02-rados-master-distro-basic-smithi/5759282/orig.config.yaml test.yaml
ideepika@teuthology:~/teuthology$ ./virtualenv/bin/teuthology -v --lock --block test.yaml --interactive-on-error
In the event of job failure, teuthology will lock the machines required by
``config.yaml``. Teuthology will halt at an interactive python session.
By sshing into the targets, we can investigate their ctx values. After we have
investigated the system, we can manually terminate the session and let
teuthology clean the session up.
Suggested Resources
--------------------
* `Testing Ceph: Pains & Pleasures <https://www.youtube.com/watch?v=gj1OXrKdSrs>`_
.. _Scheduling Test Run: ../tests-integration-testing-teuthology-workflow/#scheduling-test-run
.. _detailed test config: https://docs.ceph.com/projects/teuthology/en/latest/detailed_test_config.html
.. _teuthology archives: ../tests-integration-testing-teuthology-workflow/#teuthology-archives
.. _qa/suites: https://github.com/ceph/ceph/tree/master/qa/suites
.. _qa/tasks: https://github.com/ceph/ceph/tree/master/qa/tasks
.. _interactive-on-error: https://docs.ceph.com/projects/teuthology/en/latest/detailed_test_config.html#troubleshooting
.. _custom config.yaml: https://docs.ceph.com/projects/teuthology/en/latest/detailed_test_config.html#test-configuration
.. _testing priority: ../tests-integration-testing-teuthology-intro/#testing-priority
.. _thrash: https://github.com/ceph/ceph/tree/master/qa/suites/rados/thrash