Merge pull request #53976 from phlogistonjohn/jjm-cephadm-gdb-extd-doc

cephadm: document extended gdb debugging techniques

Reviewed-by: Adam King <adking@redhat.com>
Reviewed-by: Zac Dover <zac.dover@proton.me>
This commit is contained in:
zdover23 2023-10-27 17:16:16 +10:00 committed by GitHub
commit 2706ecac4a
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 122 additions and 14 deletions

View File

@ -335,16 +335,22 @@ Deploy the daemon::
cephadm --image <container-image> deploy --fsid <fsid> --name mgr.hostname.smfvfd --config-json config-json.json
Analyzing Core Dumps
Capturing Core Dumps
---------------------
When a Ceph daemon crashes, cephadm supports analyzing core dumps. To enable core dumps, run
A Ceph cluster that uses cephadm can be configured to capture core dumps.
Initial capture and processing of the coredump is performed by
`systemd-coredump <https://www.man7.org/linux/man-pages/man8/systemd-coredump.8.html>`_.
To enable coredump handling, run:
.. prompt:: bash #
ulimit -c unlimited
Core dumps will now be written to ``/var/lib/systemd/coredump``.
Core dumps will be written to ``/var/lib/systemd/coredump``.
This will persist until the system is rebooted.
.. note::
@ -352,16 +358,110 @@ Core dumps will now be written to ``/var/lib/systemd/coredump``.
they will be written to ``/var/lib/systemd/coredump`` on
the container host.
Now, wait for the crash to happen again. To simulate the crash of a daemon, run e.g. ``killall -3 ceph-mon``.
Now, wait for the crash to happen again. To simulate the crash of a daemon, run
e.g. ``killall -3 ceph-mon``.
Install debug packages including ``ceph-debuginfo`` by entering the cephadm shelll::
# cephadm shell --mount /var/lib/systemd/coredump
[ceph: root@host1 /]# dnf install ceph-debuginfo gdb zstd
[ceph: root@host1 /]# unzstd /mnt/coredump/core.ceph-*.zst
[ceph: root@host1 /]# gdb /usr/bin/ceph-mon /mnt/coredump/core.ceph-...
(gdb) bt
#0 0x00007fa9117383fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fa910d7f8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2 0x00007fa913d3f48f in AsyncMessenger::wait() () from /usr/lib64/ceph/libceph-common.so.2
#3 0x0000563085ca3d7e in main ()
Running the Debugger with cephadm
----------------------------------
Running a single debugging session
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
One can initiate a debugging session using the ``cephadm shell`` command.
From within the shell container we need to install the debugger and debuginfo
packages. To debug a core file captured by systemd, run the following:
.. prompt:: bash #
# start the shell session
cephadm shell --mount /var/lib/system/coredump
# within the shell:
dnf install ceph-debuginfo gdb zstd
unzstd /var/lib/systemd/coredump/core.ceph-*.zst
gdb /usr/bin/ceph-mon /mnt/coredump/core.ceph-*.zst
You can then run debugger commands at gdb's prompt.
.. prompt::
(gdb) bt
#0 0x00007fa9117383fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fa910d7f8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2 0x00007fa913d3f48f in AsyncMessenger::wait() () from /usr/lib64/ceph/libceph-common.so.2
#3 0x0000563085ca3d7e in main ()
Running repeated debugging sessions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When using ``cephadm shell``, like in the example above, the changes made to
the container the shell command spawned are ephemeral. Once the shell session
exits all of the files that were downloaded and installed are no longer
available. One can simply re-run the same commands every time ``cephadm shell``
is invoked, but in order to save time and resources one can create a new
container image and use it for repeated debugging sessions.
In the following example we create a simple file for constructing the
container image. The command below uses podman but it should work correctly
if ``podman`` is replaced with ``docker``.
.. prompt:: bash
cat >Containerfile <<EOF
ARG BASE_IMG=quay.io/ceph/ceph:v18
FROM \${BASE_IMG}
# install ceph debuginfo packages, gdb and other potentially useful packages
RUN dnf install --enablerepo='*debug*' -y ceph-debuginfo gdb zstd strace python3-debuginfo
EOF
podman build -t ceph:debugging -f Containerfile .
# pass --build-arg=BASE_IMG=<your image> to customize the base image
The result should be a new local image named ``ceph:debugging``. This image can
be used on the same machine that built it. Later, the image could be pushed to
a container repository, or saved and copied to a node runing other ceph
containers. Please consult the documentation for ``podman`` or ``docker`` for
more details on the general container workflow.
Once the image has been built it can be used to initiate repeat debugging
sessions without having to re-install the debug tools and debuginfo packages.
To debug a core file using this image, in the same way as previously described,
run:
.. prompt:: bash #
cephadm --image ceph:debugging shell --mount /var/lib/system/coredump
Debugging live processes
~~~~~~~~~~~~~~~~~~~~~~~~
The gdb debugger has the ability to attach to running processes to debug them.
For a containerized process this can be accomplished by using the debug image
and attaching it to the same PID namespace as the process to be debugged.
This requires running a container command with some custom arguments. We can generate a script that can debug a process in a running container.
.. prompt:: bash #
cephadm --image ceph:debugging shell --dry-run > /tmp/debug.sh
This creates a script with the container command cephadm would use to create a
shell. Now, modify the script by removing the ``--init`` argument and replace
that with the argument to join to the namespace used for a running running
container. For example, let's assume we want to debug the MGR, and have
determnined that the MGR is running in a container named
``ceph-bc615290-685b-11ee-84a6-525400220000-mgr-ceph0-sluwsk``. The new
argument
``--pid=container:ceph-bc615290-685b-11ee-84a6-525400220000-mgr-ceph0-sluwsk``
should be used.
Now, we can run our debugging container with ``sh /tmp/debug.sh``. Within the shell
we can run commands such as ``ps`` to get the PID of the MGR process. In the following
example this will be ``2``. Running gdb, we can now attach to the running process:
.. prompt:: bash (gdb)
attach 2
info threads
bt

View File

@ -5417,6 +5417,10 @@ def command_shell(ctx):
privileged=True)
command = c.shell_cmd(command)
if ctx.dry_run:
print(' '.join(shlex.quote(arg) for arg in command))
return 0
return call_timeout(ctx, command, ctx.timeout)
##################################
@ -7165,6 +7169,10 @@ def _get_parser():
'--no-hosts',
action='store_true',
help='dont pass /etc/hosts through to the container')
parser_shell.add_argument(
'--dry-run',
action='store_true',
help='print, but do not execute, the container command to start the shell')
parser_enter = subparsers.add_parser(
'enter', help='run an interactive shell inside a running daemon container')