Merge pull request #53976 from phlogistonjohn/jjm-cephadm-gdb-extd-doc

cephadm: document extended gdb debugging techniques Reviewed-by: Adam King <adking@redhat.com> Reviewed-by: Zac Dover <zac.dover@proton.me>
2024-12-24 20:33:27 +00:00 · 2023-10-27 17:16:16 +10:00 · 2023-10-27 17:16:16 +10:00 · 2706ecac4a
commit 2706ecac4a
parent ab39b2b23c b4e2b39917
2 changed files with 122 additions and 14 deletions
--- a/doc/cephadm/troubleshooting.rst
+++ b/doc/cephadm/troubleshooting.rst
@ -335,16 +335,22 @@ Deploy the daemon::

  cephadm --image <container-image> deploy --fsid <fsid> --name mgr.hostname.smfvfd --config-json config-json.json

-Analyzing Core Dumps
+Capturing Core Dumps
 ---------------------

-When a Ceph daemon crashes, cephadm supports analyzing core dumps. To enable core dumps, run
+A Ceph cluster that uses cephadm can be configured to capture core dumps.
+Initial capture and processing of the coredump is performed by
+`systemd-coredump <https://www.man7.org/linux/man-pages/man8/systemd-coredump.8.html>`_.
+
+
+To enable coredump handling, run:

 .. prompt:: bash #

  ulimit -c unlimited

-Core dumps will now be written to ``/var/lib/systemd/coredump``.
+Core dumps will be written to ``/var/lib/systemd/coredump``.
+This will persist until the system is rebooted.

 .. note::

@ -352,16 +358,110 @@ Core dumps will now be written to ``/var/lib/systemd/coredump``.
  they will be written to ``/var/lib/systemd/coredump`` on
  the container host. 

-Now, wait for the crash to happen again. To simulate the crash of a daemon, run e.g. ``killall -3 ceph-mon``.
+Now, wait for the crash to happen again. To simulate the crash of a daemon, run
+e.g. ``killall -3 ceph-mon``.

-Install debug packages including ``ceph-debuginfo`` by entering the cephadm shelll::

-  # cephadm shell --mount /var/lib/systemd/coredump
-  [ceph: root@host1 /]# dnf install ceph-debuginfo gdb zstd
-  [ceph: root@host1 /]# unzstd /mnt/coredump/core.ceph-*.zst
-  [ceph: root@host1 /]# gdb /usr/bin/ceph-mon /mnt/coredump/core.ceph-...
-  (gdb) bt
-  #0  0x00007fa9117383fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
-  #1  0x00007fa910d7f8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
-  #2  0x00007fa913d3f48f in AsyncMessenger::wait() () from /usr/lib64/ceph/libceph-common.so.2
-  #3  0x0000563085ca3d7e in main ()
+Running the Debugger with cephadm
+----------------------------------
+
+Running a single debugging session
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+One can initiate a debugging session using the ``cephadm shell`` command.
+From within the shell container we need to install the debugger and debuginfo
+packages. To debug a core file captured by systemd, run the following:
+
+.. prompt:: bash #
+
+    # start the shell session
+    cephadm shell --mount /var/lib/system/coredump
+    # within the shell:
+    dnf install ceph-debuginfo gdb zstd
+    unzstd /var/lib/systemd/coredump/core.ceph-*.zst
+    gdb /usr/bin/ceph-mon /mnt/coredump/core.ceph-*.zst
+
+You can then run debugger commands at gdb's prompt.
+
+.. prompt::
+
+    (gdb) bt
+    #0  0x00007fa9117383fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
+    #1  0x00007fa910d7f8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
+    #2  0x00007fa913d3f48f in AsyncMessenger::wait() () from /usr/lib64/ceph/libceph-common.so.2
+    #3  0x0000563085ca3d7e in main ()
+
+
+Running repeated debugging sessions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When using ``cephadm shell``, like in the example above, the changes made to
+the container the shell command spawned are ephemeral. Once the shell session
+exits all of the files that were downloaded and installed are no longer
+available. One can simply re-run the same commands every time ``cephadm shell``
+is invoked, but in order to save time and resources one can create a new
+container image and use it for repeated debugging sessions.
+
+In the following example we create a simple file for constructing the
+container image. The command below uses podman but it should work correctly
+if ``podman`` is replaced with ``docker``.
+
+.. prompt:: bash
+
+  cat >Containerfile <<EOF
+  ARG BASE_IMG=quay.io/ceph/ceph:v18
+  FROM \${BASE_IMG}
+  # install ceph debuginfo packages, gdb and other potentially useful packages
+  RUN dnf install --enablerepo='*debug*' -y ceph-debuginfo gdb zstd strace python3-debuginfo
+  EOF
+  podman build -t ceph:debugging -f Containerfile .
+  # pass --build-arg=BASE_IMG=<your image> to customize the base image
+
+The result should be a new local image named ``ceph:debugging``. This image can
+be used on the same machine that built it. Later, the image could be pushed to
+a container repository, or saved and copied to a node runing other ceph
+containers. Please consult the documentation for ``podman`` or ``docker`` for
+more details on the general container workflow.
+
+Once the image has been built it can be used to initiate repeat debugging
+sessions without having to re-install the debug tools and debuginfo packages.
+To debug a core file using this image, in the same way as previously described,
+run:
+
+.. prompt:: bash #
+
+    cephadm --image ceph:debugging shell --mount /var/lib/system/coredump
+
+
+Debugging live processes
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+The gdb debugger has the ability to attach to running processes to debug them.
+For a containerized process this can be accomplished by using the debug image
+and attaching it to the same PID namespace as the process to be debugged.
+
+This requires running a container command with some custom arguments. We can generate a script that can debug a process in a running container.
+
+.. prompt:: bash #
+
+   cephadm --image ceph:debugging shell --dry-run > /tmp/debug.sh
+
+This creates a script with the container command cephadm would use to create a
+shell. Now, modify the script by removing the ``--init`` argument and replace
+that with the argument to join to the namespace used for a running running
+container.  For example, let's assume we want to debug the MGR, and have
+determnined that the MGR is running in a container named
+``ceph-bc615290-685b-11ee-84a6-525400220000-mgr-ceph0-sluwsk``. The new
+argument
+``--pid=container:ceph-bc615290-685b-11ee-84a6-525400220000-mgr-ceph0-sluwsk``
+should be used.
+
+Now, we can run our debugging container with ``sh /tmp/debug.sh``. Within the shell
+we can run commands such as ``ps`` to get the PID of the MGR process. In the following
+example this will be ``2``. Running gdb, we can now attach to the running process:
+
+.. prompt:: bash (gdb)
+
+   attach 2
+   info threads
+   bt
--- a/src/cephadm/cephadm.py
+++ b/src/cephadm/cephadm.py
@ -5417,6 +5417,10 @@ def command_shell(ctx):
        privileged=True)
    command = c.shell_cmd(command)

+    if ctx.dry_run:
+        print(' '.join(shlex.quote(arg) for arg in command))
+        return 0
+
    return call_timeout(ctx, command, ctx.timeout)

 ##################################
@ -7165,6 +7169,10 @@ def _get_parser():
        '--no-hosts',
        action='store_true',
        help='dont pass /etc/hosts through to the container')
+    parser_shell.add_argument(
+        '--dry-run',
+        action='store_true',
+        help='print, but do not execute, the container command to start the shell')

    parser_enter = subparsers.add_parser(
        'enter', help='run an interactive shell inside a running daemon container')