diff --git a/doc/cephadm/troubleshooting.rst b/doc/cephadm/troubleshooting.rst index 5ec69288166..3d4965deda0 100644 --- a/doc/cephadm/troubleshooting.rst +++ b/doc/cephadm/troubleshooting.rst @@ -335,16 +335,22 @@ Deploy the daemon:: cephadm --image deploy --fsid --name mgr.hostname.smfvfd --config-json config-json.json -Analyzing Core Dumps +Capturing Core Dumps --------------------- -When a Ceph daemon crashes, cephadm supports analyzing core dumps. To enable core dumps, run +A Ceph cluster that uses cephadm can be configured to capture core dumps. +Initial capture and processing of the coredump is performed by +`systemd-coredump `_. + + +To enable coredump handling, run: .. prompt:: bash # ulimit -c unlimited -Core dumps will now be written to ``/var/lib/systemd/coredump``. +Core dumps will be written to ``/var/lib/systemd/coredump``. +This will persist until the system is rebooted. .. note:: @@ -352,16 +358,110 @@ Core dumps will now be written to ``/var/lib/systemd/coredump``. they will be written to ``/var/lib/systemd/coredump`` on the container host. -Now, wait for the crash to happen again. To simulate the crash of a daemon, run e.g. ``killall -3 ceph-mon``. +Now, wait for the crash to happen again. To simulate the crash of a daemon, run +e.g. ``killall -3 ceph-mon``. -Install debug packages including ``ceph-debuginfo`` by entering the cephadm shelll:: - # cephadm shell --mount /var/lib/systemd/coredump - [ceph: root@host1 /]# dnf install ceph-debuginfo gdb zstd - [ceph: root@host1 /]# unzstd /mnt/coredump/core.ceph-*.zst - [ceph: root@host1 /]# gdb /usr/bin/ceph-mon /mnt/coredump/core.ceph-... - (gdb) bt - #0 0x00007fa9117383fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 - #1 0x00007fa910d7f8f0 in std::condition_variable::wait(std::unique_lock&) () from /lib64/libstdc++.so.6 - #2 0x00007fa913d3f48f in AsyncMessenger::wait() () from /usr/lib64/ceph/libceph-common.so.2 - #3 0x0000563085ca3d7e in main () +Running the Debugger with cephadm +---------------------------------- + +Running a single debugging session +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +One can initiate a debugging session using the ``cephadm shell`` command. +From within the shell container we need to install the debugger and debuginfo +packages. To debug a core file captured by systemd, run the following: + +.. prompt:: bash # + + # start the shell session + cephadm shell --mount /var/lib/system/coredump + # within the shell: + dnf install ceph-debuginfo gdb zstd + unzstd /var/lib/systemd/coredump/core.ceph-*.zst + gdb /usr/bin/ceph-mon /mnt/coredump/core.ceph-*.zst + +You can then run debugger commands at gdb's prompt. + +.. prompt:: + + (gdb) bt + #0 0x00007fa9117383fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 + #1 0x00007fa910d7f8f0 in std::condition_variable::wait(std::unique_lock&) () from /lib64/libstdc++.so.6 + #2 0x00007fa913d3f48f in AsyncMessenger::wait() () from /usr/lib64/ceph/libceph-common.so.2 + #3 0x0000563085ca3d7e in main () + + +Running repeated debugging sessions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When using ``cephadm shell``, like in the example above, the changes made to +the container the shell command spawned are ephemeral. Once the shell session +exits all of the files that were downloaded and installed are no longer +available. One can simply re-run the same commands every time ``cephadm shell`` +is invoked, but in order to save time and resources one can create a new +container image and use it for repeated debugging sessions. + +In the following example we create a simple file for constructing the +container image. The command below uses podman but it should work correctly +if ``podman`` is replaced with ``docker``. + +.. prompt:: bash + + cat >Containerfile < to customize the base image + +The result should be a new local image named ``ceph:debugging``. This image can +be used on the same machine that built it. Later, the image could be pushed to +a container repository, or saved and copied to a node runing other ceph +containers. Please consult the documentation for ``podman`` or ``docker`` for +more details on the general container workflow. + +Once the image has been built it can be used to initiate repeat debugging +sessions without having to re-install the debug tools and debuginfo packages. +To debug a core file using this image, in the same way as previously described, +run: + +.. prompt:: bash # + + cephadm --image ceph:debugging shell --mount /var/lib/system/coredump + + +Debugging live processes +~~~~~~~~~~~~~~~~~~~~~~~~ + +The gdb debugger has the ability to attach to running processes to debug them. +For a containerized process this can be accomplished by using the debug image +and attaching it to the same PID namespace as the process to be debugged. + +This requires running a container command with some custom arguments. We can generate a script that can debug a process in a running container. + +.. prompt:: bash # + + cephadm --image ceph:debugging shell --dry-run > /tmp/debug.sh + +This creates a script with the container command cephadm would use to create a +shell. Now, modify the script by removing the ``--init`` argument and replace +that with the argument to join to the namespace used for a running running +container. For example, let's assume we want to debug the MGR, and have +determnined that the MGR is running in a container named +``ceph-bc615290-685b-11ee-84a6-525400220000-mgr-ceph0-sluwsk``. The new +argument +``--pid=container:ceph-bc615290-685b-11ee-84a6-525400220000-mgr-ceph0-sluwsk`` +should be used. + +Now, we can run our debugging container with ``sh /tmp/debug.sh``. Within the shell +we can run commands such as ``ps`` to get the PID of the MGR process. In the following +example this will be ``2``. Running gdb, we can now attach to the running process: + +.. prompt:: bash (gdb) + + attach 2 + info threads + bt diff --git a/src/cephadm/cephadm.py b/src/cephadm/cephadm.py index a64b0b49404..04104c846b8 100755 --- a/src/cephadm/cephadm.py +++ b/src/cephadm/cephadm.py @@ -5417,6 +5417,10 @@ def command_shell(ctx): privileged=True) command = c.shell_cmd(command) + if ctx.dry_run: + print(' '.join(shlex.quote(arg) for arg in command)) + return 0 + return call_timeout(ctx, command, ctx.timeout) ################################## @@ -7165,6 +7169,10 @@ def _get_parser(): '--no-hosts', action='store_true', help='dont pass /etc/hosts through to the container') + parser_shell.add_argument( + '--dry-run', + action='store_true', + help='print, but do not execute, the container command to start the shell') parser_enter = subparsers.add_parser( 'enter', help='run an interactive shell inside a running daemon container')