ceph/doc/cephadm/troubleshooting.rst


Troubleshooting
===============

Sometimes there is a need to investigate why a cephadm command failed or why
a specific service no longer runs properly.

As cephadm deploys daemons as containers, troubleshooting daemons is slightly
different. Here are a few tools and commands to help investigating issues.

Pausing or disabling cephadm
----------------------------

If something goes wrong and cephadm is doing behaving in a way you do
not like, you can pause most background activity with::

  ceph orch pause

This will stop any changes, but cephadm will still periodically check hosts to
refresh its inventory of daemons and devices.  You can disable cephadm
completely with::

  ceph orch set backend ''
  ceph mgr module disable cephadm

This will disable all of the ``ceph orch ...`` CLI commands but the previously
deployed daemon containers will still continue to exist and start as they
did before.

Checking cephadm logs
---------------------

You can monitor the cephadm log in real time with::

  ceph -W cephadm

You can see the last few messages with::

  ceph log last cephadm

If you have enabled logging to files, you can see a cephadm log file called
``ceph.cephadm.log`` on monitor hosts (see :ref:`cephadm-logs`).

Gathering log files
-------------------

Use journalctl to gather the log files of all daemons:

.. note:: By default cephadm now stores logs in journald. This means
   that you will no longer find daemon logs in ``/var/log/ceph/``.

To read the log file of one specific daemon, run::

    cephadm logs --name <name-of-daemon>

Note: this only works when run on the same host where the daemon is running. To
get logs of a daemon running on a different host, give the ``--fsid`` option::

    cephadm logs --fsid <fsid> --name <name-of-daemon>

where the ``<fsid>`` corresponds to the cluster ID printed by ``ceph status``.

To fetch all log files of all daemons on a given host, run::

    for name in $(cephadm ls | jq -r '.[].name') ; do
      cephadm logs --fsid <fsid> --name "$name" > $name;
    done

Collecting systemd status
-------------------------

To print the state of a systemd unit, run::

      systemctl status "ceph-$(cephadm shell ceph fsid)@<service name>.service";


To fetch all state of all daemons of a given host, run::

    fsid="$(cephadm shell ceph fsid)"
    for name in $(cephadm ls | jq -r '.[].name') ; do
      systemctl status "ceph-$fsid@$name.service" > $name;
    done


List all downloaded container images
------------------------------------

To list all container images that are downloaded on a host:

.. note:: ``Image`` might also be called `ImageID`

::

    podman ps -a --format json | jq '.[].Image'
    "docker.io/library/centos:8"
    "registry.opensuse.org/opensuse/leap:15.2"


Manually running containers
---------------------------

Cephadm writes small wrappers that run a containers. Refer to
``/var/lib/ceph/<cluster-fsid>/<service-name>/unit.run`` for the
container execution command.

.. _cephadm-ssh-errors:

ssh errors
----------

Error message::

  xxxxxx.gateway_bootstrap.HostNotFound: -F /tmp/cephadm-conf-kbqvkrkw root@10.10.1.2
  raise OrchestratorError('Failed to connect to %s (%s).  Check that the host is reachable and accepts  connections using the cephadm SSH key' % (host, addr)) from
  orchestrator._interface.OrchestratorError: Failed to connect to 10.10.1.2 (10.10.1.2).  Check that the host is reachable and accepts connections using the cephadm SSH key

Things users can do:

1. Ensure cephadm has an SSH identity key::
      
     [root@mon1~]# cephadm shell -- ceph config-key get mgr/cephadm/ssh_identity_key > key
     INFO:cephadm:Inferring fsid f8edc08a-7f17-11ea-8707-000c2915dd98
     INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 obtained 'mgr/cephadm/ssh_identity_key'
     [root@mon1 ~] # chmod 0600 key

 If this fails, cephadm doesn't have a key. Fix this by running the following command::
   
     [root@mon1 ~]# cephadm shell -- ceph cephadm generate-ssh-key

 or::
   
     [root@mon1 ~]# cat key | cephadm shell -- ceph cephadm set-ssk-key -i -

2. Ensure that the ssh config is correct::
   
     [root@mon1 ~]# cephadm shell -- ceph cephadm get-ssh-config > config

3. Verify that we can connect to the host::
    
     [root@mon1 ~]# ssh -F config -i key root@mon1


Verifying that the Public Key is Listed in the authorized_keys file
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To verify that the public key is in the authorized_keys file, run the following commands::

     [root@mon1 ~]# cephadm shell -- ceph config-key get mgr/cephadm/ssh_identity_pub > key.pub
     [root@mon1 ~]# grep "`cat key.pub`"  /root/.ssh/authorized_keys

Failed to infer CIDR network error
----------------------------------

If you see this error::

   ERROR: Failed to infer CIDR network for mon ip ***; pass --skip-mon-network to configure it later

Or this error::

   Must set public_network config option or specify a CIDR network, ceph addrvec, or plain IP

This means that you must run a command of this form::

  ceph config set mon public_network <mon_network>

For more detail on operations of this kind, see :ref:`deploy_additional_monitors`

Accessing the admin socket
--------------------------

Each Ceph daemon provides an admin socket that bypasses the
MONs (See :ref:`rados-monitoring-using-admin-socket`).

To access the admin socket, first enter the daemon container on the host::

    [root@mon1 ~]# cephadm enter --name <daemon-name>
    [ceph: root@mon1 /]# ceph --admin-daemon /var/run/ceph/ceph-<daemon-name>.asok config show
doc: reorganize cephadm docs - reorganized cephadm into a top-level item with a series of sub-items. - condensed the 'install' page so that it doesn't create a zillion items in the toctree on the left - started updating the cephadm/install sequence (incomplete) Signed-off-by: Sage Weil <sage@redhat.com> 2020-03-15 13:45:46 +00:00
			`Troubleshooting`
			`===============`

			`Sometimes there is a need to investigate why a cephadm command failed or why`
			`a specific service no longer runs properly.`

			`As cephadm deploys daemons as containers, troubleshooting daemons is slightly`
			`different. Here are a few tools and commands to help investigating issues.`

doc/cephadm: some troubleshooting tips Signed-off-by: Sage Weil <sage@redhat.com> 2020-03-19 20:51:08 +00:00			`Pausing or disabling cephadm`
			`----------------------------`

doc/cephadm: notes on status/stability Signed-off-by: Sage Weil <sage@redhat.com> 2020-03-20 16:40:56 +00:00			`If something goes wrong and cephadm is doing behaving in a way you do`
			`not like, you can pause most background activity with::`
doc/cephadm: some troubleshooting tips Signed-off-by: Sage Weil <sage@redhat.com> 2020-03-19 20:51:08 +00:00
			`ceph orch pause`

			`This will stop any changes, but cephadm will still periodically check hosts to`
			`refresh its inventory of daemons and devices. You can disable cephadm`
			`completely with::`

			`ceph orch set backend ''`
			`ceph mgr module disable cephadm`

			This will disable all of the ``ceph orch ...`` CLI commands but the previously
			`deployed daemon containers will still continue to exist and start as they`
			`did before.`

			`Checking cephadm logs`
			`---------------------`

doc/cephadm: notes on status/stability Signed-off-by: Sage Weil <sage@redhat.com> 2020-03-20 16:40:56 +00:00			`You can monitor the cephadm log in real time with::`
doc/cephadm: some troubleshooting tips Signed-off-by: Sage Weil <sage@redhat.com> 2020-03-19 20:51:08 +00:00
			`ceph -W cephadm`

			`You can see the last few messages with::`

			`ceph log last cephadm`

			`If you have enabled logging to files, you can see a cephadm log file called`
			``ceph.cephadm.log`` on monitor hosts (see :ref:`cephadm-logs`).

doc: reorganize cephadm docs - reorganized cephadm into a top-level item with a series of sub-items. - condensed the 'install' page so that it doesn't create a zillion items in the toctree on the left - started updating the cephadm/install sequence (incomplete) Signed-off-by: Sage Weil <sage@redhat.com> 2020-03-15 13:45:46 +00:00			`Gathering log files`
			`-------------------`

			`Use journalctl to gather the log files of all daemons:`

			`.. note:: By default cephadm now stores logs in journald. This means`
			that you will no longer find daemon logs in ``/var/log/ceph/``.

			`To read the log file of one specific daemon, run::`

			`cephadm logs --name <name-of-daemon>`

			`Note: this only works when run on the same host where the daemon is running. To`
			get logs of a daemon running on a different host, give the ``--fsid`` option::

			`cephadm logs --fsid <fsid> --name <name-of-daemon>`

doc/cephadm: more edits Based on review by Alexandra Settle <asettle@suse.com> Signed-off-by: Sage Weil <sage@redhat.com> 2020-03-17 13:54:47 +00:00			where the ``<fsid>`` corresponds to the cluster ID printed by ``ceph status``.
doc: reorganize cephadm docs - reorganized cephadm into a top-level item with a series of sub-items. - condensed the 'install' page so that it doesn't create a zillion items in the toctree on the left - started updating the cephadm/install sequence (incomplete) Signed-off-by: Sage Weil <sage@redhat.com> 2020-03-15 13:45:46 +00:00
			`To fetch all log files of all daemons on a given host, run::`

			`for name in $(cephadm ls \| jq -r '.[].name') ; do`
			`cephadm logs --fsid <fsid> --name "$name" > $name;`
			`done`

			`Collecting systemd status`
			`-------------------------`

			`To print the state of a systemd unit, run::`

			`systemctl status "ceph-$(cephadm shell ceph fsid)@<service name>.service";`


			`To fetch all state of all daemons of a given host, run::`

			`fsid="$(cephadm shell ceph fsid)"`
			`for name in $(cephadm ls \| jq -r '.[].name') ; do`
			`systemctl status "ceph-$fsid@$name.service" > $name;`
			`done`


			`List all downloaded container images`
			`------------------------------------`

			`To list all container images that are downloaded on a host:`

			.. note:: ``Image`` might also be called `ImageID`

			`::`

			`podman ps -a --format json \| jq '.[].Image'`
			`"docker.io/library/centos:8"`
			`"registry.opensuse.org/opensuse/leap:15.2"`


			`Manually running containers`
			`---------------------------`

doc/cephadm: more edits Based on review by Alexandra Settle <asettle@suse.com> Signed-off-by: Sage Weil <sage@redhat.com> 2020-03-17 13:54:47 +00:00			`Cephadm writes small wrappers that run a containers. Refer to`
			``/var/lib/ceph/<cluster-fsid>/<service-name>/unit.run`` for the
			`container execution command.`
Add troubleshooting guidance for ssh connection failures. For more thorough information about this commit, see: https://tracker.ceph.com/issues/44905 Signed-off-by: Zac Dover <zac.dover@gmail.com> Update doc/cephadm/troubleshooting.rst Co-Authored-By: Michael Fritch <mfritch@suse.com> ibid Signed-off-by: Zac Dover <zac.dover@gmail.com> Update doc/cephadm/troubleshooting.rst Co-Authored-By: Sebastian Wagner <sebastian@spawnhost.de> Update doc/cephadm/troubleshooting.rst Co-Authored-By: Sebastian Wagner <sebastian@spawnhost.de> Added mgfritch's suggestion regarding checking to see if the public key is listed in the authorized_keys file. Signed-off-by: Zac Dover <zac.dover@gmail.com> 2020-04-17 04:12:37 +00:00
doc/cephadm: Import existing ssh key docs. This commit supersedes the commit below, which was suggested by mohnewald: https://github.com/ceph/ceph/pull/34453/commits/1cd278b08a342df1acf9b43dbc5c27ee686149c0 Co-author: https://github.com/mohnewald Signed-off-by: Zac Dover <zac.dover@gmail.com> 2020-06-09 14:15:43 +00:00			`.. _cephadm-ssh-errors:`
Add troubleshooting guidance for ssh connection failures. For more thorough information about this commit, see: https://tracker.ceph.com/issues/44905 Signed-off-by: Zac Dover <zac.dover@gmail.com> Update doc/cephadm/troubleshooting.rst Co-Authored-By: Michael Fritch <mfritch@suse.com> ibid Signed-off-by: Zac Dover <zac.dover@gmail.com> Update doc/cephadm/troubleshooting.rst Co-Authored-By: Sebastian Wagner <sebastian@spawnhost.de> Update doc/cephadm/troubleshooting.rst Co-Authored-By: Sebastian Wagner <sebastian@spawnhost.de> Added mgfritch's suggestion regarding checking to see if the public key is listed in the authorized_keys file. Signed-off-by: Zac Dover <zac.dover@gmail.com> 2020-04-17 04:12:37 +00:00
			`ssh errors`
			`----------`

			`Error message::`

			`xxxxxx.gateway_bootstrap.HostNotFound: -F /tmp/cephadm-conf-kbqvkrkw root@10.10.1.2`
			`raise OrchestratorError('Failed to connect to %s (%s). Check that the host is reachable and accepts connections using the cephadm SSH key' % (host, addr)) from`
			`orchestrator._interface.OrchestratorError: Failed to connect to 10.10.1.2 (10.10.1.2). Check that the host is reachable and accepts connections using the cephadm SSH key`

			`Things users can do:`

			`1. Ensure cephadm has an SSH identity key::`

			`[root@mon1~]# cephadm shell -- ceph config-key get mgr/cephadm/ssh_identity_key > key`
			`INFO:cephadm:Inferring fsid f8edc08a-7f17-11ea-8707-000c2915dd98`
			`INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 obtained 'mgr/cephadm/ssh_identity_key'`
			`[root@mon1 ~] # chmod 0600 key`

			`If this fails, cephadm doesn't have a key. Fix this by running the following command::`

			`[root@mon1 ~]# cephadm shell -- ceph cephadm generate-ssh-key`

			`or::`

			`[root@mon1 ~]# cat key \| cephadm shell -- ceph cephadm set-ssk-key -i -`

			`2. Ensure that the ssh config is correct::`

			`[root@mon1 ~]# cephadm shell -- ceph cephadm get-ssh-config > config`

			`3. Verify that we can connect to the host::`

			`[root@mon1 ~]# ssh -F config -i key root@mon1`




			`Verifying that the Public Key is Listed in the authorized_keys file`
			`^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^`
			`To verify that the public key is in the authorized_keys file, run the following commands::`

			`[root@mon1 ~]# cephadm shell -- ceph config-key get mgr/cephadm/ssh_identity_pub > key.pub`
			[root@mon1 ~]# grep "`cat key.pub`" /root/.ssh/authorized_keys
doc/cephadm/install.rst: Add troubleshooting for 'CIDR' error add troubleshooting for 'failed to infer CIDR network...' error See also https://tracker.ceph.com/issues/44828 Signed-off-by: Zac Dover <zac.dover@gmail.com> 2020-04-22 12:59:56 +00:00
			`Failed to infer CIDR network error`
			`----------------------------------`

			`If you see this error::`

			`ERROR: Failed to infer CIDR network for mon ip ***; pass --skip-mon-network to configure it later`

doc/cephadm: Add alternative error for missing CIDR network Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com> 2020-06-04 07:53:10 +00:00			`Or this error::`

			`Must set public_network config option or specify a CIDR network, ceph addrvec, or plain IP`

doc/cephadm/install.rst: Add troubleshooting for 'CIDR' error add troubleshooting for 'failed to infer CIDR network...' error See also https://tracker.ceph.com/issues/44828 Signed-off-by: Zac Dover <zac.dover@gmail.com> 2020-04-22 12:59:56 +00:00			`This means that you must run a command of this form::`

			`ceph config set mon public_network <mon_network>`

			For more detail on operations of this kind, see :ref:`deploy_additional_monitors`
doc/cephadm: accessing the admin socket Signed-off-by: Sebastian Wagner <sebastian.wagner@suse.com> 2020-05-15 12:17:38 +00:00
			`Accessing the admin socket`
			`--------------------------`

			`Each Ceph daemon provides an admin socket that bypasses the`
			MONs (See :ref:`rados-monitoring-using-admin-socket`).

			`To access the admin socket, first enter the daemon container on the host::`

			`[root@mon1 ~]# cephadm enter --name <daemon-name>`
			`[ceph: root@mon1 /]# ceph --admin-daemon /var/run/ceph/ceph-<daemon-name>.asok config show`