2020-03-15 13:45:46 +00:00
|
|
|
|
|
|
|
Troubleshooting
|
|
|
|
===============
|
|
|
|
|
|
|
|
Sometimes there is a need to investigate why a cephadm command failed or why
|
|
|
|
a specific service no longer runs properly.
|
|
|
|
|
|
|
|
As cephadm deploys daemons as containers, troubleshooting daemons is slightly
|
|
|
|
different. Here are a few tools and commands to help investigating issues.
|
|
|
|
|
2020-03-19 20:51:08 +00:00
|
|
|
Pausing or disabling cephadm
|
|
|
|
----------------------------
|
|
|
|
|
2020-03-20 16:40:56 +00:00
|
|
|
If something goes wrong and cephadm is doing behaving in a way you do
|
|
|
|
not like, you can pause most background activity with::
|
2020-03-19 20:51:08 +00:00
|
|
|
|
|
|
|
ceph orch pause
|
|
|
|
|
|
|
|
This will stop any changes, but cephadm will still periodically check hosts to
|
|
|
|
refresh its inventory of daemons and devices. You can disable cephadm
|
|
|
|
completely with::
|
|
|
|
|
|
|
|
ceph orch set backend ''
|
|
|
|
ceph mgr module disable cephadm
|
|
|
|
|
|
|
|
This will disable all of the ``ceph orch ...`` CLI commands but the previously
|
|
|
|
deployed daemon containers will still continue to exist and start as they
|
|
|
|
did before.
|
|
|
|
|
2021-02-15 13:20:13 +00:00
|
|
|
|
|
|
|
Per-service and per-daemon events
|
|
|
|
---------------------------------
|
|
|
|
|
|
|
|
In order to aid debugging failed daemon deployments, cephadm stores
|
|
|
|
events per service and per daemon. They often contain relevant information::
|
|
|
|
|
|
|
|
ceph orch ls --service_name=<service-name> --format yaml
|
|
|
|
|
|
|
|
for example:
|
|
|
|
|
|
|
|
.. code-block:: yaml
|
|
|
|
|
|
|
|
service_type: alertmanager
|
|
|
|
service_name: alertmanager
|
|
|
|
placement:
|
|
|
|
hosts:
|
|
|
|
- unknown_host
|
|
|
|
status:
|
|
|
|
...
|
|
|
|
running: 1
|
|
|
|
size: 1
|
|
|
|
events:
|
|
|
|
- 2021-02-01T08:58:02.741162 service:alertmanager [INFO] "service was created"
|
|
|
|
- '2021-02-01T12:09:25.264584 service:alertmanager [ERROR] "Failed to apply: Cannot
|
|
|
|
place <AlertManagerSpec for service_name=alertmanager> on unknown_host: Unknown hosts"'
|
|
|
|
|
|
|
|
Or per daemon::
|
|
|
|
|
|
|
|
ceph orch ceph --service-type mds --daemon-id=hostname.ppdhsz --format yaml
|
|
|
|
|
|
|
|
.. code-block:: yaml
|
|
|
|
|
|
|
|
daemon_type: mds
|
|
|
|
daemon_id: cephfs.hostname.ppdhsz
|
|
|
|
hostname: hostname
|
|
|
|
status_desc: running
|
|
|
|
...
|
|
|
|
events:
|
|
|
|
- 2021-02-01T08:59:43.845866 daemon:mds.cephfs.hostname.ppdhsz [INFO] "Reconfigured
|
|
|
|
mds.cephfs.hostname.ppdhsz on host 'hostname'"
|
|
|
|
|
|
|
|
|
2020-03-19 20:51:08 +00:00
|
|
|
Checking cephadm logs
|
|
|
|
---------------------
|
|
|
|
|
2020-03-20 16:40:56 +00:00
|
|
|
You can monitor the cephadm log in real time with::
|
2020-03-19 20:51:08 +00:00
|
|
|
|
|
|
|
ceph -W cephadm
|
|
|
|
|
|
|
|
You can see the last few messages with::
|
|
|
|
|
|
|
|
ceph log last cephadm
|
|
|
|
|
|
|
|
If you have enabled logging to files, you can see a cephadm log file called
|
|
|
|
``ceph.cephadm.log`` on monitor hosts (see :ref:`cephadm-logs`).
|
|
|
|
|
2020-03-15 13:45:46 +00:00
|
|
|
Gathering log files
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
Use journalctl to gather the log files of all daemons:
|
|
|
|
|
|
|
|
.. note:: By default cephadm now stores logs in journald. This means
|
|
|
|
that you will no longer find daemon logs in ``/var/log/ceph/``.
|
|
|
|
|
|
|
|
To read the log file of one specific daemon, run::
|
|
|
|
|
|
|
|
cephadm logs --name <name-of-daemon>
|
|
|
|
|
|
|
|
Note: this only works when run on the same host where the daemon is running. To
|
|
|
|
get logs of a daemon running on a different host, give the ``--fsid`` option::
|
|
|
|
|
|
|
|
cephadm logs --fsid <fsid> --name <name-of-daemon>
|
|
|
|
|
2020-03-17 13:54:47 +00:00
|
|
|
where the ``<fsid>`` corresponds to the cluster ID printed by ``ceph status``.
|
2020-03-15 13:45:46 +00:00
|
|
|
|
|
|
|
To fetch all log files of all daemons on a given host, run::
|
|
|
|
|
|
|
|
for name in $(cephadm ls | jq -r '.[].name') ; do
|
|
|
|
cephadm logs --fsid <fsid> --name "$name" > $name;
|
|
|
|
done
|
|
|
|
|
|
|
|
Collecting systemd status
|
|
|
|
-------------------------
|
|
|
|
|
|
|
|
To print the state of a systemd unit, run::
|
|
|
|
|
|
|
|
systemctl status "ceph-$(cephadm shell ceph fsid)@<service name>.service";
|
|
|
|
|
|
|
|
|
|
|
|
To fetch all state of all daemons of a given host, run::
|
|
|
|
|
|
|
|
fsid="$(cephadm shell ceph fsid)"
|
|
|
|
for name in $(cephadm ls | jq -r '.[].name') ; do
|
|
|
|
systemctl status "ceph-$fsid@$name.service" > $name;
|
|
|
|
done
|
|
|
|
|
|
|
|
|
|
|
|
List all downloaded container images
|
|
|
|
------------------------------------
|
|
|
|
|
|
|
|
To list all container images that are downloaded on a host:
|
|
|
|
|
|
|
|
.. note:: ``Image`` might also be called `ImageID`
|
|
|
|
|
|
|
|
::
|
|
|
|
|
|
|
|
podman ps -a --format json | jq '.[].Image'
|
|
|
|
"docker.io/library/centos:8"
|
|
|
|
"registry.opensuse.org/opensuse/leap:15.2"
|
|
|
|
|
|
|
|
|
|
|
|
Manually running containers
|
|
|
|
---------------------------
|
|
|
|
|
2020-03-17 13:54:47 +00:00
|
|
|
Cephadm writes small wrappers that run a containers. Refer to
|
|
|
|
``/var/lib/ceph/<cluster-fsid>/<service-name>/unit.run`` for the
|
|
|
|
container execution command.
|
2020-04-17 04:12:37 +00:00
|
|
|
|
2020-06-09 14:15:43 +00:00
|
|
|
.. _cephadm-ssh-errors:
|
2020-04-17 04:12:37 +00:00
|
|
|
|
|
|
|
ssh errors
|
|
|
|
----------
|
|
|
|
|
|
|
|
Error message::
|
|
|
|
|
2020-07-28 18:36:30 +00:00
|
|
|
execnet.gateway_bootstrap.HostNotFound: -F /tmp/cephadm-conf-73z09u6g -i /tmp/cephadm-identity-ky7ahp_5 root@10.10.1.2
|
|
|
|
...
|
|
|
|
raise OrchestratorError(msg) from e
|
|
|
|
orchestrator._interface.OrchestratorError: Failed to connect to 10.10.1.2 (10.10.1.2).
|
|
|
|
Please make sure that the host is reachable and accepts connections using the cephadm SSH key
|
|
|
|
...
|
2020-04-17 04:12:37 +00:00
|
|
|
|
|
|
|
Things users can do:
|
|
|
|
|
|
|
|
1. Ensure cephadm has an SSH identity key::
|
2020-07-28 18:36:30 +00:00
|
|
|
|
|
|
|
[root@mon1~]# cephadm shell -- ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
|
2020-04-17 04:12:37 +00:00
|
|
|
INFO:cephadm:Inferring fsid f8edc08a-7f17-11ea-8707-000c2915dd98
|
|
|
|
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 obtained 'mgr/cephadm/ssh_identity_key'
|
2020-07-28 18:36:30 +00:00
|
|
|
[root@mon1 ~] # chmod 0600 ~/cephadm_private_key
|
2020-04-17 04:12:37 +00:00
|
|
|
|
|
|
|
If this fails, cephadm doesn't have a key. Fix this by running the following command::
|
2020-07-28 18:36:30 +00:00
|
|
|
|
2020-04-17 04:12:37 +00:00
|
|
|
[root@mon1 ~]# cephadm shell -- ceph cephadm generate-ssh-key
|
|
|
|
|
|
|
|
or::
|
2020-07-28 18:36:30 +00:00
|
|
|
|
|
|
|
[root@mon1 ~]# cat ~/cephadm_private_key | cephadm shell -- ceph cephadm set-ssk-key -i -
|
2020-04-17 04:12:37 +00:00
|
|
|
|
|
|
|
2. Ensure that the ssh config is correct::
|
2020-07-28 18:36:30 +00:00
|
|
|
|
2020-04-17 04:12:37 +00:00
|
|
|
[root@mon1 ~]# cephadm shell -- ceph cephadm get-ssh-config > config
|
|
|
|
|
|
|
|
3. Verify that we can connect to the host::
|
|
|
|
|
2020-07-28 18:36:30 +00:00
|
|
|
[root@mon1 ~]# ssh -F config -i ~/cephadm_private_key root@mon1
|
2020-04-17 04:12:37 +00:00
|
|
|
|
|
|
|
Verifying that the Public Key is Listed in the authorized_keys file
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
To verify that the public key is in the authorized_keys file, run the following commands::
|
|
|
|
|
2020-07-28 18:36:30 +00:00
|
|
|
[root@mon1 ~]# cephadm shell -- ceph cephadm get-pub-key > ~/ceph.pub
|
|
|
|
[root@mon1 ~]# grep "`cat ~/ceph.pub`" /root/.ssh/authorized_keys
|
2020-04-22 12:59:56 +00:00
|
|
|
|
|
|
|
Failed to infer CIDR network error
|
|
|
|
----------------------------------
|
|
|
|
|
|
|
|
If you see this error::
|
|
|
|
|
|
|
|
ERROR: Failed to infer CIDR network for mon ip ***; pass --skip-mon-network to configure it later
|
|
|
|
|
2020-06-04 07:53:10 +00:00
|
|
|
Or this error::
|
|
|
|
|
|
|
|
Must set public_network config option or specify a CIDR network, ceph addrvec, or plain IP
|
|
|
|
|
2020-04-22 12:59:56 +00:00
|
|
|
This means that you must run a command of this form::
|
|
|
|
|
|
|
|
ceph config set mon public_network <mon_network>
|
|
|
|
|
|
|
|
For more detail on operations of this kind, see :ref:`deploy_additional_monitors`
|
2020-05-15 12:17:38 +00:00
|
|
|
|
|
|
|
Accessing the admin socket
|
|
|
|
--------------------------
|
|
|
|
|
|
|
|
Each Ceph daemon provides an admin socket that bypasses the
|
|
|
|
MONs (See :ref:`rados-monitoring-using-admin-socket`).
|
|
|
|
|
|
|
|
To access the admin socket, first enter the daemon container on the host::
|
|
|
|
|
|
|
|
[root@mon1 ~]# cephadm enter --name <daemon-name>
|
|
|
|
[ceph: root@mon1 /]# ceph --admin-daemon /var/run/ceph/ceph-<daemon-name>.asok config show
|