mirror of
https://github.com/ceph/ceph
synced 2025-01-04 10:12:30 +00:00
Merge pull request #37610 from anthonyeleven/doc-rados-troubleshooting
doc/rados/troubleshooting: clarity and modernization Reviewed-by: Zac Dover <zac.dover@gmail.com>
This commit is contained in:
commit
9822fb49a7
@ -5,11 +5,11 @@
|
||||
.. index:: monitor, high availability
|
||||
|
||||
When a cluster encounters monitor-related troubles there's a tendency to
|
||||
panic, and some times with good reason. You should keep in mind that losing
|
||||
a monitor, or a bunch of them, don't necessarily mean that your cluster is
|
||||
down, as long as a majority is up, running and with a formed quorum.
|
||||
panic, and sometimes with good reason. Losing one or more monitors doesn't
|
||||
necessarily mean that your cluster is down, so long as a majority are up,
|
||||
running, and form a quorum.
|
||||
Regardless of how bad the situation is, the first thing you should do is to
|
||||
calm down, take a breath and try answering our initial troubleshooting script.
|
||||
calm down, take a breath, and step through the below troubleshooting steps.
|
||||
|
||||
|
||||
Initial Troubleshooting
|
||||
@ -18,32 +18,37 @@ Initial Troubleshooting
|
||||
|
||||
**Are the monitors running?**
|
||||
|
||||
First of all, we need to make sure the monitors are running. You would be
|
||||
amazed by how often people forget to run the monitors, or restart them after
|
||||
an upgrade. There's no shame in that, but let's try not losing a couple of
|
||||
hours chasing an issue that is not there.
|
||||
First of all, we need to make sure the monitor (*mon*) daemon processes
|
||||
(``ceph-mon``) are running. You would be amazed by how often Ceph admins
|
||||
forget to start the mons, or to restart them after an upgrade. There's no
|
||||
shame, but try to not lose a couple of hours looking for a deeper problem.
|
||||
When running Kraken or later releases also ensure that the manager
|
||||
daemons (``ceph-mgr``) are running, usually alongside each ``ceph-mon``.
|
||||
|
||||
**Are you able to connect to the monitor's servers?**
|
||||
|
||||
Doesn't happen often, but sometimes people do have ``iptables`` rules that
|
||||
block accesses to monitor servers or monitor ports. Usually leftovers from
|
||||
monitor stress-testing that were forgotten at some point. Try ssh'ing into
|
||||
the server and, if that succeeds, try connecting to the monitor's port
|
||||
using you tool of choice (telnet, nc,...).
|
||||
**Are you able to reach to the mon nodes?**
|
||||
|
||||
Doesn't happen often, but sometimes there are ``iptables`` rules that
|
||||
block accesse to mon nodes or TCP ports. These may be leftovers from
|
||||
prior stress-testing or rule development. Try SSHing into
|
||||
the server and, if that succeeds, try connecting to the monitor's ports
|
||||
(``tcp/3300`` and ``tcp/6789``) using a ``telnet``, ``nc``, or similar tools.
|
||||
|
||||
**Does ceph -s run and obtain a reply from the cluster?**
|
||||
|
||||
If the answer is yes then your cluster is up and running. One thing you
|
||||
can take for granted is that the monitors will only answer to a ``status``
|
||||
request if there is a formed quorum.
|
||||
request if there is a formed quorum. Also check that at least one ``mgr``
|
||||
daemon is reported as running, ideally all of them.
|
||||
|
||||
If ``ceph -s`` blocked however, without obtaining a reply from the cluster
|
||||
or showing a lot of ``fault`` messages, then it is likely that your monitors
|
||||
are either down completely or just a portion is up -- a portion that is not
|
||||
enough to form a quorum (keep in mind that a quorum is formed by a majority
|
||||
of monitors).
|
||||
If ``ceph -s`` hangs without obtaining a reply from the cluster
|
||||
or showing ``fault`` messages, then it is likely that your monitors
|
||||
are either down completely or just a fraction are up -- a fraction
|
||||
insufficient to form a majority quorum. This check will connect to an
|
||||
arbitrary mon; in rare cases it may be illuminating to bind to specific
|
||||
mons in sequence by adding e.g. ``-m mymon1`` to the command.
|
||||
|
||||
**What if ceph -s doesn't finish?**
|
||||
**What if ceph -s doesn't come back?**
|
||||
|
||||
If you haven't gone through all the steps so far, please go back and do.
|
||||
|
||||
@ -53,11 +58,7 @@ Initial Troubleshooting
|
||||
perform this for each monitor in the cluster. In section `Understanding
|
||||
mon_status`_ we will explain how to interpret the output of this command.
|
||||
|
||||
For the rest of you who don't tread on the bleeding edge, you will need to
|
||||
ssh into the server and use the monitor's admin socket. Please jump to
|
||||
`Using the monitor's admin socket`_.
|
||||
|
||||
For other specific issues, keep on reading.
|
||||
You may instead SSH into each mon node and query the daemon's admin socket.
|
||||
|
||||
|
||||
Using the monitor's admin socket
|
||||
@ -66,15 +67,16 @@ Using the monitor's admin socket
|
||||
The admin socket allows you to interact with a given daemon directly using a
|
||||
Unix socket file. This file can be found in your monitor's ``run`` directory.
|
||||
By default, the admin socket will be kept in ``/var/run/ceph/ceph-mon.ID.asok``
|
||||
but this can vary if you defined it otherwise. If you don't find it there,
|
||||
please check your ``ceph.conf`` for an alternative path or run::
|
||||
but this may be elsewhere if you have overridden the default directory. If you
|
||||
don't find it there, check your ``ceph.conf`` for an alternative path or
|
||||
run::
|
||||
|
||||
ceph-conf --name mon.ID --show-config-value admin_socket
|
||||
|
||||
Please bear in mind that the admin socket will only be available while the
|
||||
monitor is running. When the monitor is properly shutdown, the admin socket
|
||||
Bear in mind that the admin socket will be available only while the monitor
|
||||
daemon is running. When the monitor is properly shut down, the admin socket
|
||||
will be removed. If however the monitor is not running and the admin socket
|
||||
still persists, it is likely that the monitor was improperly shutdown.
|
||||
persists, it is likely that the monitor was improperly shut down.
|
||||
Regardless, if the monitor is not running, you will not be able to use the
|
||||
admin socket, with ``ceph`` likely returning ``Error 111: Connection Refused``.
|
||||
|
||||
@ -170,10 +172,10 @@ How to troubleshoot this?
|
||||
|
||||
First, make sure ``mon.a`` is running.
|
||||
|
||||
Second, make sure you are able to connect to ``mon.a``'s server from the
|
||||
other monitors' servers. Check the ports as well. Check ``iptables`` on
|
||||
all your monitor nodes and make sure you are not dropping/rejecting
|
||||
connections.
|
||||
Second, make sure you are able to connect to ``mon.a``'s node from the
|
||||
other mon nodes. Check the TCP ports as well. Check ``iptables`` and
|
||||
``nf_conntrack`` on all nodes and ensure that you are not
|
||||
dropping/rejecting connections.
|
||||
|
||||
If this initial troubleshooting doesn't solve your problems, then it's
|
||||
time to go deeper.
|
||||
@ -182,7 +184,7 @@ How to troubleshoot this?
|
||||
socket as explained in `Using the monitor's admin socket`_ and
|
||||
`Understanding mon_status`_.
|
||||
|
||||
Considering the monitor is out of the quorum, its state should be one of
|
||||
If the monitor is out of the quorum, its state should be one of
|
||||
``probing``, ``electing`` or ``synchronizing``. If it happens to be either
|
||||
``leader`` or ``peon``, then the monitor believes to be in quorum, while
|
||||
the remaining cluster is sure it is not; or maybe it got into the quorum
|
||||
@ -193,17 +195,17 @@ What if the state is ``probing``?
|
||||
|
||||
This means the monitor is still looking for the other monitors. Every time
|
||||
you start a monitor, the monitor will stay in this state for some time
|
||||
while trying to find the rest of the monitors specified in the ``monmap``.
|
||||
while trying to connect the rest of the monitors specified in the ``monmap``.
|
||||
The time a monitor will spend in this state can vary. For instance, when on
|
||||
a single-monitor cluster, the monitor will pass through the probing state
|
||||
almost instantaneously, since there are no other monitors around. On a
|
||||
multi-monitor cluster, the monitors will stay in this state until they
|
||||
a single-monitor cluster (never do this in production),
|
||||
the monitor will pass through the probing state almost instantaneously.
|
||||
In a multi-monitor cluster, the monitors will stay in this state until they
|
||||
find enough monitors to form a quorum -- this means that if you have 2 out
|
||||
of 3 monitors down, the one remaining monitor will stay in this state
|
||||
indefinitely until you bring one of the other monitors up.
|
||||
|
||||
If you have a quorum, however, the monitor should be able to find the
|
||||
remaining monitors pretty fast, as long as they can be reached. If your
|
||||
If you have a quorum the starting daemon should be able to find the
|
||||
other monitors quickly, as long as they can be reached. If your
|
||||
monitor is stuck probing and you have gone through with all the communication
|
||||
troubleshooting, then there is a fair chance that the monitor is trying
|
||||
to reach the other monitors on a wrong address. ``mon_status`` outputs the
|
||||
@ -218,43 +220,45 @@ What if the state is ``probing``?
|
||||
|
||||
What if state is ``electing``?
|
||||
|
||||
This means the monitor is in the middle of an election. These should be
|
||||
fast to complete, but at times the monitors can get stuck electing. This
|
||||
is usually a sign of a clock skew among the monitor nodes; jump to
|
||||
`Clock Skews`_ for more infos on that. If all your clocks are properly
|
||||
synchronized, it is best if you prepare some logs and reach out to the
|
||||
community. This is not a state that is likely to persist and aside from
|
||||
This means the monitor is in the middle of an election. With recent Ceph
|
||||
releases these typically complete quickly, but at times the monitors can
|
||||
get stuck in what is known as an *election storm*. This can indicate
|
||||
clock skew among the monitor nodes; jump to
|
||||
`Clock Skews`_ for more information. If all your clocks are properly
|
||||
synchronized, you should search the mailing lists and tracker.
|
||||
This is not a state that is likely to persist and aside from
|
||||
(*really*) old bugs there is not an obvious reason besides clock skews on
|
||||
why this would happen.
|
||||
why this would happen. Worst case, if there are enough surviving mons,
|
||||
down the problematic one while you investigate.
|
||||
|
||||
What if state is ``synchronizing``?
|
||||
|
||||
This means the monitor is synchronizing with the rest of the cluster in
|
||||
order to join the quorum. The synchronization process is as faster as
|
||||
smaller your monitor store is, so if you have a big store it may
|
||||
take a while. Don't worry, it should be finished soon enough.
|
||||
This means the monitor is catching up with the rest of the cluster in
|
||||
order to join the quorum. Time to synchronize is a function of the size
|
||||
of your monitor store and thus of cluster size and state, so if you have a
|
||||
large or degraded cluster this may take a while.
|
||||
|
||||
However, if you notice that the monitor jumps from ``synchronizing`` to
|
||||
If you notice that the monitor jumps from ``synchronizing`` to
|
||||
``electing`` and then back to ``synchronizing``, then you do have a
|
||||
problem: the cluster state is advancing (i.e., generating new maps) way
|
||||
too fast for the synchronization process to keep up. This used to be a
|
||||
thing in early Cuttlefish, but since then the synchronization process was
|
||||
quite refactored and enhanced to avoid just this sort of behavior. If this
|
||||
happens in later versions let us know. And bring some logs
|
||||
problem: the cluster state may be advancing (i.e., generating new maps)
|
||||
too fast for the synchronization process to keep up. This was a more common
|
||||
thing in early days (Cuttlefish), but since then the synchronization process
|
||||
has been refactored and enhanced to avoid this dynamic. If you experience
|
||||
this in later versions please let us know via a bug tracker. And bring some logs
|
||||
(see `Preparing your logs`_).
|
||||
|
||||
What if state is ``leader`` or ``peon``?
|
||||
|
||||
This should not happen. There is a chance this might happen however, and
|
||||
it has a lot to do with clock skews -- see `Clock Skews`_. If you are not
|
||||
suffering from clock skews, then please prepare your logs (see
|
||||
`Preparing your logs`_) and reach out to us.
|
||||
This should not happen: famous last words. If it does, however, it likely
|
||||
has a lot to do with clock skew -- see `Clock Skews`_. If you are not
|
||||
suffering from clock skew, then please prepare your logs (see
|
||||
`Preparing your logs`_) and reach out to the community.
|
||||
|
||||
|
||||
Recovering a Monitor's Broken monmap
|
||||
-------------------------------------
|
||||
Recovering a Monitor's Broken ``monmap``
|
||||
----------------------------------------
|
||||
|
||||
This is how a ``monmap`` usually looks like, depending on the number of
|
||||
This is how a ``monmap`` usually looks, depending on the number of
|
||||
monitors::
|
||||
|
||||
|
||||
@ -267,19 +271,20 @@ monitors::
|
||||
2: 127.0.0.1:6795/0 mon.c
|
||||
|
||||
This may not be what you have however. For instance, in some versions of
|
||||
early Cuttlefish there was this one bug that could cause your ``monmap``
|
||||
early Cuttlefish there was a bug that could cause your ``monmap``
|
||||
to be nullified. Completely filled with zeros. This means that not even
|
||||
``monmaptool`` would be able to read it because it would find it hard to
|
||||
make sense of only-zeros. Some other times, you may end up with a monitor
|
||||
with a severely outdated monmap, thus being unable to find the remaining
|
||||
``monmaptool`` would be able to make sense of cold, hard, inscrutable zeros.
|
||||
It's also possible to end up with a monitor with a severely outdated monmap,
|
||||
notably if the node has been down for months while you fight with your vendor's
|
||||
TAC. The subject ``ceph-mon`` daemon might be unable to find the surviving
|
||||
monitors (e.g., say ``mon.c`` is down; you add a new monitor ``mon.d``,
|
||||
then remove ``mon.a``, then add a new monitor ``mon.e`` and remove
|
||||
``mon.b``; you will end up with a totally different monmap from the one
|
||||
``mon.c`` knows).
|
||||
|
||||
In this sort of situations, you have two possible solutions:
|
||||
In this situation you have two possible solutions:
|
||||
|
||||
Scrap the monitor and create a new one
|
||||
Scrap the monitor and redeploy
|
||||
|
||||
You should only take this route if you are positive that you won't
|
||||
lose the information kept by that monitor; that you have other monitors
|
||||
@ -321,38 +326,60 @@ Inject a monmap into the monitor
|
||||
Clock Skews
|
||||
------------
|
||||
|
||||
Monitors can be severely affected by significant clock skews across the
|
||||
monitor nodes. This usually translates into weird behavior with no obvious
|
||||
cause. To avoid such issues, you should run a clock synchronization tool
|
||||
on your monitor nodes.
|
||||
Monitor operation can be severely affected by clock skew among the quorum's
|
||||
mons, as the PAXOS consensus algorithm requires tight time alignment.
|
||||
Skew can result in weird behavior with no obvious
|
||||
cause. To avoid such issues, you must run a clock synchronization tool
|
||||
on your monitor nodes: ``Chrony`` or the legacy ``ntpd``. Be sure to
|
||||
configure the mon nodes with the `iburst` option and multiple peers:
|
||||
|
||||
* Each other
|
||||
* Internal ``NTP`` servers
|
||||
* Multiple external, public pool servers
|
||||
|
||||
For good measure, *all* nodes in your cluster should also sync against
|
||||
internal and external servers, and perhaps even your mons. ``NTP`` servers
|
||||
should run on bare metal; VM virtualized clocks are not suitable for steady
|
||||
timekeeping. Visit `https://www.ntp.org <https://www.ntp.org>`_ for more info. Your
|
||||
organization may already have quality internal ``NTP`` servers you can use.
|
||||
Sources for ``NTP`` server appliances include:
|
||||
|
||||
* Microsemi (formerly Symmetricom) `https://microsemi.com <https://www.microsemi.com/product-directory/3425-timing-synchronization>`_
|
||||
* EndRun `https://endruntechnologies.com <https://endruntechnologies.com/products/ntp-time-servers>`_
|
||||
* Netburner `https://www.netburner.com <https://www.netburner.com/products/network-time-server/pk70-ex-ntp-network-time-server>`_
|
||||
|
||||
|
||||
What's the maximum tolerated clock skew?
|
||||
|
||||
By default the monitors will allow clocks to drift up to ``0.05 seconds``.
|
||||
By default the monitors will allow clocks to drift up to 0.05 seconds (50 ms).
|
||||
|
||||
|
||||
Can I increase the maximum tolerated clock skew?
|
||||
|
||||
This value is configurable via the ``mon-clock-drift-allowed`` option, and
|
||||
although you *CAN* it doesn't mean you *SHOULD*. The clock skew mechanism
|
||||
is in place because clock skewed monitor may not properly behave. We, as
|
||||
The maximum tolerated clock skew is configurable via the
|
||||
``mon-clock-drift-allowed`` option, and
|
||||
although you *CAN* you almost certainly *SHOULDN'T*. The clock skew mechanism
|
||||
is in place because clock-skewed monitors are liely to misbehave. We, as
|
||||
developers and QA aficionados, are comfortable with the current default
|
||||
value, as it will alert the user before the monitors get out hand. Changing
|
||||
this value without testing it first may cause unforeseen effects on the
|
||||
stability of the monitors and overall cluster healthiness, although there is
|
||||
no risk of dataloss.
|
||||
|
||||
this value may cause unforeseen effects on the
|
||||
stability of the monitors and overall cluster health.
|
||||
|
||||
How do I know there's a clock skew?
|
||||
|
||||
The monitors will warn you in the form of a ``HEALTH_WARN``. ``ceph health
|
||||
detail`` should show something in the form of::
|
||||
The monitors will warn you via the cluster status ``HEALTH_WARN``. ``ceph health
|
||||
detail`` or ``ceph status`` should show something like::
|
||||
|
||||
mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s)
|
||||
|
||||
That means that ``mon.c`` has been flagged as suffering from a clock skew.
|
||||
|
||||
On releases beginning with Luminous you can issue the
|
||||
``ceph time-sync-status`` command to check status. Note that the lead mon
|
||||
is typically the one with the numerically lowest IP address. It will always
|
||||
show ``0``: the reported offsets of other mons are relative to
|
||||
the lead mon, not to any external reference source.
|
||||
|
||||
|
||||
What should I do if there's a clock skew?
|
||||
|
||||
|
@ -2,21 +2,22 @@
|
||||
Troubleshooting OSDs
|
||||
======================
|
||||
|
||||
Before troubleshooting your OSDs, check your monitors and network first. If
|
||||
you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph returns
|
||||
a health status, it means that the monitors have a quorum.
|
||||
Before troubleshooting your OSDs, first check your monitors and network. If
|
||||
you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph shows
|
||||
``HEALTH_OK``, it means that the monitors have a quorum.
|
||||
If you don't have a monitor quorum or if there are errors with the monitor
|
||||
status, `address the monitor issues first <../troubleshooting-mon>`_.
|
||||
Check your networks to ensure they
|
||||
are running properly, because networks may have a significant impact on OSD
|
||||
operation and performance.
|
||||
operation and performance. Look for dropped packets on the host side
|
||||
and CRC errors on the switch side.
|
||||
|
||||
|
||||
|
||||
Obtaining Data About OSDs
|
||||
=========================
|
||||
|
||||
A good first step in troubleshooting your OSDs is to obtain information in
|
||||
A good first step in troubleshooting your OSDs is to obtain topology information in
|
||||
addition to the information you collected while `monitoring your OSDs`_
|
||||
(e.g., ``ceph osd tree``).
|
||||
|
||||
@ -29,7 +30,7 @@ If you haven't changed the default path, you can find Ceph log files at
|
||||
|
||||
ls /var/log/ceph
|
||||
|
||||
If you don't get enough log detail, you can change your logging level. See
|
||||
If you don't see enough log detail you can change your logging level. See
|
||||
`Logging and Debugging`_ for details to ensure that Ceph performs adequately
|
||||
under high logging volume.
|
||||
|
||||
@ -38,7 +39,7 @@ Admin Socket
|
||||
------------
|
||||
|
||||
Use the admin socket tool to retrieve runtime information. For details, list
|
||||
the sockets for your Ceph processes::
|
||||
the sockets for your Ceph daemons::
|
||||
|
||||
ls /var/run/ceph
|
||||
|
||||
@ -51,7 +52,6 @@ Alternatively, you can specify a ``{socket-file}`` (e.g., something in ``/var/ru
|
||||
|
||||
ceph daemon {socket-file} help
|
||||
|
||||
|
||||
The admin socket, among other things, allows you to:
|
||||
|
||||
- List your configuration at runtime
|
||||
@ -83,7 +83,7 @@ Use `iostat`_ to identify I/O-related issues. ::
|
||||
Diagnostic Messages
|
||||
-------------------
|
||||
|
||||
To retrieve diagnostic messages, use ``dmesg`` with ``less``, ``more``, ``grep``
|
||||
To retrieve diagnostic messages from the kernel, use ``dmesg`` with ``less``, ``more``, ``grep``
|
||||
or ``tail``. For example::
|
||||
|
||||
dmesg | grep scsi
|
||||
@ -99,7 +99,18 @@ maintenance, set the cluster to ``noout`` first::
|
||||
|
||||
ceph osd set noout
|
||||
|
||||
Once the cluster is set to ``noout``, you can begin stopping the OSDs within the
|
||||
On Luminous or newer releases it is safer to set the flag only on affected OSDs.
|
||||
You can do this individually ::
|
||||
|
||||
ceph osd add-noout osd.0
|
||||
ceph osd rm-noout osd.0
|
||||
|
||||
Or an entire CRUSH bucket at a time. Say you're going to take down
|
||||
``prod-ceph-data1701`` to add RAM ::
|
||||
|
||||
ceph osd set-group noout prod-ceph-data1701
|
||||
|
||||
Once the flag is set you can begin stopping the OSDs within the
|
||||
failure domain that requires maintenance work. ::
|
||||
|
||||
stop ceph-osd id={num}
|
||||
@ -114,6 +125,7 @@ Once you have completed your maintenance, restart the OSDs. ::
|
||||
Finally, you must unset the cluster from ``noout``. ::
|
||||
|
||||
ceph osd unset noout
|
||||
ceph osd unset-group noout prod-ceph-data1701
|
||||
|
||||
|
||||
|
||||
@ -135,11 +147,11 @@ If you start your cluster and an OSD won't start, check the following:
|
||||
(e.g., ``host`` not ``hostname``, etc.).
|
||||
|
||||
- **Check Paths:** Check the paths in your configuration, and the actual
|
||||
paths themselves for data and journals. If you separate the OSD data from
|
||||
the journal data and there are errors in your configuration file or in the
|
||||
paths themselves for data and metadata (journals, WAL, DB). If you separate the OSD data from
|
||||
the metadata and there are errors in your configuration file or in the
|
||||
actual mounts, you may have trouble starting OSDs. If you want to store the
|
||||
journal on a block device, you should partition your journal disk and assign
|
||||
one partition per OSD.
|
||||
metadata on a separate block device, you should partition or LVM your
|
||||
drive and assign one partition per OSD.
|
||||
|
||||
- **Check Max Threadcount:** If you have a node with a lot of OSDs, you may be
|
||||
hitting the default maximum number of threads (e.g., usually 32k), especially
|
||||
@ -150,81 +162,118 @@ If you start your cluster and an OSD won't start, check the following:
|
||||
sysctl -w kernel.pid_max=4194303
|
||||
|
||||
If increasing the maximum thread count resolves the issue, you can make it
|
||||
permanent by including a ``kernel.pid_max`` setting in the
|
||||
``/etc/sysctl.conf`` file. For example::
|
||||
permanent by including a ``kernel.pid_max`` setting in a file under ``/etc/sysctl.d`` or
|
||||
within the master ``/etc/sysctl.conf`` file. For example::
|
||||
|
||||
kernel.pid_max = 4194303
|
||||
|
||||
- **Check ``nf_conntrack``:** This connection tracking and limiting system
|
||||
is the bane of many production Ceph clusters, and can be insidious in that
|
||||
everything is fine at first. As cluster topology and client workload
|
||||
grow, mysterious and intermittent connection failures and performance
|
||||
glitches manifest, becoming worse over time and at certain times of day.
|
||||
Check ``syslog`` history for table fillage events. You can mitigate this
|
||||
bother by raising ``nf_conntrack_max`` to a much higher value via ``sysctl``.
|
||||
Be sure to raise ``nf_conntrack_buckets`` accordingly to
|
||||
``nf_conntrack_max / 4``, which may require action outside of ``sysctl`` e.g.
|
||||
``"echo 131072 > /sys/module/nf_conntrack/parameters/hashsize``
|
||||
More interdictive but fussier is to blacklist the associated kernel modules
|
||||
to disable processing altogether. This is fragile in that the modules
|
||||
vary among kernel versions, as does the order in which they must be listed.
|
||||
Even when blacklisted there are situations in which ``iptables`` or ``docker``
|
||||
may activate connection tracking anyway, so a "set and forget" strategy for
|
||||
the tunables is advised. On modern systems this will not consume appreciable
|
||||
resources.
|
||||
|
||||
|
||||
- **Kernel Version:** Identify the kernel version and distribution you
|
||||
are using. Ceph uses some third party tools by default, which may be
|
||||
buggy or may conflict with certain distributions and/or kernel
|
||||
versions (e.g., Google perftools). Check the `OS recommendations`_
|
||||
versions (e.g., Google ``gperftools`` and ``TCMalloc``). Check the
|
||||
`OS recommendations`_ and the release notes for each Ceph version
|
||||
to ensure you have addressed any issues related to your kernel.
|
||||
|
||||
- **Segment Fault:** If there is a segment fault, turn your logging up
|
||||
(if it is not already), and try again. If it segment faults again,
|
||||
contact the ceph-devel email list and provide your Ceph configuration
|
||||
file, your monitor output and the contents of your log file(s).
|
||||
|
||||
- **Segment Fault:** If there is a segment fault, increase log levels
|
||||
and start the problematic daemon(s) again. If segment faults recur,
|
||||
search the Ceph bug tracker `https://tracker.ceph/com/projects/ceph <https://tracker.ceph.com/projects/ceph/>`_
|
||||
and the ``dev`` and ``ceph-users`` mailing list archives `https://ceph.io/resources <https://ceph.io/resources>`_.
|
||||
If this is truly a new and unique
|
||||
failure, post to the ``dev`` email list and provide the specific Ceph
|
||||
release being run, ``ceph.conf`` (with secrets XXX'd out),
|
||||
your monitor status output and excerpts from your log file(s).
|
||||
|
||||
|
||||
An OSD Failed
|
||||
-------------
|
||||
|
||||
When a ``ceph-osd`` process dies, the monitor will learn about the failure
|
||||
from surviving ``ceph-osd`` daemons and report it via the ``ceph health``
|
||||
command::
|
||||
When a ``ceph-osd`` process dies, surviving ``ceph-osd`` daemons will report
|
||||
to the mons that it appears down, which will in turn surface the new status
|
||||
via the ``ceph health`` command::
|
||||
|
||||
ceph health
|
||||
HEALTH_WARN 1/3 in osds are down
|
||||
|
||||
Specifically, you will get a warning whenever there are ``ceph-osd``
|
||||
processes that are marked ``in`` and ``down``. You can identify which
|
||||
``ceph-osds`` are ``down`` with::
|
||||
Specifically, you will get a warning whenever there are OSDs marked ``in``
|
||||
and ``down``. You can identify which are ``down`` with::
|
||||
|
||||
ceph health detail
|
||||
HEALTH_WARN 1/3 in osds are down
|
||||
osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080
|
||||
|
||||
If there is a disk
|
||||
or ::
|
||||
|
||||
ceph osd tree down
|
||||
|
||||
|
||||
If there is a drive
|
||||
failure or other fault preventing ``ceph-osd`` from functioning or
|
||||
restarting, an error message should be present in its log file in
|
||||
restarting, an error message should be present in its log file under
|
||||
``/var/log/ceph``.
|
||||
|
||||
If the daemon stopped because of a heartbeat failure, the underlying
|
||||
kernel file system may be unresponsive. Check ``dmesg`` output for disk
|
||||
or other kernel errors.
|
||||
If the daemon stopped because of a heartbeat failure or ``suicide timeout``,
|
||||
the underlying drive or filesystem may be unresponsive. Check ``dmesg``
|
||||
and `syslog` output for drive or other kernel errors. You may need to
|
||||
specify something like ``dmesg -T`` to get timestamps, otherwise it's
|
||||
easy to mistake old errors for new.
|
||||
|
||||
If the problem is a software error (failed assertion or other
|
||||
unexpected error), it should be reported to the `ceph-devel`_ email list.
|
||||
unexpected error), search the archives and tracker as above, and
|
||||
report it to the `ceph-devel`_ email list if there's no clear fix or
|
||||
existing bug.
|
||||
|
||||
|
||||
No Free Drive Space
|
||||
-------------------
|
||||
|
||||
Ceph prevents you from writing to a full OSD so that you don't lose data.
|
||||
In an operational cluster, you should receive a warning when your cluster
|
||||
is getting near its full ratio. The ``mon osd full ratio`` defaults to
|
||||
In an operational cluster, you should receive a warning when your cluster's OSDs
|
||||
and pools approach the full ratio. The ``mon osd full ratio`` defaults to
|
||||
``0.95``, or 95% of capacity before it stops clients from writing data.
|
||||
The ``mon osd backfillfull ratio`` defaults to ``0.90``, or 90 % of
|
||||
capacity when it blocks backfills from starting. The
|
||||
capacity above which backfills will not start. The
|
||||
OSD nearfull ratio defaults to ``0.85``, or 85% of capacity
|
||||
when it generates a health warning.
|
||||
|
||||
Changing it can be done using:
|
||||
Note that individual OSDs within a cluster will vary in how much data Ceph
|
||||
allocates to them. This utilization can be displayed for each OSD with ::
|
||||
|
||||
::
|
||||
ceph osd df
|
||||
|
||||
ceph osd set-nearfull-ratio <float[0.0-1.0]>
|
||||
Overall cluster / pool fullness can be checked with ::
|
||||
|
||||
ceph df
|
||||
|
||||
Full cluster issues usually arise when testing how Ceph handles an OSD
|
||||
failure on a small cluster. When one node has a high percentage of the
|
||||
cluster's data, the cluster can easily eclipse its nearfull and full ratio
|
||||
immediately. If you are testing how Ceph reacts to OSD failures on a small
|
||||
cluster, you should leave ample free disk space and consider temporarily
|
||||
lowering the OSD ``full ratio``, OSD ``backfillfull ratio`` and
|
||||
OSD ``nearfull ratio`` using these commands:
|
||||
Pay close attention to the **most full** OSDs, not the percentage of raw space
|
||||
used as reported by ``ceph df``. It only takes one outlier OSD filling up to
|
||||
fail writes to its pool. The space available to each pool as reported by
|
||||
``ceph df`` considers the ratio settings relative to the *most full* OSD that
|
||||
is part of a given pool. The distribution can be flattened by progressively
|
||||
moving data from overfull or to underfull OSDs using the ``reweight-by-utilization``
|
||||
command. With Ceph releases beginning with later revisions of Luminous one can also
|
||||
exploit the ``ceph-mgr`` ``balancer`` module to perform this task automatically
|
||||
and rather effectively.
|
||||
|
||||
The ratios can be adjusted:
|
||||
|
||||
::
|
||||
|
||||
@ -232,6 +281,15 @@ OSD ``nearfull ratio`` using these commands:
|
||||
ceph osd set-full-ratio <float[0.0-1.0]>
|
||||
ceph osd set-backfillfull-ratio <float[0.0-1.0]>
|
||||
|
||||
Full cluster issues can arise when an OSD fails either as a test or organically
|
||||
within small and/or very full or unbalanced cluster. When an OSD or node
|
||||
holds an outsize percentage of the cluster's data, the ``nearfull`` and ``full``
|
||||
ratios may be exceeded as a result of component failures or even natural growth.
|
||||
If you are testing how Ceph reacts to OSD failures on a small
|
||||
cluster, you should leave ample free disk space and consider temporarily
|
||||
lowering the OSD ``full ratio``, OSD ``backfillfull ratio`` and
|
||||
OSD ``nearfull ratio``
|
||||
|
||||
Full ``ceph-osds`` will be reported by ``ceph health``::
|
||||
|
||||
ceph health
|
||||
@ -245,16 +303,17 @@ Or::
|
||||
osd.4 is backfill full at 91%
|
||||
osd.2 is near full at 87%
|
||||
|
||||
The best way to deal with a full cluster is to add new ``ceph-osds``, allowing
|
||||
the cluster to redistribute data to the newly available storage.
|
||||
The best way to deal with a full cluster is to add capacity via new OSDs, enabling
|
||||
the cluster to redistribute data to newly available storage.
|
||||
|
||||
If you cannot start an OSD because it is full, you may delete some data by deleting
|
||||
some placement group directories in the full OSD.
|
||||
If you cannot start a legacy Filestore OSD because it is full, you may reclaim
|
||||
some space deleting a few placement group directories in the full OSD.
|
||||
|
||||
.. important:: If you choose to delete a placement group directory on a full OSD,
|
||||
**DO NOT** delete the same placement group directory on another full OSD, or
|
||||
**YOU MAY LOSE DATA**. You **MUST** maintain at least one copy of your data on
|
||||
at least one OSD.
|
||||
**YOU WILL LOSE DATA**. You **MUST** maintain at least one copy of your data on
|
||||
at least one OSD. This is a rare and extreme intervention, and is not to be
|
||||
undertaken lightly.
|
||||
|
||||
See `Monitor Config Reference`_ for additional details.
|
||||
|
||||
@ -275,8 +334,8 @@ and your OSDs are running. Check to see if OSDs are throttling recovery traffic.
|
||||
Networking Issues
|
||||
-----------------
|
||||
|
||||
Ceph is a distributed storage system, so it depends upon networks to peer with
|
||||
OSDs, replicate objects, recover from faults and check heartbeats. Networking
|
||||
Ceph is a distributed storage system, so it relies upon networks for OSD peering
|
||||
and replication, recovery from faults, and periodic heartbeats. Networking
|
||||
issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for
|
||||
details.
|
||||
|
||||
@ -295,15 +354,17 @@ Check network statistics. ::
|
||||
Drive Configuration
|
||||
-------------------
|
||||
|
||||
A storage drive should only support one OSD. Sequential read and sequential
|
||||
write throughput can bottleneck if other processes share the drive, including
|
||||
journals, operating systems, monitors, other OSDs and non-Ceph processes.
|
||||
A SAS or SATA storage drive should only house one OSD; NVMe drives readily
|
||||
handle two or more. Read and write throughput can bottleneck if other processes
|
||||
share the drive, including journals / metadata, operating systems, Ceph monitors,
|
||||
`syslog` logs, other OSDs, and non-Ceph processes.
|
||||
|
||||
Ceph acknowledges writes *after* journaling, so fast SSDs are an
|
||||
attractive option to accelerate the response time--particularly when
|
||||
using the ``XFS`` or ``ext4`` file systems. By contrast, the ``btrfs``
|
||||
using the ``XFS`` or ``ext4`` file systems for legacy Filestore OSDs.
|
||||
By contrast, the ``Btrfs``
|
||||
file system can write and journal simultaneously. (Note, however, that
|
||||
we recommend against using ``btrfs`` for production deployments.)
|
||||
we recommend against using ``Btrfs`` for production deployments.)
|
||||
|
||||
.. note:: Partitioning a drive does not change its total throughput or
|
||||
sequential read/write limits. Running a journal in a separate partition
|
||||
@ -313,20 +374,22 @@ we recommend against using ``btrfs`` for production deployments.)
|
||||
Bad Sectors / Fragmented Disk
|
||||
-----------------------------
|
||||
|
||||
Check your disks for bad sectors and fragmentation. This can cause total throughput
|
||||
to drop substantially.
|
||||
Check your drives for bad blocks, fragmentation, and other errors that can cause
|
||||
performance to drop substantially. Invaluable tools include ``dmesg``, ``syslog``
|
||||
logs, and ``smartctl`` (from the ``smartmontools`` package).
|
||||
|
||||
|
||||
Co-resident Monitors/OSDs
|
||||
-------------------------
|
||||
|
||||
Monitors are generally light-weight processes, but they do lots of ``fsync()``,
|
||||
Monitors are relatively lightweight processes, but they issue lots of
|
||||
``fsync()`` calls,
|
||||
which can interfere with other workloads, particularly if monitors run on the
|
||||
same drive as your OSDs. Additionally, if you run monitors on the same host as
|
||||
the OSDs, you may incur performance issues related to:
|
||||
same drive as an OSD. Additionally, if you run monitors on the same host as
|
||||
OSDs, you may incur performance issues related to:
|
||||
|
||||
- Running an older kernel (pre-3.0)
|
||||
- Running a kernel with no syncfs(2) syscall.
|
||||
- Running a kernel with no ``syncfs(2)`` syscall.
|
||||
|
||||
In these cases, multiple OSDs running on the same host can drag each other down
|
||||
by doing lots of commits. That often leads to the bursty writes.
|
||||
@ -335,10 +398,10 @@ by doing lots of commits. That often leads to the bursty writes.
|
||||
Co-resident Processes
|
||||
---------------------
|
||||
|
||||
Spinning up co-resident processes such as a cloud-based solution, virtual
|
||||
Spinning up co-resident processes (convergence) such as a cloud-based solution, virtual
|
||||
machines and other applications that write data to Ceph while operating on the
|
||||
same hardware as OSDs can introduce significant OSD latency. Generally, we
|
||||
recommend optimizing a host for use with Ceph and using other hosts for other
|
||||
recommend optimizing hosts for use with Ceph and using other hosts for other
|
||||
processes. The practice of separating Ceph operations from other applications
|
||||
may help improve performance and may streamline troubleshooting and maintenance.
|
||||
|
||||
@ -377,13 +440,15 @@ might not have a recent enough version of ``glibc`` to support ``syncfs(2)``.
|
||||
Filesystem Issues
|
||||
-----------------
|
||||
|
||||
Currently, we recommend deploying clusters with XFS.
|
||||
Currently, we recommend deploying clusters with the BlueStore back end.
|
||||
When running a pre-Luminous release or if you have a specific reason to deploy
|
||||
OSDs with the previous Filestore backend, we recommend ``XFS``.
|
||||
|
||||
We recommend against using btrfs or ext4. The btrfs file system has
|
||||
many attractive features, but bugs in the file system may lead to
|
||||
We recommend against using ``Btrfs`` or ``ext4``. The ``Btrfs`` filesystem has
|
||||
many attractive features, but bugs may lead to
|
||||
performance issues and spurious ENOSPC errors. We do not recommend
|
||||
ext4 because xattr size limitations break our support for long object
|
||||
names (needed for RGW).
|
||||
``ext4`` for Filestore OSDs because ``xattr`` limitations break support for long
|
||||
object names, which are needed for RGW.
|
||||
|
||||
For more information, see `Filesystem Recommendations`_.
|
||||
|
||||
@ -393,21 +458,23 @@ For more information, see `Filesystem Recommendations`_.
|
||||
Insufficient RAM
|
||||
----------------
|
||||
|
||||
We recommend 1GB of RAM per OSD daemon. You may notice that during normal
|
||||
operations, the OSD only uses a fraction of that amount (e.g., 100-200MB).
|
||||
Unused RAM makes it tempting to use the excess RAM for co-resident applications,
|
||||
VMs and so forth. However, when OSDs go into recovery mode, their memory
|
||||
utilization spikes. If there is no RAM available, the OSD performance will slow
|
||||
considerably.
|
||||
We recommend a *minimum* of 4GB of RAM per OSD daemon and suggest rounding up
|
||||
from 6-8GB. You may notice that during normal operations, ``ceph-osd``
|
||||
processes only use a fraction of that amount.
|
||||
Unused RAM makes it tempting to use the excess RAM for co-resident
|
||||
applications or to skimp on each node's memory capacity. However,
|
||||
when OSDs experience recovery their memory utilization spikes. If
|
||||
there is insufficient RAM available, OSD performance will slow considerably
|
||||
and the daemons may even crash or be killed by the Linux ``OOM Killer``.
|
||||
|
||||
|
||||
Old Requests or Slow Requests
|
||||
-----------------------------
|
||||
Blocked Requests or Slow Requests
|
||||
---------------------------------
|
||||
|
||||
If a ``ceph-osd`` daemon is slow to respond to a request, it will generate log messages
|
||||
complaining about requests that are taking too long. The warning threshold
|
||||
defaults to 30 seconds, and is configurable via the ``osd op complaint time``
|
||||
option. When this happens, the cluster log will receive messages.
|
||||
If a ``ceph-osd`` daemon is slow to respond to a request, messages will be logged
|
||||
noting ops that are taking too long. The warning threshold
|
||||
defaults to 30 seconds and is configurable via the ``osd op complaint time``
|
||||
setting. When this happens, the cluster log will receive messages.
|
||||
|
||||
Legacy versions of Ceph complain about ``old requests``::
|
||||
|
||||
@ -421,7 +488,7 @@ New versions of Ceph complain about ``slow requests``::
|
||||
|
||||
Possible causes include:
|
||||
|
||||
- A bad drive (check ``dmesg`` output)
|
||||
- A failing drive (check ``dmesg`` output)
|
||||
- A bug in the kernel file system (check ``dmesg`` output)
|
||||
- An overloaded cluster (check system load, iostat, etc.)
|
||||
- A bug in the ``ceph-osd`` daemon.
|
||||
@ -432,6 +499,7 @@ Possible solutions:
|
||||
- Upgrade kernel
|
||||
- Upgrade Ceph
|
||||
- Restart OSDs
|
||||
- Replace failed or failing components
|
||||
|
||||
Debugging Slow Requests
|
||||
-----------------------
|
||||
@ -450,7 +518,7 @@ Events from the Messenger layer:
|
||||
- ``initiated``: This is identical to ``header_read``. The existence of both is a
|
||||
historical oddity.
|
||||
|
||||
Events from the OSD as it prepares operations:
|
||||
Events from the OSD as it processes ops:
|
||||
|
||||
- ``queued_for_pg``: The op has been put into the queue for processing by its PG.
|
||||
- ``reached_pg``: The PG has started doing the op.
|
||||
@ -461,7 +529,7 @@ Events from the OSD as it prepares operations:
|
||||
is now being performed.
|
||||
- ``waiting for subops from``: The op has been sent to replica OSDs.
|
||||
|
||||
Events from the FileStore:
|
||||
Events from ```Filestore```:
|
||||
|
||||
- ``commit_queued_for_journal_write``: The op has been given to the FileStore.
|
||||
- ``write_thread_in_journal_buffer``: The op is in the journal's buffer and waiting
|
||||
@ -469,7 +537,7 @@ Events from the FileStore:
|
||||
- ``journaled_completion_queued``: The op was journaled to disk and its callback
|
||||
queued for invocation.
|
||||
|
||||
Events from the OSD after stuff has been given to local disk:
|
||||
Events from the OSD after data has been given to underlying storage:
|
||||
|
||||
- ``op_commit``: The op has been committed (i.e. written to journal) by the
|
||||
primary OSD.
|
||||
@ -486,26 +554,47 @@ the internal code (such as passing data across locks into new threads).
|
||||
Flapping OSDs
|
||||
=============
|
||||
|
||||
We recommend using both a public (front-end) network and a cluster (back-end)
|
||||
network so that you can better meet the capacity requirements of object
|
||||
replication. Another advantage is that you can run a cluster network such that
|
||||
it is not connected to the internet, thereby preventing some denial of service
|
||||
attacks. When OSDs peer and check heartbeats, they use the cluster (back-end)
|
||||
When OSDs peer and check heartbeats, they use the cluster (back-end)
|
||||
network when it's available. See `Monitor/OSD Interaction`_ for details.
|
||||
|
||||
However, if the cluster (back-end) network fails or develops significant latency
|
||||
while the public (front-end) network operates optimally, OSDs currently do not
|
||||
handle this situation well. What happens is that OSDs mark each other ``down``
|
||||
on the monitor, while marking themselves ``up``. We call this scenario
|
||||
'flapping`.
|
||||
We have tradtionally recommended separate *public* (front-end) and *private*
|
||||
(cluster / back-end / replication) networks:
|
||||
|
||||
If something is causing OSDs to 'flap' (repeatedly getting marked ``down`` and
|
||||
then ``up`` again), you can force the monitors to stop the flapping with::
|
||||
#. Segregation of heartbeat and replication / recovery traffic (private)
|
||||
from client and OSD <-> mon traffic (public). This helps keep one
|
||||
from DoS-ing the other, which could in turn result in a cascading failure.
|
||||
|
||||
#. Additional throughput for both public and private traffic.
|
||||
|
||||
When common networking technloogies were 100Mb/s and 1Gb/s, this separation
|
||||
was often critical. With today's 10Gb/s, 40Gb/s, and 25/50/100Gb/s
|
||||
networks, the above capacity concerns are often diminished or even obviated.
|
||||
For example, if your OSD nodes have two network ports, dedicating one to
|
||||
the public and the other to the private network means no path redundancy.
|
||||
This degrades your ability to weather network maintenance and failures without
|
||||
significant cluster or client impact. Consider instead using both links
|
||||
for just a public network: with bonding (LACP) or equal-cost routing (e.g. FRR)
|
||||
you reap the benefits of increased throughput headroom, fault tolerance, and
|
||||
reduced OSD flapping.
|
||||
|
||||
When a private network (or even a single host link) fails or degrades while the
|
||||
public network operates normally, OSDs may not handle this situation well. What
|
||||
happens is that OSDs use the public network to report each other ``down`` to
|
||||
the monitors, while marking themselves ``up``. The monitors then send out,
|
||||
again on the public network, an updated cluster map with affected OSDs marked
|
||||
`down`. These OSDs reply to the monitors "I'm not dead yet!", and the cycle
|
||||
repeats. We call this scenario 'flapping`, and it can be difficult to isolate
|
||||
and remediate. With no private network, this irksome dynamic is avoided:
|
||||
OSDs are generally either ``up`` or ``down`` without flapping.
|
||||
|
||||
If something does cause OSDs to 'flap' (repeatedly getting marked ``down`` and
|
||||
then ``up`` again), you can force the monitors to halt the flapping by
|
||||
temporarily freezing their states::
|
||||
|
||||
ceph osd set noup # prevent OSDs from getting marked up
|
||||
ceph osd set nodown # prevent OSDs from getting marked down
|
||||
|
||||
These flags are recorded in the osdmap structure::
|
||||
These flags are recorded in the osdmap::
|
||||
|
||||
ceph osd dump | grep flags
|
||||
flags no-up,no-down
|
||||
@ -526,9 +615,12 @@ from eventually being marked ``out`` (regardless of what the current value for
|
||||
prevents OSDs from being marked ``in`` on boot, and any daemons that
|
||||
started while the flag was set will remain that way.
|
||||
|
||||
|
||||
|
||||
|
||||
.. note:: The causes and effects of flapping can be somewhat mitigated through
|
||||
careful adjustments to the ``mon_osd_down_out_subtree_limit``,
|
||||
``mon_osd_reporter_subtree_level``, and ``mon_osd_min_down_reporters``.
|
||||
Derivation of optimal settings depends on cluster size, topology, and the
|
||||
Ceph release in use. Their interactions are subtle and beyond the scope of
|
||||
this document.
|
||||
|
||||
|
||||
.. _iostat: https://en.wikipedia.org/wiki/Iostat
|
||||
|
Loading…
Reference in New Issue
Block a user