Merge pull request #37610 from anthonyeleven/doc-rados-troubleshooting

doc/rados/troubleshooting: clarity and modernization

Reviewed-by: Zac Dover <zac.dover@gmail.com>
This commit is contained in:
zdover23 2020-10-14 02:32:24 +10:00 committed by GitHub
commit 9822fb49a7
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 310 additions and 191 deletions

View File

@ -5,11 +5,11 @@
.. index:: monitor, high availability
When a cluster encounters monitor-related troubles there's a tendency to
panic, and some times with good reason. You should keep in mind that losing
a monitor, or a bunch of them, don't necessarily mean that your cluster is
down, as long as a majority is up, running and with a formed quorum.
panic, and sometimes with good reason. Losing one or more monitors doesn't
necessarily mean that your cluster is down, so long as a majority are up,
running, and form a quorum.
Regardless of how bad the situation is, the first thing you should do is to
calm down, take a breath and try answering our initial troubleshooting script.
calm down, take a breath, and step through the below troubleshooting steps.
Initial Troubleshooting
@ -18,32 +18,37 @@ Initial Troubleshooting
**Are the monitors running?**
First of all, we need to make sure the monitors are running. You would be
amazed by how often people forget to run the monitors, or restart them after
an upgrade. There's no shame in that, but let's try not losing a couple of
hours chasing an issue that is not there.
First of all, we need to make sure the monitor (*mon*) daemon processes
(``ceph-mon``) are running. You would be amazed by how often Ceph admins
forget to start the mons, or to restart them after an upgrade. There's no
shame, but try to not lose a couple of hours looking for a deeper problem.
When running Kraken or later releases also ensure that the manager
daemons (``ceph-mgr``) are running, usually alongside each ``ceph-mon``.
**Are you able to connect to the monitor's servers?**
Doesn't happen often, but sometimes people do have ``iptables`` rules that
block accesses to monitor servers or monitor ports. Usually leftovers from
monitor stress-testing that were forgotten at some point. Try ssh'ing into
the server and, if that succeeds, try connecting to the monitor's port
using you tool of choice (telnet, nc,...).
**Are you able to reach to the mon nodes?**
Doesn't happen often, but sometimes there are ``iptables`` rules that
block accesse to mon nodes or TCP ports. These may be leftovers from
prior stress-testing or rule development. Try SSHing into
the server and, if that succeeds, try connecting to the monitor's ports
(``tcp/3300`` and ``tcp/6789``) using a ``telnet``, ``nc``, or similar tools.
**Does ceph -s run and obtain a reply from the cluster?**
If the answer is yes then your cluster is up and running. One thing you
can take for granted is that the monitors will only answer to a ``status``
request if there is a formed quorum.
request if there is a formed quorum. Also check that at least one ``mgr``
daemon is reported as running, ideally all of them.
If ``ceph -s`` blocked however, without obtaining a reply from the cluster
or showing a lot of ``fault`` messages, then it is likely that your monitors
are either down completely or just a portion is up -- a portion that is not
enough to form a quorum (keep in mind that a quorum is formed by a majority
of monitors).
If ``ceph -s`` hangs without obtaining a reply from the cluster
or showing ``fault`` messages, then it is likely that your monitors
are either down completely or just a fraction are up -- a fraction
insufficient to form a majority quorum. This check will connect to an
arbitrary mon; in rare cases it may be illuminating to bind to specific
mons in sequence by adding e.g. ``-m mymon1`` to the command.
**What if ceph -s doesn't finish?**
**What if ceph -s doesn't come back?**
If you haven't gone through all the steps so far, please go back and do.
@ -53,11 +58,7 @@ Initial Troubleshooting
perform this for each monitor in the cluster. In section `Understanding
mon_status`_ we will explain how to interpret the output of this command.
For the rest of you who don't tread on the bleeding edge, you will need to
ssh into the server and use the monitor's admin socket. Please jump to
`Using the monitor's admin socket`_.
For other specific issues, keep on reading.
You may instead SSH into each mon node and query the daemon's admin socket.
Using the monitor's admin socket
@ -66,15 +67,16 @@ Using the monitor's admin socket
The admin socket allows you to interact with a given daemon directly using a
Unix socket file. This file can be found in your monitor's ``run`` directory.
By default, the admin socket will be kept in ``/var/run/ceph/ceph-mon.ID.asok``
but this can vary if you defined it otherwise. If you don't find it there,
please check your ``ceph.conf`` for an alternative path or run::
but this may be elsewhere if you have overridden the default directory. If you
don't find it there, check your ``ceph.conf`` for an alternative path or
run::
ceph-conf --name mon.ID --show-config-value admin_socket
Please bear in mind that the admin socket will only be available while the
monitor is running. When the monitor is properly shutdown, the admin socket
Bear in mind that the admin socket will be available only while the monitor
daemon is running. When the monitor is properly shut down, the admin socket
will be removed. If however the monitor is not running and the admin socket
still persists, it is likely that the monitor was improperly shutdown.
persists, it is likely that the monitor was improperly shut down.
Regardless, if the monitor is not running, you will not be able to use the
admin socket, with ``ceph`` likely returning ``Error 111: Connection Refused``.
@ -170,10 +172,10 @@ How to troubleshoot this?
First, make sure ``mon.a`` is running.
Second, make sure you are able to connect to ``mon.a``'s server from the
other monitors' servers. Check the ports as well. Check ``iptables`` on
all your monitor nodes and make sure you are not dropping/rejecting
connections.
Second, make sure you are able to connect to ``mon.a``'s node from the
other mon nodes. Check the TCP ports as well. Check ``iptables`` and
``nf_conntrack`` on all nodes and ensure that you are not
dropping/rejecting connections.
If this initial troubleshooting doesn't solve your problems, then it's
time to go deeper.
@ -182,7 +184,7 @@ How to troubleshoot this?
socket as explained in `Using the monitor's admin socket`_ and
`Understanding mon_status`_.
Considering the monitor is out of the quorum, its state should be one of
If the monitor is out of the quorum, its state should be one of
``probing``, ``electing`` or ``synchronizing``. If it happens to be either
``leader`` or ``peon``, then the monitor believes to be in quorum, while
the remaining cluster is sure it is not; or maybe it got into the quorum
@ -193,17 +195,17 @@ What if the state is ``probing``?
This means the monitor is still looking for the other monitors. Every time
you start a monitor, the monitor will stay in this state for some time
while trying to find the rest of the monitors specified in the ``monmap``.
while trying to connect the rest of the monitors specified in the ``monmap``.
The time a monitor will spend in this state can vary. For instance, when on
a single-monitor cluster, the monitor will pass through the probing state
almost instantaneously, since there are no other monitors around. On a
multi-monitor cluster, the monitors will stay in this state until they
a single-monitor cluster (never do this in production),
the monitor will pass through the probing state almost instantaneously.
In a multi-monitor cluster, the monitors will stay in this state until they
find enough monitors to form a quorum -- this means that if you have 2 out
of 3 monitors down, the one remaining monitor will stay in this state
indefinitely until you bring one of the other monitors up.
If you have a quorum, however, the monitor should be able to find the
remaining monitors pretty fast, as long as they can be reached. If your
If you have a quorum the starting daemon should be able to find the
other monitors quickly, as long as they can be reached. If your
monitor is stuck probing and you have gone through with all the communication
troubleshooting, then there is a fair chance that the monitor is trying
to reach the other monitors on a wrong address. ``mon_status`` outputs the
@ -218,43 +220,45 @@ What if the state is ``probing``?
What if state is ``electing``?
This means the monitor is in the middle of an election. These should be
fast to complete, but at times the monitors can get stuck electing. This
is usually a sign of a clock skew among the monitor nodes; jump to
`Clock Skews`_ for more infos on that. If all your clocks are properly
synchronized, it is best if you prepare some logs and reach out to the
community. This is not a state that is likely to persist and aside from
This means the monitor is in the middle of an election. With recent Ceph
releases these typically complete quickly, but at times the monitors can
get stuck in what is known as an *election storm*. This can indicate
clock skew among the monitor nodes; jump to
`Clock Skews`_ for more information. If all your clocks are properly
synchronized, you should search the mailing lists and tracker.
This is not a state that is likely to persist and aside from
(*really*) old bugs there is not an obvious reason besides clock skews on
why this would happen.
why this would happen. Worst case, if there are enough surviving mons,
down the problematic one while you investigate.
What if state is ``synchronizing``?
This means the monitor is synchronizing with the rest of the cluster in
order to join the quorum. The synchronization process is as faster as
smaller your monitor store is, so if you have a big store it may
take a while. Don't worry, it should be finished soon enough.
This means the monitor is catching up with the rest of the cluster in
order to join the quorum. Time to synchronize is a function of the size
of your monitor store and thus of cluster size and state, so if you have a
large or degraded cluster this may take a while.
However, if you notice that the monitor jumps from ``synchronizing`` to
If you notice that the monitor jumps from ``synchronizing`` to
``electing`` and then back to ``synchronizing``, then you do have a
problem: the cluster state is advancing (i.e., generating new maps) way
too fast for the synchronization process to keep up. This used to be a
thing in early Cuttlefish, but since then the synchronization process was
quite refactored and enhanced to avoid just this sort of behavior. If this
happens in later versions let us know. And bring some logs
problem: the cluster state may be advancing (i.e., generating new maps)
too fast for the synchronization process to keep up. This was a more common
thing in early days (Cuttlefish), but since then the synchronization process
has been refactored and enhanced to avoid this dynamic. If you experience
this in later versions please let us know via a bug tracker. And bring some logs
(see `Preparing your logs`_).
What if state is ``leader`` or ``peon``?
This should not happen. There is a chance this might happen however, and
it has a lot to do with clock skews -- see `Clock Skews`_. If you are not
suffering from clock skews, then please prepare your logs (see
`Preparing your logs`_) and reach out to us.
This should not happen: famous last words. If it does, however, it likely
has a lot to do with clock skew -- see `Clock Skews`_. If you are not
suffering from clock skew, then please prepare your logs (see
`Preparing your logs`_) and reach out to the community.
Recovering a Monitor's Broken monmap
-------------------------------------
Recovering a Monitor's Broken ``monmap``
----------------------------------------
This is how a ``monmap`` usually looks like, depending on the number of
This is how a ``monmap`` usually looks, depending on the number of
monitors::
@ -267,19 +271,20 @@ monitors::
2: 127.0.0.1:6795/0 mon.c
This may not be what you have however. For instance, in some versions of
early Cuttlefish there was this one bug that could cause your ``monmap``
early Cuttlefish there was a bug that could cause your ``monmap``
to be nullified. Completely filled with zeros. This means that not even
``monmaptool`` would be able to read it because it would find it hard to
make sense of only-zeros. Some other times, you may end up with a monitor
with a severely outdated monmap, thus being unable to find the remaining
``monmaptool`` would be able to make sense of cold, hard, inscrutable zeros.
It's also possible to end up with a monitor with a severely outdated monmap,
notably if the node has been down for months while you fight with your vendor's
TAC. The subject ``ceph-mon`` daemon might be unable to find the surviving
monitors (e.g., say ``mon.c`` is down; you add a new monitor ``mon.d``,
then remove ``mon.a``, then add a new monitor ``mon.e`` and remove
``mon.b``; you will end up with a totally different monmap from the one
``mon.c`` knows).
In this sort of situations, you have two possible solutions:
In this situation you have two possible solutions:
Scrap the monitor and create a new one
Scrap the monitor and redeploy
You should only take this route if you are positive that you won't
lose the information kept by that monitor; that you have other monitors
@ -321,38 +326,60 @@ Inject a monmap into the monitor
Clock Skews
------------
Monitors can be severely affected by significant clock skews across the
monitor nodes. This usually translates into weird behavior with no obvious
cause. To avoid such issues, you should run a clock synchronization tool
on your monitor nodes.
Monitor operation can be severely affected by clock skew among the quorum's
mons, as the PAXOS consensus algorithm requires tight time alignment.
Skew can result in weird behavior with no obvious
cause. To avoid such issues, you must run a clock synchronization tool
on your monitor nodes: ``Chrony`` or the legacy ``ntpd``. Be sure to
configure the mon nodes with the `iburst` option and multiple peers:
* Each other
* Internal ``NTP`` servers
* Multiple external, public pool servers
For good measure, *all* nodes in your cluster should also sync against
internal and external servers, and perhaps even your mons. ``NTP`` servers
should run on bare metal; VM virtualized clocks are not suitable for steady
timekeeping. Visit `https://www.ntp.org <https://www.ntp.org>`_ for more info. Your
organization may already have quality internal ``NTP`` servers you can use.
Sources for ``NTP`` server appliances include:
* Microsemi (formerly Symmetricom) `https://microsemi.com <https://www.microsemi.com/product-directory/3425-timing-synchronization>`_
* EndRun `https://endruntechnologies.com <https://endruntechnologies.com/products/ntp-time-servers>`_
* Netburner `https://www.netburner.com <https://www.netburner.com/products/network-time-server/pk70-ex-ntp-network-time-server>`_
What's the maximum tolerated clock skew?
By default the monitors will allow clocks to drift up to ``0.05 seconds``.
By default the monitors will allow clocks to drift up to 0.05 seconds (50 ms).
Can I increase the maximum tolerated clock skew?
This value is configurable via the ``mon-clock-drift-allowed`` option, and
although you *CAN* it doesn't mean you *SHOULD*. The clock skew mechanism
is in place because clock skewed monitor may not properly behave. We, as
The maximum tolerated clock skew is configurable via the
``mon-clock-drift-allowed`` option, and
although you *CAN* you almost certainly *SHOULDN'T*. The clock skew mechanism
is in place because clock-skewed monitors are liely to misbehave. We, as
developers and QA aficionados, are comfortable with the current default
value, as it will alert the user before the monitors get out hand. Changing
this value without testing it first may cause unforeseen effects on the
stability of the monitors and overall cluster healthiness, although there is
no risk of dataloss.
this value may cause unforeseen effects on the
stability of the monitors and overall cluster health.
How do I know there's a clock skew?
The monitors will warn you in the form of a ``HEALTH_WARN``. ``ceph health
detail`` should show something in the form of::
The monitors will warn you via the cluster status ``HEALTH_WARN``. ``ceph health
detail`` or ``ceph status`` should show something like::
mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s)
That means that ``mon.c`` has been flagged as suffering from a clock skew.
On releases beginning with Luminous you can issue the
``ceph time-sync-status`` command to check status. Note that the lead mon
is typically the one with the numerically lowest IP address. It will always
show ``0``: the reported offsets of other mons are relative to
the lead mon, not to any external reference source.
What should I do if there's a clock skew?

View File

@ -2,21 +2,22 @@
Troubleshooting OSDs
======================
Before troubleshooting your OSDs, check your monitors and network first. If
you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph returns
a health status, it means that the monitors have a quorum.
Before troubleshooting your OSDs, first check your monitors and network. If
you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph shows
``HEALTH_OK``, it means that the monitors have a quorum.
If you don't have a monitor quorum or if there are errors with the monitor
status, `address the monitor issues first <../troubleshooting-mon>`_.
Check your networks to ensure they
are running properly, because networks may have a significant impact on OSD
operation and performance.
operation and performance. Look for dropped packets on the host side
and CRC errors on the switch side.
Obtaining Data About OSDs
=========================
A good first step in troubleshooting your OSDs is to obtain information in
A good first step in troubleshooting your OSDs is to obtain topology information in
addition to the information you collected while `monitoring your OSDs`_
(e.g., ``ceph osd tree``).
@ -29,7 +30,7 @@ If you haven't changed the default path, you can find Ceph log files at
ls /var/log/ceph
If you don't get enough log detail, you can change your logging level. See
If you don't see enough log detail you can change your logging level. See
`Logging and Debugging`_ for details to ensure that Ceph performs adequately
under high logging volume.
@ -38,7 +39,7 @@ Admin Socket
------------
Use the admin socket tool to retrieve runtime information. For details, list
the sockets for your Ceph processes::
the sockets for your Ceph daemons::
ls /var/run/ceph
@ -51,7 +52,6 @@ Alternatively, you can specify a ``{socket-file}`` (e.g., something in ``/var/ru
ceph daemon {socket-file} help
The admin socket, among other things, allows you to:
- List your configuration at runtime
@ -83,7 +83,7 @@ Use `iostat`_ to identify I/O-related issues. ::
Diagnostic Messages
-------------------
To retrieve diagnostic messages, use ``dmesg`` with ``less``, ``more``, ``grep``
To retrieve diagnostic messages from the kernel, use ``dmesg`` with ``less``, ``more``, ``grep``
or ``tail``. For example::
dmesg | grep scsi
@ -99,7 +99,18 @@ maintenance, set the cluster to ``noout`` first::
ceph osd set noout
Once the cluster is set to ``noout``, you can begin stopping the OSDs within the
On Luminous or newer releases it is safer to set the flag only on affected OSDs.
You can do this individually ::
ceph osd add-noout osd.0
ceph osd rm-noout osd.0
Or an entire CRUSH bucket at a time. Say you're going to take down
``prod-ceph-data1701`` to add RAM ::
ceph osd set-group noout prod-ceph-data1701
Once the flag is set you can begin stopping the OSDs within the
failure domain that requires maintenance work. ::
stop ceph-osd id={num}
@ -114,6 +125,7 @@ Once you have completed your maintenance, restart the OSDs. ::
Finally, you must unset the cluster from ``noout``. ::
ceph osd unset noout
ceph osd unset-group noout prod-ceph-data1701
@ -135,11 +147,11 @@ If you start your cluster and an OSD won't start, check the following:
(e.g., ``host`` not ``hostname``, etc.).
- **Check Paths:** Check the paths in your configuration, and the actual
paths themselves for data and journals. If you separate the OSD data from
the journal data and there are errors in your configuration file or in the
paths themselves for data and metadata (journals, WAL, DB). If you separate the OSD data from
the metadata and there are errors in your configuration file or in the
actual mounts, you may have trouble starting OSDs. If you want to store the
journal on a block device, you should partition your journal disk and assign
one partition per OSD.
metadata on a separate block device, you should partition or LVM your
drive and assign one partition per OSD.
- **Check Max Threadcount:** If you have a node with a lot of OSDs, you may be
hitting the default maximum number of threads (e.g., usually 32k), especially
@ -150,81 +162,118 @@ If you start your cluster and an OSD won't start, check the following:
sysctl -w kernel.pid_max=4194303
If increasing the maximum thread count resolves the issue, you can make it
permanent by including a ``kernel.pid_max`` setting in the
``/etc/sysctl.conf`` file. For example::
permanent by including a ``kernel.pid_max`` setting in a file under ``/etc/sysctl.d`` or
within the master ``/etc/sysctl.conf`` file. For example::
kernel.pid_max = 4194303
- **Check ``nf_conntrack``:** This connection tracking and limiting system
is the bane of many production Ceph clusters, and can be insidious in that
everything is fine at first. As cluster topology and client workload
grow, mysterious and intermittent connection failures and performance
glitches manifest, becoming worse over time and at certain times of day.
Check ``syslog`` history for table fillage events. You can mitigate this
bother by raising ``nf_conntrack_max`` to a much higher value via ``sysctl``.
Be sure to raise ``nf_conntrack_buckets`` accordingly to
``nf_conntrack_max / 4``, which may require action outside of ``sysctl`` e.g.
``"echo 131072 > /sys/module/nf_conntrack/parameters/hashsize``
More interdictive but fussier is to blacklist the associated kernel modules
to disable processing altogether. This is fragile in that the modules
vary among kernel versions, as does the order in which they must be listed.
Even when blacklisted there are situations in which ``iptables`` or ``docker``
may activate connection tracking anyway, so a "set and forget" strategy for
the tunables is advised. On modern systems this will not consume appreciable
resources.
- **Kernel Version:** Identify the kernel version and distribution you
are using. Ceph uses some third party tools by default, which may be
buggy or may conflict with certain distributions and/or kernel
versions (e.g., Google perftools). Check the `OS recommendations`_
versions (e.g., Google ``gperftools`` and ``TCMalloc``). Check the
`OS recommendations`_ and the release notes for each Ceph version
to ensure you have addressed any issues related to your kernel.
- **Segment Fault:** If there is a segment fault, turn your logging up
(if it is not already), and try again. If it segment faults again,
contact the ceph-devel email list and provide your Ceph configuration
file, your monitor output and the contents of your log file(s).
- **Segment Fault:** If there is a segment fault, increase log levels
and start the problematic daemon(s) again. If segment faults recur,
search the Ceph bug tracker `https://tracker.ceph/com/projects/ceph <https://tracker.ceph.com/projects/ceph/>`_
and the ``dev`` and ``ceph-users`` mailing list archives `https://ceph.io/resources <https://ceph.io/resources>`_.
If this is truly a new and unique
failure, post to the ``dev`` email list and provide the specific Ceph
release being run, ``ceph.conf`` (with secrets XXX'd out),
your monitor status output and excerpts from your log file(s).
An OSD Failed
-------------
When a ``ceph-osd`` process dies, the monitor will learn about the failure
from surviving ``ceph-osd`` daemons and report it via the ``ceph health``
command::
When a ``ceph-osd`` process dies, surviving ``ceph-osd`` daemons will report
to the mons that it appears down, which will in turn surface the new status
via the ``ceph health`` command::
ceph health
HEALTH_WARN 1/3 in osds are down
Specifically, you will get a warning whenever there are ``ceph-osd``
processes that are marked ``in`` and ``down``. You can identify which
``ceph-osds`` are ``down`` with::
Specifically, you will get a warning whenever there are OSDs marked ``in``
and ``down``. You can identify which are ``down`` with::
ceph health detail
HEALTH_WARN 1/3 in osds are down
osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080
If there is a disk
or ::
ceph osd tree down
If there is a drive
failure or other fault preventing ``ceph-osd`` from functioning or
restarting, an error message should be present in its log file in
restarting, an error message should be present in its log file under
``/var/log/ceph``.
If the daemon stopped because of a heartbeat failure, the underlying
kernel file system may be unresponsive. Check ``dmesg`` output for disk
or other kernel errors.
If the daemon stopped because of a heartbeat failure or ``suicide timeout``,
the underlying drive or filesystem may be unresponsive. Check ``dmesg``
and `syslog` output for drive or other kernel errors. You may need to
specify something like ``dmesg -T`` to get timestamps, otherwise it's
easy to mistake old errors for new.
If the problem is a software error (failed assertion or other
unexpected error), it should be reported to the `ceph-devel`_ email list.
unexpected error), search the archives and tracker as above, and
report it to the `ceph-devel`_ email list if there's no clear fix or
existing bug.
No Free Drive Space
-------------------
Ceph prevents you from writing to a full OSD so that you don't lose data.
In an operational cluster, you should receive a warning when your cluster
is getting near its full ratio. The ``mon osd full ratio`` defaults to
In an operational cluster, you should receive a warning when your cluster's OSDs
and pools approach the full ratio. The ``mon osd full ratio`` defaults to
``0.95``, or 95% of capacity before it stops clients from writing data.
The ``mon osd backfillfull ratio`` defaults to ``0.90``, or 90 % of
capacity when it blocks backfills from starting. The
capacity above which backfills will not start. The
OSD nearfull ratio defaults to ``0.85``, or 85% of capacity
when it generates a health warning.
Changing it can be done using:
Note that individual OSDs within a cluster will vary in how much data Ceph
allocates to them. This utilization can be displayed for each OSD with ::
::
ceph osd df
ceph osd set-nearfull-ratio <float[0.0-1.0]>
Overall cluster / pool fullness can be checked with ::
ceph df
Full cluster issues usually arise when testing how Ceph handles an OSD
failure on a small cluster. When one node has a high percentage of the
cluster's data, the cluster can easily eclipse its nearfull and full ratio
immediately. If you are testing how Ceph reacts to OSD failures on a small
cluster, you should leave ample free disk space and consider temporarily
lowering the OSD ``full ratio``, OSD ``backfillfull ratio`` and
OSD ``nearfull ratio`` using these commands:
Pay close attention to the **most full** OSDs, not the percentage of raw space
used as reported by ``ceph df``. It only takes one outlier OSD filling up to
fail writes to its pool. The space available to each pool as reported by
``ceph df`` considers the ratio settings relative to the *most full* OSD that
is part of a given pool. The distribution can be flattened by progressively
moving data from overfull or to underfull OSDs using the ``reweight-by-utilization``
command. With Ceph releases beginning with later revisions of Luminous one can also
exploit the ``ceph-mgr`` ``balancer`` module to perform this task automatically
and rather effectively.
The ratios can be adjusted:
::
@ -232,6 +281,15 @@ OSD ``nearfull ratio`` using these commands:
ceph osd set-full-ratio <float[0.0-1.0]>
ceph osd set-backfillfull-ratio <float[0.0-1.0]>
Full cluster issues can arise when an OSD fails either as a test or organically
within small and/or very full or unbalanced cluster. When an OSD or node
holds an outsize percentage of the cluster's data, the ``nearfull`` and ``full``
ratios may be exceeded as a result of component failures or even natural growth.
If you are testing how Ceph reacts to OSD failures on a small
cluster, you should leave ample free disk space and consider temporarily
lowering the OSD ``full ratio``, OSD ``backfillfull ratio`` and
OSD ``nearfull ratio``
Full ``ceph-osds`` will be reported by ``ceph health``::
ceph health
@ -245,16 +303,17 @@ Or::
osd.4 is backfill full at 91%
osd.2 is near full at 87%
The best way to deal with a full cluster is to add new ``ceph-osds``, allowing
the cluster to redistribute data to the newly available storage.
The best way to deal with a full cluster is to add capacity via new OSDs, enabling
the cluster to redistribute data to newly available storage.
If you cannot start an OSD because it is full, you may delete some data by deleting
some placement group directories in the full OSD.
If you cannot start a legacy Filestore OSD because it is full, you may reclaim
some space deleting a few placement group directories in the full OSD.
.. important:: If you choose to delete a placement group directory on a full OSD,
**DO NOT** delete the same placement group directory on another full OSD, or
**YOU MAY LOSE DATA**. You **MUST** maintain at least one copy of your data on
at least one OSD.
**YOU WILL LOSE DATA**. You **MUST** maintain at least one copy of your data on
at least one OSD. This is a rare and extreme intervention, and is not to be
undertaken lightly.
See `Monitor Config Reference`_ for additional details.
@ -275,8 +334,8 @@ and your OSDs are running. Check to see if OSDs are throttling recovery traffic.
Networking Issues
-----------------
Ceph is a distributed storage system, so it depends upon networks to peer with
OSDs, replicate objects, recover from faults and check heartbeats. Networking
Ceph is a distributed storage system, so it relies upon networks for OSD peering
and replication, recovery from faults, and periodic heartbeats. Networking
issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for
details.
@ -295,15 +354,17 @@ Check network statistics. ::
Drive Configuration
-------------------
A storage drive should only support one OSD. Sequential read and sequential
write throughput can bottleneck if other processes share the drive, including
journals, operating systems, monitors, other OSDs and non-Ceph processes.
A SAS or SATA storage drive should only house one OSD; NVMe drives readily
handle two or more. Read and write throughput can bottleneck if other processes
share the drive, including journals / metadata, operating systems, Ceph monitors,
`syslog` logs, other OSDs, and non-Ceph processes.
Ceph acknowledges writes *after* journaling, so fast SSDs are an
attractive option to accelerate the response time--particularly when
using the ``XFS`` or ``ext4`` file systems. By contrast, the ``btrfs``
using the ``XFS`` or ``ext4`` file systems for legacy Filestore OSDs.
By contrast, the ``Btrfs``
file system can write and journal simultaneously. (Note, however, that
we recommend against using ``btrfs`` for production deployments.)
we recommend against using ``Btrfs`` for production deployments.)
.. note:: Partitioning a drive does not change its total throughput or
sequential read/write limits. Running a journal in a separate partition
@ -313,20 +374,22 @@ we recommend against using ``btrfs`` for production deployments.)
Bad Sectors / Fragmented Disk
-----------------------------
Check your disks for bad sectors and fragmentation. This can cause total throughput
to drop substantially.
Check your drives for bad blocks, fragmentation, and other errors that can cause
performance to drop substantially. Invaluable tools include ``dmesg``, ``syslog``
logs, and ``smartctl`` (from the ``smartmontools`` package).
Co-resident Monitors/OSDs
-------------------------
Monitors are generally light-weight processes, but they do lots of ``fsync()``,
Monitors are relatively lightweight processes, but they issue lots of
``fsync()`` calls,
which can interfere with other workloads, particularly if monitors run on the
same drive as your OSDs. Additionally, if you run monitors on the same host as
the OSDs, you may incur performance issues related to:
same drive as an OSD. Additionally, if you run monitors on the same host as
OSDs, you may incur performance issues related to:
- Running an older kernel (pre-3.0)
- Running a kernel with no syncfs(2) syscall.
- Running a kernel with no ``syncfs(2)`` syscall.
In these cases, multiple OSDs running on the same host can drag each other down
by doing lots of commits. That often leads to the bursty writes.
@ -335,10 +398,10 @@ by doing lots of commits. That often leads to the bursty writes.
Co-resident Processes
---------------------
Spinning up co-resident processes such as a cloud-based solution, virtual
Spinning up co-resident processes (convergence) such as a cloud-based solution, virtual
machines and other applications that write data to Ceph while operating on the
same hardware as OSDs can introduce significant OSD latency. Generally, we
recommend optimizing a host for use with Ceph and using other hosts for other
recommend optimizing hosts for use with Ceph and using other hosts for other
processes. The practice of separating Ceph operations from other applications
may help improve performance and may streamline troubleshooting and maintenance.
@ -377,13 +440,15 @@ might not have a recent enough version of ``glibc`` to support ``syncfs(2)``.
Filesystem Issues
-----------------
Currently, we recommend deploying clusters with XFS.
Currently, we recommend deploying clusters with the BlueStore back end.
When running a pre-Luminous release or if you have a specific reason to deploy
OSDs with the previous Filestore backend, we recommend ``XFS``.
We recommend against using btrfs or ext4. The btrfs file system has
many attractive features, but bugs in the file system may lead to
We recommend against using ``Btrfs`` or ``ext4``. The ``Btrfs`` filesystem has
many attractive features, but bugs may lead to
performance issues and spurious ENOSPC errors. We do not recommend
ext4 because xattr size limitations break our support for long object
names (needed for RGW).
``ext4`` for Filestore OSDs because ``xattr`` limitations break support for long
object names, which are needed for RGW.
For more information, see `Filesystem Recommendations`_.
@ -393,21 +458,23 @@ For more information, see `Filesystem Recommendations`_.
Insufficient RAM
----------------
We recommend 1GB of RAM per OSD daemon. You may notice that during normal
operations, the OSD only uses a fraction of that amount (e.g., 100-200MB).
Unused RAM makes it tempting to use the excess RAM for co-resident applications,
VMs and so forth. However, when OSDs go into recovery mode, their memory
utilization spikes. If there is no RAM available, the OSD performance will slow
considerably.
We recommend a *minimum* of 4GB of RAM per OSD daemon and suggest rounding up
from 6-8GB. You may notice that during normal operations, ``ceph-osd``
processes only use a fraction of that amount.
Unused RAM makes it tempting to use the excess RAM for co-resident
applications or to skimp on each node's memory capacity. However,
when OSDs experience recovery their memory utilization spikes. If
there is insufficient RAM available, OSD performance will slow considerably
and the daemons may even crash or be killed by the Linux ``OOM Killer``.
Old Requests or Slow Requests
-----------------------------
Blocked Requests or Slow Requests
---------------------------------
If a ``ceph-osd`` daemon is slow to respond to a request, it will generate log messages
complaining about requests that are taking too long. The warning threshold
defaults to 30 seconds, and is configurable via the ``osd op complaint time``
option. When this happens, the cluster log will receive messages.
If a ``ceph-osd`` daemon is slow to respond to a request, messages will be logged
noting ops that are taking too long. The warning threshold
defaults to 30 seconds and is configurable via the ``osd op complaint time``
setting. When this happens, the cluster log will receive messages.
Legacy versions of Ceph complain about ``old requests``::
@ -421,7 +488,7 @@ New versions of Ceph complain about ``slow requests``::
Possible causes include:
- A bad drive (check ``dmesg`` output)
- A failing drive (check ``dmesg`` output)
- A bug in the kernel file system (check ``dmesg`` output)
- An overloaded cluster (check system load, iostat, etc.)
- A bug in the ``ceph-osd`` daemon.
@ -432,6 +499,7 @@ Possible solutions:
- Upgrade kernel
- Upgrade Ceph
- Restart OSDs
- Replace failed or failing components
Debugging Slow Requests
-----------------------
@ -450,7 +518,7 @@ Events from the Messenger layer:
- ``initiated``: This is identical to ``header_read``. The existence of both is a
historical oddity.
Events from the OSD as it prepares operations:
Events from the OSD as it processes ops:
- ``queued_for_pg``: The op has been put into the queue for processing by its PG.
- ``reached_pg``: The PG has started doing the op.
@ -461,7 +529,7 @@ Events from the OSD as it prepares operations:
is now being performed.
- ``waiting for subops from``: The op has been sent to replica OSDs.
Events from the FileStore:
Events from ```Filestore```:
- ``commit_queued_for_journal_write``: The op has been given to the FileStore.
- ``write_thread_in_journal_buffer``: The op is in the journal's buffer and waiting
@ -469,7 +537,7 @@ Events from the FileStore:
- ``journaled_completion_queued``: The op was journaled to disk and its callback
queued for invocation.
Events from the OSD after stuff has been given to local disk:
Events from the OSD after data has been given to underlying storage:
- ``op_commit``: The op has been committed (i.e. written to journal) by the
primary OSD.
@ -486,26 +554,47 @@ the internal code (such as passing data across locks into new threads).
Flapping OSDs
=============
We recommend using both a public (front-end) network and a cluster (back-end)
network so that you can better meet the capacity requirements of object
replication. Another advantage is that you can run a cluster network such that
it is not connected to the internet, thereby preventing some denial of service
attacks. When OSDs peer and check heartbeats, they use the cluster (back-end)
When OSDs peer and check heartbeats, they use the cluster (back-end)
network when it's available. See `Monitor/OSD Interaction`_ for details.
However, if the cluster (back-end) network fails or develops significant latency
while the public (front-end) network operates optimally, OSDs currently do not
handle this situation well. What happens is that OSDs mark each other ``down``
on the monitor, while marking themselves ``up``. We call this scenario
'flapping`.
We have tradtionally recommended separate *public* (front-end) and *private*
(cluster / back-end / replication) networks:
If something is causing OSDs to 'flap' (repeatedly getting marked ``down`` and
then ``up`` again), you can force the monitors to stop the flapping with::
#. Segregation of heartbeat and replication / recovery traffic (private)
from client and OSD <-> mon traffic (public). This helps keep one
from DoS-ing the other, which could in turn result in a cascading failure.
#. Additional throughput for both public and private traffic.
When common networking technloogies were 100Mb/s and 1Gb/s, this separation
was often critical. With today's 10Gb/s, 40Gb/s, and 25/50/100Gb/s
networks, the above capacity concerns are often diminished or even obviated.
For example, if your OSD nodes have two network ports, dedicating one to
the public and the other to the private network means no path redundancy.
This degrades your ability to weather network maintenance and failures without
significant cluster or client impact. Consider instead using both links
for just a public network: with bonding (LACP) or equal-cost routing (e.g. FRR)
you reap the benefits of increased throughput headroom, fault tolerance, and
reduced OSD flapping.
When a private network (or even a single host link) fails or degrades while the
public network operates normally, OSDs may not handle this situation well. What
happens is that OSDs use the public network to report each other ``down`` to
the monitors, while marking themselves ``up``. The monitors then send out,
again on the public network, an updated cluster map with affected OSDs marked
`down`. These OSDs reply to the monitors "I'm not dead yet!", and the cycle
repeats. We call this scenario 'flapping`, and it can be difficult to isolate
and remediate. With no private network, this irksome dynamic is avoided:
OSDs are generally either ``up`` or ``down`` without flapping.
If something does cause OSDs to 'flap' (repeatedly getting marked ``down`` and
then ``up`` again), you can force the monitors to halt the flapping by
temporarily freezing their states::
ceph osd set noup # prevent OSDs from getting marked up
ceph osd set nodown # prevent OSDs from getting marked down
These flags are recorded in the osdmap structure::
These flags are recorded in the osdmap::
ceph osd dump | grep flags
flags no-up,no-down
@ -526,9 +615,12 @@ from eventually being marked ``out`` (regardless of what the current value for
prevents OSDs from being marked ``in`` on boot, and any daemons that
started while the flag was set will remain that way.
.. note:: The causes and effects of flapping can be somewhat mitigated through
careful adjustments to the ``mon_osd_down_out_subtree_limit``,
``mon_osd_reporter_subtree_level``, and ``mon_osd_min_down_reporters``.
Derivation of optimal settings depends on cluster size, topology, and the
Ceph release in use. Their interactions are subtle and beyond the scope of
this document.
.. _iostat: https://en.wikipedia.org/wiki/Iostat