Merge pull request #37610 from anthonyeleven/doc-rados-troubleshooting

doc/rados/troubleshooting: clarity and modernization Reviewed-by: Zac Dover <zac.dover@gmail.com>
2025-01-04 10:12:30 +00:00 · 2020-10-14 02:32:24 +10:00 · 2020-10-14 02:32:24 +10:00 · 9822fb49a7
commit 9822fb49a7
parent c2be93c981 8a4c4cb393
2 changed files with 310 additions and 191 deletions
--- a/doc/rados/troubleshooting/troubleshooting-mon.rst
+++ b/doc/rados/troubleshooting/troubleshooting-mon.rst
@ -5,11 +5,11 @@
 .. index:: monitor, high availability

 When a cluster encounters monitor-related troubles there's a tendency to
-panic, and some times with good reason. You should keep in mind that losing
-a monitor, or a bunch of them, don't necessarily mean that your cluster is
-down, as long as a majority is up, running and with a formed quorum.
+panic, and sometimes with good reason. Losing one or more monitors doesn't
+necessarily mean that your cluster is down, so long as a majority are up,
+running, and form a quorum.
 Regardless of how bad the situation is, the first thing you should do is to
-calm down, take a breath and try answering our initial troubleshooting script.
+calm down, take a breath, and step through the below troubleshooting steps.


 Initial Troubleshooting
@ -18,32 +18,37 @@ Initial Troubleshooting

 **Are the monitors running?**

-  First of all, we need to make sure the monitors are running. You would be
-  amazed by how often people forget to run the monitors, or restart them after
-  an upgrade. There's no shame in that, but let's try not losing a couple of
-  hours chasing an issue that is not there.
+  First of all, we need to make sure the monitor (*mon*) daemon processes
+  (``ceph-mon``) are running.  You would be amazed by how often Ceph admins
+  forget to start the mons, or to restart them after an upgrade. There's no
+  shame, but try to not lose a couple of hours looking for a deeper problem.
+  When running Kraken or later releases also ensure that the manager
+  daemons (``ceph-mgr``) are running, usually alongside each ``ceph-mon``.
 
-**Are you able to connect to the monitor's servers?**

-  Doesn't happen often, but sometimes people do have ``iptables`` rules that
-  block accesses to monitor servers or monitor ports. Usually leftovers from
-  monitor stress-testing that were forgotten at some point. Try ssh'ing into
-  the server and, if that succeeds, try connecting to the monitor's port
-  using you tool of choice (telnet, nc,...).
+**Are you able to reach to the mon nodes?**
+
+  Doesn't happen often, but sometimes there are ``iptables`` rules that
+  block accesse to mon nodes or TCP ports. These may be leftovers from
+  prior stress-testing or rule development. Try SSHing into
+  the server and, if that succeeds, try connecting to the monitor's ports
+  (``tcp/3300`` and ``tcp/6789``) using a ``telnet``, ``nc``, or similar tools.

 **Does ceph -s run and obtain a reply from the cluster?**

  If the answer is yes then your cluster is up and running.  One thing you
  can take for granted is that the monitors will only answer to a ``status``
-  request if there is a formed quorum.
+  request if there is a formed quorum.  Also check that at least one ``mgr``
+  daemon is reported as running, ideally all of them.

-  If ``ceph -s`` blocked however, without obtaining a reply from the cluster
-  or showing a lot of ``fault`` messages, then it is likely that your monitors
-  are either down completely or just a portion is up -- a portion that is not
-  enough to form a quorum (keep in mind that a quorum is formed by a majority
-  of monitors).
+  If ``ceph -s`` hangs without obtaining a reply from the cluster
+  or showing ``fault`` messages, then it is likely that your monitors
+  are either down completely or just a fraction are up -- a fraction
+  insufficient to form a majority quorum.  This check will connect to an
+  arbitrary mon; in rare cases it may be illuminating to bind to specific
+  mons in sequence by adding e.g. ``-m mymon1`` to the command.

-**What if ceph -s doesn't finish?**
+**What if ceph -s doesn't come back?**

  If you haven't gone through all the steps so far, please go back and do.

@ -53,11 +58,7 @@ Initial Troubleshooting
  perform this for each monitor in the cluster. In section `Understanding
  mon_status`_ we will explain how to interpret the output of this command.

-  For the rest of you who don't tread on the bleeding edge, you will need to
-  ssh into the server and use the monitor's admin socket. Please jump to
-  `Using the monitor's admin socket`_.
-
-For other specific issues, keep on reading.
+  You may instead SSH into each mon node and query the daemon's admin socket.


 Using the monitor's admin socket
@ -66,15 +67,16 @@ Using the monitor's admin socket
 The admin socket allows you to interact with a given daemon directly using a
 Unix socket file. This file can be found in your monitor's ``run`` directory.
 By default, the admin socket will be kept in ``/var/run/ceph/ceph-mon.ID.asok``
-but this can vary if you defined it otherwise. If you don't find it there,
-please check your ``ceph.conf`` for an alternative path or run::
+but this may be elsewhere if you have overridden the default directory. If you
+don't find it there, check your ``ceph.conf`` for an alternative path or
+run::

  ceph-conf --name mon.ID --show-config-value admin_socket

-Please bear in mind that the admin socket will only be available while the
-monitor is running. When the monitor is properly shutdown, the admin socket
+Bear in mind that the admin socket will be available only while the monitor
+daemon is running. When the monitor is properly shut down, the admin socket
 will be removed. If however the monitor is not running and the admin socket
-still persists, it is likely that the monitor was improperly shutdown.
+persists, it is likely that the monitor was improperly shut down.
 Regardless, if the monitor is not running, you will not be able to use the
 admin socket, with ``ceph`` likely returning ``Error 111: Connection Refused``.

@ -170,10 +172,10 @@ How to troubleshoot this?

  First, make sure ``mon.a`` is running.

-  Second, make sure you are able to connect to ``mon.a``'s server from the
-  other monitors' servers. Check the ports as well. Check ``iptables`` on
-  all your monitor nodes and make sure you are not dropping/rejecting
-  connections.
+  Second, make sure you are able to connect to ``mon.a``'s node from the
+  other mon nodes. Check the TCP ports as well. Check ``iptables`` and
+  ``nf_conntrack`` on all nodes and ensure that you are not
+  dropping/rejecting connections.

  If this initial troubleshooting doesn't solve your problems, then it's
  time to go deeper.
@ -182,7 +184,7 @@ How to troubleshoot this?
  socket as explained in `Using the monitor's admin socket`_ and
  `Understanding mon_status`_.

-  Considering the monitor is out of the quorum, its state should be one of
+  If the monitor is out of the quorum, its state should be one of
  ``probing``, ``electing`` or ``synchronizing``. If it happens to be either
  ``leader`` or ``peon``, then the monitor believes to be in quorum, while
  the remaining cluster is sure it is not; or maybe it got into the quorum
@ -193,17 +195,17 @@ What if the state is ``probing``?

  This means the monitor is still looking for the other monitors. Every time
  you start a monitor, the monitor will stay in this state for some time
-  while trying to find the rest of the monitors specified in the ``monmap``.
+  while trying to connect the rest of the monitors specified in the ``monmap``.
  The time a monitor will spend in this state can vary. For instance, when on
-  a single-monitor cluster, the monitor will pass through the probing state
-  almost instantaneously, since there are no other monitors around. On a
-  multi-monitor cluster, the monitors will stay in this state until they
+  a single-monitor cluster (never do this in production),
+  the monitor will pass through the probing state almost instantaneously.
+  In a multi-monitor cluster, the monitors will stay in this state until they
  find enough monitors to form a quorum -- this means that if you have 2 out
  of 3 monitors down, the one remaining monitor will stay in this state
  indefinitely until you bring one of the other monitors up.

-  If you have a quorum, however, the monitor should be able to find the
-  remaining monitors pretty fast, as long as they can be reached. If your
+  If you have a quorum the starting daemon should be able to find the
+  other monitors quickly, as long as they can be reached. If your
  monitor is stuck probing and you have gone through with all the communication
  troubleshooting, then there is a fair chance that the monitor is trying
  to reach the other monitors on a wrong address. ``mon_status`` outputs the
@ -218,43 +220,45 @@ What if the state is ``probing``?

 What if state is ``electing``?

-  This means the monitor is in the middle of an election. These should be
-  fast to complete, but at times the monitors can get stuck electing. This
-  is usually a sign of a clock skew among the monitor nodes; jump to
-  `Clock Skews`_ for more infos on that. If all your clocks are properly
-  synchronized, it is best if you prepare some logs and reach out to the
-  community. This is not a state that is likely to persist and aside from
+  This means the monitor is in the middle of an election. With recent Ceph
+  releases these typically complete quickly, but at times the monitors can
+  get stuck in what is known as an *election storm*. This can indicate
+  clock skew among the monitor nodes; jump to
+  `Clock Skews`_ for more information. If all your clocks are properly
+  synchronized, you should search the mailing lists and tracker.
+  This is not a state that is likely to persist and aside from
  (*really*) old bugs there is not an obvious reason besides clock skews on
-  why this would happen.
+  why this would happen.  Worst case, if there are enough surviving mons,
+  down the problematic one while you investigate.

 What if state is ``synchronizing``?

-  This means the monitor is synchronizing with the rest of the cluster in
-  order to join the quorum. The synchronization process is as faster as
-  smaller your monitor store is, so if you have a big store it may
-  take a while. Don't worry, it should be finished soon enough.
+  This means the monitor is catching up with the rest of the cluster in
+  order to join the quorum. Time to synchronize is a function of the size
+  of your monitor store and thus of cluster size and state, so if you have a
+  large or degraded cluster this may take a while.

-  However, if you notice that the monitor jumps from ``synchronizing`` to
+  If you notice that the monitor jumps from ``synchronizing`` to
  ``electing`` and then back to ``synchronizing``, then you do have a
-  problem: the cluster state is advancing (i.e., generating new maps) way
-  too fast for the synchronization process to keep up. This used to be a
-  thing in early Cuttlefish, but since then the synchronization process was
-  quite refactored and enhanced to avoid just this sort of behavior. If this
-  happens in later versions let us know. And bring some logs
+  problem: the cluster state may be advancing (i.e., generating new maps)
+  too fast for the synchronization process to keep up. This was a more common
+  thing in early days (Cuttlefish), but since then the synchronization process
+  has been refactored and enhanced to avoid this dynamic. If you experience
+  this in later versions please let us know via a bug tracker. And bring some logs
  (see `Preparing your logs`_).

 What if state is ``leader`` or ``peon``?

-  This should not happen. There is a chance this might happen however, and
-  it has a lot to do with clock skews -- see `Clock Skews`_. If you are not
-  suffering from clock skews, then please prepare your logs (see
-  `Preparing your logs`_) and reach out to us.
+  This should not happen:  famous last words.  If it does, however, it likely
+  has a lot to do with clock skew -- see `Clock Skews`_. If you are not
+  suffering from clock skew, then please prepare your logs (see
+  `Preparing your logs`_) and reach out to the community.


-Recovering a Monitor's Broken monmap
-------------------------------------
+Recovering a Monitor's Broken ``monmap``
+----------------------------------------

-This is how a ``monmap`` usually looks like, depending on the number of
+This is how a ``monmap`` usually looks, depending on the number of
 monitors::


@ -267,19 +271,20 @@ monitors::
      2: 127.0.0.1:6795/0 mon.c
      
 This may not be what you have however. For instance, in some versions of
-early Cuttlefish there was this one bug that could cause your ``monmap``
+early Cuttlefish there was a bug that could cause your ``monmap``
 to be nullified.  Completely filled with zeros. This means that not even
-``monmaptool`` would be able to read it because it would find it hard to
-make sense of only-zeros. Some other times, you may end up with a monitor
-with a severely outdated monmap, thus being unable to find the remaining
+``monmaptool`` would be able to make sense of cold, hard, inscrutable zeros.
+It's also possible to end up with a monitor with a severely outdated monmap,
+notably if the node has been down for months while you fight with your vendor's
+TAC.  The subject ``ceph-mon`` daemon might be unable to find the surviving
 monitors (e.g., say ``mon.c`` is down; you add a new monitor ``mon.d``,
 then remove ``mon.a``, then add a new monitor ``mon.e`` and remove
 ``mon.b``; you will end up with a totally different monmap from the one
 ``mon.c`` knows).

-In this sort of situations, you have two possible solutions:
+In this situation you have two possible solutions:

-Scrap the monitor and create a new one
+Scrap the monitor and redeploy

  You should only take this route if you are positive that you won't
  lose the information kept by that monitor; that you have other monitors
@ -321,38 +326,60 @@ Inject a monmap into the monitor
 Clock Skews
 ------------

-Monitors can be severely affected by significant clock skews across the
-monitor nodes. This usually translates into weird behavior with no obvious
-cause. To avoid such issues, you should run a clock synchronization tool
-on your monitor nodes.
+Monitor operation can be severely affected by clock skew among the quorum's
+mons, as the PAXOS consensus algorithm requires tight time alignment.
+Skew can result in weird behavior with no obvious
+cause. To avoid such issues, you must run a clock synchronization tool
+on your monitor nodes:  ``Chrony`` or the legacy ``ntpd``.  Be sure to
+configure the mon nodes with the `iburst` option and multiple peers:
+
+* Each other
+* Internal ``NTP`` servers
+* Multiple external, public pool servers
+
+For good measure, *all* nodes in your cluster should also sync against
+internal and external servers, and perhaps even your mons.  ``NTP`` servers
+should run on bare metal; VM virtualized clocks are not suitable for steady
+timekeeping.  Visit `https://www.ntp.org <https://www.ntp.org>`_ for more info.  Your
+organization may already have quality internal ``NTP`` servers you can use.  
+Sources for ``NTP`` server appliances include:
+
+* Microsemi (formerly Symmetricom) `https://microsemi.com <https://www.microsemi.com/product-directory/3425-timing-synchronization>`_
+* EndRun `https://endruntechnologies.com <https://endruntechnologies.com/products/ntp-time-servers>`_
+* Netburner `https://www.netburner.com <https://www.netburner.com/products/network-time-server/pk70-ex-ntp-network-time-server>`_


 What's the maximum tolerated clock skew?

-  By default the monitors will allow clocks to drift up to ``0.05 seconds``.
+  By default the monitors will allow clocks to drift up to 0.05 seconds (50 ms).


 Can I increase the maximum tolerated clock skew?

-  This value is configurable via the ``mon-clock-drift-allowed`` option, and
-  although you *CAN* it doesn't mean you *SHOULD*. The clock skew mechanism
-  is in place because clock skewed monitor may not properly behave. We, as
+  The maximum tolerated clock skew is configurable via the
+  ``mon-clock-drift-allowed`` option, and
+  although you *CAN* you almost certainly *SHOULDN'T*. The clock skew mechanism
+  is in place because clock-skewed monitors are liely to misbehave. We, as
  developers and QA aficionados, are comfortable with the current default
  value, as it will alert the user before the monitors get out hand. Changing
-  this value without testing it first may cause unforeseen effects on the
-  stability of the monitors and overall cluster healthiness, although there is
-  no risk of dataloss.
-
+  this value may cause unforeseen effects on the
+  stability of the monitors and overall cluster health.

 How do I know there's a clock skew?

-  The monitors will warn you in the form of a ``HEALTH_WARN``. ``ceph health
-  detail`` should show something in the form of::
+  The monitors will warn you via the cluster status ``HEALTH_WARN``. ``ceph health
+  detail`` or ``ceph status`` should show something like::

      mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s)

  That means that ``mon.c`` has been flagged as suffering from a clock skew.

+  On releases beginning with Luminous you can issue the
+  ``ceph time-sync-status`` command to check status.  Note that the lead mon
+  is typically the one with the numerically lowest IP address.  It will always
+  show ``0``: the reported offsets of other mons are relative to
+  the lead mon, not to any external reference source.
+

 What should I do if there's a clock skew?

--- a/doc/rados/troubleshooting/troubleshooting-osd.rst
+++ b/doc/rados/troubleshooting/troubleshooting-osd.rst
@ -2,21 +2,22 @@
 Troubleshooting OSDs
 ======================

-Before troubleshooting your OSDs, check your monitors and network first. If
-you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph returns
-a health status, it means that the monitors have a quorum.
+Before troubleshooting your OSDs, first check your monitors and network. If
+you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph shows
+``HEALTH_OK``, it means that the monitors have a quorum.
 If you don't have a monitor quorum or if there are errors with the monitor
 status, `address the monitor issues first <../troubleshooting-mon>`_.
 Check your networks to ensure they
 are running properly, because networks may have a significant impact on OSD
-operation and performance.
+operation and performance. Look for dropped packets on the host side
+and CRC errors on the switch side.



 Obtaining Data About OSDs
 =========================

-A good first step in troubleshooting your OSDs is to obtain information in
+A good first step in troubleshooting your OSDs is to obtain topology information in
 addition to the information you collected while `monitoring your OSDs`_
 (e.g., ``ceph osd tree``).

@ -29,7 +30,7 @@ If you haven't changed the default path, you can find Ceph log files at

 	ls /var/log/ceph

-If you don't get enough log detail, you can change your logging level.  See
+If you don't see enough log detail you can change your logging level.  See
 `Logging and Debugging`_ for details to ensure that Ceph performs adequately
 under high logging volume.

@ -38,7 +39,7 @@ Admin Socket
 ------------

 Use the admin socket tool to retrieve runtime information. For details, list
-the sockets for your Ceph processes::
+the sockets for your Ceph daemons::

 	ls /var/run/ceph

@ -51,7 +52,6 @@ Alternatively, you can specify a ``{socket-file}`` (e.g., something in ``/var/ru

  ceph daemon {socket-file} help

-
 The admin socket, among other things, allows you to:

 - List your configuration at runtime
@ -83,7 +83,7 @@ Use `iostat`_ to identify I/O-related issues. ::
 Diagnostic Messages
 -------------------

-To retrieve diagnostic messages, use ``dmesg`` with ``less``, ``more``, ``grep``
+To retrieve diagnostic messages from the kernel, use ``dmesg`` with ``less``, ``more``, ``grep``
 or ``tail``.  For example::

 	dmesg | grep scsi
@ -99,7 +99,18 @@ maintenance, set the cluster to ``noout`` first::

 	ceph osd set noout

-Once the cluster is set to ``noout``, you can begin stopping the OSDs within the
+On Luminous or newer releases it is safer to set the flag only on affected OSDs.
+You can do this individually ::
+
+	ceph osd add-noout osd.0
+	ceph osd rm-noout  osd.0
+
+Or an entire CRUSH bucket at a time.  Say you're going to take down
+``prod-ceph-data1701`` to add RAM ::
+
+	ceph osd set-group noout prod-ceph-data1701
+
+Once the flag is set you can begin stopping the OSDs within the
 failure domain that requires maintenance work. ::

 	stop ceph-osd id={num}
@ -114,6 +125,7 @@ Once you have completed your maintenance, restart the OSDs. ::
 Finally, you must unset the cluster from ``noout``. ::

 	ceph osd unset noout
+	ceph osd unset-group noout prod-ceph-data1701



@ -135,11 +147,11 @@ If you start your cluster and an OSD won't start, check the following:
  (e.g., ``host`` not ``hostname``, etc.).

 - **Check Paths:** Check the paths in your configuration, and the actual
-  paths themselves for data and journals. If you separate the OSD data from
-  the journal data and there are errors in your configuration file or in the
+  paths themselves for data and metadata (journals, WAL, DB). If you separate the OSD data from
+  the metadata and there are errors in your configuration file or in the
  actual mounts, you may have trouble starting OSDs. If you want to store the
-  journal on a block device, you should partition your journal disk and assign
-  one partition per OSD.
+  metadata on a separate block device, you should partition or LVM your
+  drive and assign one partition per OSD.

 - **Check Max Threadcount:** If you have a node with a lot of OSDs, you may be
  hitting the default maximum number of threads (e.g., usually 32k), especially
@ -150,81 +162,118 @@ If you start your cluster and an OSD won't start, check the following:
 	sysctl -w kernel.pid_max=4194303

  If increasing the maximum thread count resolves the issue, you can make it
-  permanent by including a ``kernel.pid_max`` setting in the
-  ``/etc/sysctl.conf`` file. For example::
+  permanent by including a ``kernel.pid_max`` setting in a file under ``/etc/sysctl.d`` or
+  within the master ``/etc/sysctl.conf`` file. For example::

 	kernel.pid_max = 4194303

+- **Check ``nf_conntrack``:** This connection tracking and limiting system
+  is the bane of many production Ceph clusters, and can be insidious in that
+  everything is fine at first. As cluster topology and client workload
+  grow, mysterious and intermittent connection failures and performance
+  glitches manifest, becoming worse over time and at certain times of day.
+  Check ``syslog`` history for table fillage events.  You can mitigate this
+  bother by raising ``nf_conntrack_max`` to a much higher value via ``sysctl``.
+  Be sure to raise ``nf_conntrack_buckets`` accordingly to
+  ``nf_conntrack_max / 4``, which may require action outside of ``sysctl`` e.g.
+  ``"echo 131072 > /sys/module/nf_conntrack/parameters/hashsize``
+  More interdictive but fussier is to blacklist the associated kernel modules
+  to disable processing altogether.  This is fragile in that the modules
+  vary among kernel versions, as does the order in which they must be listed.
+  Even when blacklisted there are situations in which ``iptables`` or ``docker``
+  may activate connection tracking anyway, so a "set and forget" strategy for
+  the tunables is advised.  On modern systems this will not consume appreciable
+  resources.
+  
+
 - **Kernel Version:** Identify the kernel version and distribution you
  are using. Ceph uses some third party tools by default, which may be
  buggy or may conflict with certain distributions and/or kernel
-  versions (e.g., Google perftools). Check the `OS recommendations`_
+  versions (e.g., Google ``gperftools`` and ``TCMalloc``). Check the
+  `OS recommendations`_ and the release notes for each Ceph version
  to ensure you have addressed any issues related to your kernel.

- **Segment Fault:** If there is a segment fault, turn your logging up
-  (if it is not already), and try again. If it segment faults again,
-  contact the ceph-devel email list and provide your Ceph configuration
-  file, your monitor output and the contents of your log file(s).
-
+- **Segment Fault:** If there is a segment fault, increase log levels
+  and start the problematic daemon(s) again. If segment faults recur,
+  search the Ceph bug tracker `https://tracker.ceph/com/projects/ceph <https://tracker.ceph.com/projects/ceph/>`_
+  and the ``dev`` and ``ceph-users`` mailing list archives `https://ceph.io/resources <https://ceph.io/resources>`_.
+  If this is truly a new and unique
+  failure, post to the ``dev`` email list and provide the specific Ceph
+  release being run, ``ceph.conf`` (with secrets XXX'd out),
+  your monitor status output and excerpts from your log file(s).


 An OSD Failed
 -------------

-When a ``ceph-osd`` process dies, the monitor will learn about the failure
-from surviving ``ceph-osd`` daemons and report it via the ``ceph health``
-command::
+When a ``ceph-osd`` process dies, surviving ``ceph-osd`` daemons will report
+to the mons that it appears down, which will in turn surface the new status
+via the ``ceph health`` command::

 	ceph health
 	HEALTH_WARN 1/3 in osds are down

-Specifically, you will get a warning whenever there are ``ceph-osd``
-processes that are marked ``in`` and ``down``.  You can identify which
-``ceph-osds`` are ``down`` with::
+Specifically, you will get a warning whenever there are OSDs marked ``in``
+and ``down``.  You can identify which  are ``down`` with::

 	ceph health detail
 	HEALTH_WARN 1/3 in osds are down
 	osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080

-If there is a disk
+or ::
+
+	ceph osd tree down
+
+
+If there is a drive
 failure or other fault preventing ``ceph-osd`` from functioning or
-restarting, an error message should be present in its log file in
+restarting, an error message should be present in its log file under
 ``/var/log/ceph``.

-If the daemon stopped because of a heartbeat failure, the underlying
-kernel file system may be unresponsive. Check ``dmesg`` output for disk
-or other kernel errors.
+If the daemon stopped because of a heartbeat failure or ``suicide timeout``,
+the underlying drive or filesystem may be unresponsive. Check ``dmesg``
+and `syslog`  output for drive or other kernel errors.  You may need to
+specify something like ``dmesg -T`` to get timestamps, otherwise it's
+easy to mistake old errors for new.

 If the problem is a software error (failed assertion or other
-unexpected error), it should be reported to the `ceph-devel`_ email list.
+unexpected error), search the archives and tracker as above, and
+report it to the `ceph-devel`_ email list if there's no clear fix or
+existing bug.


 No Free Drive Space
 -------------------

 Ceph prevents you from writing to a full OSD so that you don't lose data.
-In an operational cluster, you should receive a warning when your cluster
-is getting near its full ratio. The ``mon osd full ratio`` defaults to
+In an operational cluster, you should receive a warning when your cluster's OSDs
+and pools approach the full ratio. The ``mon osd full ratio`` defaults to
 ``0.95``, or 95% of capacity before it stops clients from writing data.
 The ``mon osd backfillfull ratio`` defaults to ``0.90``, or 90 % of
-capacity when it blocks backfills from starting. The
+capacity above which backfills will not start. The
 OSD nearfull ratio defaults to ``0.85``, or 85% of capacity
 when it generates a health warning.

-Changing it can be done using:
+Note that individual OSDs within a cluster will vary in how much data Ceph
+allocates to them.  This utilization can be displayed for each OSD with ::

-::
+	ceph osd df

-    ceph osd set-nearfull-ratio <float[0.0-1.0]>
+Overall cluster / pool fullness can be checked with ::

+	ceph df 

-Full cluster issues usually arise when testing how Ceph handles an OSD
-failure on a small cluster. When one node has a high percentage of the
-cluster's data, the cluster can easily eclipse its nearfull and full ratio
-immediately. If you are testing how Ceph reacts to OSD failures on a small
-cluster, you should leave ample free disk space and consider temporarily
-lowering the OSD ``full ratio``, OSD ``backfillfull ratio``  and
-OSD ``nearfull ratio`` using these commands:
+Pay close attention to the **most full** OSDs, not the percentage of raw space
+used as reported by ``ceph df``.  It only takes one outlier OSD filling up to
+fail writes to its pool.  The space available to each pool as reported by
+``ceph df`` considers the ratio settings relative to the *most full* OSD that
+is part of a given pool.  The distribution can be flattened by progressively
+moving data from overfull or to underfull OSDs using the ``reweight-by-utilization``
+command.  With Ceph releases beginning with later revisions of Luminous one can also
+exploit the ``ceph-mgr`` ``balancer`` module to perform this task automatically
+and rather effectively.
+
+The ratios can be adjusted:

 ::

@ -232,6 +281,15 @@ OSD ``nearfull ratio`` using these commands:
    ceph osd set-full-ratio <float[0.0-1.0]>
    ceph osd set-backfillfull-ratio <float[0.0-1.0]>

+Full cluster issues can arise when an OSD fails either as a test or organically
+within small and/or very full or unbalanced cluster. When an OSD or node
+holds an outsize percentage of the cluster's data, the ``nearfull`` and ``full``
+ratios may be exceeded as a result of component failures or even natural growth.
+If you are testing how Ceph reacts to OSD failures on a small
+cluster, you should leave ample free disk space and consider temporarily
+lowering the OSD ``full ratio``, OSD ``backfillfull ratio`` and
+OSD ``nearfull ratio``
+
 Full ``ceph-osds`` will be reported by ``ceph health``::

 	ceph health
@ -245,16 +303,17 @@ Or::
 	osd.4 is backfill full at 91%
 	osd.2 is near full at 87%

-The best way to deal with a full cluster is to add new ``ceph-osds``, allowing
-the cluster to redistribute data to the newly available storage.
+The best way to deal with a full cluster is to add capacity via new OSDs, enabling
+the cluster to redistribute data to newly available storage.

-If you cannot start an OSD because it is full, you may delete some data by deleting
-some placement group directories in the full OSD.
+If you cannot start a legacy Filestore OSD because it is full, you may reclaim
+some space deleting a few placement group directories in the full OSD.

 .. important:: If you choose to delete a placement group directory on a full OSD,
   **DO NOT** delete the same placement group directory on another full OSD, or
-   **YOU MAY LOSE DATA**. You **MUST** maintain at least one copy of your data on
-   at least one OSD.
+   **YOU WILL LOSE DATA**. You **MUST** maintain at least one copy of your data on
+   at least one OSD.  This is a rare and extreme intervention, and is not to be
+   undertaken lightly.

 See `Monitor Config Reference`_ for additional details.

@ -275,8 +334,8 @@ and your OSDs are running. Check to see if OSDs are throttling recovery traffic.
 Networking Issues
 -----------------

-Ceph is a distributed storage system, so it  depends upon networks to peer with
-OSDs, replicate objects, recover from faults and check heartbeats. Networking
+Ceph is a distributed storage system, so it relies upon networks for OSD peering
+and replication, recovery from faults, and periodic heartbeats. Networking
 issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for
 details.

@ -295,15 +354,17 @@ Check network statistics. ::
 Drive Configuration
 -------------------

-A storage drive should only support one OSD. Sequential read and sequential
-write throughput can bottleneck if other processes share the drive, including
-journals, operating systems, monitors, other OSDs and non-Ceph processes.
+A SAS or SATA storage drive should only house one OSD; NVMe drives readily
+handle two or more. Read and write throughput can bottleneck if other processes
+share the drive, including journals / metadata, operating systems, Ceph monitors,
+`syslog` logs, other OSDs, and non-Ceph processes.

 Ceph acknowledges writes *after* journaling, so fast SSDs are an
 attractive option to accelerate the response time--particularly when
-using the ``XFS`` or ``ext4`` file systems.  By contrast, the ``btrfs``
+using the ``XFS`` or ``ext4`` file systems for legacy Filestore OSDs.
+By contrast, the ``Btrfs``
 file system can write and journal simultaneously.  (Note, however, that
-we recommend against using ``btrfs`` for production deployments.)
+we recommend against using ``Btrfs`` for production deployments.)

 .. note:: Partitioning a drive does not change its total throughput or
   sequential read/write limits. Running a journal in a separate partition
@ -313,20 +374,22 @@ we recommend against using ``btrfs`` for production deployments.)
 Bad Sectors / Fragmented Disk
 -----------------------------

-Check your disks for bad sectors and fragmentation. This can cause total throughput
-to drop substantially.
+Check your drives for bad blocks, fragmentation, and other errors that can cause
+performance to drop substantially.  Invaluable tools include ``dmesg``, ``syslog``
+logs, and ``smartctl`` (from the ``smartmontools`` package).


 Co-resident Monitors/OSDs
 -------------------------

-Monitors are generally light-weight processes, but they do lots of ``fsync()``,
+Monitors are relatively lightweight processes, but they issue lots of
+``fsync()`` calls,
 which can interfere with other workloads, particularly if monitors run on the
-same drive as your OSDs. Additionally, if you run monitors on the same host as
-the OSDs, you may incur performance issues related to:
+same drive as an OSD. Additionally, if you run monitors on the same host as
+OSDs, you may incur performance issues related to:

 - Running an older kernel (pre-3.0)
- Running a kernel with no syncfs(2) syscall.
+- Running a kernel with no ``syncfs(2)`` syscall.

 In these cases, multiple OSDs running on the same host can drag each other down
 by doing lots of commits. That often leads to the bursty writes.
@ -335,10 +398,10 @@ by doing lots of commits. That often leads to the bursty writes.
 Co-resident Processes
 ---------------------

-Spinning up co-resident processes such as a cloud-based solution, virtual
+Spinning up co-resident processes (convergence) such as a cloud-based solution, virtual
 machines and other applications that write data to Ceph while operating on the
 same hardware as OSDs can introduce significant OSD latency. Generally, we
-recommend optimizing a host for use with Ceph and using other hosts for other
+recommend optimizing hosts for use with Ceph and using other hosts for other
 processes. The practice of separating Ceph operations from other applications
 may help improve performance and may streamline troubleshooting and maintenance.

@ -377,13 +440,15 @@ might not have a recent enough version of ``glibc`` to support ``syncfs(2)``.
 Filesystem Issues
 -----------------

-Currently, we recommend deploying clusters with XFS.
+Currently, we recommend deploying clusters with the BlueStore back end.
+When running a pre-Luminous release or if you have a specific reason to deploy
+OSDs with the previous Filestore backend, we recommend ``XFS``.

-We recommend against using btrfs or ext4.  The btrfs file system has
-many attractive features, but bugs in the file system may lead to
+We recommend against using ``Btrfs`` or ``ext4``.  The ``Btrfs`` filesystem has
+many attractive features, but bugs may lead to
 performance issues and spurious ENOSPC errors.  We do not recommend
-ext4 because xattr size limitations break our support for long object
-names (needed for RGW).
+``ext4`` for Filestore OSDs because ``xattr`` limitations break support for long
+object names, which are needed for RGW.

 For more information, see `Filesystem Recommendations`_.

@ -393,21 +458,23 @@ For more information, see `Filesystem Recommendations`_.
 Insufficient RAM
 ----------------

-We recommend 1GB of RAM per OSD daemon. You may notice that during normal
-operations, the OSD only uses a fraction of that amount (e.g., 100-200MB).
-Unused RAM makes it tempting to use the excess RAM for co-resident applications,
-VMs and so forth. However, when OSDs go into recovery mode, their memory
-utilization spikes. If there is no RAM available, the OSD performance will slow
-considerably.
+We recommend a *minimum* of 4GB of RAM per OSD daemon and suggest rounding up
+from 6-8GB.  You may notice that during normal operations, ``ceph-osd``
+processes only use a fraction of that amount.
+Unused RAM makes it tempting to use the excess RAM for co-resident
+applications or to skimp on each node's memory capacity.  However,
+when OSDs experience recovery their memory utilization spikes. If
+there is insufficient RAM available, OSD performance will slow considerably
+and the daemons may even crash or be killed by the Linux ``OOM Killer``.


-Old Requests or Slow Requests
-----------------------------
+Blocked Requests or Slow Requests
+---------------------------------

-If a ``ceph-osd`` daemon is slow to respond to a request, it will generate log messages
-complaining about requests that are taking too long.  The warning threshold
-defaults to 30 seconds, and is configurable via the ``osd op complaint time``
-option.  When this happens, the cluster log will receive messages.
+If a ``ceph-osd`` daemon is slow to respond to a request, messages will be logged
+noting ops that are taking too long.  The warning threshold
+defaults to 30 seconds and is configurable via the ``osd op complaint time``
+setting.  When this happens, the cluster log will receive messages.

 Legacy versions of Ceph complain about ``old requests``::

@ -421,7 +488,7 @@ New versions of Ceph complain about ``slow requests``::

 Possible causes include:

- A bad drive (check ``dmesg`` output)
+- A failing drive (check ``dmesg`` output)
 - A bug in the kernel file system (check ``dmesg`` output)
 - An overloaded cluster (check system load, iostat, etc.)
 - A bug in the ``ceph-osd`` daemon.
@ -432,6 +499,7 @@ Possible solutions:
 - Upgrade kernel
 - Upgrade Ceph
 - Restart OSDs
+- Replace failed or failing components

 Debugging Slow Requests
 -----------------------
@ -450,7 +518,7 @@ Events from the Messenger layer:
 - ``initiated``: This is identical to ``header_read``. The existence of both is a
  historical oddity.

-Events from the OSD as it prepares operations:
+Events from the OSD as it processes ops:

 - ``queued_for_pg``: The op has been put into the queue for processing by its PG.
 - ``reached_pg``: The PG has started doing the op.
@ -461,7 +529,7 @@ Events from the OSD as it prepares operations:
  is now being performed.
 - ``waiting for subops from``: The op has been sent to replica OSDs.

-Events from the FileStore:
+Events from ```Filestore```:

 - ``commit_queued_for_journal_write``: The op has been given to the FileStore.
 - ``write_thread_in_journal_buffer``: The op is in the journal's buffer and waiting
@ -469,7 +537,7 @@ Events from the FileStore:
 - ``journaled_completion_queued``: The op was journaled to disk and its callback
  queued for invocation.

-Events from the OSD after stuff has been given to local disk:
+Events from the OSD after data has been given to underlying storage:

 - ``op_commit``: The op has been committed (i.e. written to journal) by the
  primary OSD.
@ -486,26 +554,47 @@ the internal code (such as passing data across locks into new threads).
 Flapping OSDs
 =============

-We recommend using both a public (front-end) network and a cluster (back-end)
-network so that you can better meet the capacity requirements of object
-replication. Another advantage is that you can run a cluster network such that
-it is not connected to the internet, thereby preventing some denial of service
-attacks. When OSDs peer and check heartbeats, they use the cluster (back-end)
+When OSDs peer and check heartbeats, they use the cluster (back-end)
 network when it's available. See `Monitor/OSD Interaction`_ for details.

-However, if the cluster (back-end) network fails or develops significant latency
-while the public (front-end) network operates optimally, OSDs currently do not
-handle this situation well. What happens is that OSDs mark each other ``down``
-on the monitor, while marking themselves ``up``. We call this scenario
-'flapping`.
+We have tradtionally recommended separate *public* (front-end) and *private*
+(cluster / back-end / replication) networks:

-If something is causing OSDs to 'flap' (repeatedly getting marked ``down`` and
-then ``up`` again), you can force the monitors to stop the flapping with::
+#. Segregation of heartbeat and replication / recovery traffic (private)
+   from client and OSD <-> mon traffic (public).  This helps keep one
+   from DoS-ing the other, which could in turn result in a cascading failure.
+
+#. Additional throughput for both public and private traffic.
+
+When common networking technloogies were 100Mb/s and 1Gb/s, this separation
+was often critical.  With today's 10Gb/s, 40Gb/s, and 25/50/100Gb/s
+networks, the above capacity concerns are often diminished or even obviated.
+For example, if your OSD nodes have two network ports, dedicating one to
+the public and the other to the private network means no path redundancy.
+This degrades your ability to weather network maintenance and failures without
+significant cluster or client impact.  Consider instead using both links
+for just a public network:  with bonding (LACP) or equal-cost routing (e.g. FRR)
+you reap the benefits of increased throughput headroom, fault tolerance, and
+reduced OSD flapping.
+
+When a private network (or even a single host link) fails or degrades while the
+public network operates normally, OSDs may not handle this situation well. What
+happens is that OSDs use the public network to report each other ``down`` to
+the monitors, while marking themselves ``up``. The monitors then send out,
+again on the public network, an updated cluster map with affected OSDs marked
+`down`. These OSDs reply to the monitors "I'm not dead yet!", and the cycle
+repeats.  We call this scenario 'flapping`, and it can be difficult to isolate
+and remediate.  With no private network, this irksome dynamic is avoided:
+OSDs are generally either ``up`` or ``down`` without flapping.
+
+If something does cause OSDs to 'flap' (repeatedly getting marked ``down`` and
+then ``up`` again), you can force the monitors to halt the flapping by
+temporarily freezing their states::

 	ceph osd set noup      # prevent OSDs from getting marked up
 	ceph osd set nodown    # prevent OSDs from getting marked down

-These flags are recorded in the osdmap structure::
+These flags are recorded in the osdmap::

 	ceph osd dump | grep flags
 	flags no-up,no-down
@ -526,9 +615,12 @@ from eventually being marked ``out`` (regardless of what the current value for
   prevents OSDs from being marked ``in`` on boot, and any daemons that
   started while the flag was set will remain that way.

-
-
-
+.. note:: The causes and effects of flapping can be somewhat mitigated through
+   careful adjustments to the ``mon_osd_down_out_subtree_limit``,
+   ``mon_osd_reporter_subtree_level``, and ``mon_osd_min_down_reporters``.
+   Derivation of optimal settings depends on cluster size, topology, and the
+   Ceph  release in use. Their interactions are subtle and beyond the scope of
+   this document.


 .. _iostat: https://en.wikipedia.org/wiki/Iostat