mirror of
https://github.com/ceph/ceph
synced 2025-01-01 16:42:29 +00:00
4df44bb1bb
fixes: #3327 Signed-off-by: John Wilkins <john.wilkins@inktank.com>
739 lines
26 KiB
ReStructuredText
739 lines
26 KiB
ReStructuredText
==============================
|
|
Troubleshooting OSDs and PGs
|
|
==============================
|
|
|
|
Before troubleshooting your OSDs, check your monitors and network first. If
|
|
you execute ``ceph health`` or ``ceph -s`` on the command line and Ceph returns
|
|
a health status, the return of a status means that the monitors have a quorum.
|
|
If you don't have a monitor quorum or if there are errors with the monitor
|
|
status, address the monitor issues first. Check your networks to ensure they
|
|
are running properly, because networks may have a significant impact on OSD
|
|
operation and performance.
|
|
|
|
|
|
The Ceph Community
|
|
==================
|
|
|
|
The Ceph community is an excellent source of information and help. For
|
|
operational issues with Ceph releases we recommend you `subscribe to the
|
|
ceph-users email list`_. When you no longer want to receive emails, you can
|
|
`unsubscribe from the ceph-users email list`_.
|
|
|
|
If you have read through this guide and you have contacted ``ceph-users``,
|
|
but you haven't resolved your issue, you may contact `Inktank`_ for support.
|
|
|
|
You may also `subscribe to the ceph-devel email list`_. You should do so if
|
|
your issue is:
|
|
|
|
- Likely related to a bug
|
|
- Related to a development release package
|
|
- Related to a development testing package
|
|
- Related to your own builds
|
|
|
|
If you no longer want to receive emails from the ``ceph-devel`` email list, you
|
|
may `unsubscribe from the ceph-devel email list`_.
|
|
|
|
.. tip:: The Ceph community is growing rapidly, and community members can help
|
|
you if you provide them with detailed information about your problem. See
|
|
`Obtaining Data About OSDs`_ before you post questions to ensure that
|
|
community members have sufficient data to help you.
|
|
|
|
|
|
Obtaining Data About OSDs
|
|
=========================
|
|
|
|
A good first step in troubleshooting your OSDs is to obtain information in
|
|
addition to the information you collected while `monitoring your OSDs`_
|
|
(e.g., ``ceph osd tree``).
|
|
|
|
|
|
Ceph Logs
|
|
---------
|
|
|
|
If you haven't changed the default path, you can find Ceph log files at
|
|
``/var/log/ceph``::
|
|
|
|
ls /var/log/ceph
|
|
|
|
If you don't get enough log detail, you can change your logging level. See
|
|
`Ceph Logging and Debugging`_ and `Logging and Debugging Config Reference`_ in
|
|
the Ceph Configuration documentation for details. Also, see `Debugging and
|
|
Logging`_ in the Ceph Operations documentation to ensure that Ceph performs
|
|
adequately under high logging volume.
|
|
|
|
|
|
Admin Socket
|
|
------------
|
|
|
|
Use the admin socket tool to retrieve runtime information. For details, list
|
|
the sockets for your Ceph processes::
|
|
|
|
ls /var/run/ceph
|
|
|
|
Then, execute the following, replacing ``{socket-name}`` with an actual
|
|
socket name to show the list of available options::
|
|
|
|
ceph --admin-daemon /var/run/ceph/{socket-name} help
|
|
|
|
The admin socket, among other things, allows you to:
|
|
|
|
- List your configuration at runtime
|
|
- Dump historic operations
|
|
- Dump the operation priority queue state
|
|
- Dump operations in flight
|
|
- Dump perfcounters
|
|
|
|
|
|
Display Freespace
|
|
-----------------
|
|
|
|
Filesystem issues may arise. To display your filesystem's free space, execute
|
|
``df``. ::
|
|
|
|
df -h
|
|
|
|
Execute ``df --help`` for additional usage.
|
|
|
|
|
|
I/O Statistics
|
|
--------------
|
|
|
|
Use `iostat`_ to identify I/O-related issues. ::
|
|
|
|
iostat -x
|
|
|
|
|
|
Diagnostic Messages
|
|
-------------------
|
|
|
|
To retrieve diagnostic messages, use ``dmesg`` with ``less``, ``more``, ``grep``
|
|
or ``tail``. For example::
|
|
|
|
dmesg | grep scsi
|
|
|
|
|
|
Stopping w/out Rebalancing
|
|
==========================
|
|
|
|
Periodically, you may need to perform maintenance on a subset of your cluster,
|
|
or resolve a problem that affects a failure domain (e.g., a rack). If you do not
|
|
want CRUSH to automatically rebalance the cluster as you stop OSDs for
|
|
maintenance, set the cluster to ``noout`` first::
|
|
|
|
ceph osd set noout
|
|
|
|
Once the cluster is set to ``noout``, you can begin stopping the OSDs within the
|
|
failure domain that requires maintenance work. ::
|
|
|
|
ceph osd stop osd.{num}
|
|
|
|
.. note:: Placement groups within the OSDs you stop will become ``degraded``
|
|
while you are addressing issues with within the failure domain.
|
|
|
|
Once you have completed your maintenance, restart the OSDs. ::
|
|
|
|
ceph osd start osd.{num}
|
|
|
|
Finally, you must unset the cluster from ``noout``. ::
|
|
|
|
ceph osd unset noout
|
|
|
|
|
|
|
|
.. _osd-not-running:
|
|
|
|
OSD Not Running
|
|
===============
|
|
|
|
Under normal circumstances, simply restarting the ``ceph-osd`` daemon will
|
|
allow it to rejoin the cluster and recover.
|
|
|
|
An OSD Won't Start
|
|
------------------
|
|
|
|
If you start your cluster and an OSD won't start, check the following:
|
|
|
|
- **Configuration File:** If you were not able to get OSDs running from
|
|
a new installation, check your configuration file to ensure it conforms
|
|
(e.g., ``host`` not ``hostname``, etc.).
|
|
|
|
- **Check Paths:** Check the paths in your configuration, and the actual
|
|
paths themselves for data and journals. If you separate the OSD data from
|
|
the journal data and there are errors in your configuration file or in the
|
|
actual mounts, you may have trouble starting OSDs. If you want to store the
|
|
journal on a block device, you should partition your journal disk and assign
|
|
one partition per OSD.
|
|
|
|
- **Kernel Version:** Identify the kernel version and distribution you
|
|
are using. Ceph uses some third party tools by default, which may be
|
|
buggy or may conflict with certain distributions and/or kernel
|
|
versions (e.g., Google perftools). Check the `OS recommendations`_
|
|
to ensure you have addressed any issues related to your kernel.
|
|
|
|
- **Segment Fault:** If there is a segment fault, turn your logging up
|
|
(if it isn't already), and try again. If it segment faults again,
|
|
contact the ceph-devel email list and provide your Ceph configuration
|
|
file, your monitor output and the contents of your log file(s).
|
|
|
|
If you cannot resolve the issue and the email list isn't helpful, you may
|
|
contact `Inktank`_ for support.
|
|
|
|
|
|
An OSD Failed
|
|
-------------
|
|
|
|
When a ``ceph-osd`` process dies, the monitor will learn about the failure
|
|
from surviving ``ceph-osd`` daemons and report it via the ``ceph health``
|
|
command::
|
|
|
|
ceph health
|
|
HEALTH_WARN 1/3 in osds are down
|
|
|
|
Specifically, you will get a warning whenever there are ``ceph-osd``
|
|
processes that are marked ``in`` and ``down``. You can identify which
|
|
``ceph-osds`` are ``down`` with::
|
|
|
|
ceph health detail
|
|
HEALTH_WARN 1/3 in osds are down
|
|
osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080
|
|
|
|
If there is a disk
|
|
failure or other fault preventing ``ceph-osd`` from functioning or
|
|
restarting, an error message should be present in its log file in
|
|
``/var/log/ceph``.
|
|
|
|
If the daemon stopped because of a heartbeat failure, the underlying
|
|
kernel file system may be unresponsive. Check ``dmesg`` output for disk
|
|
or other kernel errors.
|
|
|
|
If the problem is a software error (failed assertion or other
|
|
unexpected error), it should be reported to the `ceph-devel`_ email list.
|
|
|
|
|
|
No Free Drive Space
|
|
-------------------
|
|
|
|
Ceph prevents you from writing to a full OSD so that you don't lose data.
|
|
In an operational cluster, you should receive a warning when your cluster
|
|
is getting near its full ratio. The ``mon osd full ratio`` defaults to
|
|
``0.95``, or 95% of capacity before it stops clients from writing data.
|
|
The ``mon osd nearfull ratio`` defaults to ``0.85``, or 85% of capacity
|
|
when it generates a health warning.
|
|
|
|
Full cluster issues usually arise when testing how Ceph handles an OSD
|
|
failure on a small cluster. When one node has a high percentage of the
|
|
cluster's data, the cluster can easily eclipse its nearfull and full ratio
|
|
immediately. If you are testing how Ceph reacts to OSD failures on a small
|
|
cluster, you should leave ample free disk space and consider temporarily
|
|
lowering the ``mon osd full ratio`` and ``mon osd nearfull ratio``.
|
|
|
|
Full ``ceph-osds`` will be reported by ``ceph health``::
|
|
|
|
ceph health
|
|
HEALTH_WARN 1 nearfull osds
|
|
osd.2 is near full at 85%
|
|
|
|
Or::
|
|
|
|
ceph health
|
|
HEALTH_ERR 1 nearfull osds, 1 full osds
|
|
osd.2 is near full at 85%
|
|
osd.3 is full at 97%
|
|
|
|
The best way to deal with a full cluster is to add new ``ceph-osds``, allowing
|
|
the cluster to redistribute data to the newly available storage.
|
|
|
|
If you cannot start an OSD because it is full, you may delete some data by deleting
|
|
some placement group directories in the full OSD.
|
|
|
|
.. important:: If you choose to delete a placement group directory on a full OSD,
|
|
**DO NOT** delete the same placement group directory on another full OSD, or
|
|
**YOU MAY LOSE DATA**. You **MUST** maintain at least one copy of your data on
|
|
at least one OSD.
|
|
|
|
|
|
OSDs are Slow/Unresponsive
|
|
==========================
|
|
|
|
A commonly recurring issue involves slow or unresponsive OSDs. Ensure that you
|
|
have eliminated other troubleshooting possibilities before delving into OSD
|
|
performance issues. For example, ensure that your network(s) is working properly
|
|
and your OSDs are running. Check to see if OSDs are throttling recovery traffic.
|
|
|
|
.. tip:: Newer versions of Ceph provide better recovery handling by preventing
|
|
recovering OSDs from using up system resources so that ``up`` and ``in``
|
|
OSDs aren't available or are otherwise slow.
|
|
|
|
|
|
Networking Issues
|
|
-----------------
|
|
|
|
Ceph is a distributed storage system, so it depends upon networks to peer with
|
|
OSDs, replicate objects, recover from faults and check heartbeats. Networking
|
|
issues can cause OSD latency and flapping OSDs. See `Flapping OSDs`_ for
|
|
details.
|
|
|
|
Ensure that Ceph processes and Ceph-dependent processes are connected and/or
|
|
listening. ::
|
|
|
|
netstat -a | grep ceph
|
|
netstat -l | grep ceph
|
|
sudo netstat -p | grep ceph
|
|
|
|
Check network statistics. ::
|
|
|
|
netstat -s
|
|
|
|
|
|
Drive Configuration
|
|
-------------------
|
|
|
|
A storage drive should only support one OSD. Sequential read and sequential
|
|
write throughput can bottleneck if other processes share the drive, including
|
|
journals, operating systems, monitors, other OSDs and non-Ceph processes.
|
|
|
|
Ceph acknowledges writes *after* journaling, so fast SSDs are an attractive
|
|
option to accelerate the response time--particularly when using the ``ext4`` or
|
|
XFS filesystems. By contrast, the ``btrfs`` filesystem can write and journal
|
|
simultaneously.
|
|
|
|
.. note:: Partitioning a drive does not change its total throughput or
|
|
sequential read/write limits. Running a journal in a separate partition
|
|
may help, but you should prefer a separate physical drive.
|
|
|
|
|
|
Bad Sectors / Fragmented Disk
|
|
-----------------------------
|
|
|
|
Check your disks for bad sectors and fragmentation. This can cause total throughput
|
|
to drop substantially.
|
|
|
|
|
|
Co-resident Monitors/OSDs
|
|
-------------------------
|
|
|
|
Monitors are generally light-weight processes, but they do lots of ``fsync()``,
|
|
which can interfere with other workloads, particularly if monitors run on the
|
|
same drive as your OSDs. Additionally, if you run monitors on the same host as
|
|
the OSDs, you may incur performance issues related to:
|
|
|
|
- Running an older kernel (pre-3.0)
|
|
- Running Argonaut with an old ``glibc``
|
|
- Running a kernel with no syncfs(2) syscall.
|
|
|
|
In these cases, multiple OSDs running on the same host can drag each other down
|
|
by doing lots of commits. That often leads to the bursty writes.
|
|
|
|
|
|
Co-resident Processes
|
|
---------------------
|
|
|
|
Spinning up co-resident processes such as a cloud-based solution, virtual
|
|
machines and other applications that write data to Ceph while operating on the
|
|
same hardware as OSDs can introduce significant OSD latency. Generally, we
|
|
recommend optimizing a host for use with Ceph and using other hosts for other
|
|
processes. The practice of separating Ceph operations from other applications
|
|
may help improve performance and may streamline troubleshooting and maintenance.
|
|
|
|
|
|
Logging Levels
|
|
--------------
|
|
|
|
If you turned logging levels up to track an issue and then forgot to turn
|
|
logging levels back down, the OSD may be putting a lot of logs onto the disk. If
|
|
you intend to keep logging levels high, you may consider mounting a drive to the
|
|
default path for logging (i.e., ``/var/log/ceph/$cluster-$name.log``).
|
|
|
|
|
|
Recovery Throttling
|
|
-------------------
|
|
|
|
Depending upon your configuration, Ceph may reduce recovery rates to maintain
|
|
performance or it may increase recovery rates to the point that recovery
|
|
impacts OSD performance. Check to see if the OSD is recovering.
|
|
|
|
|
|
Kernel Version
|
|
--------------
|
|
|
|
Check the kernel version you are running. Older kernels may not receive
|
|
new backports that Ceph depends upon for better performance.
|
|
|
|
|
|
Kernel Issues with SyncFS
|
|
-------------------------
|
|
|
|
Try running one OSD per host to see if performance improves. Old kernels
|
|
might not have a recent enough version of ``glibc`` to support ``syncfs(2)``.
|
|
|
|
|
|
Filesystem Issues
|
|
-----------------
|
|
|
|
Currently, we recommend deploying clusters with XFS or ext4. The btrfs
|
|
filesystem has many attractive features, but bugs in the filesystem may
|
|
lead to performance issues.
|
|
|
|
|
|
Insufficient RAM
|
|
----------------
|
|
|
|
We recommend 1GB of RAM per OSD daemon. You may notice that during normal
|
|
operations, the OSD only uses a fraction of that amount (e.g., 100-200MB).
|
|
Unused RAM makes it tempting to use the excess RAM for co-resident applications,
|
|
VMs and so forth. However, when OSDs go into recovery mode, their memory
|
|
utilization spikes. If there is no RAM available, the OSD performance will slow
|
|
considerably.
|
|
|
|
|
|
Old Requests or Slow Requests
|
|
-----------------------------
|
|
|
|
If a ``ceph-osd`` daemon is slow to respond to a request, it will generate log messages
|
|
complaining about requests that are taking too long. The warning threshold
|
|
defaults to 30 seconds, and is configurable via the ``osd op complaint time``
|
|
option. When this happens, the cluster log will receive messages.
|
|
|
|
Legacy versions of Ceph complain about 'old requests`::
|
|
|
|
osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops
|
|
|
|
New versions of Ceph complain about 'slow requests`::
|
|
|
|
{date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs
|
|
{date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]
|
|
|
|
|
|
Possible causes include:
|
|
|
|
- A bad drive (check ``dmesg`` output)
|
|
- A bug in the kernel file system bug (check ``dmesg`` output)
|
|
- An overloaded cluster (check system load, iostat, etc.)
|
|
- A bug in the ``ceph-osd`` daemon.
|
|
|
|
Possible solutions
|
|
|
|
- Remove VMs Cloud Solutions from Ceph Hosts
|
|
- Upgrade Kernel
|
|
- Upgrade Ceph
|
|
- Restart OSDs
|
|
|
|
|
|
|
|
Flapping OSDs
|
|
=============
|
|
|
|
We recommend using both a public (front-end) network and a cluster (back-end)
|
|
network so that you can better meet the capacity requirements of object replication. Another
|
|
advantage is that you can run a cluster network such that it isn't connected to
|
|
the internet, thereby preventing some denial of service attacks. When OSDs peer
|
|
and check heartbeats, they use the cluster (back-end) network when it's available.
|
|
See `Monitor/OSD Interaction`_ for details.
|
|
|
|
However, if the cluster (back-end) network fails or develops significant latency
|
|
while the public (front-end) network operates optimally, OSDs currently do not
|
|
handle this situation well. What happens is that OSDs mark each other ``down``
|
|
on the monitor, while marking themselves ``up``. We call this scenario 'flapping`.
|
|
|
|
If something is causing OSDs to 'flap' (repeatedly getting marked ``down`` and then
|
|
``up`` again), you can force the monitors to stop the flapping with::
|
|
|
|
ceph osd set noup # prevent osds from getting marked up
|
|
ceph osd set nodown # prevent osds from getting marked down
|
|
|
|
These flags are recorded in the osdmap structure::
|
|
|
|
ceph osd dump | grep flags
|
|
flags no-up,no-down
|
|
|
|
You can clear the flags with::
|
|
|
|
ceph osd unset noup
|
|
ceph osd unset nodown
|
|
|
|
Two other flags are supported, ``noin`` and ``noout``, which prevent
|
|
booting OSDs from being marked ``in`` (allocated data) or down
|
|
ceph-osds from eventually being marked ``out`` (regardless of what the
|
|
current value for ``mon osd down out interval`` is).
|
|
|
|
.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the
|
|
sense that once the flags are cleared, the action they were blocking
|
|
should occur shortly after. The ``noin`` flag, on the other hand,
|
|
prevents OSDs from being marked ``in`` on boot, and any daemons that
|
|
started while the flag was set will remain that way.
|
|
|
|
|
|
|
|
Troubleshooting PG Errors
|
|
=========================
|
|
|
|
|
|
Placement Groups Never Get Clean
|
|
--------------------------------
|
|
|
|
There are a few cases where Ceph placement groups never get clean:
|
|
|
|
#. **One OSD:** If you deviate from the quick start and use only one OSD, you
|
|
will likely run into problems. OSDs report other OSDs to the monitor, and
|
|
also interact with other OSDs when replicating data. If you have only one
|
|
OSD, a second OSD cannot check its heartbeat. Also, if you remove an OSD
|
|
and have only one OSD remaining, you may encounter problems. An secondary
|
|
or tertiary OSD expects another OSD to tell it which placement groups it
|
|
should have. The lack of another OSD prevents this from occurring. So a
|
|
placement group can remain stuck “stale” forever.
|
|
|
|
#. **Pool Size = 1**: If you have only one copy of an object, no other OSD will
|
|
tell the OSD which objects it should have. For each placement group mapped
|
|
to the remaining OSD (see ``ceph pg dump``), you can force the OSD to notice
|
|
the placement groups it needs by running::
|
|
|
|
ceph pg force_create_pg <pgid>
|
|
|
|
As a general rule, you should run your cluster with more than one OSD and a
|
|
pool size greater than 1 object replica.
|
|
|
|
|
|
Stuck Placement Groups
|
|
----------------------
|
|
|
|
It is normal for placement groups to enter states like "degraded" or "peering"
|
|
following a failure. Normally these states indicate the normal progression
|
|
through the failure recovery process. However, if a placement group stays in one
|
|
of these states for a long time this may be an indication of a larger problem.
|
|
For this reason, the monitor will warn when placement groups get "stuck" in a
|
|
non-optimal state. Specifically, we check for:
|
|
|
|
* ``inactive`` - The placement group has not been ``active`` for too long
|
|
(i.e., it hasn't been able to service read/write requests).
|
|
|
|
* ``unclean`` - The placement group has not been ``clean`` for too long
|
|
(i.e., it hasn't been able to completely recover from a previous failure).
|
|
|
|
* ``stale`` - The placement group status has not been updated by a ``ceph-osd``,
|
|
indicating that all nodes storing this placement group may be ``down``.
|
|
|
|
You can explicitly list stuck placement groups with one of::
|
|
|
|
ceph pg dump_stuck stale
|
|
ceph pg dump_stuck inactive
|
|
ceph pg dump_stuck unclean
|
|
|
|
For stuck ``stale`` placement groups, it is normally a matter of getting the
|
|
right ``ceph-osd`` daemons running again. For stuck ``inactive`` placement
|
|
groups, it is usually a peering problem (see :ref:`failures-osd-peering`). For
|
|
stuck ``unclean`` placement groups, there is usually something preventing
|
|
recovery from completing, like unfound objects (see
|
|
:ref:`failures-osd-unfound`);
|
|
|
|
|
|
|
|
.. _failures-osd-peering:
|
|
|
|
Placement Group Down - Peering Failure
|
|
--------------------------------------
|
|
|
|
In certain cases, the ``ceph-osd`` `Peering` process can run into
|
|
problems, preventing a PG from becoming active and usable. For
|
|
example, ``ceph health`` might report::
|
|
|
|
ceph health detail
|
|
HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
|
|
...
|
|
pg 0.5 is down+peering
|
|
pg 1.4 is down+peering
|
|
...
|
|
osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651
|
|
|
|
We can query the cluster to determine exactly why the PG is marked ``down`` with::
|
|
|
|
ceph pg 0.5 query
|
|
|
|
.. code-block:: javascript
|
|
|
|
{ "state": "down+peering",
|
|
...
|
|
"recovery_state": [
|
|
{ "name": "Started\/Primary\/Peering\/GetInfo",
|
|
"enter_time": "2012-03-06 14:40:16.169679",
|
|
"requested_info_from": []},
|
|
{ "name": "Started\/Primary\/Peering",
|
|
"enter_time": "2012-03-06 14:40:16.169659",
|
|
"probing_osds": [
|
|
0,
|
|
1],
|
|
"blocked": "peering is blocked due to down osds",
|
|
"down_osds_we_would_probe": [
|
|
1],
|
|
"peering_blocked_by": [
|
|
{ "osd": 1,
|
|
"current_lost_at": 0,
|
|
"comment": "starting or marking this osd lost may let us proceed"}]},
|
|
{ "name": "Started",
|
|
"enter_time": "2012-03-06 14:40:16.169513"}
|
|
]
|
|
}
|
|
|
|
The ``recovery_state`` section tells us that peering is blocked due to
|
|
down ``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that ``ceph-osd``
|
|
and things will recover.
|
|
|
|
Alternatively, if there is a catastrophic failure of ``osd.1`` (e.g., disk
|
|
failure), we can tell the cluster that it is ``lost`` and to cope as
|
|
best it can.
|
|
|
|
.. important:: This is dangerous in that the cluster cannot
|
|
guarantee that the other copies of the data are consistent
|
|
and up to date.
|
|
|
|
To instruct Ceph to continue anyway::
|
|
|
|
ceph osd lost 1
|
|
|
|
Recovery will proceed.
|
|
|
|
|
|
.. _failures-osd-unfound:
|
|
|
|
Unfound Objects
|
|
---------------
|
|
|
|
Under certain combinations of failures Ceph may complain about
|
|
``unfound`` objects::
|
|
|
|
ceph health detail
|
|
HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)
|
|
pg 2.4 is active+degraded, 78 unfound
|
|
|
|
This means that the storage cluster knows that some objects (or newer
|
|
copies of existing objects) exist, but it hasn't found copies of them.
|
|
One example of how this might come about for a PG whose data is on ceph-osds
|
|
1 and 2:
|
|
|
|
* 1 goes down
|
|
* 2 handles some writes, alone
|
|
* 1 comes up
|
|
* 1 and 2 repeer, and the objects missing on 1 are queued for recovery.
|
|
* Before the new objects are copied, 2 goes down.
|
|
|
|
Now 1 knows that these object exist, but there is no live ``ceph-osd`` who
|
|
has a copy. In this case, IO to those objects will block, and the
|
|
cluster will hope that the failed node comes back soon; this is
|
|
assumed to be preferable to returning an IO error to the user.
|
|
|
|
First, you can identify which objects are unfound with::
|
|
|
|
ceph pg 2.4 list_missing [starting offset, in json]
|
|
|
|
.. code-block:: javascript
|
|
|
|
{ "offset": { "oid": "",
|
|
"key": "",
|
|
"snapid": 0,
|
|
"hash": 0,
|
|
"max": 0},
|
|
"num_missing": 0,
|
|
"num_unfound": 0,
|
|
"objects": [
|
|
{ "oid": "object 1",
|
|
"key": "",
|
|
"hash": 0,
|
|
"max": 0 },
|
|
...
|
|
],
|
|
"more": 0}
|
|
|
|
If there are too many objects to list in a single result, the ``more``
|
|
field will be true and you can query for more. (Eventually the
|
|
command line tool will hide this from you, but not yet.)
|
|
|
|
Second, you can identify which OSDs have been probed or might contain
|
|
data::
|
|
|
|
ceph pg 2.4 query
|
|
|
|
.. code-block:: javascript
|
|
|
|
"recovery_state": [
|
|
{ "name": "Started\/Primary\/Active",
|
|
"enter_time": "2012-03-06 15:15:46.713212",
|
|
"might_have_unfound": [
|
|
{ "osd": 1,
|
|
"status": "osd is down"}]},
|
|
|
|
In this case, for example, the cluster knows that ``osd.1`` might have
|
|
data, but it is ``down``. The full range of possible states include::
|
|
|
|
* already probed
|
|
* querying
|
|
* osd is down
|
|
* not queried (yet)
|
|
|
|
Sometimes it simply takes some time for the cluster to query possible
|
|
locations.
|
|
|
|
It is possible that there are other locations where the object can
|
|
exist that are not listed. For example, if a ceph-osd is stopped and
|
|
taken out of the cluster, the cluster fully recovers, and due to some
|
|
future set of failures ends up with an unfound object, it won't
|
|
consider the long-departed ceph-osd as a potential location to
|
|
consider. (This scenario, however, is unlikely.)
|
|
|
|
If all possible locations have been queried and objects are still
|
|
lost, you may have to give up on the lost objects. This, again, is
|
|
possible given unusual combinations of failures that allow the cluster
|
|
to learn about writes that were performed before the writes themselves
|
|
are recovered. To mark the "unfound" objects as "lost"::
|
|
|
|
ceph pg 2.5 mark_unfound_lost revert
|
|
|
|
This the final argument specifies how the cluster should deal with
|
|
lost objects. Currently the only supported option is "revert", which
|
|
will either roll back to a previous version of the object or (if it
|
|
was a new object) forget about it entirely. Use this with caution, as
|
|
it may confuse applications that expected the object to exist.
|
|
|
|
|
|
Homeless Placement Groups
|
|
-------------------------
|
|
|
|
It is possible for all OSDs that had copies of a given placement groups to fail.
|
|
If that's the case, that subset of the object store is unavailable, and the
|
|
monitor will receive no status updates for those placement groups. To detect
|
|
this situation, the monitor marks any placement group whose primary OSD has
|
|
failed as ``stale``. For example::
|
|
|
|
ceph health
|
|
HEALTH_WARN 24 pgs stale; 3/300 in osds are down
|
|
|
|
You can identify which placement groups are ``stale``, and what the last OSDs to
|
|
store them were, with::
|
|
|
|
ceph health detail
|
|
HEALTH_WARN 24 pgs stale; 3/300 in osds are down
|
|
...
|
|
pg 2.5 is stuck stale+active+remapped, last acting [2,0]
|
|
...
|
|
osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080
|
|
osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539
|
|
osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861
|
|
|
|
If we want to get placement group 2.5 back online, for example, this tells us that
|
|
it was last managed by ``osd.0`` and ``osd.2``. Restarting those ``ceph-osd``
|
|
daemons will allow the cluster to recover that placement group (and, presumably,
|
|
many others).
|
|
|
|
|
|
|
|
.. _iostat: http://en.wikipedia.org/wiki/Iostat
|
|
.. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging
|
|
.. _Logging and Debugging Config Reference: ../../configuration/log-and-debug-ref
|
|
.. _Debugging and Logging: ../debug
|
|
.. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction
|
|
.. _monitoring your OSDs: ../monitoring-osd-pg
|
|
.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel
|
|
.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel
|
|
.. _subscribe to the ceph-users email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-users
|
|
.. _unsubscribe from the ceph-users email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-users
|
|
.. _Inktank: http://inktank.com
|
|
.. _OS recommendations: ../../../install/os-recommendations
|
|
.. _ceph-devel: ceph-devel@vger.kernel.org |