mirror of
https://github.com/ceph/ceph
synced 2025-01-05 10:42:05 +00:00
2baa027b13
Make grammar improvements (and correct a verb disagreement) in the section "Placement Groups Never Get Clean" in doc/rados/troubleshooting/troubleshooting-pg.rst. Signed-off-by: Zac Dover <zac.dover@proton.me>
843 lines
30 KiB
ReStructuredText
843 lines
30 KiB
ReStructuredText
====================
|
|
Troubleshooting PGs
|
|
====================
|
|
|
|
Placement Groups Never Get Clean
|
|
================================
|
|
|
|
Placement Groups (PGs) that remain in the ``active`` status, the
|
|
``active+remapped`` status or the ``active+degraded`` status and never achieve
|
|
an ``active+clean`` status might indicate a problem with the configuration of
|
|
the Ceph cluster.
|
|
|
|
In such a situation, review the settings in the `Pool, PG and CRUSH Config
|
|
Reference`_ and make appropriate adjustments.
|
|
|
|
As a general rule, run your cluster with more than one OSD and a pool size
|
|
of greater than two object replicas.
|
|
|
|
.. _one-node-cluster:
|
|
|
|
One Node Cluster
|
|
----------------
|
|
|
|
Ceph no longer provides documentation for operating on a single node. Systems
|
|
designed for distributed computing by definition do not run on a single node.
|
|
The mounting of client kernel modules on a single node that contains a Ceph
|
|
daemon may cause a deadlock due to issues with the Linux kernel itself (unless
|
|
VMs are used as clients). You can experiment with Ceph in a one-node
|
|
configuration, in spite of the limitations as described herein.
|
|
|
|
To create a cluster on a single node, you must change the
|
|
``osd_crush_chooseleaf_type`` setting from the default of ``1`` (meaning
|
|
``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration
|
|
file before you create your monitors and OSDs. This tells Ceph that an OSD is
|
|
permitted to place another OSD on the same host. If you are trying to set up a
|
|
single-node cluster and ``osd_crush_chooseleaf_type`` is greater than ``0``,
|
|
Ceph will attempt to place the PGs of one OSD with the PGs of another OSD on
|
|
another node, chassis, rack, row, or datacenter depending on the setting.
|
|
|
|
.. tip:: DO NOT mount kernel clients directly on the same node as your Ceph
|
|
Storage Cluster. Kernel conflicts can arise. However, you can mount kernel
|
|
clients within virtual machines (VMs) on a single node.
|
|
|
|
If you are creating OSDs using a single disk, you must manually create
|
|
directories for the data first.
|
|
|
|
|
|
Fewer OSDs than Replicas
|
|
------------------------
|
|
|
|
If two OSDs are in an ``up`` and ``in`` state, but the placement gropus are not
|
|
in an ``active + clean`` state, you may have an ``osd_pool_default_size`` set
|
|
to greater than ``2``.
|
|
|
|
There are a few ways to address this situation. If you want to operate your
|
|
cluster in an ``active + degraded`` state with two replicas, you can set the
|
|
``osd_pool_default_min_size`` to ``2`` so that you can write objects in an
|
|
``active + degraded`` state. You may also set the ``osd_pool_default_size``
|
|
setting to ``2`` so that you have only two stored replicas (the original and
|
|
one replica). In such a case, the cluster should achieve an ``active + clean``
|
|
state.
|
|
|
|
.. note:: You can make the changes while the cluster is running. If you make
|
|
the changes in your Ceph configuration file, you might need to restart your
|
|
cluster.
|
|
|
|
|
|
Pool Size = 1
|
|
-------------
|
|
|
|
If you have ``osd_pool_default_size`` set to ``1``, you will have only one copy
|
|
of the object. OSDs rely on other OSDs to tell them which objects they should
|
|
have. If one OSD has a copy of an object and there is no second copy, then
|
|
there is no second OSD to tell the first OSD that it should have that copy. For
|
|
each placement group mapped to the first OSD (see ``ceph pg dump``), you can
|
|
force the first OSD to notice the placement groups it needs by running a
|
|
command of the following form:
|
|
|
|
.. prompt:: bash
|
|
|
|
ceph osd force-create-pg <pgid>
|
|
|
|
|
|
CRUSH Map Errors
|
|
----------------
|
|
|
|
If any placement groups in your cluster are unclean, then there might be errors
|
|
in your CRUSH map.
|
|
|
|
|
|
Stuck Placement Groups
|
|
======================
|
|
|
|
It is normal for placement groups to enter "degraded" or "peering" states after
|
|
a component failure. Normally, these states reflect the expected progression
|
|
through the failure recovery process. However, a placement group that stays in
|
|
one of these states for a long time might be an indication of a larger problem.
|
|
For this reason, the Ceph Monitors will warn when placement groups get "stuck"
|
|
in a non-optimal state. Specifically, we check for:
|
|
|
|
* ``inactive`` - The placement group has not been ``active`` for too long (that
|
|
is, it hasn't been able to service read/write requests).
|
|
|
|
* ``unclean`` - The placement group has not been ``clean`` for too long (that
|
|
is, it hasn't been able to completely recover from a previous failure).
|
|
|
|
* ``stale`` - The placement group status has not been updated by a
|
|
``ceph-osd``. This indicates that all nodes storing this placement group may
|
|
be ``down``.
|
|
|
|
List stuck placement groups by running one of the following commands:
|
|
|
|
.. prompt:: bash
|
|
|
|
ceph pg dump_stuck stale
|
|
ceph pg dump_stuck inactive
|
|
ceph pg dump_stuck unclean
|
|
|
|
- Stuck ``stale`` placement groups usually indicate that key ``ceph-osd``
|
|
daemons are not running.
|
|
- Stuck ``inactive`` placement groups usually indicate a peering problem (see
|
|
:ref:`failures-osd-peering`).
|
|
- Stuck ``unclean`` placement groups usually indicate that something is
|
|
preventing recovery from completing, possibly unfound objects (see
|
|
:ref:`failures-osd-unfound`);
|
|
|
|
|
|
|
|
.. _failures-osd-peering:
|
|
|
|
Placement Group Down - Peering Failure
|
|
======================================
|
|
|
|
In certain cases, the ``ceph-osd`` `peering` process can run into problems,
|
|
which can prevent a PG from becoming active and usable. In such a case, running
|
|
the command ``ceph health detail`` will report something similar to the following:
|
|
|
|
.. prompt:: bash
|
|
|
|
ceph health detail
|
|
|
|
::
|
|
|
|
HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
|
|
...
|
|
pg 0.5 is down+peering
|
|
pg 1.4 is down+peering
|
|
...
|
|
osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651
|
|
|
|
Query the cluster to determine exactly why the PG is marked ``down`` by running a command of the following form:
|
|
|
|
.. prompt:: bash
|
|
|
|
ceph pg 0.5 query
|
|
|
|
.. code-block:: javascript
|
|
|
|
{ "state": "down+peering",
|
|
...
|
|
"recovery_state": [
|
|
{ "name": "Started\/Primary\/Peering\/GetInfo",
|
|
"enter_time": "2012-03-06 14:40:16.169679",
|
|
"requested_info_from": []},
|
|
{ "name": "Started\/Primary\/Peering",
|
|
"enter_time": "2012-03-06 14:40:16.169659",
|
|
"probing_osds": [
|
|
0,
|
|
1],
|
|
"blocked": "peering is blocked due to down osds",
|
|
"down_osds_we_would_probe": [
|
|
1],
|
|
"peering_blocked_by": [
|
|
{ "osd": 1,
|
|
"current_lost_at": 0,
|
|
"comment": "starting or marking this osd lost may let us proceed"}]},
|
|
{ "name": "Started",
|
|
"enter_time": "2012-03-06 14:40:16.169513"}
|
|
]
|
|
}
|
|
|
|
The ``recovery_state`` section tells us that peering is blocked due to down
|
|
``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that
|
|
particular ``ceph-osd`` and recovery will proceed.
|
|
|
|
Alternatively, if there is a catastrophic failure of ``osd.1`` (for example, if
|
|
there has been a disk failure), the cluster can be informed that the OSD is
|
|
``lost`` and the cluster can be instructed that it must cope as best it can.
|
|
|
|
.. important:: Informing the cluster that an OSD has been lost is dangerous
|
|
because the cluster cannot guarantee that the other copies of the data are
|
|
consistent and up to date.
|
|
|
|
To report an OSD ``lost`` and to instruct Ceph to continue to attempt recovery
|
|
anyway, run a command of the following form:
|
|
|
|
.. prompt:: bash
|
|
|
|
ceph osd lost 1
|
|
|
|
Recovery will proceed.
|
|
|
|
|
|
.. _failures-osd-unfound:
|
|
|
|
Unfound Objects
|
|
===============
|
|
|
|
Under certain combinations of failures, Ceph may complain about ``unfound``
|
|
objects, as in this example:
|
|
|
|
.. prompt:: bash
|
|
|
|
ceph health detail
|
|
|
|
::
|
|
|
|
HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)
|
|
pg 2.4 is active+degraded, 78 unfound
|
|
|
|
This means that the storage cluster knows that some objects (or newer copies of
|
|
existing objects) exist, but it hasn't found copies of them. Here is an
|
|
example of how this might come about for a PG whose data is on two OSDS, which
|
|
we will call "1" and "2":
|
|
|
|
* 1 goes down
|
|
* 2 handles some writes, alone
|
|
* 1 comes up
|
|
* 1 and 2 re-peer, and the objects missing on 1 are queued for recovery.
|
|
* Before the new objects are copied, 2 goes down.
|
|
|
|
At this point, 1 knows that these objects exist, but there is no live
|
|
``ceph-osd`` that has a copy of the objects. In this case, IO to those objects
|
|
will block, and the cluster will hope that the failed node comes back soon.
|
|
This is assumed to be preferable to returning an IO error to the user.
|
|
|
|
.. note:: The situation described immediately above is one reason that setting
|
|
``size=2`` on a replicated pool and ``m=1`` on an erasure coded pool risks
|
|
data loss.
|
|
|
|
Identify which objects are unfound by running a command of the following form:
|
|
|
|
.. prompt:: bash
|
|
|
|
ceph pg 2.4 list_unfound [starting offset, in json]
|
|
|
|
.. code-block:: javascript
|
|
|
|
{
|
|
"num_missing": 1,
|
|
"num_unfound": 1,
|
|
"objects": [
|
|
{
|
|
"oid": {
|
|
"oid": "object",
|
|
"key": "",
|
|
"snapid": -2,
|
|
"hash": 2249616407,
|
|
"max": 0,
|
|
"pool": 2,
|
|
"namespace": ""
|
|
},
|
|
"need": "43'251",
|
|
"have": "0'0",
|
|
"flags": "none",
|
|
"clean_regions": "clean_offsets: [], clean_omap: 0, new_object: 1",
|
|
"locations": [
|
|
"0(3)",
|
|
"4(2)"
|
|
]
|
|
}
|
|
],
|
|
"state": "NotRecovering",
|
|
"available_might_have_unfound": true,
|
|
"might_have_unfound": [
|
|
{
|
|
"osd": "2(4)",
|
|
"status": "osd is down"
|
|
}
|
|
],
|
|
"more": false
|
|
}
|
|
|
|
If there are too many objects to list in a single result, the ``more`` field
|
|
will be true and you can query for more. (Eventually the command line tool
|
|
will hide this from you, but not yet.)
|
|
|
|
Now you can identify which OSDs have been probed or might contain data.
|
|
|
|
At the end of the listing (before ``more: false``), ``might_have_unfound`` is
|
|
provided when ``available_might_have_unfound`` is true. This is equivalent to
|
|
the output of ``ceph pg #.# query``. This eliminates the need to use ``query``
|
|
directly. The ``might_have_unfound`` information given behaves the same way as
|
|
that ``query`` does, which is described below. The only difference is that
|
|
OSDs that have the status of ``already probed`` are ignored.
|
|
|
|
Use of ``query``:
|
|
|
|
.. prompt:: bash
|
|
|
|
ceph pg 2.4 query
|
|
|
|
.. code-block:: javascript
|
|
|
|
"recovery_state": [
|
|
{ "name": "Started\/Primary\/Active",
|
|
"enter_time": "2012-03-06 15:15:46.713212",
|
|
"might_have_unfound": [
|
|
{ "osd": 1,
|
|
"status": "osd is down"}]},
|
|
|
|
In this case, the cluster knows that ``osd.1`` might have data, but it is
|
|
``down``. Here is the full range of possible states:
|
|
|
|
* already probed
|
|
* querying
|
|
* OSD is down
|
|
* not queried (yet)
|
|
|
|
Sometimes it simply takes some time for the cluster to query possible
|
|
locations.
|
|
|
|
It is possible that there are other locations where the object might exist that
|
|
are not listed. For example: if an OSD is stopped and taken out of the cluster
|
|
and then the cluster fully recovers, and then through a subsequent set of
|
|
failures the cluster ends up with an unfound object, the cluster will ignore
|
|
the removed OSD. (This scenario, however, is unlikely.)
|
|
|
|
If all possible locations have been queried and objects are still lost, you may
|
|
have to give up on the lost objects. This, again, is possible only when unusual
|
|
combinations of failures have occurred that allow the cluster to learn about
|
|
writes that were performed before the writes themselves have been recovered. To
|
|
mark the "unfound" objects as "lost", run a command of the following form:
|
|
|
|
.. prompt:: bash
|
|
|
|
ceph pg 2.5 mark_unfound_lost revert|delete
|
|
|
|
Here the final argument (``revert|delete``) specifies how the cluster should
|
|
deal with lost objects.
|
|
|
|
The ``delete`` option will cause the cluster to forget about them entirely.
|
|
|
|
The ``revert`` option (which is not available for erasure coded pools) will
|
|
either roll back to a previous version of the object or (if it was a new
|
|
object) forget about the object entirely. Use ``revert`` with caution, as it
|
|
may confuse applications that expect the object to exist.
|
|
|
|
Homeless Placement Groups
|
|
=========================
|
|
|
|
It is possible that every OSD that has copies of a given placement group fails.
|
|
If this happens, then the subset of the object store that contains those
|
|
placement groups becomes unavailable and the monitor will receive no status
|
|
updates for those placement groups. The monitor marks as ``stale`` any
|
|
placement group whose primary OSD has failed. For example:
|
|
|
|
.. prompt:: bash
|
|
|
|
ceph health
|
|
|
|
::
|
|
|
|
HEALTH_WARN 24 pgs stale; 3/300 in osds are down
|
|
|
|
Identify which placement groups are ``stale`` and which were the last OSDs to
|
|
store the ``stale`` placement groups by running the following command:
|
|
|
|
.. prompt:: bash
|
|
|
|
ceph health detail
|
|
|
|
::
|
|
|
|
HEALTH_WARN 24 pgs stale; 3/300 in osds are down
|
|
...
|
|
pg 2.5 is stuck stale+active+remapped, last acting [2,0]
|
|
...
|
|
osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080
|
|
osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539
|
|
osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861
|
|
|
|
This output indicates that placement group 2.5 (``pg 2.5``) was last managed by
|
|
``osd.0`` and ``osd.2``. Restart those OSDs to allow the cluster to recover
|
|
that placement group.
|
|
|
|
|
|
Only a Few OSDs Receive Data
|
|
============================
|
|
|
|
If only a few of the nodes in the cluster are receiving data, check the number
|
|
of placement groups in the pool as instructed in the :ref:`Placement Groups
|
|
<rados_ops_pgs_get_pg_num>` documentation. Since placement groups get mapped to
|
|
OSDs in an operation involving dividing the number of placement groups in the
|
|
cluster by the number of OSDs in the cluster, a small number of placement
|
|
groups (the remainder, in this operation) are sometimes not distributed across
|
|
the cluster. In situations like this, create a pool with a placement group
|
|
count that is a multiple of the number of OSDs. See `Placement Groups`_ for
|
|
details. See the :ref:`Pool, PG, and CRUSH Config Reference
|
|
<rados_config_pool_pg_crush_ref>` for instructions on changing the default
|
|
values used to determine how many placement groups are assigned to each pool.
|
|
|
|
|
|
Can't Write Data
|
|
================
|
|
|
|
If the cluster is up, but some OSDs are down and you cannot write data, make
|
|
sure that you have the minimum number of OSDs running in the pool. If you don't
|
|
have the minimum number of OSDs running in the pool, Ceph will not allow you to
|
|
write data to it because there is no guarantee that Ceph can replicate your
|
|
data. See ``osd_pool_default_min_size`` in the :ref:`Pool, PG, and CRUSH
|
|
Config Reference <rados_config_pool_pg_crush_ref>` for details.
|
|
|
|
|
|
PGs Inconsistent
|
|
================
|
|
|
|
If the command ``ceph health detail`` returns an ``active + clean +
|
|
inconsistent`` state, this might indicate an error during scrubbing. Identify
|
|
the inconsistent placement group or placement groups by running the following
|
|
command:
|
|
|
|
.. prompt:: bash
|
|
|
|
$ ceph health detail
|
|
|
|
::
|
|
|
|
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
|
|
pg 0.6 is active+clean+inconsistent, acting [0,1,2]
|
|
2 scrub errors
|
|
|
|
Alternatively, run this command if you prefer to inspect the output in a
|
|
programmatic way:
|
|
|
|
.. prompt:: bash
|
|
|
|
$ rados list-inconsistent-pg rbd
|
|
|
|
::
|
|
|
|
["0.6"]
|
|
|
|
There is only one consistent state, but in the worst case, we could have
|
|
different inconsistencies in multiple perspectives found in more than one
|
|
object. If an object named ``foo`` in PG ``0.6`` is truncated, the output of
|
|
``rados list-inconsistent-pg rbd`` will look something like this:
|
|
|
|
.. prompt:: bash
|
|
|
|
rados list-inconsistent-obj 0.6 --format=json-pretty
|
|
|
|
.. code-block:: javascript
|
|
|
|
{
|
|
"epoch": 14,
|
|
"inconsistents": [
|
|
{
|
|
"object": {
|
|
"name": "foo",
|
|
"nspace": "",
|
|
"locator": "",
|
|
"snap": "head",
|
|
"version": 1
|
|
},
|
|
"errors": [
|
|
"data_digest_mismatch",
|
|
"size_mismatch"
|
|
],
|
|
"union_shard_errors": [
|
|
"data_digest_mismatch_info",
|
|
"size_mismatch_info"
|
|
],
|
|
"selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])",
|
|
"shards": [
|
|
{
|
|
"osd": 0,
|
|
"errors": [],
|
|
"size": 968,
|
|
"omap_digest": "0xffffffff",
|
|
"data_digest": "0xe978e67f"
|
|
},
|
|
{
|
|
"osd": 1,
|
|
"errors": [],
|
|
"size": 968,
|
|
"omap_digest": "0xffffffff",
|
|
"data_digest": "0xe978e67f"
|
|
},
|
|
{
|
|
"osd": 2,
|
|
"errors": [
|
|
"data_digest_mismatch_info",
|
|
"size_mismatch_info"
|
|
],
|
|
"size": 0,
|
|
"omap_digest": "0xffffffff",
|
|
"data_digest": "0xffffffff"
|
|
}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
|
|
In this case, the output indicates the following:
|
|
|
|
* The only inconsistent object is named ``foo``, and its head has
|
|
inconsistencies.
|
|
* The inconsistencies fall into two categories:
|
|
|
|
#. ``errors``: these errors indicate inconsistencies between shards, without
|
|
an indication of which shard(s) are bad. Check for the ``errors`` in the
|
|
``shards`` array, if available, to pinpoint the problem.
|
|
|
|
* ``data_digest_mismatch``: the digest of the replica read from ``OSD.2``
|
|
is different from the digests of the replica reads of ``OSD.0`` and
|
|
``OSD.1``
|
|
* ``size_mismatch``: the size of the replica read from ``OSD.2`` is ``0``,
|
|
but the size reported by ``OSD.0`` and ``OSD.1`` is ``968``.
|
|
|
|
#. ``union_shard_errors``: the union of all shard-specific ``errors`` in the
|
|
``shards`` array. The ``errors`` are set for the shard with the problem.
|
|
These errors include ``read_error`` and other similar errors. The
|
|
``errors`` ending in ``oi`` indicate a comparison with
|
|
``selected_object_info``. Examine the ``shards`` array to determine
|
|
which shard has which error or errors.
|
|
|
|
* ``data_digest_mismatch_info``: the digest stored in the ``object-info``
|
|
is not ``0xffffffff``, which is calculated from the shard read from
|
|
``OSD.2``
|
|
* ``size_mismatch_info``: the size stored in the ``object-info`` is
|
|
different from the size read from ``OSD.2``. The latter is ``0``.
|
|
|
|
.. warning:: If ``read_error`` is listed in a shard's ``errors`` attribute, the
|
|
inconsistency is likely due to physical storage errors. In cases like this,
|
|
check the storage used by that OSD.
|
|
|
|
Examine the output of ``dmesg`` and ``smartctl`` before attempting a drive
|
|
repair.
|
|
|
|
To repair the inconsistent placement group, run a command of the following
|
|
form:
|
|
|
|
.. prompt:: bash
|
|
|
|
ceph pg repair {placement-group-ID}
|
|
|
|
For example:
|
|
|
|
.. prompt:: bash #
|
|
|
|
ceph pg repair 1.4
|
|
|
|
.. warning: This command overwrites the "bad" copies with "authoritative"
|
|
copies. In most cases, Ceph is able to choose authoritative copies from all
|
|
the available replicas by using some predefined criteria. This, however,
|
|
does not work in every case. For example, it might be the case that the
|
|
stored data digest is missing, which means that the calculated digest is
|
|
ignored when Ceph chooses the authoritative copies. Be aware of this, and
|
|
use the above command with caution.
|
|
|
|
.. note:: PG IDs have the form ``N.xxxxx``, where ``N`` is the number of the
|
|
pool that contains the PG. The command ``ceph osd listpools`` and the
|
|
command ``ceph osd dump | grep pool`` return a list of pool numbers.
|
|
|
|
|
|
If you receive ``active + clean + inconsistent`` states periodically due to
|
|
clock skew, consider configuring the `NTP
|
|
<https://en.wikipedia.org/wiki/Network_Time_Protocol>`_ daemons on your monitor
|
|
hosts to act as peers. See `The Network Time Protocol <http://www.ntp.org>`_
|
|
and Ceph :ref:`Clock Settings <mon-config-ref-clock>` for more information.
|
|
|
|
More Information on PG Repair
|
|
-----------------------------
|
|
Ceph stores and updates the checksums of objects stored in the cluster. When a
|
|
scrub is performed on a PG, the lead OSD attempts to choose an authoritative
|
|
copy from among its replicas. Only one of the possible cases is consistent.
|
|
After performing a deep scrub, Ceph calculates the checksum of each object that
|
|
is read from disk and compares it to the checksum that was previously recorded.
|
|
If the current checksum and the previously recorded checksum do not match, that
|
|
mismatch is considered to be an inconsistency. In the case of replicated pools,
|
|
any mismatch between the checksum of any replica of an object and the checksum
|
|
of the authoritative copy means that there is an inconsistency. The discovery
|
|
of these inconsistencies cause a PG's state to be set to ``inconsistent``.
|
|
|
|
The ``pg repair`` command attempts to fix inconsistencies of various kinds. When
|
|
``pg repair`` finds an inconsistent PG, it attempts to overwrite the digest of
|
|
the inconsistent copy with the digest of the authoritative copy. When ``pg
|
|
repair`` finds an inconsistent copy in a replicated pool, it marks the
|
|
inconsistent copy as missing. In the case of replicated pools, recovery is
|
|
beyond the scope of ``pg repair``.
|
|
|
|
In the case of erasure-coded and BlueStore pools, Ceph will automatically
|
|
perform repairs if ``osd_scrub_auto_repair`` (default ``false``) is set to
|
|
``true`` and if no more than ``osd_scrub_auto_repair_num_errors`` (default
|
|
``5``) errors are found.
|
|
|
|
The ``pg repair`` command will not solve every problem. Ceph does not
|
|
automatically repair PGs when they are found to contain inconsistencies.
|
|
|
|
The checksum of a RADOS object or an omap is not always available. Checksums
|
|
are calculated incrementally. If a replicated object is updated
|
|
non-sequentially, the write operation involved in the update changes the object
|
|
and invalidates its checksum. The whole object is not read while the checksum
|
|
is recalculated. The ``pg repair`` command is able to make repairs even when
|
|
checksums are not available to it, as in the case of Filestore. Users working
|
|
with replicated Filestore pools might prefer manual repair to ``ceph pg
|
|
repair``.
|
|
|
|
This material is relevant for Filestore, but not for BlueStore, which has its
|
|
own internal checksums. The matched-record checksum and the calculated checksum
|
|
cannot prove that any specific copy is in fact authoritative. If there is no
|
|
checksum available, ``pg repair`` favors the data on the primary, but this
|
|
might not be the uncorrupted replica. Because of this uncertainty, human
|
|
intervention is necessary when an inconsistency is discovered. This
|
|
intervention sometimes involves use of ``ceph-objectstore-tool``.
|
|
|
|
PG Repair Walkthrough
|
|
---------------------
|
|
https://ceph.io/geen-categorie/ceph-manually-repair-object/ - This page
|
|
contains a walkthrough of the repair of a PG. It is recommended reading if you
|
|
want to repair a PG but have never done so.
|
|
|
|
Erasure Coded PGs are not active+clean
|
|
======================================
|
|
|
|
If CRUSH fails to find enough OSDs to map to a PG, it will show as a
|
|
``2147483647`` which is ``ITEM_NONE`` or ``no OSD found``. For example::
|
|
|
|
[2,1,6,0,5,8,2147483647,7,4]
|
|
|
|
Not enough OSDs
|
|
---------------
|
|
|
|
If the Ceph cluster has only eight OSDs and an erasure coded pool needs nine
|
|
OSDs, the cluster will show "Not enough OSDs". In this case, you either create
|
|
another erasure coded pool that requires fewer OSDs, by running commands of the
|
|
following form:
|
|
|
|
.. prompt:: bash
|
|
|
|
ceph osd erasure-code-profile set myprofile k=5 m=3
|
|
ceph osd pool create erasurepool erasure myprofile
|
|
|
|
or add new OSDs, and the PG will automatically use them.
|
|
|
|
CRUSH constraints cannot be satisfied
|
|
-------------------------------------
|
|
|
|
If the cluster has enough OSDs, it is possible that the CRUSH rule is imposing
|
|
constraints that cannot be satisfied. If there are ten OSDs on two hosts and
|
|
the CRUSH rule requires that no two OSDs from the same host are used in the
|
|
same PG, the mapping may fail because only two OSDs will be found. Check the
|
|
constraint by displaying ("dumping") the rule, as shown here:
|
|
|
|
.. prompt:: bash
|
|
|
|
ceph osd crush rule ls
|
|
|
|
::
|
|
|
|
[
|
|
"replicated_rule",
|
|
"erasurepool"]
|
|
$ ceph osd crush rule dump erasurepool
|
|
{ "rule_id": 1,
|
|
"rule_name": "erasurepool",
|
|
"type": 3,
|
|
"steps": [
|
|
{ "op": "take",
|
|
"item": -1,
|
|
"item_name": "default"},
|
|
{ "op": "chooseleaf_indep",
|
|
"num": 0,
|
|
"type": "host"},
|
|
{ "op": "emit"}]}
|
|
|
|
|
|
Resolve this problem by creating a new pool in which PGs are allowed to have
|
|
OSDs residing on the same host by running the following commands:
|
|
|
|
.. prompt:: bash
|
|
|
|
ceph osd erasure-code-profile set myprofile crush-failure-domain=osd
|
|
ceph osd pool create erasurepool erasure myprofile
|
|
|
|
CRUSH gives up too soon
|
|
-----------------------
|
|
|
|
If the Ceph cluster has just enough OSDs to map the PG (for instance a cluster
|
|
with a total of nine OSDs and an erasure coded pool that requires nine OSDs per
|
|
PG), it is possible that CRUSH gives up before finding a mapping. This problem
|
|
can be resolved by:
|
|
|
|
* lowering the erasure coded pool requirements to use fewer OSDs per PG (this
|
|
requires the creation of another pool, because erasure code profiles cannot
|
|
be modified dynamically).
|
|
|
|
* adding more OSDs to the cluster (this does not require the erasure coded pool
|
|
to be modified, because it will become clean automatically)
|
|
|
|
* using a handmade CRUSH rule that tries more times to find a good mapping.
|
|
This can be modified for an existing CRUSH rule by setting
|
|
``set_choose_tries`` to a value greater than the default.
|
|
|
|
First, verify the problem by using ``crushtool`` after extracting the crushmap
|
|
from the cluster. This ensures that your experiments do not modify the Ceph
|
|
cluster and that they operate only on local files:
|
|
|
|
.. prompt:: bash
|
|
|
|
ceph osd crush rule dump erasurepool
|
|
|
|
::
|
|
|
|
{ "rule_id": 1,
|
|
"rule_name": "erasurepool",
|
|
"type": 3,
|
|
"steps": [
|
|
{ "op": "take",
|
|
"item": -1,
|
|
"item_name": "default"},
|
|
{ "op": "chooseleaf_indep",
|
|
"num": 0,
|
|
"type": "host"},
|
|
{ "op": "emit"}]}
|
|
$ ceph osd getcrushmap > crush.map
|
|
got crush map from osdmap epoch 13
|
|
$ crushtool -i crush.map --test --show-bad-mappings \
|
|
--rule 1 \
|
|
--num-rep 9 \
|
|
--min-x 1 --max-x $((1024 * 1024))
|
|
bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0]
|
|
bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8]
|
|
bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647]
|
|
|
|
Here, ``--num-rep`` is the number of OSDs that the erasure code CRUSH rule
|
|
needs, ``--rule`` is the value of the ``rule_id`` field that was displayed by
|
|
``ceph osd crush rule dump``. This test will attempt to map one million values
|
|
(in this example, the range defined by ``[--min-x,--max-x]``) and must display
|
|
at least one bad mapping. If this test outputs nothing, all mappings have been
|
|
successful and you can be assured that the problem with your cluster is not
|
|
caused by bad mappings.
|
|
|
|
Changing the value of set_choose_tries
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
#. Decompile the CRUSH map to edit the CRUSH rule by running the following
|
|
command:
|
|
|
|
.. prompt:: bash
|
|
|
|
crushtool --decompile crush.map > crush.txt
|
|
|
|
#. Add the following line to the rule::
|
|
|
|
step set_choose_tries 100
|
|
|
|
The relevant part of the ``crush.txt`` file will resemble this::
|
|
|
|
rule erasurepool {
|
|
id 1
|
|
type erasure
|
|
step set_chooseleaf_tries 5
|
|
step set_choose_tries 100
|
|
step take default
|
|
step chooseleaf indep 0 type host
|
|
step emit
|
|
}
|
|
|
|
#. Recompile and retest the CRUSH rule:
|
|
|
|
.. prompt:: bash
|
|
|
|
crushtool --compile crush.txt -o better-crush.map
|
|
|
|
#. When all mappings succeed, display a histogram of the number of tries that
|
|
were necessary to find all of the mapping by using the
|
|
``--show-choose-tries`` option of the ``crushtool`` command, as in the
|
|
following example:
|
|
|
|
.. prompt:: bash
|
|
|
|
crushtool -i better-crush.map --test --show-bad-mappings \
|
|
--show-choose-tries \
|
|
--rule 1 \
|
|
--num-rep 9 \
|
|
--min-x 1 --max-x $((1024 * 1024))
|
|
...
|
|
11: 42
|
|
12: 44
|
|
13: 54
|
|
14: 45
|
|
15: 35
|
|
16: 34
|
|
17: 30
|
|
18: 25
|
|
19: 19
|
|
20: 22
|
|
21: 20
|
|
22: 17
|
|
23: 13
|
|
24: 16
|
|
25: 13
|
|
26: 11
|
|
27: 11
|
|
28: 13
|
|
29: 11
|
|
30: 10
|
|
31: 6
|
|
32: 5
|
|
33: 10
|
|
34: 3
|
|
35: 7
|
|
36: 5
|
|
37: 2
|
|
38: 5
|
|
39: 5
|
|
40: 2
|
|
41: 5
|
|
42: 4
|
|
43: 1
|
|
44: 2
|
|
45: 2
|
|
46: 3
|
|
47: 1
|
|
48: 0
|
|
...
|
|
102: 0
|
|
103: 1
|
|
104: 0
|
|
...
|
|
|
|
This output indicates that it took eleven tries to map forty-two PGs, twelve
|
|
tries to map forty-four PGs etc. The highest number of tries is the minimum
|
|
value of ``set_choose_tries`` that prevents bad mappings (for example,
|
|
``103`` in the above output, because it did not take more than 103 tries for
|
|
any PG to be mapped).
|
|
|
|
.. _check: ../../operations/placement-groups#get-the-number-of-placement-groups
|
|
.. _Placement Groups: ../../operations/placement-groups
|
|
.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref
|