ceph/doc/rados/operations/monitoring-osd-pg.rst

=========================
 Monitoring OSDs and PGs
=========================

High availability and high reliability require a fault-tolerant approach to
managing hardware and software issues. Ceph has no single point of failure and
it can service requests for data even when in a "degraded" mode. Ceph's `data
placement`_ introduces a layer of indirection to ensure that data doesn't bind
directly to specific OSDs. For this reason, tracking system faults
requires finding the `placement group`_ (PG) and the underlying OSDs at the
root of the problem.

.. tip:: A fault in one part of the cluster might prevent you from accessing a
   particular object, but that doesn't mean that you are prevented from
   accessing other objects.  When you run into a fault, don't panic. Just
   follow the steps for monitoring your OSDs and placement groups, and then
   begin troubleshooting.

Ceph is self-repairing. However, when problems persist, monitoring OSDs and
placement groups will help you identify the problem.


Monitoring OSDs
===============

An OSD is either *in* service (``in``) or *out* of service (``out``). An OSD is
either running and reachable (``up``), or it is not running and not
reachable (``down``).

If an OSD is ``up``, it may be either ``in`` service (clients can read and
write data) or it is ``out`` of service. If the OSD was ``in`` but then due to a failure or a manual action was set to the ``out`` state, Ceph will migrate placement groups to the other OSDs to maintin the configured redundancy.

If an OSD is ``out`` of service, CRUSH will not assign placement groups to it.
If an OSD is ``down``, it will also be ``out``.

.. note:: If an OSD is ``down`` and ``in``, there is a problem and this
   indicates that the cluster is not in a healthy state.

.. ditaa::

           +----------------+        +----------------+
           |                |        |                |
           |   OSD #n In    |        |   OSD #n Up    |
           |                |        |                |
           +----------------+        +----------------+
                   ^                         ^
                   |                         |
                   |                         |
                   v                         v
           +----------------+        +----------------+
           |                |        |                |
           |   OSD #n Out   |        |   OSD #n Down  |
           |                |        |                |
           +----------------+        +----------------+

If you run the commands ``ceph health``, ``ceph -s``, or ``ceph -w``,
you might notice that the cluster does not always show ``HEALTH OK``. Don't
panic. There are certain circumstances in which it is expected and normal that
the cluster will **NOT** show ``HEALTH OK``:

#. You haven't started the cluster yet.
#. You have just started or restarted the cluster and it's not ready to show
   health statuses yet, because the PGs are in the process of being created and
   the OSDs are in the process of peering.
#. You have just added or removed an OSD.
#. You have just have modified your cluster map.

Checking to see if OSDs are ``up`` and running is an important aspect of monitoring them:
whenever the cluster is up and running, every OSD that is ``in`` the cluster should also
be ``up`` and running. To see if all of the cluster's OSDs are running, run the following
command:

.. prompt:: bash $

    ceph osd stat

The output provides the following information: the total number of OSDs (x),
how many OSDs are ``up`` (y), how many OSDs are ``in`` (z), and the map epoch (eNNNN). ::

    x osds: y up, z in; epoch: eNNNN

If the number of OSDs that are ``in`` the cluster is greater than the number of
OSDs that are ``up``, run the following command to identify the ``ceph-osd``
daemons that are not running:

.. prompt:: bash $

    ceph osd tree

::

    #ID CLASS WEIGHT  TYPE NAME             STATUS REWEIGHT PRI-AFF
     -1       2.00000 pool openstack
     -3       2.00000 rack dell-2950-rack-A
     -2       2.00000 host dell-2950-A1
      0   ssd 1.00000      osd.0                up  1.00000 1.00000
      1   ssd 1.00000      osd.1              down  1.00000 1.00000

.. tip:: Searching through a well-designed CRUSH hierarchy to identify the physical
   locations of particular OSDs might help you troubleshoot your cluster.

If an OSD is ``down``, start it by running the following command:

.. prompt:: bash $

    sudo systemctl start ceph-osd@1

For problems associated with OSDs that have stopped or won't restart, see `OSD Not Running`_.


PG Sets
=======

When CRUSH assigns a PG to OSDs, it takes note of how many replicas of the PG
are required by the pool and then assigns each replica to a different OSD.
For example, if the pool requires three replicas of a PG, CRUSH might assign
them individually to ``osd.1``, ``osd.2`` and ``osd.3``. CRUSH seeks a
pseudo-random placement that takes into account the failure domains that you
have set in your `CRUSH map`_; for this reason, PGs are rarely assigned to
immediately adjacent OSDs in a large cluster.

Ceph processes client requests with the **Acting Set** of OSDs: this is the set
of OSDs that currently have a full and working version of a PG shard and that
are therefore responsible for handling requests. By contrast, the **Up Set** is
the set of OSDs that contain a shard of a specific PG. Data is moved or copied
to the **Up Set**, or planned to be moved or copied, to the **Up Set**. See
:ref:`Placement Group Concepts <rados_operations_pg_concepts>`.

Sometimes an OSD in the Acting Set is ``down`` or otherwise unable to
service requests for objects in the PG. When this kind of situation
arises, don't panic. Common examples of such a situation include:

- You added or removed an OSD, CRUSH reassigned the PG to
  other OSDs, and this reassignment changed the composition of the Acting Set and triggered
  the migration of data by means of a "backfill" process.
- An OSD was ``down``, was restarted, and is now ``recovering``.
- An OSD in the Acting Set is ``down`` or unable to service requests,
  and another OSD has temporarily assumed its duties.

Typically, the Up Set and the Acting Set are identical. When they are not, it
might indicate that Ceph is migrating the PG (in other words, that the PG has
been remapped), that an OSD is recovering, or that there is a problem with the
cluster (in such scenarios, Ceph usually shows a "HEALTH WARN" state with a
"stuck stale" message).

To retrieve a list of PGs, run the following command:

.. prompt:: bash $

    ceph pg dump

To see which OSDs are within the Acting Set and the Up Set for a specific PG, run the following command:

.. prompt:: bash $

    ceph pg map {pg-num}

The output provides the following information: the osdmap epoch (eNNN), the PG number
({pg-num}), the OSDs in the Up Set (up[]), and the OSDs in the Acting Set
(acting[])::

    osdmap eNNN pg {raw-pg-num} ({pg-num}) -> up [0,1,2] acting [0,1,2]

.. note:: If the Up Set and the Acting Set do not match, this might indicate
   that the cluster is rebalancing itself or that there is a problem with
   the cluster.


Peering
=======

Before you can write data to a PG, it must be in an ``active`` state and it
will preferably be in a ``clean`` state. For Ceph to determine the current
state of a PG, peering must take place.  That is, the primary OSD of the PG
(that is, the first OSD in the Acting Set) must peer with the secondary and
OSDs so that consensus on the current state of the PG can be established. In
the following diagram, we assume a pool with three replicas of the PG:

.. ditaa::

           +---------+     +---------+     +-------+
           |  OSD 1  |     |  OSD 2  |     | OSD 3 |
           +---------+     +---------+     +-------+
                |               |              |
                |  Request To   |              |
                |     Peer      |              |
                |-------------->|              |
                |<--------------|              |
                |    Peering                   |
                |                              |
                |         Request To           |
                |            Peer              |
                |----------------------------->|
                |<-----------------------------|
                |          Peering             |

The OSDs also report their status to the monitor. For details, see `Configuring Monitor/OSD
Interaction`_. To troubleshoot peering issues, see `Peering
Failure`_.


Monitoring PG States
====================

If you run the commands ``ceph health``, ``ceph -s``, or ``ceph -w``,
you might notice that the cluster does not always show ``HEALTH OK``. After
first checking to see if the OSDs are running, you should also check PG
states. There are certain PG-peering-related circumstances in which it is expected
and normal that the cluster will **NOT** show ``HEALTH OK``:

#. You have just created a pool and the PGs haven't peered yet.
#. The PGs are recovering.
#. You have just added an OSD to or removed an OSD from the cluster.
#. You have just modified your CRUSH map and your PGs are migrating.
#. There is inconsistent data in different replicas of a PG.
#. Ceph is scrubbing a PG's replicas.
#. Ceph doesn't have enough storage capacity to complete backfilling operations.

If one of these circumstances causes Ceph to show ``HEALTH WARN``, don't
panic. In many cases, the cluster will recover on its own. In some cases, however, you
might need to take action. An important aspect of monitoring PGs is to check their
status as ``active`` and ``clean``: that is, it is important to ensure that, when the
cluster is up and running, all PGs are ``active`` and (preferably) ``clean``.
To see the status of every PG, run the following command:

.. prompt:: bash $

    ceph pg stat

The output provides the following information: the total number of PGs (x), how many
PGs are in a particular state such as ``active+clean`` (y), and the
amount of data stored (z). ::

    x pgs: y active+clean; z bytes data, aa MB used, bb GB / cc GB avail

.. note:: It is common for Ceph to report multiple states for PGs (for example,
   ``active+clean``, ``active+clean+remapped``, ``active+clean+scrubbing``.

Here Ceph shows not only the PG states, but also storage capacity used (aa),
the amount of storage capacity remaining (bb), and the total storage capacity
of the PG. These values can be important in a few cases:

- The cluster is reaching its ``near full ratio`` or ``full ratio``.
- Data is not being distributed across the cluster due to an error in the
  CRUSH configuration.


.. topic:: Placement Group IDs

   PG IDs consist of the pool number (not the pool name) followed by a period
   (.) and a hexadecimal number. You can view pool numbers and their names from
   in the output of ``ceph osd lspools``. For example, the first pool that was
   created corresponds to pool number ``1``. A fully qualified PG ID has the
   following form::

       {pool-num}.{pg-id}

   It typically resembles the following::

    1.1701b


To retrieve a list of PGs, run the following command:

.. prompt:: bash $

    ceph pg dump

To format the output in JSON format and save it to a file, run the following command:

.. prompt:: bash $

    ceph pg dump -o {filename} --format=json

To query a specific PG, run the following command:

.. prompt:: bash $

    ceph pg {poolnum}.{pg-id} query

Ceph will output the query in JSON format.

The following subsections describe the most common PG states in detail.


Creating
--------

PGs are created when you create a pool: the command that creates a pool
specifies the total number of PGs for that pool, and when the pool is created
all of those PGs are created as well. Ceph will echo ``creating`` while it is
creating PGs. After the PG(s) are created, the OSDs that are part of a PG's
Acting Set will peer. Once peering is complete, the PG status should be
``active+clean``. This status means that Ceph clients begin writing to the
PG.

.. ditaa::

       /-----------\       /-----------\       /-----------\
       | Creating  |------>|  Peering  |------>|  Active   |
       \-----------/       \-----------/       \-----------/

Peering
-------

When a PG peers, the OSDs that store the replicas of its data converge on an
agreed state of the data and metadata within that PG. When peering is complete,
those OSDs agree about the state of that PG. However, completion of the peering
process does **NOT** mean that each replica has the latest contents.

.. topic:: Authoritative History

   Ceph will **NOT** acknowledge a write operation to a client until that write
   operation is persisted by every OSD in the Acting Set. This practice ensures
   that at least one member of the Acting Set will have a record of every
   acknowledged write operation since the last successful peering operation.

   Given an accurate record of each acknowledged write operation, Ceph can
   construct a new authoritative history of the PG--that is, a complete and
   fully ordered set of operations that, if performed, would bring an OSD’s
   copy of the PG up to date.


Active
------

After Ceph has completed the peering process, a PG should become ``active``.
The ``active`` state means that the data in the PG is generally available for
read and write operations in the primary and replica OSDs.


Clean
-----

When a PG is in the ``clean`` state, all OSDs holding its data and metadata
have successfully peered and there are no stray replicas. Ceph has replicated
all objects in the PG the correct number of times.


Degraded
--------

When a client writes an object to the primary OSD, the primary OSD is
responsible for writing the replicas to the replica OSDs. After the primary OSD
writes the object to storage, the PG will remain in a ``degraded``
state until the primary OSD has received an acknowledgement from the replica
OSDs that Ceph created the replica objects successfully.

The reason that a PG can be ``active+degraded`` is that an OSD can be
``active`` even if it doesn't yet hold all of the PG's objects. If an OSD goes
``down``, Ceph marks each PG assigned to the OSD as ``degraded``. The PGs must
peer again when the OSD comes back online. However, a client can still write a
new object to a ``degraded`` PG if it is ``active``.

If an OSD is ``down`` and the ``degraded`` condition persists, Ceph might mark the
``down`` OSD as ``out`` of the cluster and remap the data from the ``down`` OSD
to another OSD. The time between being marked ``down`` and being marked ``out``
is determined by ``mon_osd_down_out_interval``, which is set to ``600`` seconds
by default.

A PG can also be in the ``degraded`` state because there are one or more
objects that Ceph expects to find in the PG but that Ceph cannot find. Although
you cannot read or write to unfound objects, you can still access all of the other
objects in the ``degraded`` PG.


Recovering
----------

Ceph was designed for fault-tolerance, because hardware and other server
problems are expected or even routine. When an OSD goes ``down``, its contents
might fall behind the current state of other replicas in the PGs. When the OSD
has returned to the ``up`` state, the contents of the PGs must be updated to
reflect that current state. During that time period, the OSD might be in a
``recovering`` state.

Recovery is not always trivial, because a hardware failure might cause a
cascading failure of multiple OSDs. For example, a network switch for a rack or
cabinet might fail, which can cause the OSDs of a number of host machines to
fall behind the current state of the cluster. In such a scenario, general
recovery is possible only if each of the OSDs recovers after the fault has been
resolved.]

Ceph provides a number of settings that determine how the cluster balances the
resource contention between the need to process new service requests and the
need to recover data objects and restore the PGs to the current state. The
``osd_recovery_delay_start`` setting allows an OSD to restart, re-peer, and
even process some replay requests before starting the recovery process. The
``osd_recovery_thread_timeout`` setting determines the duration of a thread
timeout, because multiple OSDs might fail, restart, and re-peer at staggered
rates.  The ``osd_recovery_max_active`` setting limits the number of recovery
requests an OSD can entertain simultaneously, in order to prevent the OSD from
failing to serve.  The ``osd_recovery_max_chunk`` setting limits the size of
the recovered data chunks, in order to prevent network congestion.


Back Filling
------------

When a new OSD joins the cluster, CRUSH will reassign PGs from OSDs that are
already in the cluster to the newly added OSD. It can put excessive load on the
new OSD to force it to immediately accept the reassigned PGs. Back filling the
OSD with the PGs allows this process to begin in the background. After the
backfill operations have completed, the new OSD will begin serving requests as
soon as it is ready.

During the backfill operations, you might see one of several states:
``backfill_wait`` indicates that a backfill operation is pending, but is not
yet underway; ``backfilling`` indicates that a backfill operation is currently
underway; and ``backfill_toofull`` indicates that a backfill operation was
requested but couldn't be completed due to insufficient storage capacity. When
a PG cannot be backfilled, it might be considered ``incomplete``.

The ``backfill_toofull`` state might be transient. It might happen that, as PGs
are moved around, space becomes available. The ``backfill_toofull`` state is
similar to ``backfill_wait`` in that backfill operations can proceed as soon as
conditions change.

Ceph provides a number of settings to manage the load spike associated with the
reassignment of PGs to an OSD (especially a new OSD). The ``osd_max_backfills``
setting specifies the maximum number of concurrent backfills to and from an OSD
(default: 1). The ``backfill_full_ratio`` setting allows an OSD to refuse a
backfill request if the OSD is approaching its full ratio (default: 90%). This
setting can be changed with the ``ceph osd set-backfillfull-ratio`` command. If
an OSD refuses a backfill request, the ``osd_backfill_retry_interval`` setting
allows an OSD to retry the request after a certain interval (default: 30
seconds). OSDs can also set ``osd_backfill_scan_min`` and
``osd_backfill_scan_max`` in order to manage scan intervals (default: 64 and
512, respectively).


Remapped
--------

When the Acting Set that services a PG changes, the data migrates from the old
Acting Set to the new Acting Set. Because it might take time for the new
primary OSD to begin servicing requests, the old primary OSD might be required
to continue servicing requests until the PG data migration is complete. After
data migration has completed, the mapping uses the primary OSD of the new
Acting Set.


Stale
-----

Although Ceph uses heartbeats in order to ensure that hosts and daemons are
running, the ``ceph-osd`` daemons might enter a ``stuck`` state where they are
not reporting statistics in a timely manner (for example, there might be a
temporary network fault). By default, OSD daemons report their PG, up through,
boot, and failure statistics every half second (that is, in accordance with a
value of ``0.5``), which is more frequent than the reports defined by the
heartbeat thresholds. If the primary OSD of a PG's Acting Set fails to report
to the monitor or if other OSDs have reported the primary OSD ``down``, the
monitors will mark the PG ``stale``.

When you start your cluster, it is common to see the ``stale`` state until the
peering process completes. After your cluster has been running for a while,
however, seeing PGs in the ``stale`` state indicates that the primary OSD for
those PGs is ``down`` or not reporting PG statistics to the monitor.


Identifying Troubled PGs
========================

As previously noted, a PG is not necessarily having problems just because its
state is not ``active+clean``. When PGs are stuck, this might indicate that
Ceph cannot perform self-repairs. The stuck states include:

- **Unclean**: PGs contain objects that have not been replicated the desired
  number of times. Under normal conditions, it can be assumed that these PGs
  are recovering.
- **Inactive**: PGs cannot process reads or writes because they are waiting for
  an OSD that has the most up-to-date data to come back ``up``.
- **Stale**: PG are in an unknown state, because the OSDs that host them have
  not reported to the monitor cluster for a certain period of time (determined
  by ``mon_osd_report_timeout``).

To identify stuck PGs, run the following command:

.. prompt:: bash $

    ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded]

For more detail, see `Placement Group Subsystem`_. To troubleshoot stuck PGs,
see `Troubleshooting PG Errors`_.


Finding an Object Location
==========================

To store object data in the Ceph Object Store, a Ceph client must:

#. Set an object name
#. Specify a `pool`_

The Ceph client retrieves the latest cluster map, the CRUSH algorithm
calculates how to map the object to a PG, and then the algorithm calculates how
to dynamically assign the PG to an OSD. To find the object location given only
the object name and the pool name, run a command of the following form:

.. prompt:: bash $

    ceph osd map {poolname} {object-name} [namespace]

.. topic:: Exercise: Locate an Object

        As an exercise, let's create an object. We can specify an object name, a path
        to a test file that contains some object data, and a pool name by using the
        ``rados put`` command on the command line. For example:

        .. prompt:: bash $

           rados put {object-name} {file-path} --pool=data
           rados put test-object-1 testfile.txt --pool=data

        To verify that the Ceph Object Store stored the object, run the
        following command:

        .. prompt:: bash $

           rados -p data ls

        To identify the object location, run the following commands:

        .. prompt:: bash $

           ceph osd map {pool-name} {object-name}
           ceph osd map data test-object-1

        Ceph should output the object's location. For example::

           osdmap e537 pool 'data' (1) object 'test-object-1' -> pg 1.d1743484 (1.4) -> up ([0,1], p0) acting ([0,1], p0)

        To remove the test object, simply delete it by running the ``rados rm``
        command. For example:

        .. prompt:: bash $

           rados rm test-object-1 --pool=data

As the cluster evolves, the object location may change dynamically. One benefit
of Ceph's dynamic rebalancing is that Ceph spares you the burden of manually
performing the migration. For details, see the `Architecture`_ section.

.. _data placement: ../data-placement
.. _pool: ../pools
.. _placement group: ../placement-groups
.. _Architecture: ../../../architecture
.. _OSD Not Running: ../../troubleshooting/troubleshooting-osd#osd-not-running
.. _Troubleshooting PG Errors: ../../troubleshooting/troubleshooting-pg#troubleshooting-pg-errors
.. _Peering Failure: ../../troubleshooting/troubleshooting-pg#failures-osd-peering
.. _CRUSH map: ../crush-map
.. _Configuring Monitor/OSD Interaction: ../../configuration/mon-osd-interaction/
.. _Placement Group Subsystem: ../control#placement-group-subsystem