Merge pull request #50863 from zdover23/wip-doc-2023-04-05-rados-operations-monitoring-osd-pg-2-of-x

doc/rados/ops: edit monitoring-osd-pg.rst (2 of x) Reviewed-by: Anthony D'Atri <anthony.datri@gmail.com>
2025-01-10 21:20:46 +00:00 · 2023-04-08 13:53:31 +10:00 · 2023-04-08 13:53:31 +10:00 · 94da3f270e
commit 94da3f270e
parent 3faea64708 270e2fd730
1 changed files with 141 additions and 142 deletions
--- a/doc/rados/operations/monitoring-osd-pg.rst
+++ b/doc/rados/operations/monitoring-osd-pg.rst
@ -15,8 +15,8 @@ root of the problem.
   When you run into a fault, don't panic. Just follow the steps for monitoring
   your OSDs and placement groups, and then begin troubleshooting.

-Ceph is generally self-repairing. However, when problems persist and you want to find out 
-what exactly is going wrong, it can be helpful to monitor OSDs and PGs.
+Ceph is self-repairing. However, when problems persist, monitoring OSDs and
+placement groups will help you identify the problem.


 Monitoring OSDs
@ -284,12 +284,13 @@ The following subsections describe the most common PG states in detail.
 Creating
 --------

-When you create a pool, it will create the number of placement groups you
-specified. Ceph will echo ``creating`` when it is creating one or more
-placement groups. Once they are created, the OSDs that are part of a placement
-group's Acting Set will peer. Once peering is complete, the placement group
-status should be ``active+clean``, which means a Ceph client can begin writing
-to the placement group.
+PGs are created when you create a pool: the command that creates a pool
+specifies the total number of PGs for that pool, and when the pool is created
+all of those PGs are created as well. Ceph will echo ``creating`` while it is
+creating PGs. After the PG(s) are created, the OSDs that are part of a PG's
+Acting Set will peer. Once peering is complete, the PG status should be
+``active+clean``. This status means that Ceph clients begin writing to the
+PG.

 .. ditaa::

@ -300,43 +301,38 @@ to the placement group.
 Peering
 -------

-When Ceph is Peering a placement group, Ceph is bringing the OSDs that
-store the replicas of the placement group into **agreement about the state**
-of the objects and metadata in the placement group. When Ceph completes peering,
-this means that the OSDs that store the placement group agree about the current
-state of the placement group. However, completion of the peering process does
-**NOT** mean that each replica has the latest contents.
+When a PG peers, the OSDs that store the replicas of its data converge on an
+agreed state of the data and metadata within that PG. When peering is complete,
+those OSDs agree about the state of that PG. However, completion of the peering
+process does **NOT** mean that each replica has the latest contents.

 .. topic:: Authoritative History

-   Ceph will **NOT** acknowledge a write operation to a client, until 
-   all OSDs of the acting set persist the write operation. This practice 
-   ensures that at least one member of the acting set will have a record 
-   of every acknowledged write operation since the last successful 
-   peering operation.
+   Ceph will **NOT** acknowledge a write operation to a client until that write
+   operation is persisted by every OSD in the Acting Set. This practice ensures
+   that at least one member of the Acting Set will have a record of every
+   acknowledged write operation since the last successful peering operation.
   
-   With an accurate record of each acknowledged write operation, Ceph can 
-   construct and disseminate a new authoritative history of the placement 
-   group--a complete, and fully ordered set of operations that, if performed, 
-   would bring an OSD’s copy of a placement group up to date.
+   Given an accurate record of each acknowledged write operation, Ceph can
+   construct a new authoritative history of the PG--that is, a complete and
+   fully ordered set of operations that, if performed, would bring an OSD’s
+   copy of the PG up to date.


 Active
 ------

-Once Ceph completes the peering process, a placement group may become
-``active``. The ``active`` state means that the data in the placement group is
-generally available in the primary placement group and the replicas for read
-and write operations. 
+After Ceph has completed the peering process, a PG should become ``active``.
+The ``active`` state means that the data in the PG is generally available for
+read and write operations in the primary and replica OSDs.


 Clean 
 -----

-When a placement group is in the ``clean`` state, the primary OSD and the
-replica OSDs have successfully peered and there are no stray replicas for the
-placement group. Ceph replicated all objects in the placement group the correct 
-number of times.
+When a PG is in the ``clean`` state, all OSDs holding its data and metadata
+have successfully peered and there are no stray replicas. Ceph has replicated
+all objects in the PG the correct number of times.


 Degraded
@ -344,143 +340,147 @@ Degraded

 When a client writes an object to the primary OSD, the primary OSD is
 responsible for writing the replicas to the replica OSDs. After the primary OSD
-writes the object to storage, the placement group will remain in a ``degraded``
+writes the object to storage, the PG will remain in a ``degraded``
 state until the primary OSD has received an acknowledgement from the replica
 OSDs that Ceph created the replica objects successfully. 

-The reason a placement group can be ``active+degraded`` is that an OSD may be
-``active`` even though it doesn't hold all of the objects yet. If an OSD goes
-``down``, Ceph marks each placement group assigned to the OSD as ``degraded``.
-The OSDs must peer again when the OSD comes back online. However, a client can
-still write a new object to a ``degraded`` placement group if it is ``active``.
+The reason that a PG can be ``active+degraded`` is that an OSD can be
+``active`` even if it doesn't yet hold all of the PG's objects. If an OSD goes
+``down``, Ceph marks each PG assigned to the OSD as ``degraded``. The PGs must
+peer again when the OSD comes back online. However, a client can still write a
+new object to a ``degraded`` PG if it is ``active``.

-If an OSD is ``down`` and the ``degraded`` condition persists, Ceph may mark the
+If an OSD is ``down`` and the ``degraded`` condition persists, Ceph might mark the
 ``down`` OSD as ``out`` of the cluster and remap the data from the ``down`` OSD
 to another OSD. The time between being marked ``down`` and being marked ``out``
-is controlled by ``mon_osd_down_out_interval``, which is set to ``600`` seconds
+is determined by ``mon_osd_down_out_interval``, which is set to ``600`` seconds
 by default.

-A placement group can also be ``degraded``, because Ceph cannot find one or more
-objects that Ceph thinks should be in the placement group. While you cannot
-read or write to unfound objects, you can still access all of the other objects
-in the ``degraded`` placement group.
+A PG can also be in the ``degraded`` state because there are one or more
+objects that Ceph expects to find in the PG but that Ceph cannot find. Although
+you cannot read or write to unfound objects, you can still access all of the other
+objects in the ``degraded`` PG.


 Recovering
 ----------

-Ceph was designed for fault-tolerance at a scale where hardware and software
-problems are ongoing. When an OSD goes ``down``, its contents may fall behind
-the current state of other replicas in the placement groups. When the OSD is
-back ``up``, the contents of the placement groups must be updated to reflect the
-current state. During that time period, the OSD may reflect a ``recovering``
-state.
+Ceph was designed for fault-tolerance, because hardware and other server
+problems are expected or even routine. When an OSD goes ``down``, its contents
+might fall behind the current state of other replicas in the PGs. When the OSD
+has returned to the ``up`` state, the contents of the PGs must be updated to
+reflect that current state. During that time period, the OSD might be in a
+``recovering`` state.

 Recovery is not always trivial, because a hardware failure might cause a
 cascading failure of multiple OSDs. For example, a network switch for a rack or
-cabinet may fail, which can cause the OSDs of a number of host machines to fall
-behind the current state of the cluster. Each one of the OSDs must recover once
-the fault is resolved.
+cabinet might fail, which can cause the OSDs of a number of host machines to
+fall behind the current state of the cluster. In such a scenario, general
+recovery is possible only if each of the OSDs recovers after the fault has been
+resolved.]

-Ceph provides a number of settings to balance the resource contention between
-new service requests and the need to recover data objects and restore the
-placement groups to the current state. The ``osd_recovery_delay_start`` setting
-allows an OSD to restart, re-peer and even process some replay requests before
-starting the recovery process. The ``osd_recovery_thread_timeout`` sets a thread
-timeout, because multiple OSDs may fail, restart and re-peer at staggered rates.
-The ``osd_recovery_max_active`` setting limits the number of recovery requests
-an OSD will entertain simultaneously to prevent the OSD from failing to serve.
-The ``osd_recovery_max_chunk`` setting limits the size of the recovered data
-chunks to prevent network congestion.
+Ceph provides a number of settings that determine how the cluster balances the
+resource contention between the need to process new service requests and the
+need to recover data objects and restore the PGs to the current state. The
+``osd_recovery_delay_start`` setting allows an OSD to restart, re-peer, and
+even process some replay requests before starting the recovery process. The
+``osd_recovery_thread_timeout`` setting determines the duration of a thread
+timeout, because multiple OSDs might fail, restart, and re-peer at staggered
+rates.  The ``osd_recovery_max_active`` setting limits the number of recovery
+requests an OSD can entertain simultaneously, in order to prevent the OSD from
+failing to serve.  The ``osd_recovery_max_chunk`` setting limits the size of
+the recovered data chunks, in order to prevent network congestion.


 Back Filling
 ------------

-When a new OSD joins the cluster, CRUSH will reassign placement groups from OSDs
-in the cluster to the newly added OSD. Forcing the new OSD to accept the
-reassigned placement groups immediately can put excessive load on the new OSD.
-Back filling the OSD with the placement groups allows this process to begin in
-the background. Once backfilling is complete, the new OSD will begin serving
-requests when it is ready.
+When a new OSD joins the cluster, CRUSH will reassign PGs from OSDs that are
+already in the cluster to the newly added OSD. It can put excessive load on the
+new OSD to force it to immediately accept the reassigned PGs. Back filling the
+OSD with the PGs allows this process to begin in the background. After the
+backfill operations have completed, the new OSD will begin serving requests as
+soon as it is ready.

-During the backfill operations, you may see one of several states:
+During the backfill operations, you might see one of several states:
 ``backfill_wait`` indicates that a backfill operation is pending, but is not
-underway yet; ``backfilling`` indicates that a backfill operation is underway;
-and, ``backfill_toofull`` indicates that a backfill operation was requested,
-but couldn't be completed due to insufficient storage capacity. When a 
-placement group cannot be backfilled, it may be considered ``incomplete``.
+yet underway; ``backfilling`` indicates that a backfill operation is currently
+underway; and ``backfill_toofull`` indicates that a backfill operation was
+requested but couldn't be completed due to insufficient storage capacity. When
+a PG cannot be backfilled, it might be considered ``incomplete``.

-The ``backfill_toofull`` state may be transient. It is possible that as PGs
-are moved around, space may become available. The ``backfill_toofull`` is
-similar to ``backfill_wait`` in that as soon as conditions change
-backfill can proceed.
+The ``backfill_toofull`` state might be transient. It might happen that, as PGs
+are moved around, space becomes available. The ``backfill_toofull`` state is
+similar to ``backfill_wait`` in that backfill operations can proceed as soon as
+conditions change.

-Ceph provides a number of settings to manage the load spike associated with
-reassigning placement groups to an OSD (especially a new OSD). By default,
-``osd_max_backfills`` sets the maximum number of concurrent backfills to and from
-an OSD to 1. The ``backfill_full_ratio`` enables an OSD to refuse a
-backfill request if the OSD is approaching its full ratio (90%, by default) and
-change with ``ceph osd set-backfillfull-ratio`` command.
-If an OSD refuses a backfill request, the ``osd_backfill_retry_interval``
-enables an OSD to retry the request (after 30 seconds, by default). OSDs can
-also set ``osd_backfill_scan_min`` and ``osd_backfill_scan_max`` to manage scan
-intervals (64 and 512, by default).
+Ceph provides a number of settings to manage the load spike associated with the
+reassignment of PGs to an OSD (especially a new OSD). The ``osd_max_backfills``
+setting specifies the maximum number of concurrent backfills to and from an OSD
+(default: 1). The ``backfill_full_ratio`` setting allows an OSD to refuse a
+backfill request if the OSD is approaching its full ratio (default: 90%). This
+setting can be changed with the ``ceph osd set-backfillfull-ratio`` command. If
+an OSD refuses a backfill request, the ``osd_backfill_retry_interval`` setting
+allows an OSD to retry the request after a certain interval (default: 30
+seconds). OSDs can also set ``osd_backfill_scan_min`` and
+``osd_backfill_scan_max`` in order to manage scan intervals (default: 64 and
+512, respectively).


 Remapped
 --------

-When the Acting Set that services a placement group changes, the data migrates
-from the old acting set to the new acting set. It may take some time for a new
-primary OSD to service requests. So it may ask the old primary to continue to
-service requests until the placement group migration is complete. Once data
-migration completes, the mapping uses the primary OSD of the new acting set.
+When the Acting Set that services a PG changes, the data migrates from the old
+Acting Set to the new Acting Set. Because it might take time for the new
+primary OSD to begin servicing requests, the old primary OSD might be required
+to continue servicing requests until the PG data migration is complete. After
+data migration has completed, the mapping uses the primary OSD of the new
+Acting Set.


 Stale
 -----

-While Ceph uses heartbeats to ensure that hosts and daemons are running, the
-``ceph-osd`` daemons may also get into a ``stuck`` state where they are not
-reporting statistics in a timely manner (e.g., a temporary network fault). By
-default, OSD daemons report their placement group, up through, boot and failure
-statistics every half second (i.e., ``0.5``), which is more frequent than the
-heartbeat thresholds. If the **Primary OSD** of a placement group's acting set
-fails to report to the monitor or if other OSDs have reported the primary OSD
-``down``, the monitors will mark the placement group ``stale``.
+Although Ceph uses heartbeats in order to ensure that hosts and daemons are
+running, the ``ceph-osd`` daemons might enter a ``stuck`` state where they are
+not reporting statistics in a timely manner (for example, there might be a
+temporary network fault). By default, OSD daemons report their PG, up through,
+boot, and failure statistics every half second (that is, in accordance with a
+value of ``0.5``), which is more frequent than the reports defined by the
+heartbeat thresholds. If the primary OSD of a PG's Acting Set fails to report
+to the monitor or if other OSDs have reported the primary OSD ``down``, the
+monitors will mark the PG ``stale``.

-When you start your cluster, it is common to see the ``stale`` state until
-the peering process completes. After your cluster has been running for awhile, 
-seeing placement groups in the ``stale`` state indicates that the primary OSD
-for those placement groups is ``down`` or not reporting placement group statistics
-to the monitor.
+When you start your cluster, it is common to see the ``stale`` state until the
+peering process completes. After your cluster has been running for a while,
+however, seeing PGs in the ``stale`` state indicates that the primary OSD for
+those PGs is ``down`` or not reporting PG statistics to the monitor.


 Identifying Troubled PGs
 ========================

-As previously noted, a placement group is not necessarily problematic just 
-because its state is not ``active+clean``. Generally, Ceph's ability to self
-repair may not be working when placement groups get stuck. The stuck states
-include:
+As previously noted, a PG is not necessarily having problems just because its
+state is not ``active+clean``. When PGs are stuck, this might indicate that
+Ceph cannot perform self-repairs. The stuck states include:

- **Unclean**: Placement groups contain objects that are not replicated the 
-  desired number of times. They should be recovering.
- **Inactive**: Placement groups cannot process reads or writes because they 
-  are waiting for an OSD with the most up-to-date data to come back ``up``.
- **Stale**: Placement groups are in an unknown state, because the OSDs that 
-  host them have not reported to the monitor cluster in a while (configured 
+- **Unclean**: PGs contain objects that have not been replicated the desired
+  number of times. Under normal conditions, it can be assumed that these PGs
+  are recovering.
+- **Inactive**: PGs cannot process reads or writes because they are waiting for
+  an OSD that has the most up-to-date data to come back ``up``.
+- **Stale**: PG are in an unknown state, because the OSDs that host them have
+  not reported to the monitor cluster for a certain period of time (determined
  by ``mon_osd_report_timeout``).

-To identify stuck placement groups, execute the following:
+To identify stuck PGs, run the following command:

 .. prompt:: bash $

-	ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded]
+    ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded]

-See `Placement Group Subsystem`_ for additional details. To troubleshoot
-stuck placement groups, see `Troubleshooting PG Errors`_.
+For more detail, see `Placement Group Subsystem`_. To troubleshoot stuck PGs,
+see `Troubleshooting PG Errors`_.


 Finding an Object Location
@ -491,55 +491,54 @@ To store object data in the Ceph Object Store, a Ceph client must:
 #. Set an object name
 #. Specify a `pool`_

-The Ceph client retrieves the latest cluster map and the CRUSH algorithm
-calculates how to map the object to a `placement group`_, and then calculates
-how to assign the placement group to an OSD dynamically. To find the object
-location, all you need is the object name and the pool name. For example:
+The Ceph client retrieves the latest cluster map, the CRUSH algorithm
+calculates how to map the object to a PG, and then the algorithm calculates how
+to dynamically assign the PG to an OSD. To find the object location given only
+the object name and the pool name, run a command of the following form:

 .. prompt:: bash $

-	ceph osd map {poolname} {object-name} [namespace]
+    ceph osd map {poolname} {object-name} [namespace]

 .. topic:: Exercise: Locate an Object

-        As an exercise, let's create an object. Specify an object name, a path
-        to a test file containing some object data and a pool name using the
+        As an exercise, let's create an object. We can specify an object name, a path
+        to a test file that contains some object data, and a pool name by using the
        ``rados put`` command on the command line. For example:

        .. prompt:: bash $
   
-		rados put {object-name} {file-path} --pool=data
-		rados put test-object-1 testfile.txt --pool=data
+           rados put {object-name} {file-path} --pool=data
+           rados put test-object-1 testfile.txt --pool=data
   
-        To verify that the Ceph Object Store stored the object, execute the
-        following:
+        To verify that the Ceph Object Store stored the object, run the
+        following command:
   
        .. prompt:: bash $

           rados -p data ls
   
-	Now, identify the object location:
+        To identify the object location, run the following commands:

        .. prompt:: bash $

           ceph osd map {pool-name} {object-name}
           ceph osd map data test-object-1
-   
-	Ceph should output the object's location. For example:: 
-   
-		osdmap e537 pool 'data' (1) object 'test-object-1' -> pg 1.d1743484 (1.4) -> up ([0,1], p0) acting ([0,1], p0)
-   
-        To remove the test object, simply delete it using the ``rados rm``
-        command.  For example:
+
+        Ceph should output the object's location. For example:: 
+
+           osdmap e537 pool 'data' (1) object 'test-object-1' -> pg 1.d1743484 (1.4) -> up ([0,1], p0) acting ([0,1], p0)
+
+        To remove the test object, simply delete it by running the ``rados rm``
+        command. For example:

        .. prompt:: bash $
-   
+
           rados rm test-object-1 --pool=data
-   

 As the cluster evolves, the object location may change dynamically. One benefit
-of Ceph's dynamic rebalancing is that Ceph relieves you from having to perform
-the migration manually. See the `Architecture`_ section for details.
+of Ceph's dynamic rebalancing is that Ceph spares you the burden of manually
+performing the migration. For details, see the `Architecture`_ section.

 .. _data placement: ../data-placement
 .. _pool: ../pools