diff --git a/doc/rados/operations/monitoring-osd-pg.rst b/doc/rados/operations/monitoring-osd-pg.rst index 11a84ebea1d..3479f34b9e7 100644 --- a/doc/rados/operations/monitoring-osd-pg.rst +++ b/doc/rados/operations/monitoring-osd-pg.rst @@ -15,8 +15,8 @@ root of the problem. When you run into a fault, don't panic. Just follow the steps for monitoring your OSDs and placement groups, and then begin troubleshooting. -Ceph is generally self-repairing. However, when problems persist and you want to find out -what exactly is going wrong, it can be helpful to monitor OSDs and PGs. +Ceph is self-repairing. However, when problems persist, monitoring OSDs and +placement groups will help you identify the problem. Monitoring OSDs @@ -284,12 +284,13 @@ The following subsections describe the most common PG states in detail. Creating -------- -When you create a pool, it will create the number of placement groups you -specified. Ceph will echo ``creating`` when it is creating one or more -placement groups. Once they are created, the OSDs that are part of a placement -group's Acting Set will peer. Once peering is complete, the placement group -status should be ``active+clean``, which means a Ceph client can begin writing -to the placement group. +PGs are created when you create a pool: the command that creates a pool +specifies the total number of PGs for that pool, and when the pool is created +all of those PGs are created as well. Ceph will echo ``creating`` while it is +creating PGs. After the PG(s) are created, the OSDs that are part of a PG's +Acting Set will peer. Once peering is complete, the PG status should be +``active+clean``. This status means that Ceph clients begin writing to the +PG. .. ditaa:: @@ -300,43 +301,38 @@ to the placement group. Peering ------- -When Ceph is Peering a placement group, Ceph is bringing the OSDs that -store the replicas of the placement group into **agreement about the state** -of the objects and metadata in the placement group. When Ceph completes peering, -this means that the OSDs that store the placement group agree about the current -state of the placement group. However, completion of the peering process does -**NOT** mean that each replica has the latest contents. +When a PG peers, the OSDs that store the replicas of its data converge on an +agreed state of the data and metadata within that PG. When peering is complete, +those OSDs agree about the state of that PG. However, completion of the peering +process does **NOT** mean that each replica has the latest contents. .. topic:: Authoritative History - Ceph will **NOT** acknowledge a write operation to a client, until - all OSDs of the acting set persist the write operation. This practice - ensures that at least one member of the acting set will have a record - of every acknowledged write operation since the last successful - peering operation. + Ceph will **NOT** acknowledge a write operation to a client until that write + operation is persisted by every OSD in the Acting Set. This practice ensures + that at least one member of the Acting Set will have a record of every + acknowledged write operation since the last successful peering operation. - With an accurate record of each acknowledged write operation, Ceph can - construct and disseminate a new authoritative history of the placement - group--a complete, and fully ordered set of operations that, if performed, - would bring an OSD’s copy of a placement group up to date. + Given an accurate record of each acknowledged write operation, Ceph can + construct a new authoritative history of the PG--that is, a complete and + fully ordered set of operations that, if performed, would bring an OSD’s + copy of the PG up to date. Active ------ -Once Ceph completes the peering process, a placement group may become -``active``. The ``active`` state means that the data in the placement group is -generally available in the primary placement group and the replicas for read -and write operations. +After Ceph has completed the peering process, a PG should become ``active``. +The ``active`` state means that the data in the PG is generally available for +read and write operations in the primary and replica OSDs. Clean ----- -When a placement group is in the ``clean`` state, the primary OSD and the -replica OSDs have successfully peered and there are no stray replicas for the -placement group. Ceph replicated all objects in the placement group the correct -number of times. +When a PG is in the ``clean`` state, all OSDs holding its data and metadata +have successfully peered and there are no stray replicas. Ceph has replicated +all objects in the PG the correct number of times. Degraded @@ -344,143 +340,147 @@ Degraded When a client writes an object to the primary OSD, the primary OSD is responsible for writing the replicas to the replica OSDs. After the primary OSD -writes the object to storage, the placement group will remain in a ``degraded`` +writes the object to storage, the PG will remain in a ``degraded`` state until the primary OSD has received an acknowledgement from the replica OSDs that Ceph created the replica objects successfully. -The reason a placement group can be ``active+degraded`` is that an OSD may be -``active`` even though it doesn't hold all of the objects yet. If an OSD goes -``down``, Ceph marks each placement group assigned to the OSD as ``degraded``. -The OSDs must peer again when the OSD comes back online. However, a client can -still write a new object to a ``degraded`` placement group if it is ``active``. +The reason that a PG can be ``active+degraded`` is that an OSD can be +``active`` even if it doesn't yet hold all of the PG's objects. If an OSD goes +``down``, Ceph marks each PG assigned to the OSD as ``degraded``. The PGs must +peer again when the OSD comes back online. However, a client can still write a +new object to a ``degraded`` PG if it is ``active``. -If an OSD is ``down`` and the ``degraded`` condition persists, Ceph may mark the +If an OSD is ``down`` and the ``degraded`` condition persists, Ceph might mark the ``down`` OSD as ``out`` of the cluster and remap the data from the ``down`` OSD to another OSD. The time between being marked ``down`` and being marked ``out`` -is controlled by ``mon_osd_down_out_interval``, which is set to ``600`` seconds +is determined by ``mon_osd_down_out_interval``, which is set to ``600`` seconds by default. -A placement group can also be ``degraded``, because Ceph cannot find one or more -objects that Ceph thinks should be in the placement group. While you cannot -read or write to unfound objects, you can still access all of the other objects -in the ``degraded`` placement group. +A PG can also be in the ``degraded`` state because there are one or more +objects that Ceph expects to find in the PG but that Ceph cannot find. Although +you cannot read or write to unfound objects, you can still access all of the other +objects in the ``degraded`` PG. Recovering ---------- -Ceph was designed for fault-tolerance at a scale where hardware and software -problems are ongoing. When an OSD goes ``down``, its contents may fall behind -the current state of other replicas in the placement groups. When the OSD is -back ``up``, the contents of the placement groups must be updated to reflect the -current state. During that time period, the OSD may reflect a ``recovering`` -state. +Ceph was designed for fault-tolerance, because hardware and other server +problems are expected or even routine. When an OSD goes ``down``, its contents +might fall behind the current state of other replicas in the PGs. When the OSD +has returned to the ``up`` state, the contents of the PGs must be updated to +reflect that current state. During that time period, the OSD might be in a +``recovering`` state. Recovery is not always trivial, because a hardware failure might cause a cascading failure of multiple OSDs. For example, a network switch for a rack or -cabinet may fail, which can cause the OSDs of a number of host machines to fall -behind the current state of the cluster. Each one of the OSDs must recover once -the fault is resolved. +cabinet might fail, which can cause the OSDs of a number of host machines to +fall behind the current state of the cluster. In such a scenario, general +recovery is possible only if each of the OSDs recovers after the fault has been +resolved.] -Ceph provides a number of settings to balance the resource contention between -new service requests and the need to recover data objects and restore the -placement groups to the current state. The ``osd_recovery_delay_start`` setting -allows an OSD to restart, re-peer and even process some replay requests before -starting the recovery process. The ``osd_recovery_thread_timeout`` sets a thread -timeout, because multiple OSDs may fail, restart and re-peer at staggered rates. -The ``osd_recovery_max_active`` setting limits the number of recovery requests -an OSD will entertain simultaneously to prevent the OSD from failing to serve. -The ``osd_recovery_max_chunk`` setting limits the size of the recovered data -chunks to prevent network congestion. +Ceph provides a number of settings that determine how the cluster balances the +resource contention between the need to process new service requests and the +need to recover data objects and restore the PGs to the current state. The +``osd_recovery_delay_start`` setting allows an OSD to restart, re-peer, and +even process some replay requests before starting the recovery process. The +``osd_recovery_thread_timeout`` setting determines the duration of a thread +timeout, because multiple OSDs might fail, restart, and re-peer at staggered +rates. The ``osd_recovery_max_active`` setting limits the number of recovery +requests an OSD can entertain simultaneously, in order to prevent the OSD from +failing to serve. The ``osd_recovery_max_chunk`` setting limits the size of +the recovered data chunks, in order to prevent network congestion. Back Filling ------------ -When a new OSD joins the cluster, CRUSH will reassign placement groups from OSDs -in the cluster to the newly added OSD. Forcing the new OSD to accept the -reassigned placement groups immediately can put excessive load on the new OSD. -Back filling the OSD with the placement groups allows this process to begin in -the background. Once backfilling is complete, the new OSD will begin serving -requests when it is ready. +When a new OSD joins the cluster, CRUSH will reassign PGs from OSDs that are +already in the cluster to the newly added OSD. It can put excessive load on the +new OSD to force it to immediately accept the reassigned PGs. Back filling the +OSD with the PGs allows this process to begin in the background. After the +backfill operations have completed, the new OSD will begin serving requests as +soon as it is ready. -During the backfill operations, you may see one of several states: +During the backfill operations, you might see one of several states: ``backfill_wait`` indicates that a backfill operation is pending, but is not -underway yet; ``backfilling`` indicates that a backfill operation is underway; -and, ``backfill_toofull`` indicates that a backfill operation was requested, -but couldn't be completed due to insufficient storage capacity. When a -placement group cannot be backfilled, it may be considered ``incomplete``. +yet underway; ``backfilling`` indicates that a backfill operation is currently +underway; and ``backfill_toofull`` indicates that a backfill operation was +requested but couldn't be completed due to insufficient storage capacity. When +a PG cannot be backfilled, it might be considered ``incomplete``. -The ``backfill_toofull`` state may be transient. It is possible that as PGs -are moved around, space may become available. The ``backfill_toofull`` is -similar to ``backfill_wait`` in that as soon as conditions change -backfill can proceed. +The ``backfill_toofull`` state might be transient. It might happen that, as PGs +are moved around, space becomes available. The ``backfill_toofull`` state is +similar to ``backfill_wait`` in that backfill operations can proceed as soon as +conditions change. -Ceph provides a number of settings to manage the load spike associated with -reassigning placement groups to an OSD (especially a new OSD). By default, -``osd_max_backfills`` sets the maximum number of concurrent backfills to and from -an OSD to 1. The ``backfill_full_ratio`` enables an OSD to refuse a -backfill request if the OSD is approaching its full ratio (90%, by default) and -change with ``ceph osd set-backfillfull-ratio`` command. -If an OSD refuses a backfill request, the ``osd_backfill_retry_interval`` -enables an OSD to retry the request (after 30 seconds, by default). OSDs can -also set ``osd_backfill_scan_min`` and ``osd_backfill_scan_max`` to manage scan -intervals (64 and 512, by default). +Ceph provides a number of settings to manage the load spike associated with the +reassignment of PGs to an OSD (especially a new OSD). The ``osd_max_backfills`` +setting specifies the maximum number of concurrent backfills to and from an OSD +(default: 1). The ``backfill_full_ratio`` setting allows an OSD to refuse a +backfill request if the OSD is approaching its full ratio (default: 90%). This +setting can be changed with the ``ceph osd set-backfillfull-ratio`` command. If +an OSD refuses a backfill request, the ``osd_backfill_retry_interval`` setting +allows an OSD to retry the request after a certain interval (default: 30 +seconds). OSDs can also set ``osd_backfill_scan_min`` and +``osd_backfill_scan_max`` in order to manage scan intervals (default: 64 and +512, respectively). Remapped -------- -When the Acting Set that services a placement group changes, the data migrates -from the old acting set to the new acting set. It may take some time for a new -primary OSD to service requests. So it may ask the old primary to continue to -service requests until the placement group migration is complete. Once data -migration completes, the mapping uses the primary OSD of the new acting set. +When the Acting Set that services a PG changes, the data migrates from the old +Acting Set to the new Acting Set. Because it might take time for the new +primary OSD to begin servicing requests, the old primary OSD might be required +to continue servicing requests until the PG data migration is complete. After +data migration has completed, the mapping uses the primary OSD of the new +Acting Set. Stale ----- -While Ceph uses heartbeats to ensure that hosts and daemons are running, the -``ceph-osd`` daemons may also get into a ``stuck`` state where they are not -reporting statistics in a timely manner (e.g., a temporary network fault). By -default, OSD daemons report their placement group, up through, boot and failure -statistics every half second (i.e., ``0.5``), which is more frequent than the -heartbeat thresholds. If the **Primary OSD** of a placement group's acting set -fails to report to the monitor or if other OSDs have reported the primary OSD -``down``, the monitors will mark the placement group ``stale``. +Although Ceph uses heartbeats in order to ensure that hosts and daemons are +running, the ``ceph-osd`` daemons might enter a ``stuck`` state where they are +not reporting statistics in a timely manner (for example, there might be a +temporary network fault). By default, OSD daemons report their PG, up through, +boot, and failure statistics every half second (that is, in accordance with a +value of ``0.5``), which is more frequent than the reports defined by the +heartbeat thresholds. If the primary OSD of a PG's Acting Set fails to report +to the monitor or if other OSDs have reported the primary OSD ``down``, the +monitors will mark the PG ``stale``. -When you start your cluster, it is common to see the ``stale`` state until -the peering process completes. After your cluster has been running for awhile, -seeing placement groups in the ``stale`` state indicates that the primary OSD -for those placement groups is ``down`` or not reporting placement group statistics -to the monitor. +When you start your cluster, it is common to see the ``stale`` state until the +peering process completes. After your cluster has been running for a while, +however, seeing PGs in the ``stale`` state indicates that the primary OSD for +those PGs is ``down`` or not reporting PG statistics to the monitor. Identifying Troubled PGs ======================== -As previously noted, a placement group is not necessarily problematic just -because its state is not ``active+clean``. Generally, Ceph's ability to self -repair may not be working when placement groups get stuck. The stuck states -include: +As previously noted, a PG is not necessarily having problems just because its +state is not ``active+clean``. When PGs are stuck, this might indicate that +Ceph cannot perform self-repairs. The stuck states include: -- **Unclean**: Placement groups contain objects that are not replicated the - desired number of times. They should be recovering. -- **Inactive**: Placement groups cannot process reads or writes because they - are waiting for an OSD with the most up-to-date data to come back ``up``. -- **Stale**: Placement groups are in an unknown state, because the OSDs that - host them have not reported to the monitor cluster in a while (configured +- **Unclean**: PGs contain objects that have not been replicated the desired + number of times. Under normal conditions, it can be assumed that these PGs + are recovering. +- **Inactive**: PGs cannot process reads or writes because they are waiting for + an OSD that has the most up-to-date data to come back ``up``. +- **Stale**: PG are in an unknown state, because the OSDs that host them have + not reported to the monitor cluster for a certain period of time (determined by ``mon_osd_report_timeout``). -To identify stuck placement groups, execute the following: +To identify stuck PGs, run the following command: .. prompt:: bash $ - ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded] + ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded] -See `Placement Group Subsystem`_ for additional details. To troubleshoot -stuck placement groups, see `Troubleshooting PG Errors`_. +For more detail, see `Placement Group Subsystem`_. To troubleshoot stuck PGs, +see `Troubleshooting PG Errors`_. Finding an Object Location @@ -491,55 +491,54 @@ To store object data in the Ceph Object Store, a Ceph client must: #. Set an object name #. Specify a `pool`_ -The Ceph client retrieves the latest cluster map and the CRUSH algorithm -calculates how to map the object to a `placement group`_, and then calculates -how to assign the placement group to an OSD dynamically. To find the object -location, all you need is the object name and the pool name. For example: +The Ceph client retrieves the latest cluster map, the CRUSH algorithm +calculates how to map the object to a PG, and then the algorithm calculates how +to dynamically assign the PG to an OSD. To find the object location given only +the object name and the pool name, run a command of the following form: .. prompt:: bash $ - ceph osd map {poolname} {object-name} [namespace] + ceph osd map {poolname} {object-name} [namespace] .. topic:: Exercise: Locate an Object - As an exercise, let's create an object. Specify an object name, a path - to a test file containing some object data and a pool name using the + As an exercise, let's create an object. We can specify an object name, a path + to a test file that contains some object data, and a pool name by using the ``rados put`` command on the command line. For example: .. prompt:: bash $ - rados put {object-name} {file-path} --pool=data - rados put test-object-1 testfile.txt --pool=data + rados put {object-name} {file-path} --pool=data + rados put test-object-1 testfile.txt --pool=data - To verify that the Ceph Object Store stored the object, execute the - following: + To verify that the Ceph Object Store stored the object, run the + following command: .. prompt:: bash $ rados -p data ls - Now, identify the object location: + To identify the object location, run the following commands: .. prompt:: bash $ ceph osd map {pool-name} {object-name} ceph osd map data test-object-1 - - Ceph should output the object's location. For example:: - - osdmap e537 pool 'data' (1) object 'test-object-1' -> pg 1.d1743484 (1.4) -> up ([0,1], p0) acting ([0,1], p0) - - To remove the test object, simply delete it using the ``rados rm`` - command. For example: + + Ceph should output the object's location. For example:: + + osdmap e537 pool 'data' (1) object 'test-object-1' -> pg 1.d1743484 (1.4) -> up ([0,1], p0) acting ([0,1], p0) + + To remove the test object, simply delete it by running the ``rados rm`` + command. For example: .. prompt:: bash $ - + rados rm test-object-1 --pool=data - As the cluster evolves, the object location may change dynamically. One benefit -of Ceph's dynamic rebalancing is that Ceph relieves you from having to perform -the migration manually. See the `Architecture`_ section for details. +of Ceph's dynamic rebalancing is that Ceph spares you the burden of manually +performing the migration. For details, see the `Architecture`_ section. .. _data placement: ../data-placement .. _pool: ../pools