diff --git a/doc/rados/operations/bluestore-migration.rst b/doc/rados/operations/bluestore-migration.rst index 6404341897e..24026ac9325 100644 --- a/doc/rados/operations/bluestore-migration.rst +++ b/doc/rados/operations/bluestore-migration.rst @@ -2,28 +2,29 @@ BlueStore Migration ===================== -Each OSD can run either BlueStore or FileStore, and a single Ceph +Each OSD can run either BlueStore or Filestore, and a single Ceph cluster can contain a mix of both. Users who have previously deployed -FileStore are likely to want to transition to BlueStore in order to -take advantage of the improved performance and robustness. There are +Filestore OSDs should transition to BlueStore in order to +take advantage of the improved performance and robustness. Moreover, +Ceph releases beginning with Reef do not support Filestore. There are several strategies for making such a transition. -An individual OSD cannot be converted in place in isolation, however: -BlueStore and FileStore are simply too different for that to be -practical. "Conversion" will rely either on the cluster's normal +An individual OSD cannot be converted in place; +BlueStore and Filestore are simply too different for that to be +feasible. The conversion process uses either the cluster's normal replication and healing support or tools and strategies that copy OSD -content from an old (FileStore) device to a new (BlueStore) one. +content from an old (Filestore) device to a new (BlueStore) one. Deploy new OSDs with BlueStore ============================== -Any new OSDs (e.g., when the cluster is expanded) can be deployed +New OSDs (e.g., when the cluster is expanded) should be deployed using BlueStore. This is the default behavior so no specific change is needed. Similarly, any OSDs that are reprovisioned after replacing a failed drive -can use BlueStore. +should use BlueStore. Convert existing OSDs ===================== @@ -31,29 +32,32 @@ Convert existing OSDs Mark out and replace -------------------- -The simplest approach is to mark out each device in turn, wait for the +The simplest approach is to ensure that the cluster is healthy, +then mark ``out`` each device in turn, wait for data to replicate across the cluster, reprovision the OSD, and mark -it back in again. It is simple and easy to automate. However, it requires -more data migration than should be necessary, so it is not optimal. +it back ``in`` again. Proceed to the next OSD when recovery is complete. +This is easy to automate but results in more data migration than +is strictly necessary, which in turn presents additional wear to SSDs and takes +longer to complete. -#. Identify a FileStore OSD to replace:: +#. Identify a Filestore OSD to replace:: ID= DEVICE= - You can tell whether a given OSD is FileStore or BlueStore with: + You can tell whether a given OSD is Filestore or BlueStore with: .. prompt:: bash $ ceph osd metadata $ID | grep osd_objectstore - You can get a current count of filestore vs bluestore with: + You can get a current count of Filestore and BlueStore OSDs with: .. prompt:: bash $ ceph osd count-metadata osd_objectstore -#. Mark the filestore OSD out: +#. Mark the Filestore OSD ``out``: .. prompt:: bash $ @@ -71,7 +75,7 @@ more data migration than should be necessary, so it is not optimal. systemctl kill ceph-osd@$ID -#. Make note of which device this OSD is using: +#. Note which device this OSD is using: .. prompt:: bash $ @@ -98,9 +102,10 @@ more data migration than should be necessary, so it is not optimal. ceph osd destroy $ID --yes-i-really-mean-it -#. Reprovision a BlueStore OSD in its place with the same OSD ID. +#. Provision a BlueStore OSD in its place with the same OSD ID. This requires you do identify which device to wipe based on what you saw - mounted above. BE CAREFUL! : + mounted above. BE CAREFUL! Also note that hybrid OSDs may require + adjustments to these commands: .. prompt:: bash $ @@ -108,12 +113,15 @@ more data migration than should be necessary, so it is not optimal. #. Repeat. -You can allow the refilling of the replacement OSD to happen +You can allow balancing of the replacement OSD to happen concurrently with the draining of the next OSD, or follow the same procedure for multiple OSDs in parallel, as long as you ensure the cluster is fully clean (all data has all replicas) before destroying -any OSDs. Failure to do so will reduce the redundancy of your data -and increase the risk of (or potentially even cause) data loss. +any OSDs. If you reprovision multiple OSDs in parallel, be **very** careful to +only zap / destroy OSDs within a single CRUSH failure domain, e.g. ``host`` or +``rack``. Failure to do so will reduce the redundancy and availability of +your data and increase the risk of (or even cause) data loss. + Advantages: @@ -136,37 +144,36 @@ to evacuate an entire host in order to use it as a spare, then the conversion can be done on a host-by-host basis with each stored copy of the data migrating only once. -First, you need have empty host that has no data. There are two ways to do this: either by starting with a new, empty host that isn't yet part of the cluster, or by offloading data from an existing host that in the cluster. +First, you need an empty host that has no OSDs provisioned. There are two +ways to do this: either by starting with a new, empty host that isn't yet +part of the cluster, or by offloading data from an existing host in the cluster. Use a new, empty host ^^^^^^^^^^^^^^^^^^^^^ Ideally the host should have roughly the -same capacity as other hosts you will be converting (although it -doesn't strictly matter). :: - - NEWHOST= - +same capacity as other hosts you will be converting. Add the host to the CRUSH hierarchy, but do not attach it to the root: .. prompt:: bash $ + NEWHOST= ceph osd crush add-bucket $NEWHOST host -Make sure the ceph packages are installed. +Make sure that Ceph packages are installed on the new host. Use an existing host ^^^^^^^^^^^^^^^^^^^^ If you would like to use an existing host that is already part of the cluster, and there is sufficient free -space on that host so that all of its data can be migrated off, -then you can instead do:: +space on that host so that all of its data can be migrated off to +other cluster hosts, you can instead do:: - OLDHOST= .. prompt:: bash $ + OLDHOST= ceph osd crush unlink $OLDHOST default where "default" is the immediate ancestor in the CRUSH map. (For @@ -261,8 +268,8 @@ jump to step #5 below. .. prompt:: bash $ ssh $OLDHOST - systemctl kill ceph-osd.target - umount /var/lib/ceph/osd/ceph-* + systemctl kill ceph-osd.target + umount /var/lib/ceph/osd/ceph-* #. Destroy and purge the old OSDs: @@ -270,7 +277,7 @@ jump to step #5 below. for osd in `ceph osd ls-tree $OLDHOST`; do ceph osd purge $osd --yes-i-really-mean-it - done + done #. Wipe the old OSD devices. This requires you do identify which devices are to be wiped manually (BE CAREFUL!). For each device: @@ -281,7 +288,9 @@ jump to step #5 below. #. Use the now-empty host as the new host, and repeat:: - NEWHOST=$OLDHOST + .. prompt:: bash $ + + NEWHOST=$OLDHOST Advantages: @@ -294,7 +303,7 @@ Disadvantages: * A spare host is required. * An entire host's worth of OSDs will be migrating data at a time. This - is like likely to impact overall cluster performance. + is likely to impact overall cluster performance. * All migrated data still makes one full hop over the network. @@ -304,13 +313,13 @@ Per-OSD device copy A single logical OSD can be converted by using the ``copy`` function of ``ceph-objectstore-tool``. This requires that the host have a free device (or devices) to provision a new, empty BlueStore OSD. For -example, if each host in your cluster has 12 OSDs, then you'd need a -13th available device so that each OSD can be converted in turn before the +example, if each host in your cluster has twelve OSDs, then you'd need a +thirteenth unused device so that each OSD can be converted in turn before the old device is reclaimed to convert the next OSD. Caveats: -* This strategy requires that a blank BlueStore OSD be prepared +* This strategy requires that an empty BlueStore OSD be prepared without allocating a new OSD ID, something that the ``ceph-volume`` tool doesn't support. More importantly, the setup of *dmcrypt* is closely tied to the OSD identity, which means that this approach diff --git a/doc/rados/operations/erasure-code.rst b/doc/rados/operations/erasure-code.rst index 1dea23c3516..e4e404574f6 100644 --- a/doc/rados/operations/erasure-code.rst +++ b/doc/rados/operations/erasure-code.rst @@ -173,13 +173,13 @@ data in an erasure coded pool: ceph osd pool set ec_pool allow_ec_overwrites true -This can only be enabled on a pool residing on bluestore OSDs, since -bluestore's checksumming is used to detect bitrot or other corruption -during deep-scrub. In addition to being unsafe, using filestore with -ec overwrites yields low performance compared to bluestore. +This can be enabled only on a pool residing on BlueStore OSDs, since +BlueStore's checksumming is used during deep scrubs to detect bitrot +or other corruption. In addition to being unsafe, using Filestore with +EC overwrites results in lower performance compared to BlueStore. Erasure coded pools do not support omap, so to use them with RBD and -CephFS you must instruct them to store their data in an ec pool, and +CephFS you must instruct them to store their data in an EC pool, and their metadata in a replicated pool. For RBD, this means using the erasure coded pool as the ``--data-pool`` during image creation: @@ -195,7 +195,7 @@ Erasure coded pool and cache tiering ------------------------------------ Erasure coded pools require more resources than replicated pools and -lack some functionalities such as omap. To overcome these +lack some functionality such as omap. To overcome these limitations, one can set up a `cache tier <../cache-tiering>`_ before the erasure coded pool. @@ -212,22 +212,24 @@ mode so that every write and read to the *ecpool* are actually using the *hot-storage* and benefit from its flexibility and speed. More information can be found in the `cache tiering -<../cache-tiering>`_ documentation. +<../cache-tiering>`_ documentation. Note however that cache tiering +is deprecated and may be removed completely in a future release. Erasure coded pool recovery --------------------------- -If an erasure coded pool loses some shards, it must recover them from the others. -This generally involves reading from the remaining shards, reconstructing the data, and -writing it to the new peer. -In Octopus, erasure coded pools can recover as long as there are at least *K* shards +If an erasure coded pool loses some data shards, it must recover them from others. +This involves reading from the remaining shards, reconstructing the data, and +writing new shards. +In Octopus and later releases, erasure-coded pools can recover as long as there are at least *K* shards available. (With fewer than *K* shards, you have actually lost data!) -Prior to Octopus, erasure coded pools required at least *min_size* shards to be -available, even if *min_size* is greater than *K*. (We generally recommend min_size -be *K+2* or more to prevent loss of writes and data.) -This conservative decision was made out of an abundance of caution when designing the new pool -mode but also meant pools with lost OSDs but no data loss were unable to recover and go active -without manual intervention to change the *min_size*. +Prior to Octopus, erasure coded pools required at least ``min_size`` shards to be +available, even if ``min_size`` is greater than ``K``. We recommend ``min_size`` +be ``K+2`` or more to prevent loss of writes and data. +This conservative decision was made out of an abundance of caution when +designing the new pool mode. As a result pools with lost OSDs but without +complete loss of any data were unable to recover and go active +without manual intervention to temporarily change the ``min_size`` setting. Glossary -------- diff --git a/doc/rados/operations/health-checks.rst b/doc/rados/operations/health-checks.rst index 14cd1e9d496..6da03666a0c 100644 --- a/doc/rados/operations/health-checks.rst +++ b/doc/rados/operations/health-checks.rst @@ -473,11 +473,11 @@ Hit sets can be configured on the cache pool with: OSD_NO_SORTBITWISE __________________ -No pre-luminous v12.y.z OSDs are running but the ``sortbitwise`` flag has not +No pre-Luminous v12.y.z OSDs are running but the ``sortbitwise`` flag has not been set. -The ``sortbitwise`` flag must be set before luminous v12.y.z or newer -OSDs can start. You can safely set the flag with: +The ``sortbitwise`` flag must be set before OSDs running Luminous v12.y.z or newer +can start. You can safely set the flag with: .. prompt:: bash $ @@ -486,11 +486,11 @@ OSDs can start. You can safely set the flag with: OSD_FILESTORE __________________ -Filestore has been deprecated, considering that Bluestore has been the default -objectstore for quite some time. Warn if OSDs are running Filestore. +The Filestore OSD back end has been deprecated; the BlueStore back end has been +the default objectstore for quite some time. Warn if OSDs are running Filestore. -The 'mclock_scheduler' is not supported for filestore OSDs. Therefore, the -default 'osd_op_queue' is set to 'wpq' for filestore OSDs and is enforced +The 'mclock_scheduler' is not supported for Filestore OSDs. Therefore, the +default 'osd_op_queue' is set to 'wpq' for Filestore OSDs and is enforced even if the user attempts to change it. Filestore OSDs can be listed with: @@ -499,13 +499,18 @@ Filestore OSDs can be listed with: ceph report | jq -c '."osd_metadata" | .[] | select(.osd_objectstore | contains("filestore")) | {id, osd_objectstore}' -If it is not feasible to migrate Filestore OSDs to Bluestore immediately, you -can silence this warning temporarily with: +In order to upgrade to Reef or later releases, any Filestore OSDs must first be +migrated to BlueStore. +When upgrading a release prior to Reef to Reef or later: if it is not feasible to migrate Filestore OSDs to +BlueStore immediately, you can silence this warning temporarily with: .. prompt:: bash $ ceph health mute OSD_FILESTORE +Since this migration can take considerable time to complete, we recommend that you +begin the process well in advance of an update to Reef or later releases. + POOL_FULL _________ diff --git a/doc/rados/operations/pg-repair.rst b/doc/rados/operations/pg-repair.rst index f495530cc88..e318c1d503a 100644 --- a/doc/rados/operations/pg-repair.rst +++ b/doc/rados/operations/pg-repair.rst @@ -65,15 +65,17 @@ More Information on Placement Group Repair ========================================== Ceph stores and updates the checksums of objects stored in the cluster. When a scrub is performed on a placement group, the OSD attempts to choose an authoritative copy from among its replicas. Among all of the possible cases, only one case is consistent. After a deep scrub, Ceph calculates the checksum of an object read from the disk and compares it to the checksum previously recorded. If the current checksum and the previously recorded checksums do not match, that is an inconsistency. In the case of replicated pools, any mismatch between the checksum of any replica of an object and the checksum of the authoritative copy means that there is an inconsistency. -The "pg repair" command attempts to fix inconsistencies of various kinds. If "pg repair" finds an inconsistent placement group, it attempts to overwrite the digest of the inconsistent copy with the digest of the authoritative copy. If "pg repair" finds an inconsistent replicated pool, it marks the inconsistent copy as missing. Recovery, in the case of replicated pools, is beyond the scope of "pg repair". +The ``pg repair`` command attempts to fix inconsistencies of various kinds. If ``pg repair`` finds an inconsistent placement group, it attempts to overwrite the digest of the inconsistent copy with the digest of the authoritative copy. If ``pg repair`` finds an inconsistent replicated pool, it marks the inconsistent copy as missing. Recovery, in the case of replicated pools, is beyond the scope of ``pg repair``. -For erasure coded and bluestore pools, Ceph will automatically repair if osd_scrub_auto_repair (configuration default "false") is set to true and at most osd_scrub_auto_repair_num_errors (configuration default 5) errors are found. +For erasure coded and BlueStore pools, Ceph will automatically repair +if ``osd_scrub_auto_repair`` (default ``false`) is set to ``true`` and +at most ``osd_scrub_auto_repair_num_errors`` (default ``5``) errors are found. -"pg repair" will not solve every problem. Ceph does not automatically repair placement groups when inconsistencies are found in them. +``pg repair`` will not solve every problem. Ceph does not automatically repair placement groups when inconsistencies are found in them. -The checksum of an object or an omap is not always available. Checksums are calculated incrementally. If a replicated object is updated non-sequentially, the write operation involved in the update changes the object and invalidates its checksum. The whole object is not read while recalculating the checksum. "ceph pg repair" is able to repair things even when checksums are not available to it, as in the case of filestore. When replicated filestore pools are in question, users might prefer manual repair to "ceph pg repair". +The checksum of a RADOS object or an omap is not always available. Checksums are calculated incrementally. If a replicated object is updated non-sequentially, the write operation involved in the update changes the object and invalidates its checksum. The whole object is not read while recalculating the checksum. "ceph pg repair" is able to repair things even when checksums are not available to it, as in the case of Filestore. When replicated Filestore pools are in play, users might prefer manual repair over ``ceph pg repair``. -The material in this paragraph is relevant for filestore, and bluestore has its own internal checksums. The matched-record checksum and the calculated checksum cannot prove that the authoritative copy is in fact authoritative. In the case that there is no checksum available, "pg repair" favors the data on the primary. this might or might not be the uncorrupted replica. This is why human intervention is necessary when an inconsistency is discovered. Human intervention sometimes means using the "ceph-objectstore-tool". +The material in this paragraph is relevant for Filestore, and BlueStore has its own internal checksums. The matched-record checksum and the calculated checksum cannot prove that the authoritative copy is in fact authoritative. In the case that there is no checksum available, ``pg repair`` favors the data on the primary. this might or might not be the uncorrupted replica. This is why human intervention is necessary when an inconsistency is discovered. Human intervention sometimes means using the ``eph-objectstore-tool``. External Links ============== diff --git a/doc/rados/operations/upmap.rst b/doc/rados/operations/upmap.rst index 243a9d9aae0..366356d16e7 100644 --- a/doc/rados/operations/upmap.rst +++ b/doc/rados/operations/upmap.rst @@ -14,21 +14,21 @@ clients understand the new *pg-upmap* structure in the OSDMap. Enabling -------- -New clusters will have this module on by default. The cluster must only -have luminous (and newer) clients. You can turn the balancer off with: +New clusters will by default enable the `balancer module`. The cluster must only +have Luminous (and newer) clients. You can turn the balancer off with: .. prompt:: bash $ ceph balancer off To allow use of the feature on existing clusters, you must tell the -cluster that it only needs to support luminous (and newer) clients with: +cluster that it only needs to support Luminous (and newer) clients with: .. prompt:: bash $ ceph osd set-require-min-compat-client luminous -This command will fail if any pre-luminous clients or daemons are +This command will fail if any pre-Luminous clients or daemons are connected to the monitors. You can see what client versions are in use with: diff --git a/doc/rados/operations/user-management.rst b/doc/rados/operations/user-management.rst index bf76afd746d..b1eed3c1e51 100644 --- a/doc/rados/operations/user-management.rst +++ b/doc/rados/operations/user-management.rst @@ -353,7 +353,8 @@ by users who have access to the namespace. .. note:: Namespaces are primarily useful for applications written on top of ``librados`` where the logical grouping can alleviate the need to create - different pools. Ceph Object Gateway (from ``luminous``) uses namespaces for various + different pools. Ceph Object Gateway (in releases beginning with + Luminous) uses namespaces for various metadata objects. The rationale for namespaces is that pools can be a computationally expensive