doc/rados/operations: Improve wording, capitalizatiopn, formatting

Signed-off-by: Anthony D'Atri <anthonyeleven@users.noreply.github.com>
This commit is contained in:
Anthony D'Atri 2023-03-08 07:29:55 -05:00
parent f021748e45
commit 5f2060c082
6 changed files with 95 additions and 76 deletions

View File

@ -2,28 +2,29 @@
BlueStore Migration
=====================
Each OSD can run either BlueStore or FileStore, and a single Ceph
Each OSD can run either BlueStore or Filestore, and a single Ceph
cluster can contain a mix of both. Users who have previously deployed
FileStore are likely to want to transition to BlueStore in order to
take advantage of the improved performance and robustness. There are
Filestore OSDs should transition to BlueStore in order to
take advantage of the improved performance and robustness. Moreover,
Ceph releases beginning with Reef do not support Filestore. There are
several strategies for making such a transition.
An individual OSD cannot be converted in place in isolation, however:
BlueStore and FileStore are simply too different for that to be
practical. "Conversion" will rely either on the cluster's normal
An individual OSD cannot be converted in place;
BlueStore and Filestore are simply too different for that to be
feasible. The conversion process uses either the cluster's normal
replication and healing support or tools and strategies that copy OSD
content from an old (FileStore) device to a new (BlueStore) one.
content from an old (Filestore) device to a new (BlueStore) one.
Deploy new OSDs with BlueStore
==============================
Any new OSDs (e.g., when the cluster is expanded) can be deployed
New OSDs (e.g., when the cluster is expanded) should be deployed
using BlueStore. This is the default behavior so no specific change
is needed.
Similarly, any OSDs that are reprovisioned after replacing a failed drive
can use BlueStore.
should use BlueStore.
Convert existing OSDs
=====================
@ -31,29 +32,32 @@ Convert existing OSDs
Mark out and replace
--------------------
The simplest approach is to mark out each device in turn, wait for the
The simplest approach is to ensure that the cluster is healthy,
then mark ``out`` each device in turn, wait for
data to replicate across the cluster, reprovision the OSD, and mark
it back in again. It is simple and easy to automate. However, it requires
more data migration than should be necessary, so it is not optimal.
it back ``in`` again. Proceed to the next OSD when recovery is complete.
This is easy to automate but results in more data migration than
is strictly necessary, which in turn presents additional wear to SSDs and takes
longer to complete.
#. Identify a FileStore OSD to replace::
#. Identify a Filestore OSD to replace::
ID=<osd-id-number>
DEVICE=<disk-device>
You can tell whether a given OSD is FileStore or BlueStore with:
You can tell whether a given OSD is Filestore or BlueStore with:
.. prompt:: bash $
ceph osd metadata $ID | grep osd_objectstore
You can get a current count of filestore vs bluestore with:
You can get a current count of Filestore and BlueStore OSDs with:
.. prompt:: bash $
ceph osd count-metadata osd_objectstore
#. Mark the filestore OSD out:
#. Mark the Filestore OSD ``out``:
.. prompt:: bash $
@ -71,7 +75,7 @@ more data migration than should be necessary, so it is not optimal.
systemctl kill ceph-osd@$ID
#. Make note of which device this OSD is using:
#. Note which device this OSD is using:
.. prompt:: bash $
@ -98,9 +102,10 @@ more data migration than should be necessary, so it is not optimal.
ceph osd destroy $ID --yes-i-really-mean-it
#. Reprovision a BlueStore OSD in its place with the same OSD ID.
#. Provision a BlueStore OSD in its place with the same OSD ID.
This requires you do identify which device to wipe based on what you saw
mounted above. BE CAREFUL! :
mounted above. BE CAREFUL! Also note that hybrid OSDs may require
adjustments to these commands:
.. prompt:: bash $
@ -108,12 +113,15 @@ more data migration than should be necessary, so it is not optimal.
#. Repeat.
You can allow the refilling of the replacement OSD to happen
You can allow balancing of the replacement OSD to happen
concurrently with the draining of the next OSD, or follow the same
procedure for multiple OSDs in parallel, as long as you ensure the
cluster is fully clean (all data has all replicas) before destroying
any OSDs. Failure to do so will reduce the redundancy of your data
and increase the risk of (or potentially even cause) data loss.
any OSDs. If you reprovision multiple OSDs in parallel, be **very** careful to
only zap / destroy OSDs within a single CRUSH failure domain, e.g. ``host`` or
``rack``. Failure to do so will reduce the redundancy and availability of
your data and increase the risk of (or even cause) data loss.
Advantages:
@ -136,37 +144,36 @@ to evacuate an entire host in order to use it as a spare, then the
conversion can be done on a host-by-host basis with each stored copy of
the data migrating only once.
First, you need have empty host that has no data. There are two ways to do this: either by starting with a new, empty host that isn't yet part of the cluster, or by offloading data from an existing host that in the cluster.
First, you need an empty host that has no OSDs provisioned. There are two
ways to do this: either by starting with a new, empty host that isn't yet
part of the cluster, or by offloading data from an existing host in the cluster.
Use a new, empty host
^^^^^^^^^^^^^^^^^^^^^
Ideally the host should have roughly the
same capacity as other hosts you will be converting (although it
doesn't strictly matter). ::
NEWHOST=<empty-host-name>
same capacity as other hosts you will be converting.
Add the host to the CRUSH hierarchy, but do not attach it to the root:
.. prompt:: bash $
NEWHOST=<empty-host-name>
ceph osd crush add-bucket $NEWHOST host
Make sure the ceph packages are installed.
Make sure that Ceph packages are installed on the new host.
Use an existing host
^^^^^^^^^^^^^^^^^^^^
If you would like to use an existing host
that is already part of the cluster, and there is sufficient free
space on that host so that all of its data can be migrated off,
then you can instead do::
space on that host so that all of its data can be migrated off to
other cluster hosts, you can instead do::
OLDHOST=<existing-cluster-host-to-offload>
.. prompt:: bash $
OLDHOST=<existing-cluster-host-to-offload>
ceph osd crush unlink $OLDHOST default
where "default" is the immediate ancestor in the CRUSH map. (For
@ -261,8 +268,8 @@ jump to step #5 below.
.. prompt:: bash $
ssh $OLDHOST
systemctl kill ceph-osd.target
umount /var/lib/ceph/osd/ceph-*
systemctl kill ceph-osd.target
umount /var/lib/ceph/osd/ceph-*
#. Destroy and purge the old OSDs:
@ -270,7 +277,7 @@ jump to step #5 below.
for osd in `ceph osd ls-tree $OLDHOST`; do
ceph osd purge $osd --yes-i-really-mean-it
done
done
#. Wipe the old OSD devices. This requires you do identify which
devices are to be wiped manually (BE CAREFUL!). For each device:
@ -281,7 +288,9 @@ jump to step #5 below.
#. Use the now-empty host as the new host, and repeat::
NEWHOST=$OLDHOST
.. prompt:: bash $
NEWHOST=$OLDHOST
Advantages:
@ -294,7 +303,7 @@ Disadvantages:
* A spare host is required.
* An entire host's worth of OSDs will be migrating data at a time. This
is like likely to impact overall cluster performance.
is likely to impact overall cluster performance.
* All migrated data still makes one full hop over the network.
@ -304,13 +313,13 @@ Per-OSD device copy
A single logical OSD can be converted by using the ``copy`` function
of ``ceph-objectstore-tool``. This requires that the host have a free
device (or devices) to provision a new, empty BlueStore OSD. For
example, if each host in your cluster has 12 OSDs, then you'd need a
13th available device so that each OSD can be converted in turn before the
example, if each host in your cluster has twelve OSDs, then you'd need a
thirteenth unused device so that each OSD can be converted in turn before the
old device is reclaimed to convert the next OSD.
Caveats:
* This strategy requires that a blank BlueStore OSD be prepared
* This strategy requires that an empty BlueStore OSD be prepared
without allocating a new OSD ID, something that the ``ceph-volume``
tool doesn't support. More importantly, the setup of *dmcrypt* is
closely tied to the OSD identity, which means that this approach

View File

@ -173,13 +173,13 @@ data in an erasure coded pool:
ceph osd pool set ec_pool allow_ec_overwrites true
This can only be enabled on a pool residing on bluestore OSDs, since
bluestore's checksumming is used to detect bitrot or other corruption
during deep-scrub. In addition to being unsafe, using filestore with
ec overwrites yields low performance compared to bluestore.
This can be enabled only on a pool residing on BlueStore OSDs, since
BlueStore's checksumming is used during deep scrubs to detect bitrot
or other corruption. In addition to being unsafe, using Filestore with
EC overwrites results in lower performance compared to BlueStore.
Erasure coded pools do not support omap, so to use them with RBD and
CephFS you must instruct them to store their data in an ec pool, and
CephFS you must instruct them to store their data in an EC pool, and
their metadata in a replicated pool. For RBD, this means using the
erasure coded pool as the ``--data-pool`` during image creation:
@ -195,7 +195,7 @@ Erasure coded pool and cache tiering
------------------------------------
Erasure coded pools require more resources than replicated pools and
lack some functionalities such as omap. To overcome these
lack some functionality such as omap. To overcome these
limitations, one can set up a `cache tier <../cache-tiering>`_
before the erasure coded pool.
@ -212,22 +212,24 @@ mode so that every write and read to the *ecpool* are actually using
the *hot-storage* and benefit from its flexibility and speed.
More information can be found in the `cache tiering
<../cache-tiering>`_ documentation.
<../cache-tiering>`_ documentation. Note however that cache tiering
is deprecated and may be removed completely in a future release.
Erasure coded pool recovery
---------------------------
If an erasure coded pool loses some shards, it must recover them from the others.
This generally involves reading from the remaining shards, reconstructing the data, and
writing it to the new peer.
In Octopus, erasure coded pools can recover as long as there are at least *K* shards
If an erasure coded pool loses some data shards, it must recover them from others.
This involves reading from the remaining shards, reconstructing the data, and
writing new shards.
In Octopus and later releases, erasure-coded pools can recover as long as there are at least *K* shards
available. (With fewer than *K* shards, you have actually lost data!)
Prior to Octopus, erasure coded pools required at least *min_size* shards to be
available, even if *min_size* is greater than *K*. (We generally recommend min_size
be *K+2* or more to prevent loss of writes and data.)
This conservative decision was made out of an abundance of caution when designing the new pool
mode but also meant pools with lost OSDs but no data loss were unable to recover and go active
without manual intervention to change the *min_size*.
Prior to Octopus, erasure coded pools required at least ``min_size`` shards to be
available, even if ``min_size`` is greater than ``K``. We recommend ``min_size``
be ``K+2`` or more to prevent loss of writes and data.
This conservative decision was made out of an abundance of caution when
designing the new pool mode. As a result pools with lost OSDs but without
complete loss of any data were unable to recover and go active
without manual intervention to temporarily change the ``min_size`` setting.
Glossary
--------

View File

@ -473,11 +473,11 @@ Hit sets can be configured on the cache pool with:
OSD_NO_SORTBITWISE
__________________
No pre-luminous v12.y.z OSDs are running but the ``sortbitwise`` flag has not
No pre-Luminous v12.y.z OSDs are running but the ``sortbitwise`` flag has not
been set.
The ``sortbitwise`` flag must be set before luminous v12.y.z or newer
OSDs can start. You can safely set the flag with:
The ``sortbitwise`` flag must be set before OSDs running Luminous v12.y.z or newer
can start. You can safely set the flag with:
.. prompt:: bash $
@ -486,11 +486,11 @@ OSDs can start. You can safely set the flag with:
OSD_FILESTORE
__________________
Filestore has been deprecated, considering that Bluestore has been the default
objectstore for quite some time. Warn if OSDs are running Filestore.
The Filestore OSD back end has been deprecated; the BlueStore back end has been
the default objectstore for quite some time. Warn if OSDs are running Filestore.
The 'mclock_scheduler' is not supported for filestore OSDs. Therefore, the
default 'osd_op_queue' is set to 'wpq' for filestore OSDs and is enforced
The 'mclock_scheduler' is not supported for Filestore OSDs. Therefore, the
default 'osd_op_queue' is set to 'wpq' for Filestore OSDs and is enforced
even if the user attempts to change it.
Filestore OSDs can be listed with:
@ -499,13 +499,18 @@ Filestore OSDs can be listed with:
ceph report | jq -c '."osd_metadata" | .[] | select(.osd_objectstore | contains("filestore")) | {id, osd_objectstore}'
If it is not feasible to migrate Filestore OSDs to Bluestore immediately, you
can silence this warning temporarily with:
In order to upgrade to Reef or later releases, any Filestore OSDs must first be
migrated to BlueStore.
When upgrading a release prior to Reef to Reef or later: if it is not feasible to migrate Filestore OSDs to
BlueStore immediately, you can silence this warning temporarily with:
.. prompt:: bash $
ceph health mute OSD_FILESTORE
Since this migration can take considerable time to complete, we recommend that you
begin the process well in advance of an update to Reef or later releases.
POOL_FULL
_________

View File

@ -65,15 +65,17 @@ More Information on Placement Group Repair
==========================================
Ceph stores and updates the checksums of objects stored in the cluster. When a scrub is performed on a placement group, the OSD attempts to choose an authoritative copy from among its replicas. Among all of the possible cases, only one case is consistent. After a deep scrub, Ceph calculates the checksum of an object read from the disk and compares it to the checksum previously recorded. If the current checksum and the previously recorded checksums do not match, that is an inconsistency. In the case of replicated pools, any mismatch between the checksum of any replica of an object and the checksum of the authoritative copy means that there is an inconsistency.
The "pg repair" command attempts to fix inconsistencies of various kinds. If "pg repair" finds an inconsistent placement group, it attempts to overwrite the digest of the inconsistent copy with the digest of the authoritative copy. If "pg repair" finds an inconsistent replicated pool, it marks the inconsistent copy as missing. Recovery, in the case of replicated pools, is beyond the scope of "pg repair".
The ``pg repair`` command attempts to fix inconsistencies of various kinds. If ``pg repair`` finds an inconsistent placement group, it attempts to overwrite the digest of the inconsistent copy with the digest of the authoritative copy. If ``pg repair`` finds an inconsistent replicated pool, it marks the inconsistent copy as missing. Recovery, in the case of replicated pools, is beyond the scope of ``pg repair``.
For erasure coded and bluestore pools, Ceph will automatically repair if osd_scrub_auto_repair (configuration default "false") is set to true and at most osd_scrub_auto_repair_num_errors (configuration default 5) errors are found.
For erasure coded and BlueStore pools, Ceph will automatically repair
if ``osd_scrub_auto_repair`` (default ``false`) is set to ``true`` and
at most ``osd_scrub_auto_repair_num_errors`` (default ``5``) errors are found.
"pg repair" will not solve every problem. Ceph does not automatically repair placement groups when inconsistencies are found in them.
``pg repair`` will not solve every problem. Ceph does not automatically repair placement groups when inconsistencies are found in them.
The checksum of an object or an omap is not always available. Checksums are calculated incrementally. If a replicated object is updated non-sequentially, the write operation involved in the update changes the object and invalidates its checksum. The whole object is not read while recalculating the checksum. "ceph pg repair" is able to repair things even when checksums are not available to it, as in the case of filestore. When replicated filestore pools are in question, users might prefer manual repair to "ceph pg repair".
The checksum of a RADOS object or an omap is not always available. Checksums are calculated incrementally. If a replicated object is updated non-sequentially, the write operation involved in the update changes the object and invalidates its checksum. The whole object is not read while recalculating the checksum. "ceph pg repair" is able to repair things even when checksums are not available to it, as in the case of Filestore. When replicated Filestore pools are in play, users might prefer manual repair over ``ceph pg repair``.
The material in this paragraph is relevant for filestore, and bluestore has its own internal checksums. The matched-record checksum and the calculated checksum cannot prove that the authoritative copy is in fact authoritative. In the case that there is no checksum available, "pg repair" favors the data on the primary. this might or might not be the uncorrupted replica. This is why human intervention is necessary when an inconsistency is discovered. Human intervention sometimes means using the "ceph-objectstore-tool".
The material in this paragraph is relevant for Filestore, and BlueStore has its own internal checksums. The matched-record checksum and the calculated checksum cannot prove that the authoritative copy is in fact authoritative. In the case that there is no checksum available, ``pg repair`` favors the data on the primary. this might or might not be the uncorrupted replica. This is why human intervention is necessary when an inconsistency is discovered. Human intervention sometimes means using the ``eph-objectstore-tool``.
External Links
==============

View File

@ -14,21 +14,21 @@ clients understand the new *pg-upmap* structure in the OSDMap.
Enabling
--------
New clusters will have this module on by default. The cluster must only
have luminous (and newer) clients. You can turn the balancer off with:
New clusters will by default enable the `balancer module`. The cluster must only
have Luminous (and newer) clients. You can turn the balancer off with:
.. prompt:: bash $
ceph balancer off
To allow use of the feature on existing clusters, you must tell the
cluster that it only needs to support luminous (and newer) clients with:
cluster that it only needs to support Luminous (and newer) clients with:
.. prompt:: bash $
ceph osd set-require-min-compat-client luminous
This command will fail if any pre-luminous clients or daemons are
This command will fail if any pre-Luminous clients or daemons are
connected to the monitors. You can see what client versions are in
use with:

View File

@ -353,7 +353,8 @@ by users who have access to the namespace.
.. note:: Namespaces are primarily useful for applications written on top of
``librados`` where the logical grouping can alleviate the need to create
different pools. Ceph Object Gateway (from ``luminous``) uses namespaces for various
different pools. Ceph Object Gateway (in releases beginning with
Luminous) uses namespaces for various
metadata objects.
The rationale for namespaces is that pools can be a computationally expensive