mirror of
https://github.com/ceph/ceph
synced 2025-01-11 21:50:26 +00:00
doc/rados/operations: Improve wording, capitalizatiopn, formatting
Signed-off-by: Anthony D'Atri <anthonyeleven@users.noreply.github.com>
This commit is contained in:
parent
f021748e45
commit
5f2060c082
@ -2,28 +2,29 @@
|
||||
BlueStore Migration
|
||||
=====================
|
||||
|
||||
Each OSD can run either BlueStore or FileStore, and a single Ceph
|
||||
Each OSD can run either BlueStore or Filestore, and a single Ceph
|
||||
cluster can contain a mix of both. Users who have previously deployed
|
||||
FileStore are likely to want to transition to BlueStore in order to
|
||||
take advantage of the improved performance and robustness. There are
|
||||
Filestore OSDs should transition to BlueStore in order to
|
||||
take advantage of the improved performance and robustness. Moreover,
|
||||
Ceph releases beginning with Reef do not support Filestore. There are
|
||||
several strategies for making such a transition.
|
||||
|
||||
An individual OSD cannot be converted in place in isolation, however:
|
||||
BlueStore and FileStore are simply too different for that to be
|
||||
practical. "Conversion" will rely either on the cluster's normal
|
||||
An individual OSD cannot be converted in place;
|
||||
BlueStore and Filestore are simply too different for that to be
|
||||
feasible. The conversion process uses either the cluster's normal
|
||||
replication and healing support or tools and strategies that copy OSD
|
||||
content from an old (FileStore) device to a new (BlueStore) one.
|
||||
content from an old (Filestore) device to a new (BlueStore) one.
|
||||
|
||||
|
||||
Deploy new OSDs with BlueStore
|
||||
==============================
|
||||
|
||||
Any new OSDs (e.g., when the cluster is expanded) can be deployed
|
||||
New OSDs (e.g., when the cluster is expanded) should be deployed
|
||||
using BlueStore. This is the default behavior so no specific change
|
||||
is needed.
|
||||
|
||||
Similarly, any OSDs that are reprovisioned after replacing a failed drive
|
||||
can use BlueStore.
|
||||
should use BlueStore.
|
||||
|
||||
Convert existing OSDs
|
||||
=====================
|
||||
@ -31,29 +32,32 @@ Convert existing OSDs
|
||||
Mark out and replace
|
||||
--------------------
|
||||
|
||||
The simplest approach is to mark out each device in turn, wait for the
|
||||
The simplest approach is to ensure that the cluster is healthy,
|
||||
then mark ``out`` each device in turn, wait for
|
||||
data to replicate across the cluster, reprovision the OSD, and mark
|
||||
it back in again. It is simple and easy to automate. However, it requires
|
||||
more data migration than should be necessary, so it is not optimal.
|
||||
it back ``in`` again. Proceed to the next OSD when recovery is complete.
|
||||
This is easy to automate but results in more data migration than
|
||||
is strictly necessary, which in turn presents additional wear to SSDs and takes
|
||||
longer to complete.
|
||||
|
||||
#. Identify a FileStore OSD to replace::
|
||||
#. Identify a Filestore OSD to replace::
|
||||
|
||||
ID=<osd-id-number>
|
||||
DEVICE=<disk-device>
|
||||
|
||||
You can tell whether a given OSD is FileStore or BlueStore with:
|
||||
You can tell whether a given OSD is Filestore or BlueStore with:
|
||||
|
||||
.. prompt:: bash $
|
||||
|
||||
ceph osd metadata $ID | grep osd_objectstore
|
||||
|
||||
You can get a current count of filestore vs bluestore with:
|
||||
You can get a current count of Filestore and BlueStore OSDs with:
|
||||
|
||||
.. prompt:: bash $
|
||||
|
||||
ceph osd count-metadata osd_objectstore
|
||||
|
||||
#. Mark the filestore OSD out:
|
||||
#. Mark the Filestore OSD ``out``:
|
||||
|
||||
.. prompt:: bash $
|
||||
|
||||
@ -71,7 +75,7 @@ more data migration than should be necessary, so it is not optimal.
|
||||
|
||||
systemctl kill ceph-osd@$ID
|
||||
|
||||
#. Make note of which device this OSD is using:
|
||||
#. Note which device this OSD is using:
|
||||
|
||||
.. prompt:: bash $
|
||||
|
||||
@ -98,9 +102,10 @@ more data migration than should be necessary, so it is not optimal.
|
||||
|
||||
ceph osd destroy $ID --yes-i-really-mean-it
|
||||
|
||||
#. Reprovision a BlueStore OSD in its place with the same OSD ID.
|
||||
#. Provision a BlueStore OSD in its place with the same OSD ID.
|
||||
This requires you do identify which device to wipe based on what you saw
|
||||
mounted above. BE CAREFUL! :
|
||||
mounted above. BE CAREFUL! Also note that hybrid OSDs may require
|
||||
adjustments to these commands:
|
||||
|
||||
.. prompt:: bash $
|
||||
|
||||
@ -108,12 +113,15 @@ more data migration than should be necessary, so it is not optimal.
|
||||
|
||||
#. Repeat.
|
||||
|
||||
You can allow the refilling of the replacement OSD to happen
|
||||
You can allow balancing of the replacement OSD to happen
|
||||
concurrently with the draining of the next OSD, or follow the same
|
||||
procedure for multiple OSDs in parallel, as long as you ensure the
|
||||
cluster is fully clean (all data has all replicas) before destroying
|
||||
any OSDs. Failure to do so will reduce the redundancy of your data
|
||||
and increase the risk of (or potentially even cause) data loss.
|
||||
any OSDs. If you reprovision multiple OSDs in parallel, be **very** careful to
|
||||
only zap / destroy OSDs within a single CRUSH failure domain, e.g. ``host`` or
|
||||
``rack``. Failure to do so will reduce the redundancy and availability of
|
||||
your data and increase the risk of (or even cause) data loss.
|
||||
|
||||
|
||||
Advantages:
|
||||
|
||||
@ -136,37 +144,36 @@ to evacuate an entire host in order to use it as a spare, then the
|
||||
conversion can be done on a host-by-host basis with each stored copy of
|
||||
the data migrating only once.
|
||||
|
||||
First, you need have empty host that has no data. There are two ways to do this: either by starting with a new, empty host that isn't yet part of the cluster, or by offloading data from an existing host that in the cluster.
|
||||
First, you need an empty host that has no OSDs provisioned. There are two
|
||||
ways to do this: either by starting with a new, empty host that isn't yet
|
||||
part of the cluster, or by offloading data from an existing host in the cluster.
|
||||
|
||||
Use a new, empty host
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Ideally the host should have roughly the
|
||||
same capacity as other hosts you will be converting (although it
|
||||
doesn't strictly matter). ::
|
||||
|
||||
NEWHOST=<empty-host-name>
|
||||
|
||||
same capacity as other hosts you will be converting.
|
||||
Add the host to the CRUSH hierarchy, but do not attach it to the root:
|
||||
|
||||
.. prompt:: bash $
|
||||
|
||||
NEWHOST=<empty-host-name>
|
||||
ceph osd crush add-bucket $NEWHOST host
|
||||
|
||||
Make sure the ceph packages are installed.
|
||||
Make sure that Ceph packages are installed on the new host.
|
||||
|
||||
Use an existing host
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
If you would like to use an existing host
|
||||
that is already part of the cluster, and there is sufficient free
|
||||
space on that host so that all of its data can be migrated off,
|
||||
then you can instead do::
|
||||
space on that host so that all of its data can be migrated off to
|
||||
other cluster hosts, you can instead do::
|
||||
|
||||
OLDHOST=<existing-cluster-host-to-offload>
|
||||
|
||||
.. prompt:: bash $
|
||||
|
||||
OLDHOST=<existing-cluster-host-to-offload>
|
||||
ceph osd crush unlink $OLDHOST default
|
||||
|
||||
where "default" is the immediate ancestor in the CRUSH map. (For
|
||||
@ -261,8 +268,8 @@ jump to step #5 below.
|
||||
.. prompt:: bash $
|
||||
|
||||
ssh $OLDHOST
|
||||
systemctl kill ceph-osd.target
|
||||
umount /var/lib/ceph/osd/ceph-*
|
||||
systemctl kill ceph-osd.target
|
||||
umount /var/lib/ceph/osd/ceph-*
|
||||
|
||||
#. Destroy and purge the old OSDs:
|
||||
|
||||
@ -270,7 +277,7 @@ jump to step #5 below.
|
||||
|
||||
for osd in `ceph osd ls-tree $OLDHOST`; do
|
||||
ceph osd purge $osd --yes-i-really-mean-it
|
||||
done
|
||||
done
|
||||
|
||||
#. Wipe the old OSD devices. This requires you do identify which
|
||||
devices are to be wiped manually (BE CAREFUL!). For each device:
|
||||
@ -281,7 +288,9 @@ jump to step #5 below.
|
||||
|
||||
#. Use the now-empty host as the new host, and repeat::
|
||||
|
||||
NEWHOST=$OLDHOST
|
||||
.. prompt:: bash $
|
||||
|
||||
NEWHOST=$OLDHOST
|
||||
|
||||
Advantages:
|
||||
|
||||
@ -294,7 +303,7 @@ Disadvantages:
|
||||
|
||||
* A spare host is required.
|
||||
* An entire host's worth of OSDs will be migrating data at a time. This
|
||||
is like likely to impact overall cluster performance.
|
||||
is likely to impact overall cluster performance.
|
||||
* All migrated data still makes one full hop over the network.
|
||||
|
||||
|
||||
@ -304,13 +313,13 @@ Per-OSD device copy
|
||||
A single logical OSD can be converted by using the ``copy`` function
|
||||
of ``ceph-objectstore-tool``. This requires that the host have a free
|
||||
device (or devices) to provision a new, empty BlueStore OSD. For
|
||||
example, if each host in your cluster has 12 OSDs, then you'd need a
|
||||
13th available device so that each OSD can be converted in turn before the
|
||||
example, if each host in your cluster has twelve OSDs, then you'd need a
|
||||
thirteenth unused device so that each OSD can be converted in turn before the
|
||||
old device is reclaimed to convert the next OSD.
|
||||
|
||||
Caveats:
|
||||
|
||||
* This strategy requires that a blank BlueStore OSD be prepared
|
||||
* This strategy requires that an empty BlueStore OSD be prepared
|
||||
without allocating a new OSD ID, something that the ``ceph-volume``
|
||||
tool doesn't support. More importantly, the setup of *dmcrypt* is
|
||||
closely tied to the OSD identity, which means that this approach
|
||||
|
@ -173,13 +173,13 @@ data in an erasure coded pool:
|
||||
|
||||
ceph osd pool set ec_pool allow_ec_overwrites true
|
||||
|
||||
This can only be enabled on a pool residing on bluestore OSDs, since
|
||||
bluestore's checksumming is used to detect bitrot or other corruption
|
||||
during deep-scrub. In addition to being unsafe, using filestore with
|
||||
ec overwrites yields low performance compared to bluestore.
|
||||
This can be enabled only on a pool residing on BlueStore OSDs, since
|
||||
BlueStore's checksumming is used during deep scrubs to detect bitrot
|
||||
or other corruption. In addition to being unsafe, using Filestore with
|
||||
EC overwrites results in lower performance compared to BlueStore.
|
||||
|
||||
Erasure coded pools do not support omap, so to use them with RBD and
|
||||
CephFS you must instruct them to store their data in an ec pool, and
|
||||
CephFS you must instruct them to store their data in an EC pool, and
|
||||
their metadata in a replicated pool. For RBD, this means using the
|
||||
erasure coded pool as the ``--data-pool`` during image creation:
|
||||
|
||||
@ -195,7 +195,7 @@ Erasure coded pool and cache tiering
|
||||
------------------------------------
|
||||
|
||||
Erasure coded pools require more resources than replicated pools and
|
||||
lack some functionalities such as omap. To overcome these
|
||||
lack some functionality such as omap. To overcome these
|
||||
limitations, one can set up a `cache tier <../cache-tiering>`_
|
||||
before the erasure coded pool.
|
||||
|
||||
@ -212,22 +212,24 @@ mode so that every write and read to the *ecpool* are actually using
|
||||
the *hot-storage* and benefit from its flexibility and speed.
|
||||
|
||||
More information can be found in the `cache tiering
|
||||
<../cache-tiering>`_ documentation.
|
||||
<../cache-tiering>`_ documentation. Note however that cache tiering
|
||||
is deprecated and may be removed completely in a future release.
|
||||
|
||||
Erasure coded pool recovery
|
||||
---------------------------
|
||||
If an erasure coded pool loses some shards, it must recover them from the others.
|
||||
This generally involves reading from the remaining shards, reconstructing the data, and
|
||||
writing it to the new peer.
|
||||
In Octopus, erasure coded pools can recover as long as there are at least *K* shards
|
||||
If an erasure coded pool loses some data shards, it must recover them from others.
|
||||
This involves reading from the remaining shards, reconstructing the data, and
|
||||
writing new shards.
|
||||
In Octopus and later releases, erasure-coded pools can recover as long as there are at least *K* shards
|
||||
available. (With fewer than *K* shards, you have actually lost data!)
|
||||
|
||||
Prior to Octopus, erasure coded pools required at least *min_size* shards to be
|
||||
available, even if *min_size* is greater than *K*. (We generally recommend min_size
|
||||
be *K+2* or more to prevent loss of writes and data.)
|
||||
This conservative decision was made out of an abundance of caution when designing the new pool
|
||||
mode but also meant pools with lost OSDs but no data loss were unable to recover and go active
|
||||
without manual intervention to change the *min_size*.
|
||||
Prior to Octopus, erasure coded pools required at least ``min_size`` shards to be
|
||||
available, even if ``min_size`` is greater than ``K``. We recommend ``min_size``
|
||||
be ``K+2`` or more to prevent loss of writes and data.
|
||||
This conservative decision was made out of an abundance of caution when
|
||||
designing the new pool mode. As a result pools with lost OSDs but without
|
||||
complete loss of any data were unable to recover and go active
|
||||
without manual intervention to temporarily change the ``min_size`` setting.
|
||||
|
||||
Glossary
|
||||
--------
|
||||
|
@ -473,11 +473,11 @@ Hit sets can be configured on the cache pool with:
|
||||
OSD_NO_SORTBITWISE
|
||||
__________________
|
||||
|
||||
No pre-luminous v12.y.z OSDs are running but the ``sortbitwise`` flag has not
|
||||
No pre-Luminous v12.y.z OSDs are running but the ``sortbitwise`` flag has not
|
||||
been set.
|
||||
|
||||
The ``sortbitwise`` flag must be set before luminous v12.y.z or newer
|
||||
OSDs can start. You can safely set the flag with:
|
||||
The ``sortbitwise`` flag must be set before OSDs running Luminous v12.y.z or newer
|
||||
can start. You can safely set the flag with:
|
||||
|
||||
.. prompt:: bash $
|
||||
|
||||
@ -486,11 +486,11 @@ OSDs can start. You can safely set the flag with:
|
||||
OSD_FILESTORE
|
||||
__________________
|
||||
|
||||
Filestore has been deprecated, considering that Bluestore has been the default
|
||||
objectstore for quite some time. Warn if OSDs are running Filestore.
|
||||
The Filestore OSD back end has been deprecated; the BlueStore back end has been
|
||||
the default objectstore for quite some time. Warn if OSDs are running Filestore.
|
||||
|
||||
The 'mclock_scheduler' is not supported for filestore OSDs. Therefore, the
|
||||
default 'osd_op_queue' is set to 'wpq' for filestore OSDs and is enforced
|
||||
The 'mclock_scheduler' is not supported for Filestore OSDs. Therefore, the
|
||||
default 'osd_op_queue' is set to 'wpq' for Filestore OSDs and is enforced
|
||||
even if the user attempts to change it.
|
||||
|
||||
Filestore OSDs can be listed with:
|
||||
@ -499,13 +499,18 @@ Filestore OSDs can be listed with:
|
||||
|
||||
ceph report | jq -c '."osd_metadata" | .[] | select(.osd_objectstore | contains("filestore")) | {id, osd_objectstore}'
|
||||
|
||||
If it is not feasible to migrate Filestore OSDs to Bluestore immediately, you
|
||||
can silence this warning temporarily with:
|
||||
In order to upgrade to Reef or later releases, any Filestore OSDs must first be
|
||||
migrated to BlueStore.
|
||||
When upgrading a release prior to Reef to Reef or later: if it is not feasible to migrate Filestore OSDs to
|
||||
BlueStore immediately, you can silence this warning temporarily with:
|
||||
|
||||
.. prompt:: bash $
|
||||
|
||||
ceph health mute OSD_FILESTORE
|
||||
|
||||
Since this migration can take considerable time to complete, we recommend that you
|
||||
begin the process well in advance of an update to Reef or later releases.
|
||||
|
||||
POOL_FULL
|
||||
_________
|
||||
|
||||
|
@ -65,15 +65,17 @@ More Information on Placement Group Repair
|
||||
==========================================
|
||||
Ceph stores and updates the checksums of objects stored in the cluster. When a scrub is performed on a placement group, the OSD attempts to choose an authoritative copy from among its replicas. Among all of the possible cases, only one case is consistent. After a deep scrub, Ceph calculates the checksum of an object read from the disk and compares it to the checksum previously recorded. If the current checksum and the previously recorded checksums do not match, that is an inconsistency. In the case of replicated pools, any mismatch between the checksum of any replica of an object and the checksum of the authoritative copy means that there is an inconsistency.
|
||||
|
||||
The "pg repair" command attempts to fix inconsistencies of various kinds. If "pg repair" finds an inconsistent placement group, it attempts to overwrite the digest of the inconsistent copy with the digest of the authoritative copy. If "pg repair" finds an inconsistent replicated pool, it marks the inconsistent copy as missing. Recovery, in the case of replicated pools, is beyond the scope of "pg repair".
|
||||
The ``pg repair`` command attempts to fix inconsistencies of various kinds. If ``pg repair`` finds an inconsistent placement group, it attempts to overwrite the digest of the inconsistent copy with the digest of the authoritative copy. If ``pg repair`` finds an inconsistent replicated pool, it marks the inconsistent copy as missing. Recovery, in the case of replicated pools, is beyond the scope of ``pg repair``.
|
||||
|
||||
For erasure coded and bluestore pools, Ceph will automatically repair if osd_scrub_auto_repair (configuration default "false") is set to true and at most osd_scrub_auto_repair_num_errors (configuration default 5) errors are found.
|
||||
For erasure coded and BlueStore pools, Ceph will automatically repair
|
||||
if ``osd_scrub_auto_repair`` (default ``false`) is set to ``true`` and
|
||||
at most ``osd_scrub_auto_repair_num_errors`` (default ``5``) errors are found.
|
||||
|
||||
"pg repair" will not solve every problem. Ceph does not automatically repair placement groups when inconsistencies are found in them.
|
||||
``pg repair`` will not solve every problem. Ceph does not automatically repair placement groups when inconsistencies are found in them.
|
||||
|
||||
The checksum of an object or an omap is not always available. Checksums are calculated incrementally. If a replicated object is updated non-sequentially, the write operation involved in the update changes the object and invalidates its checksum. The whole object is not read while recalculating the checksum. "ceph pg repair" is able to repair things even when checksums are not available to it, as in the case of filestore. When replicated filestore pools are in question, users might prefer manual repair to "ceph pg repair".
|
||||
The checksum of a RADOS object or an omap is not always available. Checksums are calculated incrementally. If a replicated object is updated non-sequentially, the write operation involved in the update changes the object and invalidates its checksum. The whole object is not read while recalculating the checksum. "ceph pg repair" is able to repair things even when checksums are not available to it, as in the case of Filestore. When replicated Filestore pools are in play, users might prefer manual repair over ``ceph pg repair``.
|
||||
|
||||
The material in this paragraph is relevant for filestore, and bluestore has its own internal checksums. The matched-record checksum and the calculated checksum cannot prove that the authoritative copy is in fact authoritative. In the case that there is no checksum available, "pg repair" favors the data on the primary. this might or might not be the uncorrupted replica. This is why human intervention is necessary when an inconsistency is discovered. Human intervention sometimes means using the "ceph-objectstore-tool".
|
||||
The material in this paragraph is relevant for Filestore, and BlueStore has its own internal checksums. The matched-record checksum and the calculated checksum cannot prove that the authoritative copy is in fact authoritative. In the case that there is no checksum available, ``pg repair`` favors the data on the primary. this might or might not be the uncorrupted replica. This is why human intervention is necessary when an inconsistency is discovered. Human intervention sometimes means using the ``eph-objectstore-tool``.
|
||||
|
||||
External Links
|
||||
==============
|
||||
|
@ -14,21 +14,21 @@ clients understand the new *pg-upmap* structure in the OSDMap.
|
||||
Enabling
|
||||
--------
|
||||
|
||||
New clusters will have this module on by default. The cluster must only
|
||||
have luminous (and newer) clients. You can turn the balancer off with:
|
||||
New clusters will by default enable the `balancer module`. The cluster must only
|
||||
have Luminous (and newer) clients. You can turn the balancer off with:
|
||||
|
||||
.. prompt:: bash $
|
||||
|
||||
ceph balancer off
|
||||
|
||||
To allow use of the feature on existing clusters, you must tell the
|
||||
cluster that it only needs to support luminous (and newer) clients with:
|
||||
cluster that it only needs to support Luminous (and newer) clients with:
|
||||
|
||||
.. prompt:: bash $
|
||||
|
||||
ceph osd set-require-min-compat-client luminous
|
||||
|
||||
This command will fail if any pre-luminous clients or daemons are
|
||||
This command will fail if any pre-Luminous clients or daemons are
|
||||
connected to the monitors. You can see what client versions are in
|
||||
use with:
|
||||
|
||||
|
@ -353,7 +353,8 @@ by users who have access to the namespace.
|
||||
|
||||
.. note:: Namespaces are primarily useful for applications written on top of
|
||||
``librados`` where the logical grouping can alleviate the need to create
|
||||
different pools. Ceph Object Gateway (from ``luminous``) uses namespaces for various
|
||||
different pools. Ceph Object Gateway (in releases beginning with
|
||||
Luminous) uses namespaces for various
|
||||
metadata objects.
|
||||
|
||||
The rationale for namespaces is that pools can be a computationally expensive
|
||||
|
Loading…
Reference in New Issue
Block a user