mirror of
https://github.com/ceph/ceph
synced 2025-02-20 17:37:29 +00:00
Merge pull request #16918 from liewegas/wip-doc-bluestore-migration
doc/rados/operations/bluestore-migration: document bluestore migration process Reviewed-by: Vasu Kulkarni <vasu@redhat.com>
This commit is contained in:
commit
1e606708bc
@ -60,7 +60,7 @@ last ten years. Key BlueStore features include:
|
||||
and for erasure coded pools (which rely on cloning to implement
|
||||
efficient two-phase commits).
|
||||
|
||||
For more information, see :doc:`bluestore-config-ref`.
|
||||
For more information, see :doc:`bluestore-config-ref` and :doc:`/rados/operations/bluestore-migration`.
|
||||
|
||||
FileStore
|
||||
---------
|
||||
|
246
doc/rados/operations/bluestore-migration.rst
Normal file
246
doc/rados/operations/bluestore-migration.rst
Normal file
@ -0,0 +1,246 @@
|
||||
=====================
|
||||
BlueStore Migration
|
||||
=====================
|
||||
|
||||
Each OSD can run either BlueStore or FileStore, and a single Ceph
|
||||
cluster can contain a mix of both. Users who have previously deployed
|
||||
FileStore are likely to want to transition to BlueStore in order to
|
||||
take advantage of the improved performance and robustness. There are
|
||||
several strategies for making such a transition.
|
||||
|
||||
An individual OSD cannot be converted in place in isolation, however:
|
||||
BlueStore and FileStore are simply to different for that to be
|
||||
practical. "Conversion" will rely either on the cluster's normal
|
||||
replication and healing support or tools and strategies that copy OSD
|
||||
content from and old (FileStore) device to a new (BlueStore) one.
|
||||
|
||||
|
||||
Deploy new OSDs with BlueStore
|
||||
==============================
|
||||
|
||||
Any new OSDs (e.g., when the cluster is expanded) can be deployed
|
||||
using BlueStore. This is the default behavior so no specific change
|
||||
is needed.
|
||||
|
||||
Similarly, any OSDs that are reprovisioned after replacing a failed drive
|
||||
can use BlueStore.
|
||||
|
||||
Convert existing OSDs
|
||||
=====================
|
||||
|
||||
Mark out and replace
|
||||
--------------------
|
||||
|
||||
The simplest approach is to mark out each device in turn, wait for the
|
||||
data to rereplicate across the cluster, reprovision the OSD, and mark
|
||||
it back in again. It is simple and easy to automate. However, it requires
|
||||
more data migration than should be necessary, so it is not optimal.
|
||||
|
||||
#. Identify a FileStore OSD to replace::
|
||||
|
||||
ID=<osd-id-number>
|
||||
DEVICE=<disk-device>
|
||||
|
||||
You can tell whether a given OSD is FileStore or BlueStore with::
|
||||
|
||||
ceph osd metadata $ID | grep osd_objectstore
|
||||
|
||||
You can get a current count of filestore vs bluestore with::
|
||||
|
||||
ceph osd count-metadata osd_objectstore
|
||||
|
||||
#. Mark the filestore OSD out::
|
||||
|
||||
ceph osd out $ID
|
||||
|
||||
#. Wait for the data to migrate off the OSD in question::
|
||||
|
||||
while ! ceph health | grep HEALTH_OK ; do sleep 60 ; done
|
||||
|
||||
#. Stop the OSD::
|
||||
|
||||
systemctl kill ceph-osd@$ID
|
||||
|
||||
#. Make note of which device this OSD is using::
|
||||
|
||||
mount | grep /var/lib/ceph/osd/ceph-$ID
|
||||
|
||||
#. Unmount the OSD::
|
||||
|
||||
umount /var/lib/ceph/osd/ceph-$ID
|
||||
|
||||
#. Destroy the OSD data. Be *EXTREMELY CAREUL* as this will destroy
|
||||
the contents of the device; be certain the data on the device is
|
||||
not needed (i.e., that the cluster is healthy) before proceeding. ::
|
||||
|
||||
ceph-disk zap $DEVICE
|
||||
|
||||
#. Tell the cluster the OSD has been destroyed (and a new OSD can be
|
||||
reprovisioned with the same ID)::
|
||||
|
||||
ceph osd destroy $ID --yes-i-really-mean-it
|
||||
|
||||
#. Reprovision a BlueStore OSD in its place with the same OSD ID.
|
||||
This requires you do identify which device to wipe based on what you saw
|
||||
mounted above. BE CAREFUL! ::
|
||||
|
||||
ceph-disk prepare --bluestore $DEVICE --osd-id $ID
|
||||
|
||||
#. Repeat.
|
||||
|
||||
You can allow the refilling of the replacement OSD to happen
|
||||
concurrently with the draining of the next OSD, or follow the same
|
||||
procedure for multiple OSDs in parallel, as long as you ensure the
|
||||
cluster is fully clean (all data has all replicas) before destroying
|
||||
any OSDs. Failure to do so will reduce the redundancy of your data
|
||||
and increase the risk of (or potentially even cause) data loss.
|
||||
|
||||
Advantages:
|
||||
|
||||
* Simple.
|
||||
* Can be done on a device-by-device basis.
|
||||
* No spare devices or hosts are required.
|
||||
|
||||
Disadvantages:
|
||||
|
||||
* Data is copied over the network twice: once to some other OSD in the
|
||||
cluster (to maintain the desired number of replicas), and then again
|
||||
back to the reprovisioned BlueStore OSD.
|
||||
|
||||
|
||||
Whole host replacement
|
||||
----------------------
|
||||
|
||||
If you have a spare host in the cluster, or have sufficient free space
|
||||
to evacuate an entire host in order to use it as a spare, then the
|
||||
conversion can be done on a host-by-host basis with each stored copy of
|
||||
the data migrating only once.
|
||||
|
||||
#. Identify an empty host. Ideally the host should have roughly the
|
||||
same capacity as other hosts you will be converting (although it
|
||||
doesn't strictly matter). ::
|
||||
|
||||
NEWHOST=<empty-host-name>
|
||||
|
||||
#. Add the host to the CRUSH hierarchy, but do not attach it to the root::
|
||||
|
||||
ceph osd crush add-bucket $NEWHOST host
|
||||
|
||||
#. Provision new BlueStore OSDs for all devices::
|
||||
|
||||
ceph-disk prepare --bluestore /dev/$DEVICE
|
||||
|
||||
#. Verify OSDs join the cluster with::
|
||||
|
||||
ceph osd tree
|
||||
|
||||
You should see the new host ``$NEWHOST`` with all of the OSDs beneath
|
||||
it, but the host should *not* be nested beneath any other node in
|
||||
hierarchy (like ``root default``). For example, if ``newhost`` is
|
||||
the empty host, you might see something like::
|
||||
|
||||
$ bin/ceph osd tree
|
||||
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
|
||||
-5 0 host newhost
|
||||
10 ssd 1.00000 osd.0 up 1.00000 1.00000
|
||||
11 ssd 1.00000 osd.1 up 1.00000 1.00000
|
||||
12 ssd 1.00000 osd.2 up 1.00000 1.00000
|
||||
-1 3.00000 root default
|
||||
-2 3.00000 host oldhost1
|
||||
0 ssd 1.00000 osd.0 up 1.00000 1.00000
|
||||
1 ssd 1.00000 osd.1 up 1.00000 1.00000
|
||||
2 ssd 1.00000 osd.2 up 1.00000 1.00000
|
||||
...
|
||||
|
||||
#. Identify the first target host to convert ::
|
||||
|
||||
OLDHOST=<old-host-name>
|
||||
|
||||
#. Swap the new host into the old host's position in the cluster::
|
||||
|
||||
ceph osd crush swap-bucket $NEWHOST $OLDHOST
|
||||
|
||||
At this point all data on ``$OLDHOST`` will start migrating to OSDs
|
||||
on ``$NEWHOST``. If there is a difference in the total capacity of
|
||||
the old and new hosts you may also see some data migrate to or from
|
||||
other nodes in the cluster, but as long as the hosts are similarly
|
||||
sized this will be a relatively small amount of data.
|
||||
|
||||
#. Wait for data migration to complete::
|
||||
|
||||
while ! ceph health | grep HEALTH_OK ; do sleep 60 ; done
|
||||
|
||||
#. Stop all old OSDs on the now-empty ``$OLDHOST``::
|
||||
|
||||
ssh $OLDHOST
|
||||
systemctl kill ceph-osd.target
|
||||
umount /var/log/ceph/osd/ceph-*
|
||||
|
||||
#. Destroy and purge the old OSDs::
|
||||
|
||||
for osd in `ceph osd crush ls $OLDHOST`; do
|
||||
ceph osd purge $osd --yes-i-really-mean-it
|
||||
done
|
||||
|
||||
#. Wipe the old OSD devices. This requires you do identify which
|
||||
devices are to be wiped manually (BE CAREFUL!). For each device,::
|
||||
|
||||
ceph-disk zap $DEVICE
|
||||
|
||||
#. Use the now-empty host as the new host, and repeat::
|
||||
|
||||
NEWHOST=$OLDHOST
|
||||
|
||||
Advantages:
|
||||
|
||||
* Data is copied over the network only once.
|
||||
* Converts an entire host's OSDs at once.
|
||||
* Can parallelize to converting multiple hosts at a time.
|
||||
* No spare devices are required on each host.
|
||||
|
||||
Disadvantages:
|
||||
|
||||
* A spare host is required.
|
||||
* An entire host's worth of OSDs will be migrating data at a time. This
|
||||
is like likely to impact overall cluster performance.
|
||||
* All migrated data still makes one full hop over the network.
|
||||
|
||||
|
||||
Per-OSD device copy
|
||||
-------------------
|
||||
|
||||
A single logical OSD can be converted by using the ``copy`` function
|
||||
of ``ceph-objectstore-tool``. This requires that the host have a free
|
||||
device (or devices) to provision a new, empty BlueStore OSD. For
|
||||
example, if each host in your cluster has 12 OSDs, then you'd need a
|
||||
13th available device so that each OSD can be converted in turn before the
|
||||
old device is reclaimed to convert the next OSD.
|
||||
|
||||
Caveats:
|
||||
|
||||
* This strategy requires that a blank BlueStore OSD be prepared
|
||||
without allocating a new OSD ID, something that the ``ceph-disk``
|
||||
tool doesn't support. More importantly, the setup of *dmcrypt* is
|
||||
closely tied to the OSD identity, which means that this approach
|
||||
does not work with encrypted OSDs.
|
||||
|
||||
* The device must be manually partitioned.
|
||||
|
||||
* Tooling not implemented!
|
||||
|
||||
* Not documented!
|
||||
|
||||
Advantages:
|
||||
|
||||
* Little or no data migrates over the network during the conversion.
|
||||
|
||||
Disadvantages:
|
||||
|
||||
* Tooling not fully implemented.
|
||||
* Process not documented.
|
||||
* Each host must have a spare or empty device.
|
||||
* The OSD is offline during the conversion, which means new writes will
|
||||
be written to only a subset of the OSDs. This increases the risk of data
|
||||
loss due to a subsequent failure. (However, if there is a failure before
|
||||
conversion is complete, the original FileStore OSD can be started to provide
|
||||
access to its original data.)
|
@ -58,6 +58,7 @@ with new hardware.
|
||||
|
||||
add-or-rm-osds
|
||||
add-or-rm-mons
|
||||
bluestore-migration
|
||||
Command Reference <control>
|
||||
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user