From f7398ddd23321be85e791ea1663c8052632c54f2 Mon Sep 17 00:00:00 2001 From: Sage Weil Date: Tue, 8 Aug 2017 15:09:50 -0400 Subject: [PATCH] doc/rados/operations/bluestore-migration: document bluestore migration process Signed-off-by: Sage Weil --- doc/rados/configuration/storage-devices.rst | 2 +- doc/rados/operations/bluestore-migration.rst | 246 +++++++++++++++++++ doc/rados/operations/index.rst | 1 + 3 files changed, 248 insertions(+), 1 deletion(-) create mode 100644 doc/rados/operations/bluestore-migration.rst diff --git a/doc/rados/configuration/storage-devices.rst b/doc/rados/configuration/storage-devices.rst index 83c0c9b9fad..262778d0dcc 100644 --- a/doc/rados/configuration/storage-devices.rst +++ b/doc/rados/configuration/storage-devices.rst @@ -60,7 +60,7 @@ last ten years. Key BlueStore features include: and for erasure coded pools (which rely on cloning to implement efficient two-phase commits). -For more information, see :doc:`bluestore-config-ref`. +For more information, see :doc:`bluestore-config-ref` and :doc:`/rados/operations/bluestore-migration`. FileStore --------- diff --git a/doc/rados/operations/bluestore-migration.rst b/doc/rados/operations/bluestore-migration.rst new file mode 100644 index 00000000000..d444e04e58f --- /dev/null +++ b/doc/rados/operations/bluestore-migration.rst @@ -0,0 +1,246 @@ +===================== + BlueStore Migration +===================== + +Each OSD can run either BlueStore or FileStore, and a single Ceph +cluster can contain a mix of both. Users who have previously deployed +FileStore are likely to want to transition to BlueStore in order to +take advantage of the improved performance and robustness. There are +several strategies for making such a transition. + +An individual OSD cannot be converted in place in isolation, however: +BlueStore and FileStore are simply to different for that to be +practical. "Conversion" will rely either on the cluster's normal +replication and healing support or tools and strategies that copy OSD +content from and old (FileStore) device to a new (BlueStore) one. + + +Deploy new OSDs with BlueStore +============================== + +Any new OSDs (e.g., when the cluster is expanded) can be deployed +using BlueStore. This is the default behavior so no specific change +is needed. + +Similarly, any OSDs that are reprovisioned after replacing a failed drive +can use BlueStore. + +Convert existing OSDs +===================== + +Mark out and replace +-------------------- + +The simplest approach is to mark out each device in turn, wait for the +data to rereplicate across the cluster, reprovision the OSD, and mark +it back in again. It is simple and easy to automate. However, it requires +more data migration than should be necessary, so it is not optimal. + +#. Identify a FileStore OSD to replace:: + + ID= + DEVICE= + + You can tell whether a given OSD is FileStore or BlueStore with:: + + ceph osd metadata $ID | grep osd_objectstore + + You can get a current count of filestore vs bluestore with:: + + ceph osd count-metadata osd_objectstore + +#. Mark the filestore OSD out:: + + ceph osd out $ID + +#. Wait for the data to migrate off the OSD in question:: + + while ! ceph health | grep HEALTH_OK ; do sleep 60 ; done + +#. Stop the OSD:: + + systemctl kill ceph-osd@$ID + +#. Make note of which device this OSD is using:: + + mount | grep /var/lib/ceph/osd/ceph-$ID + +#. Unmount the OSD:: + + umount /var/lib/ceph/osd/ceph-$ID + +#. Destroy the OSD data. Be *EXTREMELY CAREUL* as this will destroy + the contents of the device; be certain the data on the device is + not needed (i.e., that the cluster is healthy) before proceeding. :: + + ceph-disk zap $DEVICE + +#. Tell the cluster the OSD has been destroyed (and a new OSD can be + reprovisioned with the same ID):: + + ceph osd destroy $ID --yes-i-really-mean-it + +#. Reprovision a BlueStore OSD in its place with the same OSD ID. + This requires you do identify which device to wipe based on what you saw + mounted above. BE CAREFUL! :: + + ceph-disk prepare --bluestore $DEVICE --osd-id $ID + +#. Repeat. + +You can allow the refilling of the replacement OSD to happen +concurrently with the draining of the next OSD, or follow the same +procedure for multiple OSDs in parallel, as long as you ensure the +cluster is fully clean (all data has all replicas) before destroying +any OSDs. Failure to do so will reduce the redundancy of your data +and increase the risk of (or potentially even cause) data loss. + +Advantages: + +* Simple. +* Can be done on a device-by-device basis. +* No spare devices or hosts are required. + +Disadvantages: + +* Data is copied over the network twice: once to some other OSD in the + cluster (to maintain the desired number of replicas), and then again + back to the reprovisioned BlueStore OSD. + + +Whole host replacement +---------------------- + +If you have a spare host in the cluster, or have sufficient free space +to evacuate an entire host in order to use it as a spare, then the +conversion can be done on a host-by-host basis with each stored copy of +the data migrating only once. + +#. Identify an empty host. Ideally the host should have roughly the + same capacity as other hosts you will be converting (although it + doesn't strictly matter). :: + + NEWHOST= + +#. Add the host to the CRUSH hierarchy, but do not attach it to the root:: + + ceph osd crush add-bucket $NEWHOST host + +#. Provision new BlueStore OSDs for all devices:: + + ceph-disk prepare --bluestore /dev/$DEVICE + +#. Verify OSDs join the cluster with:: + + ceph osd tree + + You should see the new host ``$NEWHOST`` with all of the OSDs beneath + it, but the host should *not* be nested beneath any other node in + hierarchy (like ``root default``). For example, if ``newhost`` is + the empty host, you might see something like:: + + $ bin/ceph osd tree + ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF + -5 0 host newhost + 10 ssd 1.00000 osd.0 up 1.00000 1.00000 + 11 ssd 1.00000 osd.1 up 1.00000 1.00000 + 12 ssd 1.00000 osd.2 up 1.00000 1.00000 + -1 3.00000 root default + -2 3.00000 host oldhost1 + 0 ssd 1.00000 osd.0 up 1.00000 1.00000 + 1 ssd 1.00000 osd.1 up 1.00000 1.00000 + 2 ssd 1.00000 osd.2 up 1.00000 1.00000 + ... + +#. Identify the first target host to convert :: + + OLDHOST= + +#. Swap the new host into the old host's position in the cluster:: + + ceph osd crush swap-bucket $NEWHOST $OLDHOST + + At this point all data on ``$OLDHOST`` will start migrating to OSDs + on ``$NEWHOST``. If there is a difference in the total capacity of + the old and new hosts you may also see some data migrate to or from + other nodes in the cluster, but as long as the hosts are similarly + sized this will be a relatively small amount of data. + +#. Wait for data migration to complete:: + + while ! ceph health | grep HEALTH_OK ; do sleep 60 ; done + +#. Stop all old OSDs on the now-empty ``$OLDHOST``:: + + ssh $OLDHOST + systemctl kill ceph-osd.target + umount /var/log/ceph/osd/ceph-* + +#. Destroy and purge the old OSDs:: + + for osd in `ceph osd crush ls $OLDHOST`; do + ceph osd purge $osd --yes-i-really-mean-it + done + +#. Wipe the old OSD devices. This requires you do identify which + devices are to be wiped manually (BE CAREFUL!). For each device,:: + + ceph-disk zap $DEVICE + +#. Use the now-empty host as the new host, and repeat:: + + NEWHOST=$OLDHOST + +Advantages: + +* Data is copied over the network only once. +* Converts an entire host's OSDs at once. +* Can parallelize to converting multiple hosts at a time. +* No spare devices are required on each host. + +Disadvantages: + +* A spare host is required. +* An entire host's worth of OSDs will be migrating data at a time. This + is like likely to impact overall cluster performance. +* All migrated data still makes one full hop over the network. + + +Per-OSD device copy +------------------- + +A single logical OSD can be converted by using the ``copy`` function +of ``ceph-objectstore-tool``. This requires that the host have a free +device (or devices) to provision a new, empty BlueStore OSD. For +example, if each host in your cluster has 12 OSDs, then you'd need a +13th available device so that each OSD can be converted in turn before the +old device is reclaimed to convert the next OSD. + +Caveats: + +* This strategy requires that a blank BlueStore OSD be prepared + without allocating a new OSD ID, something that the ``ceph-disk`` + tool doesn't support. More importantly, the setup of *dmcrypt* is + closely tied to the OSD identity, which means that this approach + does not work with encrypted OSDs. + +* The device must be manually partitioned. + +* Tooling not implemented! + +* Not documented! + +Advantages: + +* Little or no data migrates over the network during the conversion. + +Disadvantages: + +* Tooling not fully implemented. +* Process not documented. +* Each host must have a spare or empty device. +* The OSD is offline during the conversion, which means new writes will + be written to only a subset of the OSDs. This increases the risk of data + loss due to a subsequent failure. (However, if there is a failure before + conversion is complete, the original FileStore OSD can be started to provide + access to its original data.) diff --git a/doc/rados/operations/index.rst b/doc/rados/operations/index.rst index aacf7648d77..253fc2d9d0a 100644 --- a/doc/rados/operations/index.rst +++ b/doc/rados/operations/index.rst @@ -58,6 +58,7 @@ with new hardware. add-or-rm-osds add-or-rm-mons + bluestore-migration Command Reference