From f7398ddd23321be85e791ea1663c8052632c54f2 Mon Sep 17 00:00:00 2001
From: Sage Weil <sage@redhat.com>
Date: Tue, 8 Aug 2017 15:09:50 -0400
Subject: [PATCH] doc/rados/operations/bluestore-migration: document bluestore
 migration process

Signed-off-by: Sage Weil <sage@redhat.com>
---
 doc/rados/configuration/storage-devices.rst  |   2 +-
 doc/rados/operations/bluestore-migration.rst | 246 +++++++++++++++++++
 doc/rados/operations/index.rst               |   1 +
 3 files changed, 248 insertions(+), 1 deletion(-)
 create mode 100644 doc/rados/operations/bluestore-migration.rst
diff --git a/doc/rados/configuration/storage-devices.rst b/doc/rados/configuration/storage-devices.rst
index 83c0c9b9fad..262778d0dcc 100644
--- a/doc/rados/configuration/storage-devices.rst
+++ b/doc/rados/configuration/storage-devices.rst
@@ -60,7 +60,7 @@ last ten years.  Key BlueStore features include:
   and for erasure coded pools (which rely on cloning to implement
   efficient two-phase commits).
 
-For more information, see :doc:`bluestore-config-ref`.
+For more information, see :doc:`bluestore-config-ref` and :doc:`/rados/operations/bluestore-migration`.
 
 FileStore
 ---------
diff --git a/doc/rados/operations/bluestore-migration.rst b/doc/rados/operations/bluestore-migration.rst
new file mode 100644
index 00000000000..d444e04e58f
--- /dev/null
+++ b/doc/rados/operations/bluestore-migration.rst
@@ -0,0 +1,246 @@
+=====================
+ BlueStore Migration
+=====================
+
+Each OSD can run either BlueStore or FileStore, and a single Ceph
+cluster can contain a mix of both.  Users who have previously deployed
+FileStore are likely to want to transition to BlueStore in order to
+take advantage of the improved performance and robustness.  There are
+several strategies for making such a transition.
+
+An individual OSD cannot be converted in place in isolation, however:
+BlueStore and FileStore are simply to different for that to be
+practical.  "Conversion" will rely either on the cluster's normal
+replication and healing support or tools and strategies that copy OSD
+content from and old (FileStore) device to a new (BlueStore) one.
+
+
+Deploy new OSDs with BlueStore
+==============================
+
+Any new OSDs (e.g., when the cluster is expanded) can be deployed
+using BlueStore.  This is the default behavior so no specific change
+is needed.
+
+Similarly, any OSDs that are reprovisioned after replacing a failed drive
+can use BlueStore.
+
+Convert existing OSDs
+=====================
+
+Mark out and replace
+--------------------
+
+The simplest approach is to mark out each device in turn, wait for the
+data to rereplicate across the cluster, reprovision the OSD, and mark
+it back in again.  It is simple and easy to automate.  However, it requires
+more data migration than should be necessary, so it is not optimal.
+
+#. Identify a FileStore OSD to replace::
+
+     ID=<osd-id-number>
+     DEVICE=<disk-device>
+
+   You can tell whether a given OSD is FileStore or BlueStore with::
+
+     ceph osd metadata $ID | grep osd_objectstore
+
+   You can get a current count of filestore vs bluestore with::
+
+     ceph osd count-metadata osd_objectstore
+
+#. Mark the filestore OSD out::
+
+     ceph osd out $ID
+
+#. Wait for the data to migrate off the OSD in question::
+
+     while ! ceph health | grep HEALTH_OK ; do sleep 60 ; done
+
+#. Stop the OSD::
+
+     systemctl kill ceph-osd@$ID
+
+#. Make note of which device this OSD is using::
+
+     mount | grep /var/lib/ceph/osd/ceph-$ID
+
+#. Unmount the OSD::
+
+     umount /var/lib/ceph/osd/ceph-$ID
+
+#. Destroy the OSD data.  Be *EXTREMELY CAREUL* as this will destroy
+   the contents of the device; be certain the data on the device is
+   not needed (i.e., that the cluster is healthy) before proceeding. ::
+
+     ceph-disk zap $DEVICE
+
+#. Tell the cluster the OSD has been destroyed (and a new OSD can be
+   reprovisioned with the same ID)::
+     
+     ceph osd destroy $ID --yes-i-really-mean-it
+
+#. Reprovision a BlueStore OSD in its place with the same OSD ID.
+   This requires you do identify which device to wipe based on what you saw
+   mounted above. BE CAREFUL! ::
+
+     ceph-disk prepare --bluestore $DEVICE --osd-id $ID
+
+#. Repeat.
+
+You can allow the refilling of the replacement OSD to happen
+concurrently with the draining of the next OSD, or follow the same
+procedure for multiple OSDs in parallel, as long as you ensure the
+cluster is fully clean (all data has all replicas) before destroying
+any OSDs.  Failure to do so will reduce the redundancy of your data
+and increase the risk of (or potentially even cause) data loss.
+
+Advantages:
+
+* Simple.
+* Can be done on a device-by-device basis.
+* No spare devices or hosts are required.
+
+Disadvantages:
+
+* Data is copied over the network twice: once to some other OSD in the
+  cluster (to maintain the desired number of replicas), and then again
+  back to the reprovisioned BlueStore OSD.
+
+
+Whole host replacement
+----------------------
+
+If you have a spare host in the cluster, or have sufficient free space
+to evacuate an entire host in order to use it as a spare, then the
+conversion can be done on a host-by-host basis with each stored copy of
+the data migrating only once.
+
+#. Identify an empty host.  Ideally the host should have roughly the
+   same capacity as other hosts you will be converting (although it
+   doesn't strictly matter). ::
+
+     NEWHOST=<empty-host-name>
+
+#. Add the host to the CRUSH hierarchy, but do not attach it to the root::
+
+     ceph osd crush add-bucket $NEWHOST host
+
+#. Provision new BlueStore OSDs for all devices::
+
+     ceph-disk prepare --bluestore /dev/$DEVICE
+
+#. Verify OSDs join the cluster with::
+
+     ceph osd tree
+
+   You should see the new host ``$NEWHOST`` with all of the OSDs beneath
+   it, but the host should *not* be nested beneath any other node in
+   hierarchy (like ``root default``).  For example, if ``newhost`` is
+   the empty host, you might see something like::
+
+     $ bin/ceph osd tree
+     ID CLASS WEIGHT  TYPE NAME     STATUS REWEIGHT PRI-AFF 
+     -5             0 host newhost                          
+     10   ssd 1.00000     osd.0         up  1.00000 1.00000 
+     11   ssd 1.00000     osd.1         up  1.00000 1.00000 
+     12   ssd 1.00000     osd.2         up  1.00000 1.00000 
+     -1       3.00000 root default                          
+     -2       3.00000     host oldhost1
+      0   ssd 1.00000         osd.0     up  1.00000 1.00000 
+      1   ssd 1.00000         osd.1     up  1.00000 1.00000 
+      2   ssd 1.00000         osd.2     up  1.00000 1.00000 
+     ...
+
+#. Identify the first target host to convert ::
+
+     OLDHOST=<old-host-name>
+
+#. Swap the new host into the old host's position in the cluster::
+
+     ceph osd crush swap-bucket $NEWHOST $OLDHOST
+
+   At this point all data on ``$OLDHOST`` will start migrating to OSDs
+   on ``$NEWHOST``.  If there is a difference in the total capacity of
+   the old and new hosts you may also see some data migrate to or from
+   other nodes in the cluster, but as long as the hosts are similarly
+   sized this will be a relatively small amount of data.
+
+#. Wait for data migration to complete::
+
+     while ! ceph health | grep HEALTH_OK ; do sleep 60 ; done
+
+#. Stop all old OSDs on the now-empty ``$OLDHOST``::
+
+     ssh $OLDHOST
+     systemctl kill ceph-osd.target
+     umount /var/log/ceph/osd/ceph-*
+
+#. Destroy and purge the old OSDs::
+
+     for osd in `ceph osd crush ls $OLDHOST`; do
+         ceph osd purge $osd --yes-i-really-mean-it
+     done
+
+#. Wipe the old OSD devices. This requires you do identify which
+   devices are to be wiped manually (BE CAREFUL!). For each device,::
+
+     ceph-disk zap $DEVICE
+
+#. Use the now-empty host as the new host, and repeat::
+
+     NEWHOST=$OLDHOST
+
+Advantages:
+
+* Data is copied over the network only once.
+* Converts an entire host's OSDs at once.
+* Can parallelize to converting multiple hosts at a time.
+* No spare devices are required on each host.
+
+Disadvantages:
+
+* A spare host is required.
+* An entire host's worth of OSDs will be migrating data at a time.  This
+  is like likely to impact overall cluster performance.
+* All migrated data still makes one full hop over the network.
+
+
+Per-OSD device copy
+-------------------
+
+A single logical OSD can be converted by using the ``copy`` function
+of ``ceph-objectstore-tool``.  This requires that the host have a free
+device (or devices) to provision a new, empty BlueStore OSD.  For
+example, if each host in your cluster has 12 OSDs, then you'd need a
+13th available device so that each OSD can be converted in turn before the
+old device is reclaimed to convert the next OSD.
+
+Caveats:
+
+* This strategy requires that a blank BlueStore OSD be prepared
+  without allocating a new OSD ID, something that the ``ceph-disk``
+  tool doesn't support.  More importantly, the setup of *dmcrypt* is
+  closely tied to the OSD identity, which means that this approach
+  does not work with encrypted OSDs.
+
+* The device must be manually partitioned.
+
+* Tooling not implemented!
+
+* Not documented!
+
+Advantages:
+
+* Little or no data migrates over the network during the conversion.
+
+Disadvantages:
+
+* Tooling not fully implemented.
+* Process not documented.
+* Each host must have a spare or empty device.
+* The OSD is offline during the conversion, which means new writes will
+  be written to only a subset of the OSDs.  This increases the risk of data
+  loss due to a subsequent failure.  (However, if there is a failure before
+  conversion is complete, the original FileStore OSD can be started to provide
+  access to its original data.)
diff --git a/doc/rados/operations/index.rst b/doc/rados/operations/index.rst
index aacf7648d77..253fc2d9d0a 100644
--- a/doc/rados/operations/index.rst
+++ b/doc/rados/operations/index.rst
@@ -58,6 +58,7 @@ with new hardware.
 
 	add-or-rm-osds
 	add-or-rm-mons
+	bluestore-migration
 	Command Reference <control>