From c0e7e8254e9222beee7bc401996c503a1a75078f Mon Sep 17 00:00:00 2001 From: Alfredo Deza Date: Mon, 23 Jul 2018 16:22:37 -0400 Subject: [PATCH] doc/ceph-volume expand on the ceph-disk replacement reasons Signed-off-by: Alfredo Deza --- doc/ceph-volume/intro.rst | 66 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 66 insertions(+) diff --git a/doc/ceph-volume/intro.rst b/doc/ceph-volume/intro.rst index efb4ef2c5cd..c49531d55b3 100644 --- a/doc/ceph-volume/intro.rst +++ b/doc/ceph-volume/intro.rst @@ -11,6 +11,64 @@ that come installed for Ceph. These rules allow automatic detection of previously setup devices that are in turn fed into ``ceph-disk`` to activate them. +.. _ceph-disk-replaced: + +Replacing ``ceph-disk`` +----------------------- +The ``ceph-disk`` tool was created at a time were the project was required to +support many different types of init systems (upstart, sysvinit, etc...) while +being able to discover devices. This caused the tool to concentrate initially +(and exclusively afterwards) on GPT partitions. Specifically on GPT GUIDs, +which were used to label devices in a unique way to answer questions like: + +* is this device a Journal? +* an encrypted data partition? +* was the device left partially prepared? + +To solve these, it used ``UDEV`` rules to match the GUIDs, that would call +``ceph-disk``, and end up in a back and forth between the ``ceph-disk`` systemd +unit and the ``ceph-disk`` executable. The process was very unreliable and time +consuming (a timeout of close to three hours **per OSD** had to be put in +place), and would cause OSDs to not come up at all during the boot process of +a node. + +It was hard to debug, or even replicate these problems given the asynchronous +behavior of ``UDEV``. + +Since the world-view of ``ceph-disk`` had to be GPT partitions exclusively, it meant +that it couldn't work with other technologies like LVM, or similar device +mapper devices. It was ultimately decided to create something modular, starting +with LVM support, and the ability to expand on other technologies as needed. + + +GPT partitions are simple? +-------------------------- +Although partitions in general are simple to reason about, ``ceph-disk`` +partitions were not simple by any means. It required a tremendous amount of +special flags in order to get them to work correctly with the device discovery +workflow. Here is an example call to create a data partition:: + + /sbin/sgdisk --largest-new=1 --change-name=1:ceph data --partition-guid=1:f0fc39fd-eeb2-49f1-b922-a11939cf8a0f --typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be --mbrtogpt -- /dev/sdb + +Not only creating these was hard, but these partitions required devices to be +exclusively owned by Ceph. For example, in some cases a special partition would +be created when devices were encrypted, which would contain unencrypted keys. +This was ``ceph-disk`` domain knowledge, which would not translate to a "GPT +partitions are simple" understanding. Here is an example of that special +partition being created:: + + /sbin/sgdisk --new=5:0:+10M --change-name=5:ceph lockbox --partition-guid=5:None --typecode=5:fb3aabf9-d25f-47cc-bf5e-721d181642be --mbrtogpt -- /dev/sdad + + +Modularity +---------- +``ceph-volume`` was designed to be a modular tool because we anticipate that +there are going to be lots of ways that people provision the hardware devices +that we need to consider. There are already two: legacy ceph-disk devices that +are still in use and have GPT partitions (handled by :ref:`ceph-volume-simple`), +and lvm. SPDK devices where we manage NVMe devices directly from userspace are +on the immediate horizon, where LVM won't work there since the kernel isn't +involved at all. ``ceph-volume lvm`` ------------------- @@ -21,3 +79,11 @@ like dm-cache as well. For ``ceph-volume``, the use of dm-cache is transparent, there is no difference for the tool, and it treats dm-cache like a plain logical volume. + +LVM performance penalty +----------------------- +In short: we haven't been able to notice any significant performance penalties +associated with the change to LVM. By being able to work closely with LVM, the +ability to work with other device mapper technologies (for example ``dmcache``) +was a given: there is no technical difficulty in working with anything that can +sit below a Logical Volume.