doc/ceph-volume expand on the ceph-disk replacement reasons

Signed-off-by: Alfredo Deza <adeza@redhat.com>
This commit is contained in:
Alfredo Deza 2018-07-23 16:22:37 -04:00
parent 0b37258035
commit c0e7e8254e

View File

@ -11,6 +11,64 @@ that come installed for Ceph. These rules allow automatic detection of
previously setup devices that are in turn fed into ``ceph-disk`` to activate
them.
.. _ceph-disk-replaced:
Replacing ``ceph-disk``
-----------------------
The ``ceph-disk`` tool was created at a time were the project was required to
support many different types of init systems (upstart, sysvinit, etc...) while
being able to discover devices. This caused the tool to concentrate initially
(and exclusively afterwards) on GPT partitions. Specifically on GPT GUIDs,
which were used to label devices in a unique way to answer questions like:
* is this device a Journal?
* an encrypted data partition?
* was the device left partially prepared?
To solve these, it used ``UDEV`` rules to match the GUIDs, that would call
``ceph-disk``, and end up in a back and forth between the ``ceph-disk`` systemd
unit and the ``ceph-disk`` executable. The process was very unreliable and time
consuming (a timeout of close to three hours **per OSD** had to be put in
place), and would cause OSDs to not come up at all during the boot process of
a node.
It was hard to debug, or even replicate these problems given the asynchronous
behavior of ``UDEV``.
Since the world-view of ``ceph-disk`` had to be GPT partitions exclusively, it meant
that it couldn't work with other technologies like LVM, or similar device
mapper devices. It was ultimately decided to create something modular, starting
with LVM support, and the ability to expand on other technologies as needed.
GPT partitions are simple?
--------------------------
Although partitions in general are simple to reason about, ``ceph-disk``
partitions were not simple by any means. It required a tremendous amount of
special flags in order to get them to work correctly with the device discovery
workflow. Here is an example call to create a data partition::
/sbin/sgdisk --largest-new=1 --change-name=1:ceph data --partition-guid=1:f0fc39fd-eeb2-49f1-b922-a11939cf8a0f --typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be --mbrtogpt -- /dev/sdb
Not only creating these was hard, but these partitions required devices to be
exclusively owned by Ceph. For example, in some cases a special partition would
be created when devices were encrypted, which would contain unencrypted keys.
This was ``ceph-disk`` domain knowledge, which would not translate to a "GPT
partitions are simple" understanding. Here is an example of that special
partition being created::
/sbin/sgdisk --new=5:0:+10M --change-name=5:ceph lockbox --partition-guid=5:None --typecode=5:fb3aabf9-d25f-47cc-bf5e-721d181642be --mbrtogpt -- /dev/sdad
Modularity
----------
``ceph-volume`` was designed to be a modular tool because we anticipate that
there are going to be lots of ways that people provision the hardware devices
that we need to consider. There are already two: legacy ceph-disk devices that
are still in use and have GPT partitions (handled by :ref:`ceph-volume-simple`),
and lvm. SPDK devices where we manage NVMe devices directly from userspace are
on the immediate horizon, where LVM won't work there since the kernel isn't
involved at all.
``ceph-volume lvm``
-------------------
@ -21,3 +79,11 @@ like dm-cache as well.
For ``ceph-volume``, the use of dm-cache is transparent, there is no difference
for the tool, and it treats dm-cache like a plain logical volume.
LVM performance penalty
-----------------------
In short: we haven't been able to notice any significant performance penalties
associated with the change to LVM. By being able to work closely with LVM, the
ability to work with other device mapper technologies (for example ``dmcache``)
was a given: there is no technical difficulty in working with anything that can
sit below a Logical Volume.