From 8e166fa1b70a75b62dd89f156e4653fb150dccaf Mon Sep 17 00:00:00 2001
From: Sage Weil <sage@redhat.com>
Date: Wed, 2 Aug 2017 15:48:36 -0400
Subject: [PATCH 1/2] doc/rados/configuration: document bluestore

Initial pass here.  Not yet complete.

Signed-off-by: Sage Weil <sage@redhat.com>
---
 .../filesystem-recommendations.rst            | 62 --------------
 doc/rados/configuration/index.rst             |  2 +-
 doc/rados/configuration/storage-devices.rst   | 83 +++++++++++++++++++
 3 files changed, 84 insertions(+), 63 deletions(-)
 delete mode 100644 doc/rados/configuration/filesystem-recommendations.rst
 create mode 100644 doc/rados/configuration/storage-devices.rst
diff --git a/doc/rados/configuration/filesystem-recommendations.rst b/doc/rados/configuration/filesystem-recommendations.rst
deleted file mode 100644
index c967d60ce07..00000000000
--- a/doc/rados/configuration/filesystem-recommendations.rst
+++ /dev/null
@@ -1,62 +0,0 @@
-===========================================
- Hard Disk and File System Recommendations
-===========================================
-
-.. index:: hard drive preparation
-
-Hard Drive Prep
-===============
-
-Ceph aims for data safety, which means that when the :term:`Ceph Client`
-receives notice that data was written to a storage drive, that data was actually
-written to the storage drive. For old kernels (<2.6.33), disable the write cache
-if the journal is on a raw drive. Newer kernels should work fine.
-
-Use ``hdparm`` to disable write caching on the hard disk::
-
-	sudo hdparm -W 0 /dev/hda 0
-
-In production environments, we recommend running a :term:`Ceph OSD Daemon` with
-separate drives for the operating system and the data. If you run data and an
-operating system on a single disk, we recommend creating a separate partition
-for your data.
-
-.. index:: filesystems
-
-Filesystems
-===========
-
-Ceph OSD Daemons rely heavily upon the stability and performance of the
-underlying filesystem.
-
-Recommended
------------
-
-We currently recommend ``XFS`` for production deployments.
-
-Not recommended
----------------
-
-We recommand *against* using ``btrfs`` due to the lack of a stable
-version to test against and frequent bugs in the ENOSPC handling.
-
-We recommend *against* using ``ext4`` due to limitations in the size
-of xattrs it can store, and the problems this causes with the way Ceph
-handles long RADOS object names.  Although these issues will generally
-not surface with Ceph clusters using only short object names (e.g., an
-RBD workload that does not include long RBD image names), other users
-like RGW make extensive use of long object names and can break.
-
-Starting with the Jewel release, the ``ceph-osd`` daemon will refuse
-to start if the configured max object name cannot be safely stored on
-``ext4``.  If the cluster is only being used with short object names
-(e.g., RBD only), you can continue using ``ext4`` by setting the
-following configuration option::
-
-  osd max object name len = 256
-  osd max object namespace len = 64
-
-.. note:: This may result in difficult-to-diagnose errors if you try
-          to use RGW or other librados clients that do not properly
-          handle or politely surface any resulting ENAMETOOLONG
-          errors.
diff --git a/doc/rados/configuration/index.rst b/doc/rados/configuration/index.rst
index 264141c1047..48b58efb707 100644
--- a/doc/rados/configuration/index.rst
+++ b/doc/rados/configuration/index.rst
@@ -32,7 +32,7 @@ For general object store configuration, refer to the following:
 .. toctree::
    :maxdepth: 1
 
-   Disks and Filesystems <filesystem-recommendations>
+   Storage devices <storage-devices>
    ceph-conf
 
 
diff --git a/doc/rados/configuration/storage-devices.rst b/doc/rados/configuration/storage-devices.rst
new file mode 100644
index 00000000000..83c0c9b9fad
--- /dev/null
+++ b/doc/rados/configuration/storage-devices.rst
@@ -0,0 +1,83 @@
+=================
+ Storage Devices
+=================
+
+There are two Ceph daemons that store data on disk:
+
+* **Ceph OSDs** (or Object Storage Daemons) are where most of the
+  data is stored in Ceph.  Generally speaking, each OSD is backed by
+  a single storage device, like a traditional hard disk (HDD) or
+  solid state disk (SSD).  OSDs can also be backed by a combination
+  of devices, like a HDD for most data and an SSD (or partition of an
+  SSD) for some metadata.  The number of OSDs in a cluster is
+  generally a function of how much data will be stored, how big each
+  storage device will be, and the level and type of redundancy
+  (replication or erasure coding).
+* **Ceph Monitor** daemons manage critical cluster state like cluster
+  membership and authentication information.  For smaller clusters a
+  few gigabytes is all that is needed, although for larger clusters
+  the monitor database can reach tens or possibly hundreds of
+  gigabytes.
+
+
+OSD Backends
+============
+
+There are two ways that OSDs can manage the data they store.  Starting
+with the Luminous 12.2.z release, the new default (and recommended) backend is
+*BlueStore*.  Prior to Luminous, the default (and only option) was
+*FileStore*.
+
+BlueStore
+---------
+
+BlueStore is a special-purpose storage backend designed specifically
+for managing data on disk for Ceph OSD workloads.  It is motivated by
+experience supporting and managing OSDs using FileStore over the
+last ten years.  Key BlueStore features include:
+
+* Direct management of storage devices.  BlueStore consumes raw block
+  devices or partitions.  This avoids any intervening layers of
+  abstraction (such as local file systems like XFS) that may limit
+  performance or add complexity.
+* Metadata management with RocksDB.  We embed RocksDB's key/value database
+  in order to manage internal metadata, such as the mapping from object
+  names to block locations on disk.
+* Full data and metadata checksumming.  By default all data and
+  metadata written to BlueStore is protected by one or more
+  checksums.  No data or metadata will be read from disk or returned
+  to the user without being verified.
+* Inline compression.  Data written may be optionally compressed
+  before being written to disk.
+* Multi-device metadata tiering.  BlueStore allows its internal
+  journal (write-ahead log) to be written to a separate, high-speed
+  device (like an SSD, NVMe, or NVDIMM) to increased performance.  If
+  a significant amount of faster storage is available, internal
+  metadata can also be stored on the faster device.
+* Efficient copy-on-write.  RBD and CephFS snapshots rely on a
+  copy-on-write *clone* mechanism that is implemented efficiently in
+  BlueStore.  This results in efficient IO both for regular snapshots
+  and for erasure coded pools (which rely on cloning to implement
+  efficient two-phase commits).
+
+For more information, see :doc:`bluestore-config-ref`.
+
+FileStore
+---------
+
+FileStore is the legacy approach to storing objects in Ceph.  It
+relies on a standard file system (normally XFS) in combination with a
+key/value database (traditionally LevelDB, now RocksDB) for some
+metadata.
+
+FileStore is well-tested and widely used in production but suffers
+from many performance deficiencies due to its overall design and
+reliance on a traditional file system for storing object data.
+
+Although FileStore is generally capable of functioning on most
+POSIX-compatible file systems (including btrfs and ext4), we only
+recommend that XFS be used.  Both btrfs and ext4 have known bugs and
+deficiencies and their use may lead to data loss.  By default all Ceph
+provisioning tools will use XFS.
+
+For more information, see :doc:`filestore-config-ref`.

From f2bcd0250bf2751f4f739ba788f68d1bb6cf297e Mon Sep 17 00:00:00 2001
From: Sage Weil <sage@redhat.com>
Date: Thu, 3 Aug 2017 09:21:18 -0400
Subject: [PATCH 2/2] doc/rados/configuration/bluestore-config-ref: devices,
 checksumming, cache

Signed-off-by: Sage Weil <sage@redhat.com>
---
 .../configuration/bluestore-config-ref.rst    | 206 +++++++++++++++++-
 1 file changed, 204 insertions(+), 2 deletions(-)

diff --git a/doc/rados/configuration/bluestore-config-ref.rst b/doc/rados/configuration/bluestore-config-ref.rst
index 254f99a1332..86c44ce00e7 100644
--- a/doc/rados/configuration/bluestore-config-ref.rst
+++ b/doc/rados/configuration/bluestore-config-ref.rst
@@ -2,11 +2,202 @@
 BlueStore Config Reference
 ==========================
 
+Devices
+=======
+
+BlueStore manages either one, two, or (in certain cases) three storage
+devices.
+
+In the simplest case, BlueStore consumes a single (primary) storage
+device.  The storage device is normally partitioned into two parts:
+
+#. A small partition is formatted with XFS and contains basic metadata
+for the OSD.  This *data directory* includes information about the OSD
+(its identifier, which cluster it belongs to, and its private keyring.
+#. The rest of the device is normally a large partition occupying the
+rest of the device that is managed directly by BlueStore contains all
+of the actual data.  This *main device* is normally identifed by a
+``block`` symlink in data directory.
+
+It is also possible to deploy BlueStore across two additional devices:
+
+* A *WAL device* can be used for BlueStore's internal journal or
+  write-ahead log.  It is identified by the ``block.wal`` symlink in
+  the data directory.  It is only useful to use a WAL device if the
+  device is faster than the primary device (e.g., when it is on an SSD
+  and the primary device is an HDD).
+* A *DB device* can be used for storing BlueStore's internal metadata.
+  BlueStore (or rather, the embedded RocksDB) will put as much
+  metadata as it can on the DB device to improve performance.  If the
+  DB device fills up, metadata will spill back onto the primary device
+  (where it would have been otherwise).  Again, it is only helpful to
+  provision a DB device if it is faster than the primary device.
+
+If there is only a small amount of fast storage available (e.g., less
+than a gigabyte), we recommend using it as a WAL device.  If there is
+more, provisioning a DB device makes more sense.  The BlueStore
+journal will always be placed on the fastest device available, so
+using a DB device will provide the same benefit that the WAL device
+would while *also* allowing additional metadata to be stored there (if
+it will fix).
+
+A single-device BlueStore OSD can be provisioned with::
+
+  ceph-disk prepare --bluestore <device>
+
+To specify a WAL device and/or DB device, ::
+
+  ceph-disk prepare --bluestore <device> --block.wal <wal-device> --block-db <db-device>
+
+Cache size
+==========
+
+The amount of memory consumed by each OSD for BlueStore's cache is
+determined by the ``bluestore_cache_size`` configuration option.  If
+that config option is not set (i.e., remains at 0), there is a
+different default value that is used depending on whether an HDD or
+SSD is used for the primary device (set by the
+``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config
+options).
+
+BlueStore and the rest of the Ceph OSD does the best it can currently
+to stick to the budgeted memory.  Note that on top of the configured
+cache size, there is also memory consumed by the OSD itself, and
+generally some overhead due to memory fragmentation and other
+allocator overhead.
+
+The configured cache memory budget can be used in a few different ways:
+
+* Key/Value metadata (i.e., RocksDB's internal cache)
+* BlueStore metadata
+* BlueStore data (i.e., recently read or written object data)
+
+Cache memory usage is governed by the following options:
+``bluestore_cache_meta_ratio``, ``bluestore_cache_kv_ratio``, and
+``bluestore_cache_kv_max``.  The fraction of the cache devoted to data
+is 1.0 minus the meta and kv ratios.  The memory devoted to kv
+metadata (the RocksDB cache) is capped by ``bluestore_cache_kv_max``
+since our testing indicates there are diminishing returns beyond a
+certain point.
+
+``bluestore_cache_size``
+
+:Description: The amount of memory BlueStore will use for its cache.  If zero, ``bluestore_cache_size_hdd`` or ``bluestore_cache_size_ssd`` will be used instead.
+:Type: Integer
+:Required: Yes
+:Default: ``0``
+
+``bluestore_cache_size_hdd``
+
+:Description: The default amount of memory BlueStore will use for its cache when backed by an HDD.
+:Type: Integer
+:Required: Yes
+:Default: ``1 * 1024 * 1024 * 1024`` (1 GB)
+
+``bluestore_cache_size_ssd``
+
+:Description: The default amount of memory BlueStore will use for its cache when backed by an SSD.
+:Type: Integer
+:Required: Yes
+:Default: ``3 * 1024 * 1024 * 1024`` (3 GB)
+
+``bluestore_cache_meta_ratio``
+
+:Description: The ratio of cache devoted to metadata.
+:Type: Floating point
+:Required: Yes
+:Default: ``.01``
+
+``bluestore_cache_kv_ratio``
+
+:Description: The ratio of cache devoted to key/value data (rocksdb).
+:Type: Floating point
+:Required: Yes
+:Default: ``.99``
+
+``bluestore_cache_kv_max``
+
+:Description: The maximum amount of cache devoted to key/value data (rocksdb).
+:Type: Floating point
+:Required: Yes
+:Default: ``512 * 1024*1024`` (512 MB)
+
+
+Checksums
+=========
+
+BlueStore checksums all metadata and data written to disk.  Metadata
+checksumming is handled by RocksDB and uses `crc32c`. Data
+checksumming is done by BlueStore and can make use of `crc32c`,
+`xxhash32`, or `xxhash64`.  The default is `crc32c` and should be
+suitable for most purposes.
+
+Full data checksumming does increase the amount of metadata that
+BlueStore must store and manage.  When possible, e.g., when clients
+hint that data is written and read sequentially, BlueStore will
+checksum larger blocks, but in many cases it must store a checksum
+value (usually 4 bytes) for every 4 kilobyte block of data.
+
+It is possible to use a smaller checksum value by truncating the
+checksum to two or one byte, reducing the metadata overhead.  The
+trade-off is that the probability that a random error will not be
+detected is higher with a smaller checksum, going from about one if
+four billion with a 32-bit (4 byte) checksum to one is 65,536 for a
+16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum.
+The smaller checksum values can be used by selecting `crc32c_16` or
+`crc32c_8` as the checksum algorithm.
+
+The *checksum algorithm* can be set either via a per-pool
+``csum_type`` property or the global config option.  For example, ::
+
+  ceph osd pool set <pool-name> csum_type <algorithm>
+
+``bluestore_csum_type``
+
+:Description: The default checksum algorithm to use.
+:Type: String
+:Required: Yes
+:Valid Settings: ``none``, ``crc32c``, ``crc32c_16``, ``crc32c_8``, ``xxhash32``, ``xxhash64``
+:Default: ``crc32c``
+
+
 Inline Compression
 ==================
 
-BlueStore supports inline compression using snappy, zlib, or LZ4. Please note,
-the lz4 compression plugin is not distributed in the official release.
+BlueStore supports inline compression using `snappy`, `zlib`, or
+`lz4`. Please note that the `lz4` compression plugin is not
+distributed in the official release.
+
+Whether data in BlueStore is compressed is determined by a combination
+of the *compression mode* and any hints associated with a write
+operation.  The modes are:
+
+* **none**: Never compress data.
+* **passive**: Do not compress data unless the write operation as a
+  *compressible* hint set.
+* **aggressive**: Compress data unless the write operation as an
+  *incompressible* hint set.
+* **force**: Try to compress data no matter what.
+
+For more information about the *compressible* and *incompressible* IO
+hints, see :doc:`/api/librados/#rados_set_alloc_hint`.
+
+Note that regardless of the mode, if the size of the data chunk is not
+reduced sufficiently it will not be used and the original
+(uncompressed) data will be stored.  For example, if the ``bluestore
+compression required ratio`` is set to ``.7`` then the compressed data
+must be 70% of the size of the original (or smaller).
+
+The *compression mode*, *compression algorithm*, *compression required
+ratio*, *min blob size*, and *max blob size* can be set either via a
+per-pool property or a global config option.  Pool properties can be
+set with::
+
+  ceph osd pool set <pool-name> compression_algorithm <algorithm>
+  ceph osd pool set <pool-name> compression_mode <mode>
+  ceph osd pool set <pool-name> compression_required_ratio <ratio>
+  ceph osd pool set <pool-name> compression_min_blob_size <size>
+  ceph osd pool set <pool-name> compression_max_blob_size <size>
 
 ``bluestore compression algorithm``
 
@@ -33,6 +224,17 @@ the lz4 compression plugin is not distributed in the official release.
 :Valid Settings: ``none``, ``passive``, ``aggressive``, ``force``
 :Default: ``none``
 
+``bluestore compression required ratio``
+
+:Description: The ratio of the size of the data chunk after
+              compression relative to the original size must be at
+              least this small in order to store the compressed
+              version.
+
+:Type: Floating point
+:Required: No
+:Default: .875
+
 ``bluestore compression min blob size``
 
 :Description: Chunks smaller than this are never compressed.