From 8e166fa1b70a75b62dd89f156e4653fb150dccaf Mon Sep 17 00:00:00 2001 From: Sage Weil Date: Wed, 2 Aug 2017 15:48:36 -0400 Subject: [PATCH 1/2] doc/rados/configuration: document bluestore Initial pass here. Not yet complete. Signed-off-by: Sage Weil --- .../filesystem-recommendations.rst | 62 -------------- doc/rados/configuration/index.rst | 2 +- doc/rados/configuration/storage-devices.rst | 83 +++++++++++++++++++ 3 files changed, 84 insertions(+), 63 deletions(-) delete mode 100644 doc/rados/configuration/filesystem-recommendations.rst create mode 100644 doc/rados/configuration/storage-devices.rst diff --git a/doc/rados/configuration/filesystem-recommendations.rst b/doc/rados/configuration/filesystem-recommendations.rst deleted file mode 100644 index c967d60ce07..00000000000 --- a/doc/rados/configuration/filesystem-recommendations.rst +++ /dev/null @@ -1,62 +0,0 @@ -=========================================== - Hard Disk and File System Recommendations -=========================================== - -.. index:: hard drive preparation - -Hard Drive Prep -=============== - -Ceph aims for data safety, which means that when the :term:`Ceph Client` -receives notice that data was written to a storage drive, that data was actually -written to the storage drive. For old kernels (<2.6.33), disable the write cache -if the journal is on a raw drive. Newer kernels should work fine. - -Use ``hdparm`` to disable write caching on the hard disk:: - - sudo hdparm -W 0 /dev/hda 0 - -In production environments, we recommend running a :term:`Ceph OSD Daemon` with -separate drives for the operating system and the data. If you run data and an -operating system on a single disk, we recommend creating a separate partition -for your data. - -.. index:: filesystems - -Filesystems -=========== - -Ceph OSD Daemons rely heavily upon the stability and performance of the -underlying filesystem. - -Recommended ------------ - -We currently recommend ``XFS`` for production deployments. - -Not recommended ---------------- - -We recommand *against* using ``btrfs`` due to the lack of a stable -version to test against and frequent bugs in the ENOSPC handling. - -We recommend *against* using ``ext4`` due to limitations in the size -of xattrs it can store, and the problems this causes with the way Ceph -handles long RADOS object names. Although these issues will generally -not surface with Ceph clusters using only short object names (e.g., an -RBD workload that does not include long RBD image names), other users -like RGW make extensive use of long object names and can break. - -Starting with the Jewel release, the ``ceph-osd`` daemon will refuse -to start if the configured max object name cannot be safely stored on -``ext4``. If the cluster is only being used with short object names -(e.g., RBD only), you can continue using ``ext4`` by setting the -following configuration option:: - - osd max object name len = 256 - osd max object namespace len = 64 - -.. note:: This may result in difficult-to-diagnose errors if you try - to use RGW or other librados clients that do not properly - handle or politely surface any resulting ENAMETOOLONG - errors. diff --git a/doc/rados/configuration/index.rst b/doc/rados/configuration/index.rst index 264141c1047..48b58efb707 100644 --- a/doc/rados/configuration/index.rst +++ b/doc/rados/configuration/index.rst @@ -32,7 +32,7 @@ For general object store configuration, refer to the following: .. toctree:: :maxdepth: 1 - Disks and Filesystems + Storage devices ceph-conf diff --git a/doc/rados/configuration/storage-devices.rst b/doc/rados/configuration/storage-devices.rst new file mode 100644 index 00000000000..83c0c9b9fad --- /dev/null +++ b/doc/rados/configuration/storage-devices.rst @@ -0,0 +1,83 @@ +================= + Storage Devices +================= + +There are two Ceph daemons that store data on disk: + +* **Ceph OSDs** (or Object Storage Daemons) are where most of the + data is stored in Ceph. Generally speaking, each OSD is backed by + a single storage device, like a traditional hard disk (HDD) or + solid state disk (SSD). OSDs can also be backed by a combination + of devices, like a HDD for most data and an SSD (or partition of an + SSD) for some metadata. The number of OSDs in a cluster is + generally a function of how much data will be stored, how big each + storage device will be, and the level and type of redundancy + (replication or erasure coding). +* **Ceph Monitor** daemons manage critical cluster state like cluster + membership and authentication information. For smaller clusters a + few gigabytes is all that is needed, although for larger clusters + the monitor database can reach tens or possibly hundreds of + gigabytes. + + +OSD Backends +============ + +There are two ways that OSDs can manage the data they store. Starting +with the Luminous 12.2.z release, the new default (and recommended) backend is +*BlueStore*. Prior to Luminous, the default (and only option) was +*FileStore*. + +BlueStore +--------- + +BlueStore is a special-purpose storage backend designed specifically +for managing data on disk for Ceph OSD workloads. It is motivated by +experience supporting and managing OSDs using FileStore over the +last ten years. Key BlueStore features include: + +* Direct management of storage devices. BlueStore consumes raw block + devices or partitions. This avoids any intervening layers of + abstraction (such as local file systems like XFS) that may limit + performance or add complexity. +* Metadata management with RocksDB. We embed RocksDB's key/value database + in order to manage internal metadata, such as the mapping from object + names to block locations on disk. +* Full data and metadata checksumming. By default all data and + metadata written to BlueStore is protected by one or more + checksums. No data or metadata will be read from disk or returned + to the user without being verified. +* Inline compression. Data written may be optionally compressed + before being written to disk. +* Multi-device metadata tiering. BlueStore allows its internal + journal (write-ahead log) to be written to a separate, high-speed + device (like an SSD, NVMe, or NVDIMM) to increased performance. If + a significant amount of faster storage is available, internal + metadata can also be stored on the faster device. +* Efficient copy-on-write. RBD and CephFS snapshots rely on a + copy-on-write *clone* mechanism that is implemented efficiently in + BlueStore. This results in efficient IO both for regular snapshots + and for erasure coded pools (which rely on cloning to implement + efficient two-phase commits). + +For more information, see :doc:`bluestore-config-ref`. + +FileStore +--------- + +FileStore is the legacy approach to storing objects in Ceph. It +relies on a standard file system (normally XFS) in combination with a +key/value database (traditionally LevelDB, now RocksDB) for some +metadata. + +FileStore is well-tested and widely used in production but suffers +from many performance deficiencies due to its overall design and +reliance on a traditional file system for storing object data. + +Although FileStore is generally capable of functioning on most +POSIX-compatible file systems (including btrfs and ext4), we only +recommend that XFS be used. Both btrfs and ext4 have known bugs and +deficiencies and their use may lead to data loss. By default all Ceph +provisioning tools will use XFS. + +For more information, see :doc:`filestore-config-ref`. From f2bcd0250bf2751f4f739ba788f68d1bb6cf297e Mon Sep 17 00:00:00 2001 From: Sage Weil Date: Thu, 3 Aug 2017 09:21:18 -0400 Subject: [PATCH 2/2] doc/rados/configuration/bluestore-config-ref: devices, checksumming, cache Signed-off-by: Sage Weil --- .../configuration/bluestore-config-ref.rst | 206 +++++++++++++++++- 1 file changed, 204 insertions(+), 2 deletions(-) diff --git a/doc/rados/configuration/bluestore-config-ref.rst b/doc/rados/configuration/bluestore-config-ref.rst index 254f99a1332..86c44ce00e7 100644 --- a/doc/rados/configuration/bluestore-config-ref.rst +++ b/doc/rados/configuration/bluestore-config-ref.rst @@ -2,11 +2,202 @@ BlueStore Config Reference ========================== +Devices +======= + +BlueStore manages either one, two, or (in certain cases) three storage +devices. + +In the simplest case, BlueStore consumes a single (primary) storage +device. The storage device is normally partitioned into two parts: + +#. A small partition is formatted with XFS and contains basic metadata +for the OSD. This *data directory* includes information about the OSD +(its identifier, which cluster it belongs to, and its private keyring. +#. The rest of the device is normally a large partition occupying the +rest of the device that is managed directly by BlueStore contains all +of the actual data. This *main device* is normally identifed by a +``block`` symlink in data directory. + +It is also possible to deploy BlueStore across two additional devices: + +* A *WAL device* can be used for BlueStore's internal journal or + write-ahead log. It is identified by the ``block.wal`` symlink in + the data directory. It is only useful to use a WAL device if the + device is faster than the primary device (e.g., when it is on an SSD + and the primary device is an HDD). +* A *DB device* can be used for storing BlueStore's internal metadata. + BlueStore (or rather, the embedded RocksDB) will put as much + metadata as it can on the DB device to improve performance. If the + DB device fills up, metadata will spill back onto the primary device + (where it would have been otherwise). Again, it is only helpful to + provision a DB device if it is faster than the primary device. + +If there is only a small amount of fast storage available (e.g., less +than a gigabyte), we recommend using it as a WAL device. If there is +more, provisioning a DB device makes more sense. The BlueStore +journal will always be placed on the fastest device available, so +using a DB device will provide the same benefit that the WAL device +would while *also* allowing additional metadata to be stored there (if +it will fix). + +A single-device BlueStore OSD can be provisioned with:: + + ceph-disk prepare --bluestore + +To specify a WAL device and/or DB device, :: + + ceph-disk prepare --bluestore --block.wal --block-db + +Cache size +========== + +The amount of memory consumed by each OSD for BlueStore's cache is +determined by the ``bluestore_cache_size`` configuration option. If +that config option is not set (i.e., remains at 0), there is a +different default value that is used depending on whether an HDD or +SSD is used for the primary device (set by the +``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config +options). + +BlueStore and the rest of the Ceph OSD does the best it can currently +to stick to the budgeted memory. Note that on top of the configured +cache size, there is also memory consumed by the OSD itself, and +generally some overhead due to memory fragmentation and other +allocator overhead. + +The configured cache memory budget can be used in a few different ways: + +* Key/Value metadata (i.e., RocksDB's internal cache) +* BlueStore metadata +* BlueStore data (i.e., recently read or written object data) + +Cache memory usage is governed by the following options: +``bluestore_cache_meta_ratio``, ``bluestore_cache_kv_ratio``, and +``bluestore_cache_kv_max``. The fraction of the cache devoted to data +is 1.0 minus the meta and kv ratios. The memory devoted to kv +metadata (the RocksDB cache) is capped by ``bluestore_cache_kv_max`` +since our testing indicates there are diminishing returns beyond a +certain point. + +``bluestore_cache_size`` + +:Description: The amount of memory BlueStore will use for its cache. If zero, ``bluestore_cache_size_hdd`` or ``bluestore_cache_size_ssd`` will be used instead. +:Type: Integer +:Required: Yes +:Default: ``0`` + +``bluestore_cache_size_hdd`` + +:Description: The default amount of memory BlueStore will use for its cache when backed by an HDD. +:Type: Integer +:Required: Yes +:Default: ``1 * 1024 * 1024 * 1024`` (1 GB) + +``bluestore_cache_size_ssd`` + +:Description: The default amount of memory BlueStore will use for its cache when backed by an SSD. +:Type: Integer +:Required: Yes +:Default: ``3 * 1024 * 1024 * 1024`` (3 GB) + +``bluestore_cache_meta_ratio`` + +:Description: The ratio of cache devoted to metadata. +:Type: Floating point +:Required: Yes +:Default: ``.01`` + +``bluestore_cache_kv_ratio`` + +:Description: The ratio of cache devoted to key/value data (rocksdb). +:Type: Floating point +:Required: Yes +:Default: ``.99`` + +``bluestore_cache_kv_max`` + +:Description: The maximum amount of cache devoted to key/value data (rocksdb). +:Type: Floating point +:Required: Yes +:Default: ``512 * 1024*1024`` (512 MB) + + +Checksums +========= + +BlueStore checksums all metadata and data written to disk. Metadata +checksumming is handled by RocksDB and uses `crc32c`. Data +checksumming is done by BlueStore and can make use of `crc32c`, +`xxhash32`, or `xxhash64`. The default is `crc32c` and should be +suitable for most purposes. + +Full data checksumming does increase the amount of metadata that +BlueStore must store and manage. When possible, e.g., when clients +hint that data is written and read sequentially, BlueStore will +checksum larger blocks, but in many cases it must store a checksum +value (usually 4 bytes) for every 4 kilobyte block of data. + +It is possible to use a smaller checksum value by truncating the +checksum to two or one byte, reducing the metadata overhead. The +trade-off is that the probability that a random error will not be +detected is higher with a smaller checksum, going from about one if +four billion with a 32-bit (4 byte) checksum to one is 65,536 for a +16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum. +The smaller checksum values can be used by selecting `crc32c_16` or +`crc32c_8` as the checksum algorithm. + +The *checksum algorithm* can be set either via a per-pool +``csum_type`` property or the global config option. For example, :: + + ceph osd pool set csum_type + +``bluestore_csum_type`` + +:Description: The default checksum algorithm to use. +:Type: String +:Required: Yes +:Valid Settings: ``none``, ``crc32c``, ``crc32c_16``, ``crc32c_8``, ``xxhash32``, ``xxhash64`` +:Default: ``crc32c`` + + Inline Compression ================== -BlueStore supports inline compression using snappy, zlib, or LZ4. Please note, -the lz4 compression plugin is not distributed in the official release. +BlueStore supports inline compression using `snappy`, `zlib`, or +`lz4`. Please note that the `lz4` compression plugin is not +distributed in the official release. + +Whether data in BlueStore is compressed is determined by a combination +of the *compression mode* and any hints associated with a write +operation. The modes are: + +* **none**: Never compress data. +* **passive**: Do not compress data unless the write operation as a + *compressible* hint set. +* **aggressive**: Compress data unless the write operation as an + *incompressible* hint set. +* **force**: Try to compress data no matter what. + +For more information about the *compressible* and *incompressible* IO +hints, see :doc:`/api/librados/#rados_set_alloc_hint`. + +Note that regardless of the mode, if the size of the data chunk is not +reduced sufficiently it will not be used and the original +(uncompressed) data will be stored. For example, if the ``bluestore +compression required ratio`` is set to ``.7`` then the compressed data +must be 70% of the size of the original (or smaller). + +The *compression mode*, *compression algorithm*, *compression required +ratio*, *min blob size*, and *max blob size* can be set either via a +per-pool property or a global config option. Pool properties can be +set with:: + + ceph osd pool set compression_algorithm + ceph osd pool set compression_mode + ceph osd pool set compression_required_ratio + ceph osd pool set compression_min_blob_size + ceph osd pool set compression_max_blob_size ``bluestore compression algorithm`` @@ -33,6 +224,17 @@ the lz4 compression plugin is not distributed in the official release. :Valid Settings: ``none``, ``passive``, ``aggressive``, ``force`` :Default: ``none`` +``bluestore compression required ratio`` + +:Description: The ratio of the size of the data chunk after + compression relative to the original size must be at + least this small in order to store the compressed + version. + +:Type: Floating point +:Required: No +:Default: .875 + ``bluestore compression min blob size`` :Description: Chunks smaller than this are never compressed.