mirror of
https://github.com/ceph/ceph
synced 2025-02-23 11:07:35 +00:00
Merge pull request #16765 from liewegas/wip-bluestore-docs
doc/rados/configuration: document bluestore
This commit is contained in:
commit
ea96265ed1
@ -2,11 +2,202 @@
|
||||
BlueStore Config Reference
|
||||
==========================
|
||||
|
||||
Devices
|
||||
=======
|
||||
|
||||
BlueStore manages either one, two, or (in certain cases) three storage
|
||||
devices.
|
||||
|
||||
In the simplest case, BlueStore consumes a single (primary) storage
|
||||
device. The storage device is normally partitioned into two parts:
|
||||
|
||||
#. A small partition is formatted with XFS and contains basic metadata
|
||||
for the OSD. This *data directory* includes information about the OSD
|
||||
(its identifier, which cluster it belongs to, and its private keyring.
|
||||
#. The rest of the device is normally a large partition occupying the
|
||||
rest of the device that is managed directly by BlueStore contains all
|
||||
of the actual data. This *main device* is normally identifed by a
|
||||
``block`` symlink in data directory.
|
||||
|
||||
It is also possible to deploy BlueStore across two additional devices:
|
||||
|
||||
* A *WAL device* can be used for BlueStore's internal journal or
|
||||
write-ahead log. It is identified by the ``block.wal`` symlink in
|
||||
the data directory. It is only useful to use a WAL device if the
|
||||
device is faster than the primary device (e.g., when it is on an SSD
|
||||
and the primary device is an HDD).
|
||||
* A *DB device* can be used for storing BlueStore's internal metadata.
|
||||
BlueStore (or rather, the embedded RocksDB) will put as much
|
||||
metadata as it can on the DB device to improve performance. If the
|
||||
DB device fills up, metadata will spill back onto the primary device
|
||||
(where it would have been otherwise). Again, it is only helpful to
|
||||
provision a DB device if it is faster than the primary device.
|
||||
|
||||
If there is only a small amount of fast storage available (e.g., less
|
||||
than a gigabyte), we recommend using it as a WAL device. If there is
|
||||
more, provisioning a DB device makes more sense. The BlueStore
|
||||
journal will always be placed on the fastest device available, so
|
||||
using a DB device will provide the same benefit that the WAL device
|
||||
would while *also* allowing additional metadata to be stored there (if
|
||||
it will fix).
|
||||
|
||||
A single-device BlueStore OSD can be provisioned with::
|
||||
|
||||
ceph-disk prepare --bluestore <device>
|
||||
|
||||
To specify a WAL device and/or DB device, ::
|
||||
|
||||
ceph-disk prepare --bluestore <device> --block.wal <wal-device> --block-db <db-device>
|
||||
|
||||
Cache size
|
||||
==========
|
||||
|
||||
The amount of memory consumed by each OSD for BlueStore's cache is
|
||||
determined by the ``bluestore_cache_size`` configuration option. If
|
||||
that config option is not set (i.e., remains at 0), there is a
|
||||
different default value that is used depending on whether an HDD or
|
||||
SSD is used for the primary device (set by the
|
||||
``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config
|
||||
options).
|
||||
|
||||
BlueStore and the rest of the Ceph OSD does the best it can currently
|
||||
to stick to the budgeted memory. Note that on top of the configured
|
||||
cache size, there is also memory consumed by the OSD itself, and
|
||||
generally some overhead due to memory fragmentation and other
|
||||
allocator overhead.
|
||||
|
||||
The configured cache memory budget can be used in a few different ways:
|
||||
|
||||
* Key/Value metadata (i.e., RocksDB's internal cache)
|
||||
* BlueStore metadata
|
||||
* BlueStore data (i.e., recently read or written object data)
|
||||
|
||||
Cache memory usage is governed by the following options:
|
||||
``bluestore_cache_meta_ratio``, ``bluestore_cache_kv_ratio``, and
|
||||
``bluestore_cache_kv_max``. The fraction of the cache devoted to data
|
||||
is 1.0 minus the meta and kv ratios. The memory devoted to kv
|
||||
metadata (the RocksDB cache) is capped by ``bluestore_cache_kv_max``
|
||||
since our testing indicates there are diminishing returns beyond a
|
||||
certain point.
|
||||
|
||||
``bluestore_cache_size``
|
||||
|
||||
:Description: The amount of memory BlueStore will use for its cache. If zero, ``bluestore_cache_size_hdd`` or ``bluestore_cache_size_ssd`` will be used instead.
|
||||
:Type: Integer
|
||||
:Required: Yes
|
||||
:Default: ``0``
|
||||
|
||||
``bluestore_cache_size_hdd``
|
||||
|
||||
:Description: The default amount of memory BlueStore will use for its cache when backed by an HDD.
|
||||
:Type: Integer
|
||||
:Required: Yes
|
||||
:Default: ``1 * 1024 * 1024 * 1024`` (1 GB)
|
||||
|
||||
``bluestore_cache_size_ssd``
|
||||
|
||||
:Description: The default amount of memory BlueStore will use for its cache when backed by an SSD.
|
||||
:Type: Integer
|
||||
:Required: Yes
|
||||
:Default: ``3 * 1024 * 1024 * 1024`` (3 GB)
|
||||
|
||||
``bluestore_cache_meta_ratio``
|
||||
|
||||
:Description: The ratio of cache devoted to metadata.
|
||||
:Type: Floating point
|
||||
:Required: Yes
|
||||
:Default: ``.01``
|
||||
|
||||
``bluestore_cache_kv_ratio``
|
||||
|
||||
:Description: The ratio of cache devoted to key/value data (rocksdb).
|
||||
:Type: Floating point
|
||||
:Required: Yes
|
||||
:Default: ``.99``
|
||||
|
||||
``bluestore_cache_kv_max``
|
||||
|
||||
:Description: The maximum amount of cache devoted to key/value data (rocksdb).
|
||||
:Type: Floating point
|
||||
:Required: Yes
|
||||
:Default: ``512 * 1024*1024`` (512 MB)
|
||||
|
||||
|
||||
Checksums
|
||||
=========
|
||||
|
||||
BlueStore checksums all metadata and data written to disk. Metadata
|
||||
checksumming is handled by RocksDB and uses `crc32c`. Data
|
||||
checksumming is done by BlueStore and can make use of `crc32c`,
|
||||
`xxhash32`, or `xxhash64`. The default is `crc32c` and should be
|
||||
suitable for most purposes.
|
||||
|
||||
Full data checksumming does increase the amount of metadata that
|
||||
BlueStore must store and manage. When possible, e.g., when clients
|
||||
hint that data is written and read sequentially, BlueStore will
|
||||
checksum larger blocks, but in many cases it must store a checksum
|
||||
value (usually 4 bytes) for every 4 kilobyte block of data.
|
||||
|
||||
It is possible to use a smaller checksum value by truncating the
|
||||
checksum to two or one byte, reducing the metadata overhead. The
|
||||
trade-off is that the probability that a random error will not be
|
||||
detected is higher with a smaller checksum, going from about one if
|
||||
four billion with a 32-bit (4 byte) checksum to one is 65,536 for a
|
||||
16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum.
|
||||
The smaller checksum values can be used by selecting `crc32c_16` or
|
||||
`crc32c_8` as the checksum algorithm.
|
||||
|
||||
The *checksum algorithm* can be set either via a per-pool
|
||||
``csum_type`` property or the global config option. For example, ::
|
||||
|
||||
ceph osd pool set <pool-name> csum_type <algorithm>
|
||||
|
||||
``bluestore_csum_type``
|
||||
|
||||
:Description: The default checksum algorithm to use.
|
||||
:Type: String
|
||||
:Required: Yes
|
||||
:Valid Settings: ``none``, ``crc32c``, ``crc32c_16``, ``crc32c_8``, ``xxhash32``, ``xxhash64``
|
||||
:Default: ``crc32c``
|
||||
|
||||
|
||||
Inline Compression
|
||||
==================
|
||||
|
||||
BlueStore supports inline compression using snappy, zlib, or LZ4. Please note,
|
||||
the lz4 compression plugin is not distributed in the official release.
|
||||
BlueStore supports inline compression using `snappy`, `zlib`, or
|
||||
`lz4`. Please note that the `lz4` compression plugin is not
|
||||
distributed in the official release.
|
||||
|
||||
Whether data in BlueStore is compressed is determined by a combination
|
||||
of the *compression mode* and any hints associated with a write
|
||||
operation. The modes are:
|
||||
|
||||
* **none**: Never compress data.
|
||||
* **passive**: Do not compress data unless the write operation as a
|
||||
*compressible* hint set.
|
||||
* **aggressive**: Compress data unless the write operation as an
|
||||
*incompressible* hint set.
|
||||
* **force**: Try to compress data no matter what.
|
||||
|
||||
For more information about the *compressible* and *incompressible* IO
|
||||
hints, see :doc:`/api/librados/#rados_set_alloc_hint`.
|
||||
|
||||
Note that regardless of the mode, if the size of the data chunk is not
|
||||
reduced sufficiently it will not be used and the original
|
||||
(uncompressed) data will be stored. For example, if the ``bluestore
|
||||
compression required ratio`` is set to ``.7`` then the compressed data
|
||||
must be 70% of the size of the original (or smaller).
|
||||
|
||||
The *compression mode*, *compression algorithm*, *compression required
|
||||
ratio*, *min blob size*, and *max blob size* can be set either via a
|
||||
per-pool property or a global config option. Pool properties can be
|
||||
set with::
|
||||
|
||||
ceph osd pool set <pool-name> compression_algorithm <algorithm>
|
||||
ceph osd pool set <pool-name> compression_mode <mode>
|
||||
ceph osd pool set <pool-name> compression_required_ratio <ratio>
|
||||
ceph osd pool set <pool-name> compression_min_blob_size <size>
|
||||
ceph osd pool set <pool-name> compression_max_blob_size <size>
|
||||
|
||||
``bluestore compression algorithm``
|
||||
|
||||
@ -33,6 +224,17 @@ the lz4 compression plugin is not distributed in the official release.
|
||||
:Valid Settings: ``none``, ``passive``, ``aggressive``, ``force``
|
||||
:Default: ``none``
|
||||
|
||||
``bluestore compression required ratio``
|
||||
|
||||
:Description: The ratio of the size of the data chunk after
|
||||
compression relative to the original size must be at
|
||||
least this small in order to store the compressed
|
||||
version.
|
||||
|
||||
:Type: Floating point
|
||||
:Required: No
|
||||
:Default: .875
|
||||
|
||||
``bluestore compression min blob size``
|
||||
|
||||
:Description: Chunks smaller than this are never compressed.
|
||||
|
@ -1,62 +0,0 @@
|
||||
===========================================
|
||||
Hard Disk and File System Recommendations
|
||||
===========================================
|
||||
|
||||
.. index:: hard drive preparation
|
||||
|
||||
Hard Drive Prep
|
||||
===============
|
||||
|
||||
Ceph aims for data safety, which means that when the :term:`Ceph Client`
|
||||
receives notice that data was written to a storage drive, that data was actually
|
||||
written to the storage drive. For old kernels (<2.6.33), disable the write cache
|
||||
if the journal is on a raw drive. Newer kernels should work fine.
|
||||
|
||||
Use ``hdparm`` to disable write caching on the hard disk::
|
||||
|
||||
sudo hdparm -W 0 /dev/hda 0
|
||||
|
||||
In production environments, we recommend running a :term:`Ceph OSD Daemon` with
|
||||
separate drives for the operating system and the data. If you run data and an
|
||||
operating system on a single disk, we recommend creating a separate partition
|
||||
for your data.
|
||||
|
||||
.. index:: filesystems
|
||||
|
||||
Filesystems
|
||||
===========
|
||||
|
||||
Ceph OSD Daemons rely heavily upon the stability and performance of the
|
||||
underlying filesystem.
|
||||
|
||||
Recommended
|
||||
-----------
|
||||
|
||||
We currently recommend ``XFS`` for production deployments.
|
||||
|
||||
Not recommended
|
||||
---------------
|
||||
|
||||
We recommand *against* using ``btrfs`` due to the lack of a stable
|
||||
version to test against and frequent bugs in the ENOSPC handling.
|
||||
|
||||
We recommend *against* using ``ext4`` due to limitations in the size
|
||||
of xattrs it can store, and the problems this causes with the way Ceph
|
||||
handles long RADOS object names. Although these issues will generally
|
||||
not surface with Ceph clusters using only short object names (e.g., an
|
||||
RBD workload that does not include long RBD image names), other users
|
||||
like RGW make extensive use of long object names and can break.
|
||||
|
||||
Starting with the Jewel release, the ``ceph-osd`` daemon will refuse
|
||||
to start if the configured max object name cannot be safely stored on
|
||||
``ext4``. If the cluster is only being used with short object names
|
||||
(e.g., RBD only), you can continue using ``ext4`` by setting the
|
||||
following configuration option::
|
||||
|
||||
osd max object name len = 256
|
||||
osd max object namespace len = 64
|
||||
|
||||
.. note:: This may result in difficult-to-diagnose errors if you try
|
||||
to use RGW or other librados clients that do not properly
|
||||
handle or politely surface any resulting ENAMETOOLONG
|
||||
errors.
|
@ -32,7 +32,7 @@ For general object store configuration, refer to the following:
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
Disks and Filesystems <filesystem-recommendations>
|
||||
Storage devices <storage-devices>
|
||||
ceph-conf
|
||||
|
||||
|
||||
|
83
doc/rados/configuration/storage-devices.rst
Normal file
83
doc/rados/configuration/storage-devices.rst
Normal file
@ -0,0 +1,83 @@
|
||||
=================
|
||||
Storage Devices
|
||||
=================
|
||||
|
||||
There are two Ceph daemons that store data on disk:
|
||||
|
||||
* **Ceph OSDs** (or Object Storage Daemons) are where most of the
|
||||
data is stored in Ceph. Generally speaking, each OSD is backed by
|
||||
a single storage device, like a traditional hard disk (HDD) or
|
||||
solid state disk (SSD). OSDs can also be backed by a combination
|
||||
of devices, like a HDD for most data and an SSD (or partition of an
|
||||
SSD) for some metadata. The number of OSDs in a cluster is
|
||||
generally a function of how much data will be stored, how big each
|
||||
storage device will be, and the level and type of redundancy
|
||||
(replication or erasure coding).
|
||||
* **Ceph Monitor** daemons manage critical cluster state like cluster
|
||||
membership and authentication information. For smaller clusters a
|
||||
few gigabytes is all that is needed, although for larger clusters
|
||||
the monitor database can reach tens or possibly hundreds of
|
||||
gigabytes.
|
||||
|
||||
|
||||
OSD Backends
|
||||
============
|
||||
|
||||
There are two ways that OSDs can manage the data they store. Starting
|
||||
with the Luminous 12.2.z release, the new default (and recommended) backend is
|
||||
*BlueStore*. Prior to Luminous, the default (and only option) was
|
||||
*FileStore*.
|
||||
|
||||
BlueStore
|
||||
---------
|
||||
|
||||
BlueStore is a special-purpose storage backend designed specifically
|
||||
for managing data on disk for Ceph OSD workloads. It is motivated by
|
||||
experience supporting and managing OSDs using FileStore over the
|
||||
last ten years. Key BlueStore features include:
|
||||
|
||||
* Direct management of storage devices. BlueStore consumes raw block
|
||||
devices or partitions. This avoids any intervening layers of
|
||||
abstraction (such as local file systems like XFS) that may limit
|
||||
performance or add complexity.
|
||||
* Metadata management with RocksDB. We embed RocksDB's key/value database
|
||||
in order to manage internal metadata, such as the mapping from object
|
||||
names to block locations on disk.
|
||||
* Full data and metadata checksumming. By default all data and
|
||||
metadata written to BlueStore is protected by one or more
|
||||
checksums. No data or metadata will be read from disk or returned
|
||||
to the user without being verified.
|
||||
* Inline compression. Data written may be optionally compressed
|
||||
before being written to disk.
|
||||
* Multi-device metadata tiering. BlueStore allows its internal
|
||||
journal (write-ahead log) to be written to a separate, high-speed
|
||||
device (like an SSD, NVMe, or NVDIMM) to increased performance. If
|
||||
a significant amount of faster storage is available, internal
|
||||
metadata can also be stored on the faster device.
|
||||
* Efficient copy-on-write. RBD and CephFS snapshots rely on a
|
||||
copy-on-write *clone* mechanism that is implemented efficiently in
|
||||
BlueStore. This results in efficient IO both for regular snapshots
|
||||
and for erasure coded pools (which rely on cloning to implement
|
||||
efficient two-phase commits).
|
||||
|
||||
For more information, see :doc:`bluestore-config-ref`.
|
||||
|
||||
FileStore
|
||||
---------
|
||||
|
||||
FileStore is the legacy approach to storing objects in Ceph. It
|
||||
relies on a standard file system (normally XFS) in combination with a
|
||||
key/value database (traditionally LevelDB, now RocksDB) for some
|
||||
metadata.
|
||||
|
||||
FileStore is well-tested and widely used in production but suffers
|
||||
from many performance deficiencies due to its overall design and
|
||||
reliance on a traditional file system for storing object data.
|
||||
|
||||
Although FileStore is generally capable of functioning on most
|
||||
POSIX-compatible file systems (including btrfs and ext4), we only
|
||||
recommend that XFS be used. Both btrfs and ext4 have known bugs and
|
||||
deficiencies and their use may lead to data loss. By default all Ceph
|
||||
provisioning tools will use XFS.
|
||||
|
||||
For more information, see :doc:`filestore-config-ref`.
|
Loading…
Reference in New Issue
Block a user