mirror of
https://github.com/ceph/ceph
synced 2025-01-07 03:31:36 +00:00
1ceeab30eb
Edit the second part of doc/rados/configuration/bluestore-config-ref.rst. https://tracker.ceph.com/issues/58485 Co-authored-by: Anthoy D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
553 lines
25 KiB
ReStructuredText
553 lines
25 KiB
ReStructuredText
==================================
|
|
BlueStore Configuration Reference
|
|
==================================
|
|
|
|
Devices
|
|
=======
|
|
|
|
BlueStore manages either one, two, or in certain cases three storage devices.
|
|
These *devices* are "devices" in the Linux/Unix sense. This means that they are
|
|
assets listed under ``/dev`` or ``/devices``. Each of these devices may be an
|
|
entire storage drive, or a partition of a storage drive, or a logical volume.
|
|
BlueStore does not create or mount a conventional file system on devices that
|
|
it uses; BlueStore reads and writes to the devices directly in a "raw" fashion.
|
|
|
|
In the simplest case, BlueStore consumes all of a single storage device. This
|
|
device is known as the *primary device*. The primary device is identified by
|
|
the ``block`` symlink in the data directory.
|
|
|
|
The data directory is a ``tmpfs`` mount. When this data directory is booted or
|
|
activated by ``ceph-volume``, it is populated with metadata files and links
|
|
that hold information about the OSD: for example, the OSD's identifier, the
|
|
name of the cluster that the OSD belongs to, and the OSD's private keyring.
|
|
|
|
In more complicated cases, BlueStore is deployed across one or two additional
|
|
devices:
|
|
|
|
* A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data
|
|
directory) can be used to separate out BlueStore's internal journal or
|
|
write-ahead log. Using a WAL device is advantageous only if the WAL device
|
|
is faster than the primary device (for example, if the WAL device is an SSD
|
|
and the primary device is an HDD).
|
|
* A *DB device* (identified as ``block.db`` in the data directory) can be used
|
|
to store BlueStore's internal metadata. BlueStore (or more precisely, the
|
|
embedded RocksDB) will put as much metadata as it can on the DB device in
|
|
order to improve performance. If the DB device becomes full, metadata will
|
|
spill back onto the primary device (where it would have been located in the
|
|
absence of the DB device). Again, it is advantageous to provision a DB device
|
|
only if it is faster than the primary device.
|
|
|
|
If there is only a small amount of fast storage available (for example, less
|
|
than a gigabyte), we recommend using the available space as a WAL device. But
|
|
if more fast storage is available, it makes more sense to provision a DB
|
|
device. Because the BlueStore journal is always placed on the fastest device
|
|
available, using a DB device provides the same benefit that using a WAL device
|
|
would, while *also* allowing additional metadata to be stored off the primary
|
|
device (provided that it fits). DB devices make this possible because whenever
|
|
a DB device is specified but an explicit WAL device is not, the WAL will be
|
|
implicitly colocated with the DB on the faster device.
|
|
|
|
To provision a single-device (colocated) BlueStore OSD, run the following
|
|
command:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph-volume lvm prepare --bluestore --data <device>
|
|
|
|
To specify a WAL device or DB device, run the following command:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device>
|
|
|
|
.. note:: The option ``--data`` can take as its argument any of the the
|
|
following devices: logical volumes specified using *vg/lv* notation,
|
|
existing logical volumes, and GPT partitions.
|
|
|
|
|
|
|
|
Provisioning strategies
|
|
-----------------------
|
|
|
|
BlueStore differs from Filestore in that there are several ways to deploy a
|
|
BlueStore OSD. However, the overall deployment strategy for BlueStore can be
|
|
clarified by examining just these two common arrangements:
|
|
|
|
.. _bluestore-single-type-device-config:
|
|
|
|
**block (data) only**
|
|
^^^^^^^^^^^^^^^^^^^^^
|
|
If all devices are of the same type (for example, they are all HDDs), and if
|
|
there are no fast devices available for the storage of metadata, then it makes
|
|
sense to specify the block device only and to leave ``block.db`` and
|
|
``block.wal`` unseparated. The :ref:`ceph-volume-lvm` command for a single
|
|
``/dev/sda`` device is as follows:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph-volume lvm create --bluestore --data /dev/sda
|
|
|
|
If the devices to be used for a BlueStore OSD are pre-created logical volumes,
|
|
then the :ref:`ceph-volume-lvm` call for an logical volume named
|
|
``ceph-vg/block-lv`` is as follows:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph-volume lvm create --bluestore --data ceph-vg/block-lv
|
|
|
|
.. _bluestore-mixed-device-config:
|
|
|
|
**block and block.db**
|
|
^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
If you have a mix of fast and slow devices (for example, SSD or HDD), then we
|
|
recommend placing ``block.db`` on the faster device while ``block`` (that is,
|
|
the data) is stored on the slower device (that is, the rotational drive).
|
|
|
|
You must create these volume groups and these logical volumes manually. as The
|
|
``ceph-volume`` tool is currently unable to do so [create them?] automatically.
|
|
|
|
The following procedure illustrates the manual creation of volume groups and
|
|
logical volumes. For this example, we shall assume four rotational drives
|
|
(``sda``, ``sdb``, ``sdc``, and ``sdd``) and one (fast) SSD (``sdx``). First,
|
|
to create the volume groups, run the following commands:
|
|
|
|
.. prompt:: bash $
|
|
|
|
vgcreate ceph-block-0 /dev/sda
|
|
vgcreate ceph-block-1 /dev/sdb
|
|
vgcreate ceph-block-2 /dev/sdc
|
|
vgcreate ceph-block-3 /dev/sdd
|
|
|
|
Next, to create the logical volumes for ``block``, run the following commands:
|
|
|
|
.. prompt:: bash $
|
|
|
|
lvcreate -l 100%FREE -n block-0 ceph-block-0
|
|
lvcreate -l 100%FREE -n block-1 ceph-block-1
|
|
lvcreate -l 100%FREE -n block-2 ceph-block-2
|
|
lvcreate -l 100%FREE -n block-3 ceph-block-3
|
|
|
|
Because there are four HDDs, there will be four OSDs. Supposing that there is a
|
|
200GB SSD in ``/dev/sdx``, we can create four 50GB logical volumes by running
|
|
the following commands:
|
|
|
|
.. prompt:: bash $
|
|
|
|
vgcreate ceph-db-0 /dev/sdx
|
|
lvcreate -L 50GB -n db-0 ceph-db-0
|
|
lvcreate -L 50GB -n db-1 ceph-db-0
|
|
lvcreate -L 50GB -n db-2 ceph-db-0
|
|
lvcreate -L 50GB -n db-3 ceph-db-0
|
|
|
|
Finally, to create the four OSDs, run the following commands:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0
|
|
ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1
|
|
ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2
|
|
ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3
|
|
|
|
After this procedure is finished, there should be four OSDs, ``block`` should
|
|
be on the four HDDs, and each HDD should have a 50GB logical volume
|
|
(specifically, a DB device) on the shared SSD.
|
|
|
|
Sizing
|
|
======
|
|
When using a :ref:`mixed spinning-and-solid-drive setup
|
|
<bluestore-mixed-device-config>`, it is important to make a large enough
|
|
``block.db`` logical volume for BlueStore. The logical volumes associated with
|
|
``block.db`` should have logical volumes that are *as large as possible*.
|
|
|
|
It is generally recommended that the size of ``block.db`` be somewhere between
|
|
1% and 4% of the size of ``block``. For RGW workloads, it is recommended that
|
|
the ``block.db`` be at least 4% of the ``block`` size, because RGW makes heavy
|
|
use of ``block.db`` to store metadata (in particular, omap keys). For example,
|
|
if the ``block`` size is 1TB, then ``block.db`` should have a size of at least
|
|
40GB. For RBD workloads, however, ``block.db`` usually needs no more than 1% to
|
|
2% of the ``block`` size.
|
|
|
|
In older releases, internal level sizes are such that the DB can fully utilize
|
|
only those specific partition / logical volume sizes that correspond to sums of
|
|
L0, L0+L1, L1+L2, and so on--that is, given default settings, sizes of roughly
|
|
3GB, 30GB, 300GB, and so on. Most deployments do not substantially benefit from
|
|
sizing that accommodates L3 and higher, though DB compaction can be facilitated
|
|
by doubling these figures to 6GB, 60GB, and 600GB.
|
|
|
|
Improvements in Nautilus 14.2.12, Octopus 15.2.6, and subsequent releases allow
|
|
for better utilization of arbitrarily-sized DB devices. Moreover, the Pacific
|
|
release brings experimental dynamic-level support. Because of these advances,
|
|
users of older releases might want to plan ahead by provisioning larger DB
|
|
devices today so that the benefits of scale can be realized when upgrades are
|
|
made in the future.
|
|
|
|
When *not* using a mix of fast and slow devices, there is no requirement to
|
|
create separate logical volumes for ``block.db`` or ``block.wal``. BlueStore
|
|
will automatically colocate these devices within the space of ``block``.
|
|
|
|
Automatic Cache Sizing
|
|
======================
|
|
|
|
BlueStore can be configured to automatically resize its caches, provided that
|
|
certain conditions are met: TCMalloc must be configured as the memory allocator
|
|
and the ``bluestore_cache_autotune`` configuration option must be enabled (note
|
|
that it is currently enabled by default). When automatic cache sizing is in
|
|
effect, BlueStore attempts to keep OSD heap-memory usage under a certain target
|
|
size (as determined by ``osd_memory_target``). This approach makes use of a
|
|
best-effort algorithm and caches do not shrink smaller than the size defined by
|
|
the value of ``osd_memory_cache_min``. Cache ratios are selected in accordance
|
|
with a hierarchy of priorities. But if priority information is not available,
|
|
the values specified in the ``bluestore_cache_meta_ratio`` and
|
|
``bluestore_cache_kv_ratio`` options are used as fallback cache ratios.
|
|
|
|
.. confval:: bluestore_cache_autotune
|
|
.. confval:: osd_memory_target
|
|
.. confval:: bluestore_cache_autotune_interval
|
|
.. confval:: osd_memory_base
|
|
.. confval:: osd_memory_expected_fragmentation
|
|
.. confval:: osd_memory_cache_min
|
|
.. confval:: osd_memory_cache_resize_interval
|
|
|
|
|
|
Manual Cache Sizing
|
|
===================
|
|
|
|
The amount of memory consumed by each OSD to be used for its BlueStore cache is
|
|
determined by the ``bluestore_cache_size`` configuration option. If that option
|
|
has not been specified (that is, if it remains at 0), then Ceph uses a
|
|
different configuration option to determine the default memory budget:
|
|
``bluestore_cache_size_hdd`` if the primary device is an HDD, or
|
|
``bluestore_cache_size_ssd`` if the primary device is an SSD.
|
|
|
|
BlueStore and the rest of the Ceph OSD daemon make every effort to work within
|
|
this memory budget. Note that in addition to the configured cache size, there
|
|
is also memory consumed by the OSD itself. There is additional utilization due
|
|
to memory fragmentation and other allocator overhead.
|
|
|
|
The configured cache-memory budget can be used to store the following types of
|
|
things:
|
|
|
|
* Key/Value metadata (that is, RocksDB's internal cache)
|
|
* BlueStore metadata
|
|
* BlueStore data (that is, recently read or recently written object data)
|
|
|
|
Cache memory usage is governed by the configuration options
|
|
``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``. The fraction
|
|
of the cache that is reserved for data is governed by both the effective
|
|
BlueStore cache size (which depends on the relevant
|
|
``bluestore_cache_size[_ssd|_hdd]`` option and the device class of the primary
|
|
device) and the "meta" and "kv" ratios. This data fraction can be calculated
|
|
with the following formula: ``<effective_cache_size> * (1 -
|
|
bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)``.
|
|
|
|
.. confval:: bluestore_cache_size
|
|
.. confval:: bluestore_cache_size_hdd
|
|
.. confval:: bluestore_cache_size_ssd
|
|
.. confval:: bluestore_cache_meta_ratio
|
|
.. confval:: bluestore_cache_kv_ratio
|
|
|
|
Checksums
|
|
=========
|
|
|
|
BlueStore checksums all metadata and all data written to disk. Metadata
|
|
checksumming is handled by RocksDB and uses the `crc32c` algorithm. By
|
|
contrast, data checksumming is handled by BlueStore and can use either
|
|
`crc32c`, `xxhash32`, or `xxhash64`. Nonetheless, `crc32c` is the default
|
|
checksum algorithm and it is suitable for most purposes.
|
|
|
|
Full data checksumming increases the amount of metadata that BlueStore must
|
|
store and manage. Whenever possible (for example, when clients hint that data
|
|
is written and read sequentially), BlueStore will checksum larger blocks. In
|
|
many cases, however, it must store a checksum value (usually 4 bytes) for every
|
|
4 KB block of data.
|
|
|
|
It is possible to obtain a smaller checksum value by truncating the checksum to
|
|
one or two bytes and reducing the metadata overhead. A drawback of this
|
|
approach is that it increases the probability of a random error going
|
|
undetected: about one in four billion given a 32-bit (4 byte) checksum, 1 in
|
|
65,536 given a 16-bit (2 byte) checksum, and 1 in 256 given an 8-bit (1 byte)
|
|
checksum. To use the smaller checksum values, select `crc32c_16` or `crc32c_8`
|
|
as the checksum algorithm.
|
|
|
|
The *checksum algorithm* can be specified either via a per-pool ``csum_type``
|
|
configuration option or via the global configuration option. For example:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd pool set <pool-name> csum_type <algorithm>
|
|
|
|
.. confval:: bluestore_csum_type
|
|
|
|
Inline Compression
|
|
==================
|
|
|
|
BlueStore supports inline compression using `snappy`, `zlib`, `lz4`, or `zstd`.
|
|
|
|
Whether data in BlueStore is compressed is determined by two factors: (1) the
|
|
*compression mode* and (2) any client hints associated with a write operation.
|
|
The compression modes are as follows:
|
|
|
|
* **none**: Never compress data.
|
|
* **passive**: Do not compress data unless the write operation has a
|
|
*compressible* hint set.
|
|
* **aggressive**: Do compress data unless the write operation has an
|
|
*incompressible* hint set.
|
|
* **force**: Try to compress data no matter what.
|
|
|
|
For more information about the *compressible* and *incompressible* I/O hints,
|
|
see :c:func:`rados_set_alloc_hint`.
|
|
|
|
Note that data in Bluestore will be compressed only if the data chunk will be
|
|
sufficiently reduced in size (as determined by the ``bluestore compression
|
|
required ratio`` setting). No matter which compression modes have been used, if
|
|
the data chunk is too big, then it will be discarded and the original
|
|
(uncompressed) data will be stored instead. For example, if ``bluestore
|
|
compression required ratio`` is set to ``.7``, then data compression will take
|
|
place only if the size of the compressed data is no more than 70% of the size
|
|
of the original data.
|
|
|
|
The *compression mode*, *compression algorithm*, *compression required ratio*,
|
|
*min blob size*, and *max blob size* settings can be specified either via a
|
|
per-pool property or via a global config option. To specify pool properties,
|
|
run the following commands:
|
|
|
|
.. prompt:: bash $
|
|
|
|
ceph osd pool set <pool-name> compression_algorithm <algorithm>
|
|
ceph osd pool set <pool-name> compression_mode <mode>
|
|
ceph osd pool set <pool-name> compression_required_ratio <ratio>
|
|
ceph osd pool set <pool-name> compression_min_blob_size <size>
|
|
ceph osd pool set <pool-name> compression_max_blob_size <size>
|
|
|
|
.. confval:: bluestore_compression_algorithm
|
|
.. confval:: bluestore_compression_mode
|
|
.. confval:: bluestore_compression_required_ratio
|
|
.. confval:: bluestore_compression_min_blob_size
|
|
.. confval:: bluestore_compression_min_blob_size_hdd
|
|
.. confval:: bluestore_compression_min_blob_size_ssd
|
|
.. confval:: bluestore_compression_max_blob_size
|
|
.. confval:: bluestore_compression_max_blob_size_hdd
|
|
.. confval:: bluestore_compression_max_blob_size_ssd
|
|
|
|
.. _bluestore-rocksdb-sharding:
|
|
|
|
RocksDB Sharding
|
|
================
|
|
|
|
BlueStore maintains several types of internal key-value data, all of which are
|
|
stored in RocksDB. Each data type in BlueStore is assigned a unique prefix.
|
|
Prior to the Pacific release, all key-value data was stored in a single RocksDB
|
|
column family: 'default'. In Pacific and later releases, however, BlueStore can
|
|
divide key-value data into several RocksDB column families. BlueStore achieves
|
|
better caching and more precise compaction when keys are similar: specifically,
|
|
when keys have similar access frequency, similar modification frequency, and a
|
|
similar lifetime. Under such conditions, performance is improved and less disk
|
|
space is required during compaction (because each column family is smaller and
|
|
is able to compact independently of the others).
|
|
|
|
OSDs deployed in Pacific or later releases use RocksDB sharding by default.
|
|
However, if Ceph has been upgraded to Pacific or a later version from a
|
|
previous version, sharding is disabled on any OSDs that were created before
|
|
Pacific.
|
|
|
|
To enable sharding and apply the Pacific defaults to a specific OSD, stop the
|
|
OSD and run the following command:
|
|
|
|
.. prompt:: bash #
|
|
|
|
ceph-bluestore-tool \
|
|
--path <data path> \
|
|
--sharding="m(3) p(3,0-12) o(3,0-13)=block_cache={type=binned_lru} l p" \
|
|
reshard
|
|
|
|
.. confval:: bluestore_rocksdb_cf
|
|
.. confval:: bluestore_rocksdb_cfs
|
|
|
|
Throttling
|
|
==========
|
|
|
|
.. confval:: bluestore_throttle_bytes
|
|
.. confval:: bluestore_throttle_deferred_bytes
|
|
.. confval:: bluestore_throttle_cost_per_io
|
|
.. confval:: bluestore_throttle_cost_per_io_hdd
|
|
.. confval:: bluestore_throttle_cost_per_io_ssd
|
|
|
|
SPDK Usage
|
|
==========
|
|
|
|
To use the SPDK driver for NVMe devices, you must first prepare your system.
|
|
See `SPDK document`__.
|
|
|
|
.. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples
|
|
|
|
SPDK offers a script that will configure the device automatically. Run this
|
|
script with root permissions:
|
|
|
|
.. prompt:: bash $
|
|
|
|
sudo src/spdk/scripts/setup.sh
|
|
|
|
You will need to specify the subject NVMe device's device selector with the
|
|
"spdk:" prefix for ``bluestore_block_path``.
|
|
|
|
In the following example, you first find the device selector of an Intel NVMe
|
|
SSD by running the following command:
|
|
|
|
.. prompt:: bash $
|
|
|
|
lspci -mm -n -d -d 8086:0953
|
|
|
|
The form of the device selector is either ``DDDD:BB:DD.FF`` or
|
|
``DDDD.BB.DD.FF``.
|
|
|
|
Next, supposing that ``0000:01:00.0`` is the device selector found in the
|
|
output of the ``lspci`` command, you can specify the device selector by running
|
|
the following command::
|
|
|
|
bluestore_block_path = "spdk:trtype:pcie traddr:0000:01:00.0"
|
|
|
|
You may also specify a remote NVMeoF target over the TCP transport, as in the
|
|
following example::
|
|
|
|
bluestore_block_path = "spdk:trtype:tcp traddr:10.67.110.197 trsvcid:4420 subnqn:nqn.2019-02.io.spdk:cnode1"
|
|
|
|
To run multiple SPDK instances per node, you must make sure each instance uses
|
|
its own DPDK memory by specifying for each instance the amount of DPDK memory
|
|
(in MB) that the instance will use.
|
|
|
|
In most cases, a single device can be used for data, DB, and WAL. We describe
|
|
this strategy as *colocating* these components. Be sure to enter the below
|
|
settings to ensure that all I/Os are issued through SPDK::
|
|
|
|
bluestore_block_db_path = ""
|
|
bluestore_block_db_size = 0
|
|
bluestore_block_wal_path = ""
|
|
bluestore_block_wal_size = 0
|
|
|
|
If these settings are not entered, then the current implementation will
|
|
populate the SPDK map files with kernel file system symbols and will use the
|
|
kernel driver to issue DB/WAL I/Os.
|
|
|
|
Minimum Allocation Size
|
|
=======================
|
|
|
|
There is a configured minimum amount of storage that BlueStore allocates on an
|
|
underlying storage device. In practice, this is the least amount of capacity
|
|
that even a tiny RADOS object can consume on each OSD's primary device. The
|
|
configuration option in question--:confval:`bluestore_min_alloc_size`--derives
|
|
its value from the value of either :confval:`bluestore_min_alloc_size_hdd` or
|
|
:confval:`bluestore_min_alloc_size_ssd`, depending on the OSD's ``rotational``
|
|
attribute. Thus if an OSD is created on an HDD, BlueStore is initialized with
|
|
the current value of :confval:`bluestore_min_alloc_size_hdd`; but with SSD OSDs
|
|
(including NVMe devices), Bluestore is initialized with the current value of
|
|
:confval:`bluestore_min_alloc_size_ssd`.
|
|
|
|
In Mimic and earlier releases, the default values were 64KB for rotational
|
|
media (HDD) and 16KB for non-rotational media (SSD). The Octopus release
|
|
changed the the default value for non-rotational media (SSD) to 4KB, and the
|
|
Pacific release changed the default value for rotational media (HDD) to 4KB.
|
|
|
|
These changes were driven by space amplification that was experienced by Ceph
|
|
RADOS GateWay (RGW) deployments that hosted large numbers of small files
|
|
(S3/Swift objects).
|
|
|
|
For example, when an RGW client stores a 1 KB S3 object, that object is written
|
|
to a single RADOS object. In accordance with the default
|
|
:confval:`min_alloc_size` value, 4 KB of underlying drive space is allocated.
|
|
This means that roughly 3 KB (that is, 4 KB minus 1 KB) is allocated but never
|
|
used: this corresponds to 300% overhead or 25% efficiency. Similarly, a 5 KB
|
|
user object will be stored as two RADOS objects, a 4 KB RADOS object and a 1 KB
|
|
RADOS object, with the result that 4KB of device capacity is stranded. In this
|
|
case, however, the overhead percentage is much smaller. Think of this in terms
|
|
of the remainder from a modulus operation. The overhead *percentage* thus
|
|
decreases rapidly as object size increases.
|
|
|
|
There is an additional subtlety that is easily missed: the amplification
|
|
phenomenon just described takes place for *each* replica. For example, when
|
|
using the default of three copies of data (3R), a 1 KB S3 object actually
|
|
strands roughly 9 KB of storage device capacity. If erasure coding (EC) is used
|
|
instead of replication, the amplification might be even higher: for a ``k=4,
|
|
m=2`` pool, our 1 KB S3 object allocates 24 KB (that is, 4 KB multiplied by 6)
|
|
of device capacity.
|
|
|
|
When an RGW bucket pool contains many relatively large user objects, the effect
|
|
of this phenomenon is often negligible. However, with deployments that can
|
|
expect a significant fraction of relatively small user objects, the effect
|
|
should be taken into consideration.
|
|
|
|
The 4KB default value aligns well with conventional HDD and SSD devices.
|
|
However, certain novel coarse-IU (Indirection Unit) QLC SSDs perform and wear
|
|
best when :confval:`bluestore_min_alloc_size_ssd` is specified at OSD creation
|
|
to match the device's IU: this might be 8KB, 16KB, or even 64KB. These novel
|
|
storage drives can achieve read performance that is competitive with that of
|
|
conventional TLC SSDs and write performance that is faster than that of HDDs,
|
|
with higher density and lower cost than TLC SSDs.
|
|
|
|
Note that when creating OSDs on these novel devices, one must be careful to
|
|
apply the non-default value only to appropriate devices, and not to
|
|
conventional HDD and SSD devices. Error can be avoided through careful ordering
|
|
of OSD creation, with custom OSD device classes, and especially by the use of
|
|
central configuration *masks*.
|
|
|
|
In Quincy and later releases, you can use the
|
|
:confval:`bluestore_use_optimal_io_size_for_min_alloc_size` option to allow
|
|
automatic discovery of the correct value as each OSD is created. Note that the
|
|
use of ``bcache``, ``OpenCAS``, ``dmcrypt``, ``ATA over Ethernet``, `iSCSI`, or
|
|
other device-layering and abstraction technologies might confound the
|
|
determination of correct values. Moreover, OSDs deployed on top of VMware
|
|
storage have sometimes been found to report a ``rotational`` attribute that
|
|
does not match the underlying hardware.
|
|
|
|
We suggest inspecting such OSDs at startup via logs and admin sockets in order
|
|
to ensure that their behavior is correct. Be aware that this kind of inspection
|
|
might not work as expected with older kernels. To check for this issue,
|
|
examine the presence and value of ``/sys/block/<drive>/queue/optimal_io_size``.
|
|
|
|
.. note:: When running Reef or a later Ceph release, the ``min_alloc_size``
|
|
baked into each OSD is conveniently reported by ``ceph osd metadata``.
|
|
|
|
To inspect a specific OSD, run the following command:
|
|
|
|
.. prompt:: bash #
|
|
|
|
ceph osd metadata osd.1701 | egrep rotational\|alloc
|
|
|
|
This space amplification might manifest as an unusually high ratio of raw to
|
|
stored data as reported by ``ceph df``. There might also be ``%USE`` / ``VAR``
|
|
values reported by ``ceph osd df`` that are unusually high in comparison to
|
|
other, ostensibly identical, OSDs. Finally, there might be unexpected balancer
|
|
behavior in pools that use OSDs that have mismatched ``min_alloc_size`` values.
|
|
|
|
This BlueStore attribute takes effect *only* at OSD creation; if the attribute
|
|
is changed later, a specific OSD's behavior will not change unless and until
|
|
the OSD is destroyed and redeployed with the appropriate option value(s).
|
|
Upgrading to a later Ceph release will *not* change the value used by OSDs that
|
|
were deployed under older releases or with other settings.
|
|
|
|
.. confval:: bluestore_min_alloc_size
|
|
.. confval:: bluestore_min_alloc_size_hdd
|
|
.. confval:: bluestore_min_alloc_size_ssd
|
|
.. confval:: bluestore_use_optimal_io_size_for_min_alloc_size
|
|
|
|
DSA (Data Streaming Accelerator) Usage
|
|
======================================
|
|
|
|
If you want to use the DML library to drive the DSA device for offloading
|
|
read/write operations on persistent memory (PMEM) in BlueStore, you need to
|
|
install `DML`_ and the `idxd-config`_ library. This will work only on machines
|
|
that have a SPR (Sapphire Rapids) CPU.
|
|
|
|
.. _dml: https://github.com/intel/dml
|
|
.. _idxd-config: https://github.com/intel/idxd-config
|
|
|
|
After installing the DML software, configure the shared work queues (WQs) with
|
|
reference to the following WQ configuration example:
|
|
|
|
.. prompt:: bash $
|
|
|
|
accel-config config-wq --group-id=1 --mode=shared --wq-size=16 --threshold=15 --type=user --name="myapp1" --priority=10 --block-on-fault=1 dsa0/wq0.1
|
|
accel-config config-engine dsa0/engine0.1 --group-id=1
|
|
accel-config enable-device dsa0
|
|
accel-config enable-wq dsa0/wq0.1
|