2017-08-02 07:35:29 +00:00
|
|
|
==========================
|
|
|
|
BlueStore Config Reference
|
|
|
|
==========================
|
|
|
|
|
2017-08-03 13:21:18 +00:00
|
|
|
Devices
|
|
|
|
=======
|
|
|
|
|
|
|
|
BlueStore manages either one, two, or (in certain cases) three storage
|
|
|
|
devices.
|
|
|
|
|
2018-06-18 19:21:50 +00:00
|
|
|
In the simplest case, BlueStore consumes a single (primary) storage device.
|
|
|
|
The storage device is normally used as a whole, occupying the full device that
|
2018-08-23 21:05:05 +00:00
|
|
|
is managed directly by BlueStore. This *primary device* is normally identified
|
|
|
|
by a ``block`` symlink in the data directory.
|
2017-08-03 13:21:18 +00:00
|
|
|
|
2018-06-18 19:21:50 +00:00
|
|
|
The data directory is a ``tmpfs`` mount which gets populated (at boot time, or
|
|
|
|
when ``ceph-volume`` activates it) with all the common OSD files that hold
|
|
|
|
information about the OSD, like: its identifier, which cluster it belongs to,
|
|
|
|
and its private keyring.
|
2017-08-03 13:21:18 +00:00
|
|
|
|
2020-12-16 17:59:14 +00:00
|
|
|
It is also possible to deploy BlueStore across one or two additional devices:
|
2017-08-03 13:21:18 +00:00
|
|
|
|
2020-05-21 08:50:58 +00:00
|
|
|
* A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data directory) can be
|
2018-07-24 16:51:56 +00:00
|
|
|
used for BlueStore's internal journal or write-ahead log. It is only useful
|
|
|
|
to use a WAL device if the device is faster than the primary device (e.g.,
|
|
|
|
when it is on an SSD and the primary device is an HDD).
|
|
|
|
* A *DB device* (identified as ``block.db`` in the data directory) can be used
|
|
|
|
for storing BlueStore's internal metadata. BlueStore (or rather, the
|
|
|
|
embedded RocksDB) will put as much metadata as it can on the DB device to
|
|
|
|
improve performance. If the DB device fills up, metadata will spill back
|
|
|
|
onto the primary device (where it would have been otherwise). Again, it is
|
|
|
|
only helpful to provision a DB device if it is faster than the primary
|
|
|
|
device.
|
2017-08-03 13:21:18 +00:00
|
|
|
|
|
|
|
If there is only a small amount of fast storage available (e.g., less
|
|
|
|
than a gigabyte), we recommend using it as a WAL device. If there is
|
|
|
|
more, provisioning a DB device makes more sense. The BlueStore
|
|
|
|
journal will always be placed on the fastest device available, so
|
|
|
|
using a DB device will provide the same benefit that the WAL device
|
|
|
|
would while *also* allowing additional metadata to be stored there (if
|
2020-12-16 17:59:14 +00:00
|
|
|
it will fit). This means that if a DB device is specified but an explicit
|
|
|
|
WAL device is not, the WAL will be implicitly colocated with the DB on the faster
|
|
|
|
device.
|
2017-08-03 13:21:18 +00:00
|
|
|
|
2020-12-16 17:59:14 +00:00
|
|
|
A single-device (colocated) BlueStore OSD can be provisioned with::
|
2017-08-03 13:21:18 +00:00
|
|
|
|
2017-11-29 14:56:15 +00:00
|
|
|
ceph-volume lvm prepare --bluestore --data <device>
|
2017-08-03 13:21:18 +00:00
|
|
|
|
|
|
|
To specify a WAL device and/or DB device, ::
|
|
|
|
|
2017-11-29 14:56:15 +00:00
|
|
|
ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device>
|
|
|
|
|
2020-12-16 17:59:14 +00:00
|
|
|
.. note:: ``--data`` can be a Logical Volume using *vg/lv* notation. Other
|
|
|
|
devices can be existing logical volumes or GPT partitions.
|
2017-08-03 13:21:18 +00:00
|
|
|
|
2018-07-24 16:51:56 +00:00
|
|
|
Provisioning strategies
|
|
|
|
-----------------------
|
2020-12-16 17:59:14 +00:00
|
|
|
Although there are multiple ways to deploy a BlueStore OSD (unlike Filestore
|
|
|
|
which had just one), there are two common arrangements that should help clarify
|
|
|
|
the deployment strategy:
|
2018-07-24 16:51:56 +00:00
|
|
|
|
2018-07-24 18:39:59 +00:00
|
|
|
.. _bluestore-single-type-device-config:
|
|
|
|
|
2018-07-24 16:51:56 +00:00
|
|
|
**block (data) only**
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^
|
2020-12-16 17:59:14 +00:00
|
|
|
If all devices are the same type, for example all rotational drives, and
|
|
|
|
there are no fast devices to use for metadata, it makes sense to specifiy the
|
|
|
|
block device only and to not separate ``block.db`` or ``block.wal``. The
|
|
|
|
:ref:`ceph-volume-lvm` command for a single ``/dev/sda`` device looks like::
|
2018-07-24 16:51:56 +00:00
|
|
|
|
|
|
|
ceph-volume lvm create --bluestore --data /dev/sda
|
|
|
|
|
2020-12-16 17:59:14 +00:00
|
|
|
If logical volumes have already been created for each device, (a single LV
|
|
|
|
using 100% of the device), then the :ref:`ceph-volume-lvm` call for an LV named
|
2018-07-24 16:51:56 +00:00
|
|
|
``ceph-vg/block-lv`` would look like::
|
|
|
|
|
|
|
|
ceph-volume lvm create --bluestore --data ceph-vg/block-lv
|
|
|
|
|
2018-07-24 18:39:59 +00:00
|
|
|
.. _bluestore-mixed-device-config:
|
2018-07-24 16:51:56 +00:00
|
|
|
|
|
|
|
**block and block.db**
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^
|
2020-12-16 17:59:14 +00:00
|
|
|
If you have a mix of fast and slow devices (SSD / NVMe and rotational),
|
2018-07-24 16:51:56 +00:00
|
|
|
it is recommended to place ``block.db`` on the faster device while ``block``
|
2020-12-16 17:59:14 +00:00
|
|
|
(data) lives on the slower (spinning drive).
|
2018-07-24 16:51:56 +00:00
|
|
|
|
2020-12-16 17:59:14 +00:00
|
|
|
You must create these volume groups and logical volumes manually as
|
|
|
|
the ``ceph-volume`` tool is currently not able to do so automatically.
|
|
|
|
|
|
|
|
For the below example, let us assume four rotational (``sda``, ``sdb``, ``sdc``, and ``sdd``)
|
|
|
|
and one (fast) solid state drive (``sdx``). First create the volume groups::
|
2018-07-24 16:51:56 +00:00
|
|
|
|
|
|
|
$ vgcreate ceph-block-0 /dev/sda
|
|
|
|
$ vgcreate ceph-block-1 /dev/sdb
|
|
|
|
$ vgcreate ceph-block-2 /dev/sdc
|
|
|
|
$ vgcreate ceph-block-3 /dev/sdd
|
|
|
|
|
|
|
|
Now create the logical volumes for ``block``::
|
|
|
|
|
|
|
|
$ lvcreate -l 100%FREE -n block-0 ceph-block-0
|
|
|
|
$ lvcreate -l 100%FREE -n block-1 ceph-block-1
|
|
|
|
$ lvcreate -l 100%FREE -n block-2 ceph-block-2
|
|
|
|
$ lvcreate -l 100%FREE -n block-3 ceph-block-3
|
|
|
|
|
|
|
|
We are creating 4 OSDs for the four slow spinning devices, so assuming a 200GB
|
|
|
|
SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB::
|
|
|
|
|
|
|
|
$ vgcreate ceph-db-0 /dev/sdx
|
|
|
|
$ lvcreate -L 50GB -n db-0 ceph-db-0
|
|
|
|
$ lvcreate -L 50GB -n db-1 ceph-db-0
|
|
|
|
$ lvcreate -L 50GB -n db-2 ceph-db-0
|
|
|
|
$ lvcreate -L 50GB -n db-3 ceph-db-0
|
|
|
|
|
|
|
|
Finally, create the 4 OSDs with ``ceph-volume``::
|
|
|
|
|
|
|
|
$ ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0
|
|
|
|
$ ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1
|
|
|
|
$ ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2
|
|
|
|
$ ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3
|
|
|
|
|
2020-12-16 17:59:14 +00:00
|
|
|
These operations should end up creating four OSDs, with ``block`` on the slower
|
|
|
|
rotational drives with a 50 GB logical volume (DB) for each on the solid state
|
2018-07-24 16:51:56 +00:00
|
|
|
drive.
|
|
|
|
|
2018-07-24 18:39:59 +00:00
|
|
|
Sizing
|
|
|
|
======
|
|
|
|
When using a :ref:`mixed spinning and solid drive setup
|
2020-12-16 17:59:14 +00:00
|
|
|
<bluestore-mixed-device-config>` it is important to make a large enough
|
|
|
|
``block.db`` logical volume for BlueStore. Generally, ``block.db`` should have
|
2018-07-24 18:39:59 +00:00
|
|
|
*as large as possible* logical volumes.
|
|
|
|
|
2019-12-12 22:04:57 +00:00
|
|
|
The general recommendation is to have ``block.db`` size in between 1% to 4%
|
|
|
|
of ``block`` size. For RGW workloads, it is recommended that the ``block.db``
|
2020-12-16 17:59:14 +00:00
|
|
|
size isn't smaller than 4% of ``block``, because RGW heavily uses it to store
|
|
|
|
metadata (omap keys). For example, if the ``block`` size is 1TB, then ``block.db`` shouldn't
|
2019-12-12 22:04:57 +00:00
|
|
|
be less than 40GB. For RBD workloads, 1% to 2% of ``block`` size is usually enough.
|
2018-07-24 18:39:59 +00:00
|
|
|
|
2020-12-16 17:59:14 +00:00
|
|
|
In older releases, internal level sizes mean that the DB can fully utilize only
|
|
|
|
specific partition / LV sizes that correspond to sums of L0, L0+L1, L1+L2,
|
|
|
|
etc. sizes, which with default settings means roughly 3 GB, 30 GB, 300 GB, and
|
|
|
|
so forth. Most deployments will not substantially benefit from sizing to
|
|
|
|
accomodate L3 and higher, though DB compaction can be facilitated by doubling
|
|
|
|
these figures to 6GB, 60GB, and 600GB.
|
|
|
|
|
|
|
|
Improvements in releases beginning with Nautilus 14.2.12 and Octopus 15.2.6
|
|
|
|
enable better utilization of arbitrary DB device sizes, and the Pacific
|
|
|
|
release brings experimental dynamic level support. Users of older releases may
|
|
|
|
thus wish to plan ahead by provisioning larger DB devices today so that their
|
|
|
|
benefits may be realized with future upgrades.
|
|
|
|
|
|
|
|
When *not* using a mix of fast and slow devices, it isn't required to create
|
|
|
|
separate logical volumes for ``block.db`` (or ``block.wal``). BlueStore will
|
|
|
|
automatically colocate these within the space of ``block``.
|
2018-07-24 16:51:56 +00:00
|
|
|
|
2018-11-12 23:17:42 +00:00
|
|
|
|
|
|
|
Automatic Cache Sizing
|
|
|
|
======================
|
|
|
|
|
2020-12-16 17:59:14 +00:00
|
|
|
BlueStore can be configured to automatically resize its caches when TCMalloc
|
2018-11-12 23:17:42 +00:00
|
|
|
is configured as the memory allocator and the ``bluestore_cache_autotune``
|
2020-12-16 17:59:14 +00:00
|
|
|
setting is enabled. This option is currently enabled by default. BlueStore
|
2018-11-12 23:17:42 +00:00
|
|
|
will attempt to keep OSD heap memory usage under a designated target size via
|
|
|
|
the ``osd_memory_target`` configuration option. This is a best effort
|
|
|
|
algorithm and caches will not shrink smaller than the amount specified by
|
|
|
|
``osd_memory_cache_min``. Cache ratios will be chosen based on a hierarchy
|
2019-09-26 07:13:00 +00:00
|
|
|
of priorities. If priority information is not available, the
|
2018-11-12 23:17:42 +00:00
|
|
|
``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio`` options are
|
|
|
|
used as fallbacks.
|
|
|
|
|
|
|
|
``bluestore_cache_autotune``
|
|
|
|
|
2020-12-16 17:59:14 +00:00
|
|
|
:Description: Automatically tune the space ratios assigned to various BlueStore
|
|
|
|
caches while respecting minimum values.
|
2018-11-12 23:17:42 +00:00
|
|
|
:Type: Boolean
|
2019-02-13 15:04:40 +00:00
|
|
|
:Required: Yes
|
2018-11-12 23:17:42 +00:00
|
|
|
:Default: ``True``
|
|
|
|
|
|
|
|
``osd_memory_target``
|
|
|
|
|
2020-12-16 17:59:14 +00:00
|
|
|
:Description: When TCMalloc is available and cache autotuning is enabled, try to
|
|
|
|
keep this many bytes mapped in memory. Note: This may not exactly
|
|
|
|
match the RSS memory usage of the process. While the total amount
|
|
|
|
of heap memory mapped by the process should usually be close
|
|
|
|
to this target, there is no guarantee that the kernel will actually
|
|
|
|
reclaim memory that has been unmapped. During initial development,
|
|
|
|
it was found that some kernels result in the OSD's RSS memory
|
|
|
|
exceeding the mapped memory by up to 20%. It is hypothesised
|
|
|
|
however, that the kernel generally may be more aggressive about
|
|
|
|
reclaiming unmapped memory when there is a high amount of memory
|
|
|
|
pressure. Your mileage may vary.
|
2018-11-12 23:17:42 +00:00
|
|
|
:Type: Unsigned Integer
|
2019-02-13 15:04:40 +00:00
|
|
|
:Required: Yes
|
2018-11-12 23:17:42 +00:00
|
|
|
:Default: ``4294967296``
|
|
|
|
|
|
|
|
``bluestore_cache_autotune_chunk_size``
|
|
|
|
|
2020-12-16 17:59:14 +00:00
|
|
|
:Description: The chunk size in bytes to allocate to caches when cache autotune
|
|
|
|
is enabled. When the autotuner assigns memory to various caches,
|
|
|
|
it will allocate memory in chunks. This is done to avoid
|
|
|
|
evictions when there are minor fluctuations in the heap size or
|
|
|
|
autotuned cache ratios.
|
2018-11-12 23:17:42 +00:00
|
|
|
:Type: Unsigned Integer
|
2019-02-13 15:04:40 +00:00
|
|
|
:Required: No
|
2018-11-12 23:17:42 +00:00
|
|
|
:Default: ``33554432``
|
|
|
|
|
|
|
|
``bluestore_cache_autotune_interval``
|
|
|
|
|
2020-12-16 17:59:14 +00:00
|
|
|
:Description: The number of seconds to wait between rebalances when cache autotune
|
|
|
|
is enabled. This setting changes how quickly the allocation ratios of
|
|
|
|
various caches are recomputed. Note: Setting this interval too small
|
|
|
|
can result in high CPU usage and lower performance.
|
2018-11-12 23:17:42 +00:00
|
|
|
:Type: Float
|
2019-02-13 15:04:40 +00:00
|
|
|
:Required: No
|
2018-11-12 23:17:42 +00:00
|
|
|
:Default: ``5``
|
|
|
|
|
|
|
|
``osd_memory_base``
|
|
|
|
|
2020-12-16 17:59:14 +00:00
|
|
|
:Description: When TCMalloc and cache autotuning are enabled, estimate the minimum
|
|
|
|
amount of memory in bytes the OSD will need. This is used to help
|
|
|
|
the autotuner estimate the expected aggregate memory consumption of
|
|
|
|
the caches.
|
2019-09-26 07:13:00 +00:00
|
|
|
:Type: Unsigned Integer
|
2018-11-12 23:17:42 +00:00
|
|
|
:Required: No
|
|
|
|
:Default: ``805306368``
|
|
|
|
|
|
|
|
``osd_memory_expected_fragmentation``
|
|
|
|
|
2020-12-16 17:59:14 +00:00
|
|
|
:Description: When TCMalloc and cache autotuning is enabled, estimate the
|
|
|
|
percentage of memory fragmentation. This is used to help the
|
|
|
|
autotuner estimate the expected aggregate memory consumption
|
|
|
|
of the caches.
|
2018-11-12 23:17:42 +00:00
|
|
|
:Type: Float
|
|
|
|
:Required: No
|
|
|
|
:Default: ``0.15``
|
|
|
|
|
|
|
|
``osd_memory_cache_min``
|
|
|
|
|
2020-12-16 17:59:14 +00:00
|
|
|
:Description: When TCMalloc and cache autotuning are enabled, set the minimum
|
|
|
|
amount of memory used for caches. Note: Setting this value too
|
|
|
|
low can result in significant cache thrashing.
|
2018-11-12 23:17:42 +00:00
|
|
|
:Type: Unsigned Integer
|
|
|
|
:Required: No
|
|
|
|
:Default: ``134217728``
|
|
|
|
|
|
|
|
``osd_memory_cache_resize_interval``
|
|
|
|
|
2020-12-16 17:59:14 +00:00
|
|
|
:Description: When TCMalloc and cache autotuning are enabled, wait this many
|
|
|
|
seconds between resizing caches. This setting changes the total
|
|
|
|
amount of memory available for BlueStore to use for caching. Note
|
|
|
|
that setting this interval too small can result in memory allocator
|
|
|
|
thrashing and lower performance.
|
2018-11-12 23:17:42 +00:00
|
|
|
:Type: Float
|
|
|
|
:Required: No
|
|
|
|
:Default: ``1``
|
|
|
|
|
|
|
|
|
|
|
|
Manual Cache Sizing
|
|
|
|
===================
|
2017-08-03 13:21:18 +00:00
|
|
|
|
2020-12-16 17:59:14 +00:00
|
|
|
The amount of memory consumed by each OSD for BlueStore caches is
|
2017-08-03 13:21:18 +00:00
|
|
|
determined by the ``bluestore_cache_size`` configuration option. If
|
|
|
|
that config option is not set (i.e., remains at 0), there is a
|
|
|
|
different default value that is used depending on whether an HDD or
|
|
|
|
SSD is used for the primary device (set by the
|
|
|
|
``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config
|
|
|
|
options).
|
|
|
|
|
2020-12-16 17:59:14 +00:00
|
|
|
BlueStore and the rest of the Ceph OSD daemon do the best they can
|
|
|
|
to work within this memory budget. Note that on top of the configured
|
2017-08-03 13:21:18 +00:00
|
|
|
cache size, there is also memory consumed by the OSD itself, and
|
2020-12-16 17:59:14 +00:00
|
|
|
some additional utilization due to memory fragmentation and other
|
2017-08-03 13:21:18 +00:00
|
|
|
allocator overhead.
|
|
|
|
|
|
|
|
The configured cache memory budget can be used in a few different ways:
|
|
|
|
|
|
|
|
* Key/Value metadata (i.e., RocksDB's internal cache)
|
|
|
|
* BlueStore metadata
|
|
|
|
* BlueStore data (i.e., recently read or written object data)
|
|
|
|
|
|
|
|
Cache memory usage is governed by the following options:
|
2019-04-29 12:52:27 +00:00
|
|
|
``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``.
|
|
|
|
The fraction of the cache devoted to data
|
|
|
|
is governed by the effective bluestore cache size (depending on
|
|
|
|
``bluestore_cache_size[_ssd|_hdd]`` settings and the device class of the primary
|
|
|
|
device) as well as the meta and kv ratios.
|
|
|
|
The data fraction can be calculated by
|
|
|
|
``<effective_cache_size> * (1 - bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)``
|
2017-08-03 13:21:18 +00:00
|
|
|
|
|
|
|
``bluestore_cache_size``
|
|
|
|
|
2020-12-16 17:59:14 +00:00
|
|
|
:Description: The amount of memory BlueStore will use for its cache. If zero,
|
|
|
|
``bluestore_cache_size_hdd`` or ``bluestore_cache_size_ssd`` will
|
|
|
|
be used instead.
|
2017-10-10 03:34:26 +00:00
|
|
|
:Type: Unsigned Integer
|
2017-08-03 13:21:18 +00:00
|
|
|
:Required: Yes
|
|
|
|
:Default: ``0``
|
|
|
|
|
|
|
|
``bluestore_cache_size_hdd``
|
|
|
|
|
2020-12-16 17:59:14 +00:00
|
|
|
:Description: The default amount of memory BlueStore will use for its cache when
|
|
|
|
backed by an HDD.
|
2017-10-10 03:34:26 +00:00
|
|
|
:Type: Unsigned Integer
|
2017-08-03 13:21:18 +00:00
|
|
|
:Required: Yes
|
|
|
|
:Default: ``1 * 1024 * 1024 * 1024`` (1 GB)
|
|
|
|
|
|
|
|
``bluestore_cache_size_ssd``
|
|
|
|
|
2020-12-16 17:59:14 +00:00
|
|
|
:Description: The default amount of memory BlueStore will use for its cache when
|
|
|
|
backed by an SSD.
|
2017-10-10 03:34:26 +00:00
|
|
|
:Type: Unsigned Integer
|
2017-08-03 13:21:18 +00:00
|
|
|
:Required: Yes
|
|
|
|
:Default: ``3 * 1024 * 1024 * 1024`` (3 GB)
|
|
|
|
|
|
|
|
``bluestore_cache_meta_ratio``
|
|
|
|
|
|
|
|
:Description: The ratio of cache devoted to metadata.
|
|
|
|
:Type: Floating point
|
|
|
|
:Required: Yes
|
2019-04-29 12:52:27 +00:00
|
|
|
:Default: ``.4``
|
2017-08-03 13:21:18 +00:00
|
|
|
|
|
|
|
``bluestore_cache_kv_ratio``
|
|
|
|
|
2020-12-16 17:59:14 +00:00
|
|
|
:Description: The ratio of cache devoted to key/value data (RocksDB).
|
2017-08-03 13:21:18 +00:00
|
|
|
:Type: Floating point
|
|
|
|
:Required: Yes
|
2019-04-29 12:52:27 +00:00
|
|
|
:Default: ``.4``
|
2017-08-03 13:21:18 +00:00
|
|
|
|
|
|
|
``bluestore_cache_kv_max``
|
|
|
|
|
2020-12-16 17:59:14 +00:00
|
|
|
:Description: The maximum amount of cache devoted to key/value data (RocksDB).
|
2017-10-10 03:34:26 +00:00
|
|
|
:Type: Unsigned Integer
|
2017-08-03 13:21:18 +00:00
|
|
|
:Required: Yes
|
|
|
|
:Default: ``512 * 1024*1024`` (512 MB)
|
|
|
|
|
|
|
|
|
|
|
|
Checksums
|
|
|
|
=========
|
|
|
|
|
|
|
|
BlueStore checksums all metadata and data written to disk. Metadata
|
|
|
|
checksumming is handled by RocksDB and uses `crc32c`. Data
|
|
|
|
checksumming is done by BlueStore and can make use of `crc32c`,
|
|
|
|
`xxhash32`, or `xxhash64`. The default is `crc32c` and should be
|
|
|
|
suitable for most purposes.
|
|
|
|
|
|
|
|
Full data checksumming does increase the amount of metadata that
|
|
|
|
BlueStore must store and manage. When possible, e.g., when clients
|
|
|
|
hint that data is written and read sequentially, BlueStore will
|
|
|
|
checksum larger blocks, but in many cases it must store a checksum
|
|
|
|
value (usually 4 bytes) for every 4 kilobyte block of data.
|
|
|
|
|
|
|
|
It is possible to use a smaller checksum value by truncating the
|
|
|
|
checksum to two or one byte, reducing the metadata overhead. The
|
|
|
|
trade-off is that the probability that a random error will not be
|
2017-11-26 16:47:35 +00:00
|
|
|
detected is higher with a smaller checksum, going from about one in
|
|
|
|
four billion with a 32-bit (4 byte) checksum to one in 65,536 for a
|
2017-08-03 13:21:18 +00:00
|
|
|
16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum.
|
|
|
|
The smaller checksum values can be used by selecting `crc32c_16` or
|
|
|
|
`crc32c_8` as the checksum algorithm.
|
|
|
|
|
|
|
|
The *checksum algorithm* can be set either via a per-pool
|
|
|
|
``csum_type`` property or the global config option. For example, ::
|
|
|
|
|
|
|
|
ceph osd pool set <pool-name> csum_type <algorithm>
|
|
|
|
|
|
|
|
``bluestore_csum_type``
|
|
|
|
|
|
|
|
:Description: The default checksum algorithm to use.
|
|
|
|
:Type: String
|
|
|
|
:Required: Yes
|
|
|
|
:Valid Settings: ``none``, ``crc32c``, ``crc32c_16``, ``crc32c_8``, ``xxhash32``, ``xxhash64``
|
|
|
|
:Default: ``crc32c``
|
|
|
|
|
|
|
|
|
2017-08-02 07:35:29 +00:00
|
|
|
Inline Compression
|
|
|
|
==================
|
|
|
|
|
2017-08-03 13:21:18 +00:00
|
|
|
BlueStore supports inline compression using `snappy`, `zlib`, or
|
|
|
|
`lz4`. Please note that the `lz4` compression plugin is not
|
|
|
|
distributed in the official release.
|
|
|
|
|
|
|
|
Whether data in BlueStore is compressed is determined by a combination
|
|
|
|
of the *compression mode* and any hints associated with a write
|
|
|
|
operation. The modes are:
|
|
|
|
|
|
|
|
* **none**: Never compress data.
|
2018-07-05 11:41:35 +00:00
|
|
|
* **passive**: Do not compress data unless the write operation has a
|
2017-08-03 13:21:18 +00:00
|
|
|
*compressible* hint set.
|
2018-07-05 11:41:35 +00:00
|
|
|
* **aggressive**: Compress data unless the write operation has an
|
2017-08-03 13:21:18 +00:00
|
|
|
*incompressible* hint set.
|
|
|
|
* **force**: Try to compress data no matter what.
|
|
|
|
|
|
|
|
For more information about the *compressible* and *incompressible* IO
|
2017-08-14 20:37:07 +00:00
|
|
|
hints, see :c:func:`rados_set_alloc_hint`.
|
2017-08-03 13:21:18 +00:00
|
|
|
|
|
|
|
Note that regardless of the mode, if the size of the data chunk is not
|
|
|
|
reduced sufficiently it will not be used and the original
|
|
|
|
(uncompressed) data will be stored. For example, if the ``bluestore
|
|
|
|
compression required ratio`` is set to ``.7`` then the compressed data
|
|
|
|
must be 70% of the size of the original (or smaller).
|
|
|
|
|
|
|
|
The *compression mode*, *compression algorithm*, *compression required
|
|
|
|
ratio*, *min blob size*, and *max blob size* can be set either via a
|
|
|
|
per-pool property or a global config option. Pool properties can be
|
|
|
|
set with::
|
|
|
|
|
|
|
|
ceph osd pool set <pool-name> compression_algorithm <algorithm>
|
|
|
|
ceph osd pool set <pool-name> compression_mode <mode>
|
|
|
|
ceph osd pool set <pool-name> compression_required_ratio <ratio>
|
|
|
|
ceph osd pool set <pool-name> compression_min_blob_size <size>
|
|
|
|
ceph osd pool set <pool-name> compression_max_blob_size <size>
|
2017-08-02 07:35:29 +00:00
|
|
|
|
2020-12-15 06:02:31 +00:00
|
|
|
``bluestore_compression_algorithm``
|
2017-08-02 07:35:29 +00:00
|
|
|
|
|
|
|
:Description: The default compressor to use (if any) if the per-pool property
|
2020-12-15 06:02:31 +00:00
|
|
|
``compression_algorithm`` is not set. Note that ``zstd`` is *not*
|
|
|
|
recommended for BlueStore due to high CPU overhead when
|
2017-08-02 07:35:29 +00:00
|
|
|
compressing small amounts of data.
|
|
|
|
:Type: String
|
|
|
|
:Required: No
|
|
|
|
:Valid Settings: ``lz4``, ``snappy``, ``zlib``, ``zstd``
|
|
|
|
:Default: ``snappy``
|
|
|
|
|
2020-12-15 06:02:31 +00:00
|
|
|
``bluestore_compression_mode``
|
2017-08-02 07:35:29 +00:00
|
|
|
|
|
|
|
:Description: The default policy for using compression if the per-pool property
|
|
|
|
``compression_mode`` is not set. ``none`` means never use
|
2017-08-14 20:37:07 +00:00
|
|
|
compression. ``passive`` means use compression when
|
|
|
|
:c:func:`clients hint <rados_set_alloc_hint>` that data is
|
|
|
|
compressible. ``aggressive`` means use compression unless
|
|
|
|
clients hint that data is not compressible. ``force`` means use
|
|
|
|
compression under all circumstances even if the clients hint that
|
|
|
|
the data is not compressible.
|
2017-08-02 07:35:29 +00:00
|
|
|
:Type: String
|
|
|
|
:Required: No
|
|
|
|
:Valid Settings: ``none``, ``passive``, ``aggressive``, ``force``
|
|
|
|
:Default: ``none``
|
|
|
|
|
2020-12-15 06:02:31 +00:00
|
|
|
``bluestore_compression_required_ratio``
|
2017-08-03 13:21:18 +00:00
|
|
|
|
|
|
|
:Description: The ratio of the size of the data chunk after
|
|
|
|
compression relative to the original size must be at
|
|
|
|
least this small in order to store the compressed
|
|
|
|
version.
|
|
|
|
|
|
|
|
:Type: Floating point
|
|
|
|
:Required: No
|
|
|
|
:Default: .875
|
|
|
|
|
2020-12-15 06:02:31 +00:00
|
|
|
``bluestore_compression_min_blob_size``
|
2017-08-02 07:35:29 +00:00
|
|
|
|
|
|
|
:Description: Chunks smaller than this are never compressed.
|
|
|
|
The per-pool property ``compression_min_blob_size`` overrides
|
|
|
|
this setting.
|
|
|
|
|
|
|
|
:Type: Unsigned Integer
|
|
|
|
:Required: No
|
|
|
|
:Default: 0
|
|
|
|
|
2020-12-15 06:02:31 +00:00
|
|
|
``bluestore_compression_min_blob_size_hdd``
|
2017-08-02 07:35:29 +00:00
|
|
|
|
|
|
|
:Description: Default value of ``bluestore compression min blob size``
|
|
|
|
for rotational media.
|
|
|
|
|
|
|
|
:Type: Unsigned Integer
|
|
|
|
:Required: No
|
|
|
|
:Default: 128K
|
|
|
|
|
2020-12-15 06:02:31 +00:00
|
|
|
``bluestore_compression_min_blob_size_ssd``
|
2017-08-02 07:35:29 +00:00
|
|
|
|
|
|
|
:Description: Default value of ``bluestore compression min blob size``
|
|
|
|
for non-rotational (solid state) media.
|
|
|
|
|
|
|
|
:Type: Unsigned Integer
|
|
|
|
:Required: No
|
|
|
|
:Default: 8K
|
|
|
|
|
2020-12-15 06:02:31 +00:00
|
|
|
``bluestore_compression_max_blob_size``
|
2017-08-02 07:35:29 +00:00
|
|
|
|
2020-12-15 06:02:31 +00:00
|
|
|
:Description: Chunks larger than this value are broken into smaller blobs of at most
|
|
|
|
``bluestore_compression_max_blob_size`` bytes before being compressed.
|
2017-08-02 07:35:29 +00:00
|
|
|
The per-pool property ``compression_max_blob_size`` overrides
|
|
|
|
this setting.
|
|
|
|
|
|
|
|
:Type: Unsigned Integer
|
|
|
|
:Required: No
|
|
|
|
:Default: 0
|
|
|
|
|
2020-12-15 06:02:31 +00:00
|
|
|
``bluestore_compression_max_blob_size_hdd``
|
2017-08-02 07:35:29 +00:00
|
|
|
|
|
|
|
:Description: Default value of ``bluestore compression max blob size``
|
|
|
|
for rotational media.
|
|
|
|
|
|
|
|
:Type: Unsigned Integer
|
|
|
|
:Required: No
|
|
|
|
:Default: 512K
|
|
|
|
|
2020-12-15 06:02:31 +00:00
|
|
|
``bluestore_compression_max_blob_size_ssd``
|
2017-08-02 07:35:29 +00:00
|
|
|
|
|
|
|
:Description: Default value of ``bluestore compression max blob size``
|
2020-12-15 06:02:31 +00:00
|
|
|
for non-rotational (SSD, NVMe) media.
|
2017-08-02 07:35:29 +00:00
|
|
|
|
|
|
|
:Type: Unsigned Integer
|
|
|
|
:Required: No
|
|
|
|
:Default: 64K
|
2017-09-11 21:57:34 +00:00
|
|
|
|
|
|
|
SPDK Usage
|
|
|
|
==================
|
|
|
|
|
2020-12-15 06:02:31 +00:00
|
|
|
If you want to use the SPDK driver for NVMe devices, you must prepare your system.
|
|
|
|
Refer to `SPDK document`__ for more details.
|
2018-06-01 09:39:24 +00:00
|
|
|
|
|
|
|
.. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples
|
|
|
|
|
|
|
|
SPDK offers a script to configure the device automatically. Users can run the
|
|
|
|
script as root::
|
|
|
|
|
|
|
|
$ sudo src/spdk/scripts/setup.sh
|
|
|
|
|
2020-12-15 06:02:31 +00:00
|
|
|
You will need to specify the subject NVMe device's device selector with
|
|
|
|
the "spdk:" prefix for ``bluestore_block_path``.
|
2017-09-11 21:57:34 +00:00
|
|
|
|
2020-12-15 06:02:31 +00:00
|
|
|
For example, you can find the device selector of an Intel PCIe SSD with::
|
2017-09-11 21:57:34 +00:00
|
|
|
|
2018-09-18 12:23:23 +00:00
|
|
|
$ lspci -mm -n -D -d 8086:0953
|
|
|
|
|
|
|
|
The device selector always has the form of ``DDDD:BB:DD.FF`` or ``DDDD.BB.DD.FF``.
|
2017-09-11 21:57:34 +00:00
|
|
|
|
|
|
|
and then set::
|
|
|
|
|
2020-12-15 06:02:31 +00:00
|
|
|
bluestore_block_path = spdk:0000:01:00.0
|
2018-09-18 12:23:23 +00:00
|
|
|
|
|
|
|
Where ``0000:01:00.0`` is the device selector found in the output of ``lspci``
|
|
|
|
command above.
|
2017-09-11 21:57:34 +00:00
|
|
|
|
2020-12-15 06:02:31 +00:00
|
|
|
To run multiple SPDK instances per node, you must specify the
|
|
|
|
amount of dpdk memory in MB that each instance will use, to make sure each
|
2017-09-11 21:57:34 +00:00
|
|
|
instance uses its own dpdk memory
|
|
|
|
|
2020-12-15 06:02:31 +00:00
|
|
|
In most cases, a single device can be used for data, DB, and WAL. We describe
|
|
|
|
this strategy as *colocating* these components. Be sure to enter the below
|
|
|
|
settings to ensure that all IOs are issued through SPDK.::
|
2017-09-11 21:57:34 +00:00
|
|
|
|
|
|
|
bluestore_block_db_path = ""
|
|
|
|
bluestore_block_db_size = 0
|
|
|
|
bluestore_block_wal_path = ""
|
|
|
|
bluestore_block_wal_size = 0
|
|
|
|
|
2020-12-15 06:02:31 +00:00
|
|
|
Otherwise, the current implementation will populate the SPDK map files with
|
|
|
|
kernel file system symbols and will use the kernel driver to issue DB/WAL IO.
|