mirror of
https://github.com/ceph/ceph
synced 2025-01-03 09:32:43 +00:00
doc/rados/configuration: clean up BlueStore config and enhance DB sizing guidance
This commit is contained in:
parent
c6121e48ec
commit
7673793945
@ -18,7 +18,7 @@ when ``ceph-volume`` activates it) with all the common OSD files that hold
|
||||
information about the OSD, like: its identifier, which cluster it belongs to,
|
||||
and its private keyring.
|
||||
|
||||
It is also possible to deploy BlueStore across two additional devices:
|
||||
It is also possible to deploy BlueStore across one or two additional devices:
|
||||
|
||||
* A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data directory) can be
|
||||
used for BlueStore's internal journal or write-ahead log. It is only useful
|
||||
@ -38,9 +38,11 @@ more, provisioning a DB device makes more sense. The BlueStore
|
||||
journal will always be placed on the fastest device available, so
|
||||
using a DB device will provide the same benefit that the WAL device
|
||||
would while *also* allowing additional metadata to be stored there (if
|
||||
it will fit).
|
||||
it will fit). This means that if a DB device is specified but an explicit
|
||||
WAL device is not, the WAL will be implicitly colocated with the DB on the faster
|
||||
device.
|
||||
|
||||
A single-device BlueStore OSD can be provisioned with::
|
||||
A single-device (colocated) BlueStore OSD can be provisioned with::
|
||||
|
||||
ceph-volume lvm prepare --bluestore --data <device>
|
||||
|
||||
@ -48,28 +50,28 @@ To specify a WAL device and/or DB device, ::
|
||||
|
||||
ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device>
|
||||
|
||||
.. note:: --data can be a Logical Volume using the vg/lv notation. Other
|
||||
devices can be existing logical volumes or GPT partitions
|
||||
.. note:: ``--data`` can be a Logical Volume using *vg/lv* notation. Other
|
||||
devices can be existing logical volumes or GPT partitions.
|
||||
|
||||
Provisioning strategies
|
||||
-----------------------
|
||||
Although there are multiple ways to deploy a Bluestore OSD (unlike Filestore
|
||||
which had 1) here are two common use cases that should help clarify the
|
||||
initial deployment strategy:
|
||||
Although there are multiple ways to deploy a BlueStore OSD (unlike Filestore
|
||||
which had just one), there are two common arrangements that should help clarify
|
||||
the deployment strategy:
|
||||
|
||||
.. _bluestore-single-type-device-config:
|
||||
|
||||
**block (data) only**
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
If all the devices are the same type, for example all are spinning drives, and
|
||||
there are no fast devices to combine these, it makes sense to just deploy with
|
||||
block only and not try to separate ``block.db`` or ``block.wal``. The
|
||||
:ref:`ceph-volume-lvm` call for a single ``/dev/sda`` device would look like::
|
||||
If all devices are the same type, for example all rotational drives, and
|
||||
there are no fast devices to use for metadata, it makes sense to specifiy the
|
||||
block device only and to not separate ``block.db`` or ``block.wal``. The
|
||||
:ref:`ceph-volume-lvm` command for a single ``/dev/sda`` device looks like::
|
||||
|
||||
ceph-volume lvm create --bluestore --data /dev/sda
|
||||
|
||||
If logical volumes have already been created for each device (1 LV using 100%
|
||||
of the device), then the :ref:`ceph-volume-lvm` call for an lv named
|
||||
If logical volumes have already been created for each device, (a single LV
|
||||
using 100% of the device), then the :ref:`ceph-volume-lvm` call for an LV named
|
||||
``ceph-vg/block-lv`` would look like::
|
||||
|
||||
ceph-volume lvm create --bluestore --data ceph-vg/block-lv
|
||||
@ -78,15 +80,15 @@ of the device), then the :ref:`ceph-volume-lvm` call for an lv named
|
||||
|
||||
**block and block.db**
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
If there is a mix of fast and slow devices (spinning and solid state),
|
||||
If you have a mix of fast and slow devices (SSD / NVMe and rotational),
|
||||
it is recommended to place ``block.db`` on the faster device while ``block``
|
||||
(data) lives on the slower (spinning drive). Sizing for ``block.db`` should be
|
||||
as large as possible to avoid performance penalties otherwise. The
|
||||
``ceph-volume`` tool is currently not able to create these automatically, so
|
||||
the volume groups and logical volumes need to be created manually.
|
||||
(data) lives on the slower (spinning drive).
|
||||
|
||||
For the below example, lets assume 4 spinning drives (sda, sdb, sdc, and sdd)
|
||||
and 1 solid state drive (sdx). First create the volume groups::
|
||||
You must create these volume groups and logical volumes manually as
|
||||
the ``ceph-volume`` tool is currently not able to do so automatically.
|
||||
|
||||
For the below example, let us assume four rotational (``sda``, ``sdb``, ``sdc``, and ``sdd``)
|
||||
and one (fast) solid state drive (``sdx``). First create the volume groups::
|
||||
|
||||
$ vgcreate ceph-block-0 /dev/sda
|
||||
$ vgcreate ceph-block-1 /dev/sdb
|
||||
@ -116,34 +118,47 @@ Finally, create the 4 OSDs with ``ceph-volume``::
|
||||
$ ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2
|
||||
$ ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3
|
||||
|
||||
These operations should end up creating 4 OSDs, with ``block`` on the slower
|
||||
spinning drives and a 50GB logical volume for each coming from the solid state
|
||||
These operations should end up creating four OSDs, with ``block`` on the slower
|
||||
rotational drives with a 50 GB logical volume (DB) for each on the solid state
|
||||
drive.
|
||||
|
||||
Sizing
|
||||
======
|
||||
When using a :ref:`mixed spinning and solid drive setup
|
||||
<bluestore-mixed-device-config>` it is important to make a large-enough
|
||||
``block.db`` logical volume for Bluestore. Generally, ``block.db`` should have
|
||||
<bluestore-mixed-device-config>` it is important to make a large enough
|
||||
``block.db`` logical volume for BlueStore. Generally, ``block.db`` should have
|
||||
*as large as possible* logical volumes.
|
||||
|
||||
The general recommendation is to have ``block.db`` size in between 1% to 4%
|
||||
of ``block`` size. For RGW workloads, it is recommended that the ``block.db``
|
||||
size isn't smaller than 4% of ``block``, because RGW heavily uses it to store its
|
||||
metadata. For example, if the ``block`` size is 1TB, then ``block.db`` shouldn't
|
||||
size isn't smaller than 4% of ``block``, because RGW heavily uses it to store
|
||||
metadata (omap keys). For example, if the ``block`` size is 1TB, then ``block.db`` shouldn't
|
||||
be less than 40GB. For RBD workloads, 1% to 2% of ``block`` size is usually enough.
|
||||
|
||||
If *not* using a mix of fast and slow devices, it isn't required to create
|
||||
separate logical volumes for ``block.db`` (or ``block.wal``). Bluestore will
|
||||
automatically manage these within the space of ``block``.
|
||||
In older releases, internal level sizes mean that the DB can fully utilize only
|
||||
specific partition / LV sizes that correspond to sums of L0, L0+L1, L1+L2,
|
||||
etc. sizes, which with default settings means roughly 3 GB, 30 GB, 300 GB, and
|
||||
so forth. Most deployments will not substantially benefit from sizing to
|
||||
accomodate L3 and higher, though DB compaction can be facilitated by doubling
|
||||
these figures to 6GB, 60GB, and 600GB.
|
||||
|
||||
Improvements in releases beginning with Nautilus 14.2.12 and Octopus 15.2.6
|
||||
enable better utilization of arbitrary DB device sizes, and the Pacific
|
||||
release brings experimental dynamic level support. Users of older releases may
|
||||
thus wish to plan ahead by provisioning larger DB devices today so that their
|
||||
benefits may be realized with future upgrades.
|
||||
|
||||
When *not* using a mix of fast and slow devices, it isn't required to create
|
||||
separate logical volumes for ``block.db`` (or ``block.wal``). BlueStore will
|
||||
automatically colocate these within the space of ``block``.
|
||||
|
||||
|
||||
Automatic Cache Sizing
|
||||
======================
|
||||
|
||||
Bluestore can be configured to automatically resize its caches when TCMalloc
|
||||
BlueStore can be configured to automatically resize its caches when TCMalloc
|
||||
is configured as the memory allocator and the ``bluestore_cache_autotune``
|
||||
setting is enabled. This option is currently enabled by default. Bluestore
|
||||
setting is enabled. This option is currently enabled by default. BlueStore
|
||||
will attempt to keep OSD heap memory usage under a designated target size via
|
||||
the ``osd_memory_target`` configuration option. This is a best effort
|
||||
algorithm and caches will not shrink smaller than the amount specified by
|
||||
@ -154,56 +169,86 @@ used as fallbacks.
|
||||
|
||||
``bluestore_cache_autotune``
|
||||
|
||||
:Description: Automatically tune the ratios assigned to different bluestore caches while respecting minimum values.
|
||||
:Description: Automatically tune the space ratios assigned to various BlueStore
|
||||
caches while respecting minimum values.
|
||||
:Type: Boolean
|
||||
:Required: Yes
|
||||
:Default: ``True``
|
||||
|
||||
``osd_memory_target``
|
||||
|
||||
:Description: When tcmalloc is available and cache autotuning is enabled, try to keep this many bytes mapped in memory. Note: This may not exactly match the RSS memory usage of the process. While the total amount of heap memory mapped by the process should generally stay close to this target, there is no guarantee that the kernel will actually reclaim memory that has been unmapped. During initial development, it was found that some kernels result in the OSD's RSS Memory exceeding the mapped memory by up to 20%. It is hypothesised however, that the kernel generally may be more aggressive about reclaiming unmapped memory when there is a high amount of memory pressure. Your mileage may vary.
|
||||
:Description: When TCMalloc is available and cache autotuning is enabled, try to
|
||||
keep this many bytes mapped in memory. Note: This may not exactly
|
||||
match the RSS memory usage of the process. While the total amount
|
||||
of heap memory mapped by the process should usually be close
|
||||
to this target, there is no guarantee that the kernel will actually
|
||||
reclaim memory that has been unmapped. During initial development,
|
||||
it was found that some kernels result in the OSD's RSS memory
|
||||
exceeding the mapped memory by up to 20%. It is hypothesised
|
||||
however, that the kernel generally may be more aggressive about
|
||||
reclaiming unmapped memory when there is a high amount of memory
|
||||
pressure. Your mileage may vary.
|
||||
:Type: Unsigned Integer
|
||||
:Required: Yes
|
||||
:Default: ``4294967296``
|
||||
|
||||
``bluestore_cache_autotune_chunk_size``
|
||||
|
||||
:Description: The chunk size in bytes to allocate to caches when cache autotune is enabled. When the autotuner assigns memory to different caches, it will allocate memory in chunks. This is done to avoid evictions when there are minor fluctuations in the heap size or autotuned cache ratios.
|
||||
:Description: The chunk size in bytes to allocate to caches when cache autotune
|
||||
is enabled. When the autotuner assigns memory to various caches,
|
||||
it will allocate memory in chunks. This is done to avoid
|
||||
evictions when there are minor fluctuations in the heap size or
|
||||
autotuned cache ratios.
|
||||
:Type: Unsigned Integer
|
||||
:Required: No
|
||||
:Default: ``33554432``
|
||||
|
||||
``bluestore_cache_autotune_interval``
|
||||
|
||||
:Description: The number of seconds to wait between rebalances when cache autotune is enabled. This setting changes how quickly the ratios of the difference caches are recomputed. Note: Setting the interval too small can result in high CPU usage and lower performance.
|
||||
:Description: The number of seconds to wait between rebalances when cache autotune
|
||||
is enabled. This setting changes how quickly the allocation ratios of
|
||||
various caches are recomputed. Note: Setting this interval too small
|
||||
can result in high CPU usage and lower performance.
|
||||
:Type: Float
|
||||
:Required: No
|
||||
:Default: ``5``
|
||||
|
||||
``osd_memory_base``
|
||||
|
||||
:Description: When tcmalloc and cache autotuning is enabled, estimate the minimum amount of memory in bytes the OSD will need. This is used to help the autotuner estimate the expected aggregate memory consumption of the caches.
|
||||
:Description: When TCMalloc and cache autotuning are enabled, estimate the minimum
|
||||
amount of memory in bytes the OSD will need. This is used to help
|
||||
the autotuner estimate the expected aggregate memory consumption of
|
||||
the caches.
|
||||
:Type: Unsigned Integer
|
||||
:Required: No
|
||||
:Default: ``805306368``
|
||||
|
||||
``osd_memory_expected_fragmentation``
|
||||
|
||||
:Description: When tcmalloc and cache autotuning is enabled, estimate the percent of memory fragmentation. This is used to help the autotuner estimate the expected aggregate memory consumption of the caches.
|
||||
:Description: When TCMalloc and cache autotuning is enabled, estimate the
|
||||
percentage of memory fragmentation. This is used to help the
|
||||
autotuner estimate the expected aggregate memory consumption
|
||||
of the caches.
|
||||
:Type: Float
|
||||
:Required: No
|
||||
:Default: ``0.15``
|
||||
|
||||
``osd_memory_cache_min``
|
||||
|
||||
:Description: When tcmalloc and cache autotuning is enabled, set the minimum amount of memory used for caches. Note: Setting this value too low can result in significant cache thrashing.
|
||||
:Description: When TCMalloc and cache autotuning are enabled, set the minimum
|
||||
amount of memory used for caches. Note: Setting this value too
|
||||
low can result in significant cache thrashing.
|
||||
:Type: Unsigned Integer
|
||||
:Required: No
|
||||
:Default: ``134217728``
|
||||
|
||||
``osd_memory_cache_resize_interval``
|
||||
|
||||
:Description: When tcmalloc and cache autotuning is enabled, wait this many seconds between resizing caches. This setting changes the total amount of memory available for bluestore to use for caching. Note: Setting the interval too small can result in memory allocator thrashing and lower performance.
|
||||
:Description: When TCMalloc and cache autotuning are enabled, wait this many
|
||||
seconds between resizing caches. This setting changes the total
|
||||
amount of memory available for BlueStore to use for caching. Note
|
||||
that setting this interval too small can result in memory allocator
|
||||
thrashing and lower performance.
|
||||
:Type: Float
|
||||
:Required: No
|
||||
:Default: ``1``
|
||||
@ -212,7 +257,7 @@ used as fallbacks.
|
||||
Manual Cache Sizing
|
||||
===================
|
||||
|
||||
The amount of memory consumed by each OSD for BlueStore's cache is
|
||||
The amount of memory consumed by each OSD for BlueStore caches is
|
||||
determined by the ``bluestore_cache_size`` configuration option. If
|
||||
that config option is not set (i.e., remains at 0), there is a
|
||||
different default value that is used depending on whether an HDD or
|
||||
@ -220,10 +265,10 @@ SSD is used for the primary device (set by the
|
||||
``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config
|
||||
options).
|
||||
|
||||
BlueStore and the rest of the Ceph OSD does the best it can currently
|
||||
to stick to the budgeted memory. Note that on top of the configured
|
||||
BlueStore and the rest of the Ceph OSD daemon do the best they can
|
||||
to work within this memory budget. Note that on top of the configured
|
||||
cache size, there is also memory consumed by the OSD itself, and
|
||||
generally some overhead due to memory fragmentation and other
|
||||
some additional utilization due to memory fragmentation and other
|
||||
allocator overhead.
|
||||
|
||||
The configured cache memory budget can be used in a few different ways:
|
||||
@ -243,21 +288,25 @@ The data fraction can be calculated by
|
||||
|
||||
``bluestore_cache_size``
|
||||
|
||||
:Description: The amount of memory BlueStore will use for its cache. If zero, ``bluestore_cache_size_hdd`` or ``bluestore_cache_size_ssd`` will be used instead.
|
||||
:Description: The amount of memory BlueStore will use for its cache. If zero,
|
||||
``bluestore_cache_size_hdd`` or ``bluestore_cache_size_ssd`` will
|
||||
be used instead.
|
||||
:Type: Unsigned Integer
|
||||
:Required: Yes
|
||||
:Default: ``0``
|
||||
|
||||
``bluestore_cache_size_hdd``
|
||||
|
||||
:Description: The default amount of memory BlueStore will use for its cache when backed by an HDD.
|
||||
:Description: The default amount of memory BlueStore will use for its cache when
|
||||
backed by an HDD.
|
||||
:Type: Unsigned Integer
|
||||
:Required: Yes
|
||||
:Default: ``1 * 1024 * 1024 * 1024`` (1 GB)
|
||||
|
||||
``bluestore_cache_size_ssd``
|
||||
|
||||
:Description: The default amount of memory BlueStore will use for its cache when backed by an SSD.
|
||||
:Description: The default amount of memory BlueStore will use for its cache when
|
||||
backed by an SSD.
|
||||
:Type: Unsigned Integer
|
||||
:Required: Yes
|
||||
:Default: ``3 * 1024 * 1024 * 1024`` (3 GB)
|
||||
@ -271,14 +320,14 @@ The data fraction can be calculated by
|
||||
|
||||
``bluestore_cache_kv_ratio``
|
||||
|
||||
:Description: The ratio of cache devoted to key/value data (rocksdb).
|
||||
:Description: The ratio of cache devoted to key/value data (RocksDB).
|
||||
:Type: Floating point
|
||||
:Required: Yes
|
||||
:Default: ``.4``
|
||||
|
||||
``bluestore_cache_kv_max``
|
||||
|
||||
:Description: The maximum amount of cache devoted to key/value data (rocksdb).
|
||||
:Description: The maximum amount of cache devoted to key/value data (RocksDB).
|
||||
:Type: Unsigned Integer
|
||||
:Required: Yes
|
||||
:Default: ``512 * 1024*1024`` (512 MB)
|
||||
|
Loading…
Reference in New Issue
Block a user