mirror of
https://github.com/ceph/ceph
synced 2025-04-01 23:02:17 +00:00
doc/start: Modernize and clarify hardware-recommendations.rst
Signed-off-by: Anthony D'Atri <anthonyeleven@users.noreply.github.com>
This commit is contained in:
parent
9b943423e5
commit
83bd3a8dfb
@ -1,66 +1,83 @@
|
||||
.. _hardware-recommendations:
|
||||
|
||||
==========================
|
||||
Hardware Recommendations
|
||||
hardware recommendations
|
||||
==========================
|
||||
|
||||
Ceph was designed to run on commodity hardware, which makes building and
|
||||
maintaining petabyte-scale data clusters economically feasible.
|
||||
When planning out your cluster hardware, you will need to balance a number
|
||||
of considerations, including failure domains and potential performance
|
||||
issues. Hardware planning should include distributing Ceph daemons and
|
||||
Ceph is designed to run on commodity hardware, which makes building and
|
||||
maintaining petabyte-scale data clusters flexible and economically feasible.
|
||||
When planning your cluster's hardware, you will need to balance a number
|
||||
of considerations, including failure domains, cost, and performance.
|
||||
Hardware planning should include distributing Ceph daemons and
|
||||
other processes that use Ceph across many hosts. Generally, we recommend
|
||||
running Ceph daemons of a specific type on a host configured for that type
|
||||
of daemon. We recommend using other hosts for processes that utilize your
|
||||
data cluster (e.g., OpenStack, CloudStack, etc).
|
||||
of daemon. We recommend using separate hosts for processes that utilize your
|
||||
data cluster (e.g., OpenStack, CloudStack, Kubernetes, etc).
|
||||
|
||||
The requirements of one Ceph cluster are not the same as the requirements of
|
||||
another, but below are some general guidelines.
|
||||
|
||||
.. tip:: Check out the `Ceph blog`_ too.
|
||||
|
||||
.. tip:: check out the `ceph blog`_ too.
|
||||
|
||||
CPU
|
||||
===
|
||||
|
||||
CephFS metadata servers (MDS) are CPU-intensive. CephFS metadata servers (MDS)
|
||||
should therefore have quad-core (or better) CPUs and high clock rates (GHz). OSD
|
||||
nodes need enough processing power to run the RADOS service, to calculate data
|
||||
CephFS Metadata Servers (MDS) are CPU-intensive. They are
|
||||
are single-threaded and perform best with CPUs with a high clock rate (GHz). MDS
|
||||
servers do not need a large number of CPU cores unless they are also hosting other
|
||||
services, such as SSD OSDs for the CephFS metadata pool.
|
||||
OSD nodes need enough processing power to run the RADOS service, to calculate data
|
||||
placement with CRUSH, to replicate data, and to maintain their own copies of the
|
||||
cluster map.
|
||||
|
||||
The requirements of one Ceph cluster are not the same as the requirements of
|
||||
another, but here are some general guidelines.
|
||||
|
||||
In earlier versions of Ceph, we would make hardware recommendations based on
|
||||
the number of cores per OSD, but this cores-per-OSD metric is no longer as
|
||||
useful a metric as the number of cycles per IOP and the number of IOPs per OSD.
|
||||
For example, for NVMe drives, Ceph can easily utilize five or six cores on real
|
||||
With earlier releases of Ceph, we would make hardware recommendations based on
|
||||
the number of cores per OSD, but this cores-per-osd metric is no longer as
|
||||
useful a metric as the number of cycles per IOP and the number of IOPS per OSD.
|
||||
For example, with NVMe OSD drives, Ceph can easily utilize five or six cores on real
|
||||
clusters and up to about fourteen cores on single OSDs in isolation. So cores
|
||||
per OSD are no longer as pressing a concern as they were. When selecting
|
||||
hardware, select for IOPs per core.
|
||||
hardware, select for IOPS per core.
|
||||
|
||||
Monitor nodes and manager nodes have no heavy CPU demands and require only
|
||||
modest processors. If your host machines will run CPU-intensive processes in
|
||||
.. tip:: When we speak of CPU _cores_, we mean _threads_ when hyperthreading
|
||||
is enabled. Hyperthreading is usually beneficial for Ceph servers.
|
||||
|
||||
Monitor nodes and Manager nodes do not have heavy CPU demands and require only
|
||||
modest processors. if your hosts will run CPU-intensive processes in
|
||||
addition to Ceph daemons, make sure that you have enough processing power to
|
||||
run both the CPU-intensive processes and the Ceph daemons. (OpenStack Nova is
|
||||
one such example of a CPU-intensive process.) We recommend that you run
|
||||
one example of a CPU-intensive process.) We recommend that you run
|
||||
non-Ceph CPU-intensive processes on separate hosts (that is, on hosts that are
|
||||
not your monitor and manager nodes) in order to avoid resource contention.
|
||||
not your Monitor and Manager nodes) in order to avoid resource contention.
|
||||
If your cluster deployes the Ceph Object Gateway, RGW daemons may co-reside
|
||||
with your Mon and Manager services if the nodes have sufficient resources.
|
||||
|
||||
RAM
|
||||
===
|
||||
|
||||
Generally, more RAM is better. Monitor / manager nodes for a modest cluster
|
||||
Generally, more RAM is better. Monitor / Manager nodes for a modest cluster
|
||||
might do fine with 64GB; for a larger cluster with hundreds of OSDs 128GB
|
||||
is a reasonable target. There is a memory target for BlueStore OSDs that
|
||||
is advised.
|
||||
|
||||
.. tip:: when we speak of RAM and storage requirements, we often describe
|
||||
the needs of a single daemon of a given type. A given server as
|
||||
a whole will thus need at least the sum of the needs of the
|
||||
daemons that it hosts as well as resources for logs and other operating
|
||||
system components. Keep in mind that a server's need for RAM
|
||||
and storage will be greater at startup and when components
|
||||
fail or are added and the cluster rebalances. In other words,
|
||||
allow headroom past what you might see used during a calm period
|
||||
on a small initial cluster footprint.
|
||||
|
||||
There is an :confval:`osd_memory_target` setting for BlueStore OSDs that
|
||||
defaults to 4GB. Factor in a prudent margin for the operating system and
|
||||
administrative tasks (like monitoring and metrics) as well as increased
|
||||
consumption during recovery: provisioning ~8GB per BlueStore OSD
|
||||
is advised.
|
||||
consumption during recovery: provisioning ~8GB *per BlueStore OSD* is thus
|
||||
advised.
|
||||
|
||||
Monitors and managers (ceph-mon and ceph-mgr)
|
||||
---------------------------------------------
|
||||
|
||||
Monitor and manager daemon memory usage generally scales with the size of the
|
||||
Monitor and manager daemon memory usage scales with the size of the
|
||||
cluster. Note that at boot-time and during topology changes and recovery these
|
||||
daemons will need more RAM than they do during steady-state operation, so plan
|
||||
for peak usage. For very small clusters, 32 GB suffices. For clusters of up to,
|
||||
@ -75,8 +92,8 @@ tuning the following settings:
|
||||
Metadata servers (ceph-mds)
|
||||
---------------------------
|
||||
|
||||
The metadata daemon memory utilization depends on how much memory its cache is
|
||||
configured to consume. We recommend 1 GB as a minimum for most systems. See
|
||||
CephFS metadata daemon memory utilization depends on the configured size of
|
||||
its cache. We recommend 1 GB as a minimum for most systems. See
|
||||
:confval:`mds_cache_memory_limit`.
|
||||
|
||||
|
||||
@ -88,23 +105,24 @@ operating system's page cache. In Bluestore you can adjust the amount of memory
|
||||
that the OSD attempts to consume by changing the :confval:`osd_memory_target`
|
||||
configuration option.
|
||||
|
||||
- Setting the :confval:`osd_memory_target` below 2GB is typically not
|
||||
recommended (Ceph may fail to keep the memory consumption under 2GB and
|
||||
this may cause extremely slow performance).
|
||||
- Setting the :confval:`osd_memory_target` below 2GB is not
|
||||
recommended. eph may fail to keep the memory consumption under 2GB and
|
||||
extremely slow performance is likely.
|
||||
|
||||
- Setting the memory target between 2GB and 4GB typically works but may result
|
||||
in degraded performance: metadata may be read from disk during IO unless the
|
||||
active data set is relatively small.
|
||||
in degraded performance: metadata may need to be read from disk during IO
|
||||
unless the active data set is relatively small.
|
||||
|
||||
- 4GB is the current default :confval:`osd_memory_target` size. This default
|
||||
was chosen for typical use cases, and is intended to balance memory
|
||||
requirements and OSD performance.
|
||||
- 4GB is the current default value for :confval:`osd_memory_target` This default
|
||||
was chosen for typical use cases, and is intended to balance RAM cost and
|
||||
OSD performance.
|
||||
|
||||
- Setting the :confval:`osd_memory_target` higher than 4GB can improve
|
||||
performance when there many (small) objects or when large (256GB/OSD
|
||||
or more) data sets are processed.
|
||||
or more) data sets are processed. This is especially true with fast
|
||||
NVMe OSDs.
|
||||
|
||||
.. important:: OSD memory autotuning is "best effort". Although the OSD may
|
||||
.. important:: OSD memory management is "best effort". Although the OSD may
|
||||
unmap memory to allow the kernel to reclaim it, there is no guarantee that
|
||||
the kernel will actually reclaim freed memory within a specific time
|
||||
frame. This applies especially in older versions of Ceph, where transparent
|
||||
@ -113,14 +131,19 @@ configuration option.
|
||||
pages at the application level to avoid this, but that does not
|
||||
guarantee that the kernel will immediately reclaim unmapped memory. The OSD
|
||||
may still at times exceed its memory target. We recommend budgeting
|
||||
approximately 20% extra memory on your system to prevent OSDs from going OOM
|
||||
at least 20% extra memory on your system to prevent OSDs from going OOM
|
||||
(**O**\ut **O**\f **M**\emory) during temporary spikes or due to delay in
|
||||
the kernel reclaiming freed pages. That 20% value might be more or less than
|
||||
needed, depending on the exact configuration of the system.
|
||||
|
||||
When using the legacy FileStore back end, the page cache is used for caching
|
||||
data, so no tuning is normally needed. When using the legacy FileStore backend,
|
||||
the OSD memory consumption is related to the number of PGs per daemon in the
|
||||
.. tip:: Configuring the operating system with swap to provide additional
|
||||
virtual memory for daemons is not advised for modern systems. Doing
|
||||
may result in lower performance, and your Ceph cluster may well be
|
||||
happier with a daemon that crashes vs one that slows to a crawl.
|
||||
|
||||
When using the legacy FileStore back end, the OS page cache was used for caching
|
||||
data, so tuning was not normally needed. When using the legacy FileStore backend,
|
||||
the OSD memory consumption was related to the number of PGs per daemon in the
|
||||
system.
|
||||
|
||||
|
||||
@ -130,13 +153,34 @@ Data Storage
|
||||
Plan your data storage configuration carefully. There are significant cost and
|
||||
performance tradeoffs to consider when planning for data storage. Simultaneous
|
||||
OS operations and simultaneous requests from multiple daemons for read and
|
||||
write operations against a single drive can slow performance.
|
||||
write operations against a single drive can impact performance.
|
||||
|
||||
OSDs require substantial storage drive space for RADOS data. We recommend a
|
||||
minimum drive size of 1 terabyte. OSD drives much smaller than one terabyte
|
||||
use a significant fraction of their capacity for metadata, and drives smaller
|
||||
than 100 gigabytes will not be effective at all.
|
||||
|
||||
It is *strongly* suggested that (enterprise-class) SSDs are provisioned for, at a
|
||||
minimum, Ceph Monitor and Ceph Manager hosts, as well as CephFS Metadata Server
|
||||
metadata pools and Ceph Object Gateway (RGW) index pools, even if HDDs are to
|
||||
be provisioned for bulk OSD data.
|
||||
|
||||
To get the best performance out of Ceph, provision the following on separate
|
||||
drives:
|
||||
|
||||
* The operating systems
|
||||
* OSD data
|
||||
* BlueStore WAL+DB
|
||||
|
||||
For more
|
||||
information on how to effectively use a mix of fast drives and slow drives in
|
||||
your Ceph cluster, see the `block and block.db`_ section of the Bluestore
|
||||
Configuration Reference.
|
||||
|
||||
Hard Disk Drives
|
||||
----------------
|
||||
|
||||
OSDs should have plenty of storage drive space for object data. We recommend a
|
||||
minimum disk drive size of 1 terabyte. Consider the cost-per-gigabyte advantage
|
||||
Consider carefully the cost-per-gigabyte advantage
|
||||
of larger disks. We recommend dividing the price of the disk drive by the
|
||||
number of gigabytes to arrive at a cost per gigabyte, because larger drives may
|
||||
have a significant impact on the cost-per-gigabyte. For example, a 1 terabyte
|
||||
@ -146,11 +190,10 @@ per gigabyte (i.e., $150 / 3072 = 0.0488). In the foregoing example, using the
|
||||
1 terabyte disks would generally increase the cost per gigabyte by
|
||||
40%--rendering your cluster substantially less cost efficient.
|
||||
|
||||
.. tip:: Running multiple OSDs on a single SAS / SATA drive
|
||||
is **NOT** a good idea. NVMe drives, however, can achieve
|
||||
improved performance by being split into two or more OSDs.
|
||||
.. tip:: Hosting multiple OSDs on a single SAS / SATA HDD
|
||||
is **NOT** a good idea.
|
||||
|
||||
.. tip:: Running an OSD and a monitor or a metadata server on a single
|
||||
.. tip:: Hosting an OSD with monitor, manager, or MDS data on a single
|
||||
drive is also **NOT** a good idea.
|
||||
|
||||
.. tip:: With spinning disks, the SATA and SAS interface increasingly
|
||||
@ -162,35 +205,36 @@ Storage drives are subject to limitations on seek time, access time, read and
|
||||
write times, as well as total throughput. These physical limitations affect
|
||||
overall system performance--especially during recovery. We recommend using a
|
||||
dedicated (ideally mirrored) drive for the operating system and software, and
|
||||
one drive for each Ceph OSD Daemon you run on the host (modulo NVMe above).
|
||||
one drive for each Ceph OSD Daemon you run on the host.
|
||||
Many "slow OSD" issues (when they are not attributable to hardware failure)
|
||||
arise from running an operating system and multiple OSDs on the same drive.
|
||||
Also be aware that today's 22TB HDD uses the same SATA interface as a
|
||||
3TB HDD from ten years ago: more than seven times the data to squeeze
|
||||
through the same same interface. For this reason, when using HDDs for
|
||||
OSDs, drives larger than 8TB may be best suited for storage of large
|
||||
files / objects that are not at all performance-sensitive.
|
||||
|
||||
It is technically possible to run multiple Ceph OSD Daemons per SAS / SATA
|
||||
drive, but this will lead to resource contention and diminish overall
|
||||
throughput.
|
||||
|
||||
To get the best performance out of Ceph, run the following on separate drives:
|
||||
(1) operating systems, (2) OSD data, and (3) BlueStore db. For more
|
||||
information on how to effectively use a mix of fast drives and slow drives in
|
||||
your Ceph cluster, see the `block and block.db`_ section of the Bluestore
|
||||
Configuration Reference.
|
||||
|
||||
Solid State Drives
|
||||
------------------
|
||||
|
||||
Ceph performance can be improved by using solid-state drives (SSDs). This
|
||||
reduces random access time and reduces latency while accelerating throughput.
|
||||
Ceph performance is much improved when using solid-state drives (SSDs). This
|
||||
reduces random access time and reduces latency while increasing throughput.
|
||||
|
||||
SSDs cost more per gigabyte than do hard disk drives, but SSDs often offer
|
||||
access times that are, at a minimum, 100 times faster than hard disk drives.
|
||||
SSDs cost more per gigabyte than do HDDs but SSDs often offer
|
||||
access times that are, at a minimum, 100 times faster than HDDs.
|
||||
SSDs avoid hotspot issues and bottleneck issues within busy clusters, and
|
||||
they may offer better economics when TCO is evaluated holistically.
|
||||
they may offer better economics when TCO is evaluated holistically. Notably,
|
||||
the amortized drive cost for a given number of IOPS is much lower with SSDs
|
||||
than with HDDs. SSDs do not suffer rotational or seek latency and in addition
|
||||
to improved client performance, they substantially improve the speed and
|
||||
client impact of cluster changes including rebalancing when OSDs or Monitors
|
||||
are added, removed, or fail.
|
||||
|
||||
SSDs do not have moving mechanical parts, so they are not necessarily subject
|
||||
to the same types of limitations as hard disk drives. SSDs do have significant
|
||||
SSDs do not have moving mechanical parts, so they are not subject
|
||||
to many of the limitations of HDDs. SSDs do have significant
|
||||
limitations though. When evaluating SSDs, it is important to consider the
|
||||
performance of sequential reads and writes.
|
||||
performance of sequential and random reads and writes.
|
||||
|
||||
.. important:: We recommend exploring the use of SSDs to improve performance.
|
||||
However, before making a significant investment in SSDs, we **strongly
|
||||
@ -198,16 +242,36 @@ performance of sequential reads and writes.
|
||||
SSD in a test configuration in order to gauge performance.
|
||||
|
||||
Relatively inexpensive SSDs may appeal to your sense of economy. Use caution.
|
||||
Acceptable IOPS are not the only factor to consider when selecting an SSD for
|
||||
use with Ceph.
|
||||
Acceptable IOPS are not the only factor to consider when selecting SSDs for
|
||||
use with Ceph. Bargain SSDs are often a false economy: they may experience
|
||||
"cliffing", which means that after an initial burst, sustained performance
|
||||
once a limited cache is filled declines considerably. Consider also durability:
|
||||
a drive rated for 0.3 Drive Writes Per Day (DWPD or equivalent) may be fine for
|
||||
OSDs dedicated to certain types of sequentially-written read-mostly data, but
|
||||
are not a good choice for Ceph Monitor duty. Enterprise-class SSDs are best
|
||||
for Ceph: they almost always feature power less protection (PLP) and do
|
||||
not suffer the dramatic cliffing that client (desktop) models may experience.
|
||||
|
||||
SSDs have historically been cost prohibitive for object storage, but emerging
|
||||
QLC drives are closing the gap, offering greater density with lower power
|
||||
consumption and less power spent on cooling. HDD OSDs may see a significant
|
||||
performance improvement by offloading WAL+DB onto an SSD.
|
||||
When using a single (or mirrored pair) SSD for both operating system boot
|
||||
and Ceph Monitor / Manager purposes, a minimum capacity of 256GB is advised
|
||||
and at least 480GB is recommended. A drive model rated at 1+ DWPD (or the
|
||||
equivalent in TBW (TeraBytes Written) is suggested. However, for a given write
|
||||
workload, a larger drive than technically required will provide more endurance
|
||||
because it effectively has greater overprovsioning. We stress that
|
||||
enterprise-class drives are best for production use, as they feature power
|
||||
loss protection and increased durability compared to client (desktop) SKUs
|
||||
that are intended for much lighter and intermittent duty cycles.
|
||||
|
||||
To get a better sense of the factors that determine the cost of storage, you
|
||||
might use the `Storage Networking Industry Association's Total Cost of
|
||||
SSDs were historically been cost prohibitive for object storage, but
|
||||
QLC SSDs are closing the gap, offering greater density with lower power
|
||||
consumption and less power spent on cooling. Also, HDD OSDs may see a
|
||||
significant write latency improvement by offloading WAL+DB onto an SSD.
|
||||
Many Ceph OSD deployments do not require an SSD with greater endurance than
|
||||
1 DWPD (aka "read-optimized"). "Mixed-use" SSDs in the 3 DWPD class are
|
||||
often overkill for this purpose and cost signficantly more.
|
||||
|
||||
To get a better sense of the factors that determine the total cost of storage,
|
||||
you might use the `Storage Networking Industry Association's Total Cost of
|
||||
Ownership calculator`_
|
||||
|
||||
Partition Alignment
|
||||
@ -222,11 +286,11 @@ alignment and example commands that show how to align partitions properly, see
|
||||
CephFS Metadata Segregation
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
One way that Ceph accelerates CephFS file system performance is by segregating
|
||||
One way that Ceph accelerates CephFS file system performance is by separating
|
||||
the storage of CephFS metadata from the storage of the CephFS file contents.
|
||||
Ceph provides a default ``metadata`` pool for CephFS metadata. You will never
|
||||
have to create a pool for CephFS metadata, but you can create a CRUSH map
|
||||
hierarchy for your CephFS metadata pool that points only to SSD storage media.
|
||||
have to manually create a pool for CephFS metadata, but you can create a CRUSH map
|
||||
hierarchy for your CephFS metadata pool that includes only SSD storage media.
|
||||
See :ref:`CRUSH Device Class<crush-map-device-class>` for details.
|
||||
|
||||
|
||||
@ -237,8 +301,20 @@ Disk controllers (HBAs) can have a significant impact on write throughput.
|
||||
Carefully consider your selection of HBAs to ensure that they do not create a
|
||||
performance bottleneck. Notably, RAID-mode (IR) HBAs may exhibit higher latency
|
||||
than simpler "JBOD" (IT) mode HBAs. The RAID SoC, write cache, and battery
|
||||
backup can substantially increase hardware and maintenance costs. Some RAID
|
||||
HBAs can be configured with an IT-mode "personality".
|
||||
backup can substantially increase hardware and maintenance costs. Many RAID
|
||||
HBAs can be configured with an IT-mode "personality" or "JBOD mode" for
|
||||
streamlined operation.
|
||||
|
||||
You do not need an RoC (RAID-capable) HBA. ZFS or Linux MD software mirroring
|
||||
serve well for boot volume durability. When using SAS or SATA data drives,
|
||||
forgoing HBA RAID capabilities can reduce the gap between HDD and SSD
|
||||
media cost. Moreover, when using NVMe SSDs, you do not need *any* HBA. This
|
||||
additionally reduces the HDD vs SSD cost gap when the system as a whole is
|
||||
considered. The initial cost of a fancy RAID HBA plus onboard cache plus
|
||||
battery backup (BBU or supercapacitor) can easily exceed more than 1000 US
|
||||
dollars even after discounts - a sum that goes a log way toward SSD cost parity.
|
||||
An HBA-free system may also cost hundreds of US dollars less every year if one
|
||||
purchases an annual maintenance contract or extended warranty.
|
||||
|
||||
.. tip:: The `Ceph blog`_ is often an excellent source of information on Ceph
|
||||
performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write
|
||||
@ -248,10 +324,10 @@ HBAs can be configured with an IT-mode "personality".
|
||||
Benchmarking
|
||||
------------
|
||||
|
||||
BlueStore opens block devices in O_DIRECT and uses fsync frequently to ensure
|
||||
that data is safely persisted to media. You can evaluate a drive's low-level
|
||||
write performance using ``fio``. For example, 4kB random write performance is
|
||||
measured as follows:
|
||||
BlueStore opens storage devices with ``O_DIRECT`` and issues ``fsync()``
|
||||
frequently to ensure that data is safely persisted to media. You can evaluate a
|
||||
drive's low-level write performance using ``fio``. For example, 4kB random write
|
||||
performance is measured as follows:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
@ -261,6 +337,7 @@ Write Caches
|
||||
------------
|
||||
|
||||
Enterprise SSDs and HDDs normally include power loss protection features which
|
||||
ensure data durability when power is lost while operating, and
|
||||
use multi-level caches to speed up direct or synchronous writes. These devices
|
||||
can be toggled between two caching modes -- a volatile cache flushed to
|
||||
persistent media with fsync, or a non-volatile cache written synchronously.
|
||||
@ -269,9 +346,9 @@ These two modes are selected by either "enabling" or "disabling" the write
|
||||
(volatile) cache. When the volatile cache is enabled, Linux uses a device in
|
||||
"write back" mode, and when disabled, it uses "write through".
|
||||
|
||||
The default configuration (normally caching enabled) may not be optimal, and
|
||||
The default configuration (usually: caching is enabled) may not be optimal, and
|
||||
OSD performance may be dramatically increased in terms of increased IOPS and
|
||||
decreased commit_latency by disabling the write cache.
|
||||
decreased commit latency by disabling this write cache.
|
||||
|
||||
Users are therefore encouraged to benchmark their devices with ``fio`` as
|
||||
described earlier and persist the optimal cache configuration for their
|
||||
@ -319,11 +396,11 @@ The write cache can be disabled with those same tools:
|
||||
=== START OF ENABLE/DISABLE COMMANDS SECTION ===
|
||||
Write cache disabled
|
||||
|
||||
Normally, disabling the cache using ``hdparm``, ``sdparm``, or ``smartctl``
|
||||
In most cases, disabling this cache using ``hdparm``, ``sdparm``, or ``smartctl``
|
||||
results in the cache_type changing automatically to "write through". If this is
|
||||
not the case, you can try setting it directly as follows. (Users should note
|
||||
not the case, you can try setting it directly as follows. (Users should ensure
|
||||
that setting cache_type also correctly persists the caching mode of the device
|
||||
until the next reboot):
|
||||
until the next reboot as some drives require this to be repeated at every boot):
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
@ -367,13 +444,13 @@ until the next reboot):
|
||||
Additional Considerations
|
||||
-------------------------
|
||||
|
||||
You typically will run multiple OSDs per host, but you should ensure that the
|
||||
aggregate throughput of your OSD drives doesn't exceed the network bandwidth
|
||||
required to service a client's need to read or write data. You should also
|
||||
consider what percentage of the overall data the cluster stores on each host. If
|
||||
the percentage on a particular host is large and the host fails, it can lead to
|
||||
problems such as exceeding the ``full ratio``, which causes Ceph to halt
|
||||
operations as a safety precaution that prevents data loss.
|
||||
Ceph operators typically provision multiple OSDs per host, but you should
|
||||
ensure that the aggregate throughput of your OSD drives doesn't exceed the
|
||||
network bandwidth required to service a client's read and write operations.
|
||||
You should also each host's percentage of the cluster's overall capacity. If
|
||||
the percentage located on a particular host is large and the host fails, it
|
||||
can lead to problems such as recovery causing OSDs to exceed the ``full ratio``,
|
||||
which in turn causes Ceph to halt operations to prevent data loss.
|
||||
|
||||
When you run multiple OSDs per host, you also need to ensure that the kernel
|
||||
is up to date. See `OS Recommendations`_ for notes on ``glibc`` and
|
||||
@ -384,7 +461,11 @@ multiple OSDs per host.
|
||||
Networks
|
||||
========
|
||||
|
||||
Provision at least 10 Gb/s networking in your racks.
|
||||
Provision at least 10 Gb/s networking in your datacenter, both among Ceph
|
||||
hosts and between clients and your Ceph cluster. Network link active/active
|
||||
bonding across separate network switches is strongly recommended both for
|
||||
increased throughput and for tolerance of network failures and maintenance.
|
||||
Take care that your bonding hash policy distributes traffic across links.
|
||||
|
||||
Speed
|
||||
-----
|
||||
@ -392,13 +473,20 @@ Speed
|
||||
It takes three hours to replicate 1 TB of data across a 1 Gb/s network and it
|
||||
takes thirty hours to replicate 10 TB across a 1 Gb/s network. But it takes only
|
||||
twenty minutes to replicate 1 TB across a 10 Gb/s network, and it takes
|
||||
only one hour to replicate 10 TB across a 10 Gb/s network.
|
||||
only one hour to replicate 10 TB across a 10 Gb/s network.
|
||||
|
||||
Note that a 40 Gb/s network link is effectively four 10 Gb/s channels in
|
||||
parallel, and that a 100Gb/s network link is effectively four 25 Gb/s channels
|
||||
in parallel. Thus, and perhaps somewhat counterintuitively, an individual
|
||||
packet on a 25 Gb/s network has slightly lower latency compared to a 40 Gb/s
|
||||
network.
|
||||
|
||||
|
||||
Cost
|
||||
----
|
||||
|
||||
The larger the Ceph cluster, the more common OSD failures will be.
|
||||
The faster that a placement group (PG) can recover from a ``degraded`` state to
|
||||
The faster that a placement group (PG) can recover from a degraded state to
|
||||
an ``active + clean`` state, the better. Notably, fast recovery minimizes
|
||||
the likelihood of multiple, overlapping failures that can cause data to become
|
||||
temporarily unavailable or even lost. Of course, when provisioning your
|
||||
@ -410,10 +498,10 @@ switches. The added expense of this hardware may be offset by the operational
|
||||
cost savings on network setup and maintenance. When using VLANs to handle VM
|
||||
traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack,
|
||||
etc.), there is additional value in using 10 Gb/s Ethernet or better; 40 Gb/s or
|
||||
25/50/100 Gb/s networking as of 2022 is common for production clusters.
|
||||
increasingly 25/50/100 Gb/s networking as of 2022 is common for production clusters.
|
||||
|
||||
Top-of-rack (TOR) switches also need fast and redundant uplinks to spind
|
||||
spine switches / routers, often at least 40 Gb/s.
|
||||
Top-of-rack (TOR) switches also need fast and redundant uplinks to
|
||||
core / spine network switches or routers, often at least 40 Gb/s.
|
||||
|
||||
|
||||
Baseboard Management Controller (BMC)
|
||||
@ -425,78 +513,103 @@ Administration and deployment tools may also use BMCs extensively, especially
|
||||
via IPMI or Redfish, so consider the cost/benefit tradeoff of an out-of-band
|
||||
network for security and administration. Hypervisor SSH access, VM image uploads,
|
||||
OS image installs, management sockets, etc. can impose significant loads on a network.
|
||||
Running three networks may seem like overkill, but each traffic path represents
|
||||
Running multiple networks may seem like overkill, but each traffic path represents
|
||||
a potential capacity, throughput and/or performance bottleneck that you should
|
||||
carefully consider before deploying a large scale data cluster.
|
||||
|
||||
Additionally BMCs as of 2023 rarely sport network connections faster than 1 Gb/s,
|
||||
so dedicated and inexpensive 1 Gb/s switches for BMC administrative traffic
|
||||
may reduce costs by wasting fewer expenive ports on faster host switches.
|
||||
|
||||
|
||||
Failure Domains
|
||||
===============
|
||||
|
||||
A failure domain is any failure that prevents access to one or more OSDs. That
|
||||
could be a stopped daemon on a host; a disk failure, an OS crash, a
|
||||
malfunctioning NIC, a failed power supply, a network outage, a power outage,
|
||||
and so forth. When planning out your hardware needs, you must balance the
|
||||
temptation to reduce costs by placing too many responsibilities into too few
|
||||
failure domains, and the added costs of isolating every potential failure
|
||||
domain.
|
||||
A failure domain can be thought of as any component loss that prevents access to
|
||||
one or more OSDs or other Ceph daemons. These could be a stopped daemon on a host;
|
||||
a storage drive failure, an OS crash, a malfunctioning NIC, a failed power supply,
|
||||
a network outage, a power outage, and so forth. When planning your hardware
|
||||
deployment, you must balance the risk of reducing costs by placing too many
|
||||
responsibilities into too few failure domains against the added costs of
|
||||
isolating every potential failure domain.
|
||||
|
||||
|
||||
Minimum Hardware Recommendations
|
||||
================================
|
||||
|
||||
Ceph can run on inexpensive commodity hardware. Small production clusters
|
||||
and development clusters can run successfully with modest hardware.
|
||||
and development clusters can run successfully with modest hardware. As
|
||||
we noted above: when we speak of CPU _cores_, we mean _threads_ when
|
||||
hyperthreading (HT) is enabled. Each modern physical x64 CPU core typically
|
||||
provides two logical CPU threads; other CPU architectures may vary.
|
||||
|
||||
Take care that there are many factors that influence resource choices. The
|
||||
minimum resources that suffice for one purpose will not necessarily suffice for
|
||||
another. A sandbox cluster with one OSD built on a laptop with VirtualBox or on
|
||||
a trio of Raspberry PIs will get by with fewer resources than a production
|
||||
deployment with a thousand OSDs serving five thousand of RBD clients. The
|
||||
classic Fisher Price PXL 2000 captures video, as does an IMAX or RED camera.
|
||||
One would not expect the former to do the job of the latter. We especially
|
||||
cannot stress enough the criticality of using enterprise-quality storage
|
||||
media for production workloads.
|
||||
|
||||
Additional insights into resource planning for production clusters are
|
||||
found above and elsewhere within this documentation.
|
||||
|
||||
+--------------+----------------+-----------------------------------------+
|
||||
| Process | Criteria | Minimum Recommended |
|
||||
| Process | Criteria | Bare Minimum and Recommended |
|
||||
+==============+================+=========================================+
|
||||
| ``ceph-osd`` | Processor | - 1 core minimum |
|
||||
| | | - 1 core per 200-500 MB/s |
|
||||
| ``ceph-osd`` | Processor | - 1 core minimum, 2 recommended |
|
||||
| | | - 1 core per 200-500 MB/s throughput |
|
||||
| | | - 1 core per 1000-3000 IOPS |
|
||||
| | | |
|
||||
| | | * Results are before replication. |
|
||||
| | | * Results may vary with different |
|
||||
| | | CPU models and Ceph features. |
|
||||
| | | * Results may vary across CPU and drive |
|
||||
| | | models and Ceph configuration: |
|
||||
| | | (erasure coding, compression, etc) |
|
||||
| | | * ARM processors specifically may |
|
||||
| | | require additional cores. |
|
||||
| | | require more cores for performance. |
|
||||
| | | * SSD OSDs, especially NVMe, will |
|
||||
| | | benefit from additional cores per OSD.|
|
||||
| | | * Actual performance depends on many |
|
||||
| | | factors including drives, net, and |
|
||||
| | | client throughput and latency. |
|
||||
| | | Benchmarking is highly recommended. |
|
||||
| +----------------+-----------------------------------------+
|
||||
| | RAM | - 4GB+ per daemon (more is better) |
|
||||
| | | - 2-4GB often functions (may be slow) |
|
||||
| | | - Less than 2GB not recommended |
|
||||
| | | - 2-4GB may function but may be slow |
|
||||
| | | - Less than 2GB is not recommended |
|
||||
| +----------------+-----------------------------------------+
|
||||
| | Volume Storage | 1x storage drive per daemon |
|
||||
| | Storage Drives | 1x storage drive per OSD |
|
||||
| +----------------+-----------------------------------------+
|
||||
| | DB/WAL | 1x SSD partition per daemon (optional) |
|
||||
| | DB/WAL | 1x SSD partion per HDD OSD |
|
||||
| | (optional) | 4-5x HDD OSDs per DB/WAL SATA SSD |
|
||||
| | | <= 10 HDD OSDss per DB/WAL NVMe SSD |
|
||||
| +----------------+-----------------------------------------+
|
||||
| | Network | 1x 1GbE+ NICs (10GbE+ recommended) |
|
||||
| | Network | 1x 1Gb/s (bonded 10+ Gb/s recommended) |
|
||||
+--------------+----------------+-----------------------------------------+
|
||||
| ``ceph-mon`` | Processor | - 2 cores minimum |
|
||||
| +----------------+-----------------------------------------+
|
||||
| | RAM | 2-4GB+ per daemon |
|
||||
| | RAM | 5GB+ per daemon (large / production |
|
||||
| | | clusters need more) |
|
||||
| +----------------+-----------------------------------------+
|
||||
| | Disk Space | 60 GB per daemon |
|
||||
| | Storage | 100 GB per daemon, SSD is recommended |
|
||||
| +----------------+-----------------------------------------+
|
||||
| | Network | 1x 1GbE+ NICs |
|
||||
| | Network | 1x 1Gb/s (10+ Gb/s recommended) |
|
||||
+--------------+----------------+-----------------------------------------+
|
||||
| ``ceph-mds`` | Processor | - 2 cores minimum |
|
||||
| +----------------+-----------------------------------------+
|
||||
| | RAM | 2GB+ per daemon |
|
||||
| | RAM | 2GB+ per daemon (more for production) |
|
||||
| +----------------+-----------------------------------------+
|
||||
| | Disk Space | 1 MB per daemon |
|
||||
| | Disk Space | 1 GB per daemon |
|
||||
| +----------------+-----------------------------------------+
|
||||
| | Network | 1x 1GbE+ NICs |
|
||||
| | Network | 1x 1Gb/s (10+ Gb/s recommended) |
|
||||
+--------------+----------------+-----------------------------------------+
|
||||
|
||||
.. tip:: If you are running an OSD with a single disk, create a
|
||||
partition for your volume storage that is separate from the partition
|
||||
containing the OS. Generally, we recommend separate disks for the
|
||||
OS and the volume storage.
|
||||
.. tip:: If you are running an OSD node with a single storage drive, create a
|
||||
partition for your OSD that is separate from the partition
|
||||
containing the OS. We recommend separate drives for the
|
||||
OS and for OSD storage.
|
||||
|
||||
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user