doc/start: Modernize and clarify hardware-recommendations.rst

Signed-off-by: Anthony D'Atri <anthonyeleven@users.noreply.github.com>
This commit is contained in:
Anthony D'Atri 2023-10-15 21:27:44 -04:00
parent 9b943423e5
commit 83bd3a8dfb

View File

@ -1,66 +1,83 @@
.. _hardware-recommendations:
==========================
Hardware Recommendations
hardware recommendations
==========================
Ceph was designed to run on commodity hardware, which makes building and
maintaining petabyte-scale data clusters economically feasible.
When planning out your cluster hardware, you will need to balance a number
of considerations, including failure domains and potential performance
issues. Hardware planning should include distributing Ceph daemons and
Ceph is designed to run on commodity hardware, which makes building and
maintaining petabyte-scale data clusters flexible and economically feasible.
When planning your cluster's hardware, you will need to balance a number
of considerations, including failure domains, cost, and performance.
Hardware planning should include distributing Ceph daemons and
other processes that use Ceph across many hosts. Generally, we recommend
running Ceph daemons of a specific type on a host configured for that type
of daemon. We recommend using other hosts for processes that utilize your
data cluster (e.g., OpenStack, CloudStack, etc).
of daemon. We recommend using separate hosts for processes that utilize your
data cluster (e.g., OpenStack, CloudStack, Kubernetes, etc).
The requirements of one Ceph cluster are not the same as the requirements of
another, but below are some general guidelines.
.. tip:: Check out the `Ceph blog`_ too.
.. tip:: check out the `ceph blog`_ too.
CPU
===
CephFS metadata servers (MDS) are CPU-intensive. CephFS metadata servers (MDS)
should therefore have quad-core (or better) CPUs and high clock rates (GHz). OSD
nodes need enough processing power to run the RADOS service, to calculate data
CephFS Metadata Servers (MDS) are CPU-intensive. They are
are single-threaded and perform best with CPUs with a high clock rate (GHz). MDS
servers do not need a large number of CPU cores unless they are also hosting other
services, such as SSD OSDs for the CephFS metadata pool.
OSD nodes need enough processing power to run the RADOS service, to calculate data
placement with CRUSH, to replicate data, and to maintain their own copies of the
cluster map.
The requirements of one Ceph cluster are not the same as the requirements of
another, but here are some general guidelines.
In earlier versions of Ceph, we would make hardware recommendations based on
the number of cores per OSD, but this cores-per-OSD metric is no longer as
useful a metric as the number of cycles per IOP and the number of IOPs per OSD.
For example, for NVMe drives, Ceph can easily utilize five or six cores on real
With earlier releases of Ceph, we would make hardware recommendations based on
the number of cores per OSD, but this cores-per-osd metric is no longer as
useful a metric as the number of cycles per IOP and the number of IOPS per OSD.
For example, with NVMe OSD drives, Ceph can easily utilize five or six cores on real
clusters and up to about fourteen cores on single OSDs in isolation. So cores
per OSD are no longer as pressing a concern as they were. When selecting
hardware, select for IOPs per core.
hardware, select for IOPS per core.
Monitor nodes and manager nodes have no heavy CPU demands and require only
modest processors. If your host machines will run CPU-intensive processes in
.. tip:: When we speak of CPU _cores_, we mean _threads_ when hyperthreading
is enabled. Hyperthreading is usually beneficial for Ceph servers.
Monitor nodes and Manager nodes do not have heavy CPU demands and require only
modest processors. if your hosts will run CPU-intensive processes in
addition to Ceph daemons, make sure that you have enough processing power to
run both the CPU-intensive processes and the Ceph daemons. (OpenStack Nova is
one such example of a CPU-intensive process.) We recommend that you run
one example of a CPU-intensive process.) We recommend that you run
non-Ceph CPU-intensive processes on separate hosts (that is, on hosts that are
not your monitor and manager nodes) in order to avoid resource contention.
not your Monitor and Manager nodes) in order to avoid resource contention.
If your cluster deployes the Ceph Object Gateway, RGW daemons may co-reside
with your Mon and Manager services if the nodes have sufficient resources.
RAM
===
Generally, more RAM is better. Monitor / manager nodes for a modest cluster
Generally, more RAM is better. Monitor / Manager nodes for a modest cluster
might do fine with 64GB; for a larger cluster with hundreds of OSDs 128GB
is a reasonable target. There is a memory target for BlueStore OSDs that
is advised.
.. tip:: when we speak of RAM and storage requirements, we often describe
the needs of a single daemon of a given type. A given server as
a whole will thus need at least the sum of the needs of the
daemons that it hosts as well as resources for logs and other operating
system components. Keep in mind that a server's need for RAM
and storage will be greater at startup and when components
fail or are added and the cluster rebalances. In other words,
allow headroom past what you might see used during a calm period
on a small initial cluster footprint.
There is an :confval:`osd_memory_target` setting for BlueStore OSDs that
defaults to 4GB. Factor in a prudent margin for the operating system and
administrative tasks (like monitoring and metrics) as well as increased
consumption during recovery: provisioning ~8GB per BlueStore OSD
is advised.
consumption during recovery: provisioning ~8GB *per BlueStore OSD* is thus
advised.
Monitors and managers (ceph-mon and ceph-mgr)
---------------------------------------------
Monitor and manager daemon memory usage generally scales with the size of the
Monitor and manager daemon memory usage scales with the size of the
cluster. Note that at boot-time and during topology changes and recovery these
daemons will need more RAM than they do during steady-state operation, so plan
for peak usage. For very small clusters, 32 GB suffices. For clusters of up to,
@ -75,8 +92,8 @@ tuning the following settings:
Metadata servers (ceph-mds)
---------------------------
The metadata daemon memory utilization depends on how much memory its cache is
configured to consume. We recommend 1 GB as a minimum for most systems. See
CephFS metadata daemon memory utilization depends on the configured size of
its cache. We recommend 1 GB as a minimum for most systems. See
:confval:`mds_cache_memory_limit`.
@ -88,23 +105,24 @@ operating system's page cache. In Bluestore you can adjust the amount of memory
that the OSD attempts to consume by changing the :confval:`osd_memory_target`
configuration option.
- Setting the :confval:`osd_memory_target` below 2GB is typically not
recommended (Ceph may fail to keep the memory consumption under 2GB and
this may cause extremely slow performance).
- Setting the :confval:`osd_memory_target` below 2GB is not
recommended. eph may fail to keep the memory consumption under 2GB and
extremely slow performance is likely.
- Setting the memory target between 2GB and 4GB typically works but may result
in degraded performance: metadata may be read from disk during IO unless the
active data set is relatively small.
in degraded performance: metadata may need to be read from disk during IO
unless the active data set is relatively small.
- 4GB is the current default :confval:`osd_memory_target` size. This default
was chosen for typical use cases, and is intended to balance memory
requirements and OSD performance.
- 4GB is the current default value for :confval:`osd_memory_target` This default
was chosen for typical use cases, and is intended to balance RAM cost and
OSD performance.
- Setting the :confval:`osd_memory_target` higher than 4GB can improve
performance when there many (small) objects or when large (256GB/OSD
or more) data sets are processed.
or more) data sets are processed. This is especially true with fast
NVMe OSDs.
.. important:: OSD memory autotuning is "best effort". Although the OSD may
.. important:: OSD memory management is "best effort". Although the OSD may
unmap memory to allow the kernel to reclaim it, there is no guarantee that
the kernel will actually reclaim freed memory within a specific time
frame. This applies especially in older versions of Ceph, where transparent
@ -113,14 +131,19 @@ configuration option.
pages at the application level to avoid this, but that does not
guarantee that the kernel will immediately reclaim unmapped memory. The OSD
may still at times exceed its memory target. We recommend budgeting
approximately 20% extra memory on your system to prevent OSDs from going OOM
at least 20% extra memory on your system to prevent OSDs from going OOM
(**O**\ut **O**\f **M**\emory) during temporary spikes or due to delay in
the kernel reclaiming freed pages. That 20% value might be more or less than
needed, depending on the exact configuration of the system.
When using the legacy FileStore back end, the page cache is used for caching
data, so no tuning is normally needed. When using the legacy FileStore backend,
the OSD memory consumption is related to the number of PGs per daemon in the
.. tip:: Configuring the operating system with swap to provide additional
virtual memory for daemons is not advised for modern systems. Doing
may result in lower performance, and your Ceph cluster may well be
happier with a daemon that crashes vs one that slows to a crawl.
When using the legacy FileStore back end, the OS page cache was used for caching
data, so tuning was not normally needed. When using the legacy FileStore backend,
the OSD memory consumption was related to the number of PGs per daemon in the
system.
@ -130,13 +153,34 @@ Data Storage
Plan your data storage configuration carefully. There are significant cost and
performance tradeoffs to consider when planning for data storage. Simultaneous
OS operations and simultaneous requests from multiple daemons for read and
write operations against a single drive can slow performance.
write operations against a single drive can impact performance.
OSDs require substantial storage drive space for RADOS data. We recommend a
minimum drive size of 1 terabyte. OSD drives much smaller than one terabyte
use a significant fraction of their capacity for metadata, and drives smaller
than 100 gigabytes will not be effective at all.
It is *strongly* suggested that (enterprise-class) SSDs are provisioned for, at a
minimum, Ceph Monitor and Ceph Manager hosts, as well as CephFS Metadata Server
metadata pools and Ceph Object Gateway (RGW) index pools, even if HDDs are to
be provisioned for bulk OSD data.
To get the best performance out of Ceph, provision the following on separate
drives:
* The operating systems
* OSD data
* BlueStore WAL+DB
For more
information on how to effectively use a mix of fast drives and slow drives in
your Ceph cluster, see the `block and block.db`_ section of the Bluestore
Configuration Reference.
Hard Disk Drives
----------------
OSDs should have plenty of storage drive space for object data. We recommend a
minimum disk drive size of 1 terabyte. Consider the cost-per-gigabyte advantage
Consider carefully the cost-per-gigabyte advantage
of larger disks. We recommend dividing the price of the disk drive by the
number of gigabytes to arrive at a cost per gigabyte, because larger drives may
have a significant impact on the cost-per-gigabyte. For example, a 1 terabyte
@ -146,11 +190,10 @@ per gigabyte (i.e., $150 / 3072 = 0.0488). In the foregoing example, using the
1 terabyte disks would generally increase the cost per gigabyte by
40%--rendering your cluster substantially less cost efficient.
.. tip:: Running multiple OSDs on a single SAS / SATA drive
is **NOT** a good idea. NVMe drives, however, can achieve
improved performance by being split into two or more OSDs.
.. tip:: Hosting multiple OSDs on a single SAS / SATA HDD
is **NOT** a good idea.
.. tip:: Running an OSD and a monitor or a metadata server on a single
.. tip:: Hosting an OSD with monitor, manager, or MDS data on a single
drive is also **NOT** a good idea.
.. tip:: With spinning disks, the SATA and SAS interface increasingly
@ -162,35 +205,36 @@ Storage drives are subject to limitations on seek time, access time, read and
write times, as well as total throughput. These physical limitations affect
overall system performance--especially during recovery. We recommend using a
dedicated (ideally mirrored) drive for the operating system and software, and
one drive for each Ceph OSD Daemon you run on the host (modulo NVMe above).
one drive for each Ceph OSD Daemon you run on the host.
Many "slow OSD" issues (when they are not attributable to hardware failure)
arise from running an operating system and multiple OSDs on the same drive.
Also be aware that today's 22TB HDD uses the same SATA interface as a
3TB HDD from ten years ago: more than seven times the data to squeeze
through the same same interface. For this reason, when using HDDs for
OSDs, drives larger than 8TB may be best suited for storage of large
files / objects that are not at all performance-sensitive.
It is technically possible to run multiple Ceph OSD Daemons per SAS / SATA
drive, but this will lead to resource contention and diminish overall
throughput.
To get the best performance out of Ceph, run the following on separate drives:
(1) operating systems, (2) OSD data, and (3) BlueStore db. For more
information on how to effectively use a mix of fast drives and slow drives in
your Ceph cluster, see the `block and block.db`_ section of the Bluestore
Configuration Reference.
Solid State Drives
------------------
Ceph performance can be improved by using solid-state drives (SSDs). This
reduces random access time and reduces latency while accelerating throughput.
Ceph performance is much improved when using solid-state drives (SSDs). This
reduces random access time and reduces latency while increasing throughput.
SSDs cost more per gigabyte than do hard disk drives, but SSDs often offer
access times that are, at a minimum, 100 times faster than hard disk drives.
SSDs cost more per gigabyte than do HDDs but SSDs often offer
access times that are, at a minimum, 100 times faster than HDDs.
SSDs avoid hotspot issues and bottleneck issues within busy clusters, and
they may offer better economics when TCO is evaluated holistically.
they may offer better economics when TCO is evaluated holistically. Notably,
the amortized drive cost for a given number of IOPS is much lower with SSDs
than with HDDs. SSDs do not suffer rotational or seek latency and in addition
to improved client performance, they substantially improve the speed and
client impact of cluster changes including rebalancing when OSDs or Monitors
are added, removed, or fail.
SSDs do not have moving mechanical parts, so they are not necessarily subject
to the same types of limitations as hard disk drives. SSDs do have significant
SSDs do not have moving mechanical parts, so they are not subject
to many of the limitations of HDDs. SSDs do have significant
limitations though. When evaluating SSDs, it is important to consider the
performance of sequential reads and writes.
performance of sequential and random reads and writes.
.. important:: We recommend exploring the use of SSDs to improve performance.
However, before making a significant investment in SSDs, we **strongly
@ -198,16 +242,36 @@ performance of sequential reads and writes.
SSD in a test configuration in order to gauge performance.
Relatively inexpensive SSDs may appeal to your sense of economy. Use caution.
Acceptable IOPS are not the only factor to consider when selecting an SSD for
use with Ceph.
Acceptable IOPS are not the only factor to consider when selecting SSDs for
use with Ceph. Bargain SSDs are often a false economy: they may experience
"cliffing", which means that after an initial burst, sustained performance
once a limited cache is filled declines considerably. Consider also durability:
a drive rated for 0.3 Drive Writes Per Day (DWPD or equivalent) may be fine for
OSDs dedicated to certain types of sequentially-written read-mostly data, but
are not a good choice for Ceph Monitor duty. Enterprise-class SSDs are best
for Ceph: they almost always feature power less protection (PLP) and do
not suffer the dramatic cliffing that client (desktop) models may experience.
SSDs have historically been cost prohibitive for object storage, but emerging
QLC drives are closing the gap, offering greater density with lower power
consumption and less power spent on cooling. HDD OSDs may see a significant
performance improvement by offloading WAL+DB onto an SSD.
When using a single (or mirrored pair) SSD for both operating system boot
and Ceph Monitor / Manager purposes, a minimum capacity of 256GB is advised
and at least 480GB is recommended. A drive model rated at 1+ DWPD (or the
equivalent in TBW (TeraBytes Written) is suggested. However, for a given write
workload, a larger drive than technically required will provide more endurance
because it effectively has greater overprovsioning. We stress that
enterprise-class drives are best for production use, as they feature power
loss protection and increased durability compared to client (desktop) SKUs
that are intended for much lighter and intermittent duty cycles.
To get a better sense of the factors that determine the cost of storage, you
might use the `Storage Networking Industry Association's Total Cost of
SSDs were historically been cost prohibitive for object storage, but
QLC SSDs are closing the gap, offering greater density with lower power
consumption and less power spent on cooling. Also, HDD OSDs may see a
significant write latency improvement by offloading WAL+DB onto an SSD.
Many Ceph OSD deployments do not require an SSD with greater endurance than
1 DWPD (aka "read-optimized"). "Mixed-use" SSDs in the 3 DWPD class are
often overkill for this purpose and cost signficantly more.
To get a better sense of the factors that determine the total cost of storage,
you might use the `Storage Networking Industry Association's Total Cost of
Ownership calculator`_
Partition Alignment
@ -222,11 +286,11 @@ alignment and example commands that show how to align partitions properly, see
CephFS Metadata Segregation
~~~~~~~~~~~~~~~~~~~~~~~~~~~
One way that Ceph accelerates CephFS file system performance is by segregating
One way that Ceph accelerates CephFS file system performance is by separating
the storage of CephFS metadata from the storage of the CephFS file contents.
Ceph provides a default ``metadata`` pool for CephFS metadata. You will never
have to create a pool for CephFS metadata, but you can create a CRUSH map
hierarchy for your CephFS metadata pool that points only to SSD storage media.
have to manually create a pool for CephFS metadata, but you can create a CRUSH map
hierarchy for your CephFS metadata pool that includes only SSD storage media.
See :ref:`CRUSH Device Class<crush-map-device-class>` for details.
@ -237,8 +301,20 @@ Disk controllers (HBAs) can have a significant impact on write throughput.
Carefully consider your selection of HBAs to ensure that they do not create a
performance bottleneck. Notably, RAID-mode (IR) HBAs may exhibit higher latency
than simpler "JBOD" (IT) mode HBAs. The RAID SoC, write cache, and battery
backup can substantially increase hardware and maintenance costs. Some RAID
HBAs can be configured with an IT-mode "personality".
backup can substantially increase hardware and maintenance costs. Many RAID
HBAs can be configured with an IT-mode "personality" or "JBOD mode" for
streamlined operation.
You do not need an RoC (RAID-capable) HBA. ZFS or Linux MD software mirroring
serve well for boot volume durability. When using SAS or SATA data drives,
forgoing HBA RAID capabilities can reduce the gap between HDD and SSD
media cost. Moreover, when using NVMe SSDs, you do not need *any* HBA. This
additionally reduces the HDD vs SSD cost gap when the system as a whole is
considered. The initial cost of a fancy RAID HBA plus onboard cache plus
battery backup (BBU or supercapacitor) can easily exceed more than 1000 US
dollars even after discounts - a sum that goes a log way toward SSD cost parity.
An HBA-free system may also cost hundreds of US dollars less every year if one
purchases an annual maintenance contract or extended warranty.
.. tip:: The `Ceph blog`_ is often an excellent source of information on Ceph
performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write
@ -248,10 +324,10 @@ HBAs can be configured with an IT-mode "personality".
Benchmarking
------------
BlueStore opens block devices in O_DIRECT and uses fsync frequently to ensure
that data is safely persisted to media. You can evaluate a drive's low-level
write performance using ``fio``. For example, 4kB random write performance is
measured as follows:
BlueStore opens storage devices with ``O_DIRECT`` and issues ``fsync()``
frequently to ensure that data is safely persisted to media. You can evaluate a
drive's low-level write performance using ``fio``. For example, 4kB random write
performance is measured as follows:
.. code-block:: console
@ -261,6 +337,7 @@ Write Caches
------------
Enterprise SSDs and HDDs normally include power loss protection features which
ensure data durability when power is lost while operating, and
use multi-level caches to speed up direct or synchronous writes. These devices
can be toggled between two caching modes -- a volatile cache flushed to
persistent media with fsync, or a non-volatile cache written synchronously.
@ -269,9 +346,9 @@ These two modes are selected by either "enabling" or "disabling" the write
(volatile) cache. When the volatile cache is enabled, Linux uses a device in
"write back" mode, and when disabled, it uses "write through".
The default configuration (normally caching enabled) may not be optimal, and
The default configuration (usually: caching is enabled) may not be optimal, and
OSD performance may be dramatically increased in terms of increased IOPS and
decreased commit_latency by disabling the write cache.
decreased commit latency by disabling this write cache.
Users are therefore encouraged to benchmark their devices with ``fio`` as
described earlier and persist the optimal cache configuration for their
@ -319,11 +396,11 @@ The write cache can be disabled with those same tools:
=== START OF ENABLE/DISABLE COMMANDS SECTION ===
Write cache disabled
Normally, disabling the cache using ``hdparm``, ``sdparm``, or ``smartctl``
In most cases, disabling this cache using ``hdparm``, ``sdparm``, or ``smartctl``
results in the cache_type changing automatically to "write through". If this is
not the case, you can try setting it directly as follows. (Users should note
not the case, you can try setting it directly as follows. (Users should ensure
that setting cache_type also correctly persists the caching mode of the device
until the next reboot):
until the next reboot as some drives require this to be repeated at every boot):
.. code-block:: console
@ -367,13 +444,13 @@ until the next reboot):
Additional Considerations
-------------------------
You typically will run multiple OSDs per host, but you should ensure that the
aggregate throughput of your OSD drives doesn't exceed the network bandwidth
required to service a client's need to read or write data. You should also
consider what percentage of the overall data the cluster stores on each host. If
the percentage on a particular host is large and the host fails, it can lead to
problems such as exceeding the ``full ratio``, which causes Ceph to halt
operations as a safety precaution that prevents data loss.
Ceph operators typically provision multiple OSDs per host, but you should
ensure that the aggregate throughput of your OSD drives doesn't exceed the
network bandwidth required to service a client's read and write operations.
You should also each host's percentage of the cluster's overall capacity. If
the percentage located on a particular host is large and the host fails, it
can lead to problems such as recovery causing OSDs to exceed the ``full ratio``,
which in turn causes Ceph to halt operations to prevent data loss.
When you run multiple OSDs per host, you also need to ensure that the kernel
is up to date. See `OS Recommendations`_ for notes on ``glibc`` and
@ -384,7 +461,11 @@ multiple OSDs per host.
Networks
========
Provision at least 10 Gb/s networking in your racks.
Provision at least 10 Gb/s networking in your datacenter, both among Ceph
hosts and between clients and your Ceph cluster. Network link active/active
bonding across separate network switches is strongly recommended both for
increased throughput and for tolerance of network failures and maintenance.
Take care that your bonding hash policy distributes traffic across links.
Speed
-----
@ -392,13 +473,20 @@ Speed
It takes three hours to replicate 1 TB of data across a 1 Gb/s network and it
takes thirty hours to replicate 10 TB across a 1 Gb/s network. But it takes only
twenty minutes to replicate 1 TB across a 10 Gb/s network, and it takes
only one hour to replicate 10 TB across a 10 Gb/s network.
only one hour to replicate 10 TB across a 10 Gb/s network.
Note that a 40 Gb/s network link is effectively four 10 Gb/s channels in
parallel, and that a 100Gb/s network link is effectively four 25 Gb/s channels
in parallel. Thus, and perhaps somewhat counterintuitively, an individual
packet on a 25 Gb/s network has slightly lower latency compared to a 40 Gb/s
network.
Cost
----
The larger the Ceph cluster, the more common OSD failures will be.
The faster that a placement group (PG) can recover from a ``degraded`` state to
The faster that a placement group (PG) can recover from a degraded state to
an ``active + clean`` state, the better. Notably, fast recovery minimizes
the likelihood of multiple, overlapping failures that can cause data to become
temporarily unavailable or even lost. Of course, when provisioning your
@ -410,10 +498,10 @@ switches. The added expense of this hardware may be offset by the operational
cost savings on network setup and maintenance. When using VLANs to handle VM
traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack,
etc.), there is additional value in using 10 Gb/s Ethernet or better; 40 Gb/s or
25/50/100 Gb/s networking as of 2022 is common for production clusters.
increasingly 25/50/100 Gb/s networking as of 2022 is common for production clusters.
Top-of-rack (TOR) switches also need fast and redundant uplinks to spind
spine switches / routers, often at least 40 Gb/s.
Top-of-rack (TOR) switches also need fast and redundant uplinks to
core / spine network switches or routers, often at least 40 Gb/s.
Baseboard Management Controller (BMC)
@ -425,78 +513,103 @@ Administration and deployment tools may also use BMCs extensively, especially
via IPMI or Redfish, so consider the cost/benefit tradeoff of an out-of-band
network for security and administration. Hypervisor SSH access, VM image uploads,
OS image installs, management sockets, etc. can impose significant loads on a network.
Running three networks may seem like overkill, but each traffic path represents
Running multiple networks may seem like overkill, but each traffic path represents
a potential capacity, throughput and/or performance bottleneck that you should
carefully consider before deploying a large scale data cluster.
Additionally BMCs as of 2023 rarely sport network connections faster than 1 Gb/s,
so dedicated and inexpensive 1 Gb/s switches for BMC administrative traffic
may reduce costs by wasting fewer expenive ports on faster host switches.
Failure Domains
===============
A failure domain is any failure that prevents access to one or more OSDs. That
could be a stopped daemon on a host; a disk failure, an OS crash, a
malfunctioning NIC, a failed power supply, a network outage, a power outage,
and so forth. When planning out your hardware needs, you must balance the
temptation to reduce costs by placing too many responsibilities into too few
failure domains, and the added costs of isolating every potential failure
domain.
A failure domain can be thought of as any component loss that prevents access to
one or more OSDs or other Ceph daemons. These could be a stopped daemon on a host;
a storage drive failure, an OS crash, a malfunctioning NIC, a failed power supply,
a network outage, a power outage, and so forth. When planning your hardware
deployment, you must balance the risk of reducing costs by placing too many
responsibilities into too few failure domains against the added costs of
isolating every potential failure domain.
Minimum Hardware Recommendations
================================
Ceph can run on inexpensive commodity hardware. Small production clusters
and development clusters can run successfully with modest hardware.
and development clusters can run successfully with modest hardware. As
we noted above: when we speak of CPU _cores_, we mean _threads_ when
hyperthreading (HT) is enabled. Each modern physical x64 CPU core typically
provides two logical CPU threads; other CPU architectures may vary.
Take care that there are many factors that influence resource choices. The
minimum resources that suffice for one purpose will not necessarily suffice for
another. A sandbox cluster with one OSD built on a laptop with VirtualBox or on
a trio of Raspberry PIs will get by with fewer resources than a production
deployment with a thousand OSDs serving five thousand of RBD clients. The
classic Fisher Price PXL 2000 captures video, as does an IMAX or RED camera.
One would not expect the former to do the job of the latter. We especially
cannot stress enough the criticality of using enterprise-quality storage
media for production workloads.
Additional insights into resource planning for production clusters are
found above and elsewhere within this documentation.
+--------------+----------------+-----------------------------------------+
| Process | Criteria | Minimum Recommended |
| Process | Criteria | Bare Minimum and Recommended |
+==============+================+=========================================+
| ``ceph-osd`` | Processor | - 1 core minimum |
| | | - 1 core per 200-500 MB/s |
| ``ceph-osd`` | Processor | - 1 core minimum, 2 recommended |
| | | - 1 core per 200-500 MB/s throughput |
| | | - 1 core per 1000-3000 IOPS |
| | | |
| | | * Results are before replication. |
| | | * Results may vary with different |
| | | CPU models and Ceph features. |
| | | * Results may vary across CPU and drive |
| | | models and Ceph configuration: |
| | | (erasure coding, compression, etc) |
| | | * ARM processors specifically may |
| | | require additional cores. |
| | | require more cores for performance. |
| | | * SSD OSDs, especially NVMe, will |
| | | benefit from additional cores per OSD.|
| | | * Actual performance depends on many |
| | | factors including drives, net, and |
| | | client throughput and latency. |
| | | Benchmarking is highly recommended. |
| +----------------+-----------------------------------------+
| | RAM | - 4GB+ per daemon (more is better) |
| | | - 2-4GB often functions (may be slow) |
| | | - Less than 2GB not recommended |
| | | - 2-4GB may function but may be slow |
| | | - Less than 2GB is not recommended |
| +----------------+-----------------------------------------+
| | Volume Storage | 1x storage drive per daemon |
| | Storage Drives | 1x storage drive per OSD |
| +----------------+-----------------------------------------+
| | DB/WAL | 1x SSD partition per daemon (optional) |
| | DB/WAL | 1x SSD partion per HDD OSD |
| | (optional) | 4-5x HDD OSDs per DB/WAL SATA SSD |
| | | <= 10 HDD OSDss per DB/WAL NVMe SSD |
| +----------------+-----------------------------------------+
| | Network | 1x 1GbE+ NICs (10GbE+ recommended) |
| | Network | 1x 1Gb/s (bonded 10+ Gb/s recommended) |
+--------------+----------------+-----------------------------------------+
| ``ceph-mon`` | Processor | - 2 cores minimum |
| +----------------+-----------------------------------------+
| | RAM | 2-4GB+ per daemon |
| | RAM | 5GB+ per daemon (large / production |
| | | clusters need more) |
| +----------------+-----------------------------------------+
| | Disk Space | 60 GB per daemon |
| | Storage | 100 GB per daemon, SSD is recommended |
| +----------------+-----------------------------------------+
| | Network | 1x 1GbE+ NICs |
| | Network | 1x 1Gb/s (10+ Gb/s recommended) |
+--------------+----------------+-----------------------------------------+
| ``ceph-mds`` | Processor | - 2 cores minimum |
| +----------------+-----------------------------------------+
| | RAM | 2GB+ per daemon |
| | RAM | 2GB+ per daemon (more for production) |
| +----------------+-----------------------------------------+
| | Disk Space | 1 MB per daemon |
| | Disk Space | 1 GB per daemon |
| +----------------+-----------------------------------------+
| | Network | 1x 1GbE+ NICs |
| | Network | 1x 1Gb/s (10+ Gb/s recommended) |
+--------------+----------------+-----------------------------------------+
.. tip:: If you are running an OSD with a single disk, create a
partition for your volume storage that is separate from the partition
containing the OS. Generally, we recommend separate disks for the
OS and the volume storage.
.. tip:: If you are running an OSD node with a single storage drive, create a
partition for your OSD that is separate from the partition
containing the OS. We recommend separate drives for the
OS and for OSD storage.