doc: misc clarity and capitalization

Signed-off-by: Anthony D'Atri <anthony.datri@gmail.com>
This commit is contained in:
Anthony D'Atri 2020-10-07 15:21:28 -07:00
parent 8e530674ff
commit 32375cb789
12 changed files with 202 additions and 161 deletions

View File

@ -66,7 +66,7 @@ then you just add a line saying ::
Signed-off-by: Random J Developer <random@developer.example.org>
using your real name (sorry, no pseudonyms or anonymous contributions.)
using your real name (sorry, no pseudonyms or anonymous contributions).
Git can sign off on your behalf
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@ -201,12 +201,17 @@ PR title
If your PR has only one commit, the PR title can be the same as the commit title
(and GitHub will suggest this). If the PR has multiple commits, do not accept
the title GitHub suggest. Either use the title of the most relevant commit, or
the title GitHub suggests. Either use the title of the most relevant commit, or
write your own title. In the latter case, use the same "subsystem: short
description" convention described in `Commit title`_ for the PR title, with
the following difference: the PR title describes the entire set of changes,
while the `Commit title`_ describes only the changes in a particular commit.
If GitHub suggests a PR title based on a very long commit message it will split
the result with an elipsis (...) and fold the remainder into the PR description.
In such a case, please edit the title to be more concise and the description to
remove the elipsis.
Keep in mind that the PR titles feed directly into the script that generates
release notes and it is tedious to clean up non-conformant PR titles at release
time. This document places no limit on the length of PR titles, but be aware

View File

@ -14,17 +14,17 @@ online bucket resharding.
Each bucket index shard can handle its entries efficiently up until
reaching a certain threshold number of entries. If this threshold is
exceeded the system can encounter performance issues. The dynamic
exceeded the system can suffer from performance issues. The dynamic
resharding feature detects this situation and automatically increases
the number of shards used by the bucket index, resulting in the
the number of shards used by the bucket index, resulting in a
reduction of the number of entries in each bucket index shard. This
process is transparent to the user.
By default dynamic bucket index resharding can only increase the
number of bucket index shards to 1999, although the upper-bound is a
configuration parameter (see Configuration below). Furthermore, when
number of bucket index shards to 1999, although this upper-bound is a
configuration parameter (see Configuration below). When
possible, the process chooses a prime number of bucket index shards to
help spread the number of bucket index entries across the bucket index
spread the number of bucket index entries across the bucket index
shards more evenly.
The detection process runs in a background process that periodically

View File

@ -8,8 +8,8 @@ new developers to get up to speed with the implementation details.
Introduction
------------
Swift offers something called a container, that we use interchangeably with
the term bucket. One may say that RGW's buckets implement Swift containers.
Swift offers something called a *container*, which we use interchangeably with
the term *bucket*, so we say that RGW's buckets implement Swift containers.
This document does not consider how RGW operates on these structures,
e.g. the use of encode() and decode() methods for serialization and so on.
@ -42,18 +42,18 @@ Some variables have been used in above commands, they are:
- bucket: Holds a mapping between bucket name and bucket instance id
- bucket.instance: Holds bucket instance information[2]
Every metadata entry is kept on a single rados object. See below for implementation details.
Every metadata entry is kept on a single RADOS object. See below for implementation details.
Note that the metadata is not indexed. When listing a metadata section we do a
rados pgls operation on the containing pool.
RADOS ``pgls`` operation on the containing pool.
Bucket Index
^^^^^^^^^^^^
It's a different kind of metadata, and kept separately. The bucket index holds
a key-value map in rados objects. By default it is a single rados object per
bucket, but it is possible since Hammer to shard that map over multiple rados
objects. The map itself is kept in omap, associated with each rados object.
a key-value map in RADOS objects. By default it is a single RADOS object per
bucket, but it is possible since Hammer to shard that map over multiple RADOS
objects. The map itself is kept in omap, associated with each RADOS object.
The key of each omap is the name of the objects, and the value holds some basic
metadata of that object -- metadata that shows up when listing the bucket.
Also, each omap holds a header, and we keep some bucket accounting metadata
@ -66,7 +66,7 @@ objects there is more information that we keep on other keys.
Data
^^^^
Objects data is kept in one or more rados objects for each rgw object.
Objects data is kept in one or more RADOS objects for each rgw object.
Object Lookup Path
------------------
@ -96,7 +96,7 @@ causes no ambiguity. For the same reason, slashes are permitted in object
names (keys).
It is also possible to create multiple data pools and make it so that
different users buckets will be created in different rados pools by default,
different users buckets will be created in different RADOS pools by default,
thus providing the necessary scaling. The layout and naming of these pools
is controlled by a 'policy' setting.[3]
@ -187,7 +187,7 @@ Known pools:
namespace: users.keys
47UA98JSTJZ9YAN3OS3O
This allows radosgw to look up users by their access keys during authentication.
This allows ``radosgw`` to look up users by their access keys during authentication.
namespace: users.swift
test:tester

View File

@ -7,16 +7,16 @@
RBD images can be live-migrated between different pools within the same cluster
or between different image formats and layouts. When started, the source image
will be deep-copied to the destination image, pulling all snapshot history and
optionally keeping any link to the source image's parent to help preserve
optionally preserving any link to the source image's parent to preserve
sparseness.
This copy process can safely run in the background while the new target image is
in-use. There is currently a requirement to temporarily stop using the source
in use. There is currently a requirement to temporarily stop using the source
image before preparing a migration. This helps to ensure that the client using
the image is updated to point to the new target image.
.. note::
Image live-migration requires the Ceph Nautilus release or later. The krbd
Image live-migration requires the Ceph Nautilus release or later. The ``krbd``
kernel module does not support live-migration at this time.

View File

@ -13,14 +13,14 @@ capability is available in two modes:
actual image. The remote cluster will read from this associated journal and
replay the updates to its local copy of the image. Since each write to the
RBD image will result in two writes to the Ceph cluster, expect write
latencies to nearly double when using the RBD journaling image feature.
latencies to nearly double while using the RBD journaling image feature.
* **Snapshot-based**: This mode uses periodically scheduled or manually
created RBD image mirror-snapshots to replicate crash-consistent RBD images
between clusters. The remote cluster will determine any data or metadata
updates between two mirror-snapshots and copy the deltas to its local copy of
the image. With the help of the RBD fast-diff image feature, updated data
blocks can be quickly computed without the need to scan the full RBD image.
the image. With the help of the RBD ``fast-diff`` image feature, updated data
blocks can be quickly determined without the need to scan the full RBD image.
Since this mode is not as fine-grained as journaling, the complete delta
between two snapshots will need to be synced prior to use during a failover
scenario. Any partially applied set of deltas will be rolled back at moment
@ -30,10 +30,10 @@ capability is available in two modes:
snapshot-based mirroring requires the Ceph Octopus release or later.
Mirroring is configured on a per-pool basis within peer clusters and can be
configured on a specific subset of images within the pool or configured to
automatically mirror all images within a pool when using journal-based
mirroring only. Mirroring is configured using the ``rbd`` command. The
``rbd-mirror`` daemon is responsible for pulling image updates from the remote,
configured on a specific subset of images within the pool. You can also mirror
all images within a given pool when using journal-based
mirroring. Mirroring is configured using the ``rbd`` command. The
``rbd-mirror`` daemon is responsible for pulling image updates from the remote
peer cluster and applying them to the image within the local cluster.
Depending on the desired needs for replication, RBD mirroring can be configured
@ -57,30 +57,35 @@ Pool Configuration
The following procedures demonstrate how to perform the basic administrative
tasks to configure mirroring using the ``rbd`` command. Mirroring is
configured on a per-pool basis within the Ceph clusters.
configured on a per-pool basis.
The pool configuration steps should be performed on both peer clusters. These
procedures assume two clusters, named "site-a" and "site-b", are accessible from
a single host for clarity.
These pool configuration steps should be performed on both peer clusters. These
procedures assume that both clusters, named "site-a" and "site-b", are accessible
from a single host for clarity.
See the `rbd`_ manpage for additional details of how to connect to different
Ceph clusters.
.. note:: The cluster name in the following examples corresponds to a Ceph
configuration file of the same name (e.g. /etc/ceph/site-b.conf). See the
`ceph-conf`_ documentation for how to configure multiple clusters.
`ceph-conf`_ documentation for how to configure multiple clusters. Note
that ``rbd-mirror`` does **not** require the source and destination clusters
to have unique internal names; both can and should call themselves ``ceph``.
The config `files` that ``rbd-mirror`` needs for local and remote clusters
can be named arbitrarily, and containerizing the daemon is one strategy
for maintaining them outside of ``/etc/ceph`` to avoid confusion.
Enable Mirroring
----------------
To enable mirroring on a pool with ``rbd``, specify the ``mirror pool enable``
command, the pool name, and the mirroring mode::
To enable mirroring on a pool with ``rbd``, issue the ``mirror pool enable``
subcommand with the pool name, and the mirroring mode::
rbd mirror pool enable {pool-name} {mode}
The mirroring mode can either be ``image`` or ``pool``:
* **image**: When configured in ``image`` mode, mirroring needs to be
* **image**: When configured in ``image`` mode, mirroring must
`explicitly enabled`_ on each image.
* **pool** (default): When configured in ``pool`` mode, all images in the pool
with the journaling feature enabled are mirrored.
@ -111,13 +116,13 @@ Bootstrap Peers
---------------
In order for the ``rbd-mirror`` daemon to discover its peer cluster, the peer
needs to be registered to the pool and a user account needs to be created.
must be registered and a user account must be created.
This process can be automated with ``rbd`` and the
``mirror pool peer bootstrap create`` and ``mirror pool peer bootstrap import``
commands.
To manually create a new bootstrap token with ``rbd``, specify the
``mirror pool peer bootstrap create`` command, a pool name, along with an
To manually create a new bootstrap token with ``rbd``, issue the
``mirror pool peer bootstrap create`` subcommand, a pool name, and an
optional friendly site name to describe the local cluster::
rbd mirror pool peer bootstrap create [--site-name {local-site-name}] {pool-name}
@ -289,6 +294,16 @@ For example::
.. tip:: You can enable journaling on all new images by default by adding
``rbd default features = 125`` to your Ceph configuration file.
.. tip:: ``rbd-mirror`` tunables are set by default to values suitable for
mirroring an entire pool. When using ``rbd-mirror`` to migrate single
volumes been clusters you may achieve substantial performance gains
by setting ``rbd_mirror_journal_max_fetch_bytes=33554432`` and
``rbd_journal_max_payload_bytes=8388608`` within the ``[client]`` config
section of the local or centralized configuration. Note that these
settings may allow ``rbd-mirror`` to present a substantial write workload
to the destination cluster: monitor cluster performance closely during
migrations and test carefuly before running multiple migrations in parallel.
Create Image Mirror-Snapshots
-----------------------------

View File

@ -4,10 +4,10 @@
.. index:: Ceph Block Device; OpenStack
You may use Ceph Block Device images with OpenStack through ``libvirt``, which
configures the QEMU interface to ``librbd``. Ceph stripes block device images as
objects across the cluster, which means that large Ceph Block Device images have
better performance than a standalone server!
You can attach Ceph Block Device images to OpenStack instances through ``libvirt``,
which configures the QEMU interface to ``librbd``. Ceph stripes block volumes
across multiple OSDs within the cluster, which means that large volumes can
realize better performance than local drives on a standalone server!
To use Ceph Block Devices with OpenStack, you must install QEMU, ``libvirt``,
and OpenStack first. We recommend using a separate physical node for your
@ -56,13 +56,13 @@ Three parts of OpenStack integrate with Ceph's block devices:
every virtual machine inside Ceph directly without using Cinder, which is
advantageous because it allows you to perform maintenance operations easily
with the live-migration process. Additionally, if your hypervisor dies it is
also convenient to trigger ``nova evacuate`` and run the virtual machine
also convenient to trigger ``nova evacuate`` and reinstate the virtual machine
elsewhere almost seamlessly. In doing so,
:ref:`exclusive locks <rbd-exclusive-locks>` prevent multiple
compute nodes from concurrently accessing the guest disk.
You can use OpenStack Glance to store images in a Ceph Block Device, and you
You can use OpenStack Glance to store images as Ceph Block Devices, and you
can use Cinder to boot a VM using a copy-on-write clone of an image.
The instructions below detail the setup for Glance, Cinder and Nova, although
@ -78,9 +78,9 @@ while running VMs using a local disk, or vice versa.
Create a Pool
=============
By default, Ceph block devices use the ``rbd`` pool. You may use any available
pool. We recommend creating a pool for Cinder and a pool for Glance. Ensure
your Ceph cluster is running, then create the pools. ::
By default, Ceph block devices live within the ``rbd`` pool. You may use any
suitable pool by specifying it explicitly. We recommend creating a pool for
Cinder and a pool for Glance. Ensure your Ceph cluster is running, then create the pools. ::
ceph osd pool create volumes
ceph osd pool create images
@ -309,25 +309,26 @@ authenticating with the Ceph cluster. ::
rbd_user = cinder
rbd_secret_uuid = 457eb676-33da-42ec-9a8c-9293d545c337
These two flags are also used by the Nova ephemeral backend.
These two flags are also used by the Nova ephemeral back end.
Configuring Nova
----------------
In order to boot all the virtual machines directly into Ceph, you must
In order to boot virtual machines directly from Ceph volumes, you must
configure the ephemeral backend for Nova.
It is recommended to enable the RBD cache in your Ceph configuration file
(enabled by default since Giant). Moreover, enabling the admin socket
brings a lot of benefits while troubleshooting. Having one socket
per virtual machine using a Ceph block device will help investigating performance and/or wrong behaviors.
It is recommended to enable the RBD cache in your Ceph configuration file; this
has been enabled by default since the Giant release. Moreover, enabling the
client admin socket allows the collection of metrics and can be invaluable
for troubleshooting.
This socket can be accessed like this::
This socket can be accessed on the hypvervisor (Nova compute) node::
ceph daemon /var/run/ceph/ceph-client.cinder.19195.32310016.asok help
Now on every compute nodes edit your Ceph configuration file::
To enable RBD cache and admin sockets, ensure that on each hypervisor's
``ceph.conf`` contains::
[client]
rbd cache = true
@ -336,7 +337,7 @@ Now on every compute nodes edit your Ceph configuration file::
log file = /var/log/qemu/qemu-guest-$pid.log
rbd concurrent management ops = 20
Configure the permissions of these paths::
Configure permissions for these directories::
mkdir -p /var/run/ceph/guests/ /var/log/qemu/
chown qemu:libvirtd /var/run/ceph/guests /var/log/qemu/
@ -344,15 +345,15 @@ Configure the permissions of these paths::
Note that user ``qemu`` and group ``libvirtd`` can vary depending on your system.
The provided example works for RedHat based systems.
.. tip:: If your virtual machine is already running you can simply restart it to get the socket
.. tip:: If your virtual machine is already running you can simply restart it to enable the admin socket
Restart OpenStack
=================
To activate the Ceph block device driver and load the block device pool name
into the configuration, you must restart OpenStack. Thus, for Debian based
systems execute these commands on the appropriate nodes::
into the configuration, you must restart the related OpenStack services.
For Debian based systems execute these commands on the appropriate nodes::
sudo glance-control api restart
sudo service nova-compute restart
@ -383,7 +384,7 @@ You can use `qemu-img`_ to convert from one format to another. For example::
qemu-img convert -f qcow2 -O raw precise-cloudimg.img precise-cloudimg.raw
When Glance and Cinder are both using Ceph block devices, the image is a
copy-on-write clone, so it can create a new volume quickly. In the OpenStack
copy-on-write clone, so new volumes are created quickly. In the OpenStack
dashboard, you can boot from that volume by performing the following steps:
#. Launch a new instance.

View File

@ -7,14 +7,14 @@
Shared, Read-only Parent Image Cache
====================================
`Cloned RBD images`_ from a parent usually only modify a small portion of
the image. For example, in a VDI workload, the VMs are cloned from the same
base image and initially only differ by hostname and IP address. During the
booting stage, all of these VMs would re-read portions of duplicate parent
image data from the RADOS cluster. If we have a local cache of the parent
image, this will help to speed up the read process on one host, as well as
to save the client to cluster network traffic.
RBD shared read-only parent image cache requires explicitly enabling in
`Cloned RBD images`_ usually modify only a small fraction of the parent
image. For example, in a VDI use-case, VMs are cloned from the same
base image and initially differ only by hostname and IP address. During
booting, all of these VMs read portions of the same parent
image data. If we have a local cache of the parent
image, this speeds up reads on the caching host. We also achieve
reduction of client-to-cluster network traffic.
RBD cache must be explicitly enabled in
``ceph.conf``. The ``ceph-immutable-object-cache`` daemon is responsible for
caching the parent content on the local disk, and future reads on that data
will be serviced from the local cache.
@ -64,14 +64,14 @@ The key components of the daemon are:
RADOS cluster and stored in the local caching directory.
On opening each cloned rbd image, ``librbd`` will try to connect to the
cache daemon over its domain socket. If it's successfully connected,
``librbd`` will automatically check with the daemon on the subsequent reads.
cache daemon through its Unix domain socket. Once successfully connected,
``librbd`` will coordinate with the daemon on the subsequent reads.
If there's a read that's not cached, the daemon will promote the RADOS object
to local caching directory, so the next read on that object will be serviced
from local file. The daemon also maintains simple LRU statistics so if there's
not enough capacity it will delete some cold cache files.
from cache. The daemon also maintains simple LRU statistics so that under
capacity pressure it will evict cold cache files as needed.
Here are some important cache options correspond to the following settings:
Here are some important cache configuration settings:
- ``immutable_object_cache_sock`` The path to the domain socket used for
communication between librbd clients and the ceph-immutable-object-cache
@ -81,9 +81,9 @@ Here are some important cache options correspond to the following settings:
- ``immutable_object_cache_max_size`` The max size for immutable cache.
- ``immutable_object_cache_watermark`` The watermark for the cache. If the
capacity reaches to this watermark, the daemon will delete cold cache based
on the LRU statistics.
- ``immutable_object_cache_watermark`` The high-water mark for the cache. If the
capacity reaches this threshold the daemon will delete cold cache based
on LRU statistics.
The ``ceph-immutable-object-cache`` daemon is available within the optional
``ceph-immutable-object-cache`` distribution package.

View File

@ -4,7 +4,7 @@
.. index:: Ceph Block Device; RBD Replay
RBD Replay is a set of tools for capturing and replaying Rados Block Device
RBD Replay is a set of tools for capturing and replaying RADOS Block Device
(RBD) workloads. To capture an RBD workload, ``lttng-tools`` must be installed
on the client, and ``librbd`` on the client must be the v0.87 (Giant) release
or later. To replay an RBD workload, ``librbd`` on the client must be the Giant

View File

@ -4,22 +4,24 @@
.. index:: Ceph Block Device; snapshots
A snapshot is a read-only copy of the state of an image at a particular point in
time. One of the advanced features of Ceph block devices is that you can create
snapshots of the images to retain a history of an image's state. Ceph also
supports snapshot layering, which allows you to clone images (e.g., a VM image)
quickly and easily. Ceph supports block device snapshots using the ``rbd``
command and many higher level interfaces, including `QEMU`_, `libvirt`_,
`OpenStack`_ and `CloudStack`_.
A snapshot is a read-only logical copy of an image at a particular point in
time: a checkpoint. One of the advanced features of Ceph block devices is
that you can create snapshots of images to retain point-in-time state history.
Ceph also supports snapshot layering, which allows you to clone images (e.g., a
VM image) quickly and easily. Ceph block device snapshots are managed using the
``rbd`` command and multiple higher level interfaces, including `QEMU`_,
`libvirt`_, `OpenStack`_ and `CloudStack`_.
.. important:: To use RBD snapshots, you must have a running Ceph cluster.
.. note:: Because RBD does not know about the file system, snapshots are
`crash-consistent` if they are not coordinated with the mounting
computer. So, we recommend you stop `I/O` before taking a snapshot of
an image. If the image contains a file system, the file system must be
in a consistent state before taking a snapshot or you may have to run
`fsck`. To stop `I/O` you can use `fsfreeze` command. See
.. note:: Because RBD does not know about any filesystem within an image
(volume), snapshots are not `crash-consistent` unless they are
coordinated within the mounting (attaching) operating system.
We therefore recommend that you pause or stop I/O before taking a snapshot.
If the volume contains a filesystem, it must be in an internally
consistent state before taking a snapshot. Snapshots taken at
inconsistent points may need a `fsck` pass before subsequent
mounting. To stop `I/O` you can use `fsfreeze` command. See
`fsfreeze(8)` man page for more details.
For virtual machines, `qemu-guest-agent` can be used to automatically
freeze file systems when creating a snapshot.
@ -37,10 +39,10 @@ command and many higher level interfaces, including `QEMU`_, `libvirt`_,
Cephx Notes
===========
When `cephx`_ is enabled (it is by default), you must specify a user name or ID
and a path to the keyring containing the corresponding key for the user. See
:ref:`User Management <user-management>` for details. You may also add the ``CEPH_ARGS`` environment
variable to avoid re-entry of the following parameters. ::
When `cephx`_ authentication is enabled (it is by default), you must specify a
user name or ID and a path to the keyring containing the corresponding key. See
:ref:`User Management <user-management>` for details. You may also set the
``CEPH_ARGS`` environment variable to avoid re-entry of these parameters. ::
rbd --id {user-ID} --keyring=/path/to/secret [commands]
rbd --name {username} --keyring=/path/to/secret [commands]
@ -58,12 +60,12 @@ Snapshot Basics
===============
The following procedures demonstrate how to create, list, and remove
snapshots using the ``rbd`` command on the command line.
snapshots using the ``rbd`` command.
Create Snapshot
---------------
To create a snapshot with ``rbd``, specify the ``snap create`` option, the pool
To create a snapshot with ``rbd``, specify the ``snap create`` option, the pool
name and the image name. ::
rbd snap create {pool-name}/{image-name}@{snap-name}
@ -102,14 +104,14 @@ For example::
the current version of the image with data from a snapshot. The
time it takes to execute a rollback increases with the size of the
image. It is **faster to clone** from a snapshot **than to rollback**
an image to a snapshot, and it is the preferred method of returning
an image to a snapshot, and is the preferred method of returning
to a pre-existing state.
Delete a Snapshot
-----------------
To delete a snapshot with ``rbd``, specify the ``snap rm`` option, the pool
To delete a snapshot with ``rbd``, specify the ``snap rm`` subcommand, the pool
name, the image name and the snap name. ::
rbd snap rm {pool-name}/{image-name}@{snap-name}
@ -120,13 +122,13 @@ For example::
.. note:: Ceph OSDs delete data asynchronously, so deleting a snapshot
doesn't free up the disk space immediately.
doesn't immediately free up the underlying OSDs' capacity.
Purge Snapshots
---------------
To delete all snapshots for an image with ``rbd``, specify the ``snap purge``
option and the image name. ::
subcommand and the image name. ::
rbd snap purge {pool-name}/{image-name}
@ -161,7 +163,7 @@ clones rapidly.
Parent Child
.. note:: The terms "parent" and "child" mean a Ceph block device snapshot (parent),
.. note:: The terms "parent" and "child" refer to a Ceph block device snapshot (parent),
and the corresponding image cloned from the snapshot (child). These terms are
important for the command line usage below.
@ -171,12 +173,12 @@ the cloned image to open the parent snapshot and read it.
A COW clone of a snapshot behaves exactly like any other Ceph block device
image. You can read to, write from, clone, and resize cloned images. There are
no special restrictions with cloned images. However, the copy-on-write clone of
a snapshot refers to the snapshot, so you **MUST** protect the snapshot before
a snapshot depends on the snapshot, so you **MUST** protect the snapshot before
you clone it. The following diagram depicts the process.
.. note:: Ceph only supports cloning for format 2 images (i.e., created with
.. note:: Ceph only supports cloning of RBD format 2 images (i.e., created with
``rbd create --image-format 2``). The kernel client supports cloned images
since kernel 3.10.
beginning with the 3.10 release.
Getting Started with Layering
-----------------------------

View File

@ -151,13 +151,13 @@ If it doesn't exist, create your branch::
Make a Change
-------------
Modifying a document involves opening a restructuredText file, changing
Modifying a document involves opening a reStructuredText file, changing
its contents, and saving the changes. See `Documentation Style Guide`_ for
details on syntax requirements.
Adding a document involves creating a new restructuredText file under the
``doc`` directory or its subdirectories and saving the file with a ``*.rst``
file extension. You must also include a reference to the document: a hyperlink
Adding a document involves creating a new reStructuredText file within the
``doc`` directory tree with a ``*.rst``
extension. You must also include a reference to the document: a hyperlink
or a table of contents entry. The ``index.rst`` file of a top-level directory
usually contains a TOC, where you can add the new file name. All documents must
have a title. See `Headings`_ for details.

View File

@ -21,33 +21,44 @@ data cluster (e.g., OpenStack, CloudStack, etc).
CPU
===
Ceph metadata servers dynamically redistribute their load, which is CPU
intensive. So your metadata servers should have significant processing power
(e.g., quad core or better CPUs). Ceph OSDs run the :term:`RADOS` service, calculate
CephFS metadata servers are CPU intensive, so they should have significant
processing power (e.g., quad core or better CPUs) and benefit from higher clock
rate (frequency in GHz). Ceph OSDs run the :term:`RADOS` service, calculate
data placement with :term:`CRUSH`, replicate data, and maintain their own copy of the
cluster map. Therefore, OSDs should have a reasonable amount of processing power
(e.g., dual core processors). Monitors simply maintain a master copy of the
cluster map, so they are not CPU intensive. You must also consider whether the
cluster map. Therefore, OSD nodes should have a reasonable amount of processing
power. Requirements vary by use-case; a starting point might be one core per
OSD for light / archival usage, and two cores per OSD for heavy workloads such
as RBD volumes attached to VMs. Monitor / manager nodes do not have heavy CPU
demands so a modest processor can be chosen for them. Also consider whether the
host machine will run CPU-intensive processes in addition to Ceph daemons. For
example, if your hosts will run computing VMs (e.g., OpenStack Nova), you will
need to ensure that these other processes leave sufficient processing power for
Ceph daemons. We recommend running additional CPU-intensive processes on
separate hosts.
separate hosts to avoid resource contention.
RAM
===
Generally, more RAM is better.
Generally, more RAM is better. Monitor / manager nodes for a modest cluster
might do fine with 64GB; for a larger cluster with hundreds of OSDs 128GB
is a reasonable target. There is a memory target for BlueStore OSDs that
defaults to 4GB. Factor in a prudent margin for the operating system and
administrative tasks (like monitoring and metrics) as well as increased
consumption during recovery: provisioning ~8GB per BlueStore OSD
is advised.
Monitors and managers (ceph-mon and ceph-mgr)
---------------------------------------------
Monitor and manager daemon memory usage generally scales with the size of the
cluster. For small clusters, 1-2 GB is generally sufficient. For
large clusters, you should provide more (5-10 GB). You may also want
to consider tuning settings like ``mon_osd_cache_size`` or
``rocksdb_cache_size``.
cluster. Note that at boot-time and during topology changes and recovery these
daemons will need more RAM than they do during steady-state operation, so plan
for peak usage. For very small clusters, 32 GB suffices. For
clusters of up to, say, 300 OSDs go with 64GB. For clusters built with (or
which will grow to) even more OSDS you should provision
129GB. You may also want to consider tuning settings like ``mon_osd_cache_size``
or ``rocksdb_cache_size`` after careful research.
Metadata servers (ceph-mds)
---------------------------
@ -108,8 +119,8 @@ performance tradeoffs to consider when planning for data storage. Simultaneous
OS operations, and simultaneous request for read and write operations from
multiple daemons against a single drive can slow performance considerably.
.. important:: Since Ceph has to write all data to the journal before it can
send an ACK (for XFS at least), having the journal and OSD
.. important:: Since Ceph has to write all data to the journal (or WAL+DB)
before it can ACK writes, having this metadata and OSD
performance in balance is really important!
@ -127,23 +138,25 @@ at $150.00 has a cost of $0.05 per gigabyte (i.e., $150 / 3072 = 0.0488). In the
foregoing example, using the 1 terabyte disks would generally increase the cost
per gigabyte by 40%--rendering your cluster substantially less cost efficient.
.. tip:: Running multiple OSDs on a single disk--irrespective of partitions--is
**NOT** a good idea.
.. tip:: Running multiple OSDs on a single SAS / SATA drive
is **NOT** a good idea. NVMe drives, however, can achieve
improved performance by being split into two more more OSDs.
.. tip:: Running an OSD and a monitor or a metadata server on a single
disk--irrespective of partitions--is **NOT** a good idea either.
drive is also **NOT** a good idea.
Storage drives are subject to limitations on seek time, access time, read and
write times, as well as total throughput. These physical limitations affect
overall system performance--especially during recovery. We recommend using a
dedicated drive for the operating system and software, and one drive for each
Ceph OSD Daemon you run on the host. Most "slow OSD" issues arise due to running
dedicated (ideally mirrored) drive for the operating system and software, and
one drive for each Ceph OSD Daemon you run on the host (modulo NVMe above).
Many "slow OSD" issues not attributable to hardware failure arise from running
an operating system, multiple OSDs, and/or multiple journals on the same drive.
Since the cost of troubleshooting performance issues on a small cluster likely
exceeds the cost of the extra disk drives, you can optimize your cluster
design planning by avoiding the temptation to overtax the OSD storage drives.
You may run multiple Ceph OSD Daemons per hard disk drive, but this will likely
You may run multiple Ceph OSD Daemons per SAS / SATA drive, but this will likely
lead to resource contention and diminish the overall throughput. You may store a
journal and object data on the same drive, but this may increase the time it
takes to journal a write and ACK to the client. Ceph must write to the journal
@ -196,12 +209,9 @@ are a few important performance considerations for journals and SSDs:
proper partition alignment with SSDs, which can cause SSDs to transfer data
much more slowly. Ensure that SSD partitions are properly aligned.
While SSDs are cost prohibitive for object storage, OSDs may see a significant
performance improvement by storing an OSD's journal on an SSD and the OSD's
object data on a separate hard disk drive. The ``osd journal`` configuration
setting defaults to ``/var/lib/ceph/osd/$cluster-$id/journal``. You can mount
this path to an SSD or to an SSD partition so that it is not merely a file on
the same disk as the object data.
SSDs have historically been cost prohibitive for object storage, though
emerging QLC drives are closing the gap. HDD OSDs may see a significant
performance improvement by offloading WAL+DB onto an SSD.
One way Ceph accelerates CephFS file system performance is to segregate the
storage of CephFS metadata from the storage of the CephFS file contents. Ceph
@ -214,9 +224,12 @@ your CephFS metadata pool that points only to a host's SSD storage media. See
Controllers
-----------
Disk controllers also have a significant impact on write throughput. Carefully,
consider your selection of disk controllers to ensure that they do not create
a performance bottleneck.
Disk controllers (HBAs) can have a significant impact on write throughput.
Carefully consider your selection to ensure that they do not create
a performance bottleneck. Notably RAID-mode (IR) HBAs may exhibit higher
latency than simpler "JBOD" (IT) mode HBAs, and the RAID SoC, write cache,
and battery backup can substantially increase hardware and maintenance
costs. Some RAID HBAs can be configured with an IT-mode "personality".
.. tip:: The `Ceph blog`_ is often an excellent source of information on Ceph
performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write
@ -226,8 +239,8 @@ a performance bottleneck.
Additional Considerations
-------------------------
You may run multiple OSDs per host, but you should ensure that the sum of the
total throughput of your OSD hard disks doesn't exceed the network bandwidth
You typically will run multiple OSDs per host, but you should ensure that the
aggregate throughput of your OSD drives doesn't exceed the network bandwidth
required to service a client's need to read or write data. You should also
consider what percentage of the overall data the cluster stores on each host. If
the percentage on a particular host is large and the host fails, it can lead to
@ -243,10 +256,10 @@ multiple OSDs per host.
Networks
========
Consider starting with a 10Gbps+ network in your racks. Replicating 1TB of data
Provision at least 10Gbps+ networking in your racks. Replicating 1TB of data
across a 1Gbps network takes 3 hours, and 10TBs takes 30 hours! By contrast,
with a 10Gbps network, the replication times would be 20 minutes and 1 hour
respectively. In a petabyte-scale cluster, failure of an OSD disk should be an
with a 10Gbps network, the replication times would be 20 minutes and 1 hour
respectively. In a petabyte-scale cluster, failure of an OSD drive is an
expectation, not an exception. System administrators will appreciate PGs
recovering from a ``degraded`` state to an ``active + clean`` state as rapidly
as possible, with price / performance tradeoffs taken into consideration.
@ -255,12 +268,16 @@ cabling more manageable. VLANs using 802.1q protocol require VLAN-capable NICs
and Switches. The added hardware expense may be offset by the operational cost
savings for network setup and maintenance. When using VLANs to handle VM
traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack,
etc.), it is also worth considering using 10G Ethernet. Top-of-rack routers for
each network also need to be able to communicate with spine routers that have
even faster throughput--e.g., 40Gbps to 100Gbps.
etc.), there is additional value in using 10G Ethernet or better; 40Gb or
25/50/100 Gb networking as of 2020 is common for production clusters.
Top-of-rack routers for each network also need to be able to communicate with
spine routers that have even faster throughput, often 40Gbp/s or more.
Your server hardware should have a Baseboard Management Controller (BMC).
Administration and deployment tools may also use BMCs extensively, so consider
Administration and deployment tools may also use BMCs extensively, especially
via IPMI or Redfish, so consider
the cost/benefit tradeoff of an out-of-band network for administration.
Hypervisor SSH access, VM image uploads, OS image installs, management sockets,
etc. can impose significant loads on a network. Running three networks may seem
@ -273,7 +290,7 @@ Failure Domains
===============
A failure domain is any failure that prevents access to one or more OSDs. That
could be a stopped daemon on a host; a hard disk failure, an OS crash, a
could be a stopped daemon on a host; a hard disk failure, an OS crash, a
malfunctioning NIC, a failed power supply, a network outage, a power outage, and
so forth. When planning out your hardware needs, you must balance the
temptation to reduce costs by placing too many responsibilities into too few
@ -301,7 +318,7 @@ and development clusters can run successfully with modest hardware.
| | | * ARM processors specifically may |
| | | require additional cores. |
| | | * Actual performance depends on many |
| | | factors including disk, network, and |
| | | factors including drives, net, and |
| | | client throughput and latency. |
| | | Benchmarking is highly recommended. |
| +----------------+-----------------------------------------+
@ -315,15 +332,15 @@ and development clusters can run successfully with modest hardware.
| +----------------+-----------------------------------------+
| | Network | 1x 1GbE+ NICs (10GbE+ recommended) |
+--------------+----------------+-----------------------------------------+
| ``ceph-mon`` | Processor | - 1 core minimum |
| ``ceph-mon`` | Processor | - 2 cores minimum |
| +----------------+-----------------------------------------+
| | RAM | 2GB+ per daemon |
| | RAM | 24GB+ per daemon |
| +----------------+-----------------------------------------+
| | Disk Space | 10 GB per daemon |
| | Disk Space | 60 GB per daemon |
| +----------------+-----------------------------------------+
| | Network | 1x 1GbE+ NICs |
+--------------+----------------+-----------------------------------------+
| ``ceph-mds`` | Processor | - 1 core minimum |
| ``ceph-mds`` | Processor | - 2 cores minimum |
| +----------------+-----------------------------------------+
| | RAM | 2GB+ per daemon |
| +----------------+-----------------------------------------+

View File

@ -19,10 +19,11 @@ Linux Kernel
your Linux distribution on any client hosts.
For RBD, if you choose to *track* long-term kernels, we currently recommend
4.x-based "longterm maintenance" kernel series:
4.x-based "longterm maintenance" kernel series or later:
- 4.19.z
- 4.14.z
- 5.x
For CephFS, see the section about `Mounting CephFS using Kernel Driver`_
for kernel version guidance.
@ -111,30 +112,30 @@ Luminous (12.2.z)
Notes
-----
- **1**: The default kernel has an older version of ``btrfs`` that we do not
recommend for ``ceph-osd`` storage nodes. We recommend using ``bluestore``
starting from Mimic, and ``XFS`` for previous releases with ``filestore``.
- **1**: The default kernel has an older version of ``Btrfs`` that we do not
recommend for ``ceph-osd`` storage nodes. We recommend using ``BlueStore``
starting with Luminous, and ``XFS`` for previous releases with ``Filestore``.
- **2**: The default kernel has an old Ceph client that we do not recommend
for kernel client (kernel RBD or the Ceph file system). Upgrade to a
recommended kernel.
- **3**: The default kernel regularly fails in QA when the ``btrfs``
file system is used. We recommend using ``bluestore`` starting from
Mimic, and ``XFS`` for previous releases with ``filestore``.
- **3**: The default kernel regularly fails in QA when the ``Btrfs``
file system is used. We recommend using ``BlueStore`` starting from
Luminous, and ``XFS`` for previous releases with ``Filestore``.
- **4**: ``btrfs`` is no longer tested on this release. We recommend
using ``bluestore``.
- **5**: Some additional features related to dashboard are not available.
- **6**: Building packages are built regularly, but not distributed by Ceph.
- **6**: Packages are built regularly, but not distributed by upstream Ceph.
Testing
-------
- **B**: We build release packages for this platform. For some of these
platforms, we may also continuously build all ceph branches and exercise
platforms, we may also continuously build all Ceph branches and perform
basic unit tests.
- **I**: We do basic installation and functionality tests of releases on this