mirror of
https://github.com/ceph/ceph
synced 2025-03-11 02:39:05 +00:00
doc: misc clarity and capitalization
Signed-off-by: Anthony D'Atri <anthony.datri@gmail.com>
This commit is contained in:
parent
8e530674ff
commit
32375cb789
@ -66,7 +66,7 @@ then you just add a line saying ::
|
||||
|
||||
Signed-off-by: Random J Developer <random@developer.example.org>
|
||||
|
||||
using your real name (sorry, no pseudonyms or anonymous contributions.)
|
||||
using your real name (sorry, no pseudonyms or anonymous contributions).
|
||||
|
||||
Git can sign off on your behalf
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
@ -201,12 +201,17 @@ PR title
|
||||
|
||||
If your PR has only one commit, the PR title can be the same as the commit title
|
||||
(and GitHub will suggest this). If the PR has multiple commits, do not accept
|
||||
the title GitHub suggest. Either use the title of the most relevant commit, or
|
||||
the title GitHub suggests. Either use the title of the most relevant commit, or
|
||||
write your own title. In the latter case, use the same "subsystem: short
|
||||
description" convention described in `Commit title`_ for the PR title, with
|
||||
the following difference: the PR title describes the entire set of changes,
|
||||
while the `Commit title`_ describes only the changes in a particular commit.
|
||||
|
||||
If GitHub suggests a PR title based on a very long commit message it will split
|
||||
the result with an elipsis (...) and fold the remainder into the PR description.
|
||||
In such a case, please edit the title to be more concise and the description to
|
||||
remove the elipsis.
|
||||
|
||||
Keep in mind that the PR titles feed directly into the script that generates
|
||||
release notes and it is tedious to clean up non-conformant PR titles at release
|
||||
time. This document places no limit on the length of PR titles, but be aware
|
||||
|
@ -14,17 +14,17 @@ online bucket resharding.
|
||||
|
||||
Each bucket index shard can handle its entries efficiently up until
|
||||
reaching a certain threshold number of entries. If this threshold is
|
||||
exceeded the system can encounter performance issues. The dynamic
|
||||
exceeded the system can suffer from performance issues. The dynamic
|
||||
resharding feature detects this situation and automatically increases
|
||||
the number of shards used by the bucket index, resulting in the
|
||||
the number of shards used by the bucket index, resulting in a
|
||||
reduction of the number of entries in each bucket index shard. This
|
||||
process is transparent to the user.
|
||||
|
||||
By default dynamic bucket index resharding can only increase the
|
||||
number of bucket index shards to 1999, although the upper-bound is a
|
||||
configuration parameter (see Configuration below). Furthermore, when
|
||||
number of bucket index shards to 1999, although this upper-bound is a
|
||||
configuration parameter (see Configuration below). When
|
||||
possible, the process chooses a prime number of bucket index shards to
|
||||
help spread the number of bucket index entries across the bucket index
|
||||
spread the number of bucket index entries across the bucket index
|
||||
shards more evenly.
|
||||
|
||||
The detection process runs in a background process that periodically
|
||||
|
@ -8,8 +8,8 @@ new developers to get up to speed with the implementation details.
|
||||
Introduction
|
||||
------------
|
||||
|
||||
Swift offers something called a container, that we use interchangeably with
|
||||
the term bucket. One may say that RGW's buckets implement Swift containers.
|
||||
Swift offers something called a *container*, which we use interchangeably with
|
||||
the term *bucket*, so we say that RGW's buckets implement Swift containers.
|
||||
|
||||
This document does not consider how RGW operates on these structures,
|
||||
e.g. the use of encode() and decode() methods for serialization and so on.
|
||||
@ -42,18 +42,18 @@ Some variables have been used in above commands, they are:
|
||||
- bucket: Holds a mapping between bucket name and bucket instance id
|
||||
- bucket.instance: Holds bucket instance information[2]
|
||||
|
||||
Every metadata entry is kept on a single rados object. See below for implementation details.
|
||||
Every metadata entry is kept on a single RADOS object. See below for implementation details.
|
||||
|
||||
Note that the metadata is not indexed. When listing a metadata section we do a
|
||||
rados pgls operation on the containing pool.
|
||||
RADOS ``pgls`` operation on the containing pool.
|
||||
|
||||
Bucket Index
|
||||
^^^^^^^^^^^^
|
||||
|
||||
It's a different kind of metadata, and kept separately. The bucket index holds
|
||||
a key-value map in rados objects. By default it is a single rados object per
|
||||
bucket, but it is possible since Hammer to shard that map over multiple rados
|
||||
objects. The map itself is kept in omap, associated with each rados object.
|
||||
a key-value map in RADOS objects. By default it is a single RADOS object per
|
||||
bucket, but it is possible since Hammer to shard that map over multiple RADOS
|
||||
objects. The map itself is kept in omap, associated with each RADOS object.
|
||||
The key of each omap is the name of the objects, and the value holds some basic
|
||||
metadata of that object -- metadata that shows up when listing the bucket.
|
||||
Also, each omap holds a header, and we keep some bucket accounting metadata
|
||||
@ -66,7 +66,7 @@ objects there is more information that we keep on other keys.
|
||||
Data
|
||||
^^^^
|
||||
|
||||
Objects data is kept in one or more rados objects for each rgw object.
|
||||
Objects data is kept in one or more RADOS objects for each rgw object.
|
||||
|
||||
Object Lookup Path
|
||||
------------------
|
||||
@ -96,7 +96,7 @@ causes no ambiguity. For the same reason, slashes are permitted in object
|
||||
names (keys).
|
||||
|
||||
It is also possible to create multiple data pools and make it so that
|
||||
different users buckets will be created in different rados pools by default,
|
||||
different users buckets will be created in different RADOS pools by default,
|
||||
thus providing the necessary scaling. The layout and naming of these pools
|
||||
is controlled by a 'policy' setting.[3]
|
||||
|
||||
@ -187,7 +187,7 @@ Known pools:
|
||||
namespace: users.keys
|
||||
47UA98JSTJZ9YAN3OS3O
|
||||
|
||||
This allows radosgw to look up users by their access keys during authentication.
|
||||
This allows ``radosgw`` to look up users by their access keys during authentication.
|
||||
|
||||
namespace: users.swift
|
||||
test:tester
|
||||
|
@ -7,16 +7,16 @@
|
||||
RBD images can be live-migrated between different pools within the same cluster
|
||||
or between different image formats and layouts. When started, the source image
|
||||
will be deep-copied to the destination image, pulling all snapshot history and
|
||||
optionally keeping any link to the source image's parent to help preserve
|
||||
optionally preserving any link to the source image's parent to preserve
|
||||
sparseness.
|
||||
|
||||
This copy process can safely run in the background while the new target image is
|
||||
in-use. There is currently a requirement to temporarily stop using the source
|
||||
in use. There is currently a requirement to temporarily stop using the source
|
||||
image before preparing a migration. This helps to ensure that the client using
|
||||
the image is updated to point to the new target image.
|
||||
|
||||
.. note::
|
||||
Image live-migration requires the Ceph Nautilus release or later. The krbd
|
||||
Image live-migration requires the Ceph Nautilus release or later. The ``krbd``
|
||||
kernel module does not support live-migration at this time.
|
||||
|
||||
|
||||
|
@ -13,14 +13,14 @@ capability is available in two modes:
|
||||
actual image. The remote cluster will read from this associated journal and
|
||||
replay the updates to its local copy of the image. Since each write to the
|
||||
RBD image will result in two writes to the Ceph cluster, expect write
|
||||
latencies to nearly double when using the RBD journaling image feature.
|
||||
latencies to nearly double while using the RBD journaling image feature.
|
||||
|
||||
* **Snapshot-based**: This mode uses periodically scheduled or manually
|
||||
created RBD image mirror-snapshots to replicate crash-consistent RBD images
|
||||
between clusters. The remote cluster will determine any data or metadata
|
||||
updates between two mirror-snapshots and copy the deltas to its local copy of
|
||||
the image. With the help of the RBD fast-diff image feature, updated data
|
||||
blocks can be quickly computed without the need to scan the full RBD image.
|
||||
the image. With the help of the RBD ``fast-diff`` image feature, updated data
|
||||
blocks can be quickly determined without the need to scan the full RBD image.
|
||||
Since this mode is not as fine-grained as journaling, the complete delta
|
||||
between two snapshots will need to be synced prior to use during a failover
|
||||
scenario. Any partially applied set of deltas will be rolled back at moment
|
||||
@ -30,10 +30,10 @@ capability is available in two modes:
|
||||
snapshot-based mirroring requires the Ceph Octopus release or later.
|
||||
|
||||
Mirroring is configured on a per-pool basis within peer clusters and can be
|
||||
configured on a specific subset of images within the pool or configured to
|
||||
automatically mirror all images within a pool when using journal-based
|
||||
mirroring only. Mirroring is configured using the ``rbd`` command. The
|
||||
``rbd-mirror`` daemon is responsible for pulling image updates from the remote,
|
||||
configured on a specific subset of images within the pool. You can also mirror
|
||||
all images within a given pool when using journal-based
|
||||
mirroring. Mirroring is configured using the ``rbd`` command. The
|
||||
``rbd-mirror`` daemon is responsible for pulling image updates from the remote
|
||||
peer cluster and applying them to the image within the local cluster.
|
||||
|
||||
Depending on the desired needs for replication, RBD mirroring can be configured
|
||||
@ -57,30 +57,35 @@ Pool Configuration
|
||||
|
||||
The following procedures demonstrate how to perform the basic administrative
|
||||
tasks to configure mirroring using the ``rbd`` command. Mirroring is
|
||||
configured on a per-pool basis within the Ceph clusters.
|
||||
configured on a per-pool basis.
|
||||
|
||||
The pool configuration steps should be performed on both peer clusters. These
|
||||
procedures assume two clusters, named "site-a" and "site-b", are accessible from
|
||||
a single host for clarity.
|
||||
These pool configuration steps should be performed on both peer clusters. These
|
||||
procedures assume that both clusters, named "site-a" and "site-b", are accessible
|
||||
from a single host for clarity.
|
||||
|
||||
See the `rbd`_ manpage for additional details of how to connect to different
|
||||
Ceph clusters.
|
||||
|
||||
.. note:: The cluster name in the following examples corresponds to a Ceph
|
||||
configuration file of the same name (e.g. /etc/ceph/site-b.conf). See the
|
||||
`ceph-conf`_ documentation for how to configure multiple clusters.
|
||||
`ceph-conf`_ documentation for how to configure multiple clusters. Note
|
||||
that ``rbd-mirror`` does **not** require the source and destination clusters
|
||||
to have unique internal names; both can and should call themselves ``ceph``.
|
||||
The config `files` that ``rbd-mirror`` needs for local and remote clusters
|
||||
can be named arbitrarily, and containerizing the daemon is one strategy
|
||||
for maintaining them outside of ``/etc/ceph`` to avoid confusion.
|
||||
|
||||
Enable Mirroring
|
||||
----------------
|
||||
|
||||
To enable mirroring on a pool with ``rbd``, specify the ``mirror pool enable``
|
||||
command, the pool name, and the mirroring mode::
|
||||
To enable mirroring on a pool with ``rbd``, issue the ``mirror pool enable``
|
||||
subcommand with the pool name, and the mirroring mode::
|
||||
|
||||
rbd mirror pool enable {pool-name} {mode}
|
||||
|
||||
The mirroring mode can either be ``image`` or ``pool``:
|
||||
|
||||
* **image**: When configured in ``image`` mode, mirroring needs to be
|
||||
* **image**: When configured in ``image`` mode, mirroring must
|
||||
`explicitly enabled`_ on each image.
|
||||
* **pool** (default): When configured in ``pool`` mode, all images in the pool
|
||||
with the journaling feature enabled are mirrored.
|
||||
@ -111,13 +116,13 @@ Bootstrap Peers
|
||||
---------------
|
||||
|
||||
In order for the ``rbd-mirror`` daemon to discover its peer cluster, the peer
|
||||
needs to be registered to the pool and a user account needs to be created.
|
||||
must be registered and a user account must be created.
|
||||
This process can be automated with ``rbd`` and the
|
||||
``mirror pool peer bootstrap create`` and ``mirror pool peer bootstrap import``
|
||||
commands.
|
||||
|
||||
To manually create a new bootstrap token with ``rbd``, specify the
|
||||
``mirror pool peer bootstrap create`` command, a pool name, along with an
|
||||
To manually create a new bootstrap token with ``rbd``, issue the
|
||||
``mirror pool peer bootstrap create`` subcommand, a pool name, and an
|
||||
optional friendly site name to describe the local cluster::
|
||||
|
||||
rbd mirror pool peer bootstrap create [--site-name {local-site-name}] {pool-name}
|
||||
@ -289,6 +294,16 @@ For example::
|
||||
.. tip:: You can enable journaling on all new images by default by adding
|
||||
``rbd default features = 125`` to your Ceph configuration file.
|
||||
|
||||
.. tip:: ``rbd-mirror`` tunables are set by default to values suitable for
|
||||
mirroring an entire pool. When using ``rbd-mirror`` to migrate single
|
||||
volumes been clusters you may achieve substantial performance gains
|
||||
by setting ``rbd_mirror_journal_max_fetch_bytes=33554432`` and
|
||||
``rbd_journal_max_payload_bytes=8388608`` within the ``[client]`` config
|
||||
section of the local or centralized configuration. Note that these
|
||||
settings may allow ``rbd-mirror`` to present a substantial write workload
|
||||
to the destination cluster: monitor cluster performance closely during
|
||||
migrations and test carefuly before running multiple migrations in parallel.
|
||||
|
||||
Create Image Mirror-Snapshots
|
||||
-----------------------------
|
||||
|
||||
|
@ -4,10 +4,10 @@
|
||||
|
||||
.. index:: Ceph Block Device; OpenStack
|
||||
|
||||
You may use Ceph Block Device images with OpenStack through ``libvirt``, which
|
||||
configures the QEMU interface to ``librbd``. Ceph stripes block device images as
|
||||
objects across the cluster, which means that large Ceph Block Device images have
|
||||
better performance than a standalone server!
|
||||
You can attach Ceph Block Device images to OpenStack instances through ``libvirt``,
|
||||
which configures the QEMU interface to ``librbd``. Ceph stripes block volumes
|
||||
across multiple OSDs within the cluster, which means that large volumes can
|
||||
realize better performance than local drives on a standalone server!
|
||||
|
||||
To use Ceph Block Devices with OpenStack, you must install QEMU, ``libvirt``,
|
||||
and OpenStack first. We recommend using a separate physical node for your
|
||||
@ -56,13 +56,13 @@ Three parts of OpenStack integrate with Ceph's block devices:
|
||||
every virtual machine inside Ceph directly without using Cinder, which is
|
||||
advantageous because it allows you to perform maintenance operations easily
|
||||
with the live-migration process. Additionally, if your hypervisor dies it is
|
||||
also convenient to trigger ``nova evacuate`` and run the virtual machine
|
||||
also convenient to trigger ``nova evacuate`` and reinstate the virtual machine
|
||||
elsewhere almost seamlessly. In doing so,
|
||||
:ref:`exclusive locks <rbd-exclusive-locks>` prevent multiple
|
||||
compute nodes from concurrently accessing the guest disk.
|
||||
|
||||
|
||||
You can use OpenStack Glance to store images in a Ceph Block Device, and you
|
||||
You can use OpenStack Glance to store images as Ceph Block Devices, and you
|
||||
can use Cinder to boot a VM using a copy-on-write clone of an image.
|
||||
|
||||
The instructions below detail the setup for Glance, Cinder and Nova, although
|
||||
@ -78,9 +78,9 @@ while running VMs using a local disk, or vice versa.
|
||||
Create a Pool
|
||||
=============
|
||||
|
||||
By default, Ceph block devices use the ``rbd`` pool. You may use any available
|
||||
pool. We recommend creating a pool for Cinder and a pool for Glance. Ensure
|
||||
your Ceph cluster is running, then create the pools. ::
|
||||
By default, Ceph block devices live within the ``rbd`` pool. You may use any
|
||||
suitable pool by specifying it explicitly. We recommend creating a pool for
|
||||
Cinder and a pool for Glance. Ensure your Ceph cluster is running, then create the pools. ::
|
||||
|
||||
ceph osd pool create volumes
|
||||
ceph osd pool create images
|
||||
@ -309,25 +309,26 @@ authenticating with the Ceph cluster. ::
|
||||
rbd_user = cinder
|
||||
rbd_secret_uuid = 457eb676-33da-42ec-9a8c-9293d545c337
|
||||
|
||||
These two flags are also used by the Nova ephemeral backend.
|
||||
These two flags are also used by the Nova ephemeral back end.
|
||||
|
||||
|
||||
Configuring Nova
|
||||
----------------
|
||||
|
||||
In order to boot all the virtual machines directly into Ceph, you must
|
||||
In order to boot virtual machines directly from Ceph volumes, you must
|
||||
configure the ephemeral backend for Nova.
|
||||
|
||||
It is recommended to enable the RBD cache in your Ceph configuration file
|
||||
(enabled by default since Giant). Moreover, enabling the admin socket
|
||||
brings a lot of benefits while troubleshooting. Having one socket
|
||||
per virtual machine using a Ceph block device will help investigating performance and/or wrong behaviors.
|
||||
It is recommended to enable the RBD cache in your Ceph configuration file; this
|
||||
has been enabled by default since the Giant release. Moreover, enabling the
|
||||
client admin socket allows the collection of metrics and can be invaluable
|
||||
for troubleshooting.
|
||||
|
||||
This socket can be accessed like this::
|
||||
This socket can be accessed on the hypvervisor (Nova compute) node::
|
||||
|
||||
ceph daemon /var/run/ceph/ceph-client.cinder.19195.32310016.asok help
|
||||
|
||||
Now on every compute nodes edit your Ceph configuration file::
|
||||
To enable RBD cache and admin sockets, ensure that on each hypervisor's
|
||||
``ceph.conf`` contains::
|
||||
|
||||
[client]
|
||||
rbd cache = true
|
||||
@ -336,7 +337,7 @@ Now on every compute nodes edit your Ceph configuration file::
|
||||
log file = /var/log/qemu/qemu-guest-$pid.log
|
||||
rbd concurrent management ops = 20
|
||||
|
||||
Configure the permissions of these paths::
|
||||
Configure permissions for these directories::
|
||||
|
||||
mkdir -p /var/run/ceph/guests/ /var/log/qemu/
|
||||
chown qemu:libvirtd /var/run/ceph/guests /var/log/qemu/
|
||||
@ -344,15 +345,15 @@ Configure the permissions of these paths::
|
||||
Note that user ``qemu`` and group ``libvirtd`` can vary depending on your system.
|
||||
The provided example works for RedHat based systems.
|
||||
|
||||
.. tip:: If your virtual machine is already running you can simply restart it to get the socket
|
||||
.. tip:: If your virtual machine is already running you can simply restart it to enable the admin socket
|
||||
|
||||
|
||||
Restart OpenStack
|
||||
=================
|
||||
|
||||
To activate the Ceph block device driver and load the block device pool name
|
||||
into the configuration, you must restart OpenStack. Thus, for Debian based
|
||||
systems execute these commands on the appropriate nodes::
|
||||
into the configuration, you must restart the related OpenStack services.
|
||||
For Debian based systems execute these commands on the appropriate nodes::
|
||||
|
||||
sudo glance-control api restart
|
||||
sudo service nova-compute restart
|
||||
@ -383,7 +384,7 @@ You can use `qemu-img`_ to convert from one format to another. For example::
|
||||
qemu-img convert -f qcow2 -O raw precise-cloudimg.img precise-cloudimg.raw
|
||||
|
||||
When Glance and Cinder are both using Ceph block devices, the image is a
|
||||
copy-on-write clone, so it can create a new volume quickly. In the OpenStack
|
||||
copy-on-write clone, so new volumes are created quickly. In the OpenStack
|
||||
dashboard, you can boot from that volume by performing the following steps:
|
||||
|
||||
#. Launch a new instance.
|
||||
|
@ -7,14 +7,14 @@
|
||||
Shared, Read-only Parent Image Cache
|
||||
====================================
|
||||
|
||||
`Cloned RBD images`_ from a parent usually only modify a small portion of
|
||||
the image. For example, in a VDI workload, the VMs are cloned from the same
|
||||
base image and initially only differ by hostname and IP address. During the
|
||||
booting stage, all of these VMs would re-read portions of duplicate parent
|
||||
image data from the RADOS cluster. If we have a local cache of the parent
|
||||
image, this will help to speed up the read process on one host, as well as
|
||||
to save the client to cluster network traffic.
|
||||
RBD shared read-only parent image cache requires explicitly enabling in
|
||||
`Cloned RBD images`_ usually modify only a small fraction of the parent
|
||||
image. For example, in a VDI use-case, VMs are cloned from the same
|
||||
base image and initially differ only by hostname and IP address. During
|
||||
booting, all of these VMs read portions of the same parent
|
||||
image data. If we have a local cache of the parent
|
||||
image, this speeds up reads on the caching host. We also achieve
|
||||
reduction of client-to-cluster network traffic.
|
||||
RBD cache must be explicitly enabled in
|
||||
``ceph.conf``. The ``ceph-immutable-object-cache`` daemon is responsible for
|
||||
caching the parent content on the local disk, and future reads on that data
|
||||
will be serviced from the local cache.
|
||||
@ -64,14 +64,14 @@ The key components of the daemon are:
|
||||
RADOS cluster and stored in the local caching directory.
|
||||
|
||||
On opening each cloned rbd image, ``librbd`` will try to connect to the
|
||||
cache daemon over its domain socket. If it's successfully connected,
|
||||
``librbd`` will automatically check with the daemon on the subsequent reads.
|
||||
cache daemon through its Unix domain socket. Once successfully connected,
|
||||
``librbd`` will coordinate with the daemon on the subsequent reads.
|
||||
If there's a read that's not cached, the daemon will promote the RADOS object
|
||||
to local caching directory, so the next read on that object will be serviced
|
||||
from local file. The daemon also maintains simple LRU statistics so if there's
|
||||
not enough capacity it will delete some cold cache files.
|
||||
from cache. The daemon also maintains simple LRU statistics so that under
|
||||
capacity pressure it will evict cold cache files as needed.
|
||||
|
||||
Here are some important cache options correspond to the following settings:
|
||||
Here are some important cache configuration settings:
|
||||
|
||||
- ``immutable_object_cache_sock`` The path to the domain socket used for
|
||||
communication between librbd clients and the ceph-immutable-object-cache
|
||||
@ -81,9 +81,9 @@ Here are some important cache options correspond to the following settings:
|
||||
|
||||
- ``immutable_object_cache_max_size`` The max size for immutable cache.
|
||||
|
||||
- ``immutable_object_cache_watermark`` The watermark for the cache. If the
|
||||
capacity reaches to this watermark, the daemon will delete cold cache based
|
||||
on the LRU statistics.
|
||||
- ``immutable_object_cache_watermark`` The high-water mark for the cache. If the
|
||||
capacity reaches this threshold the daemon will delete cold cache based
|
||||
on LRU statistics.
|
||||
|
||||
The ``ceph-immutable-object-cache`` daemon is available within the optional
|
||||
``ceph-immutable-object-cache`` distribution package.
|
||||
|
@ -4,7 +4,7 @@
|
||||
|
||||
.. index:: Ceph Block Device; RBD Replay
|
||||
|
||||
RBD Replay is a set of tools for capturing and replaying Rados Block Device
|
||||
RBD Replay is a set of tools for capturing and replaying RADOS Block Device
|
||||
(RBD) workloads. To capture an RBD workload, ``lttng-tools`` must be installed
|
||||
on the client, and ``librbd`` on the client must be the v0.87 (Giant) release
|
||||
or later. To replay an RBD workload, ``librbd`` on the client must be the Giant
|
||||
|
@ -4,22 +4,24 @@
|
||||
|
||||
.. index:: Ceph Block Device; snapshots
|
||||
|
||||
A snapshot is a read-only copy of the state of an image at a particular point in
|
||||
time. One of the advanced features of Ceph block devices is that you can create
|
||||
snapshots of the images to retain a history of an image's state. Ceph also
|
||||
supports snapshot layering, which allows you to clone images (e.g., a VM image)
|
||||
quickly and easily. Ceph supports block device snapshots using the ``rbd``
|
||||
command and many higher level interfaces, including `QEMU`_, `libvirt`_,
|
||||
`OpenStack`_ and `CloudStack`_.
|
||||
A snapshot is a read-only logical copy of an image at a particular point in
|
||||
time: a checkpoint. One of the advanced features of Ceph block devices is
|
||||
that you can create snapshots of images to retain point-in-time state history.
|
||||
Ceph also supports snapshot layering, which allows you to clone images (e.g., a
|
||||
VM image) quickly and easily. Ceph block device snapshots are managed using the
|
||||
``rbd`` command and multiple higher level interfaces, including `QEMU`_,
|
||||
`libvirt`_, `OpenStack`_ and `CloudStack`_.
|
||||
|
||||
.. important:: To use RBD snapshots, you must have a running Ceph cluster.
|
||||
|
||||
.. note:: Because RBD does not know about the file system, snapshots are
|
||||
`crash-consistent` if they are not coordinated with the mounting
|
||||
computer. So, we recommend you stop `I/O` before taking a snapshot of
|
||||
an image. If the image contains a file system, the file system must be
|
||||
in a consistent state before taking a snapshot or you may have to run
|
||||
`fsck`. To stop `I/O` you can use `fsfreeze` command. See
|
||||
.. note:: Because RBD does not know about any filesystem within an image
|
||||
(volume), snapshots are not `crash-consistent` unless they are
|
||||
coordinated within the mounting (attaching) operating system.
|
||||
We therefore recommend that you pause or stop I/O before taking a snapshot.
|
||||
If the volume contains a filesystem, it must be in an internally
|
||||
consistent state before taking a snapshot. Snapshots taken at
|
||||
inconsistent points may need a `fsck` pass before subsequent
|
||||
mounting. To stop `I/O` you can use `fsfreeze` command. See
|
||||
`fsfreeze(8)` man page for more details.
|
||||
For virtual machines, `qemu-guest-agent` can be used to automatically
|
||||
freeze file systems when creating a snapshot.
|
||||
@ -37,10 +39,10 @@ command and many higher level interfaces, including `QEMU`_, `libvirt`_,
|
||||
Cephx Notes
|
||||
===========
|
||||
|
||||
When `cephx`_ is enabled (it is by default), you must specify a user name or ID
|
||||
and a path to the keyring containing the corresponding key for the user. See
|
||||
:ref:`User Management <user-management>` for details. You may also add the ``CEPH_ARGS`` environment
|
||||
variable to avoid re-entry of the following parameters. ::
|
||||
When `cephx`_ authentication is enabled (it is by default), you must specify a
|
||||
user name or ID and a path to the keyring containing the corresponding key. See
|
||||
:ref:`User Management <user-management>` for details. You may also set the
|
||||
``CEPH_ARGS`` environment variable to avoid re-entry of these parameters. ::
|
||||
|
||||
rbd --id {user-ID} --keyring=/path/to/secret [commands]
|
||||
rbd --name {username} --keyring=/path/to/secret [commands]
|
||||
@ -58,12 +60,12 @@ Snapshot Basics
|
||||
===============
|
||||
|
||||
The following procedures demonstrate how to create, list, and remove
|
||||
snapshots using the ``rbd`` command on the command line.
|
||||
snapshots using the ``rbd`` command.
|
||||
|
||||
Create Snapshot
|
||||
---------------
|
||||
|
||||
To create a snapshot with ``rbd``, specify the ``snap create`` option, the pool
|
||||
To create a snapshot with ``rbd``, specify the ``snap create`` option, the pool
|
||||
name and the image name. ::
|
||||
|
||||
rbd snap create {pool-name}/{image-name}@{snap-name}
|
||||
@ -102,14 +104,14 @@ For example::
|
||||
the current version of the image with data from a snapshot. The
|
||||
time it takes to execute a rollback increases with the size of the
|
||||
image. It is **faster to clone** from a snapshot **than to rollback**
|
||||
an image to a snapshot, and it is the preferred method of returning
|
||||
an image to a snapshot, and is the preferred method of returning
|
||||
to a pre-existing state.
|
||||
|
||||
|
||||
Delete a Snapshot
|
||||
-----------------
|
||||
|
||||
To delete a snapshot with ``rbd``, specify the ``snap rm`` option, the pool
|
||||
To delete a snapshot with ``rbd``, specify the ``snap rm`` subcommand, the pool
|
||||
name, the image name and the snap name. ::
|
||||
|
||||
rbd snap rm {pool-name}/{image-name}@{snap-name}
|
||||
@ -120,13 +122,13 @@ For example::
|
||||
|
||||
|
||||
.. note:: Ceph OSDs delete data asynchronously, so deleting a snapshot
|
||||
doesn't free up the disk space immediately.
|
||||
doesn't immediately free up the underlying OSDs' capacity.
|
||||
|
||||
Purge Snapshots
|
||||
---------------
|
||||
|
||||
To delete all snapshots for an image with ``rbd``, specify the ``snap purge``
|
||||
option and the image name. ::
|
||||
subcommand and the image name. ::
|
||||
|
||||
rbd snap purge {pool-name}/{image-name}
|
||||
|
||||
@ -161,7 +163,7 @@ clones rapidly.
|
||||
|
||||
Parent Child
|
||||
|
||||
.. note:: The terms "parent" and "child" mean a Ceph block device snapshot (parent),
|
||||
.. note:: The terms "parent" and "child" refer to a Ceph block device snapshot (parent),
|
||||
and the corresponding image cloned from the snapshot (child). These terms are
|
||||
important for the command line usage below.
|
||||
|
||||
@ -171,12 +173,12 @@ the cloned image to open the parent snapshot and read it.
|
||||
A COW clone of a snapshot behaves exactly like any other Ceph block device
|
||||
image. You can read to, write from, clone, and resize cloned images. There are
|
||||
no special restrictions with cloned images. However, the copy-on-write clone of
|
||||
a snapshot refers to the snapshot, so you **MUST** protect the snapshot before
|
||||
a snapshot depends on the snapshot, so you **MUST** protect the snapshot before
|
||||
you clone it. The following diagram depicts the process.
|
||||
|
||||
.. note:: Ceph only supports cloning for format 2 images (i.e., created with
|
||||
.. note:: Ceph only supports cloning of RBD format 2 images (i.e., created with
|
||||
``rbd create --image-format 2``). The kernel client supports cloned images
|
||||
since kernel 3.10.
|
||||
beginning with the 3.10 release.
|
||||
|
||||
Getting Started with Layering
|
||||
-----------------------------
|
||||
|
@ -151,13 +151,13 @@ If it doesn't exist, create your branch::
|
||||
Make a Change
|
||||
-------------
|
||||
|
||||
Modifying a document involves opening a restructuredText file, changing
|
||||
Modifying a document involves opening a reStructuredText file, changing
|
||||
its contents, and saving the changes. See `Documentation Style Guide`_ for
|
||||
details on syntax requirements.
|
||||
|
||||
Adding a document involves creating a new restructuredText file under the
|
||||
``doc`` directory or its subdirectories and saving the file with a ``*.rst``
|
||||
file extension. You must also include a reference to the document: a hyperlink
|
||||
Adding a document involves creating a new reStructuredText file within the
|
||||
``doc`` directory tree with a ``*.rst``
|
||||
extension. You must also include a reference to the document: a hyperlink
|
||||
or a table of contents entry. The ``index.rst`` file of a top-level directory
|
||||
usually contains a TOC, where you can add the new file name. All documents must
|
||||
have a title. See `Headings`_ for details.
|
||||
|
@ -21,33 +21,44 @@ data cluster (e.g., OpenStack, CloudStack, etc).
|
||||
CPU
|
||||
===
|
||||
|
||||
Ceph metadata servers dynamically redistribute their load, which is CPU
|
||||
intensive. So your metadata servers should have significant processing power
|
||||
(e.g., quad core or better CPUs). Ceph OSDs run the :term:`RADOS` service, calculate
|
||||
CephFS metadata servers are CPU intensive, so they should have significant
|
||||
processing power (e.g., quad core or better CPUs) and benefit from higher clock
|
||||
rate (frequency in GHz). Ceph OSDs run the :term:`RADOS` service, calculate
|
||||
data placement with :term:`CRUSH`, replicate data, and maintain their own copy of the
|
||||
cluster map. Therefore, OSDs should have a reasonable amount of processing power
|
||||
(e.g., dual core processors). Monitors simply maintain a master copy of the
|
||||
cluster map, so they are not CPU intensive. You must also consider whether the
|
||||
cluster map. Therefore, OSD nodes should have a reasonable amount of processing
|
||||
power. Requirements vary by use-case; a starting point might be one core per
|
||||
OSD for light / archival usage, and two cores per OSD for heavy workloads such
|
||||
as RBD volumes attached to VMs. Monitor / manager nodes do not have heavy CPU
|
||||
demands so a modest processor can be chosen for them. Also consider whether the
|
||||
host machine will run CPU-intensive processes in addition to Ceph daemons. For
|
||||
example, if your hosts will run computing VMs (e.g., OpenStack Nova), you will
|
||||
need to ensure that these other processes leave sufficient processing power for
|
||||
Ceph daemons. We recommend running additional CPU-intensive processes on
|
||||
separate hosts.
|
||||
separate hosts to avoid resource contention.
|
||||
|
||||
|
||||
RAM
|
||||
===
|
||||
|
||||
Generally, more RAM is better.
|
||||
Generally, more RAM is better. Monitor / manager nodes for a modest cluster
|
||||
might do fine with 64GB; for a larger cluster with hundreds of OSDs 128GB
|
||||
is a reasonable target. There is a memory target for BlueStore OSDs that
|
||||
defaults to 4GB. Factor in a prudent margin for the operating system and
|
||||
administrative tasks (like monitoring and metrics) as well as increased
|
||||
consumption during recovery: provisioning ~8GB per BlueStore OSD
|
||||
is advised.
|
||||
|
||||
Monitors and managers (ceph-mon and ceph-mgr)
|
||||
---------------------------------------------
|
||||
|
||||
Monitor and manager daemon memory usage generally scales with the size of the
|
||||
cluster. For small clusters, 1-2 GB is generally sufficient. For
|
||||
large clusters, you should provide more (5-10 GB). You may also want
|
||||
to consider tuning settings like ``mon_osd_cache_size`` or
|
||||
``rocksdb_cache_size``.
|
||||
cluster. Note that at boot-time and during topology changes and recovery these
|
||||
daemons will need more RAM than they do during steady-state operation, so plan
|
||||
for peak usage. For very small clusters, 32 GB suffices. For
|
||||
clusters of up to, say, 300 OSDs go with 64GB. For clusters built with (or
|
||||
which will grow to) even more OSDS you should provision
|
||||
129GB. You may also want to consider tuning settings like ``mon_osd_cache_size``
|
||||
or ``rocksdb_cache_size`` after careful research.
|
||||
|
||||
Metadata servers (ceph-mds)
|
||||
---------------------------
|
||||
@ -108,8 +119,8 @@ performance tradeoffs to consider when planning for data storage. Simultaneous
|
||||
OS operations, and simultaneous request for read and write operations from
|
||||
multiple daemons against a single drive can slow performance considerably.
|
||||
|
||||
.. important:: Since Ceph has to write all data to the journal before it can
|
||||
send an ACK (for XFS at least), having the journal and OSD
|
||||
.. important:: Since Ceph has to write all data to the journal (or WAL+DB)
|
||||
before it can ACK writes, having this metadata and OSD
|
||||
performance in balance is really important!
|
||||
|
||||
|
||||
@ -127,23 +138,25 @@ at $150.00 has a cost of $0.05 per gigabyte (i.e., $150 / 3072 = 0.0488). In the
|
||||
foregoing example, using the 1 terabyte disks would generally increase the cost
|
||||
per gigabyte by 40%--rendering your cluster substantially less cost efficient.
|
||||
|
||||
.. tip:: Running multiple OSDs on a single disk--irrespective of partitions--is
|
||||
**NOT** a good idea.
|
||||
.. tip:: Running multiple OSDs on a single SAS / SATA drive
|
||||
is **NOT** a good idea. NVMe drives, however, can achieve
|
||||
improved performance by being split into two more more OSDs.
|
||||
|
||||
.. tip:: Running an OSD and a monitor or a metadata server on a single
|
||||
disk--irrespective of partitions--is **NOT** a good idea either.
|
||||
drive is also **NOT** a good idea.
|
||||
|
||||
Storage drives are subject to limitations on seek time, access time, read and
|
||||
write times, as well as total throughput. These physical limitations affect
|
||||
overall system performance--especially during recovery. We recommend using a
|
||||
dedicated drive for the operating system and software, and one drive for each
|
||||
Ceph OSD Daemon you run on the host. Most "slow OSD" issues arise due to running
|
||||
dedicated (ideally mirrored) drive for the operating system and software, and
|
||||
one drive for each Ceph OSD Daemon you run on the host (modulo NVMe above).
|
||||
Many "slow OSD" issues not attributable to hardware failure arise from running
|
||||
an operating system, multiple OSDs, and/or multiple journals on the same drive.
|
||||
Since the cost of troubleshooting performance issues on a small cluster likely
|
||||
exceeds the cost of the extra disk drives, you can optimize your cluster
|
||||
design planning by avoiding the temptation to overtax the OSD storage drives.
|
||||
|
||||
You may run multiple Ceph OSD Daemons per hard disk drive, but this will likely
|
||||
You may run multiple Ceph OSD Daemons per SAS / SATA drive, but this will likely
|
||||
lead to resource contention and diminish the overall throughput. You may store a
|
||||
journal and object data on the same drive, but this may increase the time it
|
||||
takes to journal a write and ACK to the client. Ceph must write to the journal
|
||||
@ -196,12 +209,9 @@ are a few important performance considerations for journals and SSDs:
|
||||
proper partition alignment with SSDs, which can cause SSDs to transfer data
|
||||
much more slowly. Ensure that SSD partitions are properly aligned.
|
||||
|
||||
While SSDs are cost prohibitive for object storage, OSDs may see a significant
|
||||
performance improvement by storing an OSD's journal on an SSD and the OSD's
|
||||
object data on a separate hard disk drive. The ``osd journal`` configuration
|
||||
setting defaults to ``/var/lib/ceph/osd/$cluster-$id/journal``. You can mount
|
||||
this path to an SSD or to an SSD partition so that it is not merely a file on
|
||||
the same disk as the object data.
|
||||
SSDs have historically been cost prohibitive for object storage, though
|
||||
emerging QLC drives are closing the gap. HDD OSDs may see a significant
|
||||
performance improvement by offloading WAL+DB onto an SSD.
|
||||
|
||||
One way Ceph accelerates CephFS file system performance is to segregate the
|
||||
storage of CephFS metadata from the storage of the CephFS file contents. Ceph
|
||||
@ -214,9 +224,12 @@ your CephFS metadata pool that points only to a host's SSD storage media. See
|
||||
Controllers
|
||||
-----------
|
||||
|
||||
Disk controllers also have a significant impact on write throughput. Carefully,
|
||||
consider your selection of disk controllers to ensure that they do not create
|
||||
a performance bottleneck.
|
||||
Disk controllers (HBAs) can have a significant impact on write throughput.
|
||||
Carefully consider your selection to ensure that they do not create
|
||||
a performance bottleneck. Notably RAID-mode (IR) HBAs may exhibit higher
|
||||
latency than simpler "JBOD" (IT) mode HBAs, and the RAID SoC, write cache,
|
||||
and battery backup can substantially increase hardware and maintenance
|
||||
costs. Some RAID HBAs can be configured with an IT-mode "personality".
|
||||
|
||||
.. tip:: The `Ceph blog`_ is often an excellent source of information on Ceph
|
||||
performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write
|
||||
@ -226,8 +239,8 @@ a performance bottleneck.
|
||||
Additional Considerations
|
||||
-------------------------
|
||||
|
||||
You may run multiple OSDs per host, but you should ensure that the sum of the
|
||||
total throughput of your OSD hard disks doesn't exceed the network bandwidth
|
||||
You typically will run multiple OSDs per host, but you should ensure that the
|
||||
aggregate throughput of your OSD drives doesn't exceed the network bandwidth
|
||||
required to service a client's need to read or write data. You should also
|
||||
consider what percentage of the overall data the cluster stores on each host. If
|
||||
the percentage on a particular host is large and the host fails, it can lead to
|
||||
@ -243,10 +256,10 @@ multiple OSDs per host.
|
||||
Networks
|
||||
========
|
||||
|
||||
Consider starting with a 10Gbps+ network in your racks. Replicating 1TB of data
|
||||
Provision at least 10Gbps+ networking in your racks. Replicating 1TB of data
|
||||
across a 1Gbps network takes 3 hours, and 10TBs takes 30 hours! By contrast,
|
||||
with a 10Gbps network, the replication times would be 20 minutes and 1 hour
|
||||
respectively. In a petabyte-scale cluster, failure of an OSD disk should be an
|
||||
with a 10Gbps network, the replication times would be 20 minutes and 1 hour
|
||||
respectively. In a petabyte-scale cluster, failure of an OSD drive is an
|
||||
expectation, not an exception. System administrators will appreciate PGs
|
||||
recovering from a ``degraded`` state to an ``active + clean`` state as rapidly
|
||||
as possible, with price / performance tradeoffs taken into consideration.
|
||||
@ -255,12 +268,16 @@ cabling more manageable. VLANs using 802.1q protocol require VLAN-capable NICs
|
||||
and Switches. The added hardware expense may be offset by the operational cost
|
||||
savings for network setup and maintenance. When using VLANs to handle VM
|
||||
traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack,
|
||||
etc.), it is also worth considering using 10G Ethernet. Top-of-rack routers for
|
||||
each network also need to be able to communicate with spine routers that have
|
||||
even faster throughput--e.g., 40Gbps to 100Gbps.
|
||||
etc.), there is additional value in using 10G Ethernet or better; 40Gb or
|
||||
25/50/100 Gb networking as of 2020 is common for production clusters.
|
||||
|
||||
Top-of-rack routers for each network also need to be able to communicate with
|
||||
spine routers that have even faster throughput, often 40Gbp/s or more.
|
||||
|
||||
|
||||
Your server hardware should have a Baseboard Management Controller (BMC).
|
||||
Administration and deployment tools may also use BMCs extensively, so consider
|
||||
Administration and deployment tools may also use BMCs extensively, especially
|
||||
via IPMI or Redfish, so consider
|
||||
the cost/benefit tradeoff of an out-of-band network for administration.
|
||||
Hypervisor SSH access, VM image uploads, OS image installs, management sockets,
|
||||
etc. can impose significant loads on a network. Running three networks may seem
|
||||
@ -273,7 +290,7 @@ Failure Domains
|
||||
===============
|
||||
|
||||
A failure domain is any failure that prevents access to one or more OSDs. That
|
||||
could be a stopped daemon on a host; a hard disk failure, an OS crash, a
|
||||
could be a stopped daemon on a host; a hard disk failure, an OS crash, a
|
||||
malfunctioning NIC, a failed power supply, a network outage, a power outage, and
|
||||
so forth. When planning out your hardware needs, you must balance the
|
||||
temptation to reduce costs by placing too many responsibilities into too few
|
||||
@ -301,7 +318,7 @@ and development clusters can run successfully with modest hardware.
|
||||
| | | * ARM processors specifically may |
|
||||
| | | require additional cores. |
|
||||
| | | * Actual performance depends on many |
|
||||
| | | factors including disk, network, and |
|
||||
| | | factors including drives, net, and |
|
||||
| | | client throughput and latency. |
|
||||
| | | Benchmarking is highly recommended. |
|
||||
| +----------------+-----------------------------------------+
|
||||
@ -315,15 +332,15 @@ and development clusters can run successfully with modest hardware.
|
||||
| +----------------+-----------------------------------------+
|
||||
| | Network | 1x 1GbE+ NICs (10GbE+ recommended) |
|
||||
+--------------+----------------+-----------------------------------------+
|
||||
| ``ceph-mon`` | Processor | - 1 core minimum |
|
||||
| ``ceph-mon`` | Processor | - 2 cores minimum |
|
||||
| +----------------+-----------------------------------------+
|
||||
| | RAM | 2GB+ per daemon |
|
||||
| | RAM | 24GB+ per daemon |
|
||||
| +----------------+-----------------------------------------+
|
||||
| | Disk Space | 10 GB per daemon |
|
||||
| | Disk Space | 60 GB per daemon |
|
||||
| +----------------+-----------------------------------------+
|
||||
| | Network | 1x 1GbE+ NICs |
|
||||
+--------------+----------------+-----------------------------------------+
|
||||
| ``ceph-mds`` | Processor | - 1 core minimum |
|
||||
| ``ceph-mds`` | Processor | - 2 cores minimum |
|
||||
| +----------------+-----------------------------------------+
|
||||
| | RAM | 2GB+ per daemon |
|
||||
| +----------------+-----------------------------------------+
|
||||
|
@ -19,10 +19,11 @@ Linux Kernel
|
||||
your Linux distribution on any client hosts.
|
||||
|
||||
For RBD, if you choose to *track* long-term kernels, we currently recommend
|
||||
4.x-based "longterm maintenance" kernel series:
|
||||
4.x-based "longterm maintenance" kernel series or later:
|
||||
|
||||
- 4.19.z
|
||||
- 4.14.z
|
||||
- 5.x
|
||||
|
||||
For CephFS, see the section about `Mounting CephFS using Kernel Driver`_
|
||||
for kernel version guidance.
|
||||
@ -111,30 +112,30 @@ Luminous (12.2.z)
|
||||
Notes
|
||||
-----
|
||||
|
||||
- **1**: The default kernel has an older version of ``btrfs`` that we do not
|
||||
recommend for ``ceph-osd`` storage nodes. We recommend using ``bluestore``
|
||||
starting from Mimic, and ``XFS`` for previous releases with ``filestore``.
|
||||
- **1**: The default kernel has an older version of ``Btrfs`` that we do not
|
||||
recommend for ``ceph-osd`` storage nodes. We recommend using ``BlueStore``
|
||||
starting with Luminous, and ``XFS`` for previous releases with ``Filestore``.
|
||||
|
||||
- **2**: The default kernel has an old Ceph client that we do not recommend
|
||||
for kernel client (kernel RBD or the Ceph file system). Upgrade to a
|
||||
recommended kernel.
|
||||
|
||||
- **3**: The default kernel regularly fails in QA when the ``btrfs``
|
||||
file system is used. We recommend using ``bluestore`` starting from
|
||||
Mimic, and ``XFS`` for previous releases with ``filestore``.
|
||||
- **3**: The default kernel regularly fails in QA when the ``Btrfs``
|
||||
file system is used. We recommend using ``BlueStore`` starting from
|
||||
Luminous, and ``XFS`` for previous releases with ``Filestore``.
|
||||
|
||||
- **4**: ``btrfs`` is no longer tested on this release. We recommend
|
||||
using ``bluestore``.
|
||||
|
||||
- **5**: Some additional features related to dashboard are not available.
|
||||
|
||||
- **6**: Building packages are built regularly, but not distributed by Ceph.
|
||||
- **6**: Packages are built regularly, but not distributed by upstream Ceph.
|
||||
|
||||
Testing
|
||||
-------
|
||||
|
||||
- **B**: We build release packages for this platform. For some of these
|
||||
platforms, we may also continuously build all ceph branches and exercise
|
||||
platforms, we may also continuously build all Ceph branches and perform
|
||||
basic unit tests.
|
||||
|
||||
- **I**: We do basic installation and functionality tests of releases on this
|
||||
|
Loading…
Reference in New Issue
Block a user