2012-09-18 18:08:23 +00:00
|
|
|
|
==============
|
2013-05-15 00:05:43 +00:00
|
|
|
|
Architecture
|
2012-09-18 18:08:23 +00:00
|
|
|
|
==============
|
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
:term:`Ceph` uniquely delivers **object, block, and file storage** in one
|
|
|
|
|
unified system. Ceph is highly reliable, easy to manage, and free. The power of
|
|
|
|
|
Ceph can transform your company's IT infrastructure and your ability to manage
|
|
|
|
|
vast amounts of data. Ceph delivers extraordinary scalability–thousands of
|
|
|
|
|
clients accessing petabytes to exabytes of data. A :term:`Ceph Node` leverages
|
|
|
|
|
commodity hardware and intelligent daemons, and a :term:`Ceph Storage Cluster`
|
|
|
|
|
accommodates large numbers of nodes, which communicate with each other to
|
2014-05-06 10:53:03 +00:00
|
|
|
|
replicate and redistribute data dynamically.
|
2012-09-18 18:08:23 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
.. image:: images/stack.png
|
2012-12-04 04:48:02 +00:00
|
|
|
|
|
2022-11-22 04:02:34 +00:00
|
|
|
|
.. _arch-ceph-storage-cluster:
|
2012-09-18 18:08:23 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
The Ceph Storage Cluster
|
|
|
|
|
========================
|
|
|
|
|
|
|
|
|
|
Ceph provides an infinitely scalable :term:`Ceph Storage Cluster` based upon
|
|
|
|
|
:abbr:`RADOS (Reliable Autonomic Distributed Object Store)`, which you can read
|
|
|
|
|
about in `RADOS - A Scalable, Reliable Storage Service for Petabyte-scale
|
2014-05-06 10:53:03 +00:00
|
|
|
|
Storage Clusters`_.
|
|
|
|
|
|
2020-11-19 06:57:54 +00:00
|
|
|
|
A Ceph Storage Cluster consists of multiple types of daemons:
|
2014-05-06 10:53:03 +00:00
|
|
|
|
|
2014-08-14 03:27:31 +00:00
|
|
|
|
- :term:`Ceph Monitor`
|
2014-09-11 17:50:42 +00:00
|
|
|
|
- :term:`Ceph OSD Daemon`
|
2020-11-19 06:57:54 +00:00
|
|
|
|
- :term:`Ceph Manager`
|
|
|
|
|
- :term:`Ceph Metadata Server`
|
2013-05-15 00:05:43 +00:00
|
|
|
|
|
2023-09-10 03:10:09 +00:00
|
|
|
|
Ceph Monitors maintain the master copy of the cluster map, which they provide
|
|
|
|
|
to Ceph clients. Provisioning multiple monitors within the Ceph cluster ensures
|
|
|
|
|
availability in the event that one of the monitor daemons or its host fails.
|
|
|
|
|
The Ceph monitor provides copies of the cluster map to storage cluster clients.
|
2014-08-14 03:27:31 +00:00
|
|
|
|
|
|
|
|
|
A Ceph OSD Daemon checks its own state and the state of other OSDs and reports
|
|
|
|
|
back to monitors.
|
|
|
|
|
|
2023-09-10 03:10:09 +00:00
|
|
|
|
A Ceph Manager serves as an endpoint for monitoring, orchestration, and plug-in
|
2020-11-19 06:57:54 +00:00
|
|
|
|
modules.
|
|
|
|
|
|
|
|
|
|
A Ceph Metadata Server (MDS) manages file metadata when CephFS is used to
|
|
|
|
|
provide file services.
|
|
|
|
|
|
2023-09-10 03:10:09 +00:00
|
|
|
|
Storage cluster clients and :term:`Ceph OSD Daemon`\s use the CRUSH algorithm
|
|
|
|
|
to compute information about data location. This means that clients and OSDs
|
|
|
|
|
are not bottlenecked by a central lookup table. Ceph's high-level features
|
|
|
|
|
include a native interface to the Ceph Storage Cluster via ``librados``, and a
|
|
|
|
|
number of service interfaces built on top of ``librados``.
|
2012-09-18 18:08:23 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
Storing Data
|
|
|
|
|
------------
|
|
|
|
|
|
2022-10-03 12:51:35 +00:00
|
|
|
|
The Ceph Storage Cluster receives data from :term:`Ceph Client`\s--whether it
|
2013-05-15 00:05:43 +00:00
|
|
|
|
comes through a :term:`Ceph Block Device`, :term:`Ceph Object Storage`, the
|
2023-09-10 03:10:09 +00:00
|
|
|
|
:term:`Ceph File System`, or a custom implementation that you create by using
|
|
|
|
|
``librados``. The data received by the Ceph Storage Cluster is stored as RADOS
|
|
|
|
|
objects. Each object is stored on an :term:`Object Storage Device` (this is
|
|
|
|
|
also called an "OSD"). Ceph OSDs control read, write, and replication
|
|
|
|
|
operations on storage drives. The default BlueStore back end stores objects
|
|
|
|
|
in a monolithic, database-like fashion.
|
2012-09-18 18:08:23 +00:00
|
|
|
|
|
2020-04-09 13:25:39 +00:00
|
|
|
|
.. ditaa::
|
|
|
|
|
|
2023-09-10 03:10:09 +00:00
|
|
|
|
/------\ +-----+ +-----+
|
|
|
|
|
| obj |------>| {d} |------>| {s} |
|
|
|
|
|
\------/ +-----+ +-----+
|
2012-12-04 04:48:02 +00:00
|
|
|
|
|
2020-11-19 06:57:54 +00:00
|
|
|
|
Object OSD Drive
|
2012-09-18 18:08:23 +00:00
|
|
|
|
|
2023-09-10 03:10:09 +00:00
|
|
|
|
Ceph OSD Daemons store data as objects in a flat namespace. This means that
|
|
|
|
|
objects are not stored in a hierarchy of directories. An object has an
|
|
|
|
|
identifier, binary data, and metadata consisting of name/value pairs.
|
|
|
|
|
:term:`Ceph Client`\s determine the semantics of the object data. For example,
|
|
|
|
|
CephFS uses metadata to store file attributes such as the file owner, the
|
|
|
|
|
created date, and the last modified date.
|
2012-09-18 18:08:23 +00:00
|
|
|
|
|
|
|
|
|
|
2020-04-09 13:25:39 +00:00
|
|
|
|
.. ditaa::
|
|
|
|
|
|
|
|
|
|
/------+------------------------------+----------------\
|
2012-12-04 04:48:02 +00:00
|
|
|
|
| ID | Binary Data | Metadata |
|
|
|
|
|
+------+------------------------------+----------------+
|
|
|
|
|
| 1234 | 0101010101010100110101010010 | name1 = value1 |
|
|
|
|
|
| | 0101100001010100110101010010 | name2 = value2 |
|
|
|
|
|
| | 0101100001010100110101010010 | nameN = valueN |
|
|
|
|
|
\------+------------------------------+----------------/
|
2012-09-18 18:08:23 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
.. note:: An object ID is unique across the entire cluster, not just the local
|
|
|
|
|
filesystem.
|
2012-09-18 18:08:23 +00:00
|
|
|
|
|
2012-12-04 04:48:02 +00:00
|
|
|
|
|
2013-06-14 23:52:25 +00:00
|
|
|
|
.. index:: architecture; high availability, scalability
|
|
|
|
|
|
2023-09-17 08:56:40 +00:00
|
|
|
|
.. _arch_scalability_and_high_availability:
|
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
Scalability and High Availability
|
|
|
|
|
---------------------------------
|
2011-12-01 23:22:15 +00:00
|
|
|
|
|
2023-09-10 03:10:09 +00:00
|
|
|
|
In traditional architectures, clients talk to a centralized component. This
|
|
|
|
|
centralized component might be a gateway, a broker, an API, or a facade. A
|
|
|
|
|
centralized component of this kind acts as a single point of entry to a complex
|
2023-09-10 16:31:30 +00:00
|
|
|
|
subsystem. Architectures that rely upon such a centralized component have a
|
|
|
|
|
single point of failure and incur limits to performance and scalability. If
|
|
|
|
|
the centralized component goes down, the whole system becomes unavailable.
|
2013-05-15 00:05:43 +00:00
|
|
|
|
|
2023-09-10 03:10:09 +00:00
|
|
|
|
Ceph eliminates this centralized component. This enables clients to interact
|
|
|
|
|
with Ceph OSDs directly. Ceph OSDs create object replicas on other Ceph Nodes
|
|
|
|
|
to ensure data safety and high availability. Ceph also uses a cluster of
|
|
|
|
|
monitors to ensure high availability. To eliminate centralization, Ceph uses an
|
|
|
|
|
algorithm called :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)`.
|
2012-09-18 18:08:23 +00:00
|
|
|
|
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
2013-06-14 23:52:25 +00:00
|
|
|
|
.. index:: CRUSH; architecture
|
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
CRUSH Introduction
|
|
|
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
|
|
Ceph Clients and Ceph OSD Daemons both use the :abbr:`CRUSH (Controlled
|
2023-09-10 03:10:09 +00:00
|
|
|
|
Replication Under Scalable Hashing)` algorithm to compute information about
|
|
|
|
|
object location instead of relying upon a central lookup table. CRUSH provides
|
|
|
|
|
a better data management mechanism than do older approaches, and CRUSH enables
|
|
|
|
|
massive scale by distributing the work to all the OSD daemons in the cluster
|
|
|
|
|
and all the clients that communicate with them. CRUSH uses intelligent data
|
|
|
|
|
replication to ensure resiliency, which is better suited to hyper-scale
|
|
|
|
|
storage. The following sections provide additional details on how CRUSH works.
|
|
|
|
|
For a detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable,
|
|
|
|
|
Decentralized Placement of Replicated Data`_.
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
2013-06-14 23:52:25 +00:00
|
|
|
|
.. index:: architecture; cluster map
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
2022-11-09 13:12:54 +00:00
|
|
|
|
.. _architecture_cluster_map:
|
|
|
|
|
|
2013-04-23 04:02:45 +00:00
|
|
|
|
Cluster Map
|
2013-05-15 00:05:43 +00:00
|
|
|
|
~~~~~~~~~~~
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
2023-09-12 11:17:40 +00:00
|
|
|
|
In order for a Ceph cluster to function properly, Ceph Clients and Ceph OSDs
|
|
|
|
|
must have current information about the cluster's topology. Current information
|
|
|
|
|
is stored in the "Cluster Map", which is in fact a collection of five maps. The
|
|
|
|
|
five maps that constitute the cluster map are:
|
|
|
|
|
|
|
|
|
|
#. **The Monitor Map:** Contains the cluster ``fsid``, the position, the name,
|
|
|
|
|
the address, and the TCP port of each monitor. The monitor map specifies the
|
|
|
|
|
current epoch, the time of the monitor map's creation, and the time of the
|
|
|
|
|
monitor map's last modification. To view a monitor map, run ``ceph mon
|
|
|
|
|
dump``.
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
2023-09-12 11:17:40 +00:00
|
|
|
|
#. **The OSD Map:** Contains the cluster ``fsid``, the time of the OSD map's
|
|
|
|
|
creation, the time of the OSD map's last modification, a list of pools, a
|
|
|
|
|
list of replica sizes, a list of PG numbers, and a list of OSDs and their
|
|
|
|
|
statuses (for example, ``up``, ``in``). To view an OSD map, run ``ceph
|
|
|
|
|
osd dump``.
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
2023-09-12 11:17:40 +00:00
|
|
|
|
#. **The PG Map:** Contains the PG version, its time stamp, the last OSD map
|
|
|
|
|
epoch, the full ratios, and the details of each placement group. This
|
|
|
|
|
includes the PG ID, the `Up Set`, the `Acting Set`, the state of the PG (for
|
|
|
|
|
example, ``active + clean``), and data usage statistics for each pool.
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
|
|
|
|
#. **The CRUSH Map:** Contains a list of storage devices, the failure domain
|
2023-09-12 11:17:40 +00:00
|
|
|
|
hierarchy (for example, ``device``, ``host``, ``rack``, ``row``, ``room``),
|
|
|
|
|
and rules for traversing the hierarchy when storing data. To view a CRUSH
|
|
|
|
|
map, run ``ceph osd getcrushmap -o {filename}`` and then decompile it by
|
|
|
|
|
running ``crushtool -d {comp-crushmap-filename} -o
|
|
|
|
|
{decomp-crushmap-filename}``. Use a text editor or ``cat`` to view the
|
|
|
|
|
decompiled map.
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
|
|
|
|
#. **The MDS Map:** Contains the current MDS map epoch, when the map was
|
|
|
|
|
created, and the last time it changed. It also contains the pool for
|
|
|
|
|
storing metadata, a list of metadata servers, and which metadata servers
|
2016-10-12 13:44:57 +00:00
|
|
|
|
are ``up`` and ``in``. To view an MDS map, execute ``ceph fs dump``.
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
2023-09-12 11:17:40 +00:00
|
|
|
|
Each map maintains a history of changes to its operating state. Ceph Monitors
|
|
|
|
|
maintain a master copy of the cluster map. This master copy includes the
|
|
|
|
|
cluster members, the state of the cluster, changes to the cluster, and
|
|
|
|
|
information recording the overall health of the Ceph Storage Cluster.
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
2013-06-14 23:52:25 +00:00
|
|
|
|
.. index:: high availability; monitor architecture
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
High Availability Monitors
|
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
2023-09-13 14:09:45 +00:00
|
|
|
|
A Ceph Client must contact a Ceph Monitor and obtain a current copy of the
|
|
|
|
|
cluster map in order to read data from or to write data to the Ceph cluster.
|
|
|
|
|
|
|
|
|
|
It is possible for a Ceph cluster to function properly with only a single
|
|
|
|
|
monitor, but a Ceph cluster that has only a single monitor has a single point
|
|
|
|
|
of failure: if the monitor goes down, Ceph clients will be unable to read data
|
|
|
|
|
from or write data to the cluster.
|
|
|
|
|
|
|
|
|
|
Ceph leverages a cluster of monitors in order to increase reliability and fault
|
|
|
|
|
tolerance. When a cluster of monitors is used, however, one or more of the
|
|
|
|
|
monitors in the cluster can fall behind due to latency or other faults. Ceph
|
|
|
|
|
mitigates these negative effects by requiring multiple monitor instances to
|
|
|
|
|
agree about the state of the cluster. To establish consensus among the monitors
|
|
|
|
|
regarding the state of the cluster, Ceph uses the `Paxos`_ algorithm and a
|
|
|
|
|
majority of monitors (for example, one in a cluster that contains only one
|
|
|
|
|
monitor, two in a cluster that contains three monitors, three in a cluster that
|
|
|
|
|
contains five monitors, four in a cluster that contains six monitors, and so
|
|
|
|
|
on).
|
|
|
|
|
|
|
|
|
|
See the `Monitor Config Reference`_ for more detail on configuring monitors.
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
2013-06-14 23:52:25 +00:00
|
|
|
|
.. index:: architecture; high availability authentication
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
2023-03-28 08:42:11 +00:00
|
|
|
|
.. _arch_high_availability_authentication:
|
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
High Availability Authentication
|
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
2023-09-16 12:27:29 +00:00
|
|
|
|
The ``cephx`` authentication system is used by Ceph to authenticate users and
|
|
|
|
|
daemons and to protect against man-in-the-middle attacks.
|
2014-08-25 18:02:27 +00:00
|
|
|
|
|
|
|
|
|
.. note:: The ``cephx`` protocol does not address data encryption in transport
|
2023-09-17 08:56:40 +00:00
|
|
|
|
(for example, SSL/TLS) or encryption at rest.
|
|
|
|
|
|
|
|
|
|
``cephx`` uses shared secret keys for authentication. This means that both the
|
|
|
|
|
client and the monitor cluster keep a copy of the client's secret key.
|
|
|
|
|
|
|
|
|
|
The ``cephx`` protocol makes it possible for each party to prove to the other
|
|
|
|
|
that it has a copy of the key without revealing it. This provides mutual
|
|
|
|
|
authentication and allows the cluster to confirm (1) that the user has the
|
|
|
|
|
secret key and (2) that the user can be confident that the cluster has a copy
|
|
|
|
|
of the secret key.
|
2014-08-25 18:02:27 +00:00
|
|
|
|
|
2023-09-17 08:56:40 +00:00
|
|
|
|
As stated in :ref:`Scalability and High Availability
|
|
|
|
|
<arch_scalability_and_high_availability>`, Ceph does not have any centralized
|
|
|
|
|
interface between clients and the Ceph object store. By avoiding such a
|
|
|
|
|
centralized interface, Ceph avoids the bottlenecks that attend such centralized
|
|
|
|
|
interfaces. However, this means that clients must interact directly with OSDs.
|
|
|
|
|
Direct interactions between Ceph clients and OSDs require authenticated
|
|
|
|
|
connections. The ``cephx`` authentication system establishes and sustains these
|
|
|
|
|
authenticated connections.
|
2014-08-14 03:27:31 +00:00
|
|
|
|
|
2023-09-17 08:56:40 +00:00
|
|
|
|
The ``cephx`` protocol operates in a manner similar to `Kerberos`_.
|
2014-08-14 03:27:31 +00:00
|
|
|
|
|
2023-09-17 20:41:28 +00:00
|
|
|
|
A user invokes a Ceph client to contact a monitor. Unlike Kerberos, each
|
|
|
|
|
monitor can authenticate users and distribute keys, which means that there is
|
|
|
|
|
no single point of failure and no bottleneck when using ``cephx``. The monitor
|
|
|
|
|
returns an authentication data structure that is similar to a Kerberos ticket.
|
|
|
|
|
This authentication data structure contains a session key for use in obtaining
|
|
|
|
|
Ceph services. The session key is itself encrypted with the user's permanent
|
|
|
|
|
secret key, which means that only the user can request services from the Ceph
|
|
|
|
|
Monitors. The client then uses the session key to request services from the
|
|
|
|
|
monitors, and the monitors provide the client with a ticket that authenticates
|
|
|
|
|
the client against the OSDs that actually handle data. Ceph Monitors and OSDs
|
|
|
|
|
share a secret, which means that the clients can use the ticket provided by the
|
|
|
|
|
monitors to authenticate against any OSD or metadata server in the cluster.
|
|
|
|
|
|
|
|
|
|
Like Kerberos tickets, ``cephx`` tickets expire. An attacker cannot use an
|
|
|
|
|
expired ticket or session key that has been obtained surreptitiously. This form
|
|
|
|
|
of authentication prevents attackers who have access to the communications
|
|
|
|
|
medium from creating bogus messages under another user's identity and prevents
|
|
|
|
|
attackers from altering another user's legitimate messages, as long as the
|
|
|
|
|
user's secret key is not divulged before it expires.
|
|
|
|
|
|
|
|
|
|
An administrator must set up users before using ``cephx``. In the following
|
|
|
|
|
diagram, the ``client.admin`` user invokes ``ceph auth get-or-create-key`` from
|
2014-08-14 03:27:31 +00:00
|
|
|
|
the command line to generate a username and secret key. Ceph's ``auth``
|
2023-09-17 20:41:28 +00:00
|
|
|
|
subsystem generates the username and key, stores a copy on the monitor(s), and
|
|
|
|
|
transmits the user's secret back to the ``client.admin`` user. This means that
|
2014-08-14 03:27:31 +00:00
|
|
|
|
the client and the monitor share a secret key.
|
|
|
|
|
|
|
|
|
|
.. note:: The ``client.admin`` user must provide the user ID and
|
|
|
|
|
secret key to the user in a secure manner.
|
|
|
|
|
|
2020-04-09 13:25:39 +00:00
|
|
|
|
.. ditaa::
|
|
|
|
|
|
|
|
|
|
+---------+ +---------+
|
2014-08-14 03:27:31 +00:00
|
|
|
|
| Client | | Monitor |
|
|
|
|
|
+---------+ +---------+
|
|
|
|
|
| request to |
|
|
|
|
|
| create a user |
|
|
|
|
|
|-------------->|----------+ create user
|
|
|
|
|
| | | and
|
|
|
|
|
|<--------------|<---------+ store key
|
|
|
|
|
| transmit key |
|
|
|
|
|
| |
|
|
|
|
|
|
2023-09-23 02:26:18 +00:00
|
|
|
|
Here is how a client authenticates with a monitor. The client passes the user
|
|
|
|
|
name to the monitor. The monitor generates a session key that is encrypted with
|
|
|
|
|
the secret key associated with the ``username``. The monitor transmits the
|
|
|
|
|
encrypted ticket to the client. The client uses the shared secret key to
|
|
|
|
|
decrypt the payload. The session key identifies the user, and this act of
|
|
|
|
|
identification will last for the duration of the session. The client requests
|
|
|
|
|
a ticket for the user, and the ticket is signed with the session key. The
|
|
|
|
|
monitor generates a ticket and uses the user's secret key to encrypt it. The
|
|
|
|
|
encrypted ticket is transmitted to the client. The client decrypts the ticket
|
|
|
|
|
and uses it to sign requests to OSDs and to metadata servers in the cluster.
|
2014-08-14 03:27:31 +00:00
|
|
|
|
|
2020-04-09 13:25:39 +00:00
|
|
|
|
.. ditaa::
|
|
|
|
|
|
|
|
|
|
+---------+ +---------+
|
2014-08-14 03:27:31 +00:00
|
|
|
|
| Client | | Monitor |
|
|
|
|
|
+---------+ +---------+
|
|
|
|
|
| authenticate |
|
|
|
|
|
|-------------->|----------+ generate and
|
|
|
|
|
| | | encrypt
|
|
|
|
|
|<--------------|<---------+ session key
|
|
|
|
|
| transmit |
|
|
|
|
|
| encrypted |
|
|
|
|
|
| session key |
|
|
|
|
|
| |
|
|
|
|
|
|-----+ decrypt |
|
|
|
|
|
| | session |
|
|
|
|
|
|<----+ key |
|
|
|
|
|
| |
|
|
|
|
|
| req. ticket |
|
|
|
|
|
|-------------->|----------+ generate and
|
|
|
|
|
| | | encrypt
|
|
|
|
|
|<--------------|<---------+ ticket
|
|
|
|
|
| recv. ticket |
|
|
|
|
|
| |
|
|
|
|
|
|-----+ decrypt |
|
|
|
|
|
| | ticket |
|
|
|
|
|
|<----+ |
|
|
|
|
|
|
|
|
|
|
|
2023-09-25 04:34:06 +00:00
|
|
|
|
The ``cephx`` protocol authenticates ongoing communications between the clients
|
|
|
|
|
and Ceph daemons. After initial authentication, each message sent between a
|
|
|
|
|
client and a daemon is signed using a ticket that can be verified by monitors,
|
|
|
|
|
OSDs, and metadata daemons. This ticket is verified by using the secret shared
|
|
|
|
|
between the client and the daemon.
|
2014-08-14 03:27:31 +00:00
|
|
|
|
|
2020-04-09 13:25:39 +00:00
|
|
|
|
.. ditaa::
|
|
|
|
|
|
|
|
|
|
+---------+ +---------+ +-------+ +-------+
|
2014-08-14 03:27:31 +00:00
|
|
|
|
| Client | | Monitor | | MDS | | OSD |
|
|
|
|
|
+---------+ +---------+ +-------+ +-------+
|
|
|
|
|
| request to | | |
|
|
|
|
|
| create a user | | |
|
|
|
|
|
|-------------->| mon and | |
|
|
|
|
|
|<--------------| client share | |
|
|
|
|
|
| receive | a secret. | |
|
|
|
|
|
| shared secret | | |
|
|
|
|
|
| |<------------>| |
|
|
|
|
|
| |<-------------+------------>|
|
|
|
|
|
| | mon, mds, | |
|
|
|
|
|
| authenticate | and osd | |
|
|
|
|
|
|-------------->| share | |
|
|
|
|
|
|<--------------| a secret | |
|
|
|
|
|
| session key | | |
|
|
|
|
|
| | | |
|
|
|
|
|
| req. ticket | | |
|
|
|
|
|
|-------------->| | |
|
|
|
|
|
|<--------------| | |
|
|
|
|
|
| recv. ticket | | |
|
|
|
|
|
| | | |
|
|
|
|
|
| make request (CephFS only) | |
|
|
|
|
|
|----------------------------->| |
|
|
|
|
|
|<-----------------------------| |
|
|
|
|
|
| receive response (CephFS only) |
|
|
|
|
|
| |
|
|
|
|
|
| make request |
|
|
|
|
|
|------------------------------------------->|
|
|
|
|
|
|<-------------------------------------------|
|
|
|
|
|
receive response
|
|
|
|
|
|
2023-09-25 04:34:06 +00:00
|
|
|
|
This authentication protects only the connections between Ceph clients and Ceph
|
|
|
|
|
daemons. The authentication is not extended beyond the Ceph client. If a user
|
|
|
|
|
accesses the Ceph client from a remote host, cephx authentication will not be
|
2023-09-17 20:41:28 +00:00
|
|
|
|
applied to the connection between the user's host and the client host.
|
|
|
|
|
|
2023-09-17 08:56:40 +00:00
|
|
|
|
See `Cephx Config Guide`_ for more on configuration details.
|
2013-05-15 00:05:43 +00:00
|
|
|
|
|
2023-09-17 08:56:40 +00:00
|
|
|
|
See `User Management`_ for more on user management.
|
2014-08-25 18:02:27 +00:00
|
|
|
|
|
2023-09-17 08:56:40 +00:00
|
|
|
|
See :ref:`A Detailed Description of the Cephx Authentication Protocol
|
|
|
|
|
<cephx_2012_peter>` for more on the distinction between authorization and
|
|
|
|
|
authentication and for a step-by-step explanation of the setup of ``cephx``
|
|
|
|
|
tickets and session keys.
|
2014-08-25 18:02:27 +00:00
|
|
|
|
|
2013-06-14 23:52:25 +00:00
|
|
|
|
.. index:: architecture; smart daemons and scalability
|
2013-05-15 00:05:43 +00:00
|
|
|
|
|
|
|
|
|
Smart Daemons Enable Hyperscale
|
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
2023-09-25 21:40:42 +00:00
|
|
|
|
A feature of many storage clusters is a centralized interface that keeps track
|
|
|
|
|
of the nodes that clients are permitted to access. Such centralized
|
|
|
|
|
architectures provide services to clients by means of a double dispatch. At the
|
|
|
|
|
petabyte-to-exabyte scale, such double dispatches are a significant
|
|
|
|
|
bottleneck.
|
|
|
|
|
|
|
|
|
|
Ceph obviates this bottleneck: Ceph's OSD Daemons AND Ceph clients are
|
|
|
|
|
cluster-aware. Like Ceph clients, each Ceph OSD Daemon is aware of other Ceph
|
|
|
|
|
OSD Daemons in the cluster. This enables Ceph OSD Daemons to interact directly
|
|
|
|
|
with other Ceph OSD Daemons and to interact directly with Ceph Monitors. Being
|
|
|
|
|
cluster-aware makes it possible for Ceph clients to interact directly with Ceph
|
|
|
|
|
OSD Daemons.
|
|
|
|
|
|
|
|
|
|
Because Ceph clients, Ceph monitors, and Ceph OSD daemons interact with one
|
|
|
|
|
another directly, Ceph OSD daemons can make use of the aggregate CPU and RAM
|
|
|
|
|
resources of the nodes in the Ceph cluster. This means that a Ceph cluster can
|
|
|
|
|
easily perform tasks that a cluster with a centralized interface would struggle
|
2023-09-26 20:00:34 +00:00
|
|
|
|
to perform. The ability of Ceph nodes to make use of the computing power of
|
2023-09-25 21:40:42 +00:00
|
|
|
|
the greater cluster provides several benefits:
|
2013-05-15 00:05:43 +00:00
|
|
|
|
|
2023-09-26 20:00:34 +00:00
|
|
|
|
#. **OSDs Service Clients Directly:** Network devices can support only a
|
|
|
|
|
limited number of concurrent connections. Because Ceph clients contact
|
|
|
|
|
Ceph OSD daemons directly without first connecting to a central interface,
|
|
|
|
|
Ceph enjoys improved perfomance and increased system capacity relative to
|
|
|
|
|
storage redundancy strategies that include a central interface. Ceph clients
|
|
|
|
|
maintain sessions only when needed, and maintain those sessions with only
|
|
|
|
|
particular Ceph OSD daemons, not with a centralized interface.
|
2013-05-15 00:05:43 +00:00
|
|
|
|
|
|
|
|
|
#. **OSD Membership and Status**: Ceph OSD Daemons join a cluster and report
|
|
|
|
|
on their status. At the lowest level, the Ceph OSD Daemon status is ``up``
|
|
|
|
|
or ``down`` reflecting whether or not it is running and able to service
|
|
|
|
|
Ceph Client requests. If a Ceph OSD Daemon is ``down`` and ``in`` the Ceph
|
|
|
|
|
Storage Cluster, this status may indicate the failure of the Ceph OSD
|
|
|
|
|
Daemon. If a Ceph OSD Daemon is not running (e.g., it crashes), the Ceph OSD
|
2017-05-23 10:27:32 +00:00
|
|
|
|
Daemon cannot notify the Ceph Monitor that it is ``down``. The OSDs
|
|
|
|
|
periodically send messages to the Ceph Monitor (``MPGStats`` pre-luminous,
|
|
|
|
|
and a new ``MOSDBeacon`` in luminous). If the Ceph Monitor doesn't see that
|
|
|
|
|
message after a configurable period of time then it marks the OSD down.
|
|
|
|
|
This mechanism is a failsafe, however. Normally, Ceph OSD Daemons will
|
2023-09-17 20:41:28 +00:00
|
|
|
|
determine if a neighboring OSD is down and report it to the Ceph Monitors.
|
2017-05-23 10:27:32 +00:00
|
|
|
|
This assures that Ceph Monitors are lightweight processes. See `Monitoring
|
|
|
|
|
OSDs`_ and `Heartbeats`_ for additional details.
|
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
#. **Data Scrubbing:** As part of maintaining data consistency and cleanliness,
|
2020-01-16 15:33:19 +00:00
|
|
|
|
Ceph OSD Daemons can scrub objects. That is, Ceph OSD Daemons can compare
|
|
|
|
|
their local objects metadata with its replicas stored on other OSDs. Scrubbing
|
|
|
|
|
happens on a per-Placement Group base. Scrubbing (usually performed daily)
|
|
|
|
|
catches mismatches in size and other metadata. Ceph OSD Daemons also perform deeper
|
|
|
|
|
scrubbing by comparing data in objects bit-for-bit with their checksums.
|
|
|
|
|
Deep scrubbing (usually performed weekly) finds bad sectors on a drive that
|
|
|
|
|
weren't apparent in a light scrub. See `Data Scrubbing`_ for details on
|
2013-05-15 00:05:43 +00:00
|
|
|
|
configuring scrubbing.
|
|
|
|
|
|
|
|
|
|
#. **Replication:** Like Ceph Clients, Ceph OSD Daemons use the CRUSH
|
|
|
|
|
algorithm, but the Ceph OSD Daemon uses it to compute where replicas of
|
|
|
|
|
objects should be stored (and for rebalancing). In a typical write scenario,
|
|
|
|
|
a client uses the CRUSH algorithm to compute where to store an object, maps
|
|
|
|
|
the object to a pool and placement group, then looks at the CRUSH map to
|
|
|
|
|
identify the primary OSD for the placement group.
|
|
|
|
|
|
|
|
|
|
The client writes the object to the identified placement group in the
|
|
|
|
|
primary OSD. Then, the primary OSD with its own copy of the CRUSH map
|
|
|
|
|
identifies the secondary and tertiary OSDs for replication purposes, and
|
|
|
|
|
replicates the object to the appropriate placement groups in the secondary
|
|
|
|
|
and tertiary OSDs (as many OSDs as additional replicas), and responds to the
|
|
|
|
|
client once it has confirmed the object was stored successfully.
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
2020-04-09 13:25:39 +00:00
|
|
|
|
.. ditaa::
|
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
+----------+
|
|
|
|
|
| Client |
|
|
|
|
|
| |
|
|
|
|
|
+----------+
|
|
|
|
|
* ^
|
|
|
|
|
Write (1) | | Ack (6)
|
|
|
|
|
| |
|
|
|
|
|
v *
|
|
|
|
|
+-------------+
|
|
|
|
|
| Primary OSD |
|
|
|
|
|
| |
|
|
|
|
|
+-------------+
|
|
|
|
|
* ^ ^ *
|
|
|
|
|
Write (2) | | | | Write (3)
|
|
|
|
|
+------+ | | +------+
|
|
|
|
|
| +------+ +------+ |
|
|
|
|
|
| | Ack (4) Ack (5)| |
|
|
|
|
|
v * * v
|
|
|
|
|
+---------------+ +---------------+
|
|
|
|
|
| Secondary OSD | | Tertiary OSD |
|
|
|
|
|
| | | |
|
|
|
|
|
+---------------+ +---------------+
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
With the ability to perform data replication, Ceph OSD Daemons relieve Ceph
|
|
|
|
|
clients from that duty, while ensuring high data availability and data safety.
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
Dynamic Cluster Management
|
|
|
|
|
--------------------------
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
In the `Scalability and High Availability`_ section, we explained how Ceph uses
|
|
|
|
|
CRUSH, cluster awareness and intelligent daemons to scale and maintain high
|
|
|
|
|
availability. Key to Ceph's design is the autonomous, self-healing, and
|
|
|
|
|
intelligent Ceph OSD Daemon. Let's take a deeper look at how CRUSH works to
|
|
|
|
|
enable modern cloud storage infrastructures to place data, rebalance the cluster
|
|
|
|
|
and recover from faults dynamically.
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
2013-06-14 23:52:25 +00:00
|
|
|
|
.. index:: architecture; pools
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
|
|
|
|
About Pools
|
2013-05-15 00:05:43 +00:00
|
|
|
|
~~~~~~~~~~~
|
2012-09-18 18:08:23 +00:00
|
|
|
|
|
|
|
|
|
The Ceph storage system supports the notion of 'Pools', which are logical
|
2014-05-06 10:53:03 +00:00
|
|
|
|
partitions for storing objects.
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
Ceph Clients retrieve a `Cluster Map`_ from a Ceph Monitor, and write objects to
|
2017-12-11 16:09:25 +00:00
|
|
|
|
pools. The pool's ``size`` or number of replicas, the CRUSH rule and the
|
2013-05-15 00:05:43 +00:00
|
|
|
|
number of placement groups determine how Ceph will place the data.
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
2020-04-09 13:25:39 +00:00
|
|
|
|
.. ditaa::
|
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
+--------+ Retrieves +---------------+
|
|
|
|
|
| Client |------------>| Cluster Map |
|
|
|
|
|
+--------+ +---------------+
|
|
|
|
|
|
|
|
|
|
|
v Writes
|
|
|
|
|
/-----\
|
|
|
|
|
| obj |
|
|
|
|
|
\-----/
|
|
|
|
|
| To
|
|
|
|
|
v
|
|
|
|
|
+--------+ +---------------+
|
2017-12-11 16:09:25 +00:00
|
|
|
|
| Pool |---------->| CRUSH Rule |
|
2013-05-15 00:05:43 +00:00
|
|
|
|
+--------+ Selects +---------------+
|
|
|
|
|
|
|
|
|
|
|
2014-05-06 10:53:03 +00:00
|
|
|
|
Pools set at least the following parameters:
|
|
|
|
|
|
|
|
|
|
- Ownership/Access to Objects
|
|
|
|
|
- The Number of Placement Groups, and
|
2017-12-11 16:09:25 +00:00
|
|
|
|
- The CRUSH Rule to Use.
|
2014-05-06 10:53:03 +00:00
|
|
|
|
|
|
|
|
|
See `Set Pool Values`_ for details.
|
|
|
|
|
|
|
|
|
|
|
2013-06-14 23:52:25 +00:00
|
|
|
|
.. index: architecture; placement group mapping
|
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
Mapping PGs to OSDs
|
|
|
|
|
~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
|
|
Each pool has a number of placement groups. CRUSH maps PGs to OSDs dynamically.
|
|
|
|
|
When a Ceph Client stores objects, CRUSH will map each object to a placement
|
|
|
|
|
group.
|
|
|
|
|
|
|
|
|
|
Mapping objects to placement groups creates a layer of indirection between the
|
|
|
|
|
Ceph OSD Daemon and the Ceph Client. The Ceph Storage Cluster must be able to
|
|
|
|
|
grow (or shrink) and rebalance where it stores objects dynamically. If the Ceph
|
|
|
|
|
Client "knew" which Ceph OSD Daemon had which object, that would create a tight
|
|
|
|
|
coupling between the Ceph Client and the Ceph OSD Daemon. Instead, the CRUSH
|
|
|
|
|
algorithm maps each object to a placement group and then maps each placement
|
|
|
|
|
group to one or more Ceph OSD Daemons. This layer of indirection allows Ceph to
|
|
|
|
|
rebalance dynamically when new Ceph OSD Daemons and the underlying OSD devices
|
|
|
|
|
come online. The following diagram depicts how CRUSH maps objects to placement
|
2013-04-23 04:02:45 +00:00
|
|
|
|
groups, and placement groups to OSDs.
|
2012-09-18 18:08:23 +00:00
|
|
|
|
|
2020-04-09 13:25:39 +00:00
|
|
|
|
.. ditaa::
|
|
|
|
|
|
2012-09-18 18:08:23 +00:00
|
|
|
|
/-----\ /-----\ /-----\ /-----\ /-----\
|
|
|
|
|
| obj | | obj | | obj | | obj | | obj |
|
|
|
|
|
\-----/ \-----/ \-----/ \-----/ \-----/
|
|
|
|
|
| | | | |
|
|
|
|
|
+--------+--------+ +---+----+
|
|
|
|
|
| |
|
|
|
|
|
v v
|
|
|
|
|
+-----------------------+ +-----------------------+
|
|
|
|
|
| Placement Group #1 | | Placement Group #2 |
|
|
|
|
|
| | | |
|
|
|
|
|
+-----------------------+ +-----------------------+
|
|
|
|
|
| |
|
|
|
|
|
| +-----------------------+---+
|
|
|
|
|
+------+------+-------------+ |
|
|
|
|
|
| | | |
|
|
|
|
|
v v v v
|
|
|
|
|
/----------\ /----------\ /----------\ /----------\
|
|
|
|
|
| | | | | | | |
|
|
|
|
|
| OSD #1 | | OSD #2 | | OSD #3 | | OSD #4 |
|
|
|
|
|
| | | | | | | |
|
|
|
|
|
\----------/ \----------/ \----------/ \----------/
|
|
|
|
|
|
|
|
|
|
With a copy of the cluster map and the CRUSH algorithm, the client can compute
|
2012-12-04 04:48:02 +00:00
|
|
|
|
exactly which OSD to use when reading or writing a particular object.
|
2012-09-18 18:08:23 +00:00
|
|
|
|
|
2013-06-14 23:52:25 +00:00
|
|
|
|
.. index:: architecture; calculating PG IDs
|
2013-05-15 00:05:43 +00:00
|
|
|
|
|
|
|
|
|
Calculating PG IDs
|
|
|
|
|
~~~~~~~~~~~~~~~~~~
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
When a Ceph Client binds to a Ceph Monitor, it retrieves the latest copy of the
|
|
|
|
|
`Cluster Map`_. With the cluster map, the client knows about all of the monitors,
|
|
|
|
|
OSDs, and metadata servers in the cluster. **However, it doesn't know anything
|
|
|
|
|
about object locations.**
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
.. epigraph::
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
Object locations get computed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The only input required by the client is the object ID and the pool.
|
|
|
|
|
It's simple: Ceph stores data in named pools (e.g., "liverpool"). When a client
|
|
|
|
|
wants to store a named object (e.g., "john," "paul," "george," "ringo", etc.)
|
|
|
|
|
it calculates a placement group using the object name, a hash code, the
|
2014-05-23 00:37:16 +00:00
|
|
|
|
number of PGs in the pool and the pool name. Ceph clients use the following
|
2013-05-15 00:05:43 +00:00
|
|
|
|
steps to compute PG IDs.
|
|
|
|
|
|
2017-09-12 15:04:14 +00:00
|
|
|
|
#. The client inputs the pool name and the object ID. (e.g., pool = "liverpool"
|
2013-05-15 00:05:43 +00:00
|
|
|
|
and object-id = "john")
|
2014-05-29 00:49:21 +00:00
|
|
|
|
#. Ceph takes the object ID and hashes it.
|
|
|
|
|
#. Ceph calculates the hash modulo the number of PGs. (e.g., ``58``) to get
|
2013-05-15 00:05:43 +00:00
|
|
|
|
a PG ID.
|
2014-05-29 00:49:21 +00:00
|
|
|
|
#. Ceph gets the pool ID given the pool name (e.g., "liverpool" = ``4``)
|
|
|
|
|
#. Ceph prepends the pool ID to the PG ID (e.g., ``4.58``).
|
2013-05-15 00:05:43 +00:00
|
|
|
|
|
|
|
|
|
Computing object locations is much faster than performing object location query
|
|
|
|
|
over a chatty session. The :abbr:`CRUSH (Controlled Replication Under Scalable
|
|
|
|
|
Hashing)` algorithm allows a client to compute where objects *should* be stored,
|
|
|
|
|
and enables the client to contact the primary OSD to store or retrieve the
|
|
|
|
|
objects.
|
|
|
|
|
|
2013-06-14 23:52:25 +00:00
|
|
|
|
.. index:: architecture; PG Peering
|
2013-05-15 00:05:43 +00:00
|
|
|
|
|
|
|
|
|
Peering and Sets
|
|
|
|
|
~~~~~~~~~~~~~~~~
|
|
|
|
|
|
2023-01-07 03:54:45 +00:00
|
|
|
|
In previous sections, we noted that Ceph OSD Daemons check each other's
|
2013-05-15 00:05:43 +00:00
|
|
|
|
heartbeats and report back to the Ceph Monitor. Another thing Ceph OSD daemons
|
|
|
|
|
do is called 'peering', which is the process of bringing all of the OSDs that
|
|
|
|
|
store a Placement Group (PG) into agreement about the state of all of the
|
|
|
|
|
objects (and their metadata) in that PG. In fact, Ceph OSD Daemons `Report
|
|
|
|
|
Peering Failure`_ to the Ceph Monitors. Peering issues usually resolve
|
|
|
|
|
themselves; however, if the problem persists, you may need to refer to the
|
|
|
|
|
`Troubleshooting Peering Failure`_ section.
|
|
|
|
|
|
|
|
|
|
.. Note:: Agreeing on the state does not mean that the PGs have the latest contents.
|
|
|
|
|
|
|
|
|
|
The Ceph Storage Cluster was designed to store at least two copies of an object
|
|
|
|
|
(i.e., ``size = 2``), which is the minimum requirement for data safety. For high
|
|
|
|
|
availability, a Ceph Storage Cluster should store more than two copies of an object
|
|
|
|
|
(e.g., ``size = 3`` and ``min size = 2``) so that it can continue to run in a
|
|
|
|
|
``degraded`` state while maintaining data safety.
|
|
|
|
|
|
|
|
|
|
Referring back to the diagram in `Smart Daemons Enable Hyperscale`_, we do not
|
|
|
|
|
name the Ceph OSD Daemons specifically (e.g., ``osd.0``, ``osd.1``, etc.), but
|
|
|
|
|
rather refer to them as *Primary*, *Secondary*, and so forth. By convention,
|
|
|
|
|
the *Primary* is the first OSD in the *Acting Set*, and is responsible for
|
|
|
|
|
coordinating the peering process for each placement group where it acts as
|
2022-06-26 03:09:51 +00:00
|
|
|
|
the *Primary*, and is the **ONLY** OSD that will accept client-initiated
|
2013-05-15 00:05:43 +00:00
|
|
|
|
writes to objects for a given placement group where it acts as the *Primary*.
|
|
|
|
|
|
|
|
|
|
When a series of OSDs are responsible for a placement group, that series of
|
|
|
|
|
OSDs, we refer to them as an *Acting Set*. An *Acting Set* may refer to the Ceph
|
|
|
|
|
OSD Daemons that are currently responsible for the placement group, or the Ceph
|
|
|
|
|
OSD Daemons that were responsible for a particular placement group as of some
|
|
|
|
|
epoch.
|
|
|
|
|
|
|
|
|
|
The Ceph OSD daemons that are part of an *Acting Set* may not always be ``up``.
|
|
|
|
|
When an OSD in the *Acting Set* is ``up``, it is part of the *Up Set*. The *Up
|
|
|
|
|
Set* is an important distinction, because Ceph can remap PGs to other Ceph OSD
|
|
|
|
|
Daemons when an OSD fails.
|
|
|
|
|
|
|
|
|
|
.. note:: In an *Acting Set* for a PG containing ``osd.25``, ``osd.32`` and
|
|
|
|
|
``osd.61``, the first OSD, ``osd.25``, is the *Primary*. If that OSD fails,
|
|
|
|
|
the Secondary, ``osd.32``, becomes the *Primary*, and ``osd.25`` will be
|
|
|
|
|
removed from the *Up Set*.
|
|
|
|
|
|
|
|
|
|
|
2013-06-14 23:52:25 +00:00
|
|
|
|
.. index:: architecture; Rebalancing
|
2013-05-15 00:05:43 +00:00
|
|
|
|
|
|
|
|
|
Rebalancing
|
|
|
|
|
~~~~~~~~~~~
|
|
|
|
|
|
|
|
|
|
When you add a Ceph OSD Daemon to a Ceph Storage Cluster, the cluster map gets
|
|
|
|
|
updated with the new OSD. Referring back to `Calculating PG IDs`_, this changes
|
|
|
|
|
the cluster map. Consequently, it changes object placement, because it changes
|
|
|
|
|
an input for the calculations. The following diagram depicts the rebalancing
|
|
|
|
|
process (albeit rather crudely, since it is substantially less impactful with
|
|
|
|
|
large clusters) where some, but not all of the PGs migrate from existing OSDs
|
|
|
|
|
(OSD 1, and OSD 2) to the new OSD (OSD 3). Even when rebalancing, CRUSH is
|
|
|
|
|
stable. Many of the placement groups remain in their original configuration,
|
|
|
|
|
and each OSD gets some added capacity, so there are no load spikes on the
|
|
|
|
|
new OSD after rebalancing is complete.
|
2012-09-18 18:08:23 +00:00
|
|
|
|
|
2012-12-04 18:58:02 +00:00
|
|
|
|
|
2020-04-09 13:25:39 +00:00
|
|
|
|
.. ditaa::
|
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
+--------+ +--------+
|
|
|
|
|
Before | OSD 1 | | OSD 2 |
|
|
|
|
|
+--------+ +--------+
|
|
|
|
|
| PG #1 | | PG #6 |
|
|
|
|
|
| PG #2 | | PG #7 |
|
|
|
|
|
| PG #3 | | PG #8 |
|
|
|
|
|
| PG #4 | | PG #9 |
|
|
|
|
|
| PG #5 | | PG #10 |
|
|
|
|
|
+--------+ +--------+
|
|
|
|
|
|
|
|
|
|
+--------+ +--------+ +--------+
|
|
|
|
|
After | OSD 1 | | OSD 2 | | OSD 3 |
|
|
|
|
|
+--------+ +--------+ +--------+
|
|
|
|
|
| PG #1 | | PG #7 | | PG #3 |
|
|
|
|
|
| PG #2 | | PG #8 | | PG #6 |
|
|
|
|
|
| PG #4 | | PG #10 | | PG #9 |
|
|
|
|
|
| PG #5 | | | | |
|
|
|
|
|
| | | | | |
|
|
|
|
|
+--------+ +--------+ +--------+
|
|
|
|
|
|
|
|
|
|
|
2013-06-14 23:52:25 +00:00
|
|
|
|
.. index:: architecture; Data Scrubbing
|
2012-09-18 18:08:23 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
Data Consistency
|
|
|
|
|
~~~~~~~~~~~~~~~~
|
|
|
|
|
|
2020-11-19 06:57:54 +00:00
|
|
|
|
As part of maintaining data consistency and cleanliness, Ceph OSDs also scrub
|
|
|
|
|
objects within placement groups. That is, Ceph OSDs compare object metadata in
|
|
|
|
|
one placement group with its replicas in placement groups stored in other
|
|
|
|
|
OSDs. Scrubbing (usually performed daily) catches OSD bugs or filesystem
|
|
|
|
|
errors, often as a result of hardware issues. OSDs also perform deeper
|
|
|
|
|
scrubbing by comparing data in objects bit-for-bit. Deep scrubbing (by default
|
|
|
|
|
performed weekly) finds bad blocks on a drive that weren't apparent in a light
|
|
|
|
|
scrub.
|
2013-05-15 00:05:43 +00:00
|
|
|
|
|
|
|
|
|
See `Data Scrubbing`_ for details on configuring scrubbing.
|
2012-09-18 18:08:23 +00:00
|
|
|
|
|
|
|
|
|
|
2014-05-06 10:53:03 +00:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.. index:: erasure coding
|
|
|
|
|
|
|
|
|
|
Erasure Coding
|
|
|
|
|
--------------
|
|
|
|
|
|
|
|
|
|
An erasure coded pool stores each object as ``K+M`` chunks. It is divided into
|
|
|
|
|
``K`` data chunks and ``M`` coding chunks. The pool is configured to have a size
|
|
|
|
|
of ``K+M`` so that each chunk is stored in an OSD in the acting set. The rank of
|
|
|
|
|
the chunk is stored as an attribute of the object.
|
|
|
|
|
|
2020-11-19 06:57:54 +00:00
|
|
|
|
For instance an erasure coded pool can be created to use five OSDs (``K+M = 5``) and
|
2014-05-06 10:53:03 +00:00
|
|
|
|
sustain the loss of two of them (``M = 2``).
|
|
|
|
|
|
|
|
|
|
Reading and Writing Encoded Chunks
|
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
|
|
When the object **NYAN** containing ``ABCDEFGHI`` is written to the pool, the erasure
|
|
|
|
|
encoding function splits the content into three data chunks simply by dividing
|
|
|
|
|
the content in three: the first contains ``ABC``, the second ``DEF`` and the
|
|
|
|
|
last ``GHI``. The content will be padded if the content length is not a multiple
|
|
|
|
|
of ``K``. The function also creates two coding chunks: the fourth with ``YXY``
|
2019-01-01 22:48:03 +00:00
|
|
|
|
and the fifth with ``QGC``. Each chunk is stored in an OSD in the acting set.
|
2014-05-06 10:53:03 +00:00
|
|
|
|
The chunks are stored in objects that have the same name (**NYAN**) but reside
|
|
|
|
|
on different OSDs. The order in which the chunks were created must be preserved
|
|
|
|
|
and is stored as an attribute of the object (``shard_t``), in addition to its
|
|
|
|
|
name. Chunk 1 contains ``ABC`` and is stored on **OSD5** while chunk 4 contains
|
|
|
|
|
``YXY`` and is stored on **OSD3**.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.. ditaa::
|
2020-04-09 13:25:39 +00:00
|
|
|
|
|
2014-05-06 10:53:03 +00:00
|
|
|
|
+-------------------+
|
|
|
|
|
name | NYAN |
|
|
|
|
|
+-------------------+
|
|
|
|
|
content | ABCDEFGHI |
|
|
|
|
|
+--------+----------+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
v
|
|
|
|
|
+------+------+
|
|
|
|
|
+---------------+ encode(3,2) +-----------+
|
|
|
|
|
| +--+--+---+---+ |
|
|
|
|
|
| | | | |
|
|
|
|
|
| +-------+ | +-----+ |
|
|
|
|
|
| | | | |
|
|
|
|
|
+--v---+ +--v---+ +--v---+ +--v---+ +--v---+
|
|
|
|
|
name | NYAN | | NYAN | | NYAN | | NYAN | | NYAN |
|
|
|
|
|
+------+ +------+ +------+ +------+ +------+
|
|
|
|
|
shard | 1 | | 2 | | 3 | | 4 | | 5 |
|
|
|
|
|
+------+ +------+ +------+ +------+ +------+
|
|
|
|
|
content | ABC | | DEF | | GHI | | YXY | | QGC |
|
|
|
|
|
+--+---+ +--+---+ +--+---+ +--+---+ +--+---+
|
|
|
|
|
| | | | |
|
|
|
|
|
| | v | |
|
|
|
|
|
| | +--+---+ | |
|
|
|
|
|
| | | OSD1 | | |
|
|
|
|
|
| | +------+ | |
|
|
|
|
|
| | | |
|
|
|
|
|
| | +------+ | |
|
|
|
|
|
| +------>| OSD2 | | |
|
|
|
|
|
| +------+ | |
|
|
|
|
|
| | |
|
|
|
|
|
| +------+ | |
|
|
|
|
|
| | OSD3 |<----+ |
|
|
|
|
|
| +------+ |
|
|
|
|
|
| |
|
|
|
|
|
| +------+ |
|
|
|
|
|
| | OSD4 |<--------------+
|
|
|
|
|
| +------+
|
|
|
|
|
|
|
|
|
|
|
| +------+
|
|
|
|
|
+----------------->| OSD5 |
|
|
|
|
|
+------+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
When the object **NYAN** is read from the erasure coded pool, the decoding
|
|
|
|
|
function reads three chunks: chunk 1 containing ``ABC``, chunk 3 containing
|
|
|
|
|
``GHI`` and chunk 4 containing ``YXY``. Then, it rebuilds the original content
|
|
|
|
|
of the object ``ABCDEFGHI``. The decoding function is informed that the chunks 2
|
|
|
|
|
and 5 are missing (they are called 'erasures'). The chunk 5 could not be read
|
|
|
|
|
because the **OSD4** is out. The decoding function can be called as soon as
|
|
|
|
|
three chunks are read: **OSD2** was the slowest and its chunk was not taken into
|
|
|
|
|
account.
|
|
|
|
|
|
|
|
|
|
.. ditaa::
|
2020-04-09 13:25:39 +00:00
|
|
|
|
|
2014-05-06 10:53:03 +00:00
|
|
|
|
+-------------------+
|
|
|
|
|
name | NYAN |
|
|
|
|
|
+-------------------+
|
|
|
|
|
content | ABCDEFGHI |
|
|
|
|
|
+---------+---------+
|
|
|
|
|
^
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
+-------+-------+
|
|
|
|
|
| decode(3,2) |
|
|
|
|
|
+------------->+ erasures 2,5 +<-+
|
|
|
|
|
| | | |
|
|
|
|
|
| +-------+-------+ |
|
|
|
|
|
| ^ |
|
|
|
|
|
| | |
|
|
|
|
|
| | |
|
|
|
|
|
+--+---+ +------+ +---+--+ +---+--+
|
|
|
|
|
name | NYAN | | NYAN | | NYAN | | NYAN |
|
|
|
|
|
+------+ +------+ +------+ +------+
|
|
|
|
|
shard | 1 | | 2 | | 3 | | 4 |
|
|
|
|
|
+------+ +------+ +------+ +------+
|
|
|
|
|
content | ABC | | DEF | | GHI | | YXY |
|
|
|
|
|
+--+---+ +--+---+ +--+---+ +--+---+
|
|
|
|
|
^ . ^ ^
|
|
|
|
|
| TOO . | |
|
|
|
|
|
| SLOW . +--+---+ |
|
|
|
|
|
| ^ | OSD1 | |
|
|
|
|
|
| | +------+ |
|
|
|
|
|
| | |
|
|
|
|
|
| | +------+ |
|
|
|
|
|
| +-------| OSD2 | |
|
|
|
|
|
| +------+ |
|
|
|
|
|
| |
|
|
|
|
|
| +------+ |
|
|
|
|
|
| | OSD3 |------+
|
|
|
|
|
| +------+
|
|
|
|
|
|
|
|
|
|
|
| +------+
|
|
|
|
|
| | OSD4 | OUT
|
|
|
|
|
| +------+
|
|
|
|
|
|
|
|
|
|
|
| +------+
|
|
|
|
|
+------------------| OSD5 |
|
|
|
|
|
+------+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Interrupted Full Writes
|
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
|
|
In an erasure coded pool, the primary OSD in the up set receives all write
|
|
|
|
|
operations. It is responsible for encoding the payload into ``K+M`` chunks and
|
|
|
|
|
sends them to the other OSDs. It is also responsible for maintaining an
|
|
|
|
|
authoritative version of the placement group logs.
|
|
|
|
|
|
|
|
|
|
In the following diagram, an erasure coded placement group has been created with
|
2020-02-12 12:13:51 +00:00
|
|
|
|
``K = 2, M = 1`` and is supported by three OSDs, two for ``K`` and one for
|
2014-05-06 10:53:03 +00:00
|
|
|
|
``M``. The acting set of the placement group is made of **OSD 1**, **OSD 2** and
|
|
|
|
|
**OSD 3**. An object has been encoded and stored in the OSDs : the chunk
|
|
|
|
|
``D1v1`` (i.e. Data chunk number 1, version 1) is on **OSD 1**, ``D2v1`` on
|
|
|
|
|
**OSD 2** and ``C1v1`` (i.e. Coding chunk number 1, version 1) on **OSD 3**. The
|
|
|
|
|
placement group logs on each OSD are identical (i.e. ``1,1`` for epoch 1,
|
|
|
|
|
version 1).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.. ditaa::
|
2020-04-09 13:25:39 +00:00
|
|
|
|
|
2014-05-06 10:53:03 +00:00
|
|
|
|
Primary OSD
|
|
|
|
|
|
|
|
|
|
+-------------+
|
|
|
|
|
| OSD 1 | +-------------+
|
|
|
|
|
| log | Write Full | |
|
|
|
|
|
| +----+ |<------------+ Ceph Client |
|
|
|
|
|
| |D1v1| 1,1 | v1 | |
|
|
|
|
|
| +----+ | +-------------+
|
|
|
|
|
+------+------+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| +-------------+
|
|
|
|
|
| | OSD 2 |
|
|
|
|
|
| | log |
|
|
|
|
|
+--------->+ +----+ |
|
|
|
|
|
| | |D2v1| 1,1 |
|
|
|
|
|
| | +----+ |
|
|
|
|
|
| +-------------+
|
|
|
|
|
|
|
|
|
|
|
| +-------------+
|
|
|
|
|
| | OSD 3 |
|
|
|
|
|
| | log |
|
|
|
|
|
+--------->| +----+ |
|
|
|
|
|
| |C1v1| 1,1 |
|
|
|
|
|
| +----+ |
|
|
|
|
|
+-------------+
|
|
|
|
|
|
|
|
|
|
**OSD 1** is the primary and receives a **WRITE FULL** from a client, which
|
|
|
|
|
means the payload is to replace the object entirely instead of overwriting a
|
|
|
|
|
portion of it. Version 2 (v2) of the object is created to override version 1
|
|
|
|
|
(v1). **OSD 1** encodes the payload into three chunks: ``D1v2`` (i.e. Data
|
|
|
|
|
chunk number 1 version 2) will be on **OSD 1**, ``D2v2`` on **OSD 2** and
|
|
|
|
|
``C1v2`` (i.e. Coding chunk number 1 version 2) on **OSD 3**. Each chunk is sent
|
|
|
|
|
to the target OSD, including the primary OSD which is responsible for storing
|
|
|
|
|
chunks in addition to handling write operations and maintaining an authoritative
|
|
|
|
|
version of the placement group logs. When an OSD receives the message
|
|
|
|
|
instructing it to write the chunk, it also creates a new entry in the placement
|
|
|
|
|
group logs to reflect the change. For instance, as soon as **OSD 3** stores
|
|
|
|
|
``C1v2``, it adds the entry ``1,2`` ( i.e. epoch 1, version 2 ) to its logs.
|
|
|
|
|
Because the OSDs work asynchronously, some chunks may still be in flight ( such
|
2020-11-19 06:57:54 +00:00
|
|
|
|
as ``D2v2`` ) while others are acknowledged and persisted to storage drives
|
|
|
|
|
(such as ``C1v1`` and ``D1v1``).
|
2014-05-06 10:53:03 +00:00
|
|
|
|
|
|
|
|
|
.. ditaa::
|
|
|
|
|
|
|
|
|
|
Primary OSD
|
|
|
|
|
|
|
|
|
|
+-------------+
|
|
|
|
|
| OSD 1 |
|
|
|
|
|
| log |
|
|
|
|
|
| +----+ | +-------------+
|
|
|
|
|
| |D1v2| 1,2 | Write Full | |
|
|
|
|
|
| +----+ +<------------+ Ceph Client |
|
|
|
|
|
| | v2 | |
|
|
|
|
|
| +----+ | +-------------+
|
|
|
|
|
| |D1v1| 1,1 |
|
|
|
|
|
| +----+ |
|
|
|
|
|
+------+------+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| +------+------+
|
|
|
|
|
| | OSD 2 |
|
|
|
|
|
| +------+ | log |
|
2014-06-09 09:27:59 +00:00
|
|
|
|
+->| D2v2 | | +----+ |
|
2014-05-06 10:53:03 +00:00
|
|
|
|
| +------+ | |D2v1| 1,1 |
|
|
|
|
|
| | +----+ |
|
|
|
|
|
| +-------------+
|
|
|
|
|
|
|
|
|
|
|
| +-------------+
|
|
|
|
|
| | OSD 3 |
|
|
|
|
|
| | log |
|
|
|
|
|
| | +----+ |
|
|
|
|
|
| | |C1v2| 1,2 |
|
|
|
|
|
+---------->+ +----+ |
|
|
|
|
|
| |
|
|
|
|
|
| +----+ |
|
|
|
|
|
| |C1v1| 1,1 |
|
|
|
|
|
| +----+ |
|
|
|
|
|
+-------------+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
If all goes well, the chunks are acknowledged on each OSD in the acting set and
|
|
|
|
|
the logs' ``last_complete`` pointer can move from ``1,1`` to ``1,2``.
|
|
|
|
|
|
|
|
|
|
.. ditaa::
|
|
|
|
|
|
|
|
|
|
Primary OSD
|
|
|
|
|
|
|
|
|
|
+-------------+
|
|
|
|
|
| OSD 1 |
|
|
|
|
|
| log |
|
|
|
|
|
| +----+ | +-------------+
|
|
|
|
|
| |D1v2| 1,2 | Write Full | |
|
|
|
|
|
| +----+ +<------------+ Ceph Client |
|
|
|
|
|
| | v2 | |
|
|
|
|
|
| +----+ | +-------------+
|
|
|
|
|
| |D1v1| 1,1 |
|
|
|
|
|
| +----+ |
|
|
|
|
|
+------+------+
|
|
|
|
|
|
|
|
|
|
|
| +-------------+
|
|
|
|
|
| | OSD 2 |
|
|
|
|
|
| | log |
|
|
|
|
|
| | +----+ |
|
|
|
|
|
| | |D2v2| 1,2 |
|
|
|
|
|
+---------->+ +----+ |
|
|
|
|
|
| | |
|
|
|
|
|
| | +----+ |
|
|
|
|
|
| | |D2v1| 1,1 |
|
|
|
|
|
| | +----+ |
|
|
|
|
|
| +-------------+
|
|
|
|
|
|
|
|
|
|
|
| +-------------+
|
|
|
|
|
| | OSD 3 |
|
|
|
|
|
| | log |
|
|
|
|
|
| | +----+ |
|
|
|
|
|
| | |C1v2| 1,2 |
|
|
|
|
|
+---------->+ +----+ |
|
|
|
|
|
| |
|
|
|
|
|
| +----+ |
|
|
|
|
|
| |C1v1| 1,1 |
|
|
|
|
|
| +----+ |
|
|
|
|
|
+-------------+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Finally, the files used to store the chunks of the previous version of the
|
|
|
|
|
object can be removed: ``D1v1`` on **OSD 1**, ``D2v1`` on **OSD 2** and ``C1v1``
|
|
|
|
|
on **OSD 3**.
|
|
|
|
|
|
|
|
|
|
.. ditaa::
|
2020-04-09 13:25:39 +00:00
|
|
|
|
|
2014-05-06 10:53:03 +00:00
|
|
|
|
Primary OSD
|
|
|
|
|
|
|
|
|
|
+-------------+
|
2016-01-20 18:38:34 +00:00
|
|
|
|
| OSD 1 |
|
|
|
|
|
| log |
|
|
|
|
|
| +----+ |
|
|
|
|
|
| |D1v2| 1,2 |
|
|
|
|
|
| +----+ |
|
2014-05-06 10:53:03 +00:00
|
|
|
|
+------+------+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| +-------------+
|
|
|
|
|
| | OSD 2 |
|
|
|
|
|
| | log |
|
|
|
|
|
+--------->+ +----+ |
|
2016-01-20 18:38:34 +00:00
|
|
|
|
| | |D2v2| 1,2 |
|
2014-05-06 10:53:03 +00:00
|
|
|
|
| | +----+ |
|
|
|
|
|
| +-------------+
|
|
|
|
|
|
|
|
|
|
|
| +-------------+
|
|
|
|
|
| | OSD 3 |
|
|
|
|
|
| | log |
|
|
|
|
|
+--------->| +----+ |
|
2016-01-20 18:38:34 +00:00
|
|
|
|
| |C1v2| 1,2 |
|
2014-05-06 10:53:03 +00:00
|
|
|
|
| +----+ |
|
|
|
|
|
+-------------+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
But accidents happen. If **OSD 1** goes down while ``D2v2`` is still in flight,
|
|
|
|
|
the object's version 2 is partially written: **OSD 3** has one chunk but that is
|
2014-11-16 20:49:03 +00:00
|
|
|
|
not enough to recover. It lost two chunks: ``D1v2`` and ``D2v2`` and the
|
2014-11-16 20:51:16 +00:00
|
|
|
|
erasure coding parameters ``K = 2``, ``M = 1`` require that at least two chunks are
|
2014-05-06 10:53:03 +00:00
|
|
|
|
available to rebuild the third. **OSD 4** becomes the new primary and finds that
|
|
|
|
|
the ``last_complete`` log entry (i.e., all objects before this entry were known
|
|
|
|
|
to be available on all OSDs in the previous acting set ) is ``1,1`` and that
|
|
|
|
|
will be the head of the new authoritative log.
|
|
|
|
|
|
|
|
|
|
.. ditaa::
|
2020-04-09 13:25:39 +00:00
|
|
|
|
|
2014-05-06 10:53:03 +00:00
|
|
|
|
+-------------+
|
|
|
|
|
| OSD 1 |
|
|
|
|
|
| (down) |
|
|
|
|
|
| c333 |
|
|
|
|
|
+------+------+
|
|
|
|
|
|
|
|
|
|
|
| +-------------+
|
|
|
|
|
| | OSD 2 |
|
|
|
|
|
| | log |
|
|
|
|
|
| | +----+ |
|
2014-11-16 20:52:36 +00:00
|
|
|
|
+---------->+ |D2v1| 1,1 |
|
2014-05-06 10:53:03 +00:00
|
|
|
|
| | +----+ |
|
|
|
|
|
| | |
|
|
|
|
|
| +-------------+
|
|
|
|
|
|
|
|
|
|
|
| +-------------+
|
|
|
|
|
| | OSD 3 |
|
|
|
|
|
| | log |
|
|
|
|
|
| | +----+ |
|
|
|
|
|
| | |C1v2| 1,2 |
|
|
|
|
|
+---------->+ +----+ |
|
|
|
|
|
| |
|
|
|
|
|
| +----+ |
|
|
|
|
|
| |C1v1| 1,1 |
|
|
|
|
|
| +----+ |
|
|
|
|
|
+-------------+
|
|
|
|
|
Primary OSD
|
|
|
|
|
+-------------+
|
|
|
|
|
| OSD 4 |
|
|
|
|
|
| log |
|
|
|
|
|
| |
|
|
|
|
|
| 1,1 |
|
|
|
|
|
| |
|
|
|
|
|
+------+------+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The log entry 1,2 found on **OSD 3** is divergent from the new authoritative log
|
|
|
|
|
provided by **OSD 4**: it is discarded and the file containing the ``C1v2``
|
|
|
|
|
chunk is removed. The ``D1v1`` chunk is rebuilt with the ``decode`` function of
|
|
|
|
|
the erasure coding library during scrubbing and stored on the new primary
|
|
|
|
|
**OSD 4**.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.. ditaa::
|
2020-04-09 13:25:39 +00:00
|
|
|
|
|
2014-05-06 10:53:03 +00:00
|
|
|
|
Primary OSD
|
|
|
|
|
|
|
|
|
|
+-------------+
|
|
|
|
|
| OSD 4 |
|
|
|
|
|
| log |
|
|
|
|
|
| +----+ |
|
|
|
|
|
| |D1v1| 1,1 |
|
|
|
|
|
| +----+ |
|
|
|
|
|
+------+------+
|
|
|
|
|
^
|
|
|
|
|
|
|
|
|
|
|
| +-------------+
|
|
|
|
|
| | OSD 2 |
|
|
|
|
|
| | log |
|
|
|
|
|
+----------+ +----+ |
|
|
|
|
|
| | |D2v1| 1,1 |
|
|
|
|
|
| | +----+ |
|
|
|
|
|
| +-------------+
|
|
|
|
|
|
|
|
|
|
|
| +-------------+
|
|
|
|
|
| | OSD 3 |
|
|
|
|
|
| | log |
|
|
|
|
|
+----------| +----+ |
|
|
|
|
|
| |C1v1| 1,1 |
|
|
|
|
|
| +----+ |
|
|
|
|
|
+-------------+
|
|
|
|
|
|
|
|
|
|
+-------------+
|
|
|
|
|
| OSD 1 |
|
|
|
|
|
| (down) |
|
|
|
|
|
| c333 |
|
|
|
|
|
+-------------+
|
|
|
|
|
|
|
|
|
|
See `Erasure Code Notes`_ for additional details.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Cache Tiering
|
|
|
|
|
-------------
|
|
|
|
|
|
2023-05-22 01:35:26 +00:00
|
|
|
|
.. note:: Cache tiering is deprecated in Reef.
|
|
|
|
|
|
2014-05-06 10:53:03 +00:00
|
|
|
|
A cache tier provides Ceph Clients with better I/O performance for a subset of
|
|
|
|
|
the data stored in a backing storage tier. Cache tiering involves creating a
|
|
|
|
|
pool of relatively fast/expensive storage devices (e.g., solid state drives)
|
|
|
|
|
configured to act as a cache tier, and a backing pool of either erasure-coded
|
|
|
|
|
or relatively slower/cheaper devices configured to act as an economical storage
|
|
|
|
|
tier. The Ceph objecter handles where to place the objects and the tiering
|
|
|
|
|
agent determines when to flush objects from the cache to the backing storage
|
|
|
|
|
tier. So the cache tier and the backing storage tier are completely transparent
|
|
|
|
|
to Ceph clients.
|
|
|
|
|
|
|
|
|
|
|
2020-04-09 13:25:39 +00:00
|
|
|
|
.. ditaa::
|
|
|
|
|
|
2014-05-06 10:53:03 +00:00
|
|
|
|
+-------------+
|
|
|
|
|
| Ceph Client |
|
|
|
|
|
+------+------+
|
|
|
|
|
^
|
|
|
|
|
Tiering is |
|
|
|
|
|
Transparent | Faster I/O
|
|
|
|
|
to Ceph | +---------------+
|
|
|
|
|
Client Ops | | |
|
|
|
|
|
| +----->+ Cache Tier |
|
|
|
|
|
| | | |
|
|
|
|
|
| | +-----+---+-----+
|
|
|
|
|
| | | ^
|
|
|
|
|
v v | | Active Data in Cache Tier
|
|
|
|
|
+------+----+--+ | |
|
|
|
|
|
| Objecter | | |
|
|
|
|
|
+-----------+--+ | |
|
|
|
|
|
^ | | Inactive Data in Storage Tier
|
|
|
|
|
| v |
|
|
|
|
|
| +-----+---+-----+
|
|
|
|
|
| | |
|
|
|
|
|
+----->| Storage Tier |
|
|
|
|
|
| |
|
|
|
|
|
+---------------+
|
|
|
|
|
Slower I/O
|
|
|
|
|
|
2020-11-19 06:57:54 +00:00
|
|
|
|
See `Cache Tiering`_ for additional details. Note that Cache Tiers can be
|
|
|
|
|
tricky and their use is now discouraged.
|
2014-05-06 10:53:03 +00:00
|
|
|
|
|
|
|
|
|
|
2013-06-14 23:52:25 +00:00
|
|
|
|
.. index:: Extensibility, Ceph Classes
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
|
|
|
|
Extending Ceph
|
|
|
|
|
--------------
|
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
You can extend Ceph by creating shared object classes called 'Ceph Classes'.
|
|
|
|
|
Ceph loads ``.so`` classes stored in the ``osd class dir`` directory dynamically
|
|
|
|
|
(i.e., ``$libdir/rados-classes`` by default). When you implement a class, you
|
|
|
|
|
can create new object methods that have the ability to call the native methods
|
|
|
|
|
in the Ceph Object Store, or other class methods you incorporate via libraries
|
|
|
|
|
or create yourself.
|
|
|
|
|
|
|
|
|
|
On writes, Ceph Classes can call native or class methods, perform any series of
|
|
|
|
|
operations on the inbound data and generate a resulting write transaction that
|
|
|
|
|
Ceph will apply atomically.
|
|
|
|
|
|
|
|
|
|
On reads, Ceph Classes can call native or class methods, perform any series of
|
|
|
|
|
operations on the outbound data and return the data to the client.
|
|
|
|
|
|
|
|
|
|
.. topic:: Ceph Class Example
|
|
|
|
|
|
|
|
|
|
A Ceph class for a content management system that presents pictures of a
|
|
|
|
|
particular size and aspect ratio could take an inbound bitmap image, crop it
|
|
|
|
|
to a particular aspect ratio, resize it and embed an invisible copyright or
|
|
|
|
|
watermark to help protect the intellectual property; then, save the
|
|
|
|
|
resulting bitmap image to the object store.
|
|
|
|
|
|
|
|
|
|
See ``src/objclass/objclass.h``, ``src/fooclass.cc`` and ``src/barclass`` for
|
|
|
|
|
exemplary implementations.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Summary
|
|
|
|
|
-------
|
|
|
|
|
|
|
|
|
|
Ceph Storage Clusters are dynamic--like a living organism. Whereas, many storage
|
|
|
|
|
appliances do not fully utilize the CPU and RAM of a typical commodity server,
|
|
|
|
|
Ceph does. From heartbeats, to peering, to rebalancing the cluster or
|
|
|
|
|
recovering from faults, Ceph offloads work from clients (and from a centralized
|
|
|
|
|
gateway which doesn't exist in the Ceph architecture) and uses the computing
|
|
|
|
|
power of the OSDs to perform the work. When referring to `Hardware
|
|
|
|
|
Recommendations`_ and the `Network Config Reference`_, be cognizant of the
|
|
|
|
|
foregoing concepts to understand how Ceph utilizes computing resources.
|
|
|
|
|
|
2013-06-14 23:52:25 +00:00
|
|
|
|
.. index:: Ceph Protocol, librados
|
2013-05-15 00:05:43 +00:00
|
|
|
|
|
|
|
|
|
Ceph Protocol
|
|
|
|
|
=============
|
|
|
|
|
|
|
|
|
|
Ceph Clients use the native protocol for interacting with the Ceph Storage
|
|
|
|
|
Cluster. Ceph packages this functionality into the ``librados`` library so that
|
|
|
|
|
you can create your own custom Ceph Clients. The following diagram depicts the
|
|
|
|
|
basic architecture.
|
|
|
|
|
|
2020-04-09 13:25:39 +00:00
|
|
|
|
.. ditaa::
|
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
+---------------------------------+
|
|
|
|
|
| Ceph Storage Cluster Protocol |
|
|
|
|
|
| (librados) |
|
|
|
|
|
+---------------------------------+
|
|
|
|
|
+---------------+ +---------------+
|
|
|
|
|
| OSDs | | Monitors |
|
|
|
|
|
+---------------+ +---------------+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Native Protocol and ``librados``
|
|
|
|
|
--------------------------------
|
|
|
|
|
|
|
|
|
|
Modern applications need a simple object storage interface with asynchronous
|
|
|
|
|
communication capability. The Ceph Storage Cluster provides a simple object
|
|
|
|
|
storage interface with asynchronous communication capability. The interface
|
|
|
|
|
provides direct, parallel access to objects throughout the cluster.
|
|
|
|
|
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
- Pool Operations
|
|
|
|
|
- Snapshots and Copy-on-write Cloning
|
|
|
|
|
- Read/Write Objects
|
|
|
|
|
- Create or Remove
|
|
|
|
|
- Entire Object or Byte Range
|
|
|
|
|
- Append or Truncate
|
|
|
|
|
- Create/Set/Get/Remove XATTRs
|
|
|
|
|
- Create/Set/Get/Remove Key/Value Pairs
|
|
|
|
|
- Compound operations and dual-ack semantics
|
|
|
|
|
- Object Classes
|
2013-04-23 04:02:45 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
|
2013-06-14 23:52:25 +00:00
|
|
|
|
.. index:: architecture; watch/notify
|
2013-05-15 00:05:43 +00:00
|
|
|
|
|
|
|
|
|
Object Watch/Notify
|
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
|
|
A client can register a persistent interest with an object and keep a session to
|
2015-06-08 18:51:58 +00:00
|
|
|
|
the primary OSD open. The client can send a notification message and a payload to
|
2013-05-15 00:05:43 +00:00
|
|
|
|
all watchers and receive notification when the watchers receive the
|
2015-06-08 18:51:58 +00:00
|
|
|
|
notification. This enables a client to use any object as a
|
2013-05-15 00:05:43 +00:00
|
|
|
|
synchronization/communication channel.
|
|
|
|
|
|
|
|
|
|
|
2020-04-09 13:25:39 +00:00
|
|
|
|
.. ditaa::
|
|
|
|
|
|
|
|
|
|
+----------+ +----------+ +----------+ +---------------+
|
2013-05-15 00:05:43 +00:00
|
|
|
|
| Client 1 | | Client 2 | | Client 3 | | OSD:Object ID |
|
|
|
|
|
+----------+ +----------+ +----------+ +---------------+
|
|
|
|
|
| | | |
|
|
|
|
|
| | | |
|
|
|
|
|
| | Watch Object | |
|
|
|
|
|
|--------------------------------------------------->|
|
|
|
|
|
| | | |
|
|
|
|
|
|<---------------------------------------------------|
|
|
|
|
|
| | Ack/Commit | |
|
|
|
|
|
| | | |
|
|
|
|
|
| | Watch Object | |
|
|
|
|
|
| |---------------------------------->|
|
|
|
|
|
| | | |
|
|
|
|
|
| |<----------------------------------|
|
|
|
|
|
| | Ack/Commit | |
|
|
|
|
|
| | | Watch Object |
|
|
|
|
|
| | |----------------->|
|
|
|
|
|
| | | |
|
|
|
|
|
| | |<-----------------|
|
|
|
|
|
| | | Ack/Commit |
|
|
|
|
|
| | Notify | |
|
|
|
|
|
|--------------------------------------------------->|
|
|
|
|
|
| | | |
|
|
|
|
|
|<---------------------------------------------------|
|
|
|
|
|
| | Notify | |
|
|
|
|
|
| | | |
|
|
|
|
|
| |<----------------------------------|
|
|
|
|
|
| | Notify | |
|
|
|
|
|
| | |<-----------------|
|
|
|
|
|
| | | Notify |
|
|
|
|
|
| | Ack | |
|
|
|
|
|
|----------------+---------------------------------->|
|
|
|
|
|
| | | |
|
|
|
|
|
| | Ack | |
|
|
|
|
|
| +---------------------------------->|
|
|
|
|
|
| | | |
|
|
|
|
|
| | | Ack |
|
|
|
|
|
| | |----------------->|
|
|
|
|
|
| | | |
|
|
|
|
|
|<---------------+----------------+------------------|
|
|
|
|
|
| Complete
|
|
|
|
|
|
2013-06-14 23:52:25 +00:00
|
|
|
|
.. index:: architecture; Striping
|
2013-05-15 00:05:43 +00:00
|
|
|
|
|
|
|
|
|
Data Striping
|
|
|
|
|
-------------
|
2012-12-04 04:48:02 +00:00
|
|
|
|
|
|
|
|
|
Storage devices have throughput limitations, which impact performance and
|
|
|
|
|
scalability. So storage systems often support `striping`_--storing sequential
|
2016-03-17 09:34:51 +00:00
|
|
|
|
pieces of information across multiple storage devices--to increase throughput
|
|
|
|
|
and performance. The most common form of data striping comes from `RAID`_.
|
|
|
|
|
The RAID type most similar to Ceph's striping is `RAID 0`_, or a 'striped
|
|
|
|
|
volume'. Ceph's striping offers the throughput of RAID 0 striping, the
|
|
|
|
|
reliability of n-way RAID mirroring and faster recovery.
|
2012-12-04 04:48:02 +00:00
|
|
|
|
|
2019-09-09 19:36:04 +00:00
|
|
|
|
Ceph provides three types of clients: Ceph Block Device, Ceph File System, and
|
2013-05-15 00:05:43 +00:00
|
|
|
|
Ceph Object Storage. A Ceph Client converts its data from the representation
|
|
|
|
|
format it provides to its users (a block device image, RESTful objects, CephFS
|
|
|
|
|
filesystem directories) into objects for storage in the Ceph Storage Cluster.
|
|
|
|
|
|
|
|
|
|
.. tip:: The objects Ceph stores in the Ceph Storage Cluster are not striped.
|
2019-09-09 19:36:04 +00:00
|
|
|
|
Ceph Object Storage, Ceph Block Device, and the Ceph File System stripe their
|
2013-05-15 00:05:43 +00:00
|
|
|
|
data over multiple Ceph Storage Cluster objects. Ceph Clients that write
|
2014-07-09 04:50:54 +00:00
|
|
|
|
directly to the Ceph Storage Cluster via ``librados`` must perform the
|
2013-05-15 00:05:43 +00:00
|
|
|
|
striping (and parallel I/O) for themselves to obtain these benefits.
|
|
|
|
|
|
|
|
|
|
The simplest Ceph striping format involves a stripe count of 1 object. Ceph
|
|
|
|
|
Clients write stripe units to a Ceph Storage Cluster object until the object is
|
|
|
|
|
at its maximum capacity, and then create another object for additional stripes
|
|
|
|
|
of data. The simplest form of striping may be sufficient for small block device
|
|
|
|
|
images, S3 or Swift objects and CephFS files. However, this simple form doesn't
|
|
|
|
|
take maximum advantage of Ceph's ability to distribute data across placement
|
|
|
|
|
groups, and consequently doesn't improve performance very much. The following
|
|
|
|
|
diagram depicts the simplest form of striping:
|
2012-12-04 04:48:02 +00:00
|
|
|
|
|
2020-04-09 13:25:39 +00:00
|
|
|
|
.. ditaa::
|
|
|
|
|
|
2012-12-04 04:48:02 +00:00
|
|
|
|
+---------------+
|
|
|
|
|
| Client Data |
|
|
|
|
|
| Format |
|
|
|
|
|
| cCCC |
|
|
|
|
|
+---------------+
|
|
|
|
|
|
|
|
|
|
|
+--------+-------+
|
|
|
|
|
| |
|
|
|
|
|
v v
|
|
|
|
|
/-----------\ /-----------\
|
|
|
|
|
| Begin cCCC| | Begin cCCC|
|
|
|
|
|
| Object 0 | | Object 1 |
|
|
|
|
|
+-----------+ +-----------+
|
|
|
|
|
| stripe | | stripe |
|
|
|
|
|
| unit 1 | | unit 5 |
|
|
|
|
|
+-----------+ +-----------+
|
|
|
|
|
| stripe | | stripe |
|
|
|
|
|
| unit 2 | | unit 6 |
|
|
|
|
|
+-----------+ +-----------+
|
|
|
|
|
| stripe | | stripe |
|
|
|
|
|
| unit 3 | | unit 7 |
|
|
|
|
|
+-----------+ +-----------+
|
|
|
|
|
| stripe | | stripe |
|
|
|
|
|
| unit 4 | | unit 8 |
|
|
|
|
|
+-----------+ +-----------+
|
|
|
|
|
| End cCCC | | End cCCC |
|
|
|
|
|
| Object 0 | | Object 1 |
|
|
|
|
|
\-----------/ \-----------/
|
|
|
|
|
|
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
If you anticipate large images sizes, large S3 or Swift objects (e.g., video),
|
|
|
|
|
or large CephFS directories, you may see considerable read/write performance
|
|
|
|
|
improvements by striping client data over multiple objects within an object set.
|
2012-12-04 18:58:02 +00:00
|
|
|
|
Significant write performance occurs when the client writes the stripe units to
|
|
|
|
|
their corresponding objects in parallel. Since objects get mapped to different
|
|
|
|
|
placement groups and further mapped to different OSDs, each write occurs in
|
2020-11-19 06:57:54 +00:00
|
|
|
|
parallel at the maximum write speed. A write to a single drive would be limited
|
2012-12-04 18:58:02 +00:00
|
|
|
|
by the head movement (e.g. 6ms per seek) and bandwidth of that one device (e.g.
|
|
|
|
|
100MB/s). By spreading that write over multiple objects (which map to different
|
|
|
|
|
placement groups and OSDs) Ceph can reduce the number of seeks per drive and
|
|
|
|
|
combine the throughput of multiple drives to achieve much faster write (or read)
|
|
|
|
|
speeds.
|
2012-12-04 04:48:02 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
.. note:: Striping is independent of object replicas. Since CRUSH
|
|
|
|
|
replicates objects across OSDs, stripes get replicated automatically.
|
|
|
|
|
|
2012-12-04 04:48:02 +00:00
|
|
|
|
In the following diagram, client data gets striped across an object set
|
|
|
|
|
(``object set 1`` in the following diagram) consisting of 4 objects, where the
|
2012-12-04 18:58:02 +00:00
|
|
|
|
first stripe unit is ``stripe unit 0`` in ``object 0``, and the fourth stripe
|
|
|
|
|
unit is ``stripe unit 3`` in ``object 3``. After writing the fourth stripe, the
|
|
|
|
|
client determines if the object set is full. If the object set is not full, the
|
|
|
|
|
client begins writing a stripe to the first object again (``object 0`` in the
|
|
|
|
|
following diagram). If the object set is full, the client creates a new object
|
|
|
|
|
set (``object set 2`` in the following diagram), and begins writing to the first
|
|
|
|
|
stripe (``stripe unit 16``) in the first object in the new object set (``object
|
|
|
|
|
4`` in the diagram below).
|
2012-12-04 04:48:02 +00:00
|
|
|
|
|
2020-04-09 13:25:39 +00:00
|
|
|
|
.. ditaa::
|
|
|
|
|
|
2012-12-04 18:58:02 +00:00
|
|
|
|
+---------------+
|
|
|
|
|
| Client Data |
|
|
|
|
|
| Format |
|
|
|
|
|
| cCCC |
|
|
|
|
|
+---------------+
|
|
|
|
|
|
|
|
|
|
|
+-----------------+--------+--------+-----------------+
|
|
|
|
|
| | | | +--\
|
|
|
|
|
v v v v |
|
|
|
|
|
/-----------\ /-----------\ /-----------\ /-----------\ |
|
|
|
|
|
| Begin cCCC| | Begin cCCC| | Begin cCCC| | Begin cCCC| |
|
|
|
|
|
| Object 0 | | Object 1 | | Object 2 | | Object 3 | |
|
|
|
|
|
+-----------+ +-----------+ +-----------+ +-----------+ |
|
|
|
|
|
| stripe | | stripe | | stripe | | stripe | |
|
|
|
|
|
| unit 0 | | unit 1 | | unit 2 | | unit 3 | |
|
|
|
|
|
+-----------+ +-----------+ +-----------+ +-----------+ |
|
|
|
|
|
| stripe | | stripe | | stripe | | stripe | +-\
|
|
|
|
|
| unit 4 | | unit 5 | | unit 6 | | unit 7 | | Object
|
|
|
|
|
+-----------+ +-----------+ +-----------+ +-----------+ +- Set
|
|
|
|
|
| stripe | | stripe | | stripe | | stripe | | 1
|
|
|
|
|
| unit 8 | | unit 9 | | unit 10 | | unit 11 | +-/
|
|
|
|
|
+-----------+ +-----------+ +-----------+ +-----------+ |
|
|
|
|
|
| stripe | | stripe | | stripe | | stripe | |
|
|
|
|
|
| unit 12 | | unit 13 | | unit 14 | | unit 15 | |
|
|
|
|
|
+-----------+ +-----------+ +-----------+ +-----------+ |
|
|
|
|
|
| End cCCC | | End cCCC | | End cCCC | | End cCCC | |
|
|
|
|
|
| Object 0 | | Object 1 | | Object 2 | | Object 3 | |
|
|
|
|
|
\-----------/ \-----------/ \-----------/ \-----------/ |
|
|
|
|
|
|
|
|
|
|
|
+--/
|
2012-12-04 04:48:02 +00:00
|
|
|
|
|
2012-12-04 18:58:02 +00:00
|
|
|
|
+--\
|
|
|
|
|
|
|
|
|
|
|
/-----------\ /-----------\ /-----------\ /-----------\ |
|
|
|
|
|
| Begin cCCC| | Begin cCCC| | Begin cCCC| | Begin cCCC| |
|
|
|
|
|
| Object 4 | | Object 5 | | Object 6 | | Object 7 | |
|
|
|
|
|
+-----------+ +-----------+ +-----------+ +-----------+ |
|
|
|
|
|
| stripe | | stripe | | stripe | | stripe | |
|
|
|
|
|
| unit 16 | | unit 17 | | unit 18 | | unit 19 | |
|
|
|
|
|
+-----------+ +-----------+ +-----------+ +-----------+ |
|
|
|
|
|
| stripe | | stripe | | stripe | | stripe | +-\
|
|
|
|
|
| unit 20 | | unit 21 | | unit 22 | | unit 23 | | Object
|
|
|
|
|
+-----------+ +-----------+ +-----------+ +-----------+ +- Set
|
|
|
|
|
| stripe | | stripe | | stripe | | stripe | | 2
|
|
|
|
|
| unit 24 | | unit 25 | | unit 26 | | unit 27 | +-/
|
|
|
|
|
+-----------+ +-----------+ +-----------+ +-----------+ |
|
|
|
|
|
| stripe | | stripe | | stripe | | stripe | |
|
|
|
|
|
| unit 28 | | unit 29 | | unit 30 | | unit 31 | |
|
|
|
|
|
+-----------+ +-----------+ +-----------+ +-----------+ |
|
|
|
|
|
| End cCCC | | End cCCC | | End cCCC | | End cCCC | |
|
|
|
|
|
| Object 4 | | Object 5 | | Object 6 | | Object 7 | |
|
|
|
|
|
\-----------/ \-----------/ \-----------/ \-----------/ |
|
|
|
|
|
|
|
|
|
|
|
+--/
|
2012-12-04 04:48:02 +00:00
|
|
|
|
|
|
|
|
|
Three important variables determine how Ceph stripes data:
|
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
- **Object Size:** Objects in the Ceph Storage Cluster have a maximum
|
2012-12-04 04:48:02 +00:00
|
|
|
|
configurable size (e.g., 2MB, 4MB, etc.). The object size should be large
|
2013-05-15 00:05:43 +00:00
|
|
|
|
enough to accommodate many stripe units, and should be a multiple of
|
2012-12-04 04:48:02 +00:00
|
|
|
|
the stripe unit.
|
|
|
|
|
|
2012-12-04 18:58:02 +00:00
|
|
|
|
- **Stripe Width:** Stripes have a configurable unit size (e.g., 64kb).
|
2013-05-15 00:05:43 +00:00
|
|
|
|
The Ceph Client divides the data it will write to objects into equally
|
2012-12-04 18:58:02 +00:00
|
|
|
|
sized stripe units, except for the last stripe unit. A stripe width,
|
2012-12-04 04:48:02 +00:00
|
|
|
|
should be a fraction of the Object Size so that an object may contain
|
|
|
|
|
many stripe units.
|
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
- **Stripe Count:** The Ceph Client writes a sequence of stripe units
|
2012-12-04 04:48:02 +00:00
|
|
|
|
over a series of objects determined by the stripe count. The series
|
2013-05-15 00:05:43 +00:00
|
|
|
|
of objects is called an object set. After the Ceph Client writes to
|
2012-12-04 04:48:02 +00:00
|
|
|
|
the last object in the object set, it returns to the first object in
|
|
|
|
|
the object set.
|
|
|
|
|
|
|
|
|
|
.. important:: Test the performance of your striping configuration before
|
|
|
|
|
putting your cluster into production. You CANNOT change these striping
|
|
|
|
|
parameters after you stripe the data and write it to objects.
|
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
Once the Ceph Client has striped data to stripe units and mapped the stripe
|
2012-12-04 04:48:02 +00:00
|
|
|
|
units to objects, Ceph's CRUSH algorithm maps the objects to placement groups,
|
2013-05-15 00:05:43 +00:00
|
|
|
|
and the placement groups to Ceph OSD Daemons before the objects are stored as
|
2020-11-19 06:57:54 +00:00
|
|
|
|
files on a storage drive.
|
2012-12-04 04:48:02 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
.. note:: Since a client writes to a single pool, all data striped into objects
|
|
|
|
|
get mapped to placement groups in the same pool. So they use the same CRUSH
|
|
|
|
|
map and the same access controls.
|
2012-12-04 04:48:02 +00:00
|
|
|
|
|
|
|
|
|
|
2013-06-14 23:52:25 +00:00
|
|
|
|
.. index:: architecture; Ceph Clients
|
2012-12-04 04:48:02 +00:00
|
|
|
|
|
2023-09-11 05:18:40 +00:00
|
|
|
|
.. _architecture_ceph_clients:
|
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
Ceph Clients
|
|
|
|
|
============
|
2012-12-04 04:48:02 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
Ceph Clients include a number of service interfaces. These include:
|
2012-12-04 04:48:02 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
- **Block Devices:** The :term:`Ceph Block Device` (a.k.a., RBD) service
|
|
|
|
|
provides resizable, thin-provisioned block devices with snapshotting and
|
|
|
|
|
cloning. Ceph stripes a block device across the cluster for high
|
|
|
|
|
performance. Ceph supports both kernel objects (KO) and a QEMU hypervisor
|
|
|
|
|
that uses ``librbd`` directly--avoiding the kernel object overhead for
|
|
|
|
|
virtualized systems.
|
2012-12-04 04:48:02 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
- **Object Storage:** The :term:`Ceph Object Storage` (a.k.a., RGW) service
|
|
|
|
|
provides RESTful APIs with interfaces that are compatible with Amazon S3
|
|
|
|
|
and OpenStack Swift.
|
|
|
|
|
|
2019-09-09 19:36:04 +00:00
|
|
|
|
- **Filesystem**: The :term:`Ceph File System` (CephFS) service provides
|
2013-05-15 00:05:43 +00:00
|
|
|
|
a POSIX compliant filesystem usable with ``mount`` or as
|
2018-09-18 03:19:18 +00:00
|
|
|
|
a filesystem in user space (FUSE).
|
2012-12-04 04:48:02 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
Ceph can run additional instances of OSDs, MDSs, and monitors for scalability
|
|
|
|
|
and high availability. The following diagram depicts the high-level
|
|
|
|
|
architecture.
|
2012-12-04 04:48:02 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
.. ditaa::
|
2020-04-09 13:25:39 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
+--------------+ +----------------+ +-------------+
|
2018-07-28 12:17:33 +00:00
|
|
|
|
| Block Device | | Object Storage | | CephFS |
|
2013-05-15 00:05:43 +00:00
|
|
|
|
+--------------+ +----------------+ +-------------+
|
2012-09-18 18:08:23 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
+--------------+ +----------------+ +-------------+
|
|
|
|
|
| librbd | | librgw | | libcephfs |
|
|
|
|
|
+--------------+ +----------------+ +-------------+
|
2012-09-18 18:08:23 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
+---------------------------------------------------+
|
|
|
|
|
| Ceph Storage Cluster Protocol (librados) |
|
|
|
|
|
+---------------------------------------------------+
|
2012-09-18 18:08:23 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
+---------------+ +---------------+ +---------------+
|
|
|
|
|
| OSDs | | MDSs | | Monitors |
|
|
|
|
|
+---------------+ +---------------+ +---------------+
|
2012-09-18 18:08:23 +00:00
|
|
|
|
|
|
|
|
|
|
2013-06-14 23:52:25 +00:00
|
|
|
|
.. index:: architecture; Ceph Object Storage
|
2012-09-18 18:08:23 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
Ceph Object Storage
|
|
|
|
|
-------------------
|
2012-09-18 18:08:23 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
The Ceph Object Storage daemon, ``radosgw``, is a FastCGI service that provides
|
|
|
|
|
a RESTful_ HTTP API to store objects and metadata. It layers on top of the Ceph
|
|
|
|
|
Storage Cluster with its own data formats, and maintains its own user database,
|
|
|
|
|
authentication, and access control. The RADOS Gateway uses a unified namespace,
|
|
|
|
|
which means you can use either the OpenStack Swift-compatible API or the Amazon
|
2014-03-08 10:27:15 +00:00
|
|
|
|
S3-compatible API. For example, you can write data using the S3-compatible API
|
2013-05-15 00:05:43 +00:00
|
|
|
|
with one application and then read data using the Swift-compatible API with
|
|
|
|
|
another application.
|
2012-09-18 18:08:23 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
.. topic:: S3/Swift Objects and Store Cluster Objects Compared
|
2011-12-01 23:22:15 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
Ceph's Object Storage uses the term *object* to describe the data it stores.
|
|
|
|
|
S3 and Swift objects are not the same as the objects that Ceph writes to the
|
|
|
|
|
Ceph Storage Cluster. Ceph Object Storage objects are mapped to Ceph Storage
|
|
|
|
|
Cluster objects. The S3 and Swift objects do not necessarily
|
|
|
|
|
correspond in a 1:1 manner with an object stored in the storage cluster. It
|
|
|
|
|
is possible for an S3 or Swift object to map to multiple Ceph objects.
|
2011-12-01 23:22:15 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
See `Ceph Object Storage`_ for details.
|
2011-12-01 23:22:15 +00:00
|
|
|
|
|
|
|
|
|
|
2013-05-16 20:57:23 +00:00
|
|
|
|
.. index:: Ceph Block Device; block device; RBD; Rados Block Device
|
2012-11-05 19:02:55 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
Ceph Block Device
|
|
|
|
|
-----------------
|
2012-11-05 19:02:55 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
A Ceph Block Device stripes a block device image over multiple objects in the
|
|
|
|
|
Ceph Storage Cluster, where each object gets mapped to a placement group and
|
|
|
|
|
distributed, and the placement groups are spread across separate ``ceph-osd``
|
|
|
|
|
daemons throughout the cluster.
|
2012-11-05 19:02:55 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
.. important:: Striping allows RBD block devices to perform better than a single
|
|
|
|
|
server could!
|
2011-12-01 23:22:15 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
Thin-provisioned snapshottable Ceph Block Devices are an attractive option for
|
|
|
|
|
virtualization and cloud computing. In virtual machine scenarios, people
|
|
|
|
|
typically deploy a Ceph Block Device with the ``rbd`` network storage driver in
|
2016-04-16 03:42:19 +00:00
|
|
|
|
QEMU/KVM, where the host machine uses ``librbd`` to provide a block device
|
2013-05-15 00:05:43 +00:00
|
|
|
|
service to the guest. Many cloud computing stacks use ``libvirt`` to integrate
|
2016-04-16 03:42:19 +00:00
|
|
|
|
with hypervisors. You can use thin-provisioned Ceph Block Devices with QEMU and
|
2013-05-15 00:05:43 +00:00
|
|
|
|
``libvirt`` to support OpenStack and CloudStack among other solutions.
|
2011-12-01 23:22:15 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
While we do not provide ``librbd`` support with other hypervisors at this time,
|
|
|
|
|
you may also use Ceph Block Device kernel objects to provide a block device to a
|
|
|
|
|
client. Other virtualization technologies such as Xen can access the Ceph Block
|
|
|
|
|
Device kernel object(s). This is done with the command-line tool ``rbd``.
|
2011-12-01 23:22:15 +00:00
|
|
|
|
|
|
|
|
|
|
2019-09-09 19:36:04 +00:00
|
|
|
|
.. index:: CephFS; Ceph File System; libcephfs; MDS; metadata server; ceph-mds
|
2011-12-01 23:22:15 +00:00
|
|
|
|
|
2018-07-01 13:46:14 +00:00
|
|
|
|
.. _arch-cephfs:
|
|
|
|
|
|
2019-09-09 19:36:04 +00:00
|
|
|
|
Ceph File System
|
|
|
|
|
----------------
|
2011-12-01 23:22:15 +00:00
|
|
|
|
|
2019-09-09 19:36:04 +00:00
|
|
|
|
The Ceph File System (CephFS) provides a POSIX-compliant filesystem as a
|
2013-05-15 00:05:43 +00:00
|
|
|
|
service that is layered on top of the object-based Ceph Storage Cluster.
|
2018-07-28 12:17:33 +00:00
|
|
|
|
CephFS files get mapped to objects that Ceph stores in the Ceph Storage
|
2013-05-15 00:05:43 +00:00
|
|
|
|
Cluster. Ceph Clients mount a CephFS filesystem as a kernel object or as
|
|
|
|
|
a Filesystem in User Space (FUSE).
|
2011-12-01 23:22:15 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
.. ditaa::
|
2020-04-09 13:25:39 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
+-----------------------+ +------------------------+
|
|
|
|
|
| CephFS Kernel Object | | CephFS FUSE |
|
|
|
|
|
+-----------------------+ +------------------------+
|
2011-12-01 23:22:15 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
+---------------------------------------------------+
|
2018-07-28 12:17:33 +00:00
|
|
|
|
| CephFS Library (libcephfs) |
|
2013-05-15 00:05:43 +00:00
|
|
|
|
+---------------------------------------------------+
|
2011-12-01 23:22:15 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
+---------------------------------------------------+
|
|
|
|
|
| Ceph Storage Cluster Protocol (librados) |
|
|
|
|
|
+---------------------------------------------------+
|
2011-12-01 23:22:15 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
+---------------+ +---------------+ +---------------+
|
|
|
|
|
| OSDs | | MDSs | | Monitors |
|
|
|
|
|
+---------------+ +---------------+ +---------------+
|
2011-12-01 23:22:15 +00:00
|
|
|
|
|
|
|
|
|
|
2019-09-09 19:36:04 +00:00
|
|
|
|
The Ceph File System service includes the Ceph Metadata Server (MDS) deployed
|
2014-07-09 04:50:54 +00:00
|
|
|
|
with the Ceph Storage cluster. The purpose of the MDS is to store all the
|
2013-05-15 00:05:43 +00:00
|
|
|
|
filesystem metadata (directories, file ownership, access modes, etc) in
|
|
|
|
|
high-availability Ceph Metadata Servers where the metadata resides in memory.
|
|
|
|
|
The reason for the MDS (a daemon called ``ceph-mds``) is that simple filesystem
|
|
|
|
|
operations like listing a directory or changing a directory (``ls``, ``cd``)
|
|
|
|
|
would tax the Ceph OSD Daemons unnecessarily. So separating the metadata from
|
2019-09-09 19:36:04 +00:00
|
|
|
|
the data means that the Ceph File System can provide high performance services
|
2013-05-15 00:05:43 +00:00
|
|
|
|
without taxing the Ceph Storage Cluster.
|
2011-12-01 23:22:15 +00:00
|
|
|
|
|
2018-07-28 12:17:33 +00:00
|
|
|
|
CephFS separates the metadata from the data, storing the metadata in the MDS,
|
2013-05-15 00:05:43 +00:00
|
|
|
|
and storing the file data in one or more objects in the Ceph Storage Cluster.
|
|
|
|
|
The Ceph filesystem aims for POSIX compatibility. ``ceph-mds`` can run as a
|
|
|
|
|
single process, or it can be distributed out to multiple physical machines,
|
|
|
|
|
either for high availability or for scalability.
|
2011-12-01 23:22:15 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
- **High Availability**: The extra ``ceph-mds`` instances can be `standby`,
|
|
|
|
|
ready to take over the duties of any failed ``ceph-mds`` that was
|
|
|
|
|
`active`. This is easy because all the data, including the journal, is
|
|
|
|
|
stored on RADOS. The transition is triggered automatically by ``ceph-mon``.
|
2011-12-01 23:22:15 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
- **Scalability**: Multiple ``ceph-mds`` instances can be `active`, and they
|
|
|
|
|
will split the directory tree into subtrees (and shards of a single
|
|
|
|
|
busy directory), effectively balancing the load amongst all `active`
|
|
|
|
|
servers.
|
2011-12-01 23:22:15 +00:00
|
|
|
|
|
2013-05-15 00:05:43 +00:00
|
|
|
|
Combinations of `standby` and `active` etc are possible, for example
|
|
|
|
|
running 3 `active` ``ceph-mds`` instances for scaling, and one `standby`
|
|
|
|
|
instance for high availability.
|
2011-12-01 23:22:15 +00:00
|
|
|
|
|
2012-12-04 04:48:02 +00:00
|
|
|
|
|
|
|
|
|
|
2022-11-08 08:24:29 +00:00
|
|
|
|
.. _RADOS - A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters: https://ceph.io/assets/pdfs/weil-rados-pdsw07.pdf
|
2017-10-23 11:26:28 +00:00
|
|
|
|
.. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science)
|
2013-05-15 00:05:43 +00:00
|
|
|
|
.. _Monitor Config Reference: ../rados/configuration/mon-config-ref
|
|
|
|
|
.. _Monitoring OSDs and PGs: ../rados/operations/monitoring-osd-pg
|
|
|
|
|
.. _Heartbeats: ../rados/configuration/mon-osd-interaction
|
|
|
|
|
.. _Monitoring OSDs: ../rados/operations/monitoring-osd-pg/#monitoring-osds
|
2022-10-09 07:09:30 +00:00
|
|
|
|
.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf
|
2013-05-15 00:05:43 +00:00
|
|
|
|
.. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing
|
|
|
|
|
.. _Report Peering Failure: ../rados/configuration/mon-osd-interaction#osds-report-peering-failure
|
|
|
|
|
.. _Troubleshooting Peering Failure: ../rados/troubleshooting/troubleshooting-pg#placement-group-down-peering-failure
|
|
|
|
|
.. _Ceph Authentication and Authorization: ../rados/operations/auth-intro/
|
2016-10-08 14:41:59 +00:00
|
|
|
|
.. _Hardware Recommendations: ../start/hardware-recommendations
|
2013-05-15 00:05:43 +00:00
|
|
|
|
.. _Network Config Reference: ../rados/configuration/network-config-ref
|
|
|
|
|
.. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing
|
2017-10-23 11:26:28 +00:00
|
|
|
|
.. _striping: https://en.wikipedia.org/wiki/Data_striping
|
|
|
|
|
.. _RAID: https://en.wikipedia.org/wiki/RAID
|
|
|
|
|
.. _RAID 0: https://en.wikipedia.org/wiki/RAID_0#RAID_0
|
2013-05-15 00:05:43 +00:00
|
|
|
|
.. _Ceph Object Storage: ../radosgw/
|
2017-10-23 11:26:28 +00:00
|
|
|
|
.. _RESTful: https://en.wikipedia.org/wiki/RESTful
|
2014-05-06 10:53:03 +00:00
|
|
|
|
.. _Erasure Code Notes: https://github.com/ceph/ceph/blob/40059e12af88267d0da67d8fd8d9cd81244d8f93/doc/dev/osd_internals/erasure_coding/developer_notes.rst
|
|
|
|
|
.. _Cache Tiering: ../rados/operations/cache-tiering
|
2014-05-23 00:37:16 +00:00
|
|
|
|
.. _Set Pool Values: ../rados/operations/pools#set-pool-values
|
2017-10-23 11:26:28 +00:00
|
|
|
|
.. _Kerberos: https://en.wikipedia.org/wiki/Kerberos_(protocol)
|
2014-08-25 18:02:27 +00:00
|
|
|
|
.. _Cephx Config Guide: ../rados/configuration/auth-config-ref
|
2015-06-08 18:51:58 +00:00
|
|
|
|
.. _User Management: ../rados/operations/user-management
|