ceph/doc/architecture.rst

==============
 Architecture
==============

:term:`Ceph` uniquely delivers **object, block, and file storage** in one
unified system. Ceph is highly reliable, easy to manage, and free. The power of
Ceph can transform your company's IT infrastructure and your ability to manage
vast amounts of data. Ceph delivers extraordinary scalability–thousands of
clients accessing petabytes to exabytes of data. A :term:`Ceph Node` leverages
commodity hardware and intelligent daemons, and a :term:`Ceph Storage Cluster`
accommodates large numbers of nodes, which communicate with each other to
replicate and redistribute data dynamically. A :term:`Ceph Monitor` can also be
placed into a cluster of Ceph monitors to oversee the Ceph nodes in the Ceph
Storage Cluster (a monitor cluster ensures high availability). 

.. image:: images/stack.png


The Ceph Storage Cluster
========================

Ceph provides an infinitely scalable :term:`Ceph Storage Cluster` based upon
:abbr:`RADOS (Reliable Autonomic Distributed Object Store)`, which you can read
about in `RADOS - A Scalable, Reliable Storage Service for Petabyte-scale
Storage Clusters`_. Storage cluster clients and each :term:`Ceph OSD Daemon` use
the CRUSH algorithm to efficiently compute information about data location,
instead of having to depend on a central lookup table. Ceph's high-level
features include providing a native interface to the Ceph Storage Cluster via
``librados``, and a number of service interfaces built on top of ``librados``.

.. ditaa::  +---------------+ +---------------+
            |      OSDs     | |    Monitors   |
            +---------------+ +---------------+


Storing Data
------------

The Ceph Storage Cluster receives data from :term:`Ceph Clients`--whether it
comes through a :term:`Ceph Block Device`, :term:`Ceph Object Storage`, the
:term:`Ceph Filesystem` or a custom implementation you create using
``librados``--and it stores the data as objects. Each object corresponds to a
file in a filesystem, which is stored on an :term:`Object Storage Device`. Ceph
OSD Daemons handle the read/write operations on the storage disks.

.. ditaa:: /-----\       +-----+       +-----+
           | obj |------>| {d} |------>| {s} |
           \-----/       +-----+       +-----+
   
            Object         File         Disk

Ceph OSD Daemons store all data as objects in a flat namespace (e.g., no
hierarchy of directories). An object has an identifier, binary data, and
metadata consisting of a set of name/value pairs. The semantics are completely
up to :term:`Ceph Clients`. For example, CephFS uses metadata to store file
attributes such as the file owner, created date, last modified date, and so
forth.


.. ditaa:: /------+------------------------------+----------------\
           | ID   | Binary Data                  | Metadata       |
           +------+------------------------------+----------------+
           | 1234 | 0101010101010100110101010010 | name1 = value1 | 
           |      | 0101100001010100110101010010 | name2 = value2 |
           |      | 0101100001010100110101010010 | nameN = valueN |
           \------+------------------------------+----------------/    

.. note:: An object ID is unique across the entire cluster, not just the local
   filesystem.


.. index:: architecture; high availability, scalability

Scalability and High Availability
---------------------------------

In traditional architectures, clients talk to a centralized component (e.g., a
gateway, broker, API, facade, etc.), which acts as a single point of entry to a
complex subsystem. This imposes a limit to both performance and scalability,
while introducing a single point of failure (i.e., if the centralized component
goes down, the whole system goes down, too).

Ceph eliminates the centralized gateway to enable clients to interact with 
Ceph OSD Daemons directly. Ceph OSD Daemons create object replicas on other
Ceph Nodes to ensure data safety and high availabilty. Ceph also uses a cluster 
of monitors to ensure high availability. To eliminate centralization, Ceph 
uses an algorithm called CRUSH.


.. index:: CRUSH; architecture

CRUSH Introduction
~~~~~~~~~~~~~~~~~~

Ceph Clients and Ceph OSD Daemons both use the :abbr:`CRUSH (Controlled
Replication Under Scalable Hashing)` algorithm to efficiently compute
information about data containers on demand, instead of having to depend on a
central lookup table. CRUSH provides a better data management mechanism compared
to older approaches, and enables massive scale by cleanly distributing the work
to all the clients and OSD daemons in the cluster. CRUSH uses intelligent data
replication to ensure resiliency, which is better suited to hyper-scale storage.
The following sections provide additional details on how CRUSH works. For a
detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable, Decentralized
Placement of Replicated Data`_.

.. index:: architecture; cluster map

Cluster Map
~~~~~~~~~~~

Ceph depends upon Ceph Clients and Ceph OSD Daemons having knowledge of the
cluster topology, which is inclusive of 5 maps collectively referred to as the
"Cluster Map":

#. **The Monitor Map:** Contains the cluster ``fsid``, the position, name 
   address and port of each monitor. It also indicates the current epoch, 
   when the map was created, and the last time it changed. To view a monitor
   map, execute ``ceph mon dump``.   
   
#. **The OSD Map:** Contains the cluster ``fsid``, when the map was created and
   last modified, a list of pools, replica sizes, PG numbers, a list of OSDs
   and their status (e.g., ``up``, ``in``). To view an OSD map, execute
   ``ceph osd dump``. 
   
#. **The PG Map:** Contains the PG version, its time stamp, the last OSD
   map epoch, the full ratios, and details on each placement group such as
   the PG ID, the `Up Set`, the `Acting Set`, the state of the PG (e.g., 
   ``active + clean``), and data usage statistics for each pool.

#. **The CRUSH Map:** Contains a list of storage devices, the failure domain
   hierarchy (e.g., device, host, rack, row, room, etc.), and rules for 
   traversing the hierarchy when storing data. To view a CRUSH map, execute
   ``ceph osd getcrushmap -o {filename}``; then, decompile it by executing
   ``crushtool -d {comp-crushmap-filename} -o {decomp-crushmap-filename}``.
   You can view the decompiled map in a text editor or with ``cat``. 

#. **The MDS Map:** Contains the current MDS map epoch, when the map was 
   created, and the last time it changed. It also contains the pool for 
   storing metadata, a list of metadata servers, and which metadata servers
   are ``up`` and ``in``. To view an MDS map, execute ``ceph mds dump``.

Each map maintains an iterative history of its operating state changes. Ceph
Monitors maintain a master copy of the cluster map including the cluster
members, state, changes, and the overall health of the Ceph Storage Cluster.

.. index:: high availability; monitor architecture

High Availability Monitors
~~~~~~~~~~~~~~~~~~~~~~~~~~

Before Ceph Clients can read or write data, they must contact a Ceph Monitor
to obtain the most recent copy of the cluster map. A Ceph Storage Cluster
can operate with a single monitor; however, this introduces a single 
point of failure (i.e., if the monitor goes down, Ceph Clients cannot
read or write data).

For added reliability and fault tolerance, Ceph supports a cluster of monitors.
In a cluster of monitors, latency and other faults can cause one or more
monitors to fall behind the current state of the cluster. For this reason, Ceph
must have agreement among various monitor instances regarding the state of the
cluster. Ceph always uses a majority of monitors (e.g., 1, 2:3, 3:5, 4:6, etc.)
and the `Paxos`_ algorithm to establish a consensus among the monitors about the
current state of the cluster.

For details on configuring monitors, see the `Monitor Config Reference`_.

.. index:: architecture; high availability authentication

High Availability Authentication
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Ceph clients can authenticate users with Ceph Monitors, Ceph OSD Daemons and
Ceph Metadata Servers, using Ceph's Kerberos-like ``cephx`` protocol.
Authenticated users gain authorization to read, write and execute Ceph commands.
The Cephx authentication system avoids a single point of failure to ensure
scalability and high availability.  For details on Cephx and how it differs
from Kerberos, see `Ceph Authentication and Authorization`_.

.. index:: architecture; smart daemons and scalability

Smart Daemons Enable Hyperscale
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In many clustered architectures, the primary purpose of cluster membership is 
so that a centralized interface knows which nodes it can access. Then the
centralized interface provides services to the client through a double
dispatch--which is a **huge** bottleneck at the petabyte-to-exabyte scale.

Ceph elminates the bottleneck: Ceph's OSD Daemons AND Ceph Clients are cluster
aware. Like Ceph clients, each Ceph OSD Daemon knows about other Ceph OSD
Daemons in the cluster.  This enables Ceph OSD Daemons to interact directly with
other Ceph OSD Daemons and Ceph monitors. Additionally, it enables Ceph Clients
to interact directly with Ceph OSD Daemons.

The ability of Ceph Clients, Ceph Monitors and Ceph OSD Daemons to interact with
each other means that Ceph OSD Daemons can utilize the CPU and RAM of the Ceph
nodes to easily perform tasks that would bog down a centralized server. The
ability to leverage this computing power leads to several major benefits:

#. **OSDs Service Clients Directly:** Since any network device has a limit to 
   the number of concurrent connections it can support, a centralized system 
   has a low physical limit at high scales. By enabling Ceph Clients to contact 
   Ceph OSD Daemons directly, Ceph increases both performance and total system 
   capacity simultaneously, while removing a single point of failure. Ceph 
   Clients can maintain a session when they need to, and with a particular Ceph 
   OSD Daemon instead of a centralized server.

#. **OSD Membership and Status**: Ceph OSD Daemons join a cluster and report 
   on their status. At the lowest level, the Ceph OSD Daemon status is ``up`` 
   or ``down`` reflecting whether or not it is running and able to service 
   Ceph Client requests. If a Ceph OSD Daemon is ``down`` and ``in`` the Ceph 
   Storage Cluster, this status may indicate the failure of the Ceph OSD 
   Daemon. If a Ceph OSD Daemon is not running (e.g., it crashes), the Ceph OSD 
   Daemon cannot notify the Ceph Monitor that it is ``down``. The Ceph Monitor 
   can ping a Ceph OSD Daemon periodically to ensure that it is running. 
   However, Ceph also empowers Ceph OSD Daemons to determine if a neighboring 
   OSD is ``down``, to update the cluster map and to report it to the Ceph 
   monitor(s). This means that Ceph monitors can remain light weight processes. 
   See `Monitoring OSDs`_ and `Heartbeats`_ for additional details.
   
#. **Data Scrubbing:** As part of maintaining data consistency and cleanliness, 
   Ceph OSD Daemons can scrub objects within placement groups. That is, Ceph 
   OSD Daemons can compare object metadata in one placement group with its 
   replicas in placement groups stored on other OSDs. Scrubbing (usually 
   performed daily) catches bugs or filesystem errors. Ceph OSD Daemons also 
   perform deeper scrubbing by comparing data in objects bit-for-bit. Deep 
   scrubbing (usually performed weekly) finds bad sectors on a drive that 
   weren't apparent in a light scrub. See `Data Scrubbing`_ for details on 
   configuring scrubbing.

#. **Replication:** Like Ceph Clients, Ceph OSD Daemons use the CRUSH 
   algorithm, but the Ceph OSD Daemon uses it to compute where replicas of 
   objects should be stored (and for rebalancing). In a typical write scenario, 
   a client uses the CRUSH algorithm to compute where to store an object, maps 
   the object to a pool and placement group, then looks at the CRUSH map to 
   identify the primary OSD for the placement group.
   
   The client writes the object to the identified placement group in the 
   primary OSD. Then, the primary OSD with its own copy of the CRUSH map 
   identifies the secondary and tertiary OSDs for replication purposes, and 
   replicates the object to the appropriate placement groups in the secondary 
   and tertiary OSDs (as many OSDs as additional replicas), and responds to the
   client once it has confirmed the object was stored successfully.

.. ditaa:: 
             +----------+
             |  Client  |
             |          |
             +----------+
                 *  ^
      Write (1)  |  |  Ack (6)
                 |  |
                 v  *
            +-------------+
            | Primary OSD |
            |             |
            +-------------+
              *  ^   ^  *
    Write (2) |  |   |  |  Write (3)
       +------+  |   |  +------+
       |  +------+   +------+  |
       |  | Ack (4)  Ack (5)|  | 
       v  *                 *  v
 +---------------+   +---------------+
 | Secondary OSD |   | Tertiary OSD  |
 |               |   |               |
 +---------------+   +---------------+

With the ability to perform data replication, Ceph OSD Daemons relieve Ceph
clients from that duty, while ensuring high data availability and data safety.


Dynamic Cluster Management
--------------------------

In the `Scalability and High Availability`_ section, we explained how Ceph uses
CRUSH, cluster awareness and intelligent daemons to scale and maintain high
availability. Key to Ceph's design is the autonomous, self-healing, and
intelligent Ceph OSD Daemon. Let's take a deeper look at how CRUSH works to
enable modern cloud storage infrastructures to place data, rebalance the cluster
and recover from faults dynamically.

.. index:: architecture; pools

About Pools
~~~~~~~~~~~

The Ceph storage system supports the notion of 'Pools', which are logical
partitions for storing objects. Pools set the following parameters:

- Ownership/Access to Objects
- The Number of Object Replicas
- The Number of Placement Groups, and 
- The CRUSH Ruleset to Use.

Ceph Clients retrieve a `Cluster Map`_ from a Ceph Monitor, and write objects to
pools. The pool's ``size`` or number of replicas, the CRUSH ruleset and the
number of placement groups determine how Ceph will place the data.

.. ditaa:: 
            +--------+  Retrieves  +---------------+
            | Client |------------>|  Cluster Map  |
            +--------+             +---------------+
                 |
                 v      Writes
              /-----\
              | obj |
              \-----/
                 |      To
                 v
            +--------+           +---------------+
            |  Pool  |---------->| CRUSH Ruleset |
            +--------+  Selects  +---------------+
                 

.. index: architecture; placement group mapping

Mapping PGs to OSDs
~~~~~~~~~~~~~~~~~~~

Each pool has a number of placement groups. CRUSH maps PGs to OSDs dynamically.
When a Ceph Client stores objects, CRUSH will map each object to a placement
group.

Mapping objects to placement groups creates a layer of indirection between the
Ceph OSD Daemon and the Ceph Client. The Ceph Storage Cluster must be able to
grow (or shrink) and rebalance where it stores objects dynamically. If the Ceph
Client "knew" which Ceph OSD Daemon had which object, that would create a tight
coupling between the Ceph Client and the Ceph OSD Daemon. Instead, the CRUSH
algorithm maps each object to a placement group and then maps each placement
group to one or more Ceph OSD Daemons. This layer of indirection allows Ceph to
rebalance dynamically when new Ceph OSD Daemons and the underlying OSD devices
come online. The following diagram depicts how CRUSH maps objects to placement
groups, and placement groups to OSDs.

.. ditaa:: 
           /-----\  /-----\  /-----\  /-----\  /-----\
           | obj |  | obj |  | obj |  | obj |  | obj |
           \-----/  \-----/  \-----/  \-----/  \-----/
              |        |        |        |        |
              +--------+--------+        +---+----+
              |                              |
              v                              v
   +-----------------------+      +-----------------------+
   |  Placement Group #1   |      |  Placement Group #2   |
   |                       |      |                       |
   +-----------------------+      +-----------------------+
               |                              |
               |      +-----------------------+---+
        +------+------+-------------+             |
        |             |             |             |
        v             v             v             v
   /----------\  /----------\  /----------\  /----------\ 
   |          |  |          |  |          |  |          |
   |  OSD #1  |  |  OSD #2  |  |  OSD #3  |  |  OSD #4  |
   |          |  |          |  |          |  |          |
   \----------/  \----------/  \----------/  \----------/  

With a copy of the cluster map and the CRUSH algorithm, the client can compute
exactly which OSD to use when reading or writing a particular object.

.. index:: architecture; calculating PG IDs

Calculating PG IDs
~~~~~~~~~~~~~~~~~~

When a Ceph Client binds to a Ceph Monitor, it retrieves the latest copy of the
`Cluster Map`_. With the cluster map, the client knows about all of the monitors,
OSDs, and metadata servers in the cluster. **However, it doesn't know anything
about object locations.** 

.. epigraph:: 

	Object locations get computed.


The only input required by the client is the object ID and the pool.
It's simple: Ceph stores data in named pools (e.g., "liverpool"). When a client
wants to store a named object (e.g., "john," "paul," "george," "ringo", etc.)
it calculates a placement group using the object name, a hash code, the
number of OSDs in the cluster and the pool name. Ceph clients use the following
steps to compute PG IDs.

#. The client inputs the pool ID and the object ID. (e.g., pool = "liverpool" 
   and object-id = "john")
#. CRUSH takes the object ID and hashes it.
#. CRUSH calculates the hash modulo the number of OSDs. (e.g., ``0x58``) to get 
   a PG ID.
#. CRUSH gets the pool ID given the pool name (e.g., "liverpool" = ``4``)
#. CRUSH prepends the pool ID to the pool ID to the PG ID (e.g., ``4.0x58``).

Computing object locations is much faster than performing object location query
over a chatty session. The :abbr:`CRUSH (Controlled Replication Under Scalable
Hashing)` algorithm allows a client to compute where objects *should* be stored,
and enables the client to contact the primary OSD to store or retrieve the
objects.

.. index:: architecture; PG Peering

Peering and Sets
~~~~~~~~~~~~~~~~

In previous sections, we noted that Ceph OSD Daemons check each other's
heartbeats and report back to the Ceph Monitor. Another thing Ceph OSD daemons
do is called 'peering', which is the process of bringing all of the OSDs that
store a Placement Group (PG) into agreement about the state of all of the
objects (and their metadata) in that PG. In fact, Ceph OSD Daemons `Report
Peering Failure`_ to the Ceph Monitors. Peering issues  usually resolve
themselves; however, if the problem persists, you may need to refer to the
`Troubleshooting Peering Failure`_ section.

.. Note:: Agreeing on the state does not mean that the PGs have the latest contents.

The Ceph Storage Cluster was designed to store at least two copies of an object
(i.e., ``size = 2``), which is the minimum requirement for data safety. For high
availability, a Ceph Storage Cluster should store more than two copies of an object
(e.g., ``size = 3`` and ``min size = 2``) so that it can continue to run in a 
``degraded`` state while maintaining data safety.

Referring back to the diagram in `Smart Daemons Enable Hyperscale`_, we do not 
name the Ceph OSD Daemons specifically (e.g., ``osd.0``, ``osd.1``, etc.), but 
rather refer to them as *Primary*, *Secondary*, and so forth. By convention, 
the *Primary* is the first OSD in the *Acting Set*, and is responsible for 
coordinating the peering process for each placement group where it acts as 
the *Primary*, and is the **ONLY** OSD that that will accept client-initiated 
writes to objects for a given placement group where it acts as the *Primary*.

When a series of OSDs are responsible for a placement group, that series of
OSDs, we refer to them as an *Acting Set*. An *Acting Set* may refer to the Ceph
OSD Daemons that are currently responsible for the placement group, or the Ceph
OSD Daemons that were responsible  for a particular placement group as of some
epoch.

The Ceph OSD daemons that are part of an *Acting Set* may not always be  ``up``.
When an OSD in the *Acting Set* is ``up``, it is part of the  *Up Set*. The *Up
Set* is an important distinction, because Ceph can remap PGs to other Ceph OSD
Daemons when an OSD fails. 

.. note:: In an *Acting Set* for a PG containing ``osd.25``, ``osd.32`` and 
   ``osd.61``, the first OSD, ``osd.25``, is the *Primary*. If that OSD fails,
   the Secondary, ``osd.32``, becomes the *Primary*, and ``osd.25`` will be 
   removed from the *Up Set*.


.. index:: architecture; Rebalancing

Rebalancing
~~~~~~~~~~~

When you add a Ceph OSD Daemon to a Ceph Storage Cluster, the cluster map gets
updated with the new OSD. Referring back to `Calculating PG IDs`_, this changes
the cluster map. Consequently, it changes object placement, because it changes
an input for the calculations. The following diagram depicts the rebalancing
process (albeit rather crudely, since it is substantially less impactful with
large clusters) where some, but not all of the PGs migrate from existing OSDs
(OSD 1, and OSD 2) to the new OSD (OSD 3). Even when rebalancing, CRUSH is
stable. Many of the placement groups remain in their original configuration,
and each OSD gets some added capacity, so there are no load spikes on the 
new OSD after rebalancing is complete.


.. ditaa:: 
           +--------+     +--------+
   Before  |  OSD 1 |     |  OSD 2 |
           +--------+     +--------+
           |  PG #1 |     | PG #6  |
           |  PG #2 |     | PG #7  |
           |  PG #3 |     | PG #8  |
           |  PG #4 |     | PG #9  |
           |  PG #5 |     | PG #10 |
           +--------+     +--------+

           +--------+     +--------+     +--------+
    After  |  OSD 1 |     |  OSD 2 |     |  OSD 3 |
           +--------+     +--------+     +--------+
           |  PG #1 |     | PG #7  |     |  PG #3 |
           |  PG #2 |     | PG #8  |     |  PG #6 |
           |  PG #4 |     | PG #10 |     |  PG #9 |
           |  PG #5 |     |        |     |        |
           |        |     |        |     |        |
           +--------+     +--------+     +--------+


.. index:: architecture; Data Scrubbing

Data Consistency
~~~~~~~~~~~~~~~~

As part of maintaining data consistency and cleanliness, Ceph OSDs can also
scrub objects within placement groups. That is, Ceph OSDs can compare object
metadata in one placement group with its replicas in placement groups stored in
other OSDs. Scrubbing (usually performed daily) catches OSD bugs or filesystem
errors.  OSDs can also perform deeper scrubbing by comparing data in objects
bit-for-bit.  Deep scrubbing (usually performed weekly) finds bad sectors on a
disk that weren't apparent in a light scrub.

See `Data Scrubbing`_ for details on configuring scrubbing.


.. index:: Extensibility, Ceph Classes

Extending Ceph
--------------

You can extend Ceph by creating shared object classes called 'Ceph Classes'.
Ceph loads ``.so`` classes stored in the ``osd class dir`` directory dynamically
(i.e., ``$libdir/rados-classes`` by default). When you implement a class, you
can create new object methods that have the ability to call the native methods
in the Ceph Object Store, or other class methods you incorporate via libraries
or create yourself.

On writes, Ceph Classes can call native or class methods, perform any series of
operations on the inbound data and generate a resulting write transaction  that
Ceph will apply atomically.

On reads, Ceph Classes can call native or class methods, perform any series of
operations on the outbound data and return the data to the client.

.. topic:: Ceph Class Example

   A Ceph class for a content management system that presents pictures of a
   particular size and aspect ratio could take an inbound bitmap image, crop it
   to a particular aspect ratio, resize it and embed an invisible copyright or 
   watermark to help protect the intellectual property; then, save the 
   resulting bitmap image to the object store.

See ``src/objclass/objclass.h``, ``src/fooclass.cc`` and ``src/barclass`` for 
exemplary implementations.


Summary
-------

Ceph Storage Clusters are dynamic--like a living organism. Whereas, many storage
appliances do not fully utilize the CPU and RAM of a typical commodity server,
Ceph does. From heartbeats, to  peering, to rebalancing the cluster or
recovering from faults,  Ceph offloads work from clients (and from a centralized
gateway which doesn't exist in the Ceph architecture) and uses the computing
power of the OSDs to perform the work. When referring to `Hardware
Recommendations`_ and the `Network Config Reference`_,  be cognizant of the
foregoing concepts to understand how Ceph utilizes computing resources.

.. index:: Ceph Protocol, librados

Ceph Protocol
=============

Ceph Clients use the native protocol for interacting with the Ceph Storage
Cluster. Ceph packages this functionality into the ``librados`` library so that
you can create your own custom Ceph Clients. The following diagram depicts the
basic architecture.

.. ditaa::  
            +---------------------------------+
            |  Ceph Storage Cluster Protocol  |
            |           (librados)            |
            +---------------------------------+
            +---------------+ +---------------+
            |      OSDs     | |    Monitors   |
            +---------------+ +---------------+


Native Protocol and ``librados``
--------------------------------

Modern applications need a simple object storage interface with asynchronous
communication capability. The Ceph Storage Cluster provides a simple object
storage interface with asynchronous communication capability. The interface
provides direct, parallel access to objects throughout the cluster.


- Pool Operations
- Snapshots and Copy-on-write Cloning
- Read/Write Objects
  - Create or Remove
  - Entire Object or Byte Range
  - Append or Truncate
- Create/Set/Get/Remove XATTRs
- Create/Set/Get/Remove Key/Value Pairs
- Compound operations and dual-ack semantics
- Object Classes


.. index:: architecture; watch/notify

Object Watch/Notify
-------------------

A client can register a persistent interest with an object and keep a session to
the primary OSD open. The client can send a notification message and payload to
all watchers and receive notification when the watchers receive the
notification. This enables a client to use any object a
synchronization/communication channel.


.. ditaa:: +----------+     +----------+     +----------+     +---------------+
           | Client 1 |     | Client 2 |     | Client 3 |     | OSD:Object ID |
           +----------+     +----------+     +----------+     +---------------+
                 |                |                |                  |
                 |                |                |                  |
                 |                |  Watch Object  |                  |               
                 |--------------------------------------------------->|
                 |                |                |                  |
                 |<---------------------------------------------------|
                 |                |   Ack/Commit   |                  |
                 |                |                |                  |
                 |                |  Watch Object  |                  |
                 |                |---------------------------------->|
                 |                |                |                  |
                 |                |<----------------------------------|
                 |                |   Ack/Commit   |                  |
                 |                |                |   Watch Object   |
                 |                |                |----------------->|
                 |                |                |                  |
                 |                |                |<-----------------|
                 |                |                |    Ack/Commit    |
                 |                |     Notify     |                  |               
                 |--------------------------------------------------->|
                 |                |                |                  |
                 |<---------------------------------------------------|
                 |                |     Notify     |                  |
                 |                |                |                  |
                 |                |<----------------------------------|
                 |                |     Notify     |                  |
                 |                |                |<-----------------|
                 |                |                |      Notify      |
                 |                |       Ack      |                  |               
                 |----------------+---------------------------------->|
                 |                |                |                  |
                 |                |       Ack      |                  |
                 |                +---------------------------------->|
                 |                |                |                  |
                 |                |                |        Ack       |
                 |                |                |----------------->|
                 |                |                |                  | 
                 |<---------------+----------------+------------------|
                 |                     Complete

.. index:: architecture; Striping

Data Striping
-------------

Storage devices have throughput limitations, which impact performance and
scalability. So storage systems often support `striping`_--storing sequential
pieces of information across across multiple storage devices--to increase
throughput and performance. The most common form of data striping comes from
`RAID`_. The RAID type most similar to Ceph's striping is `RAID 0`_, or a
'striped volume.' Ceph's striping offers the throughput of RAID 0 striping,
the reliability of n-way RAID mirroring and faster recovery.

Ceph provides three types of clients: Ceph Block Device, Ceph Filesystem, and
Ceph Object Storage. A Ceph Client converts its data from the representation 
format it provides to its users (a block device image, RESTful objects, CephFS
filesystem directories) into objects for storage in the Ceph Storage Cluster. 

.. tip:: The objects Ceph stores in the Ceph Storage Cluster are not striped. 
   Ceph Object Storage, Ceph Block Device, and the Ceph Filesystem stripe their 
   data over multiple Ceph Storage Cluster objects. Ceph Clients that write 
   directly to the Ceph Storage Cluster via ``librados`` must perform the the 
   striping (and parallel I/O) for themselves to obtain these benefits.

The simplest Ceph striping format involves a stripe count of 1 object. Ceph
Clients write stripe units to a Ceph Storage Cluster object until the object is
at its maximum capacity, and then create another object for additional stripes
of data. The simplest form of striping may be sufficient for small block device
images, S3 or Swift objects and CephFS files. However, this simple form doesn't
take maximum advantage of Ceph's ability to distribute data across placement
groups, and consequently doesn't improve performance very much. The following
diagram depicts the simplest form of striping:

.. ditaa::              
                        +---------------+
                        |  Client Data  |
                        |     Format    |
                        | cCCC          |
                        +---------------+
                                |
                       +--------+-------+
                       |                |
                       v                v
                 /-----------\    /-----------\
                 | Begin cCCC|    | Begin cCCC|
                 | Object  0 |    | Object  1 |
                 +-----------+    +-----------+
                 |  stripe   |    |  stripe   |
                 |  unit 1   |    |  unit 5   |
                 +-----------+    +-----------+
                 |  stripe   |    |  stripe   |
                 |  unit 2   |    |  unit 6   |
                 +-----------+    +-----------+
                 |  stripe   |    |  stripe   |
                 |  unit 3   |    |  unit 7   |
                 +-----------+    +-----------+
                 |  stripe   |    |  stripe   |
                 |  unit 4   |    |  unit 8   |
                 +-----------+    +-----------+
                 | End cCCC  |    | End cCCC  |
                 | Object 0  |    | Object 1  |
                 \-----------/    \-----------/
   

If you anticipate large images sizes, large S3 or Swift objects (e.g., video),
or large CephFS directories, you may see considerable read/write performance
improvements by striping client data over multiple objects within an object set.
Significant write performance occurs when the client writes the stripe units to
their corresponding objects in parallel. Since objects get mapped to different
placement groups and further mapped to different OSDs, each write occurs in
parallel at the maximum write speed. A write to a single disk would be limited
by the head movement (e.g. 6ms per seek) and bandwidth of that one device (e.g.
100MB/s).  By spreading that write over multiple objects (which map to different
placement groups and OSDs) Ceph can reduce the number of seeks per drive and
combine the throughput of multiple drives to achieve much faster write (or read)
speeds.

.. note:: Striping is independent of object replicas. Since CRUSH
   replicates objects across OSDs, stripes get replicated automatically.

In the following diagram, client data gets striped across an object set
(``object set 1`` in the following diagram) consisting of 4 objects, where the
first stripe unit is ``stripe unit 0`` in ``object 0``, and the fourth stripe
unit is ``stripe unit 3`` in ``object 3``. After writing the fourth stripe, the
client determines if the object set is full. If the object set is not full, the
client begins writing a stripe to the first object again (``object 0`` in the
following diagram). If the object set is full, the client creates a new object
set (``object set 2`` in the following diagram), and begins writing to the first
stripe (``stripe unit 16``) in the first object in the new object set (``object
4`` in the diagram below).

.. ditaa::                 
                          +---------------+
                          |  Client Data  |
                          |     Format    |
                          | cCCC          |
                          +---------------+
                                  |
       +-----------------+--------+--------+-----------------+
       |                 |                 |                 |     +--\
       v                 v                 v                 v        |
 /-----------\     /-----------\     /-----------\     /-----------\  |   
 | Begin cCCC|     | Begin cCCC|     | Begin cCCC|     | Begin cCCC|  |
 | Object 0  |     | Object  1 |     | Object  2 |     | Object  3 |  |
 +-----------+     +-----------+     +-----------+     +-----------+  |
 |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
 |  unit 0   |     |  unit 1   |     |  unit 2   |     |  unit 3   |  |
 +-----------+     +-----------+     +-----------+     +-----------+  |
 |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  +-\ 
 |  unit 4   |     |  unit 5   |     |  unit 6   |     |  unit 7   |    | Object
 +-----------+     +-----------+     +-----------+     +-----------+    +- Set 
 |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |    |   1
 |  unit 8   |     |  unit 9   |     |  unit 10  |     |  unit 11  |  +-/
 +-----------+     +-----------+     +-----------+     +-----------+  |
 |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
 |  unit 12  |     |  unit 13  |     |  unit 14  |     |  unit 15  |  |
 +-----------+     +-----------+     +-----------+     +-----------+  |
 | End cCCC  |     | End cCCC  |     | End cCCC  |     | End cCCC  |  |
 | Object 0  |     | Object 1  |     | Object 2  |     | Object 3  |  |  
 \-----------/     \-----------/     \-----------/     \-----------/  |
                                                                      |
                                                                   +--/
  
                                                                   +--\
                                                                      |
 /-----------\     /-----------\     /-----------\     /-----------\  |   
 | Begin cCCC|     | Begin cCCC|     | Begin cCCC|     | Begin cCCC|  |
 | Object  4 |     | Object  5 |     | Object  6 |     | Object  7 |  |  
 +-----------+     +-----------+     +-----------+     +-----------+  |
 |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
 |  unit 16  |     |  unit 17  |     |  unit 18  |     |  unit 19  |  |
 +-----------+     +-----------+     +-----------+     +-----------+  |
 |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  +-\ 
 |  unit 20  |     |  unit 21  |     |  unit 22  |     |  unit 23  |    | Object
 +-----------+     +-----------+     +-----------+     +-----------+    +- Set
 |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |    |   2 
 |  unit 24  |     |  unit 25  |     |  unit 26  |     |  unit 27  |  +-/
 +-----------+     +-----------+     +-----------+     +-----------+  |
 |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
 |  unit 28  |     |  unit 29  |     |  unit 30  |     |  unit 31  |  |
 +-----------+     +-----------+     +-----------+     +-----------+  |
 | End cCCC  |     | End cCCC  |     | End cCCC  |     | End cCCC  |  |
 | Object 4  |     | Object 5  |     | Object 6  |     | Object 7  |  |  
 \-----------/     \-----------/     \-----------/     \-----------/  |
                                                                      |
                                                                   +--/

Three important variables determine how Ceph stripes data: 

- **Object Size:** Objects in the Ceph Storage Cluster have a maximum
  configurable size (e.g., 2MB, 4MB, etc.). The object size should be large
  enough to accommodate many stripe units, and should be a multiple of
  the stripe unit.

- **Stripe Width:** Stripes have a configurable unit size (e.g., 64kb).
  The Ceph Client divides the data it will write to objects into equally 
  sized stripe units, except for the last stripe unit. A stripe width, 
  should be a fraction of the Object Size so that an object may contain 
  many stripe units.

- **Stripe Count:** The Ceph Client writes a sequence of stripe units
  over a series of objects determined by the stripe count. The series 
  of objects is called an object set. After the Ceph Client writes to 
  the last object in the object set, it returns to the first object in
  the object set.
  
.. important:: Test the performance of your striping configuration before
   putting your cluster into production. You CANNOT change these striping
   parameters after you stripe the data and write it to objects.

Once the Ceph Client has striped data to stripe units and mapped the stripe
units to objects, Ceph's CRUSH algorithm maps the objects to placement groups,
and the placement groups to Ceph OSD Daemons before the objects are stored as 
files on a storage disk.

.. note:: Since a client writes to a single pool, all data striped into objects
   get mapped to placement groups in the same pool. So they use the same CRUSH
   map and the same access controls.


.. index:: architecture; Ceph Clients

Ceph Clients
============

Ceph Clients include a number of service interfaces. These include:

- **Block Devices:** The :term:`Ceph Block Device` (a.k.a., RBD) service 
  provides resizable, thin-provisioned block devices with snapshotting and
  cloning. Ceph stripes a block device across the cluster for high
  performance. Ceph supports both kernel objects (KO) and a QEMU hypervisor 
  that uses ``librbd`` directly--avoiding the kernel object overhead for 
  virtualized systems.

- **Object Storage:** The :term:`Ceph Object Storage` (a.k.a., RGW) service 
  provides RESTful APIs with interfaces that are compatible with Amazon S3
  and OpenStack Swift. 
  
- **Filesystem**: The :term:`Ceph Filesystem` (CephFS) service provides 
  a POSIX compliant filesystem usable with ``mount`` or as 
  a filesytem in user space (FUSE).      

Ceph can run additional instances of OSDs, MDSs, and monitors for scalability
and high availability. The following diagram depicts the high-level
architecture. 

.. ditaa::
            +--------------+  +----------------+  +-------------+
            | Block Device |  | Object Storage |  |   Ceph FS   |
            +--------------+  +----------------+  +-------------+            

            +--------------+  +----------------+  +-------------+
            |    librbd    |  |     librgw     |  |  libcephfs  |
            +--------------+  +----------------+  +-------------+

            +---------------------------------------------------+
            |      Ceph Storage Cluster Protocol (librados)     |
            +---------------------------------------------------+

            +---------------+ +---------------+ +---------------+
            |      OSDs     | |      MDSs     | |    Monitors   |
            +---------------+ +---------------+ +---------------+


.. index:: architecture; Ceph Object Storage

Ceph Object Storage
-------------------

The Ceph Object Storage daemon, ``radosgw``, is a FastCGI service that provides
a RESTful_ HTTP API to store objects and metadata. It layers on top of the Ceph
Storage Cluster with its own data formats, and maintains its own user database,
authentication, and access control. The RADOS Gateway uses a unified namespace,
which means you can use either the OpenStack Swift-compatible API or the Amazon
S3-compatible API. For example, you can write data using the S3-comptable API
with one application and then read data using the Swift-compatible API with
another application.

.. topic:: S3/Swift Objects and Store Cluster Objects Compared

   Ceph's Object Storage uses the term *object* to describe the data it stores.
   S3 and Swift objects are not the same as the objects that Ceph writes to the 
   Ceph Storage Cluster. Ceph Object Storage objects are mapped to Ceph Storage
   Cluster objects. The S3 and Swift objects do not necessarily 
   correspond in a 1:1 manner with an object stored in the storage cluster. It 
   is possible for an S3 or Swift object to map to multiple Ceph objects.

See `Ceph Object Storage`_ for details.


.. index:: Ceph Block Device; block device; RBD; Rados Block Device

Ceph Block Device
-----------------

A Ceph Block Device stripes a block device image over multiple objects in the
Ceph Storage Cluster, where each object gets mapped to a placement group and
distributed, and the placement groups are spread across separate ``ceph-osd``
daemons throughout the cluster.

.. important:: Striping allows RBD block devices to perform better than a single 
   server could!

Thin-provisioned snapshottable Ceph Block Devices are an attractive option for
virtualization and cloud computing. In virtual machine scenarios, people
typically deploy a Ceph Block Device with the ``rbd`` network storage driver in
Qemu/KVM, where the host machine uses ``librbd`` to provide a block device
service to the guest. Many cloud computing stacks use ``libvirt`` to integrate
with hypervisors. You can use thin-provisioned Ceph Block Devices with Qemu and
``libvirt`` to support OpenStack and CloudStack among other solutions.

While we do not provide ``librbd`` support with other hypervisors at this time,
you may also use Ceph Block Device kernel objects to provide a block device to a
client. Other virtualization technologies such as Xen can access the Ceph Block
Device kernel object(s). This is done with the  command-line tool ``rbd``.


.. index:: Ceph FS; Ceph Filesystem; libcephfs; MDS; metadata server; ceph-mds

Ceph Filesystem
---------------

The Ceph Filesystem (Ceph FS) provides a POSIX-compliant filesystem as a 
service that is layered on top of the object-based Ceph Storage Cluster.
Ceph FS files get mapped to objects that Ceph stores in the Ceph Storage
Cluster. Ceph Clients mount a CephFS filesystem as a kernel object or as
a Filesystem in User Space (FUSE).

.. ditaa::
            +-----------------------+  +------------------------+
            | CephFS Kernel Object  |  |      CephFS FUSE       |
            +-----------------------+  +------------------------+            

            +---------------------------------------------------+
            |            Ceph FS Library (libcephfs)            |
            +---------------------------------------------------+

            +---------------------------------------------------+
            |      Ceph Storage Cluster Protocol (librados)     |
            +---------------------------------------------------+

            +---------------+ +---------------+ +---------------+
            |      OSDs     | |      MDSs     | |    Monitors   |
            +---------------+ +---------------+ +---------------+


The Ceph Filesystem service includes the Ceph Metadata Server (MDS) deployed
with the Ceph Storage cluster. The purpose of the MDS is to to store all the
filesystem metadata (directories, file ownership, access modes, etc) in
high-availability Ceph Metadata Servers where the metadata resides in memory.
The reason for the MDS (a daemon called ``ceph-mds``) is that simple filesystem
operations like listing a directory or changing a directory (``ls``, ``cd``)
would tax the Ceph OSD Daemons unnecessarily. So separating the metadata from
the data means that the Ceph Filesystem can provide high performance services
without taxing the Ceph Storage Cluster.

Ceph FS separates the metadata from the data, storing the metadata in the MDS, 
and storing the file data in one or more objects in the Ceph Storage Cluster.
The Ceph filesystem aims for POSIX compatibility. ``ceph-mds`` can run as a
single process, or it can be distributed out to multiple physical machines,
either for high availability or for scalability. 

- **High Availability**: The extra ``ceph-mds`` instances can be `standby`, 
  ready to take over the duties of any failed ``ceph-mds`` that was
  `active`. This is easy because all the data, including the journal, is
  stored on RADOS. The transition is triggered automatically by ``ceph-mon``.

- **Scalability**: Multiple ``ceph-mds`` instances can be `active`, and they
  will split the directory tree into subtrees (and shards of a single
  busy directory), effectively balancing the load amongst all `active`
  servers.

Combinations of `standby` and `active` etc are possible, for example
running 3 `active` ``ceph-mds`` instances for scaling, and one `standby`
instance for high availability.


.. _RADOS - A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters: http://ceph.com/papers/weil-rados-pdsw07.pdf
.. _Paxos: http://en.wikipedia.org/wiki/Paxos_(computer_science)
.. _Monitor Config Reference: ../rados/configuration/mon-config-ref
.. _Monitoring OSDs and PGs: ../rados/operations/monitoring-osd-pg
.. _Heartbeats: ../rados/configuration/mon-osd-interaction
.. _Monitoring OSDs: ../rados/operations/monitoring-osd-pg/#monitoring-osds
.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: http://ceph.com/papers/weil-crush-sc06.pdf
.. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing
.. _Report Peering Failure: ../rados/configuration/mon-osd-interaction#osds-report-peering-failure
.. _Troubleshooting Peering Failure: ../rados/troubleshooting/troubleshooting-pg#placement-group-down-peering-failure
.. _Ceph Authentication and Authorization: ../rados/operations/auth-intro/
.. _Hardware Recommendations: ../install/hardware-recommendations
.. _Network Config Reference: ../rados/configuration/network-config-ref
.. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing
.. _striping: http://en.wikipedia.org/wiki/Data_striping
.. _RAID: http://en.wikipedia.org/wiki/RAID 
.. _RAID 0: http://en.wikipedia.org/wiki/RAID_0#RAID_0
.. _Ceph Object Storage: ../radosgw/
.. _RESTful: http://en.wikipedia.org/wiki/RESTful
-												:doc: Rewrote architecture paper. Still needs some work.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-09-18 18:08:23 +00:00
+								==============
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								 Architecture
-												:doc: Rewrote architecture paper. Still needs some work.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-09-18 18:08:23 +00:00
+								==============
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								:term:`Ceph` uniquely delivers **object, block, and file storage** in one
 								unified system. Ceph is highly reliable, easy to manage, and free. The power of
 								Ceph can transform your company's IT infrastructure and your ability to manage
 								vast amounts of data. Ceph delivers extraordinary scalability–thousands of
 								clients accessing petabytes to exabytes of data. A :term:`Ceph Node` leverages
 								commodity hardware and intelligent daemons, and a :term:`Ceph Storage Cluster`
 								accommodates large numbers of nodes, which communicate with each other to
 								replicate and redistribute data dynamically. A :term:`Ceph Monitor` can also be
 								placed into a cluster of Ceph monitors to oversee the Ceph nodes in the Ceph
 								Storage Cluster (a monitor cluster ensures high availability).
-												:doc: Rewrote architecture paper. Still needs some work.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-09-18 18:08:23 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								.. image:: images/stack.png
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
-												:doc: Rewrote architecture paper. Still needs some work.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-09-18 18:08:23 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								The Ceph Storage Cluster
 								========================
 								Ceph provides an infinitely scalable :term:`Ceph Storage Cluster` based upon
 								:abbr:`RADOS (Reliable Autonomic Distributed Object Store)`, which you can read
 								about in `RADOS - A Scalable, Reliable Storage Service for Petabyte-scale
 								Storage Clusters`_. Storage cluster clients and each :term:`Ceph OSD Daemon` use
 								the CRUSH algorithm to efficiently compute information about data location,
 								instead of having to depend on a central lookup table. Ceph's high-level
 								features include providing a native interface to the Ceph Storage Cluster via
 								``librados``, and a number of service interfaces built on top of ``librados``.
 								.. ditaa::  +---------------+ +---------------+
 								            |      OSDs     | |    Monitors   |
 								            +---------------+ +---------------+
-												:doc: Rewrote architecture paper. Still needs some work.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-09-18 18:08:23 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								Storing Data
 								------------
 								The Ceph Storage Cluster receives data from :term:`Ceph Clients`--whether it
 								comes through a :term:`Ceph Block Device`, :term:`Ceph Object Storage`, the
 								:term:`Ceph Filesystem` or a custom implementation you create using
 								``librados``--and it stores the data as objects. Each object corresponds to a
 								file in a filesystem, which is stored on an :term:`Object Storage Device`. Ceph
 								OSD Daemons handle the read/write operations on the storage disks.
-												:doc: Rewrote architecture paper. Still needs some work.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-09-18 18:08:23 +00:00
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
+								.. ditaa:: /-----\       +-----+       +-----+
 								           | obj |------>| {d} |------>| {s} |
 								           \-----/       +-----+       +-----+
 								            Object         File         Disk
-												:doc: Rewrote architecture paper. Still needs some work.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-09-18 18:08:23 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								Ceph OSD Daemons store all data as objects in a flat namespace (e.g., no
 								hierarchy of directories). An object has an identifier, binary data, and
 								metadata consisting of a set of name/value pairs. The semantics are completely
 								up to :term:`Ceph Clients`. For example, CephFS uses metadata to store file
 								attributes such as the file owner, created date, last modified date, and so
 								forth.
-												:doc: Rewrote architecture paper. Still needs some work.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-09-18 18:08:23 +00:00
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
+								.. ditaa:: /------+------------------------------+----------------\
 								           | ID   | Binary Data                  | Metadata       |
 								           +------+------------------------------+----------------+
 								           | 1234 | 0101010101010100110101010010 | name1 = value1 |
 								           |      | 0101100001010100110101010010 | name2 = value2 |
 								           |      | 0101100001010100110101010010 | nameN = valueN |
 								           \------+------------------------------+----------------/
-												:doc: Rewrote architecture paper. Still needs some work.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-09-18 18:08:23 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								.. note:: An object ID is unique across the entire cluster, not just the local
 								   filesystem.
-												:doc: Rewrote architecture paper. Still needs some work.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-09-18 18:08:23 +00:00
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
-												doc: Updated index tags.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-06-14 23:52:25 +00:00
+								.. index:: architecture; high availability, scalability
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								Scalability and High Availability
 								---------------------------------
-												Doc: Restore the previous version of architecture.rst

it was accidentally overwritten with a version of the product
had a somewhat different audience/focus and a few sphinx
formatting errors.

I will cherry-pick the corrections in a subsequent commit.

Signed-off-by: Mark Kampe <mark.kampe@dreamhost.com>

											
										
										
											2011-12-01 23:22:15 +00:00
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
+								In traditional architectures, clients talk to a centralized component (e.g., a
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
+								gateway, broker, API, facade, etc.), which acts as a single point of entry to a
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
+								complex subsystem. This imposes a limit to both performance and scalability,
 								while introducing a single point of failure (i.e., if the centralized component
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								goes down, the whole system goes down, too).
 								Ceph eliminates the centralized gateway to enable clients to interact with
 								Ceph OSD Daemons directly. Ceph OSD Daemons create object replicas on other
 								Ceph Nodes to ensure data safety and high availabilty. Ceph also uses a cluster
 								of monitors to ensure high availability. To eliminate centralization, Ceph
 								uses an algorithm called CRUSH.
-												:doc: Rewrote architecture paper. Still needs some work.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-09-18 18:08:23 +00:00
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
-												doc: Updated index tags.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-06-14 23:52:25 +00:00
+								.. index:: CRUSH; architecture
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								CRUSH Introduction
 								~~~~~~~~~~~~~~~~~~
 								Ceph Clients and Ceph OSD Daemons both use the :abbr:`CRUSH (Controlled
 								Replication Under Scalable Hashing)` algorithm to efficiently compute
 								information about data containers on demand, instead of having to depend on a
 								central lookup table. CRUSH provides a better data management mechanism compared
 								to older approaches, and enables massive scale by cleanly distributing the work
 								to all the clients and OSD daemons in the cluster. CRUSH uses intelligent data
 								replication to ensure resiliency, which is better suited to hyper-scale storage.
 								The following sections provide additional details on how CRUSH works. For a
 								detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable, Decentralized
 								Placement of Replicated Data`_.
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
-												doc: Updated index tags.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-06-14 23:52:25 +00:00
+								.. index:: architecture; cluster map
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
 								Cluster Map
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								~~~~~~~~~~~
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								Ceph depends upon Ceph Clients and Ceph OSD Daemons having knowledge of the
 								cluster topology, which is inclusive of 5 maps collectively referred to as the
 								"Cluster Map":
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
 								#. **The Monitor Map:** Contains the cluster ``fsid``, the position, name
 								   address and port of each monitor. It also indicates the current epoch,
 								   when the map was created, and the last time it changed. To view a monitor
 								   map, execute ``ceph mon dump``.
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								#. **The OSD Map:** Contains the cluster ``fsid``, when the map was created and
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
+								   last modified, a list of pools, replica sizes, PG numbers, a list of OSDs
 								   and their status (e.g., ``up``, ``in``). To view an OSD map, execute
 								   ``ceph osd dump``.
 								#. **The PG Map:** Contains the PG version, its time stamp, the last OSD
 								   map epoch, the full ratios, and details on each placement group such as
 								   the PG ID, the `Up Set`, the `Acting Set`, the state of the PG (e.g.,
 								   ``active + clean``), and data usage statistics for each pool.
 								#. **The CRUSH Map:** Contains a list of storage devices, the failure domain
 								   hierarchy (e.g., device, host, rack, row, room, etc.), and rules for
 								   traversing the hierarchy when storing data. To view a CRUSH map, execute
 								   ``ceph osd getcrushmap -o {filename}``; then, decompile it by executing
 								   ``crushtool -d {comp-crushmap-filename} -o {decomp-crushmap-filename}``.
 								   You can view the decompiled map in a text editor or with ``cat``.
 								#. **The MDS Map:** Contains the current MDS map epoch, when the map was
 								   created, and the last time it changed. It also contains the pool for
 								   storing metadata, a list of metadata servers, and which metadata servers
 								   are ``up`` and ``in``. To view an MDS map, execute ``ceph mds dump``.
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								Each map maintains an iterative history of its operating state changes. Ceph
 								Monitors maintain a master copy of the cluster map including the cluster
 								members, state, changes, and the overall health of the Ceph Storage Cluster.
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
-												doc: Updated index tags.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-06-14 23:52:25 +00:00
+								.. index:: high availability; monitor architecture
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								High Availability Monitors
 								~~~~~~~~~~~~~~~~~~~~~~~~~~
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								Before Ceph Clients can read or write data, they must contact a Ceph Monitor
 								to obtain the most recent copy of the cluster map. A Ceph Storage Cluster
 								can operate with a single monitor; however, this introduces a single
 								point of failure (i.e., if the monitor goes down, Ceph Clients cannot
 								read or write data).
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								For added reliability and fault tolerance, Ceph supports a cluster of monitors.
 								In a cluster of monitors, latency and other faults can cause one or more
 								monitors to fall behind the current state of the cluster. For this reason, Ceph
 								must have agreement among various monitor instances regarding the state of the
 								cluster. Ceph always uses a majority of monitors (e.g., 1, 2:3, 3:5, 4:6, etc.)
 								and the `Paxos`_ algorithm to establish a consensus among the monitors about the
 								current state of the cluster.
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								For details on configuring monitors, see the `Monitor Config Reference`_.
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
-												doc: Updated index tags.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-06-14 23:52:25 +00:00
+								.. index:: architecture; high availability authentication
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								High Availability Authentication
 								~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								Ceph clients can authenticate users with Ceph Monitors, Ceph OSD Daemons and
 								Ceph Metadata Servers, using Ceph's Kerberos-like ``cephx`` protocol.
 								Authenticated users gain authorization to read, write and execute Ceph commands.
 								The Cephx authentication system avoids a single point of failure to ensure
 								scalability and high availability.  For details on Cephx and how it differs
 								from Kerberos, see `Ceph Authentication and Authorization`_.
-												doc: Updated index tags.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-06-14 23:52:25 +00:00
+								.. index:: architecture; smart daemons and scalability
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
 								Smart Daemons Enable Hyperscale
 								~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								In many clustered architectures, the primary purpose of cluster membership is
 								so that a centralized interface knows which nodes it can access. Then the
 								centralized interface provides services to the client through a double
 								dispatch--which is a **huge** bottleneck at the petabyte-to-exabyte scale.
 								Ceph elminates the bottleneck: Ceph's OSD Daemons AND Ceph Clients are cluster
 								aware. Like Ceph clients, each Ceph OSD Daemon knows about other Ceph OSD
 								Daemons in the cluster.  This enables Ceph OSD Daemons to interact directly with
 								other Ceph OSD Daemons and Ceph monitors. Additionally, it enables Ceph Clients
 								to interact directly with Ceph OSD Daemons.
 								The ability of Ceph Clients, Ceph Monitors and Ceph OSD Daemons to interact with
 								each other means that Ceph OSD Daemons can utilize the CPU and RAM of the Ceph
 								nodes to easily perform tasks that would bog down a centralized server. The
 								ability to leverage this computing power leads to several major benefits:
 								#. **OSDs Service Clients Directly:** Since any network device has a limit to
 								   the number of concurrent connections it can support, a centralized system
 								   has a low physical limit at high scales. By enabling Ceph Clients to contact
 								   Ceph OSD Daemons directly, Ceph increases both performance and total system
 								   capacity simultaneously, while removing a single point of failure. Ceph
 								   Clients can maintain a session when they need to, and with a particular Ceph
 								   OSD Daemon instead of a centralized server.
 								#. **OSD Membership and Status**: Ceph OSD Daemons join a cluster and report
 								   on their status. At the lowest level, the Ceph OSD Daemon status is ``up``
 								   or ``down`` reflecting whether or not it is running and able to service
 								   Ceph Client requests. If a Ceph OSD Daemon is ``down`` and ``in`` the Ceph
 								   Storage Cluster, this status may indicate the failure of the Ceph OSD
 								   Daemon. If a Ceph OSD Daemon is not running (e.g., it crashes), the Ceph OSD
 								   Daemon cannot notify the Ceph Monitor that it is ``down``. The Ceph Monitor
 								   can ping a Ceph OSD Daemon periodically to ensure that it is running.
 								   However, Ceph also empowers Ceph OSD Daemons to determine if a neighboring
 								   OSD is ``down``, to update the cluster map and to report it to the Ceph
 								   monitor(s). This means that Ceph monitors can remain light weight processes.
 								   See `Monitoring OSDs`_ and `Heartbeats`_ for additional details.
 								#. **Data Scrubbing:** As part of maintaining data consistency and cleanliness,
 								   Ceph OSD Daemons can scrub objects within placement groups. That is, Ceph
 								   OSD Daemons can compare object metadata in one placement group with its
 								   replicas in placement groups stored on other OSDs. Scrubbing (usually
 								   performed daily) catches bugs or filesystem errors. Ceph OSD Daemons also
 								   perform deeper scrubbing by comparing data in objects bit-for-bit. Deep
 								   scrubbing (usually performed weekly) finds bad sectors on a drive that
 								   weren't apparent in a light scrub. See `Data Scrubbing`_ for details on
 								   configuring scrubbing.
 								#. **Replication:** Like Ceph Clients, Ceph OSD Daemons use the CRUSH
 								   algorithm, but the Ceph OSD Daemon uses it to compute where replicas of
 								   objects should be stored (and for rebalancing). In a typical write scenario,
 								   a client uses the CRUSH algorithm to compute where to store an object, maps
 								   the object to a pool and placement group, then looks at the CRUSH map to
 								   identify the primary OSD for the placement group.
 								   The client writes the object to the identified placement group in the
 								   primary OSD. Then, the primary OSD with its own copy of the CRUSH map
 								   identifies the secondary and tertiary OSDs for replication purposes, and
 								   replicates the object to the appropriate placement groups in the secondary
 								   and tertiary OSDs (as many OSDs as additional replicas), and responds to the
 								   client once it has confirmed the object was stored successfully.
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								.. ditaa::
 								             +----------+
 								             |  Client  |
 								             |          |
 								             +----------+
 								                 *  ^
 								      Write (1)  |  |  Ack (6)
 								                 |  |
 								                 v  *
 								            +-------------+
 								            | Primary OSD |
 								            |             |
 								            +-------------+
 								              *  ^   ^  *
 								    Write (2) |  |   |  |  Write (3)
 								       +------+  |   |  +------+
 								       |  +------+   +------+  |
 								       |  | Ack (4)  Ack (5)|  |
 								       v  *                 *  v
 								 +---------------+   +---------------+
 								 | Secondary OSD |   | Tertiary OSD  |
 								 |               |   |               |
 								 +---------------+   +---------------+
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								With the ability to perform data replication, Ceph OSD Daemons relieve Ceph
 								clients from that duty, while ensuring high data availability and data safety.
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								Dynamic Cluster Management
 								--------------------------
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								In the `Scalability and High Availability`_ section, we explained how Ceph uses
 								CRUSH, cluster awareness and intelligent daemons to scale and maintain high
 								availability. Key to Ceph's design is the autonomous, self-healing, and
 								intelligent Ceph OSD Daemon. Let's take a deeper look at how CRUSH works to
 								enable modern cloud storage infrastructures to place data, rebalance the cluster
 								and recover from faults dynamically.
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
-												doc: Updated index tags.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-06-14 23:52:25 +00:00
+								.. index:: architecture; pools
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
 								About Pools
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								~~~~~~~~~~~
-												:doc: Rewrote architecture paper. Still needs some work.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-09-18 18:08:23 +00:00
 								The Ceph storage system supports the notion of 'Pools', which are logical
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								partitions for storing objects. Pools set the following parameters:
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								- Ownership/Access to Objects
 								- The Number of Object Replicas
 								- The Number of Placement Groups, and
 								- The CRUSH Ruleset to Use.
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								Ceph Clients retrieve a `Cluster Map`_ from a Ceph Monitor, and write objects to
 								pools. The pool's ``size`` or number of replicas, the CRUSH ruleset and the
 								number of placement groups determine how Ceph will place the data.
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								.. ditaa::
 								            +--------+  Retrieves  +---------------+
 								            | Client |------------>|  Cluster Map  |
 								            +--------+             +---------------+
 								                 |
 								                 v      Writes
 								              /-----\
 								              | obj |
 								              \-----/
 								                 |      To
 								                 v
 								            +--------+           +---------------+
 								            |  Pool  |---------->| CRUSH Ruleset |
 								            +--------+  Selects  +---------------+
-												doc: Updated index tags.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-06-14 23:52:25 +00:00
+								.. index: architecture; placement group mapping
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								Mapping PGs to OSDs
 								~~~~~~~~~~~~~~~~~~~
 								Each pool has a number of placement groups. CRUSH maps PGs to OSDs dynamically.
 								When a Ceph Client stores objects, CRUSH will map each object to a placement
 								group.
 								Mapping objects to placement groups creates a layer of indirection between the
 								Ceph OSD Daemon and the Ceph Client. The Ceph Storage Cluster must be able to
 								grow (or shrink) and rebalance where it stores objects dynamically. If the Ceph
 								Client "knew" which Ceph OSD Daemon had which object, that would create a tight
 								coupling between the Ceph Client and the Ceph OSD Daemon. Instead, the CRUSH
 								algorithm maps each object to a placement group and then maps each placement
 								group to one or more Ceph OSD Daemons. This layer of indirection allows Ceph to
 								rebalance dynamically when new Ceph OSD Daemons and the underlying OSD devices
 								come online. The following diagram depicts how CRUSH maps objects to placement
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
+								groups, and placement groups to OSDs.
-												:doc: Rewrote architecture paper. Still needs some work.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-09-18 18:08:23 +00:00
 								.. ditaa::
 								           /-----\  /-----\  /-----\  /-----\  /-----\
 								           | obj |  | obj |  | obj |  | obj |  | obj |
 								           \-----/  \-----/  \-----/  \-----/  \-----/
 								              |        |        |        |        |
 								              +--------+--------+        +---+----+
 								              |                              |
 								              v                              v
 								   +-----------------------+      +-----------------------+
 								   |  Placement Group #1   |      |  Placement Group #2   |
 								   |                       |      |                       |
 								   +-----------------------+      +-----------------------+
 								               |                              |
 								               |      +-----------------------+---+
 								        +------+------+-------------+             |
 								        |             |             |             |
 								        v             v             v             v
 								   /----------\  /----------\  /----------\  /----------\
 								   |          |  |          |  |          |  |          |
 								   |  OSD #1  |  |  OSD #2  |  |  OSD #3  |  |  OSD #4  |
 								   |          |  |          |  |          |  |          |
 								   \----------/  \----------/  \----------/  \----------/
 								With a copy of the cluster map and the CRUSH algorithm, the client can compute
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
+								exactly which OSD to use when reading or writing a particular object.
-												:doc: Rewrote architecture paper. Still needs some work.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-09-18 18:08:23 +00:00
-												doc: Updated index tags.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-06-14 23:52:25 +00:00
+								.. index:: architecture; calculating PG IDs
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
 								Calculating PG IDs
 								~~~~~~~~~~~~~~~~~~
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								When a Ceph Client binds to a Ceph Monitor, it retrieves the latest copy of the
 								`Cluster Map`_. With the cluster map, the client knows about all of the monitors,
 								OSDs, and metadata servers in the cluster. **However, it doesn't know anything
 								about object locations.**
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								.. epigraph::
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+									Object locations get computed.
 								The only input required by the client is the object ID and the pool.
 								It's simple: Ceph stores data in named pools (e.g., "liverpool"). When a client
 								wants to store a named object (e.g., "john," "paul," "george," "ringo", etc.)
 								it calculates a placement group using the object name, a hash code, the
 								number of OSDs in the cluster and the pool name. Ceph clients use the following
 								steps to compute PG IDs.
 								#. The client inputs the pool ID and the object ID. (e.g., pool = "liverpool"
 								   and object-id = "john")
 								#. CRUSH takes the object ID and hashes it.
 								#. CRUSH calculates the hash modulo the number of OSDs. (e.g., ``0x58``) to get
 								   a PG ID.
 								#. CRUSH gets the pool ID given the pool name (e.g., "liverpool" = ``4``)
 								#. CRUSH prepends the pool ID to the pool ID to the PG ID (e.g., ``4.0x58``).
 								Computing object locations is much faster than performing object location query
 								over a chatty session. The :abbr:`CRUSH (Controlled Replication Under Scalable
 								Hashing)` algorithm allows a client to compute where objects *should* be stored,
 								and enables the client to contact the primary OSD to store or retrieve the
 								objects.
-												doc: Updated index tags.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-06-14 23:52:25 +00:00
+								.. index:: architecture; PG Peering
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
 								Peering and Sets
 								~~~~~~~~~~~~~~~~
 								In previous sections, we noted that Ceph OSD Daemons check each other's
 								heartbeats and report back to the Ceph Monitor. Another thing Ceph OSD daemons
 								do is called 'peering', which is the process of bringing all of the OSDs that
 								store a Placement Group (PG) into agreement about the state of all of the
 								objects (and their metadata) in that PG. In fact, Ceph OSD Daemons `Report
 								Peering Failure`_ to the Ceph Monitors. Peering issues  usually resolve
 								themselves; however, if the problem persists, you may need to refer to the
 								`Troubleshooting Peering Failure`_ section.
 								.. Note:: Agreeing on the state does not mean that the PGs have the latest contents.
 								The Ceph Storage Cluster was designed to store at least two copies of an object
 								(i.e., ``size = 2``), which is the minimum requirement for data safety. For high
 								availability, a Ceph Storage Cluster should store more than two copies of an object
 								(e.g., ``size = 3`` and ``min size = 2``) so that it can continue to run in a
 								``degraded`` state while maintaining data safety.
 								Referring back to the diagram in `Smart Daemons Enable Hyperscale`_, we do not
 								name the Ceph OSD Daemons specifically (e.g., ``osd.0``, ``osd.1``, etc.), but
 								rather refer to them as *Primary*, *Secondary*, and so forth. By convention,
 								the *Primary* is the first OSD in the *Acting Set*, and is responsible for
 								coordinating the peering process for each placement group where it acts as
 								the *Primary*, and is the **ONLY** OSD that that will accept client-initiated
 								writes to objects for a given placement group where it acts as the *Primary*.
 								When a series of OSDs are responsible for a placement group, that series of
 								OSDs, we refer to them as an *Acting Set*. An *Acting Set* may refer to the Ceph
 								OSD Daemons that are currently responsible for the placement group, or the Ceph
 								OSD Daemons that were responsible  for a particular placement group as of some
 								epoch.
 								The Ceph OSD daemons that are part of an *Acting Set* may not always be  ``up``.
 								When an OSD in the *Acting Set* is ``up``, it is part of the  *Up Set*. The *Up
 								Set* is an important distinction, because Ceph can remap PGs to other Ceph OSD
 								Daemons when an OSD fails.
 								.. note:: In an *Acting Set* for a PG containing ``osd.25``, ``osd.32`` and
 								   ``osd.61``, the first OSD, ``osd.25``, is the *Primary*. If that OSD fails,
 								   the Secondary, ``osd.32``, becomes the *Primary*, and ``osd.25`` will be
 								   removed from the *Up Set*.
-												doc: Updated index tags.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-06-14 23:52:25 +00:00
+								.. index:: architecture; Rebalancing
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
 								Rebalancing
 								~~~~~~~~~~~
 								When you add a Ceph OSD Daemon to a Ceph Storage Cluster, the cluster map gets
 								updated with the new OSD. Referring back to `Calculating PG IDs`_, this changes
 								the cluster map. Consequently, it changes object placement, because it changes
 								an input for the calculations. The following diagram depicts the rebalancing
 								process (albeit rather crudely, since it is substantially less impactful with
 								large clusters) where some, but not all of the PGs migrate from existing OSDs
 								(OSD 1, and OSD 2) to the new OSD (OSD 3). Even when rebalancing, CRUSH is
 								stable. Many of the placement groups remain in their original configuration,
 								and each OSD gets some added capacity, so there are no load spikes on the
 								new OSD after rebalancing is complete.
-												:doc: Rewrote architecture paper. Still needs some work.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-09-18 18:08:23 +00:00
-												doc: Edited striping section. Modified stripe graphic to pretty print. Also modified replication graphic to pretty print.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 18:58:02 +00:00
 								.. ditaa::
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								           +--------+     +--------+
 								   Before  |  OSD 1 |     |  OSD 2 |
 								           +--------+     +--------+
 								           |  PG #1 |     | PG #6  |
 								           |  PG #2 |     | PG #7  |
 								           |  PG #3 |     | PG #8  |
 								           |  PG #4 |     | PG #9  |
 								           |  PG #5 |     | PG #10 |
 								           +--------+     +--------+
 								           +--------+     +--------+     +--------+
 								    After  |  OSD 1 |     |  OSD 2 |     |  OSD 3 |
 								           +--------+     +--------+     +--------+
 								           |  PG #1 |     | PG #7  |     |  PG #3 |
 								           |  PG #2 |     | PG #8  |     |  PG #6 |
 								           |  PG #4 |     | PG #10 |     |  PG #9 |
 								           |  PG #5 |     |        |     |        |
 								           |        |     |        |     |        |
 								           +--------+     +--------+     +--------+
-												doc: Updated index tags.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-06-14 23:52:25 +00:00
+								.. index:: architecture; Data Scrubbing
-												:doc: Rewrote architecture paper. Still needs some work.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-09-18 18:08:23 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								Data Consistency
 								~~~~~~~~~~~~~~~~
 								As part of maintaining data consistency and cleanliness, Ceph OSDs can also
 								scrub objects within placement groups. That is, Ceph OSDs can compare object
 								metadata in one placement group with its replicas in placement groups stored in
 								other OSDs. Scrubbing (usually performed daily) catches OSD bugs or filesystem
 								errors.  OSDs can also perform deeper scrubbing by comparing data in objects
 								bit-for-bit.  Deep scrubbing (usually performed weekly) finds bad sectors on a
 								disk that weren't apparent in a light scrub.
 								See `Data Scrubbing`_ for details on configuring scrubbing.
-												:doc: Rewrote architecture paper. Still needs some work.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-09-18 18:08:23 +00:00
-												doc: Updated index tags.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-06-14 23:52:25 +00:00
+								.. index:: Extensibility, Ceph Classes
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
 								Extending Ceph
 								--------------
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								You can extend Ceph by creating shared object classes called 'Ceph Classes'.
 								Ceph loads ``.so`` classes stored in the ``osd class dir`` directory dynamically
 								(i.e., ``$libdir/rados-classes`` by default). When you implement a class, you
 								can create new object methods that have the ability to call the native methods
 								in the Ceph Object Store, or other class methods you incorporate via libraries
 								or create yourself.
 								On writes, Ceph Classes can call native or class methods, perform any series of
 								operations on the inbound data and generate a resulting write transaction  that
 								Ceph will apply atomically.
 								On reads, Ceph Classes can call native or class methods, perform any series of
 								operations on the outbound data and return the data to the client.
 								.. topic:: Ceph Class Example
 								   A Ceph class for a content management system that presents pictures of a
 								   particular size and aspect ratio could take an inbound bitmap image, crop it
 								   to a particular aspect ratio, resize it and embed an invisible copyright or
 								   watermark to help protect the intellectual property; then, save the
 								   resulting bitmap image to the object store.
 								See ``src/objclass/objclass.h``, ``src/fooclass.cc`` and ``src/barclass`` for
 								exemplary implementations.
 								Summary
 								-------
 								Ceph Storage Clusters are dynamic--like a living organism. Whereas, many storage
 								appliances do not fully utilize the CPU and RAM of a typical commodity server,
 								Ceph does. From heartbeats, to  peering, to rebalancing the cluster or
 								recovering from faults,  Ceph offloads work from clients (and from a centralized
 								gateway which doesn't exist in the Ceph architecture) and uses the computing
 								power of the OSDs to perform the work. When referring to `Hardware
 								Recommendations`_ and the `Network Config Reference`_,  be cognizant of the
 								foregoing concepts to understand how Ceph utilizes computing resources.
-												doc: Updated index tags.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-06-14 23:52:25 +00:00
+								.. index:: Ceph Protocol, librados
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
 								Ceph Protocol
 								=============
 								Ceph Clients use the native protocol for interacting with the Ceph Storage
 								Cluster. Ceph packages this functionality into the ``librados`` library so that
 								you can create your own custom Ceph Clients. The following diagram depicts the
 								basic architecture.
 								.. ditaa::
 								            +---------------------------------+
 								            |  Ceph Storage Cluster Protocol  |
 								            |           (librados)            |
 								            +---------------------------------+
 								            +---------------+ +---------------+
 								            |      OSDs     | |    Monitors   |
 								            +---------------+ +---------------+
 								Native Protocol and ``librados``
 								--------------------------------
 								Modern applications need a simple object storage interface with asynchronous
 								communication capability. The Ceph Storage Cluster provides a simple object
 								storage interface with asynchronous communication capability. The interface
 								provides direct, parallel access to objects throughout the cluster.
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								- Pool Operations
 								- Snapshots and Copy-on-write Cloning
 								- Read/Write Objects
 								  - Create or Remove
 								  - Entire Object or Byte Range
 								  - Append or Truncate
 								- Create/Set/Get/Remove XATTRs
 								- Create/Set/Get/Remove Key/Value Pairs
 								- Compound operations and dual-ack semantics
 								- Object Classes
-												doc: Added some detail. Calculating PGs, maps; reorganized a bit.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-04-23 04:02:45 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
-												doc: Updated index tags.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-06-14 23:52:25 +00:00
+								.. index:: architecture; watch/notify
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
 								Object Watch/Notify
 								-------------------
 								A client can register a persistent interest with an object and keep a session to
 								the primary OSD open. The client can send a notification message and payload to
 								all watchers and receive notification when the watchers receive the
 								notification. This enables a client to use any object a
 								synchronization/communication channel.
 								.. ditaa:: +----------+     +----------+     +----------+     +---------------+
 								           | Client 1 |     | Client 2 |     | Client 3 |     | OSD:Object ID |
 								           +----------+     +----------+     +----------+     +---------------+
 								                 |                |                |                  |
 								                 |                |                |                  |
 								                 |                |  Watch Object  |                  |
 								                 |--------------------------------------------------->|
 								                 |                |                |                  |
 								                 |<---------------------------------------------------|
 								                 |                |   Ack/Commit   |                  |
 								                 |                |                |                  |
 								                 |                |  Watch Object  |                  |
 								                 |                |---------------------------------->|
 								                 |                |                |                  |
 								                 |                |<----------------------------------|
 								                 |                |   Ack/Commit   |                  |
 								                 |                |                |   Watch Object   |
 								                 |                |                |----------------->|
 								                 |                |                |                  |
 								                 |                |                |<-----------------|
 								                 |                |                |    Ack/Commit    |
 								                 |                |     Notify     |                  |
 								                 |--------------------------------------------------->|
 								                 |                |                |                  |
 								                 |<---------------------------------------------------|
 								                 |                |     Notify     |                  |
 								                 |                |                |                  |
 								                 |                |<----------------------------------|
 								                 |                |     Notify     |                  |
 								                 |                |                |<-----------------|
 								                 |                |                |      Notify      |
 								                 |                |       Ack      |                  |
 								                 |----------------+---------------------------------->|
 								                 |                |                |                  |
 								                 |                |       Ack      |                  |
 								                 |                +---------------------------------->|
 								                 |                |                |                  |
 								                 |                |                |        Ack       |
 								                 |                |                |----------------->|
 								                 |                |                |                  |
 								                 |<---------------+----------------+------------------|
 								                 |                     Complete
-												doc: Updated index tags.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-06-14 23:52:25 +00:00
+								.. index:: architecture; Striping
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
 								Data Striping
 								-------------
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
 								Storage devices have throughput limitations, which impact performance and
 								scalability. So storage systems often support `striping`_--storing sequential
 								pieces of information across across multiple storage devices--to increase
 								throughput and performance. The most common form of data striping comes from
 								`RAID`_. The RAID type most similar to Ceph's striping is `RAID 0`_, or a
 								'striped volume.' Ceph's striping offers the throughput of RAID 0 striping,
 								the reliability of n-way RAID mirroring and faster recovery.
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								Ceph provides three types of clients: Ceph Block Device, Ceph Filesystem, and
 								Ceph Object Storage. A Ceph Client converts its data from the representation
 								format it provides to its users (a block device image, RESTful objects, CephFS
 								filesystem directories) into objects for storage in the Ceph Storage Cluster.
 								.. tip:: The objects Ceph stores in the Ceph Storage Cluster are not striped.
 								   Ceph Object Storage, Ceph Block Device, and the Ceph Filesystem stripe their
 								   data over multiple Ceph Storage Cluster objects. Ceph Clients that write
 								   directly to the Ceph Storage Cluster via ``librados`` must perform the the
 								   striping (and parallel I/O) for themselves to obtain these benefits.
 								The simplest Ceph striping format involves a stripe count of 1 object. Ceph
 								Clients write stripe units to a Ceph Storage Cluster object until the object is
 								at its maximum capacity, and then create another object for additional stripes
 								of data. The simplest form of striping may be sufficient for small block device
 								images, S3 or Swift objects and CephFS files. However, this simple form doesn't
 								take maximum advantage of Ceph's ability to distribute data across placement
 								groups, and consequently doesn't improve performance very much. The following
 								diagram depicts the simplest form of striping:
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
 								.. ditaa::
 								                        +---------------+
 								                        |  Client Data  |
 								                        |     Format    |
 								                        | cCCC          |
 								                        +---------------+
 								                                |
 								                       +--------+-------+
 								                       |                |
 								                       v                v
 								                 /-----------\    /-----------\
 								                 | Begin cCCC|    | Begin cCCC|
 								                 | Object  0 |    | Object  1 |
 								                 +-----------+    +-----------+
 								                 |  stripe   |    |  stripe   |
 								                 |  unit 1   |    |  unit 5   |
 								                 +-----------+    +-----------+
 								                 |  stripe   |    |  stripe   |
 								                 |  unit 2   |    |  unit 6   |
 								                 +-----------+    +-----------+
 								                 |  stripe   |    |  stripe   |
 								                 |  unit 3   |    |  unit 7   |
 								                 +-----------+    +-----------+
 								                 |  stripe   |    |  stripe   |
 								                 |  unit 4   |    |  unit 8   |
 								                 +-----------+    +-----------+
 								                 | End cCCC  |    | End cCCC  |
 								                 | Object 0  |    | Object 1  |
 								                 \-----------/    \-----------/
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								If you anticipate large images sizes, large S3 or Swift objects (e.g., video),
 								or large CephFS directories, you may see considerable read/write performance
 								improvements by striping client data over multiple objects within an object set.
-												doc: Edited striping section. Modified stripe graphic to pretty print. Also modified replication graphic to pretty print.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 18:58:02 +00:00
+								Significant write performance occurs when the client writes the stripe units to
 								their corresponding objects in parallel. Since objects get mapped to different
 								placement groups and further mapped to different OSDs, each write occurs in
 								parallel at the maximum write speed. A write to a single disk would be limited
 								by the head movement (e.g. 6ms per seek) and bandwidth of that one device (e.g.
 MB/s).  By spreading that write over multiple objects (which map to different
 								placement groups and OSDs) Ceph can reduce the number of seeks per drive and
 								combine the throughput of multiple drives to achieve much faster write (or read)
 								speeds.
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								.. note:: Striping is independent of object replicas. Since CRUSH
 								   replicates objects across OSDs, stripes get replicated automatically.
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
+								In the following diagram, client data gets striped across an object set
 								(``object set 1`` in the following diagram) consisting of 4 objects, where the
-												doc: Edited striping section. Modified stripe graphic to pretty print. Also modified replication graphic to pretty print.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 18:58:02 +00:00
+								first stripe unit is ``stripe unit 0`` in ``object 0``, and the fourth stripe
 								unit is ``stripe unit 3`` in ``object 3``. After writing the fourth stripe, the
 								client determines if the object set is full. If the object set is not full, the
 								client begins writing a stripe to the first object again (``object 0`` in the
 								following diagram). If the object set is full, the client creates a new object
 								set (``object set 2`` in the following diagram), and begins writing to the first
 								stripe (``stripe unit 16``) in the first object in the new object set (``object
 `` in the diagram below).
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
 								.. ditaa::
-												doc: Edited striping section. Modified stripe graphic to pretty print. Also modified replication graphic to pretty print.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 18:58:02 +00:00
+								                          +---------------+
 								                          |  Client Data  |
 								                          |     Format    |
 								                          | cCCC          |
 								                          +---------------+
 								                                  |
 								       +-----------------+--------+--------+-----------------+
 								       |                 |                 |                 |     +--\
 								       v                 v                 v                 v        |
 								 /-----------\     /-----------\     /-----------\     /-----------\  |
 								 | Begin cCCC|     | Begin cCCC|     | Begin cCCC|     | Begin cCCC|  |
 								 | Object 0  |     | Object  1 |     | Object  2 |     | Object  3 |  |
 								 +-----------+     +-----------+     +-----------+     +-----------+  |
 								 |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
 								 |  unit 0   |     |  unit 1   |     |  unit 2   |     |  unit 3   |  |
 								 +-----------+     +-----------+     +-----------+     +-----------+  |
 								 |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  +-\
 								 |  unit 4   |     |  unit 5   |     |  unit 6   |     |  unit 7   |    | Object
 								 +-----------+     +-----------+     +-----------+     +-----------+    +- Set
 								 |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |    |   1
 								 |  unit 8   |     |  unit 9   |     |  unit 10  |     |  unit 11  |  +-/
 								 +-----------+     +-----------+     +-----------+     +-----------+  |
 								 |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
 								 |  unit 12  |     |  unit 13  |     |  unit 14  |     |  unit 15  |  |
 								 +-----------+     +-----------+     +-----------+     +-----------+  |
 								 | End cCCC  |     | End cCCC  |     | End cCCC  |     | End cCCC  |  |
 								 | Object 0  |     | Object 1  |     | Object 2  |     | Object 3  |  |
 								 \-----------/     \-----------/     \-----------/     \-----------/  |
 								                                                                      |
 								                                                                   +--/
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
-												doc: Edited striping section. Modified stripe graphic to pretty print. Also modified replication graphic to pretty print.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 18:58:02 +00:00
+								                                                                   +--\
 								                                                                      |
 								 /-----------\     /-----------\     /-----------\     /-----------\  |
 								 | Begin cCCC|     | Begin cCCC|     | Begin cCCC|     | Begin cCCC|  |
 								 | Object  4 |     | Object  5 |     | Object  6 |     | Object  7 |  |
 								 +-----------+     +-----------+     +-----------+     +-----------+  |
 								 |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
 								 |  unit 16  |     |  unit 17  |     |  unit 18  |     |  unit 19  |  |
 								 +-----------+     +-----------+     +-----------+     +-----------+  |
 								 |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  +-\
 								 |  unit 20  |     |  unit 21  |     |  unit 22  |     |  unit 23  |    | Object
 								 +-----------+     +-----------+     +-----------+     +-----------+    +- Set
 								 |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |    |   2
 								 |  unit 24  |     |  unit 25  |     |  unit 26  |     |  unit 27  |  +-/
 								 +-----------+     +-----------+     +-----------+     +-----------+  |
 								 |  stripe   |     |  stripe   |     |  stripe   |     |  stripe   |  |
 								 |  unit 28  |     |  unit 29  |     |  unit 30  |     |  unit 31  |  |
 								 +-----------+     +-----------+     +-----------+     +-----------+  |
 								 | End cCCC  |     | End cCCC  |     | End cCCC  |     | End cCCC  |  |
 								 | Object 4  |     | Object 5  |     | Object 6  |     | Object 7  |  |
 								 \-----------/     \-----------/     \-----------/     \-----------/  |
 								                                                                      |
 								                                                                   +--/
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
 								Three important variables determine how Ceph stripes data:
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								- **Object Size:** Objects in the Ceph Storage Cluster have a maximum
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
+								  configurable size (e.g., 2MB, 4MB, etc.). The object size should be large
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								  enough to accommodate many stripe units, and should be a multiple of
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
+								  the stripe unit.
-												doc: Edited striping section. Modified stripe graphic to pretty print. Also modified replication graphic to pretty print.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 18:58:02 +00:00
+								- **Stripe Width:** Stripes have a configurable unit size (e.g., 64kb).
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								  The Ceph Client divides the data it will write to objects into equally
-												doc: Edited striping section. Modified stripe graphic to pretty print. Also modified replication graphic to pretty print.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 18:58:02 +00:00
+								  sized stripe units, except for the last stripe unit. A stripe width,
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
+								  should be a fraction of the Object Size so that an object may contain
 								  many stripe units.
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								- **Stripe Count:** The Ceph Client writes a sequence of stripe units
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
+								  over a series of objects determined by the stripe count. The series
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								  of objects is called an object set. After the Ceph Client writes to
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
+								  the last object in the object set, it returns to the first object in
 								  the object set.
 								.. important:: Test the performance of your striping configuration before
 								   putting your cluster into production. You CANNOT change these striping
 								   parameters after you stripe the data and write it to objects.
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								Once the Ceph Client has striped data to stripe units and mapped the stripe
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
+								units to objects, Ceph's CRUSH algorithm maps the objects to placement groups,
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								and the placement groups to Ceph OSD Daemons before the objects are stored as
 								files on a storage disk.
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								.. note:: Since a client writes to a single pool, all data striped into objects
 								   get mapped to placement groups in the same pool. So they use the same CRUSH
 								   map and the same access controls.
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
-												doc: Updated index tags.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-06-14 23:52:25 +00:00
+								.. index:: architecture; Ceph Clients
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								Ceph Clients
 								============
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								Ceph Clients include a number of service interfaces. These include:
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								- **Block Devices:** The :term:`Ceph Block Device` (a.k.a., RBD) service
 								  provides resizable, thin-provisioned block devices with snapshotting and
 								  cloning. Ceph stripes a block device across the cluster for high
 								  performance. Ceph supports both kernel objects (KO) and a QEMU hypervisor
 								  that uses ``librbd`` directly--avoiding the kernel object overhead for
 								  virtualized systems.
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								- **Object Storage:** The :term:`Ceph Object Storage` (a.k.a., RGW) service
 								  provides RESTful APIs with interfaces that are compatible with Amazon S3
 								  and OpenStack Swift.
 								- **Filesystem**: The :term:`Ceph Filesystem` (CephFS) service provides
 								  a POSIX compliant filesystem usable with ``mount`` or as
 								  a filesytem in user space (FUSE).
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								Ceph can run additional instances of OSDs, MDSs, and monitors for scalability
 								and high availability. The following diagram depicts the high-level
 								architecture.
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								.. ditaa::
 								            +--------------+  +----------------+  +-------------+
 								            | Block Device |  | Object Storage |  |   Ceph FS   |
 								            +--------------+  +----------------+  +-------------+
-												:doc: Rewrote architecture paper. Still needs some work.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-09-18 18:08:23 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								            +--------------+  +----------------+  +-------------+
 								            |    librbd    |  |     librgw     |  |  libcephfs  |
 								            +--------------+  +----------------+  +-------------+
-												:doc: Rewrote architecture paper. Still needs some work.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-09-18 18:08:23 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								            +---------------------------------------------------+
 								            |      Ceph Storage Cluster Protocol (librados)     |
 								            +---------------------------------------------------+
-												:doc: Rewrote architecture paper. Still needs some work.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-09-18 18:08:23 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								            +---------------+ +---------------+ +---------------+
 								            |      OSDs     | |      MDSs     | |    Monitors   |
 								            +---------------+ +---------------+ +---------------+
-												:doc: Rewrote architecture paper. Still needs some work.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-09-18 18:08:23 +00:00
-												doc: Updated index tags.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-06-14 23:52:25 +00:00
+								.. index:: architecture; Ceph Object Storage
-												:doc: Rewrote architecture paper. Still needs some work.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-09-18 18:08:23 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								Ceph Object Storage
 								-------------------
-												:doc: Rewrote architecture paper. Still needs some work.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-09-18 18:08:23 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								The Ceph Object Storage daemon, ``radosgw``, is a FastCGI service that provides
 								a RESTful_ HTTP API to store objects and metadata. It layers on top of the Ceph
 								Storage Cluster with its own data formats, and maintains its own user database,
 								authentication, and access control. The RADOS Gateway uses a unified namespace,
 								which means you can use either the OpenStack Swift-compatible API or the Amazon
 								S3-compatible API. For example, you can write data using the S3-comptable API
 								with one application and then read data using the Swift-compatible API with
 								another application.
-												:doc: Rewrote architecture paper. Still needs some work.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-09-18 18:08:23 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								.. topic:: S3/Swift Objects and Store Cluster Objects Compared
-												Doc: Restore the previous version of architecture.rst

it was accidentally overwritten with a version of the product
had a somewhat different audience/focus and a few sphinx
formatting errors.

I will cherry-pick the corrections in a subsequent commit.

Signed-off-by: Mark Kampe <mark.kampe@dreamhost.com>

											
										
										
											2011-12-01 23:22:15 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								   Ceph's Object Storage uses the term *object* to describe the data it stores.
 								   S3 and Swift objects are not the same as the objects that Ceph writes to the
 								   Ceph Storage Cluster. Ceph Object Storage objects are mapped to Ceph Storage
 								   Cluster objects. The S3 and Swift objects do not necessarily
 								   correspond in a 1:1 manner with an object stored in the storage cluster. It
 								   is possible for an S3 or Swift object to map to multiple Ceph objects.
-												Doc: Restore the previous version of architecture.rst

it was accidentally overwritten with a version of the product
had a somewhat different audience/focus and a few sphinx
formatting errors.

I will cherry-pick the corrections in a subsequent commit.

Signed-off-by: Mark Kampe <mark.kampe@dreamhost.com>

											
										
										
											2011-12-01 23:22:15 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								See `Ceph Object Storage`_ for details.
-												Doc: Restore the previous version of architecture.rst

it was accidentally overwritten with a version of the product
had a somewhat different audience/focus and a few sphinx
formatting errors.

I will cherry-pick the corrections in a subsequent commit.

Signed-off-by: Mark Kampe <mark.kampe@dreamhost.com>

											
										
										
											2011-12-01 23:22:15 +00:00
-												doc: Fixing index references.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-16 20:57:23 +00:00
+								.. index:: Ceph Block Device; block device; RBD; Rados Block Device
-												doc: Minor edits and added reference to Cephx intro.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-11-05 19:02:55 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								Ceph Block Device
 								-----------------
-												doc: Minor edits and added reference to Cephx intro.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-11-05 19:02:55 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								A Ceph Block Device stripes a block device image over multiple objects in the
 								Ceph Storage Cluster, where each object gets mapped to a placement group and
 								distributed, and the placement groups are spread across separate ``ceph-osd``
 								daemons throughout the cluster.
-												doc: Minor edits and added reference to Cephx intro.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-11-05 19:02:55 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								.. important:: Striping allows RBD block devices to perform better than a single
 								   server could!
-												Doc: Restore the previous version of architecture.rst

it was accidentally overwritten with a version of the product
had a somewhat different audience/focus and a few sphinx
formatting errors.

I will cherry-pick the corrections in a subsequent commit.

Signed-off-by: Mark Kampe <mark.kampe@dreamhost.com>

											
										
										
											2011-12-01 23:22:15 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								Thin-provisioned snapshottable Ceph Block Devices are an attractive option for
 								virtualization and cloud computing. In virtual machine scenarios, people
 								typically deploy a Ceph Block Device with the ``rbd`` network storage driver in
 								Qemu/KVM, where the host machine uses ``librbd`` to provide a block device
 								service to the guest. Many cloud computing stacks use ``libvirt`` to integrate
 								with hypervisors. You can use thin-provisioned Ceph Block Devices with Qemu and
 								``libvirt`` to support OpenStack and CloudStack among other solutions.
-												Doc: Restore the previous version of architecture.rst

it was accidentally overwritten with a version of the product
had a somewhat different audience/focus and a few sphinx
formatting errors.

I will cherry-pick the corrections in a subsequent commit.

Signed-off-by: Mark Kampe <mark.kampe@dreamhost.com>

											
										
										
											2011-12-01 23:22:15 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								While we do not provide ``librbd`` support with other hypervisors at this time,
 								you may also use Ceph Block Device kernel objects to provide a block device to a
 								client. Other virtualization technologies such as Xen can access the Ceph Block
 								Device kernel object(s). This is done with the  command-line tool ``rbd``.
-												Doc: Restore the previous version of architecture.rst

it was accidentally overwritten with a version of the product
had a somewhat different audience/focus and a few sphinx
formatting errors.

I will cherry-pick the corrections in a subsequent commit.

Signed-off-by: Mark Kampe <mark.kampe@dreamhost.com>

											
										
										
											2011-12-01 23:22:15 +00:00
-												doc: Fixing index references.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-16 20:57:23 +00:00
+								.. index:: Ceph FS; Ceph Filesystem; libcephfs; MDS; metadata server; ceph-mds
-												Doc: Restore the previous version of architecture.rst

it was accidentally overwritten with a version of the product
had a somewhat different audience/focus and a few sphinx
formatting errors.

I will cherry-pick the corrections in a subsequent commit.

Signed-off-by: Mark Kampe <mark.kampe@dreamhost.com>

											
										
										
											2011-12-01 23:22:15 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								Ceph Filesystem
 								---------------
-												Doc: Restore the previous version of architecture.rst

it was accidentally overwritten with a version of the product
had a somewhat different audience/focus and a few sphinx
formatting errors.

I will cherry-pick the corrections in a subsequent commit.

Signed-off-by: Mark Kampe <mark.kampe@dreamhost.com>

											
										
										
											2011-12-01 23:22:15 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								The Ceph Filesystem (Ceph FS) provides a POSIX-compliant filesystem as a
 								service that is layered on top of the object-based Ceph Storage Cluster.
 								Ceph FS files get mapped to objects that Ceph stores in the Ceph Storage
 								Cluster. Ceph Clients mount a CephFS filesystem as a kernel object or as
 								a Filesystem in User Space (FUSE).
-												Doc: Restore the previous version of architecture.rst

it was accidentally overwritten with a version of the product
had a somewhat different audience/focus and a few sphinx
formatting errors.

I will cherry-pick the corrections in a subsequent commit.

Signed-off-by: Mark Kampe <mark.kampe@dreamhost.com>

											
										
										
											2011-12-01 23:22:15 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								.. ditaa::
 								            +-----------------------+  +------------------------+
 								            | CephFS Kernel Object  |  |      CephFS FUSE       |
 								            +-----------------------+  +------------------------+
-												Doc: Restore the previous version of architecture.rst

it was accidentally overwritten with a version of the product
had a somewhat different audience/focus and a few sphinx
formatting errors.

I will cherry-pick the corrections in a subsequent commit.

Signed-off-by: Mark Kampe <mark.kampe@dreamhost.com>

											
										
										
											2011-12-01 23:22:15 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								            +---------------------------------------------------+
 								            |            Ceph FS Library (libcephfs)            |
 								            +---------------------------------------------------+
-												Doc: Restore the previous version of architecture.rst

it was accidentally overwritten with a version of the product
had a somewhat different audience/focus and a few sphinx
formatting errors.

I will cherry-pick the corrections in a subsequent commit.

Signed-off-by: Mark Kampe <mark.kampe@dreamhost.com>

											
										
										
											2011-12-01 23:22:15 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								            +---------------------------------------------------+
 								            |      Ceph Storage Cluster Protocol (librados)     |
 								            +---------------------------------------------------+
-												Doc: Restore the previous version of architecture.rst

it was accidentally overwritten with a version of the product
had a somewhat different audience/focus and a few sphinx
formatting errors.

I will cherry-pick the corrections in a subsequent commit.

Signed-off-by: Mark Kampe <mark.kampe@dreamhost.com>

											
										
										
											2011-12-01 23:22:15 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								            +---------------+ +---------------+ +---------------+
 								            |      OSDs     | |      MDSs     | |    Monitors   |
 								            +---------------+ +---------------+ +---------------+
-												Doc: Restore the previous version of architecture.rst

it was accidentally overwritten with a version of the product
had a somewhat different audience/focus and a few sphinx
formatting errors.

I will cherry-pick the corrections in a subsequent commit.

Signed-off-by: Mark Kampe <mark.kampe@dreamhost.com>

											
										
										
											2011-12-01 23:22:15 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								The Ceph Filesystem service includes the Ceph Metadata Server (MDS) deployed
 								with the Ceph Storage cluster. The purpose of the MDS is to to store all the
 								filesystem metadata (directories, file ownership, access modes, etc) in
 								high-availability Ceph Metadata Servers where the metadata resides in memory.
 								The reason for the MDS (a daemon called ``ceph-mds``) is that simple filesystem
 								operations like listing a directory or changing a directory (``ls``, ``cd``)
 								would tax the Ceph OSD Daemons unnecessarily. So separating the metadata from
 								the data means that the Ceph Filesystem can provide high performance services
 								without taxing the Ceph Storage Cluster.
-												Doc: Restore the previous version of architecture.rst

it was accidentally overwritten with a version of the product
had a somewhat different audience/focus and a few sphinx
formatting errors.

I will cherry-pick the corrections in a subsequent commit.

Signed-off-by: Mark Kampe <mark.kampe@dreamhost.com>

											
										
										
											2011-12-01 23:22:15 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								Ceph FS separates the metadata from the data, storing the metadata in the MDS,
 								and storing the file data in one or more objects in the Ceph Storage Cluster.
 								The Ceph filesystem aims for POSIX compatibility. ``ceph-mds`` can run as a
 								single process, or it can be distributed out to multiple physical machines,
 								either for high availability or for scalability.
-												Doc: Restore the previous version of architecture.rst

it was accidentally overwritten with a version of the product
had a somewhat different audience/focus and a few sphinx
formatting errors.

I will cherry-pick the corrections in a subsequent commit.

Signed-off-by: Mark Kampe <mark.kampe@dreamhost.com>

											
										
										
											2011-12-01 23:22:15 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								- **High Availability**: The extra ``ceph-mds`` instances can be `standby`,
 								  ready to take over the duties of any failed ``ceph-mds`` that was
 								  `active`. This is easy because all the data, including the journal, is
 								  stored on RADOS. The transition is triggered automatically by ``ceph-mon``.
-												Doc: Restore the previous version of architecture.rst

it was accidentally overwritten with a version of the product
had a somewhat different audience/focus and a few sphinx
formatting errors.

I will cherry-pick the corrections in a subsequent commit.

Signed-off-by: Mark Kampe <mark.kampe@dreamhost.com>

											
										
										
											2011-12-01 23:22:15 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								- **Scalability**: Multiple ``ceph-mds`` instances can be `active`, and they
 								  will split the directory tree into subtrees (and shards of a single
 								  busy directory), effectively balancing the load amongst all `active`
 								  servers.
-												Doc: Restore the previous version of architecture.rst

it was accidentally overwritten with a version of the product
had a somewhat different audience/focus and a few sphinx
formatting errors.

I will cherry-pick the corrections in a subsequent commit.

Signed-off-by: Mark Kampe <mark.kampe@dreamhost.com>

											
										
										
											2011-12-01 23:22:15 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								Combinations of `standby` and `active` etc are possible, for example
 								running 3 `active` ``ceph-mds`` instances for scaling, and one `standby`
 								instance for high availability.
-												Doc: Restore the previous version of architecture.rst

it was accidentally overwritten with a version of the product
had a somewhat different audience/focus and a few sphinx
formatting errors.

I will cherry-pick the corrections in a subsequent commit.

Signed-off-by: Mark Kampe <mark.kampe@dreamhost.com>

											
										
										
											2011-12-01 23:22:15 +00:00
-												doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2012-12-04 04:48:02 +00:00
-												doc: Updated architecture document.

fixes: #2968

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

											
										
										
											2013-05-15 00:05:43 +00:00
+								.. _RADOS - A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters: http://ceph.com/papers/weil-rados-pdsw07.pdf
 								.. _Paxos: http://en.wikipedia.org/wiki/Paxos_(computer_science)
 								.. _Monitor Config Reference: ../rados/configuration/mon-config-ref
 								.. _Monitoring OSDs and PGs: ../rados/operations/monitoring-osd-pg
 								.. _Heartbeats: ../rados/configuration/mon-osd-interaction
 								.. _Monitoring OSDs: ../rados/operations/monitoring-osd-pg/#monitoring-osds
 								.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: http://ceph.com/papers/weil-crush-sc06.pdf
 								.. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing
 								.. _Report Peering Failure: ../rados/configuration/mon-osd-interaction#osds-report-peering-failure
 								.. _Troubleshooting Peering Failure: ../rados/troubleshooting/troubleshooting-pg#placement-group-down-peering-failure
 								.. _Ceph Authentication and Authorization: ../rados/operations/auth-intro/
 								.. _Hardware Recommendations: ../install/hardware-recommendations
 								.. _Network Config Reference: ../rados/configuration/network-config-ref
 								.. _Data Scrubbing: ../rados/configuration/osd-config-ref#scrubbing
 								.. _striping: http://en.wikipedia.org/wiki/Data_striping
 								.. _RAID: http://en.wikipedia.org/wiki/RAID
 								.. _RAID 0: http://en.wikipedia.org/wiki/RAID_0#RAID_0
 								.. _Ceph Object Storage: ../radosgw/
 								.. _RESTful: http://en.wikipedia.org/wiki/RESTful