mirror of
https://github.com/ceph/ceph
synced 2024-12-13 15:08:33 +00:00
1bfa322a59
git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1025 29311d96-e01e-0410-9327-a35deaab8ce9
52 lines
6.7 KiB
Plaintext
52 lines
6.7 KiB
Plaintext
|
|
<div class="mainsegment">
|
|
<h3>Ceph Overview -- What is it?</h3>
|
|
<div>
|
|
Ceph is a scalable distributed network file system that provides both excellent performance and reliability. Like other network file systems like NFS and CIFS, clients require only a network connection to mount and use the file system. Unlike NFS and CIFS, however, Ceph clients can communicate directly with storage nodes (which we call OSDs) instead of a single "server" (something that limits the scalability of NFS and CIFS installations). In that sense, Ceph resembles "cluster" file systems based on SANs (storage area networks) and FC (fibre-channel) or iSCSI. The main difference is that FC and iSCSI is a block-level protocols to communicate with dumb, passive disks; Ceph OSDs are intelligent storage nodes, and unlike FC, all communication is over TCP.
|
|
<p>
|
|
Ceph's intelligent storage nodes (basically, storage servers running software to serve "objects" instead of files) facilitate improved scalability and parallelism. NFS servers (i.e. NAS devices) and cluster file systems funnel all I/O through a single (or limited set of) servers, limiting scalability. Ceph clients interact with one of a limited set of (perhaps dozens or hundreds of) metadata servers (MDSs) for high-level operations like open(), but communicate directly with storage (termed OSDs) for I/O, of which there may be thousands.
|
|
<p>
|
|
There are a handful of new file systems and enterprise storage products adopting a similar object- or brick-based architecture, including Lustre (also open-source, but with restricted access to development) and the Panasas file system (a commercial storage product). Ceph is different:
|
|
<ul>
|
|
<li><b>Open source, open development.</b> We're hosted on SourceForge, and are actively looking for interested users and developers.
|
|
<li><b>Scalability.</b> Ceph sheds legacy file system design principles like explicit allocation tables that are still found in almost all other file systems (including Lustre and the Panasas file system) that ultimately limit scalability.
|
|
<li><b>Commodity hardware.</b> Ceph is designed to run on commodity hardware running Linux (or any other POSIX-ish Unix variant). (Lustre relies on a SAN or other shared storage failover to make storage nodes reliable, while Panasas is based on custom hardware using integrated UPSs.)
|
|
</ul>
|
|
In additional to promising greater scalability than existing solutions, Ceph also promises to fill the huge gap between open-source filesystems and commercial enterprise systems. If you want network-attached storage without shelling out the big bucks, your are usually stuck with NFS and a direct-attached RAID. Technologies like ATA-over-ethernet and iSCSI help scale raw volume sizes, but the lack of "cluster-aware" open-source file systems still limit one to a single NFS "server" that limits scalability.
|
|
<p>
|
|
Ceph fills this gap by providing a scalable, reliable file system that can seamlessly grom from gigabytes to petabytes. Moreover, Ceph will provide efficient snapshots, which almost no freely available file system (besides ZFS on Solaris) provides, despite snapshots having become almost ubiquitous in enterprise systems.
|
|
</div>
|
|
|
|
<h3>Ceph Architecture</h3>
|
|
<div>
|
|
<center><img src="images/ceph-architecture.png"></center>
|
|
<p>
|
|
A Ceph installation consists of three main elements: clients, metadata servers (MDSs), and object storage devices (OSDs). Ceph clients can either be individual processes linking directly to a user-space client library, or a host mounting the Ceph file system natively (ala NFS). OSDs are servers with attached disks and are responsible for storing data.
|
|
<p>
|
|
The Ceph architecture is based on three key design principles that set it apart from traditional file systems.
|
|
|
|
<ol>
|
|
<li><b>Separation of metadata and data management.</b><br>
|
|
A small set of metadata servers (MDSs) manage the file system hierarchy (namespace). Clients communicate with an MDS to open/close files, get directory listings, remove files, or any other operations that involve file names. Once a file is opened, clients communicate directly with OSDs (object-storage devices) to read and write data. A large Ceph system may involve anywhere from one to many dozens (or possibly hundreds) of MDSs, and anywhere from four to hundreds or thousands of OSDs.
|
|
<p>
|
|
Both file data and file system metadata are striped over multiple <i>objects</i>, each of which is replicated on multiple OSDs for reliability. A special-purpose mapping function called <b>CRUSH</b> is used to determine which OSDs store which objects. CRUSH resembles a hash function in that this mapping is pseudo-random (it appears random, but is actually deterministic). This provides load balancing across all devices that is relatively invulnerable to "hot spots," while Ceph's policy of redistributing data ensures that workload remains balanced and all devices are equally utilized even when the storage cluster is expanded or OSDs are removed.
|
|
|
|
<p>
|
|
|
|
<li><b>Intelligent storage devices</b><br>
|
|
Each Ceph OSD stores variably-sized, named <i>objects</i>. (In contract, conventional file systems are built directly on top of raw hard disks that store small fixed-sized numbered <i>blocks</i>.) In contract, Ceph OSDs can be built with conventional server hardware with attached storage (either raw disks or a small RAID). Each Ceph OSD runs a special-purpose "object file system" called <b>EBOFS</b> designed to efficiently and reliable store variably sized objects.
|
|
<p>
|
|
More importantly, Ceph OSDs are intelligent. Collectively, the OSD cluster manages data replication, data migration (when the cluster composition changes due to expansion, failures, etc.), failure detection (OSDs actively monitor their peers), and failure recovery. We call the collective cluster <b>RADOS</b>--Reliable Autonomic Distributed Object Store--because it provides the illusion of a single logical object store while hiding the details of data distribution, replication, and failure recovery.
|
|
<p>
|
|
|
|
<li><b>Dynamic distributed metadata</b><br>
|
|
Ceph dynamically distributes responsibility for managing the file system directory hierarchy over tens or even hundreds of MDSs. Because Ceph embeds most inodes directly within the directory that contains them, the hierarchical partitional allows each MDS to operation independently and efficiently. This distribution is entirely adaptive, based on the current workload, allowing the cluster to redistribute the hierarchy to balance load as client access patterns change over time. Ceph also copes with metadata hot spots: popular metadata is replicated across multiple MDS nodes, and extremely large directories can be distributed across the entire cluster when necessary.
|
|
|
|
</ol>
|
|
|
|
|
|
</div>
|
|
|
|
</div>
|
|
|