Ceph Overview -- What is it?
Ceph is a scalable distributed network file system that provides both excellent performance and reliability. Like network file protocols such as NFS and CIFS, clients require only a network connection to mount and use the file system. Unlike NFS and CIFS, however, Ceph clients can communicate directly with storage nodes (which we call OSDs) instead of a single "server" (something that limits the scalability of installations using NFS and CIFS). In that sense, Ceph resembles "cluster" file systems based on SANs (storage area networks) and FC (fibre-channel) or iSCSI. The main difference is that FC and iSCSI are block-level protocols that communicate with dumb, passive disks; Ceph OSDs are intelligent storage nodes, all communication is over TCP and commodity IP networks.
Ceph's intelligent storage nodes (basically, storage servers running software to serve "objects" instead of files) facilitate improved scalability and parallelism. NFS servers (i.e. NAS devices) and cluster file systems funnel all I/O through a single (or limited set of) servers, limiting scalability. Ceph clients interact with a set of (perhaps dozens or hundreds of) metadata servers (MDSs) for high-level operations like open() and rename(), but communicate directly with storage nodes (OSDs) for I/O, of which there may be thousands.
There are a handful of new file systems and enterprise storage products adopting a similar object- or brick-based architecture, including Lustre (also open-source, but with restricted access to source code) and the Panasas file system (a commercial storage product). Ceph is different:
- Open source, open development. We're hosted on SourceForge, and are actively looking for interested users and developers.
- Scalability. Ceph sheds legacy file system design principles like explicit allocation tables that are still found in almost all other file systems (including Lustre and the Panasas file system) and ultimately limit scalability.
- Commodity hardware. Ceph is designed to run on commodity hardware running Linux (or any other POSIX-ish Unix variant). (Lustre relies on a SAN or other shared storage failover to make storage nodes reliable, while Panasas is based on custom hardware using integrated UPSs.)
In additional to promising greater scalability than existing solutions, Ceph also promises to fill the huge gap between open-source filesystems and commercial enterprise systems. If you want network-attached storage without shelling out the big bucks, your are usually stuck with NFS and a direct-attached RAID. Technologies like ATA-over-ethernet and iSCSI help scale raw volume sizes, but the relative lack of "cluster-aware" open-source file systems (particularly those with snapshot-like functionality) still limits one to a single NFS "server" that limits scalability.
Ceph fills this gap by providing a scalable, reliable file system that can seamlessly grow from gigabytes to petabytes. Moreover, Ceph will eventually provide efficient snapshots, which almost no freely available file system (besides ZFS on Solaris) provides, despite snapshots having become almost ubiquitous in enterprise systems.
Ceph Architecture
A thorough overview of the system architecture can be found in this paper that appeared at OSDI '06.
A Ceph installation consists of three main elements: clients, metadata servers (MDSs), and object storage devices (OSDs). Ceph clients can either be individual processes linking directly to a user-space client library, or a host mounting the Ceph file system natively (ala NFS). OSDs are servers with attached disks and are responsible for storing data.
The Ceph architecture is based on three key design principles that set it apart from traditional file systems.
- Separation of metadata and data management.
A small set of metadata servers (MDSs) manage the file system hierarchy (namespace). Clients communicate with an MDS to open/close files, get directory listings, remove files, or any other operations that involve file names. Once a file is opened, clients communicate directly with OSDs (object-storage devices) to read and write data. A large Ceph system may involve anywhere from one to many dozens (or possibly hundreds) of MDSs, and anywhere from four to hundreds or thousands of OSDs.
Both file data and file system metadata are striped over multiple objects, each of which is replicated on multiple OSDs for reliability. A special-purpose mapping function called CRUSH is used to determine which OSDs store which objects. CRUSH resembles a hash function in that this mapping is pseudo-random (it appears random, but is actually deterministic). This provides load balancing across all devices that is relatively invulnerable to "hot spots," while Ceph's policy of redistributing data ensures that workload remains balanced and all devices are equally utilized even when the storage cluster is expanded or OSDs are removed.
- Intelligent storage devices
Each Ceph OSD stores variably-sized, named objects. (In contract, conventional file systems are built directly on top of raw hard disks that store small fixed-sized numbered blocks.) In contract, Ceph OSDs can be built with conventional server hardware with attached storage (either raw disks or a small RAID). Each Ceph OSD runs a special-purpose "object file system" called EBOFS designed to efficiently and reliable store variably sized objects.
More importantly, Ceph OSDs are intelligent. Collectively, the OSD cluster manages data replication, data migration (when the cluster composition changes due to expansion, failures, etc.), failure detection (OSDs actively monitor their peers), and failure recovery. We call the collective cluster RADOS--Reliable Autonomic Distributed Object Store--because it provides the illusion of a single logical object store while hiding the details of data distribution, replication, and failure recovery.
- Dynamic distributed metadata
Ceph dynamically distributes responsibility for managing the file system directory hierarchy over tens or even hundreds of MDSs. Because Ceph embeds most inodes directly within the directory that contains them, the hierarchical partitional allows each MDS to operation independently and efficiently. This distribution is entirely adaptive, based on the current workload, allowing the cluster to redistribute the hierarchy to balance load as client access patterns change over time. Ceph also copes with metadata hot spots: popular metadata is replicated across multiple MDS nodes, and extremely large directories can be distributed across the entire cluster when necessary.