Current Roadmap
Here is a brief summary of what we're currently working on, and what state we expect Ceph to take in the foreseeable future. This is a rough estimate, and highly dependent on what kind of interest Ceph generates in the larger community.
- Q1 2008 -- Basic in-kernel client (Linux)
- Q2 2008 -- Snapshots
- Q2 2008 -- User/group quotas
Tasks
Although Ceph is currently a working prototype that demonstrates the key features of the architecture, a variety of features need to be implemented in order to make Ceph a stable file system that can be used in production environments. Some of these tasks are outlined below. If you are a kernel or file system developer and are interested in contributing to Ceph, please join the email list and drop us a line.
Snapshots
The distributed object storage fabric (RADOS) includes a simple mechanism of versioning objects, and performing copy-on-write when old objects are updated. In order to utilize this mechanism for implementing flexible snapshots, the MDS needs to be extended to manage versioned directory entries and maintain some additional directory links. For more information, see this tech report.
Content-addressable Storage
The underlying problems of reliable, scalable and distributed object storage are solved by the RADOS object storage system. This mechanism can be leveraged to implement a content-addressible storage system (i.e. one that stores duplicated data only once) by selecting a suitable chunking strategy and naming objects by the hash of their contents. Ideally, we'd like to incorporate this into the overall Ceph file system, so that different parts of the file system can be selectively stored normally or by their hash. Ideally, the system could (perhaps lazily) detect duplicated data when it is written and adjust the underlying storage strategy accordingly in order to optimize for space efficiency or performance.
Ebofs
Each Ceph OSD (storage node) runs a custom object "file system" called EBOFS to store objects on locally attached disks. Although the current implementation of EBOFS is fully functional and already demonstrates promising performance (outperforming ext2/3, XFS, and ReiserFS under the workloads we anticipate), a range of improvements will be needed before it is ready for prime-time. These include:
- NVRAM for data journaling. Actually, this has been implemented, but is untested. EBOFS can utilize NVRAM to journal uncommitted requests much like WAFL does, significantly lowering write latency while facilitating more efficient disk scheduling, delayed allocation, and so forth.
- RAID-aware allocation. Although we conceptually think of each OSD as a disk with an attached CPU, memory, and network interface, it is more likely that the actual OSDs deployed in production systems will be small to medium sized storage servers: a standard server with a locally attached array of SAS or SATA disks. In order to properly take advantage of the parallelism inherent in the use of multiple disks, the EBOFS allocator and disk scheduling algorithms have to be aware of the underlying structure of the array (be it RAID0, 1, 5, 10, etc.) in order to reap the performance and reliability rewards.
Native kernel client
The prototype Ceph client is implemented as a user-space library. Although it can be mounted under Linux via the FUSE (file system in userspace) library, this incurs a significant performance penalty and limits Ceph's ability to provide strong POSIX semantics and consistency. A native Linux kernel implementation of the client in needed in order to properly take advantage of the performance and consistency features of Ceph. We are actively looking for experienced kernel programmers to help guide development in this area!
CRUSH tools
Ceph utilizes a novel data distribution function called CRUSH to distribute data (in the form of objects) to storage nodes (OSDs). CRUSH is designed to generate a balanced distribution will allowing the storage cluster to be dynamically expanded or contracted, and to separate object replicas across failure domains to enhance data safety. There is a certain amount of finesse involved in properly managing the OSD hierarchy from which CRUSH generates its distribution in order to minimize the amount of data migration that results from changes. An administrator tool would be useful for helping to manage the CRUSH mapping function in order to best exploit the available storage and network infrastructure. For more information, please refer to the technical paper describing CRUSH.
The Ceph project is always looking for more participants. If any of these projects sound interesting to you, please join our mailing list.