ceph/doc/config-cluster/file-system-recommendations.rst
John Wilkins d90fea6cad :doc: Consolidated file system recommendations.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2012-09-05 17:21:04 -07:00

102 lines
4.7 KiB
ReStructuredText

===========================================
Hard Disk and File System Recommendations
===========================================
Hard Disk Prep
==============
Ceph aims for data safety, which means that when the application receives notice
that data was written to the disk, that data was actually written to the disk.
For old kernels (<2.6.33), disable the write cache if the journal is on a raw
disk. Newer kernels should work fine.
Use ``hdparm`` to disable write caching on the hard disk::
sudo hdparm -W 0 /dev/hda 0
In production environments, we recommend running OSDs with an operating system
disk, and a separate disk(s) for data. If you run data and an operating system
on a single disk, create a separate partition for your data before configuring
your OSD cluster.
File Systems
============
Ceph OSDs depend on the Extended Attributes (XATTRs) of the underlying file
system for:
- Internal object state
- Snapshot metadata
- RADOS Gateway Access Control Lists (ACLs).
Ceph OSDs rely heavily upon the stability and performance of the underlying file
system. The underlying file system must provide sufficient capacity for XATTRs.
File system candidates for Ceph include B tree and B+ tree file systems such as:
- ``btrfs``
- ``XFS``
If you are using ``ext4``, mount your file system to enable XATTRs. You must also
add the following line to the ``[osd]`` section of your ``ceph.conf`` file. ::
filestore xattr use omap = true
.. warning:: XATTR limits.
The RADOS Gateway's ACL and Ceph snapshots easily surpass the 4-kilobyte limit
for XATTRs in ``ext4``, causing the ``ceph-osd`` process to crash. Version 0.45
or newer uses ``leveldb`` to bypass this limitation. ``ext4`` is a poor file
system choice if you intend to deploy the RADOS Gateway or use snapshots on
versions earlier than 0.45.
.. tip:: Use ``xfs`` initially and ``btrfs`` when it is ready for production.
The Ceph team believes that the best performance and stability will come from
``btrfs.`` The ``btrfs`` file system has internal transactions that keep the
local data set in a consistent state. This makes OSDs based on ``btrfs`` simple
to deploy, while providing scalability not currently available from block-based
file systems. The 64-kb XATTR limit for ``xfs`` XATTRS is enough to accommodate
RDB snapshot metadata and RADOS Gateway ACLs. So ``xfs`` is the second-choice
file system of the Ceph team in the long run, but ``xfs`` is currently more
stable than ``btrfs``. If you only plan to use RADOS and ``rbd`` without
snapshots and without ``radosgw``, the ``ext4`` file system should work just fine.
FS Background Info
==================
Before ``ext3``, ``ReiserFS`` was the only journaling file system available for
Linux. However, ``ext3`` doesn't provide Extended Attribute (XATTR) support.
While ``ext4`` provides XATTR support, it only allows XATTRs up to 4kb. The
4kb limit is not enough for RADOS GW ACLs, snapshots, and other features. As of
version 0.45, Ceph provides a ``leveldb`` feature for ``ext4`` file systems
that stores XATTRs in excess of 4kb in a ``leveldb`` database.
The ``XFS`` and ``btrfs`` file systems provide numerous advantages in highly
scaled data storage environments when `compared`_ to ``ext3`` and ``ext4``.
Both ``XFS`` and ``btrfs`` are `journaling file systems`_, which means that
they are more robust when recovering from crashes, power outages, etc. These
filesystems journal all of the changes they will make before performing writes.
``XFS`` was developed for Silicon Graphics, and is a mature and stable
filesystem. By contrast, ``btrfs`` is a relatively new file system that aims
to address the long-standing wishes of system administrators working with
large scale data storage environments. ``btrfs`` has some unique features
and advantages compared to other Linux filesystems.
``btrfs`` is a `copy-on-write`_ filesystem. It supports file creation
timestamps and checksums that verify metadata integrity, so it can detect
bad copies of data and fix them with the good copies. The copy-on-write
capability means that ``btrfs`` can support snapshots that are writable.
``btrfs`` supports transparent compression and other features.
``btrfs`` also incorporates multi-device management into the file system,
which enables you to support heterogeneous disk storage infrastructure,
data allocation policies. The community also aims to provide ``fsck``,
deduplication, and data encryption support in the future. This compelling
list of features makes ``btrfs`` the ideal choice for Ceph clusters.
.. _copy-on-write: http://en.wikipedia.org/wiki/Copy-on-write
.. _compared: http://en.wikipedia.org/wiki/Comparison_of_file_systems
.. _journaling file systems: http://en.wikipedia.org/wiki/Journaling_file_system