ceph/doc/cephfs/app-best-practices.rst


Application best practices for distributed filesystems
======================================================

CephFS is POSIX compatible, and therefore should work with any existing
applications that expect a POSIX filesystem.  However, because it is a
network filesystem (unlike e.g. XFS) and it is highly consistent (unlike
e.g. NFS), there are some consequences that application authors may
benefit from knowing about.

The following sections describe some areas where distributed filesystems
may have noticeably different performance behaviours compared with
local filesystems.


ls -l
-----

When you run "ls -l", the ``ls`` program
is first doing a directory listing, and then calling ``stat`` on every
file in the directory.

This is usually far in excess of what an application really needs, and
it can be slow for large directories.  If you don't really need all
this metadata for each file, then use a plain ``ls``.

ls/stat on files being extended
-------------------------------

If another client is currently extending files in the listed directory,
then an ``ls -l`` may take an exceptionally long time to complete, as
the lister must wait for the writer to flush data in order to do a valid
read of the every file's size.  So unless you *really* need to know the
exact size of every file in the directory, just don't do it!

This would also apply to any application code that was directly
issuing ``stat`` system calls on files being appended from
another node.

Very large directories
----------------------

Do you really need that 10,000,000 file directory?  While directory
fragmentation enables CephFS to handle it, it is always going to be
less efficient than splitting your files into more modest-sized directories.

Even standard userspace tools can become quite slow when operating on very
large directories. For example, the default behaviour of ``ls``
is to give an alphabetically ordered result, but ``readdir`` system
calls do not give an ordered result (this is true in general, not just
with CephFS).  So when you ``ls`` on a million file directory, it is
loading a list of a million names into memory, sorting the list, then writing
it out to the display.

Hard links
----------

Hard links have an intrinsic cost in terms of the internal housekeeping
that a filesystem has to do to keep two references to the same data.  In
CephFS there is a particular performance cost, because with normal files
the inode is embedded in the directory (i.e. there is no extra fetch of
the inode after looking up the path).

Working set size
----------------

The MDS acts as a cache for the metadata stored in RADOS.  Metadata
performance is very different for workloads whose metadata fits within
that cache.

If your workload has more files than fit in your cache (configured using 
``mds_cache_memory_limit`` or ``mds_cache_size`` settings), then
make sure you test it appropriately: don't test your system with a small
number of files and then expect equivalent performance when you move
to a much larger number of files.

Do you need a filesystem?
-------------------------

Remember that Ceph also includes an object storage interface.  If your
application needs to store huge flat collections of files where you just
read and write whole files at once, then you might well be better off
using the :ref:`Object Gateway <object-gateway>`
doc/cephfs: add notes on application best practices Signed-off-by: John Spray <john.spray@redhat.com> 2018-08-30 13:21:16 +00:00
			`Application best practices for distributed filesystems`
			`======================================================`

			`CephFS is POSIX compatible, and therefore should work with any existing`
			`applications that expect a POSIX filesystem. However, because it is a`
			`network filesystem (unlike e.g. XFS) and it is highly consistent (unlike`
			`e.g. NFS), there are some consequences that application authors may`
			`benefit from knowing about.`

			`The following sections describe some areas where distributed filesystems`
			`may have noticeably different performance behaviours compared with`
			`local filesystems.`


			`ls -l`
			`-----`

			When you run "ls -l", the ``ls`` program
			is first doing a directory listing, and then calling ``stat`` on every
			`file in the directory.`

			`This is usually far in excess of what an application really needs, and`
			`it can be slow for large directories. If you don't really need all`
			this metadata for each file, then use a plain ``ls``.

			`ls/stat on files being extended`
			`-------------------------------`

			`If another client is currently extending files in the listed directory,`
			then an ``ls -l`` may take an exceptionally long time to complete, as
			`the lister must wait for the writer to flush data in order to do a valid`
			`read of the every file's size. So unless you really need to know the`
			`exact size of every file in the directory, just don't do it!`

			`This would also apply to any application code that was directly`
			issuing ``stat`` system calls on files being appended from
			`another node.`

			`Very large directories`
			`----------------------`

			`Do you really need that 10,000,000 file directory? While directory`
			`fragmentation enables CephFS to handle it, it is always going to be`
			`less efficient than splitting your files into more modest-sized directories.`

			`Even standard userspace tools can become quite slow when operating on very`
			large directories. For example, the default behaviour of ``ls``
			is to give an alphabetically ordered result, but ``readdir`` system
			`calls do not give an ordered result (this is true in general, not just`
			with CephFS). So when you ``ls`` on a million file directory, it is
			`loading a list of a million names into memory, sorting the list, then writing`
			`it out to the display.`

			`Hard links`
			`----------`

			`Hard links have an intrinsic cost in terms of the internal housekeeping`
			`that a filesystem has to do to keep two references to the same data. In`
			`CephFS there is a particular performance cost, because with normal files`
			`the inode is embedded in the directory (i.e. there is no extra fetch of`
			`the inode after looking up the path).`

			`Working set size`
			`----------------`

			`The MDS acts as a cache for the metadata stored in RADOS. Metadata`
			`performance is very different for workloads whose metadata fits within`
			`that cache.`

			`If your workload has more files than fit in your cache (configured using`
			``mds_cache_memory_limit`` or ``mds_cache_size`` settings), then
			`make sure you test it appropriately: don't test your system with a small`
			`number of files and then expect equivalent performance when you move`
			`to a much larger number of files.`

			`Do you need a filesystem?`
			`-------------------------`

			`Remember that Ceph also includes an object storage interface. If your`
			`application needs to store huge flat collections of files where you just`
			`read and write whole files at once, then you might well be better off`
			using the :ref:`Object Gateway <object-gateway>`