mirror of
https://github.com/ceph/ceph
synced 2025-01-19 17:41:39 +00:00
doc: explain directory fragmenation settings
...and update the config ref. Includes the new mds_bal_fragment_fast_factor setting. Signed-off-by: John Spray <john.spray@redhat.com>
This commit is contained in:
parent
3210ee7781
commit
15c6147f79
99
doc/cephfs/dirfrags.rst
Normal file
99
doc/cephfs/dirfrags.rst
Normal file
@ -0,0 +1,99 @@
|
||||
|
||||
===================================
|
||||
Configuring Directory fragmentation
|
||||
===================================
|
||||
|
||||
In CephFS, directories are *fragmented* when they become very large
|
||||
or very busy. This splits up the metadata so that it can be shared
|
||||
between multiple MDS daemons, and between multiple objects in the
|
||||
metadata pool.
|
||||
|
||||
In normal operation, directory fragmentation is invisbible to
|
||||
users and administrators, and all the configuration settings mentioned
|
||||
here should be left at their default values.
|
||||
|
||||
While directory fragmentation enables CephFS to handle very large
|
||||
numbers of entries in a single directory, application programmers should
|
||||
remain cautious about creating very large directories, as they still
|
||||
have a resource cost in situations such as a CephFS client listing
|
||||
the directory, where all the fragments must be loaded at once.
|
||||
|
||||
All directories are initially created as a single fragment. This fragment
|
||||
may be *split* to divide up the directory into more fragments, and these
|
||||
fragments may be *merged* to reduce the number of fragments in the directory.
|
||||
|
||||
Splitting and merging
|
||||
=====================
|
||||
|
||||
An MDS will only consider doing splits and merges if the ``mds_bal_frag``
|
||||
setting is true in the MDS's configuration file, and the allow_dirfrags
|
||||
setting is true in the filesystem map (set on the mons).
|
||||
|
||||
When an MDS identifies a directory fragment to be split, it does not
|
||||
do the split immediately. Because splitting interrupts metadata IO,
|
||||
a short delay is used to allow short bursts of client IO to complete
|
||||
before the split begins. This delay is configured with
|
||||
``mds_bal_fragment_interval``, which defaults to 5 seconds.
|
||||
|
||||
When the split is done, the directory fragment is broken up into
|
||||
a power of two number of new fragments. The number of new
|
||||
fragments is given by two to the power ``mds_bal_split_bits``, i.e.
|
||||
if ``mds_bal_split_bits`` is 2, then four new fragments will be
|
||||
created. The default setting is 3, i.e. splits create 8 new fragments.
|
||||
|
||||
The criteria for initiating a split or a merge are described in the
|
||||
following sections.
|
||||
|
||||
Size thresholds
|
||||
===============
|
||||
|
||||
A directory fragment is elegible for splitting when its size exceeds
|
||||
``mds_bal_split_size`` (default 10000). Ordinarily this split is
|
||||
delayed by ``mds_bal_fragment_interval``, but if the fragment size
|
||||
exceeds a factor of ``mds_bal_fragment_fast_factor`` the split size,
|
||||
the split will happen immediately (holding up any client metadata
|
||||
IO on the directory).
|
||||
|
||||
``mds_bal_fragment_size_max`` is the hard limit on the size of
|
||||
directory fragments. If it is reached, clients will receive
|
||||
ENOSPC errors if they try to create files in the fragment. On
|
||||
a properly configured system, this limit should never be reached on
|
||||
ordinary directories, as they will have split long before. By default,
|
||||
this is set to 10 times the split size, giving a dirfrag size limit of
|
||||
100000. Increasing this limit may lead to oversized directory fragment
|
||||
objects in the metadata pool, which the OSDs may not be able to handle.
|
||||
|
||||
A directory fragment is elegible for merging when its size is less
|
||||
than ``mds_bal_merge_size``. There is no merge equivalent of the
|
||||
"fast splitting" explained above: fast splitting exists to avoid
|
||||
creating oversized directory fragments, there is no equivalent issue
|
||||
to avoid when merging. The default merge size is 50.
|
||||
|
||||
Activity thresholds
|
||||
===================
|
||||
|
||||
In addition to splitting fragments based
|
||||
on their size, the MDS may split directory fragments if their
|
||||
activity exceeds a threshold.
|
||||
|
||||
The MDS maintains separate time-decaying load counters for read and write
|
||||
operations on directory fragments. The decaying load counters have an
|
||||
exponential decay based on the ``mds_decay_halflife`` setting.
|
||||
|
||||
On writes, the write counter is
|
||||
incremented, and compared with ``mds_bal_split_wr``, triggering a
|
||||
split if the threshold is exceeded. Write operations include metadata IO
|
||||
such as renames, unlinks and creations.
|
||||
|
||||
The ``mds_bal_split_rd`` threshold is applied based on the read operation
|
||||
load counter, which tracks readdir operations.
|
||||
|
||||
By the default, the read threshold is 25000 and the write threshold is
|
||||
10000, i.e. 2.5x as many reads as writes would be required to trigger
|
||||
a split.
|
||||
|
||||
After fragments are split due to the activity thresholds, they are only
|
||||
merged based on the size threshold (``mds_bal_merge_size``), so
|
||||
a spike in activity may cause a directory to stay fragmented
|
||||
forever unless some entries are unlinked.
|
||||
|
@ -97,6 +97,7 @@ authentication keyring.
|
||||
Disaster recovery <disaster-recovery>
|
||||
Client authentication <client-auth>
|
||||
Upgrading old filesystems <upgrading>
|
||||
Configuring directory fragmentation <dirfrags>
|
||||
|
||||
.. raw:: html
|
||||
|
||||
|
@ -291,24 +291,6 @@
|
||||
:Default: ``50``
|
||||
|
||||
|
||||
``mds bal merge rd``
|
||||
|
||||
:Description: The minimum read temperature before Ceph merges
|
||||
adjacent directory fragments.
|
||||
|
||||
:Type: Float
|
||||
:Default: ``1000``
|
||||
|
||||
|
||||
``mds bal merge wr``
|
||||
|
||||
:Description: The minimum write temperature before Ceph merges
|
||||
adjacent directory fragments.
|
||||
|
||||
:Type: Float
|
||||
:Default: ``1000``
|
||||
|
||||
|
||||
``mds bal interval``
|
||||
|
||||
:Description: The frequency (in seconds) of workload exchanges between MDSs.
|
||||
@ -318,11 +300,26 @@
|
||||
|
||||
``mds bal fragment interval``
|
||||
|
||||
:Description: The frequency (in seconds) of adjusting directory fragmentation.
|
||||
:Description: The delay (in seconds) between a fragment being elegible for split
|
||||
or merge and executing the fragmentation change.
|
||||
:Type: 32-bit Integer
|
||||
:Default: ``5``
|
||||
|
||||
|
||||
``mds bal fragment fast factor``
|
||||
|
||||
:Description: The ratio by which frags may exceed the split size before
|
||||
a split is executed immediately (skipping the fragment interval)
|
||||
:Type: Float
|
||||
:Default: ``1.5``
|
||||
|
||||
``mds bal fragment size max``
|
||||
|
||||
:Description: The maximum size of a fragment before any new entries
|
||||
are rejected with ENOSPC.
|
||||
:Type: 32-bit Integer
|
||||
:Default: ``100000``
|
||||
|
||||
``mds bal idle threshold``
|
||||
|
||||
:Description: The minimum temperature before Ceph migrates a subtree
|
||||
|
Loading…
Reference in New Issue
Block a user