From 15c6147f79c1065a8fd212909d2d8a2598774bf4 Mon Sep 17 00:00:00 2001 From: John Spray Date: Thu, 17 Nov 2016 17:27:37 +0000 Subject: [PATCH] doc: explain directory fragmenation settings ...and update the config ref. Includes the new mds_bal_fragment_fast_factor setting. Signed-off-by: John Spray --- doc/cephfs/dirfrags.rst | 99 +++++++++++++++++++++++++++++++++++ doc/cephfs/index.rst | 1 + doc/cephfs/mds-config-ref.rst | 35 ++++++------- 3 files changed, 116 insertions(+), 19 deletions(-) create mode 100644 doc/cephfs/dirfrags.rst diff --git a/doc/cephfs/dirfrags.rst b/doc/cephfs/dirfrags.rst new file mode 100644 index 00000000000..e6bd045b225 --- /dev/null +++ b/doc/cephfs/dirfrags.rst @@ -0,0 +1,99 @@ + +=================================== +Configuring Directory fragmentation +=================================== + +In CephFS, directories are *fragmented* when they become very large +or very busy. This splits up the metadata so that it can be shared +between multiple MDS daemons, and between multiple objects in the +metadata pool. + +In normal operation, directory fragmentation is invisbible to +users and administrators, and all the configuration settings mentioned +here should be left at their default values. + +While directory fragmentation enables CephFS to handle very large +numbers of entries in a single directory, application programmers should +remain cautious about creating very large directories, as they still +have a resource cost in situations such as a CephFS client listing +the directory, where all the fragments must be loaded at once. + +All directories are initially created as a single fragment. This fragment +may be *split* to divide up the directory into more fragments, and these +fragments may be *merged* to reduce the number of fragments in the directory. + +Splitting and merging +===================== + +An MDS will only consider doing splits and merges if the ``mds_bal_frag`` +setting is true in the MDS's configuration file, and the allow_dirfrags +setting is true in the filesystem map (set on the mons). + +When an MDS identifies a directory fragment to be split, it does not +do the split immediately. Because splitting interrupts metadata IO, +a short delay is used to allow short bursts of client IO to complete +before the split begins. This delay is configured with +``mds_bal_fragment_interval``, which defaults to 5 seconds. + +When the split is done, the directory fragment is broken up into +a power of two number of new fragments. The number of new +fragments is given by two to the power ``mds_bal_split_bits``, i.e. +if ``mds_bal_split_bits`` is 2, then four new fragments will be +created. The default setting is 3, i.e. splits create 8 new fragments. + +The criteria for initiating a split or a merge are described in the +following sections. + +Size thresholds +=============== + +A directory fragment is elegible for splitting when its size exceeds +``mds_bal_split_size`` (default 10000). Ordinarily this split is +delayed by ``mds_bal_fragment_interval``, but if the fragment size +exceeds a factor of ``mds_bal_fragment_fast_factor`` the split size, +the split will happen immediately (holding up any client metadata +IO on the directory). + +``mds_bal_fragment_size_max`` is the hard limit on the size of +directory fragments. If it is reached, clients will receive +ENOSPC errors if they try to create files in the fragment. On +a properly configured system, this limit should never be reached on +ordinary directories, as they will have split long before. By default, +this is set to 10 times the split size, giving a dirfrag size limit of +100000. Increasing this limit may lead to oversized directory fragment +objects in the metadata pool, which the OSDs may not be able to handle. + +A directory fragment is elegible for merging when its size is less +than ``mds_bal_merge_size``. There is no merge equivalent of the +"fast splitting" explained above: fast splitting exists to avoid +creating oversized directory fragments, there is no equivalent issue +to avoid when merging. The default merge size is 50. + +Activity thresholds +=================== + +In addition to splitting fragments based +on their size, the MDS may split directory fragments if their +activity exceeds a threshold. + +The MDS maintains separate time-decaying load counters for read and write +operations on directory fragments. The decaying load counters have an +exponential decay based on the ``mds_decay_halflife`` setting. + +On writes, the write counter is +incremented, and compared with ``mds_bal_split_wr``, triggering a +split if the threshold is exceeded. Write operations include metadata IO +such as renames, unlinks and creations. + +The ``mds_bal_split_rd`` threshold is applied based on the read operation +load counter, which tracks readdir operations. + +By the default, the read threshold is 25000 and the write threshold is +10000, i.e. 2.5x as many reads as writes would be required to trigger +a split. + +After fragments are split due to the activity thresholds, they are only +merged based on the size threshold (``mds_bal_merge_size``), so +a spike in activity may cause a directory to stay fragmented +forever unless some entries are unlinked. + diff --git a/doc/cephfs/index.rst b/doc/cephfs/index.rst index 2e8be06d443..5e9b1f70f8c 100644 --- a/doc/cephfs/index.rst +++ b/doc/cephfs/index.rst @@ -97,6 +97,7 @@ authentication keyring. Disaster recovery Client authentication Upgrading old filesystems + Configuring directory fragmentation .. raw:: html diff --git a/doc/cephfs/mds-config-ref.rst b/doc/cephfs/mds-config-ref.rst index 99db8b1f83f..f877e83491b 100644 --- a/doc/cephfs/mds-config-ref.rst +++ b/doc/cephfs/mds-config-ref.rst @@ -291,24 +291,6 @@ :Default: ``50`` -``mds bal merge rd`` - -:Description: The minimum read temperature before Ceph merges - adjacent directory fragments. - -:Type: Float -:Default: ``1000`` - - -``mds bal merge wr`` - -:Description: The minimum write temperature before Ceph merges - adjacent directory fragments. - -:Type: Float -:Default: ``1000`` - - ``mds bal interval`` :Description: The frequency (in seconds) of workload exchanges between MDSs. @@ -318,11 +300,26 @@ ``mds bal fragment interval`` -:Description: The frequency (in seconds) of adjusting directory fragmentation. +:Description: The delay (in seconds) between a fragment being elegible for split + or merge and executing the fragmentation change. :Type: 32-bit Integer :Default: ``5`` +``mds bal fragment fast factor`` + +:Description: The ratio by which frags may exceed the split size before + a split is executed immediately (skipping the fragment interval) +:Type: Float +:Default: ``1.5`` + +``mds bal fragment size max`` + +:Description: The maximum size of a fragment before any new entries + are rejected with ENOSPC. +:Type: 32-bit Integer +:Default: ``100000`` + ``mds bal idle threshold`` :Description: The minimum temperature before Ceph migrates a subtree