Merge pull request #40580 from anthonyeleven/anthonyeleven/clean-up-cephfs-standby

doc/cephfs: refine cephfs/standby

Reviewed-by: Zac Dover <zac.dover@gmail.com>
This commit is contained in:
Kefu Chai 2021-04-22 14:53:09 +08:00 committed by GitHub
commit f3344192e3
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -3,53 +3,50 @@
Terminology
-----------
A Ceph cluster may have zero or more CephFS *file systems*. CephFS
file systems have a human readable name (set in ``fs new``)
and an integer ID. The ID is called the file system cluster ID,
or *FSCID*.
A Ceph cluster may have zero or more CephFS *file systems*. Each CephFS has
a human readable name (set at creatiopn time with ``fs new``) and an integer
ID. The ID is called the file system cluster ID, or *FSCID*.
Each CephFS file system has a number of *ranks*, one by default,
which start at zero. A rank may be thought of as a metadata shard.
Controlling the number of ranks in a file system is described
in :doc:`/cephfs/multimds`
Each CephFS file system has a number of *ranks*, numbered beginning with zero.
By default there is one rank per file system. A rank may be thought of as a
metadata shard. Management of ranks is described in :doc:`/cephfs/multimds` .
Each CephFS ceph-mds process (a *daemon*) initially starts up
without a rank. It may be assigned one by the monitor cluster.
A daemon may only hold one rank at a time. Daemons only give up
a rank when the ceph-mds process stops.
Each CephFS ``ceph-mds`` daemon starts without a rank. It may be assigned one
by the cluster's monitors. A daemon may only hold one rank at a time, and only
give up a rank when the ``ceph-mds`` process stops.
If a rank is not associated with a daemon, the rank is
considered *failed*. Once a rank is assigned to a daemon,
the rank is considered *up*.
If a rank is not associated with any daemon, that rank is considered ``failed``.
Once a rank is assigned to a daemon, the rank is considered ``up``.
A daemon has a *name* that is set statically by the administrator
when the daemon is first configured. Typical configurations
use the hostname where the daemon runs as the daemon name.
Each ``ceph-mds`` daemon has a *name* that is assigned statically by the
administrator when the daemon is first configured. Each daemon's *name* is
typically that of the hostname where the process runs.
A ceph-mds daemons can be assigned to a particular file system by
setting the `mds_join_fs` configuration option to the file system
name.
A ``ceph-mds`` daemon may be assigned to a specific file system by
setting its ``mds_join_fs`` configuration option to the file system's
``name``.
Each time a daemon starts up, it is also assigned a *GID*, which
is unique to this particular process lifetime of the daemon. The
GID is an integer.
When a ``ceph-mds`` daemon starts, it is also assigned an integer ``GID``,
which is unique to this current daemon's process. In other words, when a
``ceph-mds`` daemon is restarted, it runs as a new process and is assigned a
*new* ``GID`` that is different from that of the previous process.
Referring to MDS daemons
------------------------
Most of the administrative commands that refer to an MDS daemon
accept a flexible argument format that may contain a rank, a GID
or a name.
Most administrative commands that refer to a ``ceph-mds`` daemon (MDS)
accept a flexible argument format that may specify a ``rank``, a ``GID``
or a ``name``.
Where a rank is used, this may optionally be qualified with
a leading file system name or ID. If a daemon is a standby (i.e.
it is not currently assigned a rank), then it may only be
referred to by GID or name.
Where a ``rank`` is used, it may optionally be qualified by
a leading file system ``name`` or ``GID``. If a daemon is a standby (i.e.
it is not currently assigned a ``rank``), then it may only be
referred to by ``GID`` or ``name``.
For example, if we had an MDS daemon which was called 'myhost',
had GID 5446, and was assigned rank 0 in the file system 'myfs'
which had FSCID 3, then any of the following would be suitable
forms of the 'fail' command:
For example, say we have an MDS daemon with ``name`` 'myhost' and
``GID`` 5446, and which is assigned ``rank`` 0 for the file system 'myfs'
with ``FSCID`` 3. Any of the following are suitable forms of the ``fail``
command:
::
@ -62,17 +59,18 @@ forms of the 'fail' command:
Managing failover
-----------------
If an MDS daemon stops communicating with the monitor, the monitor will wait
``mds_beacon_grace`` seconds (default 15 seconds) before marking the daemon as
*laggy*. If a standby is available, the monitor will immediately replace the
If an MDS daemon stops communicating with the cluster's monitors, the monitors
will wait ``mds_beacon_grace`` seconds (default 15) before marking the daemon as
*laggy*. If a standby MDS is available, the monitor will immediately replace the
laggy daemon.
Each file system may specify a number of standby daemons to be considered
healthy. This number includes daemons in standby-replay waiting for a rank to
fail (remember that a standby-replay daemon will not be assigned to take over a
failure for another rank or a failure in a another CephFS file system). The
pool of standby daemons not in replay count towards any file system count.
Each file system may set the number of standby daemons wanted using:
Each file system may specify a minimum number of standby daemons in order to be
considered healthy. This number includes daemons in the ``standby-replay`` state
waiting for a ``rank`` to fail. Note that a ``standby-replay`` daemon will not
be assigned to take over a failure for another ``rank`` or a failure in a
different CephFS file system). The pool of standby daemons not in ``replay``
counts towards any file system count.
Each file system may set the desired number of standby daemons by:
::
@ -86,12 +84,12 @@ Setting ``count`` to 0 will disable the health check.
Configuring standby-replay
--------------------------
Each CephFS file system may be configured to add standby-replay daemons. These
standby daemons follow the active MDS's metadata journal to reduce failover
time in the event the active MDS becomes unavailable. Each active MDS may have
only one standby-replay daemon following it.
Each CephFS file system may be configured to add ``standby-replay`` daemons.
These standby daemons follow the active MDS's metadata journal in order to
reduce failover time in the event that the active MDS becomes unavailable. Each
active MDS may have only one ``standby-replay`` daemon following it.
Configuring standby-replay on a file system is done using:
Configuration of ``standby-replay`` on a file system is done using the below:
::
@ -100,35 +98,35 @@ Configuring standby-replay on a file system is done using:
Once set, the monitors will assign available standby daemons to follow the
active MDSs in that file system.
Once an MDS has entered the standby-replay state, it will only be used as a
standby for the rank that it is following. If another rank fails, this
standby-replay daemon will not be used as a replacement, even if no other
standbys are available. For this reason, it is advised that if standby-replay
is used then every active MDS should have a standby-replay daemon.
Once an MDS has entered the ``standby-replay`` state, it will only be used as a
standby for the ``rank`` that it is following. If another ``rank`` fails, this
``standby-replay`` daemon will not be used as a replacement, even if no other
standbys are available. For this reason, it is advised that if ``standby-replay``
is used then *every* active MDS should have a ``standby-replay`` daemon.
.. _mds-join-fs:
Configuring MDS file system affinity
------------------------------------
You may want to have an MDS used for a particular file system. Or, perhaps you
have larger MDSs on better hardware that should be preferred over a last-resort
standby on lesser or over-provisioned hardware. To express this preference,
You might elect to dedicate an MDS to a particular file system. Or, perhaps you
have MDSs that run on better hardware that should be preferred over a last-resort
standby on modest or over-provisioned systems. To configure this preference,
CephFS provides a configuration option for MDS called ``mds_join_fs`` which
enforces this `affinity`.
enforces this affinity.
As part of any failover, the Ceph monitors will prefer standby daemons with
``mds_join_fs`` equal to the file system name with the failed rank. If no
standby exists with ``mds_join_fs`` equal to the file system name, it will
choose a `vanilla` standby (no setting for ``mds_join_fs``) for the replacement
or any other available standby as a last resort. Note, this does not change the
behavior that ``standby-replay`` daemons are always selected before looking at
When failing over MDS daemons, a cluster's monitors will prefer standby daemons with
``mds_join_fs`` equal to the file system ``name`` with the failed ``rank``. If no
standby exists with ``mds_join_fs`` equal to the file system ``name``, it will
choose an unqualified standby (no setting for ``mds_join_fs``) for the replacement,
or any other available standby, as a last resort. Note, this does not change the
behavior that ``standby-replay`` daemons are always selected before
other standbys.
Even further, the monitors will regularly examine the CephFS file systems when
Even further, the monitors will regularly examine the CephFS file systems even when
stable to check if a standby with stronger affinity is available to replace an
MDS with lower affinity. This process is also done for standby-replay daemons:
if a regular standby has stronger affinity than the standby-replay MDS, it will
MDS with lower affinity. This process is also done for ``standby-replay`` daemons:
if a regular standby has stronger affinity than the ``standby-replay`` MDS, it will
replace the standby-replay MDS.
For example, given this stable and healthy file system: