doc/dev: update quiesce developer document

To include changes relating to it now being a local lock that prevents mutable
caps.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
This commit is contained in:
Patrick Donnelly 2024-03-14 14:29:58 -04:00
parent 48766b336d
commit 719d30d277
No known key found for this signature in database
GPG Key ID: FA47FD0B0367D313

View File

@ -1,8 +1,8 @@
MDS Quiesce Protocol MDS Quiesce Protocol
==================== ====================
The MDS quiesce protocol is a mechanism for "quiescing" (quieting) a tree The MDS quiesce protocol is a mechanism for "quiescing" (quieting) a tree in a
in a file system, stopping all write (and most read) I/O. file system, stopping all write (and sometimes incidentally read) I/O.
The purpose of this API is to prevent multiple clients from interleaving reads The purpose of this API is to prevent multiple clients from interleaving reads
and writes across an eventually consistent snapshot barrier where out-of-band and writes across an eventually consistent snapshot barrier where out-of-band
@ -10,6 +10,11 @@ communication exists between clients. This communication can lead to clients
wrongly believing they've reached a checkpoint that is mutually recoverable to wrongly believing they've reached a checkpoint that is mutually recoverable to
via a snapshot. via a snapshot.
.. note:: This is documentation for the low-level mechanism in the MDS for
quiescing a tree of files. The higher-level QuiesceDb is the
intended API for clients to effect a quiesce.
Mechanism Mechanism
--------- ---------
@ -18,76 +23,97 @@ appropriate locks on the root of a tree and then launches a series of
sub-requests for locking other inodes in the tree. The locks obtained will sub-requests for locking other inodes in the tree. The locks obtained will
force clients to release caps and in-progress client/MDS requests to complete. force clients to release caps and in-progress client/MDS requests to complete.
The sub-requests launched are ``quiesce_inode`` internal requests which simply The sub-requests launched are ``quiesce_inode`` internal requests. These will
lock the inode, if the MDS is authoritative for the inode. Generally, these obtain "cap-related" locks which control capability state, including the
are rdlocks (read locks) on each inode metadata lock but the ``filelock`` is ``filelock``, ``authlock``, ``linklock``, and ``xattrlock``. Additionally, the
xlocked (exclusively locked) because its type allows for multiple readers and new local lock ``quiescelock`` is acquired. More information on that lock in
writers. Additionally, a new ``quiescelock`` is exclusively locked (more on the next section.
that next).
Because the ``quiesce_inode`` request will xlock the ``filelock`` and Locks that are not cap-related are skipped because they do not control typical
``quiescelock``, it only does so if run on the authoritative MDS. It is and durable metadata state. Additionally, only Capabilities can give a client
expected that the glue layer on top of the quiesce protocol will execute the local control of a file's metadata or data.
same ``quiesce_path`` operation on each MDS rank. This allows each rank which
may be authoritative for part of the tree to lock all inodes it is Once all locks have been acquired, the cap-related locks are released and the
authoritative for. ``quiescelock`` is relied on to prevent issuing Capabilities to clients for the
cap-related locks. This is controlled primarily by ``CInode:get_caps_*``
methods. Releasing these locks is necessary to allow other ranks with the
replicated inode to quiesce without lock state transitions resulting in
deadlock. For example, a client wanting ``Xx`` on an inode will trigger a
``xattrlock`` in ``LOCK_SYNC`` state to transition to ``LOCK_SYNC_EXCL``. That
state would not allow another rank to acquire ``xattrlock`` for reading,
thereby creating deadlock, subject to quiesce timeout/expiration. (Quiesce
cannot complete until all ranks quiesce the tree.)
Finally, if the inode is a directory, the ``quiesce_inode`` operation traverses
all directory fragments and issues new ``quiesce_inode`` requests for any child
inodes.
Inode Quiescelock Inode Quiescelock
----------------- -----------------
The ``quiescelock`` is a new lock for inodes which supports quiescing I/O. It The ``quiescelock`` is a new local lock for inodes which supports quiescing
is a type of superlock where every client or MDS operation which accesses an I/O. It is a type of superlock where every client or MDS operation which
inode lock will also implicitly acquire the ``quiescelock`` (readonly). In requires a wrlock or xlock on a "cap-related" inode lock will also implicitly
general, this lock is never held except for reading. When a subtree is acquire a wrlock on the ``quiescelock``.
quiesced, the ``quiesce_inode`` internal operation will hold ``quiescelock``
exclusively, thereby denying the **new** acquisition of any other inode lock.
The ``quiescelock`` must be ordered before all other locks (see
``src/include/ceph_fs.h`` for ordering) in order to act as this superlock.
The reason for this lock is to prevent an operation from blocking on acquiring .. note:: A local lock supports multiple writers and only one exclusive locker. No read locks.
locks held by ``quiesce_inode`` while still holding locks obtained
during path traversal. Notably, the important locks are the ``snaplock`` and During normal operation in the MDS, the ``quiescelock`` is never held except
``policylock`` obtained via ``Locker::try_rdlock_snap_layout`` on all parents for writing. However, when a subtree is quiesced, the ``quiesce_inode``
of the root inode of the request (the ``ino`` in the ``filepath`` struct). If internal operation will hold ``quiescelock`` exclusively for the entire
that operation waits with those locks held, then a future ``mksnap`` on the lifetime of the ``quiesce_inode`` operation. This will deny the **new**
root inode will be impossible. acquisition of any other cap-related inode lock. The ``quiescelock`` must be ordered
before all other locks (see ``src/include/ceph_fs.h`` for ordering) in order to
act as this superlock.
One primary reason for this ``quiescelock`` is to prevent a client request from
blocking on acquiring locks held by ``quiesce_inode`` (e.g. ``filelock`` or
``quiescelock``) while still holding locks obtained during normal path
traversal. Notably, the important locks are the ``snaplock`` and ``policylock``
obtained via ``Locker::try_rdlock_snap_layout`` on all parents of the root
inode of the request (the ``ino`` in the ``filepath`` struct). If that
operation waits with those locks held, then a future ``mksnap`` on the root
inode will be impossible.
.. note:: The ``mksnap`` RPC only acquires a wrlock (write lock) on the .. note:: The ``mksnap`` RPC only acquires a wrlock (write lock) on the
``snaplock`` for the inode to be snapshotted. ``snaplock`` for the inode to be snapshotted.
The way ``quiescelock`` helps prevent this is by being the first **mandatory** The way ``quiescelock`` helps prevent this is by being the first **mandatory**
lock acquired and the special handling when it cannot be acquired: all locks lock acquired when acquiring a wrlock or xlock on a cap-related lock.
held by the operation are dropped and the operation waits for the Additionally, there is also special handling when it cannot be acquired: all
``quiescelock`` to be available. The lock is mandatory in that all inode locks locks held by the operation are dropped and the operation waits for the
automatically include (add) the ``quiescelock`` when calling ``quiescelock`` to be available. The lock is mandatory in that a call to
``Locker::acquire_locks``. So the expected normal flow is that an operation ``Locker::acquire_locks`` with a wrlock/xlock on a cap-related lock will
like ``getattr`` will perform its path traversal, acquiring parent and dentry automatically include (add) the ``quiescelock``.
locks, then attempt to acquire locks on the inode necessary for the requested
client caps. The operation will fail to acquire the automatically included
``quiescelock``, add itself to the ``quiescelock`` wait list, and then drop all
held locks.
There is a divergence in locking behavior for the root of the subvolume. The So, the expected normal flow is that an operation like ``mkdir`` will perform
``quiescelock`` is only locked read-only. This allows the inode to be accessed its path traversal, acquiring parent and dentry locks, then attempt to acquire
by operations like ``mksnap`` which will implicitly acquire the ``quiescelock`` locks on the parent inode necessary for the creation of a dentry. The operation
read-only when locking the ``snaplock`` for writing. Additionally, if will fail to acquire a wrlock on the automatically included ``quiescelock``,
``Locker::acquire_locks`` will only acquire read locks without waiting, then it add itself to the ``quiescelock`` wait list, and then drop all held locks.
will skip the read-only lock on ``quiescelock``. This is to allow some forms of
``lookup`` nececessary for snapshot management (e.g. volumes plugin) at higher
layers.
Readable quiesced tree Lookups and Exports
---------------------- -------------------
It may be desirable to allow readers to continue accessing a quiesced Quiescing a tree results in a number of ``quiesce_inode`` operations for each
subvolume. One way to do that is to have a separate superlock (yuck) for read inode under the tree. Those operations have a shared lifetime tied to the
access, say ``quiescerlock``. If a "readable" quiesce is performed, then parent ``quiesce_path`` operation. So, once operations complete quiesce (but do
``quiescerlock`` is not xlocked by ``quiesce_inode``. Read locks on not finish and release locks), the operations sit with locks held and do not
other (non-quiesce) locks will acquire a read lock only on ``quiescerlock`` and monitor the state of the tree. This means we need to handle cases where new
no longer on ``quiescelock``. Write locks would try to acquire both metadata is imported.
``quiescelock`` and ``quiescerlock`` (since writes may also read).
Ideally, it may be a new lock type could be used to handle both cases but no If an inode is fetched via a directory ``lookup`` or ``readdir``, the MDS will
such lock type yet exists. check if its parent is quiesced (i.e. is the parent directory ``quiescelock``
xlocked?). If so, the MDS will immediately issue an dispatch a
``quiesce_inode`` operation for that inode. Because it's a fresh inode, the
operation will immediately succeed and prevent the client from being issued
inappropriate capabailities.
The second case is handling subtree imports from another rank. This is
problematic since the subtree import may have inodes with inappropriate state
that would invalidate the guarantees of the reportedly "quiesced" tree. To
avoid this, importer MDS will skip discovery of the root inode for an import if
it encounters a directory inode that is quiesced. If skipped, the rank
will send a NAK message back to the exporter which will abort the export.