mirror of
https://github.com/ceph/ceph
synced 2025-01-02 09:02:34 +00:00
doc/dev: update quiesce developer document
To include changes relating to it now being a local lock that prevents mutable caps. Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
This commit is contained in:
parent
48766b336d
commit
719d30d277
@ -1,8 +1,8 @@
|
||||
MDS Quiesce Protocol
|
||||
====================
|
||||
|
||||
The MDS quiesce protocol is a mechanism for "quiescing" (quieting) a tree
|
||||
in a file system, stopping all write (and most read) I/O.
|
||||
The MDS quiesce protocol is a mechanism for "quiescing" (quieting) a tree in a
|
||||
file system, stopping all write (and sometimes incidentally read) I/O.
|
||||
|
||||
The purpose of this API is to prevent multiple clients from interleaving reads
|
||||
and writes across an eventually consistent snapshot barrier where out-of-band
|
||||
@ -10,6 +10,11 @@ communication exists between clients. This communication can lead to clients
|
||||
wrongly believing they've reached a checkpoint that is mutually recoverable to
|
||||
via a snapshot.
|
||||
|
||||
.. note:: This is documentation for the low-level mechanism in the MDS for
|
||||
quiescing a tree of files. The higher-level QuiesceDb is the
|
||||
intended API for clients to effect a quiesce.
|
||||
|
||||
|
||||
Mechanism
|
||||
---------
|
||||
|
||||
@ -18,76 +23,97 @@ appropriate locks on the root of a tree and then launches a series of
|
||||
sub-requests for locking other inodes in the tree. The locks obtained will
|
||||
force clients to release caps and in-progress client/MDS requests to complete.
|
||||
|
||||
The sub-requests launched are ``quiesce_inode`` internal requests which simply
|
||||
lock the inode, if the MDS is authoritative for the inode. Generally, these
|
||||
are rdlocks (read locks) on each inode metadata lock but the ``filelock`` is
|
||||
xlocked (exclusively locked) because its type allows for multiple readers and
|
||||
writers. Additionally, a new ``quiescelock`` is exclusively locked (more on
|
||||
that next).
|
||||
The sub-requests launched are ``quiesce_inode`` internal requests. These will
|
||||
obtain "cap-related" locks which control capability state, including the
|
||||
``filelock``, ``authlock``, ``linklock``, and ``xattrlock``. Additionally, the
|
||||
new local lock ``quiescelock`` is acquired. More information on that lock in
|
||||
the next section.
|
||||
|
||||
Because the ``quiesce_inode`` request will xlock the ``filelock`` and
|
||||
``quiescelock``, it only does so if run on the authoritative MDS. It is
|
||||
expected that the glue layer on top of the quiesce protocol will execute the
|
||||
same ``quiesce_path`` operation on each MDS rank. This allows each rank which
|
||||
may be authoritative for part of the tree to lock all inodes it is
|
||||
authoritative for.
|
||||
Locks that are not cap-related are skipped because they do not control typical
|
||||
and durable metadata state. Additionally, only Capabilities can give a client
|
||||
local control of a file's metadata or data.
|
||||
|
||||
Once all locks have been acquired, the cap-related locks are released and the
|
||||
``quiescelock`` is relied on to prevent issuing Capabilities to clients for the
|
||||
cap-related locks. This is controlled primarily by ``CInode:get_caps_*``
|
||||
methods. Releasing these locks is necessary to allow other ranks with the
|
||||
replicated inode to quiesce without lock state transitions resulting in
|
||||
deadlock. For example, a client wanting ``Xx`` on an inode will trigger a
|
||||
``xattrlock`` in ``LOCK_SYNC`` state to transition to ``LOCK_SYNC_EXCL``. That
|
||||
state would not allow another rank to acquire ``xattrlock`` for reading,
|
||||
thereby creating deadlock, subject to quiesce timeout/expiration. (Quiesce
|
||||
cannot complete until all ranks quiesce the tree.)
|
||||
|
||||
Finally, if the inode is a directory, the ``quiesce_inode`` operation traverses
|
||||
all directory fragments and issues new ``quiesce_inode`` requests for any child
|
||||
inodes.
|
||||
|
||||
|
||||
Inode Quiescelock
|
||||
-----------------
|
||||
|
||||
The ``quiescelock`` is a new lock for inodes which supports quiescing I/O. It
|
||||
is a type of superlock where every client or MDS operation which accesses an
|
||||
inode lock will also implicitly acquire the ``quiescelock`` (readonly). In
|
||||
general, this lock is never held except for reading. When a subtree is
|
||||
quiesced, the ``quiesce_inode`` internal operation will hold ``quiescelock``
|
||||
exclusively, thereby denying the **new** acquisition of any other inode lock.
|
||||
The ``quiescelock`` must be ordered before all other locks (see
|
||||
``src/include/ceph_fs.h`` for ordering) in order to act as this superlock.
|
||||
The ``quiescelock`` is a new local lock for inodes which supports quiescing
|
||||
I/O. It is a type of superlock where every client or MDS operation which
|
||||
requires a wrlock or xlock on a "cap-related" inode lock will also implicitly
|
||||
acquire a wrlock on the ``quiescelock``.
|
||||
|
||||
The reason for this lock is to prevent an operation from blocking on acquiring
|
||||
locks held by ``quiesce_inode`` while still holding locks obtained
|
||||
during path traversal. Notably, the important locks are the ``snaplock`` and
|
||||
``policylock`` obtained via ``Locker::try_rdlock_snap_layout`` on all parents
|
||||
of the root inode of the request (the ``ino`` in the ``filepath`` struct). If
|
||||
that operation waits with those locks held, then a future ``mksnap`` on the
|
||||
root inode will be impossible.
|
||||
.. note:: A local lock supports multiple writers and only one exclusive locker. No read locks.
|
||||
|
||||
During normal operation in the MDS, the ``quiescelock`` is never held except
|
||||
for writing. However, when a subtree is quiesced, the ``quiesce_inode``
|
||||
internal operation will hold ``quiescelock`` exclusively for the entire
|
||||
lifetime of the ``quiesce_inode`` operation. This will deny the **new**
|
||||
acquisition of any other cap-related inode lock. The ``quiescelock`` must be ordered
|
||||
before all other locks (see ``src/include/ceph_fs.h`` for ordering) in order to
|
||||
act as this superlock.
|
||||
|
||||
One primary reason for this ``quiescelock`` is to prevent a client request from
|
||||
blocking on acquiring locks held by ``quiesce_inode`` (e.g. ``filelock`` or
|
||||
``quiescelock``) while still holding locks obtained during normal path
|
||||
traversal. Notably, the important locks are the ``snaplock`` and ``policylock``
|
||||
obtained via ``Locker::try_rdlock_snap_layout`` on all parents of the root
|
||||
inode of the request (the ``ino`` in the ``filepath`` struct). If that
|
||||
operation waits with those locks held, then a future ``mksnap`` on the root
|
||||
inode will be impossible.
|
||||
|
||||
.. note:: The ``mksnap`` RPC only acquires a wrlock (write lock) on the
|
||||
``snaplock`` for the inode to be snapshotted.
|
||||
|
||||
The way ``quiescelock`` helps prevent this is by being the first **mandatory**
|
||||
lock acquired and the special handling when it cannot be acquired: all locks
|
||||
held by the operation are dropped and the operation waits for the
|
||||
``quiescelock`` to be available. The lock is mandatory in that all inode locks
|
||||
automatically include (add) the ``quiescelock`` when calling
|
||||
``Locker::acquire_locks``. So the expected normal flow is that an operation
|
||||
like ``getattr`` will perform its path traversal, acquiring parent and dentry
|
||||
locks, then attempt to acquire locks on the inode necessary for the requested
|
||||
client caps. The operation will fail to acquire the automatically included
|
||||
``quiescelock``, add itself to the ``quiescelock`` wait list, and then drop all
|
||||
held locks.
|
||||
lock acquired when acquiring a wrlock or xlock on a cap-related lock.
|
||||
Additionally, there is also special handling when it cannot be acquired: all
|
||||
locks held by the operation are dropped and the operation waits for the
|
||||
``quiescelock`` to be available. The lock is mandatory in that a call to
|
||||
``Locker::acquire_locks`` with a wrlock/xlock on a cap-related lock will
|
||||
automatically include (add) the ``quiescelock``.
|
||||
|
||||
There is a divergence in locking behavior for the root of the subvolume. The
|
||||
``quiescelock`` is only locked read-only. This allows the inode to be accessed
|
||||
by operations like ``mksnap`` which will implicitly acquire the ``quiescelock``
|
||||
read-only when locking the ``snaplock`` for writing. Additionally, if
|
||||
``Locker::acquire_locks`` will only acquire read locks without waiting, then it
|
||||
will skip the read-only lock on ``quiescelock``. This is to allow some forms of
|
||||
``lookup`` nececessary for snapshot management (e.g. volumes plugin) at higher
|
||||
layers.
|
||||
So, the expected normal flow is that an operation like ``mkdir`` will perform
|
||||
its path traversal, acquiring parent and dentry locks, then attempt to acquire
|
||||
locks on the parent inode necessary for the creation of a dentry. The operation
|
||||
will fail to acquire a wrlock on the automatically included ``quiescelock``,
|
||||
add itself to the ``quiescelock`` wait list, and then drop all held locks.
|
||||
|
||||
|
||||
Readable quiesced tree
|
||||
----------------------
|
||||
Lookups and Exports
|
||||
-------------------
|
||||
|
||||
It may be desirable to allow readers to continue accessing a quiesced
|
||||
subvolume. One way to do that is to have a separate superlock (yuck) for read
|
||||
access, say ``quiescerlock``. If a "readable" quiesce is performed, then
|
||||
``quiescerlock`` is not xlocked by ``quiesce_inode``. Read locks on
|
||||
other (non-quiesce) locks will acquire a read lock only on ``quiescerlock`` and
|
||||
no longer on ``quiescelock``. Write locks would try to acquire both
|
||||
``quiescelock`` and ``quiescerlock`` (since writes may also read).
|
||||
Quiescing a tree results in a number of ``quiesce_inode`` operations for each
|
||||
inode under the tree. Those operations have a shared lifetime tied to the
|
||||
parent ``quiesce_path`` operation. So, once operations complete quiesce (but do
|
||||
not finish and release locks), the operations sit with locks held and do not
|
||||
monitor the state of the tree. This means we need to handle cases where new
|
||||
metadata is imported.
|
||||
|
||||
Ideally, it may be a new lock type could be used to handle both cases but no
|
||||
such lock type yet exists.
|
||||
If an inode is fetched via a directory ``lookup`` or ``readdir``, the MDS will
|
||||
check if its parent is quiesced (i.e. is the parent directory ``quiescelock``
|
||||
xlocked?). If so, the MDS will immediately issue an dispatch a
|
||||
``quiesce_inode`` operation for that inode. Because it's a fresh inode, the
|
||||
operation will immediately succeed and prevent the client from being issued
|
||||
inappropriate capabailities.
|
||||
|
||||
The second case is handling subtree imports from another rank. This is
|
||||
problematic since the subtree import may have inodes with inappropriate state
|
||||
that would invalidate the guarantees of the reportedly "quiesced" tree. To
|
||||
avoid this, importer MDS will skip discovery of the root inode for an import if
|
||||
it encounters a directory inode that is quiesced. If skipped, the rank
|
||||
will send a NAK message back to the exporter which will abort the export.
|
||||
|
Loading…
Reference in New Issue
Block a user