mirror of
https://github.com/ceph/ceph
synced 2025-01-20 01:51:34 +00:00
doc/dev: update quiesce developer document
To include changes relating to it now being a local lock that prevents mutable caps. Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
This commit is contained in:
parent
48766b336d
commit
719d30d277
@ -1,8 +1,8 @@
|
|||||||
MDS Quiesce Protocol
|
MDS Quiesce Protocol
|
||||||
====================
|
====================
|
||||||
|
|
||||||
The MDS quiesce protocol is a mechanism for "quiescing" (quieting) a tree
|
The MDS quiesce protocol is a mechanism for "quiescing" (quieting) a tree in a
|
||||||
in a file system, stopping all write (and most read) I/O.
|
file system, stopping all write (and sometimes incidentally read) I/O.
|
||||||
|
|
||||||
The purpose of this API is to prevent multiple clients from interleaving reads
|
The purpose of this API is to prevent multiple clients from interleaving reads
|
||||||
and writes across an eventually consistent snapshot barrier where out-of-band
|
and writes across an eventually consistent snapshot barrier where out-of-band
|
||||||
@ -10,6 +10,11 @@ communication exists between clients. This communication can lead to clients
|
|||||||
wrongly believing they've reached a checkpoint that is mutually recoverable to
|
wrongly believing they've reached a checkpoint that is mutually recoverable to
|
||||||
via a snapshot.
|
via a snapshot.
|
||||||
|
|
||||||
|
.. note:: This is documentation for the low-level mechanism in the MDS for
|
||||||
|
quiescing a tree of files. The higher-level QuiesceDb is the
|
||||||
|
intended API for clients to effect a quiesce.
|
||||||
|
|
||||||
|
|
||||||
Mechanism
|
Mechanism
|
||||||
---------
|
---------
|
||||||
|
|
||||||
@ -18,76 +23,97 @@ appropriate locks on the root of a tree and then launches a series of
|
|||||||
sub-requests for locking other inodes in the tree. The locks obtained will
|
sub-requests for locking other inodes in the tree. The locks obtained will
|
||||||
force clients to release caps and in-progress client/MDS requests to complete.
|
force clients to release caps and in-progress client/MDS requests to complete.
|
||||||
|
|
||||||
The sub-requests launched are ``quiesce_inode`` internal requests which simply
|
The sub-requests launched are ``quiesce_inode`` internal requests. These will
|
||||||
lock the inode, if the MDS is authoritative for the inode. Generally, these
|
obtain "cap-related" locks which control capability state, including the
|
||||||
are rdlocks (read locks) on each inode metadata lock but the ``filelock`` is
|
``filelock``, ``authlock``, ``linklock``, and ``xattrlock``. Additionally, the
|
||||||
xlocked (exclusively locked) because its type allows for multiple readers and
|
new local lock ``quiescelock`` is acquired. More information on that lock in
|
||||||
writers. Additionally, a new ``quiescelock`` is exclusively locked (more on
|
the next section.
|
||||||
that next).
|
|
||||||
|
|
||||||
Because the ``quiesce_inode`` request will xlock the ``filelock`` and
|
Locks that are not cap-related are skipped because they do not control typical
|
||||||
``quiescelock``, it only does so if run on the authoritative MDS. It is
|
and durable metadata state. Additionally, only Capabilities can give a client
|
||||||
expected that the glue layer on top of the quiesce protocol will execute the
|
local control of a file's metadata or data.
|
||||||
same ``quiesce_path`` operation on each MDS rank. This allows each rank which
|
|
||||||
may be authoritative for part of the tree to lock all inodes it is
|
Once all locks have been acquired, the cap-related locks are released and the
|
||||||
authoritative for.
|
``quiescelock`` is relied on to prevent issuing Capabilities to clients for the
|
||||||
|
cap-related locks. This is controlled primarily by ``CInode:get_caps_*``
|
||||||
|
methods. Releasing these locks is necessary to allow other ranks with the
|
||||||
|
replicated inode to quiesce without lock state transitions resulting in
|
||||||
|
deadlock. For example, a client wanting ``Xx`` on an inode will trigger a
|
||||||
|
``xattrlock`` in ``LOCK_SYNC`` state to transition to ``LOCK_SYNC_EXCL``. That
|
||||||
|
state would not allow another rank to acquire ``xattrlock`` for reading,
|
||||||
|
thereby creating deadlock, subject to quiesce timeout/expiration. (Quiesce
|
||||||
|
cannot complete until all ranks quiesce the tree.)
|
||||||
|
|
||||||
|
Finally, if the inode is a directory, the ``quiesce_inode`` operation traverses
|
||||||
|
all directory fragments and issues new ``quiesce_inode`` requests for any child
|
||||||
|
inodes.
|
||||||
|
|
||||||
|
|
||||||
Inode Quiescelock
|
Inode Quiescelock
|
||||||
-----------------
|
-----------------
|
||||||
|
|
||||||
The ``quiescelock`` is a new lock for inodes which supports quiescing I/O. It
|
The ``quiescelock`` is a new local lock for inodes which supports quiescing
|
||||||
is a type of superlock where every client or MDS operation which accesses an
|
I/O. It is a type of superlock where every client or MDS operation which
|
||||||
inode lock will also implicitly acquire the ``quiescelock`` (readonly). In
|
requires a wrlock or xlock on a "cap-related" inode lock will also implicitly
|
||||||
general, this lock is never held except for reading. When a subtree is
|
acquire a wrlock on the ``quiescelock``.
|
||||||
quiesced, the ``quiesce_inode`` internal operation will hold ``quiescelock``
|
|
||||||
exclusively, thereby denying the **new** acquisition of any other inode lock.
|
|
||||||
The ``quiescelock`` must be ordered before all other locks (see
|
|
||||||
``src/include/ceph_fs.h`` for ordering) in order to act as this superlock.
|
|
||||||
|
|
||||||
The reason for this lock is to prevent an operation from blocking on acquiring
|
.. note:: A local lock supports multiple writers and only one exclusive locker. No read locks.
|
||||||
locks held by ``quiesce_inode`` while still holding locks obtained
|
|
||||||
during path traversal. Notably, the important locks are the ``snaplock`` and
|
During normal operation in the MDS, the ``quiescelock`` is never held except
|
||||||
``policylock`` obtained via ``Locker::try_rdlock_snap_layout`` on all parents
|
for writing. However, when a subtree is quiesced, the ``quiesce_inode``
|
||||||
of the root inode of the request (the ``ino`` in the ``filepath`` struct). If
|
internal operation will hold ``quiescelock`` exclusively for the entire
|
||||||
that operation waits with those locks held, then a future ``mksnap`` on the
|
lifetime of the ``quiesce_inode`` operation. This will deny the **new**
|
||||||
root inode will be impossible.
|
acquisition of any other cap-related inode lock. The ``quiescelock`` must be ordered
|
||||||
|
before all other locks (see ``src/include/ceph_fs.h`` for ordering) in order to
|
||||||
|
act as this superlock.
|
||||||
|
|
||||||
|
One primary reason for this ``quiescelock`` is to prevent a client request from
|
||||||
|
blocking on acquiring locks held by ``quiesce_inode`` (e.g. ``filelock`` or
|
||||||
|
``quiescelock``) while still holding locks obtained during normal path
|
||||||
|
traversal. Notably, the important locks are the ``snaplock`` and ``policylock``
|
||||||
|
obtained via ``Locker::try_rdlock_snap_layout`` on all parents of the root
|
||||||
|
inode of the request (the ``ino`` in the ``filepath`` struct). If that
|
||||||
|
operation waits with those locks held, then a future ``mksnap`` on the root
|
||||||
|
inode will be impossible.
|
||||||
|
|
||||||
.. note:: The ``mksnap`` RPC only acquires a wrlock (write lock) on the
|
.. note:: The ``mksnap`` RPC only acquires a wrlock (write lock) on the
|
||||||
``snaplock`` for the inode to be snapshotted.
|
``snaplock`` for the inode to be snapshotted.
|
||||||
|
|
||||||
The way ``quiescelock`` helps prevent this is by being the first **mandatory**
|
The way ``quiescelock`` helps prevent this is by being the first **mandatory**
|
||||||
lock acquired and the special handling when it cannot be acquired: all locks
|
lock acquired when acquiring a wrlock or xlock on a cap-related lock.
|
||||||
held by the operation are dropped and the operation waits for the
|
Additionally, there is also special handling when it cannot be acquired: all
|
||||||
``quiescelock`` to be available. The lock is mandatory in that all inode locks
|
locks held by the operation are dropped and the operation waits for the
|
||||||
automatically include (add) the ``quiescelock`` when calling
|
``quiescelock`` to be available. The lock is mandatory in that a call to
|
||||||
``Locker::acquire_locks``. So the expected normal flow is that an operation
|
``Locker::acquire_locks`` with a wrlock/xlock on a cap-related lock will
|
||||||
like ``getattr`` will perform its path traversal, acquiring parent and dentry
|
automatically include (add) the ``quiescelock``.
|
||||||
locks, then attempt to acquire locks on the inode necessary for the requested
|
|
||||||
client caps. The operation will fail to acquire the automatically included
|
|
||||||
``quiescelock``, add itself to the ``quiescelock`` wait list, and then drop all
|
|
||||||
held locks.
|
|
||||||
|
|
||||||
There is a divergence in locking behavior for the root of the subvolume. The
|
So, the expected normal flow is that an operation like ``mkdir`` will perform
|
||||||
``quiescelock`` is only locked read-only. This allows the inode to be accessed
|
its path traversal, acquiring parent and dentry locks, then attempt to acquire
|
||||||
by operations like ``mksnap`` which will implicitly acquire the ``quiescelock``
|
locks on the parent inode necessary for the creation of a dentry. The operation
|
||||||
read-only when locking the ``snaplock`` for writing. Additionally, if
|
will fail to acquire a wrlock on the automatically included ``quiescelock``,
|
||||||
``Locker::acquire_locks`` will only acquire read locks without waiting, then it
|
add itself to the ``quiescelock`` wait list, and then drop all held locks.
|
||||||
will skip the read-only lock on ``quiescelock``. This is to allow some forms of
|
|
||||||
``lookup`` nececessary for snapshot management (e.g. volumes plugin) at higher
|
|
||||||
layers.
|
|
||||||
|
|
||||||
|
|
||||||
Readable quiesced tree
|
Lookups and Exports
|
||||||
----------------------
|
-------------------
|
||||||
|
|
||||||
It may be desirable to allow readers to continue accessing a quiesced
|
Quiescing a tree results in a number of ``quiesce_inode`` operations for each
|
||||||
subvolume. One way to do that is to have a separate superlock (yuck) for read
|
inode under the tree. Those operations have a shared lifetime tied to the
|
||||||
access, say ``quiescerlock``. If a "readable" quiesce is performed, then
|
parent ``quiesce_path`` operation. So, once operations complete quiesce (but do
|
||||||
``quiescerlock`` is not xlocked by ``quiesce_inode``. Read locks on
|
not finish and release locks), the operations sit with locks held and do not
|
||||||
other (non-quiesce) locks will acquire a read lock only on ``quiescerlock`` and
|
monitor the state of the tree. This means we need to handle cases where new
|
||||||
no longer on ``quiescelock``. Write locks would try to acquire both
|
metadata is imported.
|
||||||
``quiescelock`` and ``quiescerlock`` (since writes may also read).
|
|
||||||
|
|
||||||
Ideally, it may be a new lock type could be used to handle both cases but no
|
If an inode is fetched via a directory ``lookup`` or ``readdir``, the MDS will
|
||||||
such lock type yet exists.
|
check if its parent is quiesced (i.e. is the parent directory ``quiescelock``
|
||||||
|
xlocked?). If so, the MDS will immediately issue an dispatch a
|
||||||
|
``quiesce_inode`` operation for that inode. Because it's a fresh inode, the
|
||||||
|
operation will immediately succeed and prevent the client from being issued
|
||||||
|
inappropriate capabailities.
|
||||||
|
|
||||||
|
The second case is handling subtree imports from another rank. This is
|
||||||
|
problematic since the subtree import may have inodes with inappropriate state
|
||||||
|
that would invalidate the guarantees of the reportedly "quiesced" tree. To
|
||||||
|
avoid this, importer MDS will skip discovery of the root inode for an import if
|
||||||
|
it encounters a directory inode that is quiesced. If skipped, the rank
|
||||||
|
will send a NAK message back to the exporter which will abort the export.
|
||||||
|
Loading…
Reference in New Issue
Block a user