Locking infrastructure in MDS is (obviously) to protect the state of various metadata. MDS has different locks covering different portions of inode and dentry. Moreover, MDS uses different kinds of locks since different metadata (in inode and dentry) have different behaviour in different situations. The MDS cache is distributed across multiple MDS ranks and across all clients. The locking infrastructure serves to ensure that all ranks and clients are consistent in their view of the file system.
Data managed by the MDS can be very large to practically have the entire data set in the memory of a single metadata server. This also results in a single point of failure. The MDS therefore has a concept of Distributed Subtree Partition. A directory tree can be divided into smaller sub-trees. This is done by recording heat (access frequency) of each node in the directory tree. When a sub-tree heat reaches a configured threshold, the MDS divides the sub-tree by splitting the directory fragment. Each fragment is responsible for a part of the original directory, however, there will be a single authority node for these fragments. Each MDS can bear the read and write requests after fragmentation. If a file is very frequently accessed, the MDS will generate multiple copies distributed across active MDSs to satisfy concurrent I/Os. Typically, there are multiple clients reading and writing to files. The MDS defines locking rules for the associated metadata, e.g., metadata which is rarely modified concurrently such as UID/GID for an inode, a shared read and exclusive write access rule would suffice. However, statistics of a directory may need to be updated by multiple clients at the same time. This large directory may have been divided (fragmented) into multiple shards and different clients could write to different shards. These shards can share the read and also support simultaneous writes.
Therefore, in addition to different lock types that cover different metadata pieces for an inode, the MDS has lock classes that define access rules for a particular lock type. Lock types and classes are explained further in this document.
Lock Types
----------
MDS defines a handful of lock types associated with different metadata for an inode or dentry. Lock type protecting metadata for an inode and dentry are as follows::
..note:: Locking rules when modifying `ctime` is a bit different - either under `versionlock` or under no specific lock at all (i.e., it can be modified with other locks held, e.g., when modifying (say) uid/gid under `CEPH_LOCK_IAUTH`).
Lock Classes
------------
Lock classes define locking behaviour for the associated lock type necessary for handling distributed locks. The MDS defines 3 lock classes::
LocalLock - Used for data that does not require distributed locking such as inode or dentry version information. Local locks are versioned locks.
SimpleLock - Used for data that requires shared read and mutually exclusive write. This lock class is also the base class for other lock classes and specifies most of the locking behaviour for implementing distributed locks.
ScatterLock - Used for data that requires shared read and shared write. Typical use is where an MDS can delegate some authority to other MDS replicas, e.g., replica MDSs can satisfy read capabilities for clients.
..note:: In addition, MDS defines FileLock which is a special case of ScatterLock used for data that requires shared read and shared write, but also for protecting other pieces of metadata that require shared read and mutually exclusive write.
There are 3 modes in which a lock can be acquired::
rdlock - shared read lock
wrlock - shared write lock
xlock - exclusive lock
`rdlock` and `xlock` are self explanatory.
`wrlock` is special since it allows concurrent writers and is valid for `ScatterLock` and `FileLock` class. From the earlier section it can be seen that `INEST` and `IDFT` are of `ScatterLock` class. `wrlock` allows multiple writers at the same time, .e.g., when a (large) directory is split into multiple shards (after fragmentation) and each shard is "assigned" to an active MDS. When new files are created under these directories, the recursive stats are independently updated on the active MDSs. Later, to fetch the updated stats, the "scattered" data is aggregated ("gathered") on the auth MDS (of the inode); which typically happens when a `rdlock` is requested on this lock type.
..note:: MDS also defines `remote_wrlock` which is primarily used during rename operations when the destination dentry is on another (active) MDS than the source MDS.
MDS defines various lock states (defined in `src/mds/locks.h` source). Not all lock states are valid for a given lock class. Each lock class defines its own lock transition rules and are organized as Lock State Machines. The lock states (`LOCK_*`) are not locks themselves, but control if a lock is allowed to be taken. Each state follows `LOCK_<STATE>` or `LOCK_<FROM_STATE>_<TO_STATE>` naming terminology and can be summed up as::
LOCK_SYNC - anybody (ANY) can read lock, no one can write lock and exclusive lock
LOCK_LOCK - no one can read lock, only primary (AUTH) mds can write lock or exclusive lock
LOCK_MIX - anybody (ANY) can write lock, no one can read lock or exclusive lock
LOCK_XLOCK - someone (client) is holding a exclusive lock
The Lock Transition table (section) use the following notions::
ANY - Auth or Replica MDS
AUTH - Auth MDS
XCL - Auth MDS or Exclusive client
Other lock states (such as `LOCK_XSYN`, `LOCK_TSYN`, etc..) are additional states that are defined as an optimization for certain client behaviour (`LOCK_XSYN` allows clients to keep the buffered writes and not flush it to the OSDs and temporarily pausing writes).
Intermediate lock states (`LOCK_<FROM_STATE>_<TO_STATE>`) denote transition of a lock from one state (`<FROM_STATE>`) to another (`<TO_STATE>`).
Each lock class defines its own Lock State Machine and can be found in `src/mds/locks.c` source. The state machines are explained when discussing Lock Transition in the section below.
Lock Transition
---------------
Transition of lock from one state to another is mostly prompted by a (client) request or a change that the MDS is undergoing, such as tree migration. Let's consider a simple case of two clients: One client does a `stat()` (`getattr()` or `lookup()`) to fetch UID/GID of a inode, and the other client does a `setattr()` to change the UID/GID of the same inode. The first client (most likely) has `As` (iauth shared) caps issued to it by the MDS. Now, when the other client does a `setattr()` call to the MDS, the MDS adds a `xlock` to the inodes' `authlock` (`CEPH_LOCK_IAUTH`)::
Server::handle_client_setattr()
if (mask & (CEPH_SETATTR_MODE|CEPH_SETATTR_UID|CEPH_SETATTR_GID|CEPH_SETATTR_BTIME|CEPH_SETATTR_KILL_SGUID))
lov.add_xlock(&cur->authlock);
Note that the MDS adds a bunch of other locks for this inode, but for now let's only work on IAUTH. Now, `CEPH_LOCK_IAUTH` is a `SimpleLock` class, and its lock transition state machine is::
// stable loner rep state r rp rd wr fwr l x caps,other