mirror of https://github.com/ceph/ceph
78 lines
3.7 KiB
ReStructuredText
78 lines
3.7 KiB
ReStructuredText
=================================
|
|
CephFS Distributed Metadata Cache
|
|
=================================
|
|
While the data for inodes in a Ceph file system is stored in RADOS and
|
|
accessed by the clients directly, inode metadata and directory
|
|
information is managed by the Ceph metadata server (MDS). The MDS's
|
|
act as mediator for all metadata related activity, storing the resulting
|
|
information in a separate RADOS pool from the file data.
|
|
|
|
CephFS clients can request that the MDS fetch or change inode metadata
|
|
on its behalf, but an MDS can also grant the client **capabilities**
|
|
(aka **caps**) for each inode (see :doc:`/cephfs/capabilities`).
|
|
|
|
A capability grants the client the ability to cache and possibly
|
|
manipulate some portion of the data or metadata associated with the
|
|
inode. When another client needs access to the same information, the MDS
|
|
will revoke the capability and the client will eventually return it,
|
|
along with an updated version of the inode's metadata (in the event that
|
|
it made changes to it while it held the capability).
|
|
|
|
Clients can request capabilities and will generally get them, but when
|
|
there is competing access or memory pressure on the MDS, they may be
|
|
**revoked**. When a capability is revoked, the client is responsible for
|
|
returning it as soon as it is able. Clients that fail to do so in a
|
|
timely fashion may end up **blocklisted** and unable to communicate with
|
|
the cluster.
|
|
|
|
Since the cache is distributed, the MDS must take great care to ensure
|
|
that no client holds capabilities that may conflict with other clients'
|
|
capabilities, or operations that it does itself. This allows cephfs
|
|
clients to rely on much greater cache coherence than a filesystem like
|
|
NFS, where the client may cache data and metadata beyond the point where
|
|
it has changed on the server.
|
|
|
|
Client Metadata Requests
|
|
------------------------
|
|
When a client needs to query/change inode metadata or perform an
|
|
operation on a directory, it has two options. It can make a request to
|
|
the MDS directly, or serve the information out of its cache. With
|
|
CephFS, the latter is only possible if the client has the necessary
|
|
caps.
|
|
|
|
Clients can send simple requests to the MDS to query or request changes
|
|
to certain metadata. The replies to these requests may also grant the
|
|
client a certain set of caps for the inode, allowing it to perform
|
|
subsequent requests without consulting the MDS.
|
|
|
|
Clients can also request caps directly from the MDS, which is necessary
|
|
in order to read or write file data.
|
|
|
|
Distributed Locks in an MDS Cluster
|
|
-----------------------------------
|
|
When an MDS wants to read or change information about an inode, it must
|
|
gather the appropriate locks for it. The MDS cluster may have a series
|
|
of different types of locks on the given inode and each MDS may have
|
|
disjoint sets of locks.
|
|
|
|
If there are outstanding caps that would conflict with these locks, then
|
|
they must be revoked before the lock can be acquired. Once the competing
|
|
caps are returned to the MDS, then it can get the locks and do the
|
|
operation.
|
|
|
|
On a filesystem served by multiple MDS', the metadata cache is also
|
|
distributed among the MDS' in the cluster. For every inode, at any given
|
|
time, only one MDS in the cluster is considered **authoritative**. Any
|
|
requests to change that inode must be done by the authoritative MDS,
|
|
though non-authoritative MDS can forward requests to the authoritative
|
|
one.
|
|
|
|
Non-auth MDS' can also obtain read locks that prevent the auth MDS from
|
|
changing the data until the lock is dropped, so that they can serve
|
|
inode info to the clients.
|
|
|
|
The auth MDS for an inode can change over time as well. The MDS' will
|
|
actively balance responsibility for the inode cache amongst
|
|
themselves, but this can be overridden by **pinning** certain subtrees
|
|
to a single MDS.
|