mirror of https://github.com/ceph/ceph
183 lines
7.6 KiB
ReStructuredText
183 lines
7.6 KiB
ReStructuredText
======================
|
|
Capabilities in CephFS
|
|
======================
|
|
When a client wants to operate on an inode, it will query the MDS in various
|
|
ways, which will then grant the client a set of **capabilities**. This
|
|
grants the client permissions to operate on the inode in various ways. One
|
|
of the major differences from other network file systems (e.g NFS or SMB) is
|
|
that the capabilities granted are quite granular, and it's possible that
|
|
multiple clients can hold different capabilities on the same inodes.
|
|
|
|
Types of Capabilities
|
|
---------------------
|
|
There are several "generic" capability bits. These denote what sort of ability
|
|
the capability grants.
|
|
|
|
::
|
|
|
|
/* generic cap bits */
|
|
#define CEPH_CAP_GSHARED 1 /* (metadata) client can read (s) */
|
|
#define CEPH_CAP_GEXCL 2 /* (metadata) client can read and update (x) */
|
|
#define CEPH_CAP_GCACHE 4 /* (file) client can cache reads (c) */
|
|
#define CEPH_CAP_GRD 8 /* (file) client can read (r) */
|
|
#define CEPH_CAP_GWR 16 /* (file) client can write (w) */
|
|
#define CEPH_CAP_GBUFFER 32 /* (file) client can buffer writes (b) */
|
|
#define CEPH_CAP_GWREXTEND 64 /* (file) client can extend EOF (a) */
|
|
#define CEPH_CAP_GLAZYIO 128 /* (file) client can perform lazy io (l) */
|
|
|
|
These are then shifted by a particular number of bits. These denote a part of
|
|
the inode's data or metadata on which the capability is being granted:
|
|
|
|
::
|
|
|
|
/* per-lock shift */
|
|
#define CEPH_CAP_SAUTH 2 /* A */
|
|
#define CEPH_CAP_SLINK 4 /* L */
|
|
#define CEPH_CAP_SXATTR 6 /* X */
|
|
#define CEPH_CAP_SFILE 8 /* F */
|
|
|
|
Only certain generic cap types are ever granted for some of those "shifts",
|
|
however. In particular, only the FILE shift ever has more than the first two
|
|
bits.
|
|
|
|
::
|
|
|
|
| AUTH | LINK | XATTR | FILE
|
|
2 4 6 8
|
|
|
|
From the above, we get a number of constants, that are generated by taking
|
|
each bit value and shifting to the correct bit in the word:
|
|
|
|
::
|
|
|
|
#define CEPH_CAP_AUTH_SHARED (CEPH_CAP_GSHARED << CEPH_CAP_SAUTH)
|
|
|
|
These bits can then be or'ed together to make a bitmask denoting a set of
|
|
capabilities.
|
|
|
|
There is one exception:
|
|
|
|
::
|
|
|
|
#define CEPH_CAP_PIN 1 /* no specific capabilities beyond the pin */
|
|
|
|
The "pin" just pins the inode into memory, without granting any other caps.
|
|
|
|
Graphically:
|
|
|
|
::
|
|
|
|
+---+---+---+---+---+---+---+---+
|
|
| p | _ |As x |Ls x |Xs x |
|
|
+---+---+---+---+---+---+---+---+
|
|
|Fs x c r w b a l |
|
|
+---+---+---+---+---+---+---+---+
|
|
|
|
The second bit is currently unused.
|
|
|
|
Abilities granted by each cap
|
|
-----------------------------
|
|
While that is how capabilities are granted (and communicated), the important
|
|
bit is what they actually allow the client to do:
|
|
|
|
* **PIN**: This just pins the inode into memory. This is sufficient to allow
|
|
the client to get to the inode number, as well as other immutable things like
|
|
major or minor numbers in a device inode, or symlink contents.
|
|
|
|
* **AUTH**: This grants the ability to get to the authentication-related metadata.
|
|
In particular, the owner, group and mode. Note that doing a full permission
|
|
check may require getting at ACLs as well, which are stored in xattrs.
|
|
|
|
* **LINK**: The link count of the inode.
|
|
|
|
* **XATTR**: Ability to access or manipulate xattrs. Note that since ACLs are
|
|
stored in xattrs, it's also sometimes necessary to access them when checking
|
|
permissions.
|
|
|
|
* **FILE**: This is the big one. This allows the client to access and manipulate
|
|
file data. It also covers certain metadata relating to file data -- the
|
|
size, mtime, atime and ctime, in particular.
|
|
|
|
Shorthand
|
|
---------
|
|
Note that the client logging can also present a compact representation of the
|
|
capabilities. For example:
|
|
|
|
::
|
|
|
|
pAsLsXsFs
|
|
|
|
The 'p' represents the pin. Each capital letter corresponds to the shift
|
|
values, and the lowercase letters after each shift are for the actual
|
|
capabilities granted in each shift.
|
|
|
|
The relation between the lock states and the capabilities
|
|
---------------------------------------------------------
|
|
In MDS there are four different locks for each inode, they are simplelock,
|
|
scatterlock, filelock and locallock. Each lock has several different lock
|
|
states, and the MDS will issue capabilities to clients based on the lock
|
|
state.
|
|
|
|
In each state the MDS Locker will always try to issue all the capabilities to the
|
|
clients allowed, even some capabilities are not needed or wanted by the clients,
|
|
as pre-issuing capabilities could reduce latency in some cases.
|
|
|
|
If there is only one client, usually it will be the loner client for all the inodes.
|
|
While in multiple clients case, the MDS will try to calculate a loner client out for
|
|
each inode depending on the capabilities the clients (needed | wanted), but usually
|
|
it will fail. The loner client will always get all the capabilities.
|
|
|
|
The filelock will control files' partial metadatas' and the file contents' access
|
|
permissions. The metadatas include **mtime**, **atime**, **size**, etc.
|
|
|
|
* **Fs**: Once a client has it, all other clients are denied **Fw**.
|
|
|
|
* **Fx**: Only the loner client is allowed this capability. Once the lock state
|
|
transitions to LOCK_EXCL, the loner client is granted this along with all other
|
|
file capabilities except the **Fl**.
|
|
|
|
* **Fr**: Once a client has it, the **Fb** capability will be already revoked from
|
|
all the other clients.
|
|
|
|
If clients only request to read the file, the lock state will be transferred
|
|
to LOCK_SYNC stable state directly. All the clients can be granted **Fscrl**
|
|
capabilities from the auth MDS and **Fscr** capabilities from the replica MDSes.
|
|
|
|
If multiple clients read from and write to the same file, then the lock state
|
|
will be transferred to LOCK_MIX stable state finally and all the clients could
|
|
have the **Frwl** capabilities from the auth MDS, and the **Fr** from the replica
|
|
MDSes. The **Fcb** capabilities won't be granted to all the clients and the
|
|
clients will do sync read/write.
|
|
|
|
* **Fw**: If there is no loner client and once a client have this capability, the
|
|
**Fsxcb** capabilities won't be granted to other clients.
|
|
|
|
If multiple clients read from and write to the same file, then the lock state
|
|
will be transferred to LOCK_MIX stable state finally and all the clients could
|
|
have the **Frwl** capabilities from the auth MDS, and the **Fr** from the replica
|
|
MDSes. The **Fcb** capabilities won't be granted to all the clients and the
|
|
clients will do sync read/write.
|
|
|
|
* **Fc**: This capability means the clients could cache file read and should be
|
|
issued together with **Fr** capability and only in this use case will it make
|
|
sense.
|
|
|
|
While actually in some stable or interim transitional states they tend to keep
|
|
the **Fc** allowed even the **Fr** capability isn't granted as this can avoid
|
|
forcing clients to drop full caches, for example on a simple file size extension
|
|
or truncating use case.
|
|
|
|
* **Fb**: This capability means the clients could buffer file write and should be
|
|
issued together with **Fw** capability and only in this use case will it make
|
|
sense.
|
|
|
|
While actually in some stable or interim transitional states they tend to keep
|
|
the **Fc** allowed even the **Fw** capability isn't granted as this can avoid
|
|
forcing clients to drop dirty buffers, for example on a simple file size extension
|
|
or truncating use case.
|
|
|
|
* **Fl**: This capability means the clients could perform lazy io. LazyIO relaxes
|
|
POSIX semantics. Buffered reads/writes are allowed even when a file is opened by
|
|
multiple applications on multiple clients. Applications are responsible for managing
|
|
cache coherency themselves.
|