ceph/doc/dev/cephfs-reclaim.rst

CephFS Reclaim Interface
========================

Introduction
------------
NFS servers typically do not track ephemeral state on stable storage. If
the NFS server is restarted, then it will be resurrected with no
ephemeral state, and the NFS clients are expected to send requests to
reclaim what state they held during a grace period.

In order to support this use-case, libcephfs has grown several functions
that allow a client that has been stopped and restarted to destroy or
reclaim state held by a previous incarnation of itself. This allows the
client to reacquire state held by its previous incarnation, and to avoid
the long wait for the old session to time out before releasing the state
previously held.

As soon as an NFS server running over cephfs goes down, it's racing
against its MDS session timeout. If the Ceph session times out before
the NFS grace period is started, then conflicting state could be
acquired by another client. This mechanism also allows us to increase
the timeout for these clients, to ensure that the server has a long
window of time to be restarted.

Setting the UUID
----------------
In order to properly reset or reclaim against the old session, we need a
way to identify the old session. This done by setting a unique opaque
value on the session using **ceph_set_uuid()**. The uuid value can be
any string and is treated as opaque by the client.

Setting the uuid directly can only be done on a new session, prior to
mounting. When reclaim is performed the current session will inherit the
old session's uuid.

Starting Reclaim
----------------
After calling ceph_create and ceph_init on the resulting struct
ceph_mount_info, the client should then issue ceph_start_reclaim,
passing in the uuid of the previous incarnation of the client with any
flags.

CEPH_RECLAIM_RESET
   This flag indicates that we do not intend to do any sort of reclaim
   against the old session indicated by the given uuid, and that it
   should just be discarded. Any state held by the previous client
   should be released immediately.

Finishing Reclaim
-----------------
After the Ceph client has completed all of its reclaim operations, the
client should issue ceph_finish_reclaim to indicate that the reclaim is
now complete.

Setting Session Timeout (Optional)
----------------------------------
When a client dies and is restarted, and we need to preserve its state,
we are effectively racing against the session expiration clock. In this
situation we generally want a longer timeout since we expect to
eventually kill off the old session manually.

Example 1: Reset Old Session
----------------------------
This example just kills off the MDS session held by a previous instance
of itself. An NFS server can start a grace period and then ask the MDS
to tear down the old session. This allows clients to start reclaim
immediately.

(Note: error handling omitted for clarity)

.. code-block:: c

	struct ceph_mount_info *cmount;
	const char *uuid = "foobarbaz";

	/* Set up a new cephfs session, but don't mount it yet. */
	rc = ceph_create(&cmount);
	rc = ceph_init(&cmount);

	/*
	 * Set the timeout to 5 minutes to lengthen the window of time for
	 * the server to restart, should it crash.
	 */
	ceph_set_session_timeout(cmount, 300);

	/*
	 * Start reclaim vs. session with old uuid. Before calling this,
	 * all NFS servers that could acquire conflicting state _must_ be
	 * enforcing their grace period locally.
	 */
	rc = ceph_start_reclaim(cmount, uuid, CEPH_RECLAIM_RESET);

	/* Declare reclaim complete */
	rc = ceph_finish_reclaim(cmount);

	/* Set uuid held by new session */
	ceph_set_uuid(cmount, nodeid);

	/*
	 * Now mount up the file system and do normal open/lock operations to
	 * satisfy reclaim requests.
	 */
	ceph_mount(cmount, rootpath);
	...