diff --git a/doc/dev/cephfs-reclaim.rst b/doc/dev/cephfs-reclaim.rst new file mode 100644 index 00000000000..94edd39ae16 --- /dev/null +++ b/doc/dev/cephfs-reclaim.rst @@ -0,0 +1,104 @@ +CephFS Reclaim Interface +======================== + +Introduction +------------ +NFS servers typically do not track ephemeral state on stable storage. If +the NFS server is restarted, then it will be resurrected with no +ephemeral state, and the NFS clients are expected to send requests to +reclaim what state they held during a grace period. + +In order to support this use-case, libcephfs has grown several functions +that allow a client that has been stopped and restarted to destroy or +reclaim state held by a previous incarnation of itself. This allows the +client to reacquire state held by its previous incarnation, and to avoid +the long wait for the old session to time out before releasing the state +previously held. + +As soon as an NFS server running over cephfs goes down, it's racing +against its MDS session timeout. If the Ceph session times out before +the NFS grace period is started, then conflicting state could be +acquired by another client. This mechanism also allows us to increase +the timeout for these clients, to ensure that the server has a long +window of time to be restarted. + +Setting the UUID +---------------- +In order to properly reset or reclaim against the old session, we need a +way to identify the old session. This done by setting a unique opaque +value on the session using **ceph_set_uuid()**. The uuid value can be +any string and is treated as opaque by the client. + +Setting the uuid directly can only be done on a new session, prior to +mounting. When reclaim is performed the current session will inherit the +old session's uuid. + +Starting Reclaim +---------------- +After calling ceph_create and ceph_init on the resulting struct +ceph_mount_info, the client should then issue ceph_start_reclaim, +passing in the uuid of the previous incarnation of the client with any +flags. + +CEPH_RECLAIM_RESET + This flag indicates that we do not intend to do any sort of reclaim + against the old session indicated by the given uuid, and that it + should just be discarded. Any state held by the previous client + should be released immediately. + +Finishing Reclaim +----------------- +After the Ceph client has completed all of its reclaim operations, the +client should issue ceph_finish_reclaim to indicate that the reclaim is +now complete. + +Setting Session Timeout (Optional) +---------------------------------- +When a client dies and is restarted, and we need to preserve its state, +we are effectively racing against the session expiration clock. In this +situation we generally want a longer timeout since we expect to +eventually kill off the old session manually. + +Example 1: Reset Old Session +---------------------------- +This example just kills off the MDS session held by a previous instance +of itself. An NFS server can start a grace period and then ask the MDS +to tear down the old session. This allows clients to start reclaim +immediately. + +(Note: error handling omitted for clarity) + +.. code-block:: c + + struct ceph_mount_info *cmount; + const char *uuid = "foobarbaz"; + + /* Set up a new cephfs session, but don't mount it yet. */ + rc = ceph_create(&cmount); + rc = ceph_init(&cmount); + + /* + * Set the timeout to 5 minutes to lengthen the window of time for + * the server to restart, should it crash. + */ + ceph_set_session_timeout(cmount, 300); + + /* + * Start reclaim vs. session with old uuid. Before calling this, + * all NFS servers that could acquire conflicting state _must_ be + * enforcing their grace period locally. + */ + rc = ceph_start_reclaim(cmount, uuid, CEPH_RECLAIM_RESET); + + /* Declare reclaim complete */ + rc = ceph_finish_reclaim(cmount); + + /* Set uuid held by new session */ + ceph_set_uuid(cmount, nodeid); + + /* + * Now mount up the filesystem and do normal open/lock operations to + * satisfy reclaim requests. + */ + ceph_mount(cmount, rootpath); + ...