mirror of
https://github.com/ceph/ceph
synced 2024-12-17 00:46:05 +00:00
Merge pull request #10436 from gregsfortytwo/wip-snapshot-doc
Add documentation about snapshots Reviewed-by: John Spray <john.spray@redhat.com>
This commit is contained in:
commit
d9a0222197
@ -60,6 +60,13 @@ writing, but there is insufficient testing to provide stability guarantees and
|
||||
every expansion of testing has generally revealed new issues. If you do enable
|
||||
snapshots and experience failure, manual intervention will be needed.
|
||||
|
||||
Snapshots are known not to work properly with multiple filesystems (below) in
|
||||
some cases. Specifically, if you share a pool for multiple FSes and delete
|
||||
a snapshot in one FS, expect to lose snapshotted file data in any other FS using
|
||||
snapshots. See the :doc:`/dev/cephfs-snapshots` page for more information.
|
||||
|
||||
Snapshots are known not to work with multi-MDS filesystems.
|
||||
|
||||
Snapshotting was blocked off with the "allow_new_snaps" flag prior to Firefly.
|
||||
|
||||
Multiple filesystems within a Ceph cluster
|
||||
@ -77,5 +84,8 @@ will not have been experienced by anybody else -- knowledgeable help will be
|
||||
extremely limited. You also probably do not have the security or isolation
|
||||
guarantees you want or think you have upon doing so.
|
||||
|
||||
Note that snapshots and multiple filesystems are *not* tested in combination
|
||||
and may not work together; see above.
|
||||
|
||||
Multiple filesystems were available starting in the Jewel release candidates
|
||||
but were protected behind the "enable_multiple" flag before the final release.
|
||||
|
114
doc/dev/cephfs-snapshots.rst
Normal file
114
doc/dev/cephfs-snapshots.rst
Normal file
@ -0,0 +1,114 @@
|
||||
CephFS Snapshots
|
||||
==============
|
||||
|
||||
CephFS supports snapshots, generally created by invoking mkdir against the
|
||||
(hidden, special) .snap directory.
|
||||
|
||||
Overview
|
||||
-----------
|
||||
|
||||
Generally, snapshots do what they sound like: they create an immutable view
|
||||
of the filesystem at the point in time they're taken. There are some headline
|
||||
features that make CephFS snapshots different from what you might expect:
|
||||
|
||||
* Arbitrary subtrees. Snapshots are created within any directory you choose,
|
||||
and cover all data in the filesystem under that directory.
|
||||
* Asynchronous. If you create a snapshot, buffered data is flushed out lazily,
|
||||
including from other clients. As a result, "creating" the snapshot is
|
||||
very fast.
|
||||
|
||||
Important Data Structures
|
||||
-----------
|
||||
* SnapRealm: A `SnapRealm` is created whenever you create a snapshot at a new
|
||||
point in the hierarchy (or, when a snapshotted inode is moved outside of its
|
||||
parent snapshot). SnapRealms contain an `sr_t srnode`, links to `past_parents`
|
||||
and `past_children`, and all `inodes_with_caps` that are part of the snapshot.
|
||||
Clients also have a SnapRealm concept that maintains less data but is used to
|
||||
associate a `SnapContext` with each open file for writing.
|
||||
* sr_t: An `sr_t` is the on-disk snapshot metadata. It is part of the containing
|
||||
directory and contains sequence counters, timestamps, the list of associated
|
||||
snapshot IDs, and `past_parents`.
|
||||
* snaplink_t: `past_parents` et al are stored on-disk as a `snaplink_t`, holding
|
||||
the inode number and first `snapid` of the inode/snapshot referenced.
|
||||
|
||||
Creating a snapshot
|
||||
----------
|
||||
To make a snapshot on directory "/1/2/3/foo", the client invokes "mkdir" on
|
||||
"/1/2/3/foo/.snaps" directory. This is transmitted to the MDS Server as a
|
||||
CEPH_MDS_OP_MKSNAP-tagged `MClientRequest`, and initially handled in
|
||||
Server::handle_client_mksnap(). It allocates a `snapid` from the `SnapServer`,
|
||||
projects a new inode with the new SnapRealm, and commits it to the MDLog as
|
||||
usual. When committed, it invokes
|
||||
`MDCache::do_realm_invalidate_and_update_notify()`, which triggers most of the
|
||||
real work of the snapshot.
|
||||
|
||||
If there were already snapshots above directory "foo" (rooted at "/1", say),
|
||||
the new SnapRealm adds its most immediate ancestor as a `past_parent` on
|
||||
creation. After committing to the MDLog, all clients with caps on files in
|
||||
"/1/2/3/foo/" are notified (MDCache::send_snaps()) of the new SnapRealm, and
|
||||
update the `SnapContext` they are using with that data. Note that this
|
||||
*is not* a synchronous part of the snapshot creation!
|
||||
|
||||
Updating a snapshot
|
||||
----------
|
||||
If you delete a snapshot, or move data out of the parent snapshot's hierarchy,
|
||||
a similar process is followed. Extra code paths check to see if we can break
|
||||
the `past_parent` links between SnapRealms, or eliminate them entirely.
|
||||
|
||||
Generating a SnapContext
|
||||
---------
|
||||
A RADOS `SnapContext` consists of a snapshot sequence ID (`snapid`) and all
|
||||
the snapshot IDs that an object is already part of. To generate that list, we
|
||||
generate a list of all `snapids` associated with the SnapRealm and all its
|
||||
`past_parents`.
|
||||
|
||||
Storing snapshot data
|
||||
----------
|
||||
File data is stored in RADOS "self-managed" snapshots. Clients are careful to
|
||||
use the correct `SnapContext` when writing file data to the OSDs.
|
||||
|
||||
Storing snapshot metadata
|
||||
----------
|
||||
Snapshotted dentries (and their inodes) are stored in-line as part of the
|
||||
directory they were in at the time of the snapshot. *All dentries* include a
|
||||
`first` and `last` snapid for which they are valid. (Non-snapshotted dentries
|
||||
will have their `last` set to CEPH_NOSNAP).
|
||||
|
||||
Snapshot writeback
|
||||
---------
|
||||
There is a great deal of code to handle writeback efficiently. When a Client
|
||||
receives an `MClientSnap` message, it updates the local `SnapRealm`
|
||||
representation and its links to specific `Inodes`, and generates a `CapSnap`
|
||||
for the `Inode`. The `CapSnap` is flushed out as part of capability writeback,
|
||||
and if there is dirty data the `CapSnap` is used to block fresh data writes
|
||||
until the snapshot is completely flushed to the OSDs.
|
||||
|
||||
In the MDS, we generate snapshot-representing dentries as part of the regular
|
||||
process for flushing them. Dentries with outstanding `CapSnap` data is kept
|
||||
pinned and in the journal.
|
||||
|
||||
Deleting snapshots
|
||||
--------
|
||||
Snapshots are deleted by invoking "rmdir" on the ".snaps" directory they are
|
||||
rooted in. (Attempts to delete a directory which roots snapshots *will fail*;
|
||||
you must delete the snapshots first.) Once deleted, they are entered into the
|
||||
`OSDMap` list of deleted snapshots and the file data is removed by the OSDs.
|
||||
Metadata is cleaned up as the directory objects are read in and written back
|
||||
out again.
|
||||
|
||||
Hard links
|
||||
---------
|
||||
Hard links do not interact well with snapshots. A file is snapshotted when its
|
||||
primary link is part of a SnapRealm; other links *will not* preserve data.
|
||||
Generally the location where a file was first created will be its primary link,
|
||||
but if the original link has been deleted it is not easy (nor always
|
||||
determnistic) to find which link is now the primary.
|
||||
|
||||
Multi-FS
|
||||
---------
|
||||
Snapshots and multiiple filesystems don't interact well. Specifically, each
|
||||
MDS cluster allocates `snapids` independently; if you have multiple filesystems
|
||||
sharing a single pool (via namespaces), their snapshots *will* collide and
|
||||
deleting one will result in missing file data for others. (This may even be
|
||||
invisible, not throwing errors to the user.) If each FS gets its own
|
||||
pool things probably work, but this isn't tested and may not be true.
|
Loading…
Reference in New Issue
Block a user