ceph/doc/dev/osd_internals/snaps.rst

======
Snaps
======

Overview
--------
Rados supports two related snapshotting mechanisms:

  1. *pool snaps*: snapshots are implicitly applied to all objects
     in a pool
  2. *self managed snaps*: the user must provide the current *SnapContext*
     on each write.

These two are mutually exclusive, only one or the other can be used on
a particular pool.

The *SnapContext* is the set of snapshots currently defined for an object
as well as the most recent snapshot (the *seq*) requested from the mon for
sequencing purposes (a *SnapContext* with a newer *seq* is considered to
be more recent).

The difference between *pool snaps* and *self managed snaps* from the
OSD's point of view lies in whether the *SnapContext* comes to the OSD
via the client's MOSDOp or via the most recent OSDMap.

See OSD::make_writeable

Ondisk Structures
-----------------
Each object has in the PG collection a *head* object (or *snapdir*, which we
will come to shortly) and possibly a set of *clone* objects.
Each hobject_t has a snap field.  For the *head* (the only writeable version
of an object), the snap field is set to CEPH_NOSNAP.  For the *clones*, the
snap field is set to the *seq* of the *SnapContext* at their creation.
When the OSD services a write, it first checks whether the most recent
*clone* is tagged with a snapid prior to the most recent snap represented
in the *SnapContext*.  If so, at least one snapshot has occurred between
the time of the write and the time of the last clone.  Therefore, prior
to performing the mutation, the OSD creates a new clone for servicing
reads on snaps between the snapid of the last clone and the most recent
snapid.

The *head* object contains a *SnapSet* encoded in an attribute, which tracks

  1. The full set of snaps defined for the object
  2. The full set of clones which currently exist
  3. Overlapping intervals between clones for tracking space usage
  4. Clone size

If the *head* is deleted while there are still clones, a *snapdir* object
is created instead to house the *SnapSet*.

Additionally, the *object_info_t* on each clone includes a vector of snaps
for which clone is defined.

Snap Removal
------------
To remove a snapshot, a request is made to the *Monitor* cluster to
add the snapshot id to the list of purged snaps (or to remove it from
the set of pool snaps in the case of *pool snaps*).  In either case,
the *PG* adds the snap to its *snap_trimq* for trimming.

A clone can be removed when all of its snaps have been removed.  In
order to determine which clones might need to be removed upon snap
removal, we maintain a mapping from snap to *hobject_t* using the
*SnapMapper*.

See PrimaryLogPG::SnapTrimmer, SnapMapper

This trimming is performed asynchronously by the snap_trim_wq while the
PG is clean and not scrubbing.

  #. The next snap in PG::snap_trimq is selected for trimming
  #. We determine the next object for trimming out of PG::snap_mapper.
     For each object, we create a log entry and repop updating the
     object info and the snap set (including adjusting the overlaps).
     If the object is a clone which no longer belongs to any live snapshots,
     it is removed here. (See PrimaryLogPG::trim_object() when new_snaps
     is empty.)
  #. We also locally update our *SnapMapper* instance with the object's
     new snaps.
  #. The log entry containing the modification of the object also
     contains the new set of snaps, which the replica uses to update
     its own *SnapMapper* instance.
  #. The primary shares the info with the replica, which persists
     the new set of purged_snaps along with the rest of the info.


Recovery
--------
Because the trim operations are implemented using repops and log entries,
normal PG peering and recovery maintain the snap trimmer operations with
the caveat that push and removal operations need to update the local
*SnapMapper* instance.  If the purged_snaps update is lost, we merely
retrim a now empty snap.

SnapMapper
----------
*SnapMapper* is implemented on top of map_cacher<string, bufferlist>,
which provides an interface over a backing store such as the file system
with async transactions.  While transactions are incomplete, the map_cacher
instance buffers unstable keys allowing consistent access without having
to flush the filestore.  *SnapMapper* provides two mappings:

  1. hobject_t -> set<snapid_t>: stores the set of snaps for each clone
     object
  2. snapid_t -> hobject_t: stores the set of hobjects with the snapshot
     as one of its snaps

Assumption: there are lots of hobjects and relatively few snaps.  The
first encoding has a stringification of the object as the key and an
encoding of the set of snaps as a value.  The second mapping, because there
might be many hobjects for a single snap, is stored as a collection of keys
of the form stringify(snap)_stringify(object) such that stringify(snap)
is constant length.  These keys have a bufferlist encoding
pair<snapid, hobject_t> as a value.  Thus, creating or trimming a single
object does not involve reading all objects for any snap.  Additionally,
upon construction, the *SnapMapper* is provided with a mask for filtering
the objects in the single SnapMapper keyspace belonging to that PG.

Split
-----
The snapid_t -> hobject_t key entries are arranged such that for any PG,
up to 8 prefixes need to be checked to determine all hobjects in a particular
snap for a particular PG.  Upon split, the prefixes to check on the parent
are adjusted such that only the objects remaining in the PG will be visible.
The children will immediately have the correct mapping.
osd_internals/snaps.rst: add a description of snaps and trimming Signed-off-by: Samuel Just <sam.just@inktank.com> 2013-03-01 01:16:19 +00:00			`======`
			`Snaps`
			`======`

			`Overview`
			`--------`
			`Rados supports two related snapshotting mechanisms:`
osd_internals/snaps.rst: update description to reflect SnapMapper Signed-off-by: Samuel Just <sam.just@inktank.com> 2013-03-01 01:33:31 +00:00
doc: typo fix Signed-off-by: Ashita Dashottar <AshitaDashottar6@gmail.com> 2018-03-27 22:42:33 +00:00			`1. pool snaps: snapshots are implicitly applied to all objects`
osd_internals/snaps.rst: add a description of snaps and trimming Signed-off-by: Samuel Just <sam.just@inktank.com> 2013-03-01 01:16:19 +00:00			`in a pool`
			`2. self managed snaps: the user must provide the current SnapContext`
			`on each write.`

			`These two are mutually exclusive, only one or the other can be used on`
			`a particular pool.`

			`The SnapContext is the set of snapshots currently defined for an object`
			`as well as the most recent snapshot (the seq) requested from the mon for`
			`sequencing purposes (a SnapContext with a newer seq is considered to`
			`be more recent).`

			`The difference between pool snaps and self managed snaps from the`
			`OSD's point of view lies in whether the SnapContext comes to the OSD`
			`via the client's MOSDOp or via the most recent OSDMap.`

			`See OSD::make_writeable`

			`Ondisk Structures`
			`-----------------`
doc/dev: doc/dev/osd_internals caps, formatting, clarity Signed-off-by: Anthony D'Atri <anthony.datri@gmail.com> 2020-10-05 01:52:00 +00:00			`Each object has in the PG collection a head object (or snapdir, which we`
osd_internals/snaps.rst: add a description of snaps and trimming Signed-off-by: Samuel Just <sam.just@inktank.com> 2013-03-01 01:16:19 +00:00			`will come to shortly) and possibly a set of clone objects.`
			`Each hobject_t has a snap field. For the head (the only writeable version`
			`of an object), the snap field is set to CEPH_NOSNAP. For the clones, the`
			`snap field is set to the seq of the SnapContext at their creation.`
			`When the OSD services a write, it first checks whether the most recent`
			`clone is tagged with a snapid prior to the most recent snap represented`
			`in the SnapContext. If so, at least one snapshot has occurred between`
			`the time of the write and the time of the last clone. Therefore, prior`
			`to performing the mutation, the OSD creates a new clone for servicing`
			`reads on snaps between the snapid of the last clone and the most recent`
			`snapid.`

			`The head object contains a SnapSet encoded in an attribute, which tracks`
osd_internals/snaps.rst: update description to reflect SnapMapper Signed-off-by: Samuel Just <sam.just@inktank.com> 2013-03-01 01:33:31 +00:00
osd_internals/snaps.rst: add a description of snaps and trimming Signed-off-by: Samuel Just <sam.just@inktank.com> 2013-03-01 01:16:19 +00:00			`1. The full set of snaps defined for the object`
			`2. The full set of clones which currently exist`
			`3. Overlapping intervals between clones for tracking space usage`
			`4. Clone size`

			`If the head is deleted while there are still clones, a snapdir object`
			`is created instead to house the SnapSet.`

doc: fix a few typos in the dev docs Signed-off-by: Abhishek Lekshmanan <abhishek.lekshmanan@gmail.com> 2014-07-09 05:35:46 +00:00			`Additionally, the object_info_t on each clone includes a vector of snaps`
osd_internals/snaps.rst: add a description of snaps and trimming Signed-off-by: Samuel Just <sam.just@inktank.com> 2013-03-01 01:16:19 +00:00			`for which clone is defined.`

			`Snap Removal`
			`------------`
			`To remove a snapshot, a request is made to the Monitor cluster to`
			`add the snapshot id to the list of purged snaps (or to remove it from`
			`the set of pool snaps in the case of pool snaps). In either case,`
doc: update doc with latest code * also silence some warnings of doxygen Signed-off-by: Kefu Chai <kchai@redhat.com> 2015-02-24 08:02:08 +00:00			`the PG adds the snap to its snap_trimq for trimming.`
osd_internals/snaps.rst: add a description of snaps and trimming Signed-off-by: Samuel Just <sam.just@inktank.com> 2013-03-01 01:16:19 +00:00
			`A clone can be removed when all of its snaps have been removed. In`
			`order to determine which clones might need to be removed upon snap`
			`removal, we maintain a mapping from snap to hobject_t using the`
osd_internals/snaps.rst: update description to reflect SnapMapper Signed-off-by: Samuel Just <sam.just@inktank.com> 2013-03-01 01:33:31 +00:00			`SnapMapper.`

src: rename ReplicatedPG to PrimaryLogPG It's been confusing for a long time that EC pools are implemented by ReplicatedPG. What PG/ReplicatedPG really implement is the concept of a PG where consistency is managed by the primary via a log. Signed-off-by: Samuel Just <sjust@redhat.com> 2016-12-14 18:18:27 +00:00			`See PrimaryLogPG::SnapTrimmer, SnapMapper`
osd_internals/snaps.rst: add a description of snaps and trimming Signed-off-by: Samuel Just <sam.just@inktank.com> 2013-03-01 01:16:19 +00:00
			`This trimming is performed asynchronously by the snap_trim_wq while the`
doc/dev: doc/dev/osd_internals caps, formatting, clarity Signed-off-by: Anthony D'Atri <anthony.datri@gmail.com> 2020-10-05 01:52:00 +00:00			`PG is clean and not scrubbing.`
osd_internals/snaps.rst: update description to reflect SnapMapper Signed-off-by: Samuel Just <sam.just@inktank.com> 2013-03-01 01:33:31 +00:00
doc: update doc with latest code * also silence some warnings of doxygen Signed-off-by: Kefu Chai <kchai@redhat.com> 2015-02-24 08:02:08 +00:00			`#. The next snap in PG::snap_trimq is selected for trimming`
doc: Syntax fixes to suppress gitbuilder warnings. Signed-off-by: John Wilkins <john.wilkins@inktank.com> 2013-09-09 22:23:24 +00:00			`#. We determine the next object for trimming out of PG::snap_mapper.`
osd_internals/snaps.rst: update description to reflect SnapMapper Signed-off-by: Samuel Just <sam.just@inktank.com> 2013-03-01 01:33:31 +00:00			`For each object, we create a log entry and repop updating the`
			`object info and the snap set (including adjusting the overlaps).`
doc: update osd snap trimming with a little more detail Signed-off-by: Greg Farnum <gfarnum@redhat.com> 2017-01-25 04:59:33 +00:00			`If the object is a clone which no longer belongs to any live snapshots,`
			`it is removed here. (See PrimaryLogPG::trim_object() when new_snaps`
			`is empty.)`
doc: Syntax fixes to suppress gitbuilder warnings. Signed-off-by: John Wilkins <john.wilkins@inktank.com> 2013-09-09 22:23:24 +00:00			`#. We also locally update our SnapMapper instance with the object's`
osd_internals/snaps.rst: update description to reflect SnapMapper Signed-off-by: Samuel Just <sam.just@inktank.com> 2013-03-01 01:33:31 +00:00			`new snaps.`
doc: Syntax fixes to suppress gitbuilder warnings. Signed-off-by: John Wilkins <john.wilkins@inktank.com> 2013-09-09 22:23:24 +00:00			`#. The log entry containing the modification of the object also`
osd_internals/snaps.rst: add a description of snaps and trimming Signed-off-by: Samuel Just <sam.just@inktank.com> 2013-03-01 01:16:19 +00:00			`contains the new set of snaps, which the replica uses to update`
osd_internals/snaps.rst: update description to reflect SnapMapper Signed-off-by: Samuel Just <sam.just@inktank.com> 2013-03-01 01:33:31 +00:00			`its own SnapMapper instance.`
doc: Syntax fixes to suppress gitbuilder warnings. Signed-off-by: John Wilkins <john.wilkins@inktank.com> 2013-09-09 22:23:24 +00:00			`#. The primary shares the info with the replica, which persists`
osd_internals/snaps.rst: update description to reflect SnapMapper Signed-off-by: Samuel Just <sam.just@inktank.com> 2013-03-01 01:33:31 +00:00			`the new set of purged_snaps along with the rest of the info.`
osd_internals/snaps.rst: add a description of snaps and trimming Signed-off-by: Samuel Just <sam.just@inktank.com> 2013-03-01 01:16:19 +00:00
doc: Syntax fixes to suppress gitbuilder warnings. Signed-off-by: John Wilkins <john.wilkins@inktank.com> 2013-09-09 22:23:24 +00:00
doc: update osd snap trimming with a little more detail Signed-off-by: Greg Farnum <gfarnum@redhat.com> 2017-01-25 04:59:33 +00:00
osd_internals/snaps.rst: add a description of snaps and trimming Signed-off-by: Samuel Just <sam.just@inktank.com> 2013-03-01 01:16:19 +00:00			`Recovery`
			`--------`
			`Because the trim operations are implemented using repops and log entries,`
doc/dev: doc/dev/osd_internals caps, formatting, clarity Signed-off-by: Anthony D'Atri <anthony.datri@gmail.com> 2020-10-05 01:52:00 +00:00			`normal PG peering and recovery maintain the snap trimmer operations with`
osd_internals/snaps.rst: update description to reflect SnapMapper Signed-off-by: Samuel Just <sam.just@inktank.com> 2013-03-01 01:33:31 +00:00			`the caveat that push and removal operations need to update the local`
			`SnapMapper instance. If the purged_snaps update is lost, we merely`
			`retrim a now empty snap.`

			`SnapMapper`
			`----------`
			`SnapMapper is implemented on top of map_cacher<string, bufferlist>,`
doc: filesystem to file system "Filesystem" is not a word (although fairly common in use). Signed-off-by: Patrick Donnelly <pdonnell@redhat.com> 2019-09-09 19:36:04 +00:00			`which provides an interface over a backing store such as the file system`
osd_internals/snaps.rst: update description to reflect SnapMapper Signed-off-by: Samuel Just <sam.just@inktank.com> 2013-03-01 01:33:31 +00:00			`with async transactions. While transactions are incomplete, the map_cacher`
			`instance buffers unstable keys allowing consistent access without having`
			`to flush the filestore. SnapMapper provides two mappings:`

			`1. hobject_t -> set<snapid_t>: stores the set of snaps for each clone`
			`object`
doc: Fix typo and redundant word in snaps section Fix typo and remove identifying from "filtering identifying" in doc/dev/osd_internals/snaps.rst Signed-off-by: Brad Hubbard <bhubbard@redhat.com> 2015-08-25 11:31:43 +00:00			`2. snapid_t -> hobject_t: stores the set of hobjects with the snapshot`
osd_internals/snaps.rst: update description to reflect SnapMapper Signed-off-by: Samuel Just <sam.just@inktank.com> 2013-03-01 01:33:31 +00:00			`as one of its snaps`

			`Assumption: there are lots of hobjects and relatively few snaps. The`
			`first encoding has a stringification of the object as the key and an`
			`encoding of the set of snaps as a value. The second mapping, because there`
			`might be many hobjects for a single snap, is stored as a collection of keys`
			`of the form stringify(snap)_stringify(object) such that stringify(snap)`
			`is constant length. These keys have a bufferlist encoding`
			`pair<snapid, hobject_t> as a value. Thus, creating or trimming a single`
			`object does not involve reading all objects for any snap. Additionally,`
			`upon construction, the SnapMapper is provided with a mask for filtering`
doc/dev: doc/dev/osd_internals caps, formatting, clarity Signed-off-by: Anthony D'Atri <anthony.datri@gmail.com> 2020-10-05 01:52:00 +00:00			`the objects in the single SnapMapper keyspace belonging to that PG.`
osd_internals/snaps.rst: add a description of snaps and trimming Signed-off-by: Samuel Just <sam.just@inktank.com> 2013-03-01 01:16:19 +00:00
			`Split`
			`-----`
doc/dev: doc/dev/osd_internals caps, formatting, clarity Signed-off-by: Anthony D'Atri <anthony.datri@gmail.com> 2020-10-05 01:52:00 +00:00			`The snapid_t -> hobject_t key entries are arranged such that for any PG,`
osd_internals/snaps.rst: update description to reflect SnapMapper Signed-off-by: Samuel Just <sam.just@inktank.com> 2013-03-01 01:33:31 +00:00			`up to 8 prefixes need to be checked to determine all hobjects in a particular`
doc/dev: doc/dev/osd_internals caps, formatting, clarity Signed-off-by: Anthony D'Atri <anthony.datri@gmail.com> 2020-10-05 01:52:00 +00:00			`snap for a particular PG. Upon split, the prefixes to check on the parent`
			`are adjusted such that only the objects remaining in the PG will be visible.`
osd_internals/snaps.rst: update description to reflect SnapMapper Signed-off-by: Samuel Just <sam.just@inktank.com> 2013-03-01 01:33:31 +00:00			`The children will immediately have the correct mapping.`