ceph/doc/dev/versions.rst
Greg Farnum b5ea74cec4 docs: document how the current OSD PG/object versions work
Signed-off-by: Greg Farnum <greg@inktank.com>
2013-08-27 15:08:28 -07:00

47 lines
2.3 KiB
ReStructuredText

==============
Public OSD Version
==============
At present, there is one main version, maintained on-disk as
pg_log.head and in-memory as OpContext::at_version.
Clients see this version in one of two ways:
1) The long-standing MOSDOpReply::reassert_version,
2) the much newer objclass API function get_current_version().
The semantics on both of these are not quite as you'd expect.
reassert_version is usually set by looking at the
OpContext::reply_version. reply_version is left at zero on successful
read operations. On any operation returning ENOENT, reassert_version
is instead set from the pg_info_t::last_update value. On successful
write operations, reply_version is set equal to
object_info_t::user_version. (On replays, reassert_version is set
directly from the PG log entry's version.)
The user_version semantics are: for a non-watch write, update
user_version to the value of OpContext::version_at following the
preparation of the Op (just before writing out the new state to disk;
so this version has been updated with anything necessary to make the
object writeable, etc). For a watch write, do not change the
user_version (meaning it is different from the
object_info_t::version). For a read, of course do not change it.
This means that the reassert_version is *normally* the value it should
be in order to replay the Op if necessary, but not for Watch
operations. (It appears this has caused problems in the past and so
the new LingerOp framework never replays them; it just generates new
ones.) The point here being that clients can look at the
reassert_version, compare it to previous versions, and see if there's
been a write they care about (if watching an rbd head object to
refresh it on version changes, for instance). These versions are often
shared with other clients via Notify mechanisms, and could be shared
via other channels as well.
The newer get_current_version() function returns whatever the current
contents of OpContext::at_version are. On read operations, that's 0;
on write operations it's whatever that version happens to be. It
*normally* will be equal to the reassert_version that gets returned,
but in unusual circumstances it might be different. So far no users
expect that version to have any relationship to the reassert_version,
though; they just want get_current_version() to be monotonically
increasing.