mirror of https://github.com/ceph/ceph
226 lines
9.2 KiB
ReStructuredText
226 lines
9.2 KiB
ReStructuredText
======================
|
|
Peering
|
|
======================
|
|
|
|
Concepts
|
|
--------
|
|
|
|
*Peering*
|
|
the process of bringing all of the OSDs that store
|
|
a Placement Group (PG) into agreement about the state
|
|
of all of the objects (and their metadata) in that PG.
|
|
Note that agreeing on the state does not mean that
|
|
they all have the latest contents.
|
|
|
|
*Active Set*
|
|
the set of OSDs who are (or were as of some epoch)
|
|
in the list of nodes to store a particular PG.
|
|
|
|
*primary*
|
|
the (by convention first) member of the *acting set*,
|
|
who is the only OSD that will accept client initiated
|
|
writes to objects in a placement group.
|
|
|
|
*replica*
|
|
a non-primary OSD in the *acting set* for a placement group
|
|
(and who has been recognized as such and *activated* by the primary).
|
|
|
|
*stray*
|
|
an OSD who is not a member of the current *acting set*, but
|
|
has not yet been told that it can delete its copies of a
|
|
particular placement group.
|
|
|
|
*recovery*
|
|
ensuring that copies of all of the objects in a PG
|
|
are on all of the OSD's in the *acting set*. Once
|
|
*peering* has been performed, the primary can start
|
|
accepting write operations, and *recovery* can proceed
|
|
in the background.
|
|
|
|
*PG log*
|
|
a list of recent updates made to objects in a PG.
|
|
Note that these logs can be truncated after all OSDs
|
|
in the *acting set* have acknowledged up to a certain
|
|
point.
|
|
|
|
*back-log*
|
|
If the failure of an OSD makes it necessary to replicate
|
|
operations that have been truncated from the most recent
|
|
PG logs, it will be necessary to reconstruct the missing
|
|
information by walking the object space and generating
|
|
log entries for
|
|
operations to create the existing objects in their existing
|
|
states. While a back-log may be different than the actual
|
|
set of operations that brought the PG to its current state,
|
|
it is equivalent ... and that is good enough.
|
|
|
|
*missing set*
|
|
Each OSD notes update log entries and if they imply updates to
|
|
the contents of an object, adds that object to a list of needed
|
|
updates. This list is called the *missing set* for that <OSD,PG>.
|
|
|
|
*Authoritative History*
|
|
a complete, and fully ordered set of operations that, if
|
|
performed, would bring an OSD's copy of a Placement Group
|
|
up to date.
|
|
|
|
*epoch*
|
|
a (monotonically increasing) OSD map version number
|
|
|
|
*last epoch start*
|
|
the last epoch at which all nodes in the *acting set*
|
|
for a particular placement group agreed on an
|
|
*authoritative history*. At this point, *peering* is
|
|
deemed to have been successful.
|
|
|
|
*up through*
|
|
when a primary successfully completes the *peering* process,
|
|
it informs a monitor that an *authoritative history* has
|
|
been established (for that PG) **up through** the current
|
|
epoch, and the primary is now going active.
|
|
|
|
*last epoch clean*
|
|
the last epoch at which all nodes in the *acting set*
|
|
for a particular placement group were completely
|
|
up to date (both PG logs and object contents).
|
|
At this point, *recovery* is deemed to have been
|
|
completed.
|
|
|
|
Description of the Peering Process
|
|
----------------------------------
|
|
|
|
The *Golden Rule* is that no write operation to any PG
|
|
is acknowledged to a client until it has been persisted
|
|
by all members of the *acting set* for that PG. This means
|
|
that if we can communicate with at least one member of
|
|
each *acting set* since the last successful *peering*, someone
|
|
will have a record of every (acknowledged) operation
|
|
since the last successful *peering*.
|
|
This means that it should be possible for the current
|
|
primary to construct and disseminate a new *authoritative history*.
|
|
|
|
It is also important to appreciate the role of the OSD map
|
|
(list of all known OSDs and their states, as well as some
|
|
information about the placement groups) in the *peering*
|
|
process:
|
|
|
|
When OSDs go up or down (or get added or removed)
|
|
this has the potential to affect the *active sets*
|
|
of many placement groups.
|
|
|
|
When a primary successfully completes the *peering*
|
|
process, this too is noted in the OSD map (*last epoch start*).
|
|
|
|
Changes can only be made after successful *peering*
|
|
(recorded in the PAXOS stream as a an "up through").
|
|
|
|
Thus, if a new primary has a copy of the latest OSD map,
|
|
it can infer which *active sets* may have accepted updates,
|
|
and thus which OSDs must be consulted before we can successfully
|
|
*peer*.
|
|
|
|
The high level process is for the current PG primary to:
|
|
|
|
1. get the latest OSD map (to identify the members of the
|
|
all interesting *acting sets*, and confirm that we are still the primary).
|
|
|
|
2. generate a list of all of the acting sets (that achieved
|
|
*last epoch start*) since the *last epoch clean*. We can
|
|
ignore acting sets that did not achieve *last epoch start*
|
|
because they could not have accepted any updates.
|
|
|
|
Successfull *peering* will require that we be able to contact at
|
|
least one OSD from each of these *acting sets*.
|
|
|
|
3. ask every node in that list what its first and last PG log entries are
|
|
(which gives us a complete list of all known operations, and enables
|
|
us to make a list of what log entries each member of the current
|
|
*acting set* does not have).
|
|
|
|
4. if anyone else has (in his PG log) operations that I do not have,
|
|
instruct them to send me the missing log entries
|
|
(constructing a *back-log* if necessary).
|
|
|
|
5. for each member of the current *acting set*:
|
|
|
|
a) ask him for copies of all PG log entries since *last epoch start*
|
|
so that I can verify that they agree with mine (or know what
|
|
objects I will be telling him to delete).
|
|
|
|
If the cluster failed before an operation was persisted by all
|
|
members of the *acting set*, and the subsequent *peering* did not
|
|
remember that operation, and a node that did remember that
|
|
operation later rejoined, his logs would record a different
|
|
(divergent) history than the *authoritative history* that was
|
|
reconstructed in the *peering* after the failure.
|
|
|
|
Since the *divergent* events were not recorded in other logs
|
|
from that *acting set*, they were not acknowledged to the client,
|
|
and there is no harm in discarding them (so that all OSDs agree
|
|
on the *authoritative history*). But, we will have to instruct
|
|
any OSD that stores data from a divergent update to delete the
|
|
affected (and now deemed to be apocryphal) objects.
|
|
|
|
b) ask him for his *missing set* (object updates recorded
|
|
in his PG log, but for which he does not have the new data).
|
|
This is the list of objects that must be fully replicated
|
|
before we can accept writes.
|
|
|
|
6. at this point, my PG log contains an *authoritative history* of
|
|
the placement group (which may have involved generating a *back-log*),
|
|
and I now have sufficient
|
|
information to bring any other OSD in the *acting set* up to date.
|
|
I can now inform a monitor
|
|
that I am "up through" the end of my *aurhoritative history*.
|
|
|
|
The monitor will persist this through PAXOS, so that any future
|
|
*peering* of this PG will note that the *acting set* for this
|
|
interval may have made updates to the PG and that a member of
|
|
this *acting set* must be included the next *peering*.
|
|
|
|
This makes me active as the *primary* and establishes a new *last epoch start*.
|
|
|
|
7. for each member of the current *acting set*:
|
|
|
|
a) send them log updates to bring their PG logs into agreement with
|
|
my own (*authoritative history*) ... which may involve deciding
|
|
to delete divergent objects.
|
|
|
|
b) await acknowledgement that they have persisted the PG log entries.
|
|
|
|
8. at this point all OSDs in the *acting set* agree on all of the meta-data,
|
|
and would (in any future *peering*) return identical accounts of all
|
|
updates.
|
|
|
|
a) start accepting client write operations (because we have unanimous
|
|
agreement on the state of the objects into which those updates are
|
|
being accepted). Note, however, that we will delay any attempts to
|
|
write to objects that are not yet fully replicated throughout the
|
|
current *acting set*.
|
|
|
|
b) start pulling object data updates that other OSDs have, but I do not.
|
|
|
|
c) start pushing object data updates to other OSDs that do not yet have them.
|
|
|
|
We push these updates from the primary (rather than having the replicas
|
|
pull them) because this allows the primary to ensure that a replica has
|
|
the current contents before sending it an update write. It also makes
|
|
it possible for a single read (from the primary) to be used to write
|
|
the data to multiple replicas. If each replica did its own pulls,
|
|
the data might have to be read multiple times.
|
|
|
|
9. once all replicas store the all copies of all objects (that existed
|
|
prior to the start of this epoch) we can dismiss all of the *stray*
|
|
replicas, allowing them to delete their copies of objects for which
|
|
they are no longer in the *acting set*.
|
|
|
|
We could not dismiss the *strays* prior to this because it was possible
|
|
that one of those *strays* might hold the sole surviving copy of an
|
|
old object (all of whose copies disappeared before they could be
|
|
replicated on members of the current *acting set*).
|
|
|
|
State Model
|
|
-----------
|
|
|
|
.. graphviz:: peering_graph.generated.dot
|