mirror of https://github.com/ceph/ceph
209 lines
8.9 KiB
ReStructuredText
209 lines
8.9 KiB
ReStructuredText
.. _log-based-pg:
|
|
|
|
============
|
|
Log Based PG
|
|
============
|
|
|
|
Background
|
|
==========
|
|
|
|
Why PrimaryLogPG?
|
|
-----------------
|
|
|
|
Currently, consistency for all ceph pool types is ensured by primary
|
|
log-based replication. This goes for both erasure-coded (EC) and
|
|
replicated pools.
|
|
|
|
Primary log-based replication
|
|
-----------------------------
|
|
|
|
Reads must return data written by any write which completed (where the
|
|
client could possibly have received a commit message). There are lots
|
|
of ways to handle this, but Ceph's architecture makes it easy for
|
|
everyone at any map epoch to know who the primary is. Thus, the easy
|
|
answer is to route all writes for a particular PG through a single
|
|
ordering primary and then out to the replicas. Though we only
|
|
actually need to serialize writes on a single RADOS object (and even then,
|
|
the partial ordering only really needs to provide an ordering between
|
|
writes on overlapping regions), we might as well serialize writes on
|
|
the whole PG since it lets us represent the current state of the PG
|
|
using two numbers: the epoch of the map on the primary in which the
|
|
most recent write started (this is a bit stranger than it might seem
|
|
since map distribution itself is asynchronous -- see Peering and the
|
|
concept of interval changes) and an increasing per-PG version number
|
|
-- this is referred to in the code with type ``eversion_t`` and stored as
|
|
``pg_info_t::last_update``. Furthermore, we maintain a log of "recent"
|
|
operations extending back at least far enough to include any
|
|
*unstable* writes (writes which have been started but not committed)
|
|
and objects which aren't up-to-date locally (see recovery and
|
|
backfill). In practice, the log will extend much further
|
|
(``osd_min_pg_log_entries`` when clean and ``osd_max_pg_log_entries`` when not
|
|
clean) because it's handy for quickly performing recovery.
|
|
|
|
Using this log, as long as we talk to a non-empty subset of the OSDs
|
|
which must have accepted any completed writes from the most recent
|
|
interval in which we accepted writes, we can determine a conservative
|
|
log which must contain any write which has been reported to a client
|
|
as committed. There is some freedom here, we can choose any log entry
|
|
between the oldest head remembered by an element of that set (any
|
|
newer cannot have completed without that log containing it) and the
|
|
newest head remembered (clearly, all writes in the log were started,
|
|
so it's fine for us to remember them) as the new head. This is the
|
|
main point of divergence between replicated pools and EC pools in
|
|
``PG/PrimaryLogPG``: replicated pools try to choose the newest valid
|
|
option to avoid the client needing to replay those operations and
|
|
instead recover the other copies. EC pools instead try to choose
|
|
the *oldest* option available to them.
|
|
|
|
The reason for this gets to the heart of the rest of the differences
|
|
in implementation: one copy will not generally be enough to
|
|
reconstruct an EC object. Indeed, there are encodings where some log
|
|
combinations would leave unrecoverable objects (as with a ``k=4,m=2`` encoding
|
|
where 3 of the replicas remember a write, but the other 3 do not -- we
|
|
don't have 3 copies of either version). For this reason, log entries
|
|
representing *unstable* writes (writes not yet committed to the
|
|
client) must be rollbackable using only local information on EC pools.
|
|
Log entries in general may therefore be rollbackable (and in that case,
|
|
via a delayed application or via a set of instructions for rolling
|
|
back an inplace update) or not. Replicated pool log entries are
|
|
never able to be rolled back.
|
|
|
|
For more details, see ``PGLog.h/cc``, ``osd_types.h:pg_log_t``,
|
|
``osd_types.h:pg_log_entry_t``, and peering in general.
|
|
|
|
ReplicatedBackend/ECBackend unification strategy
|
|
================================================
|
|
|
|
PGBackend
|
|
---------
|
|
|
|
The fundamental difference between replication and erasure coding
|
|
is that replication can do destructive updates while erasure coding
|
|
cannot. It would be really annoying if we needed to have two entire
|
|
implementations of ``PrimaryLogPG`` since there
|
|
are really only a few fundamental differences:
|
|
|
|
#. How reads work -- async only, requires remote reads for EC
|
|
#. How writes work -- either restricted to append, or must write aside and do a
|
|
tpc
|
|
#. Whether we choose the oldest or newest possible head entry during peering
|
|
#. A bit of extra information in the log entry to enable rollback
|
|
|
|
and so many similarities
|
|
|
|
#. All of the stats and metadata for objects
|
|
#. The high level locking rules for mixing client IO with recovery and scrub
|
|
#. The high level locking rules for mixing reads and writes without exposing
|
|
uncommitted state (which might be rolled back or forgotten later)
|
|
#. The process, metadata, and protocol needed to determine the set of osds
|
|
which participated in the most recent interval in which we accepted writes
|
|
#. etc.
|
|
|
|
Instead, we choose a few abstractions (and a few kludges) to paper over the differences:
|
|
|
|
#. ``PGBackend``
|
|
#. ``PGTransaction``
|
|
#. ``PG::choose_acting`` chooses between ``calc_replicated_acting`` and ``calc_ec_acting``
|
|
#. Various bits of the write pipeline disallow some operations based on pool
|
|
type -- like omap operations, class operation reads, and writes which are
|
|
not aligned appends (officially, so far) for EC
|
|
#. Misc other kludges here and there
|
|
|
|
``PGBackend`` and ``PGTransaction`` enable abstraction of differences 1 and 2 above
|
|
and the addition of 4 as needed to the log entries.
|
|
|
|
The replicated implementation is in ``ReplicatedBackend.h/cc`` and doesn't
|
|
require much additional explanation. More detail on the ``ECBackend`` can be
|
|
found in ``doc/dev/osd_internals/erasure_coding/ecbackend.rst``.
|
|
|
|
PGBackend Interface Explanation
|
|
===============================
|
|
|
|
Note: this is from a design document that predated the Firefly release
|
|
and is probably out of date w.r.t. some of the method names.
|
|
|
|
Readable vs Degraded
|
|
--------------------
|
|
|
|
For a replicated pool, an object is readable IFF it is present on
|
|
the primary (at the right version). For an EC pool, we need at least
|
|
`m` shards present to perform a read, and we need it on the primary. For
|
|
this reason, ``PGBackend`` needs to include some interfaces for determining
|
|
when recovery is required to serve a read vs a write. This also
|
|
changes the rules for when peering has enough logs to prove that it
|
|
|
|
Core Changes:
|
|
|
|
- | ``PGBackend`` needs to be able to return ``IsPG(Recoverable|Readable)Predicate``
|
|
| objects to allow the user to make these determinations.
|
|
|
|
Client Reads
|
|
------------
|
|
|
|
Reads from a replicated pool can always be satisfied
|
|
synchronously by the primary OSD. Within an erasure coded pool,
|
|
the primary will need to request data from some number of replicas in
|
|
order to satisfy a read. ``PGBackend`` will therefore need to provide
|
|
separate ``objects_read_sync`` and ``objects_read_async`` interfaces where
|
|
the former won't be implemented by the ``ECBackend``.
|
|
|
|
``PGBackend`` interfaces:
|
|
|
|
- ``objects_read_sync``
|
|
- ``objects_read_async``
|
|
|
|
Scrubs
|
|
------
|
|
|
|
We currently have two scrub modes with different default frequencies:
|
|
|
|
#. [shallow] scrub: compares the set of objects and metadata, but not
|
|
the contents
|
|
#. deep scrub: compares the set of objects, metadata, and a CRC32 of
|
|
the object contents (including omap)
|
|
|
|
The primary requests a scrubmap from each replica for a particular
|
|
range of objects. The replica fills out this scrubmap for the range
|
|
of objects including, if the scrub is deep, a CRC32 of the contents of
|
|
each object. The primary gathers these scrubmaps from each replica
|
|
and performs a comparison identifying inconsistent objects.
|
|
|
|
Most of this can work essentially unchanged with erasure coded PG with
|
|
the caveat that the ``PGBackend`` implementation must be in charge of
|
|
actually doing the scan.
|
|
|
|
|
|
``PGBackend`` interfaces:
|
|
|
|
- ``be_*``
|
|
|
|
Recovery
|
|
--------
|
|
|
|
The logic for recovering an object depends on the backend. With
|
|
the current replicated strategy, we first pull the object replica
|
|
to the primary and then concurrently push it out to the replicas.
|
|
With the erasure coded strategy, we probably want to read the
|
|
minimum number of replica chunks required to reconstruct the object
|
|
and push out the replacement chunks concurrently.
|
|
|
|
Another difference is that objects in erasure coded PG may be
|
|
unrecoverable without being unfound. The ``unfound`` state
|
|
should probably be renamed to ``unrecoverable``. Also, the
|
|
``PGBackend`` implementation will have to be able to direct the search
|
|
for PG replicas with unrecoverable object chunks and to be able
|
|
to determine whether a particular object is recoverable.
|
|
|
|
|
|
Core changes:
|
|
|
|
- ``s/unfound/unrecoverable``
|
|
|
|
PGBackend interfaces:
|
|
|
|
- `on_local_recover_start <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L60>`_
|
|
- `on_local_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L66>`_
|
|
- `on_global_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L78>`_
|
|
- `on_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L83>`_
|
|
- `begin_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L90>`_
|