mirror of
https://github.com/ceph/ceph
synced 2025-01-18 17:12:29 +00:00
ff51d3ff56
* Warn the reader that the implementation is ahead and may differ * Update the links to the Firefly branch * Remove links to issues used during development to avoid confusion Signed-off-by: Loic Dachary <loic@dachary.org>
321 lines
13 KiB
ReStructuredText
321 lines
13 KiB
ReStructuredText
===================
|
|
PG Backend Proposal
|
|
===================
|
|
|
|
NOTE: the last update of this page is dated 2013, before the Firefly
|
|
release. The details of the implementation may be different.
|
|
|
|
Motivation
|
|
----------
|
|
|
|
The purpose of the `PG Backend interface
|
|
<https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h>`_
|
|
is to abstract over the differences between replication and erasure
|
|
coding as failure recovery mechanisms.
|
|
|
|
Much of the existing PG logic, particularly that for dealing with
|
|
peering, will be common to each. With both schemes, a log of recent
|
|
operations will be used to direct recovery in the event that an OSD is
|
|
down or disconnected for a brief period of time. Similarly, in both
|
|
cases it will be necessary to scan a recovered copy of the PG in order
|
|
to recover an empty OSD. The PGBackend abstraction must be
|
|
sufficiently expressive for Replicated and ErasureCoded backends to be
|
|
treated uniformly in these areas.
|
|
|
|
However, there are also crucial differences between using replication
|
|
and erasure coding which PGBackend must abstract over:
|
|
|
|
1. The current write strategy would not ensure that a particular
|
|
object could be reconstructed after a failure.
|
|
2. Reads on an erasure coded PG require chunks to be read from the
|
|
replicas as well.
|
|
3. Object recovery probably involves recovering the primary and
|
|
replica missing copies at the same time to avoid performing extra
|
|
reads of replica shards.
|
|
4. Erasure coded PG chunks created for different acting set
|
|
positions are not interchangeable. In particular, it might make
|
|
sense for a single OSD to hold more than 1 PG copy for different
|
|
acting set positions.
|
|
5. Selection of a pgtemp for backfill may differ between replicated
|
|
and erasure coded backends.
|
|
6. The set of necessary OSDs from a particular interval required to
|
|
to continue peering may differ between replicated and erasure
|
|
coded backends.
|
|
7. The selection of the authoritative log may differ between replicated
|
|
and erasure coded backends.
|
|
|
|
Client Writes
|
|
-------------
|
|
|
|
The current PG implementation performs a write by performing the write
|
|
locally while concurrently directing replicas to perform the same
|
|
operation. Once all operations are durable, the operation is
|
|
considered durable. Because these writes may be destructive
|
|
overwrites, during peering, a log entry on a replica (or the primary)
|
|
may be found to be divergent if that replica remembers a log event
|
|
which the authoritative log does not contain. This can happen if only
|
|
1 out of 3 replicas persisted an operation, but was not available in
|
|
the next interval to provide an authoritative log. With replication,
|
|
we can repair the divergent object as long as at least 1 replica has a
|
|
current copy of the divergent object. With erasure coding, however,
|
|
it might be the case that neither the new version of the object nor
|
|
the old version of the object has enough available chunks to be
|
|
reconstructed. This problem is much simpler if we arrange for all
|
|
supported operations to be locally roll-back-able.
|
|
|
|
- CEPH_OSD_OP_APPEND: We can roll back an append locally by
|
|
including the previous object size as part of the PG log event.
|
|
- CEPH_OSD_OP_DELETE: The possibility of rolling back a delete
|
|
requires that we retain the deleted object until all replicas have
|
|
persisted the deletion event. ErasureCoded backend will therefore
|
|
need to store objects with the version at which they were created
|
|
included in the key provided to the filestore. Old versions of an
|
|
object can be pruned when all replicas have committed up to the log
|
|
event deleting the object.
|
|
- CEPH_OSD_OP_(SET|RM)ATTR: If we include the prior value of the attr
|
|
to be set or removed, we can roll back these operations locally.
|
|
|
|
Core Changes:
|
|
|
|
- Current code should be adapted to use and rollback as appropriate
|
|
APPEND, DELETE, (SET|RM)ATTR log entries.
|
|
- The filestore needs to be able to deal with multiply versioned
|
|
hobjects. This means adapting the filestore internally to
|
|
use a `ghobject <https://github.com/ceph/ceph/blob/firefly/src/common/hobject.h#L238>`_
|
|
which is basically a tuple<hobject_t, gen_t,
|
|
shard_t>. The gen_t + shard_t need to be included in the on-disk
|
|
filename. gen_t is a unique object identifier to make sure there
|
|
are no name collisions when object N is created +
|
|
deleted + created again. An interface needs to be added to get all
|
|
versions of a particular hobject_t or the most recently versioned
|
|
instance of a particular hobject_t.
|
|
|
|
PGBackend Interfaces:
|
|
|
|
- PGBackend::perform_write() : It seems simplest to pass the actual
|
|
ops vector. The reason for providing an async, callback based
|
|
interface rather than having the PGBackend respond directly is that
|
|
we might want to use this interface for internal operations like
|
|
watch/notify expiration or snap trimming which might not necessarily
|
|
have an external client.
|
|
- PGBackend::try_rollback() : Some log entries (all of the ones valid
|
|
for the Erasure coded backend) will support local rollback. In
|
|
those cases, PGLog can avoid adding objects to the missing set when
|
|
identifying divergent objects.
|
|
|
|
Peering and PG Logs
|
|
-------------------
|
|
|
|
Currently, we select the log with the newest last_update and the
|
|
longest tail to be the authoritative log. This is fine because we
|
|
aren't generally able to roll operations on the other replicas forward
|
|
or backwards, instead relying on our ability to re-replicate divergent
|
|
objects. With the write approach discussed in the previous section,
|
|
however, the erasure coded backend will rely on being able to roll
|
|
back divergent operations since we may not be able to re-replicate
|
|
divergent objects. Thus, we must choose the *oldest* last_update from
|
|
the last interval which went active in order to minimize the number of
|
|
divergent objects.
|
|
|
|
The difficulty is that the current code assumes that as long as it has
|
|
an info from at least 1 OSD from the prior interval, it can complete
|
|
peering. In order to ensure that we do not end up with an
|
|
unrecoverably divergent object, a K+M erasure coded PG must hear from at
|
|
least K of the replicas of the last interval to serve writes. This ensures
|
|
that we will select a last_update old enough to roll back at least K
|
|
replicas. If a replica with an older last_update comes along later,
|
|
we will be able to provide at least K chunks of any divergent object.
|
|
|
|
Core Changes:
|
|
|
|
- PG::choose_acting(), etc. need to be generalized to use PGBackend to
|
|
determine the authoritative log.
|
|
- PG::RecoveryState::GetInfo needs to use PGBackend to determine
|
|
whether it has enough infos to continue with authoritative log
|
|
selection.
|
|
|
|
PGBackend interfaces:
|
|
|
|
- have_enough_infos()
|
|
- choose_acting()
|
|
|
|
PGTemp
|
|
------
|
|
|
|
Currently, an OSD is able to request a temp acting set mapping in
|
|
order to allow an up-to-date OSD to serve requests while a new primary
|
|
is backfilled (and for other reasons). An erasure coded pg needs to
|
|
be able to designate a primary for these reasons without putting it
|
|
in the first position of the acting set. It also needs to be able
|
|
to leave holes in the requested acting set.
|
|
|
|
Core Changes:
|
|
|
|
- OSDMap::pg_to_*_osds needs to separately return a primary. For most
|
|
cases, this can continue to be acting[0].
|
|
- MOSDPGTemp (and related OSD structures) needs to be able to specify
|
|
a primary as well as an acting set.
|
|
- Much of the existing code base assumes that acting[0] is the primary
|
|
and that all elements of acting are valid. This needs to be cleaned
|
|
up since the acting set may contain holes.
|
|
|
|
Client Reads
|
|
------------
|
|
|
|
Reads with the replicated strategy can always be satisfied
|
|
synchronously out of the primary OSD. With an erasure coded strategy,
|
|
the primary will need to request data from some number of replicas in
|
|
order to satisfy a read. The perform_read() interface for PGBackend
|
|
therefore will be async.
|
|
|
|
PGBackend interfaces:
|
|
|
|
- perform_read(): as with perform_write() it seems simplest to pass
|
|
the ops vector. The call to oncomplete will occur once the out_bls
|
|
have been appropriately filled in.
|
|
|
|
Distinguished acting set positions
|
|
----------------------------------
|
|
|
|
With the replicated strategy, all replicas of a PG are
|
|
interchangeable. With erasure coding, different positions in the
|
|
acting set have different pieces of the erasure coding scheme and are
|
|
not interchangeable. Worse, crush might cause chunk 2 to be written
|
|
to an OSD which happens already to contain an (old) copy of chunk 4.
|
|
This means that the OSD and PG messages need to work in terms of a
|
|
type like pair<shard_t, pg_t> in order to distinguish different pg
|
|
chunks on a single OSD.
|
|
|
|
Because the mapping of object name to object in the filestore must
|
|
be 1-to-1, we must ensure that the objects in chunk 2 and the objects
|
|
in chunk 4 have different names. To that end, the filestore must
|
|
include the chunk id in the object key.
|
|
|
|
Core changes:
|
|
|
|
- The filestore `ghobject_t needs to also include a chunk id
|
|
<https://github.com/ceph/ceph/blob/firefly/src/common/hobject.h#L241>`_ making it more like
|
|
tuple<hobject_t, gen_t, shard_t>.
|
|
- coll_t needs to include a shard_t.
|
|
- The OSD pg_map and similar pg mappings need to work in terms of a
|
|
spg_t (essentially
|
|
pair<pg_t, shard_t>). Similarly, pg->pg messages need to include
|
|
a shard_t
|
|
- For client->PG messages, the OSD will need a way to know which PG
|
|
chunk should get the message since the OSD may contain both a
|
|
primary and non-primary chunk for the same pg
|
|
|
|
Object Classes
|
|
--------------
|
|
|
|
We probably won't support object classes at first on Erasure coded
|
|
backends.
|
|
|
|
Scrub
|
|
-----
|
|
|
|
We currently have two scrub modes with different default frequencies:
|
|
|
|
1. [shallow] scrub: compares the set of objects and metadata, but not
|
|
the contents
|
|
2. deep scrub: compares the set of objects, metadata, and a crc32 of
|
|
the object contents (including omap)
|
|
|
|
The primary requests a scrubmap from each replica for a particular
|
|
range of objects. The replica fills out this scrubmap for the range
|
|
of objects including, if the scrub is deep, a crc32 of the contents of
|
|
each object. The primary gathers these scrubmaps from each replica
|
|
and performs a comparison identifying inconsistent objects.
|
|
|
|
Most of this can work essentially unchanged with erasure coded PG with
|
|
the caveat that the PGBackend implementation must be in charge of
|
|
actually doing the scan, and that the PGBackend implementation should
|
|
be able to attach arbitrary information to allow PGBackend on the
|
|
primary to scrub PGBackend specific metadata.
|
|
|
|
The main catch, however, for erasure coded PG is that sending a crc32
|
|
of the stored chunk on a replica isn't particularly helpful since the
|
|
chunks on different replicas presumably store different data. Because
|
|
we don't support overwrites except via DELETE, however, we have the
|
|
option of maintaining a crc32 on each chunk through each append.
|
|
Thus, each replica instead simply computes a crc32 of its own stored
|
|
chunk and compares it with the locally stored checksum. The replica
|
|
then reports to the primary whether the checksums match.
|
|
|
|
PGBackend interfaces:
|
|
|
|
- scan()
|
|
- scrub()
|
|
- compare_scrub_maps()
|
|
|
|
Crush
|
|
-----
|
|
|
|
If crush is unable to generate a replacement for a down member of an
|
|
acting set, the acting set should have a hole at that position rather
|
|
than shifting the other elements of the acting set out of position.
|
|
|
|
Core changes:
|
|
|
|
- Ensure that crush behaves as above for INDEP.
|
|
|
|
Recovery
|
|
--------
|
|
|
|
The logic for recovering an object depends on the backend. With
|
|
the current replicated strategy, we first pull the object replica
|
|
to the primary and then concurrently push it out to the replicas.
|
|
With the erasure coded strategy, we probably want to read the
|
|
minimum number of replica chunks required to reconstruct the object
|
|
and push out the replacement chunks concurrently.
|
|
|
|
Another difference is that objects in erasure coded pg may be
|
|
unrecoverable without being unfound. The "unfound" concept
|
|
should probably then be renamed to unrecoverable. Also, the
|
|
PGBackend implementation will have to be able to direct the search
|
|
for pg replicas with unrecoverable object chunks and to be able
|
|
to determine whether a particular object is recoverable.
|
|
|
|
|
|
Core changes:
|
|
|
|
- s/unfound/unrecoverable
|
|
|
|
PGBackend interfaces:
|
|
|
|
- `on_local_recover_start <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L60>`_
|
|
- `on_local_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L66>`_
|
|
- `on_global_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L78>`_
|
|
- `on_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L83>`_
|
|
- `begin_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L90>`_
|
|
|
|
Backfill
|
|
--------
|
|
|
|
For the most part, backfill itself should behave similarly between
|
|
replicated and erasure coded pools with a few exceptions:
|
|
|
|
1. We probably want to be able to backfill multiple OSDs concurrently
|
|
with an erasure coded pool in order to cut down on the read
|
|
overhead.
|
|
2. We probably want to avoid having to place the backfill peers in the
|
|
acting set for an erasure coded pg because we might have a good
|
|
temporary pg chunk for that acting set slot.
|
|
|
|
For 2, we don't really need to place the backfill peer in the acting
|
|
set for replicated PGs anyway.
|
|
For 1, PGBackend::choose_backfill() should determine which OSDs are
|
|
backfilled in a particular interval.
|
|
|
|
Core changes:
|
|
|
|
- Backfill should be capable of handling multiple backfill peers
|
|
concurrently even for
|
|
replicated pgs (easier to test for now)
|
|
- Backfill peers should not be placed in the acting set.
|
|
|
|
PGBackend interfaces:
|
|
|
|
- choose_backfill(): allows the implementation to determine which OSDs
|
|
should be backfilled in a particular interval.
|