=================== PG Backend Proposal =================== NOTE: the last update of this page is dated 2013, before the Firefly release. The details of the implementation may be different. Motivation ---------- The purpose of the `PG Backend interface `_ is to abstract over the differences between replication and erasure coding as failure recovery mechanisms. Much of the existing PG logic, particularly that for dealing with peering, will be common to each. With both schemes, a log of recent operations will be used to direct recovery in the event that an OSD is down or disconnected for a brief period of time. Similarly, in both cases it will be necessary to scan a recovered copy of the PG in order to recover an empty OSD. The PGBackend abstraction must be sufficiently expressive for Replicated and ErasureCoded backends to be treated uniformly in these areas. However, there are also crucial differences between using replication and erasure coding which PGBackend must abstract over: 1. The current write strategy would not ensure that a particular object could be reconstructed after a failure. 2. Reads on an erasure coded PG require chunks to be read from the replicas as well. 3. Object recovery probably involves recovering the primary and replica missing copies at the same time to avoid performing extra reads of replica shards. 4. Erasure coded PG chunks created for different acting set positions are not interchangeable. In particular, it might make sense for a single OSD to hold more than 1 PG copy for different acting set positions. 5. Selection of a pgtemp for backfill may differ between replicated and erasure coded backends. 6. The set of necessary OSDs from a particular interval required to continue peering may differ between replicated and erasure coded backends. 7. The selection of the authoritative log may differ between replicated and erasure coded backends. Client Writes ------------- The current PG implementation performs a write by performing the write locally while concurrently directing replicas to perform the same operation. Once all operations are durable, the operation is considered durable. Because these writes may be destructive overwrites, during peering, a log entry on a replica (or the primary) may be found to be divergent if that replica remembers a log event which the authoritative log does not contain. This can happen if only 1 out of 3 replicas persisted an operation, but was not available in the next interval to provide an authoritative log. With replication, we can repair the divergent object as long as at least 1 replica has a current copy of the divergent object. With erasure coding, however, it might be the case that neither the new version of the object nor the old version of the object has enough available chunks to be reconstructed. This problem is much simpler if we arrange for all supported operations to be locally roll-back-able. - CEPH_OSD_OP_APPEND: We can roll back an append locally by including the previous object size as part of the PG log event. - CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that we retain the deleted object until all replicas have persisted the deletion event. ErasureCoded backend will therefore need to store objects with the version at which they were created included in the key provided to the filestore. Old versions of an object can be pruned when all replicas have committed up to the log event deleting the object. - CEPH_OSD_OP_(SET|RM)ATTR: If we include the prior value of the attr to be set or removed, we can roll back these operations locally. Core Changes: - Current code should be adapted to use and rollback as appropriate APPEND, DELETE, (SET|RM)ATTR log entries. - The filestore needs to be able to deal with multiply versioned hobjects. This means adapting the filestore internally to use a `ghobject `_ which is basically a tuple. The gen_t + shard_t need to be included in the on-disk filename. gen_t is a unique object identifier to make sure there are no name collisions when object N is created + deleted + created again. An interface needs to be added to get all versions of a particular hobject_t or the most recently versioned instance of a particular hobject_t. PGBackend Interfaces: - PGBackend::perform_write() : It seems simplest to pass the actual ops vector. The reason for providing an async, callback based interface rather than having the PGBackend respond directly is that we might want to use this interface for internal operations like watch/notify expiration or snap trimming which might not necessarily have an external client. - PGBackend::try_rollback() : Some log entries (all of the ones valid for the Erasure coded backend) will support local rollback. In those cases, PGLog can avoid adding objects to the missing set when identifying divergent objects. Peering and PG Logs ------------------- Currently, we select the log with the newest last_update and the longest tail to be the authoritative log. This is fine because we aren't generally able to roll operations on the other replicas forward or backwards, instead relying on our ability to re-replicate divergent objects. With the write approach discussed in the previous section, however, the erasure coded backend will rely on being able to roll back divergent operations since we may not be able to re-replicate divergent objects. Thus, we must choose the *oldest* last_update from the last interval which went active in order to minimize the number of divergent objects. The difficulty is that the current code assumes that as long as it has an info from at least 1 OSD from the prior interval, it can complete peering. In order to ensure that we do not end up with an unrecoverably divergent object, a K+M erasure coded PG must hear from at least K of the replicas of the last interval to serve writes. This ensures that we will select a last_update old enough to roll back at least K replicas. If a replica with an older last_update comes along later, we will be able to provide at least K chunks of any divergent object. Core Changes: - PG::choose_acting(), etc. need to be generalized to use PGBackend to determine the authoritative log. - PG::RecoveryState::GetInfo needs to use PGBackend to determine whether it has enough infos to continue with authoritative log selection. PGBackend interfaces: - have_enough_infos() - choose_acting() PGTemp ------ Currently, an OSD is able to request a temp acting set mapping in order to allow an up-to-date OSD to serve requests while a new primary is backfilled (and for other reasons). An erasure coded pg needs to be able to designate a primary for these reasons without putting it in the first position of the acting set. It also needs to be able to leave holes in the requested acting set. Core Changes: - OSDMap::pg_to_*_osds needs to separately return a primary. For most cases, this can continue to be acting[0]. - MOSDPGTemp (and related OSD structures) needs to be able to specify a primary as well as an acting set. - Much of the existing code base assumes that acting[0] is the primary and that all elements of acting are valid. This needs to be cleaned up since the acting set may contain holes. Client Reads ------------ Reads with the replicated strategy can always be satisfied synchronously out of the primary OSD. With an erasure coded strategy, the primary will need to request data from some number of replicas in order to satisfy a read. The perform_read() interface for PGBackend therefore will be async. PGBackend interfaces: - perform_read(): as with perform_write() it seems simplest to pass the ops vector. The call to oncomplete will occur once the out_bls have been appropriately filled in. Distinguished acting set positions ---------------------------------- With the replicated strategy, all replicas of a PG are interchangeable. With erasure coding, different positions in the acting set have different pieces of the erasure coding scheme and are not interchangeable. Worse, crush might cause chunk 2 to be written to an OSD which happens already to contain an (old) copy of chunk 4. This means that the OSD and PG messages need to work in terms of a type like pair in order to distinguish different pg chunks on a single OSD. Because the mapping of object name to object in the filestore must be 1-to-1, we must ensure that the objects in chunk 2 and the objects in chunk 4 have different names. To that end, the filestore must include the chunk id in the object key. Core changes: - The filestore `ghobject_t needs to also include a chunk id `_ making it more like tuple. - coll_t needs to include a shard_t. - The OSD pg_map and similar pg mappings need to work in terms of a spg_t (essentially pair). Similarly, pg->pg messages need to include a shard_t - For client->PG messages, the OSD will need a way to know which PG chunk should get the message since the OSD may contain both a primary and non-primary chunk for the same pg Object Classes -------------- We probably won't support object classes at first on Erasure coded backends. Scrub ----- We currently have two scrub modes with different default frequencies: 1. [shallow] scrub: compares the set of objects and metadata, but not the contents 2. deep scrub: compares the set of objects, metadata, and a crc32 of the object contents (including omap) The primary requests a scrubmap from each replica for a particular range of objects. The replica fills out this scrubmap for the range of objects including, if the scrub is deep, a crc32 of the contents of each object. The primary gathers these scrubmaps from each replica and performs a comparison identifying inconsistent objects. Most of this can work essentially unchanged with erasure coded PG with the caveat that the PGBackend implementation must be in charge of actually doing the scan, and that the PGBackend implementation should be able to attach arbitrary information to allow PGBackend on the primary to scrub PGBackend specific metadata. The main catch, however, for erasure coded PG is that sending a crc32 of the stored chunk on a replica isn't particularly helpful since the chunks on different replicas presumably store different data. Because we don't support overwrites except via DELETE, however, we have the option of maintaining a crc32 on each chunk through each append. Thus, each replica instead simply computes a crc32 of its own stored chunk and compares it with the locally stored checksum. The replica then reports to the primary whether the checksums match. PGBackend interfaces: - scan() - scrub() - compare_scrub_maps() Crush ----- If crush is unable to generate a replacement for a down member of an acting set, the acting set should have a hole at that position rather than shifting the other elements of the acting set out of position. Core changes: - Ensure that crush behaves as above for INDEP. Recovery -------- The logic for recovering an object depends on the backend. With the current replicated strategy, we first pull the object replica to the primary and then concurrently push it out to the replicas. With the erasure coded strategy, we probably want to read the minimum number of replica chunks required to reconstruct the object and push out the replacement chunks concurrently. Another difference is that objects in erasure coded pg may be unrecoverable without being unfound. The "unfound" concept should probably then be renamed to unrecoverable. Also, the PGBackend implementation will have to be able to direct the search for pg replicas with unrecoverable object chunks and to be able to determine whether a particular object is recoverable. Core changes: - s/unfound/unrecoverable PGBackend interfaces: - `on_local_recover_start `_ - `on_local_recover `_ - `on_global_recover `_ - `on_peer_recover `_ - `begin_peer_recover `_ Backfill -------- For the most part, backfill itself should behave similarly between replicated and erasure coded pools with a few exceptions: 1. We probably want to be able to backfill multiple OSDs concurrently with an erasure coded pool in order to cut down on the read overhead. 2. We probably want to avoid having to place the backfill peers in the acting set for an erasure coded pg because we might have a good temporary pg chunk for that acting set slot. For 2, we don't really need to place the backfill peer in the acting set for replicated PGs anyway. For 1, PGBackend::choose_backfill() should determine which OSDs are backfilled in a particular interval. Core changes: - Backfill should be capable of handling multiple backfill peers concurrently even for replicated pgs (easier to test for now) - Backfill peers should not be placed in the acting set. PGBackend interfaces: - choose_backfill(): allows the implementation to determine which OSDs should be backfilled in a particular interval.