mirror of
https://github.com/ceph/ceph
synced 2025-02-21 18:17:42 +00:00
doc: fix typo and grammar
Signed-off-by: Pulkit Mittal <2pulkit2@gmail.com>
This commit is contained in:
parent
84be9581dc
commit
8f84f121e1
@ -2,18 +2,18 @@
|
||||
ECBackend Implementation Strategy
|
||||
=================================
|
||||
|
||||
Misc initial design notes
|
||||
=========================
|
||||
Miscellaneous initial design notes
|
||||
==================================
|
||||
|
||||
The initial (and still true for ec pools without the hacky ec
|
||||
overwrites debug flag enabled) design for ec pools restricted
|
||||
EC pools to operations which can be easily rolled back:
|
||||
The initial (and still true for EC pools without the hacky EC
|
||||
overwrites debug flag enabled) design for EC pools restricted
|
||||
EC pools to operations that can be easily rolled back:
|
||||
|
||||
- CEPH_OSD_OP_APPEND: We can roll back an append locally by
|
||||
including the previous object size as part of the PG log event.
|
||||
- CEPH_OSD_OP_DELETE: The possibility of rolling back a delete
|
||||
requires that we retain the deleted object until all replicas have
|
||||
persisted the deletion event. ErasureCoded backend will therefore
|
||||
persisted in the deletion event. Erasure Coded backend will therefore
|
||||
need to store objects with the version at which they were created
|
||||
included in the key provided to the filestore. Old versions of an
|
||||
object can be pruned when all replicas have committed up to the log
|
||||
@ -30,7 +30,7 @@ PGTemp and Crush
|
||||
|
||||
Primaries are able to request a temp acting set mapping in order to
|
||||
allow an up-to-date OSD to serve requests while a new primary is
|
||||
backfilled (and for other reasons). An erasure coded pg needs to be
|
||||
backfilled (and for other reasons). An erasure coded PG needs to be
|
||||
able to designate a primary for these reasons without putting it in
|
||||
the first position of the acting set. It also needs to be able to
|
||||
leave holes in the requested acting set.
|
||||
@ -54,38 +54,38 @@ acting set have different pieces of the erasure coding scheme and are
|
||||
not interchangeable. Worse, crush might cause chunk 2 to be written
|
||||
to an OSD which happens already to contain an (old) copy of chunk 4.
|
||||
This means that the OSD and PG messages need to work in terms of a
|
||||
type like pair<shard_t, pg_t> in order to distinguish different pg
|
||||
type like pair<shard_t, pg_t> in order to distinguish different PG
|
||||
chunks on a single OSD.
|
||||
|
||||
Because the mapping of object name to object in the filestore must
|
||||
Because the mapping of an object name to object in the filestore must
|
||||
be 1-to-1, we must ensure that the objects in chunk 2 and the objects
|
||||
in chunk 4 have different names. To that end, the objectstore must
|
||||
in chunk 4 have different names. To that end, the object store must
|
||||
include the chunk id in the object key.
|
||||
|
||||
Core changes:
|
||||
|
||||
- The objectstore `ghobject_t needs to also include a chunk id
|
||||
- The object store `ghobject_t needs to also include a chunk id
|
||||
<https://github.com/ceph/ceph/blob/firefly/src/common/hobject.h#L241>`_ making it more like
|
||||
tuple<hobject_t, gen_t, shard_t>.
|
||||
- coll_t needs to include a shard_t.
|
||||
- The OSD pg_map and similar pg mappings need to work in terms of a
|
||||
- The OSD pg_map and similar PG mappings need to work in terms of a
|
||||
spg_t (essentially
|
||||
pair<pg_t, shard_t>). Similarly, pg->pg messages need to include
|
||||
a shard_t
|
||||
- For client->PG messages, the OSD will need a way to know which PG
|
||||
chunk should get the message since the OSD may contain both a
|
||||
primary and non-primary chunk for the same pg
|
||||
primary and non-primary chunk for the same PG
|
||||
|
||||
Object Classes
|
||||
--------------
|
||||
|
||||
Reads from object classes will return ENOTSUP on ec pools by invoking
|
||||
Reads from object classes will return ENOTSUP on EC pools by invoking
|
||||
a special SYNC read.
|
||||
|
||||
Scrub
|
||||
-----
|
||||
|
||||
The main catch, however, for ec pools is that sending a crc32 of the
|
||||
The main catch, however, for EC pools is that sending a crc32 of the
|
||||
stored chunk on a replica isn't particularly helpful since the chunks
|
||||
on different replicas presumably store different data. Because we
|
||||
don't support overwrites except via DELETE, however, we have the
|
||||
@ -116,7 +116,7 @@ multiple stripes of a single object. There must be code that
|
||||
tessellates the application level write into a set of per-stripe write
|
||||
operations -- some whole-stripes and up to two partial
|
||||
stripes. Without loss of generality, for the remainder of this
|
||||
document we will focus exclusively on writing a single stripe (whole
|
||||
document, we will focus exclusively on writing a single stripe (whole
|
||||
or partial). We will use the symbol "W" to represent the number of
|
||||
blocks within a stripe that are being written, i.e., W <= K.
|
||||
|
||||
@ -125,13 +125,13 @@ choice of which of the three data flows to choose is based on the size
|
||||
of the write operation and the arithmetic properties of the selected
|
||||
parity-generation algorithm.
|
||||
|
||||
(1) whole stripe is written/overwritten
|
||||
(2) a read-modify-write operation is performed.
|
||||
(1) Whole stripe is written/overwritten
|
||||
(2) A read-modify-write operation is performed.
|
||||
|
||||
WHOLE STRIPE WRITE
|
||||
------------------
|
||||
|
||||
This is the simple case, and is already performed in the existing code
|
||||
This is a simple case, and is already performed in the existing code
|
||||
(for appends, that is). The primary receives all of the data for the
|
||||
stripe in the RADOS request, computes the appropriate parity blocks
|
||||
and send the data and parity blocks to their destination shards which
|
||||
@ -149,23 +149,23 @@ written. The RADOS operation is acknowledged.
|
||||
OSD Object Write and Consistency
|
||||
--------------------------------
|
||||
|
||||
Regardless of the algorithm chosen above, writing of the data is a two
|
||||
Regardless of the algorithm chosen above, writing of the data is a two-
|
||||
phase process: commit and rollforward. The primary sends the log
|
||||
entries with the operation described (see
|
||||
osd_types.h:TransactionInfo::(LocalRollForward|LocalRollBack).
|
||||
In all cases, the "commit" is performed in place, possibly leaving some
|
||||
information required for a rollback in a write-aside object. The
|
||||
rollforward phase occurs once all acting set replicas have committed
|
||||
the commit (sorry, overloaded term) and removes the rollback information.
|
||||
the commit, it then removes the rollback information.
|
||||
|
||||
In the case of overwrites of exsting stripes, the rollback information
|
||||
In the case of overwrites of existing stripes, the rollback information
|
||||
has the form of a sparse object containing the old values of the
|
||||
overwritten extents populated using clone_range. This is essentially
|
||||
a place-holder implementation, in real life, bluestore will have an
|
||||
efficient primitive for this.
|
||||
|
||||
The rollforward part can be delayed since we report the operation as
|
||||
committed once all replicas have committed. Currently, whenever we
|
||||
committed once all replicas have been committed. Currently, whenever we
|
||||
send a write, we also indicate that all previously committed
|
||||
operations should be rolled forward (see
|
||||
ECBackend::try_reads_to_commit). If there aren't any in the pipeline
|
||||
@ -185,7 +185,7 @@ from the pipeline.
|
||||
|
||||
See ExtentCache.h for a detailed explanation of how the cache
|
||||
states correspond to the higher level invariants about the conditions
|
||||
under which cuncurrent operations can refer to the same object.
|
||||
under which concurrent operations can refer to the same object.
|
||||
|
||||
Pipeline
|
||||
--------
|
||||
@ -193,7 +193,7 @@ Pipeline
|
||||
Reading src/osd/ExtentCache.h should have given a good idea of how
|
||||
operations might overlap. There are several states involved in
|
||||
processing a write operation and an important invariant which
|
||||
isn't enforced by PrimaryLogPG at a higher level which need to be
|
||||
isn't enforced by PrimaryLogPG at a higher level which needs to be
|
||||
managed by ECBackend. The important invariant is that we can't
|
||||
have uncacheable and rmw operations running at the same time
|
||||
on the same object. For simplicity, we simply enforce that any
|
||||
@ -204,4 +204,3 @@ There are improvements to be made here in the future.
|
||||
|
||||
For more details, see ECBackend::waiting_* and
|
||||
ECBackend::try_<from>_to_<to>.
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user