ceph/branches/sage/mds/doc/osd_replication.txt



SOME GENERAL REQUIREMENTS

- cluster expansion:
  - any or all of the replicas may move to new OSDs.

- cluster map may change frequently
  - map change should translate into pending replication/migration
    state quickly (or better yet, instantly), so that we could push
    through a series of (say, botched) maps quickly and be fine, so long
    as the final map is correct.

- ideally, unordered osd<->osd, client<->osd communication
  (mds<->mds, client<->mds communication is ordered, but file i/o
  would be too slow that way?)


PRIMARY ONLY PICTURE

let's completely ignore replication for a while, and see how
complicated the picture needs to be to reliably support cluster expansion.

typedef __uint64_t version_t;


per-Object metadata:
- version #.  incremented when an object is modified.
   e.g. version_t version;
- on primary, keep list of stray replicas
   e.g. map<int,version_t> stray_replicas;  // osds w/ stray replicas
  includes old primary osd(s), until deletion is confirmed.  used while rg
  is importing.


per-RG metadata
- object list.  well, a method to fetch it by querying a collection or whatever.
- negative <object,version> list
   e.g. map<object_t, version_t> deleted_objects;
  - used to enumerate deleted objects, when in "importing" state.
- a RG "state" (enum/int)


Normal RG state:
- role=primary
    clean        - i am primary, all is well.  no stray copies.  i can
                   discard my negative object list, since my local
				   object store tells me everything.


After a map change:
- new primary
    undef        - initially; i don't know RG exists.
- old primary
    homeless     - i was primary, still have unmolested data.  new primary is not yet migrating
                   (presumably it's state=undef.)  i need to contact new primary and tell them
                   this RG exists.

- new primary
    importing    - i am migrating data from old primary.  keep negative dir entries for deletions.
                   write locally.  proxy reads (force immediately migration).  do whole objects
                   initially (on write, block until i migrate the object).  later we can do
                   sub-object state (where "live" object data is spread across new/old primaries..
- old primary
    exporting    - primary is migrating my data.
    undef        - when it finishes.  (i will forget this RG existed.)


After a second map change (scenario 1):
 as above, if we were clean again.

After a second map change (scenario 2):
 we weren't clean yet.
- new primary
    undef        - initially (until i learn RG exists)
- old primary
    importing    - i'm still migrating from old old primary
- old old primary
    exporting    - ...
- old primary
??  importing+exporting - proxy reads as before.  continue migrating from old old primary.


After a second map change (scenario 3):
 we weren't clean yet, and old old primary is also new primary
- new primary (and old old primary)
    exporting    - change state to importing.  be sure to compare object versions, and neg dir
                   entries (as we always should do, really!).
- old primary
    importing    - note that the old import source matches new primary, and change
                   state to exporting, and stop importing. (unlike scenario 2)

-> this approach could mean that a series of fast map changes could
   force data to migrate down a "chain" of old primaries to reach the
   new one.  maybe old primary should go from importing -> exporting,
   and pass along old old primary id to new primary such that the
   import is a many-to-one thing, instead of one-to-one.  version
   numbers and neg entries will make it easy to pick out correct versions.


For the importing process on a given RG:

- metadata for each source
  - each source has a state:
    'starting'  - don't know anything about source yet.  query source!
                  this probaby induces the source to change from
                  'homeless' or something similar to 'exporting'.
    'importing' - i've fetched the source's object list (and neg
                  object list).  i'm busy reading them!  these lists
                  will shrink as the process continues.  after i fetch
                  an object, i will erase it from the source.
                  (object metadata will include stray copy info
                  until i confirm that its removed.)
    'finishing' - i've read all my data, and i'm telling the old person
                  to discard any remaining RG metadata (RG contents
                  should already be gone)
  - unmigrated object list
  - migrated but not deleted object list
    - stray osd is also listed in per-object MD during this stage
  - negative object list
    - i can remove these items if i see a newer object version (say,
      from a different import source or something).
    - i can remove any local objects or ignore imported ones if it is
      older than deleted version

- the lists should be sets or otherwise queryable so that while i'm
  importing and a real op comes through I can quickly determine if a
  given object_id is pending migration etc or if my local store is to
  be trusted.


SOME CODE BITS


typedef __uint64_t version_t;
class Object {
  version_t            version;
  map<int, version_t>  stray_replicas;
};


class ReplicaGroup {
  int enumerate_objects(list<object_t>& ls);

  int                       state;

  // for unstable states,
  map<object_t, version_t>  deleted_objects;  // locally
  map<int, RGExporter_t>    exporters;        // importing from these guys.
};

// primary
#define RG_STATE_CLEAN      1
#define RG_STATE_IMPORTING  2  // pulling data

// non-primary
#define RG_STATE_HOMELESS   5  // old primary; new primary not yet
                               // notified; not yet exporting.
#define RG_STATE_EXPORTING  6  // a newer primary is extracting my
                               // data.


struct RGExporter_t {
  int                      import_state;

  set<object_t>            remaining_objects;  // remote object list
  set<object_t>            stray_objects;      // imported but not deleted.

};


----
all crap from here on down


REPLICAS
-


OSD STATES
- primary, up to date.
- replica, up to date.

- primary, proxy to old primary (primaries?)

- replica, not up to date.


REPLICATION STUFF

Per-RG metadata
- primary
  - per-replica state: clean, catching up?
- replica

Per-object metadata
- primary and replica
  - version number/mtime
  - rg (reverse indexed)
- primary
  - replication level and state.
    - commited to memory and/or disk, on which replicas (#1, #2, etc.)
- replica


->