From 5d82a7706047bf4c5486dfd8d6872f81c4c505c2 Mon Sep 17 00:00:00 2001
From: Samuel Just <sam.just@inktank.com>
Date: Tue, 10 Jul 2012 17:52:21 -0700
Subject: [PATCH] doc/dev/osd_internals: OSD overview, pg removal, map/message
 handling

This is a start on some osd internals documentation for new
developers.

Signed-off-by: Samuel Just <sam.just@inktank.com>
---
 .../osd_internals/map_message_handling.rst    | 112 ++++++++++++++++++
 doc/internals/osd_internals/osd_overview.rst  | 103 ++++++++++++++++
 doc/internals/osd_internals/pg.rst            |  31 +++++
 doc/internals/osd_internals/pg_removal.rst    |  40 +++++++
 4 files changed, 286 insertions(+)
 create mode 100644 doc/internals/osd_internals/map_message_handling.rst
 create mode 100644 doc/internals/osd_internals/osd_overview.rst
 create mode 100644 doc/internals/osd_internals/pg.rst
 create mode 100644 doc/internals/osd_internals/pg_removal.rst

diff --git a/doc/internals/osd_internals/map_message_handling.rst b/doc/internals/osd_internals/map_message_handling.rst
new file mode 100644
index 00000000000..60ffd0df38b
--- /dev/null
+++ b/doc/internals/osd_internals/map_message_handling.rst
@@ -0,0 +1,112 @@
+===========================
+Map and PG Message handling
+===========================
+
+Overview
+--------
+The OSD handles routing incoming messages to PGs, creating the PG if necessary
+in come cases.
+
+PG messages generally come in two varieties:
+  1. Peering Messages
+  2. Ops/SubOps
+
+There are several ways in which a message might be dropped or delayed.  It is
+important that the message delaying does not result in a violation of certain
+message ordering requirements on the way to the relevant PG handling logic:
+  1. Ops referring to the same object must not be reordered.
+  2. Peering messages must not be reordered.
+  3. Subops must not be reordered.
+
+MOSDMap
+-------
+MOSDMap messages may come from either monitors or other OSDs.  Upon receipt, the
+OSD must perform several tasks:
+  1. Persist the new maps to the filestore.
+     Several PG operations rely on having access to maps dating back to the last
+     time the PG was clean.
+  2. Update and persist the superblock.
+  3. Update OSD state related to the current map.
+  4. Expose new maps to PG processes via *OSDService*.
+  5. Remove PGs due to pool removal.
+  6. Queue dummy events to trigger PG map catchup.
+
+Each PG asynchronously catches up to the currently published map during
+process_peering_events before processing the event.  As a result, different
+PGs may have different views as to the "current" map.
+
+One consequence of this design is that messages containing submessages from
+multiple PGs (MOSDPGInfo, MOSDPGQuery, MOSDPGNotify) must tag each submessage
+with the PG's epoch as well as tagging the message as a whole with the OSD's
+current published epoch.
+
+MOSDPGOp/MOSDPGSubOp
+--------------------
+See OSD::dispatch_op, OSD::handle_op, OSD::handle_sub_op
+
+MOSDPGOps are used by clients to initiate rados operations. MOSDSubOps are used
+between OSDs to coordinate most non peering activities including replicating
+MOSDPGOp operations.
+
+OSD::require_same_or_newer map checks that the current OSDMap is at least
+as new as the map epoch indicated on the message.  If not, the message is
+queued in OSD::waiting_for_osdmap via OSD::wait_for_new_map.  Note, this
+cannot violate the above conditions since any two messages will be queued
+in order of receipt and if a message is recieved with epoch e0, a later message
+from the same source must be at epoch at least e0.  Note that two PGs from
+the same OSD count for these purposes as different sources for single PG
+messages.  That is, messages from different PGs may be reordered.
+
+
+MOSDPGOps follow the following process:
+  1. OSD::handle_op: validates permissions and crush mapping.
+	   See OSDService::handle_misdirected_op
+	   See OSD::op_has_sufficient_caps
+           See OSD::require_same_or_newer_map
+  2. OSD::enqueue_op
+
+MOSDSubOps follow the following process:
+  1. OSD::handle_sub_op checks that sender is an OSD
+  2. OSD::enqueue_op
+
+OSD::enqueue_op calls PG::queue_op which checks can_discard_request before
+queueing the op in the op_queue and the PG in the OpWQ.  Note, a single PG
+may be in the op queue multiple times for multiple ops.
+
+dequeue_op is then eventually called on the PG.  At this time, the op is popped
+off of op_queue and passed to PG::do_request, which checks that the PG map is
+new enough (must_delay_op) and then processes the request.
+
+In summary, the possible ways that an op may wait or be discarded in are:
+  1. Wait in waiting_for_osdmap due to OSD::require_same_or_newer_map from
+     OSD::handle_*.
+  2. Discarded in OSD::can_discard_op at enqueue_op.
+  3. Wait in PG::op_waiters due to PG::must_delay_request in PG::do_request.
+	4. Wait in PG::waiting_for_active in due_request due to !flushed.
+  5. Wait in PG::waiting_for_active due to !active() in do_op/do_sub_op.
+  6. Wait in PG::waiting_for_(degraded|missing) in do_op.
+  7. Wait in PG::waiting_for_active due to scrub_block_writes in do_op
+
+TODO: The above is not a complete list.	
+
+Peering Messages
+----------------
+See OSD::handle_pg_(notify|info|log|query)
+
+Peering messages are tagged with two epochs:
+  1. epoch_sent: map epoch at which the message was sent
+  2. query_epoch: map epoch at which the message triggering the message was sent
+
+These are the same in cases where there was no triggering message.  We discard
+a peering message if the message's query_epoch if the PG in question has entered
+a new epoch (See PG::old_peering_event, PG::queue_peering_event).  Notifies,
+infos, notifies, and logs are all handled as PG::RecoveryMachine events and
+are wrapped by PG::queue_* by PG::CephPeeringEvts, which include the created
+state machine event along with epoch_sent and query_epoch in order to
+generically check PG::old_peering_message upon insertion and removal from the
+queue.
+
+Note, notifies, logs, and infos can trigger the creation of a PG.  See
+OSD::get_or_create_pg.
+
+
diff --git a/doc/internals/osd_internals/osd_overview.rst b/doc/internals/osd_internals/osd_overview.rst
new file mode 100644
index 00000000000..5eb5fd71b78
--- /dev/null
+++ b/doc/internals/osd_internals/osd_overview.rst
@@ -0,0 +1,103 @@
+===
+OSD
+===
+
+Concepts
+--------
+
+*Messenger*
+   See src/msg/Messenger.h
+
+	 Handles sending and receipt of messages on behalf of the OSD.  The OSD uses
+	 two messengers: 
+	   1. cluster_messenger - handles traffic to other OSDs, monitors
+           2. client_messenger - handles client traffic
+
+	 This division allows the OSD to be configured with different interfaces for
+	 client and cluster traffic.
+
+*Dispatcher*
+   See src/msg/Dispatcher.h
+
+	 OSD implements the Dispatcher interface.  Of particular note is ms_dispatch,
+	 which serves as the entry point for messages received via either the client
+	 or cluster messenger.  Because there are two messengers, ms_dispatch may be
+	 called from at least two threads.  The osd_lock is always held during
+	 ms_dispatch.
+
+*WorkQueue*
+  See src/common/WorkQueue.h
+
+  The WorkQueue class abstracts the process of queueing independent tasks
+  for asynchronous execution.  Each OSD process contains workqueues for
+  distinct tasks:
+    1. OpWQ: handles ops (from clients) and subops (from other OSDs).
+       Runs in the op_tp threadpool.
+    2. PeeringWQ: handles peering tasks and pg map advancement
+       Runs in the op_tp threadpool.
+       See Peering
+    3. CommandWQ: handles commands (pg query, etc)
+       Runs in the command_tp threadpool.
+    4. RecoveryWQ: handles recovery tasks.
+       Runs in the recovery_tp threadpool.
+    5. SnapTrimWQ: handles snap trimming
+       Runs in the disk_tp threadpool.
+       See SnapTrimmer
+    6. ScrubWQ: handles primary scrub path
+       Runs in the disk_tp threadpool.
+       See Scrub
+    7. ScrubFinalizeWQ: handles primary scrub finalize
+       Runs in the disk_tp threadpool.
+       See Scrub
+    8. RepScrubWQ: handles replica scrub path
+       Runs in the disk_tp threadpool
+       See Scrub
+    9. RemoveWQ: Asynchronously removes old pg directories
+       Runs in the disk_tp threadpool
+       See PGRemoval
+
+*ThreadPool*
+  See src/common/WorkQueue.h
+  See also above.
+
+  There are 4 OSD threadpools:
+    1. op_tp: handles ops and subops
+    2. recovery_tp: handles recovery tasks
+    3. disk_tp: handles disk intensive tasks
+    4. command_tp: handles commands
+
+*OSDMap*
+  See src/osd/OSDMap.h
+
+  The crush algorithm takes two inputs: a picture of the cluster
+  with status information about which nodes are up/down and in/out, 
+  and the pgid to place.  The former is encapsulated by the OSDMap.
+  Maps are numbered by *epoch* (epoch_t).  These maps are passed around
+  within the OSD as std::tr1::shared_ptr<const OSDMap>.
+
+  See MapHandling
+
+*PG*
+  See src/osd/PG.* src/osd/ReplicatedPG.*
+
+  Objects in rados are hashed into *PGs* and *PGs* are placed via crush onto
+  OSDs.  The PG structure is responsible for handling requests pertaining to
+  a particular *PG* as well as for maintaining relevant metadata and controlling
+  recovery.
+
+*OSDService*
+  See src/osd/OSD.cc OSDService
+
+  The OSDService acts as a broker between PG threads and OSD state which allows
+  PGs to perform actions using OSD services such as workqueues and messengers.
+  This is still a work in progress.  Future cleanups will focus on moving such
+  state entirely from the OSD into the OSDService.
+
+Overview
+--------
+  See src/ceph_osd.cc
+
+  The OSD process represents one leaf device in the crush hierarchy.  There
+  might be one OSD process per physical machine, or more than one if, for
+  example, the user configures one OSD instance per disk.
+  
diff --git a/doc/internals/osd_internals/pg.rst b/doc/internals/osd_internals/pg.rst
new file mode 100644
index 00000000000..2c2c572fa51
--- /dev/null
+++ b/doc/internals/osd_internals/pg.rst
@@ -0,0 +1,31 @@
+====
+PG
+====
+
+Concepts
+--------
+
+*Peering Interval*
+  See PG::start_peering_interval.
+  See PG::up_acting_affected.
+  See PG::RecoveryState::Reset
+
+  A peering interval is a maximal set of contiguous map epochs in which the
+  up and acting sets did not change.  PG::RecoveryMachine represents a 
+  transition from one interval to another as passing through
+  RecoveryState::Reset.  On PG;:RecoveryState::AdvMap PG::up_acting_affected can
+  cause the pg to transition to Reset.
+  
+
+Peering Details and Gotchas
+---------------------------
+For an overview of peering, see Peering.
+
+  * PG::flushed defaults to false and is set to false in
+    PG::start_peering_interval.  Upon transitioning to PG::RecoveryState::Started
+    we send a transaction through the pg op sequencer which, upon complete,
+    sends a FlushedEvt which sets flushed to true.  The primary cannot go
+    active until this happens (See PG::RecoveryState::WaitFlushedPeering).
+    Replicas can go active but cannot serve ops (writes or reads).
+    This is necessary because we cannot read our ondisk state until unstable
+    transactions from the previous interval have cleared.
diff --git a/doc/internals/osd_internals/pg_removal.rst b/doc/internals/osd_internals/pg_removal.rst
new file mode 100644
index 00000000000..5246853cfef
--- /dev/null
+++ b/doc/internals/osd_internals/pg_removal.rst
@@ -0,0 +1,40 @@
+==========
+PG Removal
+==========
+
+See OSD::_remove_pg, OSD::RemoveWQ
+
+There are two ways for a pg to be removed from an OSD:
+  1. MOSDPGRemove from the primary
+  2. OSD::advance_map finds that the pool has been removed
+
+In either case, our general strategy for removing the pg is to atomically remove
+the metadata objects (pg->log_oid, pg->biginfo_oid) and rename the pg collections
+(temp, HEAD, and snap collections) into removal collections
+(see OSD::get_next_removal_coll).  Those collections are then asynchronously
+removed.  We do not do this inline because scanning the collections to remove
+the objects is an expensive operation.  Atomically moving the directories out
+of the way allows us to proceed as if the pg is fully removed except that we
+cannot rewrite any of the objects contained in the removal directories until
+they have been fully removed.  PGs partition the object space, so the only case
+we need to worry about is the same pg being recreated before we have finished
+removing the objects from the old one.
+
+OSDService::deleting_pgs tracks all pgs in the process of being deleted.  Each
+DeletingState object in deleting_pgs lives while at least one reference to it
+remains.  Each item in RemoveWQ carries a reference to the DeletingState for
+the relevant pg such that deleting_pgs.lookup(pgid) will return a null ref
+only if there are no collections currently being deleted for that pg.
+DeletingState allows you to register a callback to be called when the deletion
+is finally complete.  See PG::start_flush.  We use this mechanism to prevent
+the pg from being "flushed" until any pending deletes are complete.  Metadata
+operations are safe since we did remove the old metadata objects and we
+inherit the osr from the previous copy of the pg.
+
+Similarly, OSD::osr_registry ensures that the OpSequencers for those pgs can
+be reused for a new pg if created before the old one is fully removed, ensuring
+that operations on the new pg are sequenced properly with respect to operations
+on the old one.
+
+OSD::load_pgs() rebuilds deleting_pgs and osr_registry when scanning the
+collections as it finds old removal collections not yet removed.