From 5d82a7706047bf4c5486dfd8d6872f81c4c505c2 Mon Sep 17 00:00:00 2001 From: Samuel Just Date: Tue, 10 Jul 2012 17:52:21 -0700 Subject: [PATCH] doc/dev/osd_internals: OSD overview, pg removal, map/message handling This is a start on some osd internals documentation for new developers. Signed-off-by: Samuel Just --- .../osd_internals/map_message_handling.rst | 112 ++++++++++++++++++ doc/internals/osd_internals/osd_overview.rst | 103 ++++++++++++++++ doc/internals/osd_internals/pg.rst | 31 +++++ doc/internals/osd_internals/pg_removal.rst | 40 +++++++ 4 files changed, 286 insertions(+) create mode 100644 doc/internals/osd_internals/map_message_handling.rst create mode 100644 doc/internals/osd_internals/osd_overview.rst create mode 100644 doc/internals/osd_internals/pg.rst create mode 100644 doc/internals/osd_internals/pg_removal.rst diff --git a/doc/internals/osd_internals/map_message_handling.rst b/doc/internals/osd_internals/map_message_handling.rst new file mode 100644 index 00000000000..60ffd0df38b --- /dev/null +++ b/doc/internals/osd_internals/map_message_handling.rst @@ -0,0 +1,112 @@ +=========================== +Map and PG Message handling +=========================== + +Overview +-------- +The OSD handles routing incoming messages to PGs, creating the PG if necessary +in come cases. + +PG messages generally come in two varieties: + 1. Peering Messages + 2. Ops/SubOps + +There are several ways in which a message might be dropped or delayed. It is +important that the message delaying does not result in a violation of certain +message ordering requirements on the way to the relevant PG handling logic: + 1. Ops referring to the same object must not be reordered. + 2. Peering messages must not be reordered. + 3. Subops must not be reordered. + +MOSDMap +------- +MOSDMap messages may come from either monitors or other OSDs. Upon receipt, the +OSD must perform several tasks: + 1. Persist the new maps to the filestore. + Several PG operations rely on having access to maps dating back to the last + time the PG was clean. + 2. Update and persist the superblock. + 3. Update OSD state related to the current map. + 4. Expose new maps to PG processes via *OSDService*. + 5. Remove PGs due to pool removal. + 6. Queue dummy events to trigger PG map catchup. + +Each PG asynchronously catches up to the currently published map during +process_peering_events before processing the event. As a result, different +PGs may have different views as to the "current" map. + +One consequence of this design is that messages containing submessages from +multiple PGs (MOSDPGInfo, MOSDPGQuery, MOSDPGNotify) must tag each submessage +with the PG's epoch as well as tagging the message as a whole with the OSD's +current published epoch. + +MOSDPGOp/MOSDPGSubOp +-------------------- +See OSD::dispatch_op, OSD::handle_op, OSD::handle_sub_op + +MOSDPGOps are used by clients to initiate rados operations. MOSDSubOps are used +between OSDs to coordinate most non peering activities including replicating +MOSDPGOp operations. + +OSD::require_same_or_newer map checks that the current OSDMap is at least +as new as the map epoch indicated on the message. If not, the message is +queued in OSD::waiting_for_osdmap via OSD::wait_for_new_map. Note, this +cannot violate the above conditions since any two messages will be queued +in order of receipt and if a message is recieved with epoch e0, a later message +from the same source must be at epoch at least e0. Note that two PGs from +the same OSD count for these purposes as different sources for single PG +messages. That is, messages from different PGs may be reordered. + + +MOSDPGOps follow the following process: + 1. OSD::handle_op: validates permissions and crush mapping. + See OSDService::handle_misdirected_op + See OSD::op_has_sufficient_caps + See OSD::require_same_or_newer_map + 2. OSD::enqueue_op + +MOSDSubOps follow the following process: + 1. OSD::handle_sub_op checks that sender is an OSD + 2. OSD::enqueue_op + +OSD::enqueue_op calls PG::queue_op which checks can_discard_request before +queueing the op in the op_queue and the PG in the OpWQ. Note, a single PG +may be in the op queue multiple times for multiple ops. + +dequeue_op is then eventually called on the PG. At this time, the op is popped +off of op_queue and passed to PG::do_request, which checks that the PG map is +new enough (must_delay_op) and then processes the request. + +In summary, the possible ways that an op may wait or be discarded in are: + 1. Wait in waiting_for_osdmap due to OSD::require_same_or_newer_map from + OSD::handle_*. + 2. Discarded in OSD::can_discard_op at enqueue_op. + 3. Wait in PG::op_waiters due to PG::must_delay_request in PG::do_request. + 4. Wait in PG::waiting_for_active in due_request due to !flushed. + 5. Wait in PG::waiting_for_active due to !active() in do_op/do_sub_op. + 6. Wait in PG::waiting_for_(degraded|missing) in do_op. + 7. Wait in PG::waiting_for_active due to scrub_block_writes in do_op + +TODO: The above is not a complete list. + +Peering Messages +---------------- +See OSD::handle_pg_(notify|info|log|query) + +Peering messages are tagged with two epochs: + 1. epoch_sent: map epoch at which the message was sent + 2. query_epoch: map epoch at which the message triggering the message was sent + +These are the same in cases where there was no triggering message. We discard +a peering message if the message's query_epoch if the PG in question has entered +a new epoch (See PG::old_peering_event, PG::queue_peering_event). Notifies, +infos, notifies, and logs are all handled as PG::RecoveryMachine events and +are wrapped by PG::queue_* by PG::CephPeeringEvts, which include the created +state machine event along with epoch_sent and query_epoch in order to +generically check PG::old_peering_message upon insertion and removal from the +queue. + +Note, notifies, logs, and infos can trigger the creation of a PG. See +OSD::get_or_create_pg. + + diff --git a/doc/internals/osd_internals/osd_overview.rst b/doc/internals/osd_internals/osd_overview.rst new file mode 100644 index 00000000000..5eb5fd71b78 --- /dev/null +++ b/doc/internals/osd_internals/osd_overview.rst @@ -0,0 +1,103 @@ +=== +OSD +=== + +Concepts +-------- + +*Messenger* + See src/msg/Messenger.h + + Handles sending and receipt of messages on behalf of the OSD. The OSD uses + two messengers: + 1. cluster_messenger - handles traffic to other OSDs, monitors + 2. client_messenger - handles client traffic + + This division allows the OSD to be configured with different interfaces for + client and cluster traffic. + +*Dispatcher* + See src/msg/Dispatcher.h + + OSD implements the Dispatcher interface. Of particular note is ms_dispatch, + which serves as the entry point for messages received via either the client + or cluster messenger. Because there are two messengers, ms_dispatch may be + called from at least two threads. The osd_lock is always held during + ms_dispatch. + +*WorkQueue* + See src/common/WorkQueue.h + + The WorkQueue class abstracts the process of queueing independent tasks + for asynchronous execution. Each OSD process contains workqueues for + distinct tasks: + 1. OpWQ: handles ops (from clients) and subops (from other OSDs). + Runs in the op_tp threadpool. + 2. PeeringWQ: handles peering tasks and pg map advancement + Runs in the op_tp threadpool. + See Peering + 3. CommandWQ: handles commands (pg query, etc) + Runs in the command_tp threadpool. + 4. RecoveryWQ: handles recovery tasks. + Runs in the recovery_tp threadpool. + 5. SnapTrimWQ: handles snap trimming + Runs in the disk_tp threadpool. + See SnapTrimmer + 6. ScrubWQ: handles primary scrub path + Runs in the disk_tp threadpool. + See Scrub + 7. ScrubFinalizeWQ: handles primary scrub finalize + Runs in the disk_tp threadpool. + See Scrub + 8. RepScrubWQ: handles replica scrub path + Runs in the disk_tp threadpool + See Scrub + 9. RemoveWQ: Asynchronously removes old pg directories + Runs in the disk_tp threadpool + See PGRemoval + +*ThreadPool* + See src/common/WorkQueue.h + See also above. + + There are 4 OSD threadpools: + 1. op_tp: handles ops and subops + 2. recovery_tp: handles recovery tasks + 3. disk_tp: handles disk intensive tasks + 4. command_tp: handles commands + +*OSDMap* + See src/osd/OSDMap.h + + The crush algorithm takes two inputs: a picture of the cluster + with status information about which nodes are up/down and in/out, + and the pgid to place. The former is encapsulated by the OSDMap. + Maps are numbered by *epoch* (epoch_t). These maps are passed around + within the OSD as std::tr1::shared_ptr. + + See MapHandling + +*PG* + See src/osd/PG.* src/osd/ReplicatedPG.* + + Objects in rados are hashed into *PGs* and *PGs* are placed via crush onto + OSDs. The PG structure is responsible for handling requests pertaining to + a particular *PG* as well as for maintaining relevant metadata and controlling + recovery. + +*OSDService* + See src/osd/OSD.cc OSDService + + The OSDService acts as a broker between PG threads and OSD state which allows + PGs to perform actions using OSD services such as workqueues and messengers. + This is still a work in progress. Future cleanups will focus on moving such + state entirely from the OSD into the OSDService. + +Overview +-------- + See src/ceph_osd.cc + + The OSD process represents one leaf device in the crush hierarchy. There + might be one OSD process per physical machine, or more than one if, for + example, the user configures one OSD instance per disk. + diff --git a/doc/internals/osd_internals/pg.rst b/doc/internals/osd_internals/pg.rst new file mode 100644 index 00000000000..2c2c572fa51 --- /dev/null +++ b/doc/internals/osd_internals/pg.rst @@ -0,0 +1,31 @@ +==== +PG +==== + +Concepts +-------- + +*Peering Interval* + See PG::start_peering_interval. + See PG::up_acting_affected. + See PG::RecoveryState::Reset + + A peering interval is a maximal set of contiguous map epochs in which the + up and acting sets did not change. PG::RecoveryMachine represents a + transition from one interval to another as passing through + RecoveryState::Reset. On PG;:RecoveryState::AdvMap PG::up_acting_affected can + cause the pg to transition to Reset. + + +Peering Details and Gotchas +--------------------------- +For an overview of peering, see Peering. + + * PG::flushed defaults to false and is set to false in + PG::start_peering_interval. Upon transitioning to PG::RecoveryState::Started + we send a transaction through the pg op sequencer which, upon complete, + sends a FlushedEvt which sets flushed to true. The primary cannot go + active until this happens (See PG::RecoveryState::WaitFlushedPeering). + Replicas can go active but cannot serve ops (writes or reads). + This is necessary because we cannot read our ondisk state until unstable + transactions from the previous interval have cleared. diff --git a/doc/internals/osd_internals/pg_removal.rst b/doc/internals/osd_internals/pg_removal.rst new file mode 100644 index 00000000000..5246853cfef --- /dev/null +++ b/doc/internals/osd_internals/pg_removal.rst @@ -0,0 +1,40 @@ +========== +PG Removal +========== + +See OSD::_remove_pg, OSD::RemoveWQ + +There are two ways for a pg to be removed from an OSD: + 1. MOSDPGRemove from the primary + 2. OSD::advance_map finds that the pool has been removed + +In either case, our general strategy for removing the pg is to atomically remove +the metadata objects (pg->log_oid, pg->biginfo_oid) and rename the pg collections +(temp, HEAD, and snap collections) into removal collections +(see OSD::get_next_removal_coll). Those collections are then asynchronously +removed. We do not do this inline because scanning the collections to remove +the objects is an expensive operation. Atomically moving the directories out +of the way allows us to proceed as if the pg is fully removed except that we +cannot rewrite any of the objects contained in the removal directories until +they have been fully removed. PGs partition the object space, so the only case +we need to worry about is the same pg being recreated before we have finished +removing the objects from the old one. + +OSDService::deleting_pgs tracks all pgs in the process of being deleted. Each +DeletingState object in deleting_pgs lives while at least one reference to it +remains. Each item in RemoveWQ carries a reference to the DeletingState for +the relevant pg such that deleting_pgs.lookup(pgid) will return a null ref +only if there are no collections currently being deleted for that pg. +DeletingState allows you to register a callback to be called when the deletion +is finally complete. See PG::start_flush. We use this mechanism to prevent +the pg from being "flushed" until any pending deletes are complete. Metadata +operations are safe since we did remove the old metadata objects and we +inherit the osr from the previous copy of the pg. + +Similarly, OSD::osr_registry ensures that the OpSequencers for those pgs can +be reused for a new pg if created before the old one is fully removed, ensuring +that operations on the new pg are sequenced properly with respect to operations +on the old one. + +OSD::load_pgs() rebuilds deleting_pgs and osr_registry when scanning the +collections as it finds old removal collections not yet removed.