ceph/doc/dev/osd_internals/map_message_handling.rst

===========================
Map and PG Message handling
===========================

Overview
--------
The OSD handles routing incoming messages to PGs, creating the PG if necessary
in come cases.

PG messages generally come in two varieties:

  1. Peering Messages
  2. Ops/SubOps

There are several ways in which a message might be dropped or delayed.  It is
important that the message delaying does not result in a violation of certain
message ordering requirements on the way to the relevant PG handling logic:

  1. Ops referring to the same object must not be reordered.
  2. Peering messages must not be reordered.
  3. Subops must not be reordered.

MOSDMap
-------
MOSDMap messages may come from either monitors or other OSDs.  Upon receipt, the
OSD must perform several tasks:

  1. Persist the new maps to the filestore.
     Several PG operations rely on having access to maps dating back to the last
     time the PG was clean.
  2. Update and persist the superblock.
  3. Update OSD state related to the current map.
  4. Expose new maps to PG processes via *OSDService*.
  5. Remove PGs due to pool removal.
  6. Queue dummy events to trigger PG map catchup.

Each PG asynchronously catches up to the currently published map during
process_peering_events before processing the event.  As a result, different
PGs may have different views as to the "current" map.

One consequence of this design is that messages containing submessages from
multiple PGs (MOSDPGInfo, MOSDPGQuery, MOSDPGNotify) must tag each submessage
with the PG's epoch as well as tagging the message as a whole with the OSD's
current published epoch.

MOSDPGOp/MOSDPGSubOp
--------------------
See OSD::dispatch_op, OSD::handle_op, OSD::handle_sub_op

MOSDPGOps are used by clients to initiate rados operations. MOSDSubOps are used
between OSDs to coordinate most non peering activities including replicating
MOSDPGOp operations.

OSD::require_same_or_newer map checks that the current OSDMap is at least
as new as the map epoch indicated on the message.  If not, the message is
queued in OSD::waiting_for_osdmap via OSD::wait_for_new_map.  Note, this
cannot violate the above conditions since any two messages will be queued
in order of receipt and if a message is recieved with epoch e0, a later message
from the same source must be at epoch at least e0.  Note that two PGs from
the same OSD count for these purposes as different sources for single PG
messages.  That is, messages from different PGs may be reordered.


MOSDPGOps follow the following process:

  1. OSD::handle_op: validates permissions and crush mapping.
     See OSDService::handle_misdirected_op
     See OSD::op_has_sufficient_caps
     See OSD::require_same_or_newer_map
  2. OSD::enqueue_op

MOSDSubOps follow the following process:

  1. OSD::handle_sub_op checks that sender is an OSD
  2. OSD::enqueue_op

OSD::enqueue_op calls PG::queue_op which checks can_discard_request before
queueing the op in the op_queue and the PG in the OpWQ.  Note, a single PG
may be in the op queue multiple times for multiple ops.

dequeue_op is then eventually called on the PG.  At this time, the op is popped
off of op_queue and passed to PG::do_request, which checks that the PG map is
new enough (must_delay_op) and then processes the request.

In summary, the possible ways that an op may wait or be discarded in are:

  1. Wait in waiting_for_osdmap due to OSD::require_same_or_newer_map from
     OSD::handle_*.
  2. Discarded in OSD::can_discard_op at enqueue_op.
  3. Wait in PG::op_waiters due to PG::must_delay_request in PG::do_request.
  4. Wait in PG::waiting_for_active in due_request due to !flushed.
  5. Wait in PG::waiting_for_active due to !active() in do_op/do_sub_op.
  6. Wait in PG::waiting_for_(degraded|missing) in do_op.
  7. Wait in PG::waiting_for_active due to scrub_block_writes in do_op

TODO: The above is not a complete list.

Peering Messages
----------------
See OSD::handle_pg_(notify|info|log|query)

Peering messages are tagged with two epochs:

  1. epoch_sent: map epoch at which the message was sent
  2. query_epoch: map epoch at which the message triggering the message was sent

These are the same in cases where there was no triggering message.  We discard
a peering message if the message's query_epoch if the PG in question has entered
a new epoch (See PG::old_peering_event, PG::queue_peering_event).  Notifies,
infos, notifies, and logs are all handled as PG::RecoveryMachine events and
are wrapped by PG::queue_* by PG::CephPeeringEvts, which include the created
state machine event along with epoch_sent and query_epoch in order to
generically check PG::old_peering_message upon insertion and removal from the
queue.

Note, notifies, logs, and infos can trigger the creation of a PG.  See
OSD::get_or_create_pg.