mirror of https://github.com/ceph/ceph
120 lines
4.7 KiB
ReStructuredText
120 lines
4.7 KiB
ReStructuredText
===========================
|
|
Map and PG Message handling
|
|
===========================
|
|
|
|
Overview
|
|
--------
|
|
The OSD handles routing incoming messages to PGs, creating the PG if necessary
|
|
in come cases.
|
|
|
|
PG messages generally come in two varieties:
|
|
|
|
1. Peering Messages
|
|
2. Ops/SubOps
|
|
|
|
There are several ways in which a message might be dropped or delayed. It is
|
|
important that the message delaying does not result in a violation of certain
|
|
message ordering requirements on the way to the relevant PG handling logic:
|
|
|
|
1. Ops referring to the same object must not be reordered.
|
|
2. Peering messages must not be reordered.
|
|
3. Subops must not be reordered.
|
|
|
|
MOSDMap
|
|
-------
|
|
MOSDMap messages may come from either monitors or other OSDs. Upon receipt, the
|
|
OSD must perform several tasks:
|
|
|
|
1. Persist the new maps to the filestore.
|
|
Several PG operations rely on having access to maps dating back to the last
|
|
time the PG was clean.
|
|
2. Update and persist the superblock.
|
|
3. Update OSD state related to the current map.
|
|
4. Expose new maps to PG processes via *OSDService*.
|
|
5. Remove PGs due to pool removal.
|
|
6. Queue dummy events to trigger PG map catchup.
|
|
|
|
Each PG asynchronously catches up to the currently published map during
|
|
process_peering_events before processing the event. As a result, different
|
|
PGs may have different views as to the "current" map.
|
|
|
|
One consequence of this design is that messages containing submessages from
|
|
multiple PGs (MOSDPGInfo, MOSDPGQuery, MOSDPGNotify) must tag each submessage
|
|
with the PG's epoch as well as tagging the message as a whole with the OSD's
|
|
current published epoch.
|
|
|
|
MOSDPGOp/MOSDPGSubOp
|
|
--------------------
|
|
See OSD::dispatch_op, OSD::handle_op, OSD::handle_sub_op
|
|
|
|
MOSDPGOps are used by clients to initiate rados operations. MOSDSubOps are used
|
|
between OSDs to coordinate most non peering activities including replicating
|
|
MOSDPGOp operations.
|
|
|
|
OSD::require_same_or_newer map checks that the current OSDMap is at least
|
|
as new as the map epoch indicated on the message. If not, the message is
|
|
queued in OSD::waiting_for_osdmap via OSD::wait_for_new_map. Note, this
|
|
cannot violate the above conditions since any two messages will be queued
|
|
in order of receipt and if a message is recieved with epoch e0, a later message
|
|
from the same source must be at epoch at least e0. Note that two PGs from
|
|
the same OSD count for these purposes as different sources for single PG
|
|
messages. That is, messages from different PGs may be reordered.
|
|
|
|
|
|
MOSDPGOps follow the following process:
|
|
|
|
1. OSD::handle_op: validates permissions and crush mapping.
|
|
See OSDService::handle_misdirected_op
|
|
See OSD::op_has_sufficient_caps
|
|
See OSD::require_same_or_newer_map
|
|
2. OSD::enqueue_op
|
|
|
|
MOSDSubOps follow the following process:
|
|
|
|
1. OSD::handle_sub_op checks that sender is an OSD
|
|
2. OSD::enqueue_op
|
|
|
|
OSD::enqueue_op calls PG::queue_op which checks can_discard_request before
|
|
queueing the op in the op_queue and the PG in the OpWQ. Note, a single PG
|
|
may be in the op queue multiple times for multiple ops.
|
|
|
|
dequeue_op is then eventually called on the PG. At this time, the op is popped
|
|
off of op_queue and passed to PG::do_request, which checks that the PG map is
|
|
new enough (must_delay_op) and then processes the request.
|
|
|
|
In summary, the possible ways that an op may wait or be discarded in are:
|
|
|
|
1. Wait in waiting_for_osdmap due to OSD::require_same_or_newer_map from
|
|
OSD::handle_*.
|
|
2. Discarded in OSD::can_discard_op at enqueue_op.
|
|
3. Wait in PG::op_waiters due to PG::must_delay_request in PG::do_request.
|
|
4. Wait in PG::waiting_for_active in due_request due to !flushed.
|
|
5. Wait in PG::waiting_for_active due to !active() in do_op/do_sub_op.
|
|
6. Wait in PG::waiting_for_(degraded|missing) in do_op.
|
|
7. Wait in PG::waiting_for_active due to scrub_block_writes in do_op
|
|
|
|
TODO: The above is not a complete list.
|
|
|
|
Peering Messages
|
|
----------------
|
|
See OSD::handle_pg_(notify|info|log|query)
|
|
|
|
Peering messages are tagged with two epochs:
|
|
|
|
1. epoch_sent: map epoch at which the message was sent
|
|
2. query_epoch: map epoch at which the message triggering the message was sent
|
|
|
|
These are the same in cases where there was no triggering message. We discard
|
|
a peering message if the message's query_epoch if the PG in question has entered
|
|
a new epoch (See PG::old_peering_event, PG::queue_peering_event). Notifies,
|
|
infos, notifies, and logs are all handled as PG::RecoveryMachine events and
|
|
are wrapped by PG::queue_* by PG::CephPeeringEvts, which include the created
|
|
state machine event along with epoch_sent and query_epoch in order to
|
|
generically check PG::old_peering_message upon insertion and removal from the
|
|
queue.
|
|
|
|
Note, notifies, logs, and infos can trigger the creation of a PG. See
|
|
OSD::get_or_create_pg.
|
|
|
|
|