the strategy for stop relies on the fact that process_request() is
completely synchronous, so that io_context.stop() would still complete
each request and clean up properly
to tolerate an asynchronous process_request(), we instead need to drain
all outstanding work on the io_context so that io_context.run() can
return control natually to all of the worker threads. that would allow
us to suspend our coroutine in the middle of process_request(), and
still guarantee that process_request() will resume and run to completion
before the worker threads exit
each connected socket also counts as outstanding work, and needs to be
closed in order to drain the io_context. each connection now adds itself
to a connection list so that stop() can close its socket
Signed-off-by: Casey Bodley <cbodley@redhat.com>
the strategy for pause relied on stopping the io_context and waiting for
io_context.run() to return control to all of the worker threads. this
relies on the fact that process_request() is completely synchronous (so
considered a single unit of work in the io_context) - otherwise, pause
could complete in the middle of a call to process_request(), and destroy
the RGWRados instance while it's still in use
calling io_context.stop() to pause the worker threads also assumes that
no other work will be scheduled on these threads
to decouple pause from worker threads, handle_connection() now uses an
async shared mutex to synchronize with pause/unpause
Signed-off-by: Casey Bodley <cbodley@redhat.com>
This was there just to confirm that this path was exercised by the
rados suite (it is, several hits per rados run of 1/666).
Signed-off-by: Sage Weil <sage@redhat.com>
The class's osdmap may be updated while we are in our loop. Pass it in
explicitly instead.
Fixes: http://tracker.ceph.com/issues/26970
Signed-off-by: Sage Weil <sage@redhat.com>
Deletion involves an awkward dance between the pg lock and shard locks,
while the merge prep and tracking is "shard down". If the delete has
finished its work we may find that a merge has since been prepped.
Unwinding the merge tracking is nontrivial, especially because it might
involved a second PG, possibly even a fabricated placeholder one. Instead,
if we delete and find that a merge is coming, undo our deletion and let
things play out in the future map epoch.
Signed-off-by: Sage Weil <sage@redhat.com>
The point of premerge is to ensure that the constituent parts of the
target PG are fully clean. If there is an intervening PG migration and
one of the halves finishes migrating before the other, one half could
get removed and the final merge could result in an incomplete PG. In the
worst case, the two halves (let's call them A and B) could have started
out together on say [0,1,2], A moves to [3,4,5] and gets deleted from
[0,1,2], and then the final merge happens such that *all* copies of the PG
are incomplete.
We could construct a clever check that does allow removal of strays when
the sibling PG is also ready to go, but it would be complicated. Do the
simple thing. In reality, this would be an extremely hard case to hit
because the premerge window is generally very short.
Signed-off-by: Sage Weil <sage@redhat.com>
We can't hold osd_lock while blocking because other objectstore completions
need to take osd_lock (e.g., _committed_osd_maps), and those objectstore
completions need to complete in order to finish_splits. Move the blocking
to the top before we establish any local state in this stack frame since
both the public and cluster dispatchers may race in handle_osd_map and
we are dropping and retaking osd_lock.
Signed-off-by: Sage Weil <sage@redhat.com>
We can't safely block in _committed_osd_maps because we are being run
by the store's finisher threads, and we may have to wait for a PG to split
and then merge via that same queue and deadlock.
Do not hold osd_lock while waiting as this can interfere with *other*
objectstore completions that take osd_lock.
Signed-off-by: Sage Weil <sage@redhat.com>
Ensure that we bring split children up to date to the latest map even in
the absence of new OSDMaps feeding in NullEvts. This is important when
the handle_osd_map (or boot) thread is blocked waiting for pgs to catch
up, but we also need a newly-split child to catch up (perhaps so that it
can merge).
Signed-off-by: Sage Weil <sage@redhat.com>
This used to protect the pg registration probably? There is no need for
it now.
More importantly, having it here can cause a deadlock when we are holding
osd_lock and blocking on wait_min_pg_epoch(), because a PG may need to
finish splitting to advance and then merge with a peer. (The wait won't
block on *this* PG since it isn't registered in the shard yet, but it
will block on the merge peer.)
Signed-off-by: Sage Weil <sage@redhat.com>
The problem is:
osd is at epoch 80
import pg 1.a as of e57
1.a and 1.1a merged in epoch 60something
we set up a merge now,
but in should_restart_peering via advance_pg we hit the is_split assert
that the ps is < old_pg_num
We can meaningfully return false (this is not a split) for a pg that is
beyond pg_num.
Signed-off-by: Sage Weil <sage@redhat.com>
We currently import a portion of the PG if it has split. Merge is more
complicated, though, mainly because COT is operating in a mode where it
fast-forwards the PG to the latest OSDMap epoch, which means it has to
implement any transformations to the PG (split/merge) independently.
Avoid doing this for merge.
Signed-off-by: Sage Weil <sage@redhat.com>
Optionally bounce pg_num back up right after we decrease it. This triggers
conditions in the OSD where the merge and split logic may conflict.
Signed-off-by: Sage Weil <sage@redhat.com>
- Vevamps the split tracking infrastructure, and adds new tracking for
upcoming merges in consume_map. These are now unified into the same
identify_ method. these consume the new pg_num change tracking
instructure we just added in the prior commit.
- PGs that are about to merge have a new wait infrastructure, since all
sources and the target have to reach the target epoch before the merge
can happen.
- If one of the sources for a merge does not exist, we create an empty
dummy PG to merge with. The implies that the resulting merged PG will
be incomplete (and mostly useless), but it unifies the code paths.
- The actual merge (PG::merge_from) happens in advance_pg().
Fixes: http://tracker.ceph.com/issues/85
Signed-off-by: Sage Weil <sage@redhat.com>
This is the building block that smooshes multiple PGs back into one. The
resulting combination PG will have no PG log. That means the sources
need to be clean and quiesced or else the result will end up being
marked incomplete.
Signed-off-by: Sage Weil <sage@redhat.com>
If the user asks to reduce pg_num, reduce pg_num_target too at the same
time.
Don't completely hide pgp_num yet (by increasing it when pg_num_target
increases).
Signed-off-by: Sage Weil <sage@redhat.com>
Configure how many initial PGs we create a pool with. If the user wants
more than this then we do subsequent splits.
Default to 1024, so that pool creation works in the usual way for most users,
but does some splitting for very large pools/clusters.
Signed-off-by: Sage Weil <sage@redhat.com>
In order to recreate a lost PG, we need to set the CREATING flag for the
pool. This prevents pg_num from changing in future OSDMap epochs until
*after* the PG has successfully been instantiated.
Note that a pg_num change in *this* epoch is fine; the recreated PG will
instantiate in *this* epoch, which is /after/ the split a pg_num in this
epoch would describe.
Signed-off-by: Sage Weil <sage@redhat.com>
The new sharded wq implementation cannot handle a resent mon create
message and a split child already existing. This a side effect of the
new pg create path instantiating the PG at the pool create epoch osdmap
and letting it roll forward through splits; the mon may be resending a
create for a pg that was already created elsewhere and split elsewhere,
such that one of those split children has peered back onto this same OSD.
When we roll forward our re-created empty parent it may split and find the
child already exists, crashing.
This is no longer a concern because the mgr-based controller for pg_num
will not split PGs until after the initial PGs are all created. (We
know this because the pool has the CREATED flag set.)
The old-style path had it's own problem
http://tracker.ceph.com/issues/22165. We would build the history and
instantiate the pg in the latest osdmap epoch, ignoring any split children
that should have been created between teh pool create epoch and the
current epoch. Since we're now taking the new path, that is no longer
a problem.
Fixes: http://tracker.ceph.com/issues/22165
Signed-off-by: Sage Weil <sage@redhat.com>
This will force pre-nautilus clients to resend ops when we are adjusting
pg_num_pending. This is a big hammer: for nautilus+ clients, we only have
an interval change for the affected PGs (the two PGs that are about to
merge), whereas this compat hack will do an op resend for the whole pool.
However, it is better than requiring all clients be upgraded to nautilus in
order to do PG merges.
Note that we already do the same thing for pre-luminous clients both for
splits, so we've already inflicted similar pain the past (and, to my
knowledge, have not seen any negative feedback or fallout from that).
Signed-off-by: Sage Weil <sage@redhat.com>
We only process mon-initiated PG creates while the pool is is CREATING
mode. This ensures that we will not have any racing split or merge
operations.
Signed-off-by: Sage Weil <sage@redhat.com>
This is more reliable than looking at PG states because the PG may have
gone active and sent a notification to the mon (pg created!) and mgr
(new state!) but the mon may not have persisted that information yet.
Signed-off-by: Sage Weil <sage@redhat.com>
Set the flag when the pool is created, and clear it when the initial set
of PGs have been created by the mon. Move the update_creating_pgs()
block so that we can process the pgid removal from the creating list and
the pool flag removal in the same epoch; otherwise we might remove the
pgid but have no cluster activity to roll over another osdmap epoch to
allow the pool flag to be removed.
Signed-off-by: Sage Weil <sage@redhat.com>
Previously, we renamed the old last_force_resend to
last_force_resend_preluminous and created a new last_force_resend for
luminous+. This allowed us to force preluminous clients to resend ops
(because they didn't understand the new pg split => new interval rule)
without affecting luminous clients.
Do the same rename again, adding a last_force_resend_prenautilus (luminous
or mimic).
Adjust the OSD code accordingly so it matches the behavior we'll see from
a luminous client.
Signed-off-by: Sage Weil <sage@redhat.com>