With retries, it's possible for notifies to be received more than once
when they are resent to different OSDs, since the OSDs only track them
in memory.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Watches update the on-disk state in the OSD, and aren't idempotent,
so refreshing them must be treated as a separate transaction by the OSD.
Notifies are just in-memory state, and resending them will result in
acceptable behavior:
- if it's the same osd, the resent op will be recognized as a duplicate
- if it's a different osd, a new notify will be triggered since the new osd
can't tell whether the original notify was received by any watchers
Using a new tid for each resend can cause some unecessary extra work,
as the first case turns into the second.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
osd max backfills: 5 was too low for a default, 10
seems to work better in testing. The message
priority system should minimize disruption of
push and pull operations anyway.
osd recovery max chunk: 1MB was too small for a
default. 8MB is reasonable for a single push
and will allow us to recover an rbd block in
one push rather then 4 reducing client io
latency during log-based recovery.
osd recovery op priority: 10 rather than 30 will
further reduce the client io latency impact of
push and pull operations.
Signed-off-by: Samuel Just <sam.just@inktank.com>
We pass a pointer because it is an optional argument, but we shouldn't
put the bufferlist on the heap or else we have to manage it's life
cycle, and that's fragile (and previously broken).
Signed-off-by: Sage Weil <sage@inktank.com>
A list is overkill; just use a seq and make sure it increments to ensure
the op_submit_finish calls are in order.
Signed-off-by: Sage Weil <sage@inktank.com>
The delicate balancing with op_apply_start() and that fact that it can
block was making it very hard to determine how long commit_start() should
wait, since requests in the workqueue threads could op_apply_start() in
any order. For example,
threadA: gets osr1 from wq
threadA: gets osr2 from wq
threadA: dequeue seq 11 from osr1, op_apply_start
threadC: commit_start on 11
threadA: op_apply_finish on seq 11
threadC: commit_started, commit_finish
threadB: dequeue seq 10 from osr2
<failed assert, badness>
Instead, rip out all this code, and use the ThreadPool pause() method to
quiesce operations. Keep some of the (now unnecessary) fields around
for sanity checks (blocked, open_ops, max_applying_seq, etc.).
Signed-off-by: Sage Weil <sage@inktank.com>
These asserts are valid for a uniform cluster, but they won't hold
for a replica running a version without the info.last_epoch_started
patch.
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 0756052cff)
We were moving to the MIX even if nobody wanted to write; that is not
useful, since if we only want to read SYNC will let us cache those reads.
SYNC is also a more friendly place (all things equal) to be.
Signed-off-by: Sage Weil <sage@inktank.com>
In order to properly validate the client capabilities,
we need to be able to access them from libcephfs.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
Split format 1 and 2 image creation into separate functions for better
readability. Format 2 requires more error handling.
Fixes: #2677
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
After adding the gv patches, during Monitor::recovered_leader() we started
waking up contexts following the order of the 'paxos' vector. However,
given that the mdsmon has a forgotten dependency on the osdmon paxos
machine, we were incurring in a situation in which we proposed a value
through the osdmon before creating a new pending value (but by being
active, the mdsmon would go through with it nonetheless).
This is easily fixed by making sure that the mdsmon callbacks are only
awaken *after* the osdmon has been taken care of.
Fixes: #3495
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Prior to split, this did not matter. With split, however, it's
crucial that a pg go through advance_pg() for the map causing
the split. During operation, a PG lags the OSD superblock
epoch. If the OSD dies after the OSD epoch passes the split
but before the pg epoch passes the split, the PG will be
reloaded at the OSD epoch and won't see the split operation.
The PG collection might after that point contain incorrect
objects which should have been split into a child.
Signed-off-by: Samuel Just <sam.just@inktank.com>
PGs are split after updating to the map on which they split.
OSD::activate_map populates the set of currently "splitting"
pgs. Messages for those pgs are delayed until the split
is complete. We add the newly split children to pg_map
once the transaction populating their on-disk state completes.
Signed-off-by: Samuel Just <sam.just@inktank.com>
This caused a bug where the watch operation bypassed the is_degraded()
check in the write path and the repop got sent to the replica where the
replica crashed due to the is_missing() assert in sub_op_modify.
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>