We cannot trust the Message bufferlists or other structures to be
stable without pipe_lock, as another Pipe may claim and modify the sent
list items while we are writing to the socket.
Related to #3678.
Signed-off-by: Sage Weil <sage@inktank.com>
Fill out the Message header, footer, and calculate CRCs during
encoding, not write_message(). This removes most modifications from
Pipe::write_message().
Signed-off-by: Sage Weil <sage@inktank.com>
The call to check_linger_pool_dne() may unregister the linger request,
invalidating the iterator. To avoid this, increment the iterator at
the top of the loop.
This mirror the fix in 4bf9078286 for
regular non-linger ops.
Fixes: #3734
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
To not conflict with future linuxbox pull for nfs-ganesha.
Signed-off-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
This modifies bufferlists in the Message struct, and it is possible
for multiple instances of the Pipe to get references on the Message;
make sure they don't modify those bufferlists concurrently.
Signed-off-by: Sage Weil <sage@inktank.com>
Associate a sending message with the connection inside the pipe_lock.
This way if a racing thread tries to steal these messages it will
be sure to reset the con point *after* we do such that it the con
pointer is valid in encode_payload() (and later).
This may be part of #3678.
Signed-off-by: Sage Weil <sage@inktank.com>
In commit 20496b8d2b we treat a CALL as
different from a normal "read", but we did not adjust the behavior
determined by the RD bit in the op. We tried to fix that in
91e941aef9, but changing the op code breaks
compatibility, so that was reverted.
Instead, special-case CALL in the helper--the only point in the code that
actually checks for the RD bit. (And fix one lingering user to use that
helper appropriately.)
Fixes: #3731
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
This reverts commit 91e941aef9.
We cannot change this op code without breaking compatibility
with old code (client and server). We'll have to special case
this op code instead.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Prevents race between messages being dispatched to the client after the
client has been free'd.
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
do_copy was different from the others; call pc.fail() on error and
do not call pc.finish().
Fixes: #3729
Signed-off-by: Dan Mick <dan.mick@inktank.com>
This optimization allowed the primary to push a clone as a single push in the
case that the head object on the replica is old and happens to be at the same
version as the clone. In general, using head in clone_subsets is tricky since
we might be writing to head during the push. calc_clone_subsets does not
consider head (probably for this reason). Handling the clone from head case
properly would require blocking writes on head in the interim which is probably
a bad trade off anyway.
Because the old-head optimization only comes into play if the replica's state
happens to fall on the last write to head prior to the snap that caused the
clone in question, it's not worth the complexity.
Fixes: #3698
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
The op_seq file is the starting point for journal replay. For stable btrfs
commit mode, which is using a snapshot as a reference, we should write this
file before we take the snap. We normally ignore current/ contents anyway.
On non-btrfs file systems, however, we should only write this file *after*
we do a full sync, and we should then fsync(2) it before we continue
(and potentially trim anything from the journal).
This fixes a serious bug that could cause data loss and corruption after
a power loss event. For a 'kill -9' or crash, however, there was little
risk, since the writes were still captured by the host's cache.
Fixes: #3721
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
Normally, we batch up peering messages until the end of
process_peering_events to allow us to combine many notifies, etc
to the same osd into the same message. However, old osds assume
that the actiavtion message (log or info) will be _dispatched
before the first sub_op_modify of the interval. Thus, for those
peers, we need to send the peering messages before we drop the
pg lock, lest we issue a client repop from another thread before
activation message is sent.
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
A multi-client dbench run doesn't work over NFS,
see bug #3718. Make single client dbench available.
Signed-off-by: David Zafman <david.zafman@inktank.com>
Push osdmaps to PGs in separate method from activate_map() (whose name
is becoming less and less accurate).
Signed-off-by: Sage Weil <sage@inktank.com>
The OSD deliberate consumes and processes most OSDMaps from while it
was down before it marks itself up, as this is can be slow. The new
threading code does this asynchronously in peering_wq, though, and
does not let it drain before booting the OSD. The OSD can get into
a situation where it marks itself up but is not responsive or useful
because of the backlog, and only makes the situation works by
generating more osdmaps as result.
Fix this by calling activate_map() even when booting, and when booting
draining the peering_wq on each call. This is harmless since we are
not yet processing actual ops; we only need to be async when active.
Fixes: #3714
Signed-off-by: Sage Weil <sage@inktank.com>
We weren't locking m_flush_mutex properly, which in turn was leading to
racing threads calling dump_recent() and garbling the crash dump output.
Backport: bobtail, argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
The pjd script now uses the latest version of pjd
with an additional test for opening a non-existent
file.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
With the changes from 856f32ab, the cfuse.init call returns
a _positive_ errno, which was getting ignored. Also, if an
error occurs during cfuse.init(), we need to teardown the client
mount.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
This eliminates a window in which a race could occur when we have an
image open but no watch established. The previous fix (using
assert_version) did not work well with resend operations.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Limit size of each aio submission to IOV_MAX-1 (to be safe). Take care to
only mark the last aio with the seq to signal completion.
Signed-off-by: Sage Weil <sage@inktank.com>