The op_seq file is the starting point for journal replay. For stable btrfs
commit mode, which is using a snapshot as a reference, we should write this
file before we take the snap. We normally ignore current/ contents anyway.
On non-btrfs file systems, however, we should only write this file *after*
we do a full sync, and we should then fsync(2) it before we continue
(and potentially trim anything from the journal).
This fixes a serious bug that could cause data loss and corruption after
a power loss event. For a 'kill -9' or crash, however, there was little
risk, since the writes were still captured by the host's cache.
Fixes: #3721
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
Normally, we batch up peering messages until the end of
process_peering_events to allow us to combine many notifies, etc
to the same osd into the same message. However, old osds assume
that the actiavtion message (log or info) will be _dispatched
before the first sub_op_modify of the interval. Thus, for those
peers, we need to send the peering messages before we drop the
pg lock, lest we issue a client repop from another thread before
activation message is sent.
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
A multi-client dbench run doesn't work over NFS,
see bug #3718. Make single client dbench available.
Signed-off-by: David Zafman <david.zafman@inktank.com>
Push osdmaps to PGs in separate method from activate_map() (whose name
is becoming less and less accurate).
Signed-off-by: Sage Weil <sage@inktank.com>
The OSD deliberate consumes and processes most OSDMaps from while it
was down before it marks itself up, as this is can be slow. The new
threading code does this asynchronously in peering_wq, though, and
does not let it drain before booting the OSD. The OSD can get into
a situation where it marks itself up but is not responsive or useful
because of the backlog, and only makes the situation works by
generating more osdmaps as result.
Fix this by calling activate_map() even when booting, and when booting
draining the peering_wq on each call. This is harmless since we are
not yet processing actual ops; we only need to be async when active.
Fixes: #3714
Signed-off-by: Sage Weil <sage@inktank.com>
We weren't locking m_flush_mutex properly, which in turn was leading to
racing threads calling dump_recent() and garbling the crash dump output.
Backport: bobtail, argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
The pjd script now uses the latest version of pjd
with an additional test for opening a non-existent
file.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
With the changes from 856f32ab, the cfuse.init call returns
a _positive_ errno, which was getting ignored. Also, if an
error occurs during cfuse.init(), we need to teardown the client
mount.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
This eliminates a window in which a race could occur when we have an
image open but no watch established. The previous fix (using
assert_version) did not work well with resend operations.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Limit size of each aio submission to IOV_MAX-1 (to be safe). Take care to
only mark the last aio with the seq to signal completion.
Signed-off-by: Sage Weil <sage@inktank.com>
Using assert version for linger ops doesn't work with retries,
since the version will change after the first send.
This reverts commit e177680903.
Conflicts:
qa/workunits/rbd/watch_correct_version.sh
Use "srcdn->is_auth() && destdnl->is_primary()" to check if the MDS is
inode exportor of rename operation is not reliable, This is because
OP_FINISH slave request may race with subtree import. The fix is use
a variable in MDRequest to indicate if the MDS is inode exportor.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
MDCache::handle_cache_expire() processes dentries after inodes, so the
MDCache::maybe_eval_stray() in MDCache::inode_remove_replica() always
fails to remove stray inode because MDCache::eval_stray() checks if the
stray inode's dentry is replicated.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
We should not defer processing caps if the inode is auth pinned by MDRequest,
because the MDRequest may change lock state of the inode later and wait for
the deferred caps.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Locker::simple_excl() and Locker::scatter_mix() miss is_rdlocked
check; Locker::file_excl() miss is_rdlocked check and is_wrlocked
check.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
In some rare case, Locker::acquire_locks() drops all acquired locks
in order to auth pin new objects. But Locker::drop_locks only drops
explicitly acquired remote locks, does not drop objects' version
locks that were implicitly acquired on remote MDS. These leftover
locks break locking order when re-acquiring _locks and may cause
dead lock.
The fix is indroduce DROPLOCKS slave request which drops all acquired
lock on remote MDS.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
The slaves for two phrase commit should be mdr->more()->witnessed
instead of mdr->more()->slaves. mdr->more()->slaves includes MDS
for remote auth pin and lock
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Anchor table updates for a given inode is fully serialized on client side.
But due to network latency, two commit requests from different clients can
arrive to anchor server out of order. The anchor table gets corrupted if
updates are committed in wrong order.
The fix is track on-going anchor updates for individual inode and delay
processing commit requests that arrive out of order.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
CDir::assimilate_dirty_rstat_inodes() may encounter frozen inodes that
are being renamed. Skip these frozen inodes because assimilating inode's
rstat require auth pinning the inode.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
When handling cross authority rename, the master first sends OP_RENAMEPREP
slave requests to witness MDS, then sends OP_RENAMEPREP slave request to
the rename inode's auth MDS after getting all witness MDS' acknowledgments.
Before receiving the OP_RENAMEPREP slave request, the rename inode's auth
MDS may change lock state of the rename inode and send lock messages to
witness MDS. But the witness MDS may already received the OP_RENAMEPREP
slave request and changed the source inode's authority. So the witness MDS
send lock acknowledgment message to wrong MDS and trigger assertion.
The fix is, firstly the master marks rename inode as ambiguous and send a
message to ask the rename inode's auth MDS to mark the inode as ambiguous,
then send OP_RENAMEPREP slave requests to the witness MDS, finally send
OP_RENAMEPREP slave request to the rename inode's auth MDS after getting
all witness MDS' acknowledgments.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
When replaying an directory rename operation, MDS need to find old parent of
the renamed directory to adjust auth subtree. Current code searchs the cache
to find the old parent, it does not work if the renamed directory inode is not
in the cache. EMetaBlob for directory rename contains at most one null dentry,
so MDS can use null dentry to find old parent of the renamed directory. If
there is no null dentry in the EMetaBlob, the MDS was witness of the rename
operation and there is not auth subtree underneath the renamed directory.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Server::_rename_prepare() adds null dest dentry to the EMetaBlob if
the rename operation overwrites a remote linkage. This is incorrect
because null dentry are processed after primary and remote dentries
during journal replay. The erroneous null dentry makes the dentry of
rename destination disappear.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Discover reply that adds replica dentry and inode can race with rename
if slave request for rename sends discover and waits, but waked up by
reply for different discover.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Locker::simple_eval() checks if the loner wants CEPH_CAP_GEXCL to
decide if it should change the lock to EXCL state, but it checks
if CEPH_CAP_GEXCL is issued to the loner to decide if it should
change the lock to SYNC state. So if the loner wants CEPH_CAP_GEXCL,
but doesn't have CEPH_CAP_GEXCL, Locker::simple_eval() will keep
switching the lock state.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>