Commit Graph

23342 Commits

Author SHA1 Message Date
Sage Weil
988a521735 osd: special case CALL op to not have RD bit effects
In commit 20496b8d2b we treat a CALL as
different from a normal "read", but we did not adjust the behavior
determined by the RD bit in the op.  We tried to fix that in
91e941aef9, but changing the op code breaks
compatibility, so that was reverted.

Instead, special-case CALL in the helper--the only point in the code that
actually checks for the RD bit.  (And fix one lingering user to use that
helper appropriately.)

Fixes: #3731
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2013-01-04 20:46:56 -08:00
Sage Weil
d3abd0fe0b Revert "OSD: remove RD flag from CALL ops"
This reverts commit 91e941aef9.

We cannot change this op code without breaking compatibility
with old code (client and server).  We'll have to special case
this op code instead.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2013-01-04 20:46:48 -08:00
Noah Watkins
3a9408742a libcephfs: delete client after messenger shutdown
Prevents race between messages being dispatched to the client after the
client has been free'd.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-01-04 19:51:52 -08:00
Dan Mick
0978dc4963 rbd: Don't call ProgressContext's finish() if there's an error.
do_copy was different from the others; call pc.fail() on error and
do not call pc.finish().

Fixes: #3729
Signed-off-by: Dan Mick <dan.mick@inktank.com>
2013-01-04 18:02:55 -08:00
Samuel Just
e89b6ade63 ReplicatedPG: remove old-head optization from push_to_replica
This optimization allowed the primary to push a clone as a single push in the
case that the head object on the replica is old and happens to be at the same
version as the clone.  In general, using head in clone_subsets is tricky since
we might be writing to head during the push.  calc_clone_subsets does not
consider head (probably for this reason).  Handling the clone from head case
properly would require blocking writes on head in the interim which is probably
a bad trade off anyway.

Because the old-head optimization only comes into play if the replica's state
happens to fall on the last write to head prior to the snap that caused the
clone in question, it's not worth the complexity.

Fixes: #3698
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-01-04 13:44:18 -08:00
Josh Durgin
6a3d475cf0 Merge remote branch 'origin/wip-rbd-watch'
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2013-01-04 13:37:36 -08:00
Yan, Zheng
acfa0c9a4a mds: optimize C_MDC_RetryOpenRemoteIno
When opening remote inode, C_MDC_RetryOpenRemoteIno is used as onfinish
context for discovering remote inode. When it is called, the MDS may
already have the inode.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-04 11:00:05 +08:00
Yan, Zheng
acbe6d97fd mds: don't issue caps while inode is exporting caps
If issue caps while inode is exporting caps, the client will drop the
caps soon when it receives the CAP_OP_EXPORT message, but the client
will not receive corresponding CAP_OP_IMPORT message.

Except open file request, it's OK to not issue caps for client requests.
If an non-auth MDS receives open file request but it can't issue caps,
forward the request to auth MDS.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-04 10:45:40 +08:00
Yan, Zheng
ca4dc4dbc6 mds: check if stray dentry is needed
The necessity of stray dentry can change before the request acquires
all locks.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-04 10:45:40 +08:00
Yan, Zheng
3705c7ca9c mds: drop locks when opening remote dentry
Opening remote dentry while holding locks may cause dead lock. For example,
'discover' is blocked by a xlocked dentry, the request holding the xlock
is blocked by the locks hold by the readdir request.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-04 10:45:40 +08:00
Yan, Zheng
ea2fd1276b mds: check null context in CDir::fetch()
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-04 10:45:40 +08:00
Yan, Zheng
420f335566 mds: rdlock prepended dest trace when handling rename
rdlock prepended dest trace to prevent them from being xlocked by
someone else.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-04 10:45:40 +08:00
Yan, Zheng
248e4ab8e8 mds: fix cap mask for ifile lock
ifile lock has 8 cap bits, should its cap mask should be 0xff

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-04 10:45:40 +08:00
Yan, Zheng
f9280cb694 mds: fix replica state for LOCK_MIX_LOCK
LOCK_MIX_LOCK state is for gathering local locks and caps, so replica state
should be LOCK_MIX.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-04 10:45:40 +08:00
Yan, Zheng
59953257e2 mds: keep dentry lock in sync state as much as possible
Unlike locks of other types, dentry lock in unreadable state can block path
traverse, so it should be in sync state as much as possible. there are two
rare cases that dentry lock is not set to sync state: the dentry becomes
replicated; finishing xlock but the dentry is freezing.

In commit efbca31d, I tried fixing the issue that unreadable replica dentry
blocks path traverse by modifying MDCache::path_traverse(), but it does not
work. This patch also reverts that change.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-04 10:45:40 +08:00
Yan, Zheng
b03eab22e4 mds: forbid creating file in deleted directory
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-04 10:45:40 +08:00
Yan, Zheng
d379ac8e0b mds: disable concurrent remote locking
Current code allows multiple MDRequests to concurrently acquire a
remote lock. But a lock ACK message wakes all requests because they
were all put to the same waiting queue. One request gets the lock,
the rest requests will re-send the OP_WRLOCK/OPWRLOCK slave requests
and trigger assertion on remote MDS. The fix is disable concurrently
acquiring remote lock, send OP_WRLOCK/OPWRLOCK slave request only
if there is no on-going slave request.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-04 10:45:08 +08:00
Sage Weil
28d59d374b os/FileStore: fix non-btrfs op_seq commit order
The op_seq file is the starting point for journal replay.  For stable btrfs
commit mode, which is using a snapshot as a reference, we should write this
file before we take the snap.  We normally ignore current/ contents anyway.

On non-btrfs file systems, however, we should only write this file *after*
we do a full sync, and we should then fsync(2) it before we continue
(and potentially trim anything from the journal).

This fixes a serious bug that could cause data loss and corruption after
a power loss event.  For a 'kill -9' or crash, however, there was little
risk, since the writes were still captured by the host's cache.

Fixes: #3721
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
2013-01-03 17:15:07 -08:00
John Wilkins
f1e0305f0d doc: Removed the --without-tcmalloc flag until further advised.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-01-03 16:13:13 -08:00
Sage Weil
19df20867d Merge pull request #30 from rca/master
Minor clarification in docs.
2013-01-03 16:07:59 -08:00
John Wilkins
88af7d182a doc: Added defaults for PGs, links to recommended settings, and updated note on splitting.
Fixes: #3555

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-01-03 14:51:33 -08:00
Samuel Just
4ae4dce5c5 OSD: for old osds, dispatch peering messages immediately
Normally, we batch up peering messages until the end of
process_peering_events to allow us to combine many notifies, etc
to the same osd into the same message.  However, old osds assume
that the actiavtion message (log or info) will be _dispatched
before the first sub_op_modify of the interval.  Thus, for those
peers, we need to send the peering messages before we drop the
pg lock, lest we issue a client repop from another thread before
activation message is sent.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-01-03 14:18:00 -08:00
John Wilkins
73bc8ffc90 doc: Added comments on --without-tcmalloc option when building Ceph.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-01-03 13:30:14 -08:00
rca
37b57cdf0f Update doc/rados/configuration/filesystem-recommendations.rst
Clarified when it's necessary to use the setting:

filestore xattr use omap = true
2013-01-03 13:30:01 -08:00
John Wilkins
43ef6772eb doc: Added some packages to the copyable line.
Fixes: #3686

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-01-03 13:29:20 -08:00
John Wilkins
333ae82c61 doc: Fixed syntax error.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-01-03 13:28:06 -08:00
Sage Weil
7e94f6f1a7 Merge remote-tracking branch 'gh/wip-3714-b' into next
Signed-off-by: Samuel Just <sam.just@inktank.com>
2013-01-03 12:53:07 -08:00
David Zafman
224a33bb3b qa/workunit: Add dbench-short.sh for nfs suite
A multi-client dbench run doesn't work over NFS,
    see bug #3718.  Make single client dbench available.

    Signed-off-by: David Zafman <david.zafman@inktank.com>
2013-01-03 12:44:19 -08:00
Sage Weil
a32d6c5dca osd: move common active vs booting code into consume_map
Push osdmaps to PGs in separate method from activate_map() (whose name
is becoming less and less accurate).

Signed-off-by: Sage Weil <sage@inktank.com>
2013-01-02 22:39:10 -08:00
Sage Weil
0bfad8ef20 osd: let pgs process map advances before booting
The OSD deliberate consumes and processes most OSDMaps from while it
was down before it marks itself up, as this is can be slow.  The new
threading code does this asynchronously in peering_wq, though, and
does not let it drain before booting the OSD.  The OSD can get into
a situation where it marks itself up but is not responsive or useful
because of the backlog, and only makes the situation works by
generating more osdmaps as result.

Fix this by calling activate_map() even when booting, and when booting
draining the peering_wq on each call.  This is harmless since we are
not yet processing actual ops; we only need to be async when active.

Fixes: #3714
Signed-off-by: Sage Weil <sage@inktank.com>
2013-01-02 22:20:06 -08:00
Sage Weil
5fc94e89a9 osd: drop oldest_last_clean from activate_map
Signed-off-by: Sage Weil <sage@inktank.com>
2013-01-02 22:04:34 -08:00
Sage Weil
67f7ee6799 osd: drop unused variables from activate_map
Signed-off-by: Sage Weil <sage@inktank.com>
2013-01-02 22:04:08 -08:00
Sage Weil
a14a36ed78 OSDMap: fix modifed -> modified typo
Signed-off-by: Sage Weil <sage@inktank.com>
2013-01-02 21:09:07 -08:00
Sage Weil
6b5a89d237 Merge remote-tracking branch 'gh/next' 2013-01-02 18:13:25 -08:00
Sage Weil
43cba617aa log: fix locking typo/stupid for dump_recent()
We weren't locking m_flush_mutex properly, which in turn was leading to
racing threads calling dump_recent() and garbling the crash dump output.

Backport: bobtail, argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2013-01-02 17:01:32 -08:00
John Wilkins
29ff87a573 Merge branch 'master' of https://github.com/ceph/ceph 2013-01-02 15:59:59 -08:00
John Wilkins
64d2760a49 doc: Added a memory profiling section. Ported from the wiki.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-01-02 15:58:03 -08:00
John Wilkins
5066abf189 doc: Added memory profiling to the index.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-01-02 15:57:22 -08:00
Sam Lang
0e9a0cd7b8 qa/workunit: Update pjd script to use new tarball
The pjd script now uses the latest version of pjd
with an additional test for opening a non-existent
file.

Signed-off-by: Sam Lang <sam.lang@inktank.com>
2013-01-02 17:08:37 -06:00
Sam Lang
d8940d15c3 fuse: Fix cleanup code path on init failure
With the changes from 856f32ab, the cfuse.init call returns
a _positive_ errno, which was getting ignored.  Also, if an
error occurs during cfuse.init(), we need to teardown the client
mount.

Signed-off-by: Sam Lang <sam.lang@inktank.com>
2013-01-02 16:38:28 -06:00
Josh Durgin
c4370ff03f librbd: establish watch before reading header
This eliminates a window in which a race could occur when we have an
image open but no watch established. The previous fix (using
assert_version) did not work well with resend operations.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-02 14:15:34 -08:00
Sage Weil
9a1cf51888 Merge branch 'wip-journal-aio' into next
Reviewed-by: Samuel Just <sam.just@inktank.com>
Backport: bobtail
2013-01-02 13:42:22 -08:00
Sage Weil
483c6f76ad test_filejournal: optionally specify journal filename as an argument
Signed-off-by: Sage Weil <sage@inktank.com>
2013-01-02 13:39:05 -08:00
Sage Weil
c461e7fc1e test_filejournal: test journaling bl with >IOV_MAX segments
Signed-off-by: Sage Weil <sage@inktank.com>
2013-01-02 13:39:05 -08:00
Sage Weil
dda7b65189 os/FileJournal: limit size of aio submission
Limit size of each aio submission to IOV_MAX-1 (to be safe).  Take care to
only mark the last aio with the seq to signal completion.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-01-02 13:39:05 -08:00
Josh Durgin
e0858fa899 Revert "librbd: ensure header is up to date after initial read"
Using assert version for linger ops doesn't work with retries,
since the version will change after the first send.
This reverts commit e177680903.

Conflicts:

	qa/workunits/rbd/watch_correct_version.sh
2013-01-02 12:32:33 -08:00
John Wilkins
82297706da doc: Minor edits.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-01-02 11:24:39 -08:00
John Wilkins
d3b9803eab doc: Fixed typo, clarified usage.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-01-02 11:15:16 -08:00
Yan, Zheng
8422474320 mds: fix rename inode exportor check
Use "srcdn->is_auth() && destdnl->is_primary()" to check if the MDS is
inode exportor of rename operation is not reliable, This is because
OP_FINISH slave request may race with subtree import. The fix is use
a variable in MDRequest to indicate if the MDS is inode exportor.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:40 +08:00
Yan, Zheng
5e8642a82e mds: call maybe_eval_stray after removing a replica dentry
MDCache::handle_cache_expire() processes dentries after inodes, so the
MDCache::maybe_eval_stray() in MDCache::inode_remove_replica() always
fails to remove stray inode because MDCache::eval_stray() checks if the
stray inode's dentry is replicated.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:40 +08:00