Commit Graph

23127 Commits

Author SHA1 Message Date
Yan, Zheng
b03eab22e4 mds: forbid creating file in deleted directory
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-04 10:45:40 +08:00
Yan, Zheng
d379ac8e0b mds: disable concurrent remote locking
Current code allows multiple MDRequests to concurrently acquire a
remote lock. But a lock ACK message wakes all requests because they
were all put to the same waiting queue. One request gets the lock,
the rest requests will re-send the OP_WRLOCK/OPWRLOCK slave requests
and trigger assertion on remote MDS. The fix is disable concurrently
acquiring remote lock, send OP_WRLOCK/OPWRLOCK slave request only
if there is no on-going slave request.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-04 10:45:08 +08:00
Yan, Zheng
8422474320 mds: fix rename inode exportor check
Use "srcdn->is_auth() && destdnl->is_primary()" to check if the MDS is
inode exportor of rename operation is not reliable, This is because
OP_FINISH slave request may race with subtree import. The fix is use
a variable in MDRequest to indicate if the MDS is inode exportor.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:40 +08:00
Yan, Zheng
5e8642a82e mds: call maybe_eval_stray after removing a replica dentry
MDCache::handle_cache_expire() processes dentries after inodes, so the
MDCache::maybe_eval_stray() in MDCache::inode_remove_replica() always
fails to remove stray inode because MDCache::eval_stray() checks if the
stray inode's dentry is replicated.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:40 +08:00
Yan, Zheng
f5ea5c36a4 mds: don't defer processing caps if inode is auth pinned
We should not defer processing caps if the inode is auth pinned by MDRequest,
because the MDRequest may change lock state of the inode later and wait for
the deferred caps.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:40 +08:00
Yan, Zheng
fe5936b158 mds: remove unnecessary is_xlocked check
Locker::foo_eval() is always called for stable locks, so no need to
check if the lock is xlocked.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:40 +08:00
Yan, Zheng
b2d5005aa0 mds: fix lock state transition check
Locker::simple_excl() and Locker::scatter_mix() miss is_rdlocked
check; Locker::file_excl() miss is_rdlocked check and is_wrlocked
check.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:40 +08:00
Yan, Zheng
b3796f46a4 mds: indroduce DROPLOCKS slave request
In some rare case, Locker::acquire_locks() drops all acquired locks
in order to auth pin new objects. But Locker::drop_locks only drops
explicitly acquired remote locks, does not drop objects' version
locks that were implicitly acquired on remote MDS. These leftover
locks break locking order when re-acquiring _locks and may cause
dead lock.

The fix is indroduce DROPLOCKS slave request which drops all acquired
lock on remote MDS.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:40 +08:00
Yan, Zheng
7e04504d3e mds: fix on-going two phrase commits tracking
The slaves for two phrase commit should be mdr->more()->witnessed
instead of mdr->more()->slaves. mdr->more()->slaves includes MDS
for remote auth pin and lock

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:40 +08:00
Yan, Zheng
2f96b472ef mds: fix anchor table commit race
Anchor table updates for a given inode is fully serialized on client side.
But due to network latency, two commit requests from different clients can
arrive to anchor server out of order. The anchor table gets corrupted if
updates are committed in wrong order.

The fix is track on-going anchor updates for individual inode and delay
processing commit requests that arrive out of order.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:40 +08:00
Yan, Zheng
a79493da34 mds: skip frozen inode when assimilating dirty inodes' rstat
CDir::assimilate_dirty_rstat_inodes() may encounter frozen inodes that
are being renamed. Skip these frozen inodes because assimilating inode's
rstat require auth pinning the inode.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:40 +08:00
Yan, Zheng
61da9b1845 mds: mark rename inode as ambiguous auth on all involved MDS
When handling cross authority rename, the master first sends OP_RENAMEPREP
slave requests to witness MDS, then sends OP_RENAMEPREP slave request to
the rename inode's auth MDS after getting all witness MDS' acknowledgments.
Before receiving the OP_RENAMEPREP slave request, the rename inode's auth
MDS may change lock state of the rename inode and send lock messages to
witness MDS. But the witness MDS may already received the OP_RENAMEPREP
slave request and changed the source inode's authority. So the witness MDS
send lock acknowledgment message to wrong MDS and trigger assertion.

The fix is, firstly the master marks rename inode as ambiguous and send a
message to ask the rename inode's auth MDS to mark the inode as ambiguous,
then send OP_RENAMEPREP slave requests to the witness MDS, finally send
OP_RENAMEPREP slave request to the rename inode's auth MDS after getting
all witness MDS' acknowledgments.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:39 +08:00
Yan, Zheng
3b13d3dcbc mds: only export directory fragments in stray to their auth MDS
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:39 +08:00
Yan, Zheng
d9d7147339 mds: don't trim ambiguous imports in MDCache::trim_non_auth_subtree
Trimming ambiguous imports in MDCache::trim_non_auth_subtree() confuses
MDCache::disambiguate_imports() and causes infinite loop.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:39 +08:00
Yan, Zheng
fcb9f98887 mds: use null dentry to find old parent of renamed directory
When replaying an directory rename operation, MDS need to find old parent of
the renamed directory to adjust auth subtree. Current code searchs the cache
to find the old parent, it does not work if the renamed directory inode is not
in the cache. EMetaBlob for directory rename contains at most one null dentry,
so MDS can use null dentry to find old parent of the renamed directory. If
there is no null dentry in the EMetaBlob, the MDS was witness of the rename
operation and there is not auth subtree underneath the renamed directory.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:39 +08:00
Yan, Zheng
7a52016864 mds: don't journal null dentry for overwrited remote linkage
Server::_rename_prepare() adds null dest dentry to the EMetaBlob if
the rename operation overwrites a remote linkage. This is incorrect
because null dentry are processed after primary and remote dentries
during journal replay. The erroneous null dentry makes the dentry of
rename destination disappear.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:39 +08:00
Yan, Zheng
5ae715be5c mds: xlock stray dentry when handling rename or unlink
This prevents MDS from reintegrating stray before rename/unlink finishes

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:39 +08:00
Yan, Zheng
262795744b mds: don't trigger assertion when discover races with rename
Discover reply that adds replica dentry and inode can race with rename
if slave request for rename sends discover and waits, but waked up by
reply for different discover.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:39 +08:00
Yan, Zheng
e10267b531 mds: fix Locker::simple_eval()
Locker::simple_eval() checks if the loner wants CEPH_CAP_GEXCL to
decide if it should change the lock to EXCL state, but it checks
if CEPH_CAP_GEXCL is issued to the loner to decide if it should
change the lock to SYNC state. So if the loner wants CEPH_CAP_GEXCL,
but doesn't have CEPH_CAP_GEXCL, Locker::simple_eval() will keep
switching the lock state.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 13:55:43 +08:00
Yan, Zheng
7e23321b72 mds: don't renew revoking lease
MDS may receives lease renew request while lease is being revoked,
just ignore the renew request.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 13:54:51 +08:00
Gary Lowell
f1196c7e93 Merge branch 'master' of https://github.com/ceph/ceph 2012-12-31 21:35:03 -08:00
Gary Lowell
5dd6b19918 Merge branch 'next' 2012-12-31 21:31:17 -08:00
Sage Weil
8f77ec7d81 Merge branch 'next' 2012-12-31 18:37:12 -08:00
Sage Weil
94a5dd6b76 Merge remote-tracking branch 'gh/wip-3675'
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-12-31 18:36:39 -08:00
Gary Lowell
1a32f0a0b4 v0.56 2012-12-31 17:10:11 -08:00
Sage Weil
49ebe1ee3a client: fix _create created ino condition
We get 8 bytes back for the created ino.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-31 15:28:25 -08:00
Sage Weil
a10054bc52 libcephfs: choose more unique nonce
We were using a per-process counter combined with the pid.  A short
running process can easily loop through and reuse the same pid later.
Instead, go for 48 bits of randomness and the pid.  This way if we get
a dup pid we'll only get a dup nonce once out of 2^48 tries.

Avoids #3630 when running a libcephfs test in a loop (so that the pid
is eventually reused).  This is a better fix than the broken
8b59908370.  The real solution on the MDS
side involves cleaning up the msgr/MDS interaction with session
shutdown.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-31 15:26:54 -08:00
Sage Weil
e2fef38dfd client: fix _create
make_request() clear out req->reply and frees req; we can't inspect
it here.

Instead, just assume that extra_bl is the create flag/ino if it is
present.  Old code does not include an extra_bl on CREATE, and new code
will have the same first bytes for compatibility.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-31 15:26:53 -08:00
Sage Weil
b4d3bd06d4 Merge remote-tracking branch 'gh/wip-3625' 2012-12-31 10:16:31 -08:00
Sage Weil
ec5288a312 Merge remote-tracking branch 'gh/wip-rbd-unprotect' into next
Reviewed-by: Sage Weil <sage@inktank.com>
2012-12-30 15:29:37 -08:00
Joao Eduardo Luis
82cec48e9f doc: add-or-rm-mons.rst: Add 'Changing Monitor's IPs' section
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2012-12-30 19:18:09 +00:00
Joao Eduardo Luis
379f07923c doc: add-or-rm-mons.rst: Clarify what the monitor name/id is.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2012-12-30 19:17:03 +00:00
Josh Durgin
8bbb4a364d doc: fix rbd permissions for unprotect
Unprotect examines all pools, so use blanket x before 0.54. After
that, use class-read restricted by object_prefix to rbd_children.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2012-12-30 00:06:11 -08:00
Josh Durgin
d0a14d110d librbd: fix race between unprotect and clone
Clone needs to actually re-read the header to make sure the image is
still protected before returning. Additionally, it needs to consider
the image protected *only* if the protection status is protected -
unprotecting does not count. I thought I'd already fixed this, but
can't find the commit.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2012-12-30 00:06:11 -08:00
Josh Durgin
958addc0c9 rbd: open (source) image as read-only
This allows users without write access to copy, export and list
information about an image.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2012-12-30 00:06:11 -08:00
Josh Durgin
47bf519584 librbd: open parent as read-only during clone
We never write to the parent, and don't need to watch it during this process.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2012-12-30 00:06:11 -08:00
Josh Durgin
c67c789de6 librbd: add {rbd_}open_read_only()
Since 58890cfad5, regular {rbd_}open()
would fail with -EPERM if the user did not have write access to the
pool, since a watch on the header was requested.

For many uses of read-only access, establishing a watch is not
necessary, since changes to the header do not matter. For example,
getting metadata about an image via 'rbd info' does not care if a new
snapshot is created while it is in progress.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2012-12-30 00:06:11 -08:00
Josh Durgin
91e941aef9 OSD: remove RD flag from CALL ops
20496b8d2b forgot to do this. Without
this change, all class methods required regular read permission in
addition to class-read or class-write.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2012-12-30 00:06:11 -08:00
Josh Durgin
85e9d4f000 cls_rbd: get_children does not need write permission
This prevented a read-only user from being able to unprotect a
snapshot without write permission on all pools. This was masked before
by the CLS_METHOD_PUBLIC flag.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2012-12-30 00:06:11 -08:00
Sage Weil
6711a4c403 Revert "mds: replace closed sessions on connect"
This reverts commit 8b59908370.

This fix is not correct.  See #3696.
2012-12-29 08:38:52 -08:00
Sage Weil
82f8bcddb5 msg/Pipe: use state_closed atomic_t for _lookup_pipe
We shouldn't look at Pipe::state in SimpleMessenger::_lookup_pipe() without
holding pipe_lock.  Instead, use an atomic that we set to non-zero only
when transitioning to the terminal STATE_CLOSED state.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-28 17:21:01 -08:00
Sage Weil
a5d692a7b9 msgr: inject delays at inconvenient times
Exercise some rare races by injecting delays before taking locks
via the 'ms inject internal delays' option.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-28 17:21:01 -08:00
Sage Weil
e99b4a307b msgr: fix race on Pipe removal from hash
When a pipe is faulting and shutting down, we have to drop pipe_lock to
take msgr lock and then remove the entry.  The Pipe in this case will
have STATE_CLOSED.  Handle this case in all places we do a lookup on
the rank_pipe hash so that we effectively ignore entries that are
CLOSED.

This fixes a race introduced by the previous commit where we won't use
the CLOSED pipe and try to register a new one, but the old one is still
registered.

See bug #3675.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-28 17:21:00 -08:00
Sage Weil
6339c5d439 msgr: don't queue message on closed pipe
If we have a con that refs a pipe but it is closed, don't use it.  If
the ref is still there, it is only because we are racing with fault()
and it is about to (or just was) be detached.  Either way,

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-28 17:21:00 -08:00
Sage Weil
7bf0b0854d msgr: atomically queue first message with connect_rank
Atomically queue the first message on the new pipe, without dropping
and retaking pipe_lock.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-28 17:21:00 -08:00
Sage Weil
83c8025d12 Merge remote-tracking branch 'gh/next' 2012-12-28 17:19:46 -08:00
Joao Eduardo Luis
c2a75253e5 test: mon: workloadgen: debug when message fsid != monmap fsid
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2012-12-28 17:19:38 -08:00
Joao Eduardo Luis
b30ab51792 test: mon: workloadgen: assert if monmap's fsid is zero after authenticate
Fixes: #3629

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2012-12-28 17:19:35 -08:00
Noah Watkins
3583684776 doc: update Hadoop documentation
Updates configuration option names, and adds object.size,
localize.reads, and root.dir control options.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
2012-12-28 17:19:31 -08:00
Sage Weil
942c71454b init-ceph: ok, 8K files
16K might be a bit many.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-28 17:12:06 -08:00