CDir::assimilate_dirty_rstat_inodes() may encounter frozen inodes that
are being renamed. Skip these frozen inodes because assimilating inode's
rstat require auth pinning the inode.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
When handling cross authority rename, the master first sends OP_RENAMEPREP
slave requests to witness MDS, then sends OP_RENAMEPREP slave request to
the rename inode's auth MDS after getting all witness MDS' acknowledgments.
Before receiving the OP_RENAMEPREP slave request, the rename inode's auth
MDS may change lock state of the rename inode and send lock messages to
witness MDS. But the witness MDS may already received the OP_RENAMEPREP
slave request and changed the source inode's authority. So the witness MDS
send lock acknowledgment message to wrong MDS and trigger assertion.
The fix is, firstly the master marks rename inode as ambiguous and send a
message to ask the rename inode's auth MDS to mark the inode as ambiguous,
then send OP_RENAMEPREP slave requests to the witness MDS, finally send
OP_RENAMEPREP slave request to the rename inode's auth MDS after getting
all witness MDS' acknowledgments.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
When replaying an directory rename operation, MDS need to find old parent of
the renamed directory to adjust auth subtree. Current code searchs the cache
to find the old parent, it does not work if the renamed directory inode is not
in the cache. EMetaBlob for directory rename contains at most one null dentry,
so MDS can use null dentry to find old parent of the renamed directory. If
there is no null dentry in the EMetaBlob, the MDS was witness of the rename
operation and there is not auth subtree underneath the renamed directory.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Server::_rename_prepare() adds null dest dentry to the EMetaBlob if
the rename operation overwrites a remote linkage. This is incorrect
because null dentry are processed after primary and remote dentries
during journal replay. The erroneous null dentry makes the dentry of
rename destination disappear.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Discover reply that adds replica dentry and inode can race with rename
if slave request for rename sends discover and waits, but waked up by
reply for different discover.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Locker::simple_eval() checks if the loner wants CEPH_CAP_GEXCL to
decide if it should change the lock to EXCL state, but it checks
if CEPH_CAP_GEXCL is issued to the loner to decide if it should
change the lock to SYNC state. So if the loner wants CEPH_CAP_GEXCL,
but doesn't have CEPH_CAP_GEXCL, Locker::simple_eval() will keep
switching the lock state.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
We were using a per-process counter combined with the pid. A short
running process can easily loop through and reuse the same pid later.
Instead, go for 48 bits of randomness and the pid. This way if we get
a dup pid we'll only get a dup nonce once out of 2^48 tries.
Avoids #3630 when running a libcephfs test in a loop (so that the pid
is eventually reused). This is a better fix than the broken
8b59908370. The real solution on the MDS
side involves cleaning up the msgr/MDS interaction with session
shutdown.
Signed-off-by: Sage Weil <sage@inktank.com>
make_request() clear out req->reply and frees req; we can't inspect
it here.
Instead, just assume that extra_bl is the create flag/ino if it is
present. Old code does not include an extra_bl on CREATE, and new code
will have the same first bytes for compatibility.
Signed-off-by: Sage Weil <sage@inktank.com>
Unprotect examines all pools, so use blanket x before 0.54. After
that, use class-read restricted by object_prefix to rbd_children.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Clone needs to actually re-read the header to make sure the image is
still protected before returning. Additionally, it needs to consider
the image protected *only* if the protection status is protected -
unprotecting does not count. I thought I'd already fixed this, but
can't find the commit.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Since 58890cfad5, regular {rbd_}open()
would fail with -EPERM if the user did not have write access to the
pool, since a watch on the header was requested.
For many uses of read-only access, establishing a watch is not
necessary, since changes to the header do not matter. For example,
getting metadata about an image via 'rbd info' does not care if a new
snapshot is created while it is in progress.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
20496b8d2b forgot to do this. Without
this change, all class methods required regular read permission in
addition to class-read or class-write.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
This prevented a read-only user from being able to unprotect a
snapshot without write permission on all pools. This was masked before
by the CLS_METHOD_PUBLIC flag.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
We shouldn't look at Pipe::state in SimpleMessenger::_lookup_pipe() without
holding pipe_lock. Instead, use an atomic that we set to non-zero only
when transitioning to the terminal STATE_CLOSED state.
Signed-off-by: Sage Weil <sage@inktank.com>
Exercise some rare races by injecting delays before taking locks
via the 'ms inject internal delays' option.
Signed-off-by: Sage Weil <sage@inktank.com>
When a pipe is faulting and shutting down, we have to drop pipe_lock to
take msgr lock and then remove the entry. The Pipe in this case will
have STATE_CLOSED. Handle this case in all places we do a lookup on
the rank_pipe hash so that we effectively ignore entries that are
CLOSED.
This fixes a race introduced by the previous commit where we won't use
the CLOSED pipe and try to register a new one, but the old one is still
registered.
See bug #3675.
Signed-off-by: Sage Weil <sage@inktank.com>
If we have a con that refs a pipe but it is closed, don't use it. If
the ref is still there, it is only because we are racing with fault()
and it is about to (or just was) be detached. Either way,
Signed-off-by: Sage Weil <sage@inktank.com>
Remove the special-case check, which does not inform the peer what
protocol features are missing. It also enforces this requirement even
when we negotiate auth none.
Reported as part of bug #3657.
Signed-off-by: Sage Weil <sage@inktank.com>
We may take the O_CREAT path and get an fh from _create, but created can
still be false. In that case, skip the final _open call.
Signed-off-by: Sage Weil <sage@inktank.com>
If multiple clients race to create a file, multiple clients will send a
create request and get back a valid dentry+inode, but only one client
will actually win the race to create the file. All other clients should
treat the reply as an open of an existing file and check permissions.
This patch adds the created inode number to the mds create reply if that
request actually created the inode/file (and the feature is supported),
so the client can properly check permissions if the inode number isn't
returned. Fixes#3625.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
This is a fix for bug #3625, where multiple clients race to create a
file, and the loser returns EEXIST instead of a valid file handle.
The patch modifies ll_create in the Client class to use _create(),
which sends the request to the MDS (where an atomic create/open is
performed).
Signed-off-by: Sam Lang <sam.lang@inktank.com>
We were using a single cond, and only signalling one waiter. That means
that if the flusher and several logging threads are waiting, and we hit
a limit, we the logger could signal another logger instead of the flusher,
and we could deadlock.
Similarly, if the flusher empties the queue, it might signal only a single
logger, and that logger could re-signal the flusher, and the other logger
could wait forever.
Intead, break the single cond into two: one for loggers, and one for the
flusher. Always signal the (one) flusher, and always broadcast to all
loggers.
Backport: bobtail, argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
In a mixed cluster where some OSDs support the recovery reservations and
some don't, the replica may be new code in RepNotRecoverying and will
complete a backfill. In that case, we want to just stayin
RepNotRecovering.
It may also be possible to make it infer what the primary is doing even
thought it is not sending recovery reservation messages, but this is much
more complicated and doesn't accomplish much.
Fixes: #3689
Signed-off-by: Sage Weil <sage@inktank.com>