Way back in fc869dee1e (v0.42) when we redid
the osd type encoding we forgot to make this conditionally encode the old
format for old clients. In particular, this means that kernel clients
will fail to decode the osdmap if there is a rados pool with a pool-level
snapshot defined.
Fixes: #3290
Signed-off-by: Sage Weil <sage@inktank.com>
Peforming a hard link through the libcephfs interface causes
a double free on shutdown, due to the Client::link call decrementing
the parent (of the target) directory's inode. This fix removes the
put_inode(dir) call, to match the behavior of Client::ll_link.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
This is unused, and mostly broken in that there is no cleanup when there
is a failure. Also, the support in the OSD has been largely removed.
Signed-off-by: Sage Weil <sage@inktank.com>
Rename check_io to clip_io, which can modify the passed-in length
to clamp it to the device size. This is expected behavior for
block-device emulation.
Call clip_io in rbd_write(); need to return clipped length there,
even though aio_write() is calling clip_io() as well (for the
direct path).
Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
When checking if inode's SnapRealm is different from readdir
SnapRealm, we should use find_snaprealm() to get inode's SnapRealm.
Without this fix, I got lots of "ceph_add_cap: couldn't find snap
realm 100" from kernel client.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Allow try_eval(MDSCacheObject*, int mask) to eval locks on replica objects
so that they don't get stuck in an unstable state. The eval(CInode*, mask)
handles the non-auth already. For the dentry case, call eval_any(), which
handles the non-auth case, instead of directly calling simple_eval(), which
does not.
Reported-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Commit f8110c (Allow export subtrees in other MDS' stray directory)
make the "directory in stray " check always return false. This is
because the directory in question is grandchild of mdsdir.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
The stray migration/reintegration generates a source path that will
be rooted in a (possibly remote) MDS's MDSDIR; adjust the check in
handle_client_rename()
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Under a sustained cephfs write load where the offered load is higher
than the storage cluster write throughput, a backlog of replication ops
that arrive via the cluster messenger builds up. The client message
policy throttler, which should be limiting the total write workload
accepted by the storage cluster, is unable to prevent it, for any
value of osd_client_message_size_cap, under such an overload condition.
The root cause is that op data is released too early, in op_applied().
If instead the op data is released at op deletion, then the limit
imposed by the client policy throttler applies over the entire
lifetime of the op, including commits of replication ops. That
makes the policy throttler an effective means for an OSD to
protect itself from a sustained high offered load, because it can
effectively limit the total, cluster-wide resources needed to process
in-progress write ops.
Signed-off-by: Jim Schutt <jaschut@sandia.gov>
For example, CephFileAlreadyExistsException may be returned if mkdirs is
called to create a directory already present.
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
When we add a bufferhead with zeros to the Object data map, use the new
zero type instead of allocating actual zeros.
Signed-off-by: Sage Weil <sage@inktank.com>
If we observe an ENOENT on a read, set the complete flag. Any dirty
buffers we have will still be in memory, even if the write are in flight,
because the TX state remains pinned until the writes commit. Writes cannot
proceed faster than reads, even though reads may proceed faster than
writes.
Signed-off-by: Sage Weil <sage@inktank.com>
The p iterator points to the next bh, but try_merge_bh() at the end of the
loop might merge that into our result and invalidate the iterator. Fix
this by repeating the lookup on each pass through the loop.
Signed-off-by: Sage Weil <sage@inktank.com>
Wait until we have applied the entire read result to the cache before we
trigger any read completion events. This is a cleaner and safer approach
since we can be sure that the callback won't get blocked again on data we
have but haven't applied yet. It also fixes a crash I just observed where
the completion did a read, called trim(), and invalidated/destroyed the
iterator/bh p was referencing.
Signed-off-by: Sage Weil <sage@inktank.com>
Pull unpinned objects off the LRU in trim(). This never happens currently
due to all the explicit calls to close_object()...
Signed-off-by: Sage Weil <sage@inktank.com>
We assert that if can_close(), the Object isn't pinned in the LRU. This
assumes we did yur get/put refcounting properly, such that the pins are
at least as restrictive as can_close().
Signed-off-by: Sage Weil <sage@inktank.com>