Make sure the Inode does not go away while a readahead is in progress. In
particular:
- read_async
- start a readahead
- get actual read from cache, return
- close/release
- call ObjectCacher::release_set() and get unclean > 0, assert
Fixes: #7867
Backport: emperor, dumpling
Signed-off-by: Sage Weil <sage@inktank.com>
If we do no assemble a target bl, we still want to return a valid return
code with the number of bytes read-ahead so that the C_RetryRead completion
will see this as a finish and call the caller's provided Context.
Signed-off-by: Sage Weil <sage@inktank.com>
RGWRados::initialize() is not called when doing
RGWRados::get_raw_storage_provider(). This was the culprit for issue
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Fixes: #7903
Since we didn't prefetch data then we couldn't rely on the data to
actually exist there. In that case just move on and read the object.
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
If we decide to revert back to up, we need to
1- return false, so that we go into the NeedActingChange state, and
2- actually ask for that change.
It's too fugly to try to jump down to the existing queue_want_pg_temp
call 100+ lines down in this function, so just do it here. We already
know that we are requesting to clear the pg_temp.
Fixes: #7902
Backport: emperor, dumpling
Signed-off-by: Sage Weil <sage@inktank.com>
When splitting dirfrag, delta dirstat is always added to the first new
dirfrag. Before the delta dirstat is propagated to inode, unlinking file
from the rest dirfrags can cause nagtive inode dirstat.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Commit bc3325b37 fixes a stack overflow bug happens when replaying
client requests. Similar stack overflow can happens when processing
finished contexts.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
The auth mds has received dirty scatterlock state. But it hasn't
journaled the dirty state yet. The log segment that marked the
scatterlock dirty need to be preserved. Therefore, we can't clear
the dirty flag of scatterlock.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Fragmenting a non-auth dirfrag results several smaller dirfrags. Some
of the resulting dirfrags can be empty, which are not used to connected
to auth subtree.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
If dirfrags are subtree roots, mark the dirfragtreelock as scattered
dirty, otherwise journal the dirfragtree change.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
MDCache::handle_cache_expire() ignores mismatched dirfrags. this is
OK during normal operation because MDS doesn't trim replica inode
whose dirfrags are likely being fragmented (see commit 22535340).
During recovery, the recovering MDS can reveive survivor MDS' cache
expire message before it sends cache rejoin acks. In this case,
there still can be mismatched dirfrags, but nothing prevents the
survivor MDS to trim inode of these mismatched dirfrags. So there
can be unconnected dirfrags when the recovering MDS sends cache
rejoin acks.
The fix is, when mismatched dirfrag is encountered during recovery,
check if inode of the dirfrag is still replicated to the sender MDS.
If the inode is not replicated, remove the sender MDS from replica
maps of all child dirfrags.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
For slave rename and rmdir events, the MDS needs to preserve non-auth
dirfrag where the renamed inode originally lives in until slave commit
event is encountered. Current method to handle this is use MDCache::
uncommitted_slave_rename_olddir to track any non-auth dirfrag that
need to be preserved. This method does not works well if any preserved
dirfrag gets fragmented by log event (such as ESubtreeMap) between the
slave prepare event and the slave commit event.
The fix is tracking inode of dirfrag instead of tracking dirfrag that
need to preserved directly.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Propagate dirty dirstat to freezing auth inode if the inode is
already auth pinned by the Mutation. Otherwiese the dirstat can
be propagated to inode after client changes inode's mtime.
Fixes: #7880
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
The OSDs need to support this feature before we allow users to turn it
on. This is similar to what the erasure pool support does.
Signed-off-by: Sage Weil <sage@inktank.com>
A few cases:
- As we are working through the list, if we see a clone that is lower than
the next one we were expecting, we should be able to skip them.
- If we see a head, we can skip all of the rest of the clones.
- If we get to the end and next_clone was set, we can ignore it.
Signed-off-by: Sage Weil <sage@inktank.com>
- notice when we are missing a clone (that isn't at the end of the list)
- notice when we are missing a clone on the last object in the scrub map
- do not assert when we are missing a clone
There is still more we could do to improve this (like noticing one missing
clone but still checking the others), but we'll leave that aside for just
a moment...
Signed-off-by: Sage Weil <sage@inktank.com>
Trigger a scrub to verify that we can handle a cache tier that is missing
some clones. We rely on the test harness to notice the error, and we do
not confirm that the scrub happened. In practice this is plenty of time,
however.
Signed-off-by: Sage Weil <sage@inktank.com>
We don't need to worry about pidfile because that is done by the fork
functions, which ceph-conf doesn't call.
Signed-off-by: Sage Weil <sage@inktank.com>
If you are querying the conf for an osd and it has a log configured, we
should not generate any log activity.
This isn't super pretty, but it is much less intrusive that wiring a 'do
not log' flag down into CephContext and a zillion other places.
Fixes: #7849
Signed-off-by: Sage Weil <sage@inktank.com>