These are OpRequest properties, calculated/enforced at the OSD. They don't
belong in the MOSDOp or MOSDOpReply messages.
Signed-off-by: Sage Weil <sage@inktank.com>
It was very sloppy to put a server-side processing state inside the
messsage. Move it to the OpRequestRef instead.
Note that the client was filling in bogus data that was then lost during
encoding/decoding; clean that up.
Signed-off-by: Sage Weil <sage@inktank.com>
This file mostly duplicated the existing release documentation. Differences
have been merged into the primary file.
Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
Having this too large means that queues get too deep on the OSDs during
backfill and latency is very high. In my tests, it also meant we generated
a lot of slow recovery messages just from the recovery ops themselves (no
client io).
Keeping this at the old default means we are no worse in this respect than
argonaut, which is a safe position to start from.
Signed-off-by: Sage Weil <sage@inktank.com>
Keep the journal queue size smaller than the filestore queue size.
Keeping this small also means that we can lower the latency for new
high priority ops that come into the op queue.
Signed-off-by: Sage Weil <sage@inktank.com>
The client may have a newer map than we do; make sure we wait for it lest
we inadvertantly reply because we think the pool doesn't exist.
Signed-off-by: Sage Weil <sage@inktank.com>
If a connection comes and there is a closed session attached, remove it.
This is probably a failure of an old session to get cleaned up properly,
and in certain cases it may even be from a different client (if the addr
nonce is reused). In that case this prevents further damage, although
a complete solution would also clean up the closed connection state if
there is a fault. See #3630.
This fixes a hang that is reproduced by running the libcephfs
Caps.ReadZero test in a loop; eventually the client addr is reused and
we are linked to an ancient Session with a different client id.
Backport: bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
The fullbit sets it now. For multiversion inodes, it's "first" can be in
the future, since this dentry may not have changed when the inode was
cowed in place. (OTOH, the dentry cannot have changed without the inode
also have changing.)
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Otherwise we may wrongly increase mds->sessionmap.version, which
will confuse future journal replays that involving sessionmap.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
MDentryLink message can race with cache expire, When it arrives at
the target MDS, it's possible there is no corresponding dentry in
the cache. If this race happens, we should expire the replica inode
encoded in the MDentryLink message. But to expire an inode, the MDS
need to know which subtree does the inode belong to, so modify the
MDentryLink message to include this information.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Creating new file needs to be handled by directory fragment's auth
MDS, opening existing file in write mode needs to be handled by
corresponding inode's auth MDS. If a file is remote link, its parent
directory fragment's auth MDS can be different from corresponding
inode's auth MDS. So which MDS to handle create file request can be
affected by if the corresponding file already exists.
handle_client_openc() calls rdlock_path_xlock_dentry() at the very
beginning. It always assumes the request needs to be handled by
directory fragment's auth MDS. When handling a create file request,
if the file already exists and remotely linked to a non-auth inode,
handle_client_openc() falls back to handle_client_open(),
handle_client_open() forwards the request because the MDS is not
inode's auth MDS. Then when the request arrives at inode's auth MDS,
rdlock_path_xlock_dentry() is called, it will forward the request
back.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
If remote linkage without inode is encountered after some caps are
issued, Server::handle_client_readdir() should send the reply to
client immediately instead of retrying the request after opening
the remote dentry. This is because the MDS may want to revoke these
caps before the MDS succeeds in opening the remote dentry.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Migrator::export_dir() only check if it can lock the export lock set
but not take the lock set. So someone else can change the path to
the exporting dir and confuse Migrator::handle_export_discover().
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
The imported caps may prevent unstable locks from entering stable
states. So we should call Locker::eval_gather() with parameter
"first" set to true after caps are imported.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
If want_xlocked is true, we can not rely on previously sent discover
because it's likely the previous discover is blocking on the xlocked
dentry.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
The error hanlding code in MDCache::handle_discover_reply() has two
main issues. MDCache::handle_discover_reply() does not wake waiters
if dir_auth_hint in reply message is equal to itself's nodeid. This
can happen if discover race with subtree importing. Another issue is
that it checks the existence of cached directory fragment to decide
if it should take waiter from inode or from directory fragment. The
check is unreliable because subtree importing can add directory
fragments to the cache.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
When frozen inode is encountered, MDCache::handle_discover() sends
reply immediately if the reply message is not empty. When handling
"discover ino" requests, the reply message always contains the base
directory fragment. But requestor already has the base directory
fragment, the only effect of the reply message is wake the requestor
and make it send same "discover ino" request again. So the requestor
keeps sending "discover ino" requests but can't make any progress.
The fix is set want_base_dir to false for MDCache::discover_ino().
After set want_base_dir to false, also need update the code that
handles "discover ino" error.
This patch also remove unused error handling code for flag_error_dn
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
We should delete dir fragment's bloom filter after exporting the dir
fragment to other MDS. Otherwise the residual bloom filter may cause
problem if the MDS imports dir fragment later.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
If predirty_journal_parents() does not propagate changes in dir's
fragstat into corresponding inode's dirstat, it should mark the
inode as dirfrag dirty. This happens when we modify dir fragments
that are auth subtree roots.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
At that point, the request already auth pins and locks some objects.
So CDir::fetch() should ignore the can_auth_pin check and continue
to fetch freezing dir.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
The functional tests for the create operations should add and specify non-default
pools, but we don't have a set of library methods to do that yet (to interact with
the monitor).
Reuse old preferred_pg field. Only use if the new CREATEPOOLID feature
is present, and the value is >= 0.
Verify that the data pool is allowed, or return EINVAL to the client.
Signed-off-by: Sage Weil <sage@inktank.com>
This is a poor interface. The hadoop stuff is shifting to specify this
information on file creation instead.
Signed-off-by: Sage Weil <sage@inktank.com>
If we had a pending failure report, and send a cancellation, take it
out of our pending list so that we don't keep resending cancellations.
Signed-off-by: Sage Weil <sage@inktank.com>