Commit Graph

23215 Commits

Author SHA1 Message Date
Sage Weil
635673928a osd: fix recovery assert for pg repair case
In the case of PG repair, this assert is not valid.  Disable it for now.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-27 13:26:09 -08:00
Sage Weil
1fa8c83d2d Merge branch 'wip-osd-flags' 2012-12-27 13:09:24 -08:00
Sage Weil
207e93abef Merge remote-tracking branch 'gh/wip-mds-pool'
Reviewed-by: Sam Lang <sam.lang@inktank.com>
2012-12-27 13:07:57 -08:00
Sage Weil
f230603873 osd: only calculate OpRequest rmw flags once
Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-27 12:12:40 -08:00
Sage Weil
f1dfd64f72 messages/MOSDOpReply: remove misleading may_read/may_write
These are OpRequest properties, calculated/enforced at the OSD.  They don't
belong in the MOSDOp or MOSDOpReply messages.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-27 12:12:40 -08:00
Sage Weil
03f6dfa46e osd: move rmw_flags to OpRequest, out of MOSDOp
It was very sloppy to put a server-side processing state inside the
messsage.  Move it to the OpRequestRef instead.

Note that the client was filling in bogus data that was then lost during
encoding/decoding; clean that up.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-27 12:12:40 -08:00
tamil
998f71945d dropping xfs test 186 due to bug: 3685
Signed-off-by: tamil <tamil.muthamizhan@inktank.com>
2012-12-27 11:27:31 -08:00
Gary Lowell
98e7b59807 docs: remove extra release-process2 file.
This file mostly duplicated the existing release documentation.  Differences
have been merged into the primary file.

Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
2012-12-27 11:14:19 -08:00
Sage Weil
82c71716f7 osd: drop 'osd recovery max active' back to previous default (5)
Having this too large means that queues get too deep on the OSDs during
backfill and latency is very high.  In my tests, it also meant we generated
a lot of slow recovery messages just from the recovery ops themselves (no
client io).

Keeping this at the old default means we are no worse in this respect than
argonaut, which is a safe position to start from.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-27 11:12:33 -08:00
Sage Weil
6f1f03c7d3 journal: reduce journal max queue size
Keep the journal queue size smaller than the filestore queue size.

Keeping this small also means that we can lower the latency for new
high priority ops that come into the op queue.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-27 11:11:08 -08:00
Sage Weil
0d2ad2f24b mds: use set to store MDSMap data pools
Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-27 11:09:00 -08:00
Sage Weil
2137d5cde0 mds: wait for client's mdsmap when specifying data pool
The client may have a newer map than we do; make sure we wait for it lest
we inadvertantly reply because we think the pool doesn't exist.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-27 09:36:55 -08:00
Sage Weil
9da6d88291 doc: document mds config options
Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-27 09:33:27 -08:00
Sage Weil
916d1cf607 doc: journaler config options
Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-26 17:34:12 -08:00
Gary Lowell
cedea1391c docs: Merge changes from release-process2 document. 2012-12-26 12:54:27 -08:00
Sage Weil
850a056bec mds: add waiting_for_mdsmap queue
Defer events until we get a specific MDSMap epoch.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-26 11:58:51 -08:00
Sage Weil
c764935d0e mds: do not check for pool existence in osdmap
We don't have a wait mechanism to ensure the MSDMap has the latest osdmap
here.  Just trust the MDSMap.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-26 11:58:41 -08:00
Josh Durgin
4929fc7dd9 qa: remove xfstests 172 and 173 from qemu testing
These seem to require newer xfs.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2012-12-26 10:55:47 -08:00
Sage Weil
f5403f9493 doc/man/8/mkcephfs: update --mkfs a bit
Document that 'devs' and 'osd mkfs type' must be defined.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-26 09:42:13 -08:00
Sage Weil
8b59908370 mds: replace closed sessions on connect
If a connection comes and there is a closed session attached, remove it.
This is probably a failure of an old session to get cleaned up properly,
and in certain cases it may even be from a different client (if the addr
nonce is reused).  In that case this prevents further damage, although
a complete solution would also clean up the closed connection state if
there is a fault.  See #3630.

This fixes a hang that is reproduced by running the libcephfs
Caps.ReadZero test in a loop; eventually the client addr is reused and
we are linked to an ancient Session with a different client id.

Backport: bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-23 20:01:13 -08:00
Sage Weil
d18f3c2dd2 mds: don't force in->first == dn->first
The fullbit sets it now.  For multiversion inodes, it's "first" can be in
the future, since this dentry may not have changed when the inode was
cowed in place.  (OTOH, the dentry cannot have changed without the inode
also have changing.)

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
2012-12-23 20:01:13 -08:00
Yan, Zheng
a1485f959d mds: compare sessionmap version before replaying imported sessions
Otherwise we may wrongly increase mds->sessionmap.version, which
will confuse future journal replays that involving sessionmap.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2012-12-23 20:01:12 -08:00
Yan, Zheng
0002546205 mds: fix race between send_dentry_link() and cache expire
MDentryLink message can race with cache expire, When it arrives at
the target MDS, it's possible there is no corresponding dentry in
the cache. If this race happens, we should expire the replica inode
encoded in the MDentryLink message. But to expire an inode, the MDS
need to know which subtree does the inode belong to, so modify the
MDentryLink message to include this information.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2012-12-23 20:01:12 -08:00
Yan, Zheng
efbca31d3b mds: fix file existing check in Server::handle_client_openc()
Creating new file needs to be handled by directory fragment's auth
MDS, opening existing file in write mode needs to be handled by
corresponding inode's auth MDS. If a file is remote link, its parent
directory fragment's auth MDS can be different from corresponding
inode's auth MDS. So which MDS to handle create file request can be
affected by if the corresponding file already exists.

handle_client_openc() calls rdlock_path_xlock_dentry() at the very
beginning. It always assumes the request needs to be handled by
directory fragment's auth MDS. When handling a create file request,
if the file already exists and remotely linked to a non-auth inode,
handle_client_openc() falls back to handle_client_open(),
handle_client_open() forwards the request because the MDS is not
inode's auth MDS. Then when the request arrives at inode's auth MDS,
rdlock_path_xlock_dentry() is called, it will forward the request
back.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2012-12-23 20:01:12 -08:00
Yan, Zheng
f5e86ecbd2 mds: delay processing cache expire when state >= EXPORT_EXPORTING
It's possible that MDS receives cache expire in EXPORT_LOGGINGFINISH
and EXPORT_NOTIFYING states.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2012-12-23 20:01:12 -08:00
Yan, Zheng
1174dd3188 mds: don't retry readdir request after issuing caps
If remote linkage without inode is encountered after some caps are
issued, Server::handle_client_readdir() should send the reply to
client immediately instead of retrying the request after opening
the remote dentry. This is because the MDS may want to revoke these
caps before the MDS succeeds in opening the remote dentry.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2012-12-23 20:01:12 -08:00
Yan, Zheng
dd4415768d mds: take export lock set before sending MExportDirDiscover
Migrator::export_dir() only check if it can lock the export lock set
but not take the lock set. So someone else can change the path to
the exporting dir and confuse Migrator::handle_export_discover().

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2012-12-23 20:01:12 -08:00
Yan, Zheng
96f48aa056 mds: re-issue caps after importing caps
The imported caps may prevent unstable locks from entering stable
states. So we should call Locker::eval_gather() with parameter
"first" set to true after caps are imported.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2012-12-23 20:01:12 -08:00
Yan, Zheng
a3e70aede8 mds: always send discover if want_xlocked is true
If want_xlocked is true, we can not rely on previously sent discover
because it's likely the previous discover is blocking on the xlocked
dentry.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2012-12-23 20:01:12 -08:00
Yan, Zheng
69f9f024e8 mds: fix error hanlding in MDCache::handle_discover_reply()
The error hanlding code in MDCache::handle_discover_reply() has two
main issues. MDCache::handle_discover_reply() does not wake waiters
if dir_auth_hint in reply message is equal to itself's nodeid. This
can happen if discover race with subtree importing. Another issue is
that it checks the existence of cached directory fragment to decide
if it should take waiter from inode or from directory fragment. The
check is unreliable because subtree importing can add directory
fragments to the cache.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2012-12-23 20:01:11 -08:00
Yan, Zheng
e6b8f0a659 mds: set want_base_dir to false for MDCache::discover_ino()
When frozen inode is encountered, MDCache::handle_discover() sends
reply immediately if the reply message is not empty. When handling
"discover ino" requests, the reply message always contains the base
directory fragment. But requestor already has the base directory
fragment, the only effect of the reply message is wake the requestor
and make it send same "discover ino" request again. So the requestor
keeps sending "discover ino" requests but can't make any progress.

The fix is set want_base_dir to false for MDCache::discover_ino().
After set want_base_dir to false, also need update the code that
handles "discover ino" error.

This patch also remove unused error handling code for flag_error_dn

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2012-12-23 20:01:11 -08:00
Yan, Zheng
b7e698a52b mds: no bloom filter for replica dir
We should delete dir fragment's bloom filter after exporting the dir
fragment to other MDS. Otherwise the residual bloom filter may cause
problem if the MDS imports dir fragment later.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2012-12-23 20:01:11 -08:00
Yan, Zheng
0ab0744e6f mds: properly mark dirfrag dirty
If predirty_journal_parents() does not propagate changes in dir's
fragstat into corresponding inode's dirstat, it should mark the
inode as dirfrag dirty. This happens when we modify dir fragments
that are auth subtree roots.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2012-12-23 20:01:11 -08:00
Yan, Zheng
48d8ae58ef mds: alllow handle_client_readdir() fetching freezing dir.
At that point, the request already auth pins and locks some objects.
So CDir::fetch() should ignore the can_auth_pin check and continue
to fetch freezing dir.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2012-12-23 20:01:11 -08:00
Sage Weil
d9673ca324 Merge branch 'wip-create-layout'
Reviewed-by: Greg Farnum <greg@inktank.com>

The functional tests for the create operations should add and specify non-default
pools, but we don't have a set of library methods to do that yet (to interact with
the monitor).
2012-12-23 19:59:04 -08:00
Sage Weil
8efcf54dc1 mds: *_pg_pool -> *_pool
Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-23 19:39:23 -08:00
Sage Weil
d2f5890f84 client, libcephfs: add method to get the pool name for an open file
Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-23 19:39:23 -08:00
Sage Weil
32ab274a4f client: specify data pool on create operations
Fill in the data pool field if specified by the client, or set to -1.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-23 19:39:22 -08:00
Sage Weil
3f4582176a mds: verify that the pool id is valid on SET[DIR]LAYOUT
Make sure the data pool exists and is part of the MDSMap data pools list.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-23 19:39:22 -08:00
Sage Weil
99d9e1daa5 mds: allow data pool to be specfied on create
Reuse old preferred_pg field.  Only use if the new CREATEPOOLID feature
is present, and the value is >= 0.

Verify that the data pool is allowed, or return EINVAL to the client.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-23 19:39:22 -08:00
Sage Weil
697ed23cb9 client: remove set_default_*() methods
This is a poor interface.  The hadoop stuff is shifting to specify this
information on file creation instead.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-23 19:39:22 -08:00
Sage Weil
850d1d544b osd: fix dup failure cancellations
If we had a pending failure report, and send a cancellation, take it
out of our pending list so that we don't keep resending cancellations.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-23 15:21:18 -08:00
Sage Weil
61d43af747 osd: make MOSDFailure output more sensible
Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-23 15:21:18 -08:00
Sage Weil
9df522e9ec mon: make osd failure report log msgs sensible
Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-23 15:11:39 -08:00
Sage Weil
1290671f15 Merge branch 'wip-scrub' into next
Reviewed-by: Sage Weil <sage@inktank.com>
Conflicts:
	src/osd/PG.cc
2012-12-23 14:42:51 -08:00
Sage Weil
8362e6403e monclient: fix get_monmap_privately retry interval
Use mon_client_hunt_interval (default 3) instead of hardcoding 1 second.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-23 13:53:21 -08:00
Sage Weil
d843a64a3a Makefile: fix 'base' rule
Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-23 13:53:18 -08:00
Sage Weil
00b89c3f7b Merge branch 'next' 2012-12-23 11:19:39 -08:00
Sage Weil
a09f5b1b46 init-ceph,mkcephfs: default inode64 for mounting xfs
According to hch this is now the default or new kernels.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-23 11:18:45 -08:00
Sage Weil
5f25f9f8cf init-ceph: default osd_data path
Signed-off-by: Sage Weil <sage@inktank.com>
2012-12-22 11:10:03 -08:00