Commit Graph

30285 Commits

Author SHA1 Message Date
Yan, Zheng
ff8b9ac358 mds: send info of imported caps back to the exporter (export dir)
Introduce a new class Capability::Import and use it to send information
of imported caps back to the exporter. This is preparation for including
counterpart's information in cap import/export message.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:24 +08:00
Yan, Zheng
d00ec7915c mds: flush session messages before exporting caps
Following sequence of events can happen when exporting inodes:

- client sends open file request to mds.0
- mds.0 handles the request and sends inode stat back to the client
- mds.0 export the inode to mds.1
- mds.1 sends cap import message to the client
- mds.0 sends cap export message to the client
- client receives the cap import message from mds.1, but the client
  still doesn't have corresponding inode in the cache. So the client
  releases the imported caps.
- client receives the open file reply from mds.0
- client receives the cap export message from mds.0.

After the end of these events, the client doesn't have any cap for
the opened file.

To fix the message ordering issue, this patch introduces a new session
operation FLUSHMSG. Before exporting caps, we send a FLUSHMSG seesion
message to client and wait for the acknowledgment. When receiveing the
FLUSHMSG_ACK message from client, we are sure that clients have received
all messages sent previously.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:24 +08:00
Yan, Zheng
77515b7a3c mds: increase cap sequence when sharing max size
For case:
 - client voluntarily releases some caps through cap update message
 - mds shares the new max by sending cap grant message
 - mds recevies the cap update message

If mds doesn't increase the cap sequence when sharing the max size.
It can't determine if the cap update message was sent before or after
client reveived the cap grant message that updates max size.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:24 +08:00
Yan, Zheng
65259796ae mds: include inode version in auth mds' lock messages
encode inode version in auth mds' lock messages, so that version
of replica inodes get updated. This is important because client
use inode version in mds reply to check if the cached inode is
already up-to-date. It skips updating the inode if it thinks the
inode is already up-to-date.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:24 +08:00
Yan, Zheng
f134c77267 mds: avoid allocating MDRequest::More when cleanup request
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:24 +08:00
Yan, Zheng
e6c4d32e64 mds: waiting for slave reuqest to finish
If MDS receives a client request, but find there is an existing
slave request. It's possible that other MDS forwarded the request
to us, but the MMDSSlaveRequest::OP_FINISH message arrives after
the client request.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:24 +08:00
Yan, Zheng
1536e814da mds: check lock state before eval_gather
Locker::eval_gather() can dispatch requests, which may change other
locks' states.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:24 +08:00
Yan, Zheng
e1818692d1 mds: don't request CEPH_CAP_PIN from auth mds
avoid triggering assert(in->get_loner() >= 0 && in->mds_caps_wanted.empty())
in Locker::file_xsyn()

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:24 +08:00
Yan, Zheng
87ca260488 mds: fix sending resolve message
need to send resolve message when mds is in reconnect state

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:24 +08:00
Yan, Zheng
b7d78918de mds: keep dentry lock in sync state
unlike locks of other types, dentry lock in unreadable state can
block path traverse, so it should be in sync state as much as
possible.

This patch make Locker::try_eval() change dentry lock's state to
sync even when the dentry is freezing. Also make migrator check
imported dentries' lock states, change locks' states to sync if
necessary.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:23 +08:00
Yan, Zheng
d8440c4cae mds: avoid leaving bare-bone dirfrags in the cache
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:23 +08:00
Yan, Zheng
b2a137007f mds: re-issue caps after importing inode
After importing inode, the issued caps can be less than the caps
client wants. So always re-issue caps after importing inode.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:23 +08:00
Yan, Zheng
3ac08860d4 mds: avoid issuing caps when inode is frozen
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:23 +08:00
Yan, Zheng
31f5b0275e mds: fix rename notify
commit 1d86f77edf (mds: fix cross-authorty rename race) introduced
rename notify, but it puts the code in wrong bracket.

This patch also fixes a rename notify related bug in
MDCache::handle_mds_failure()

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:23 +08:00
Yan, Zheng
bd561772ba mds: re-send discover if want_xlocked becomes true
If want_xlocked becomes true, we can not rely on previously sent discover
because it's likely the previous discover is blocked on the xlocked dentry.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:23 +08:00
Yan, Zheng
913f7fd8db mds: fix empty directory check
Since commit 310032ee81(fix mds scatter_writebehind starvation), rdlock
a scatter lock does not always propagate dirty fragstats to corresponding
inode. So Server::_dir_is_nonempty() needs to check each dirfrag's stat
intead of checking inode's dirstat.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:23 +08:00
Yan, Zheng
2fea08b59c mds: merge delayed cache expire
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:23 +08:00
Yan, Zheng
498d5c4998 mds: process delayed expire if exporting dir cancelled in warnning state
we may add delayed expire when exporting dir is in warnning state

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:23 +08:00
Yan, Zheng
0aed0d48c7 mds: handle cache rejoin corner case
A recovering MDS may receives strong cache rejoin from a survivor,
then the survivor restarts, the recovering MDS receives week cache
rejoin from the same MDS. Before processing the week cache rejoin,
we should scour replicas added by the obsoleted strong cache rejoin.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:23 +08:00
Yan, Zheng
5a902a0e5d mds: unify nonce type
MDSCacheObject::replica_nonce is defined as __s16, but nonce type
in MDSCacheObject::replica_map is int. This mismatch may confuse
MDCache::handle_cache_expire().

this patch unifies the nonce type as uint32

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:23 +08:00
Yan, Zheng
0344d9af74 mds: rework stale import/export message detection
Current code uses import state to detect obsolete import/export messages.
it does not work for the case: cancel a subtree export, export the same
subtree again, the messages for the first export get dispatched.

This patch introduces "transation ID" for subtree exports. Each subtree
export has a unique TID, the ID is recorded in all import/export related
messages. By comparing the TID, we can reliably detect stale messages.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:22 +08:00
Yan, Zheng
9471fdc613 mds: put import/export related states together
Current code uses several STL maps to record import/export related
states. A map lookup is required for each state access, this is not
efficient. It's better to put import/export related states together.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:22 +08:00
Yan, Zheng
ab93aa59bf mds: freeze tree deadlock detection.
there are two situations that result freeze tree deadlock.

 - mds.0 authpins an item in subtree A
 - mds.0 sends request to mds.1 to authpin an item in subtree B
 - mds.0 freezes subtree A
 - mds.1 authpins an item in subtree B
 - mds.1 sends request to mds.0 to authpin an item in subtree A
 - mds.1 freezes subtree B
 - mds.1 receives the remote authpin request from mds.0
   (wait because subtree B is freezing)
 - mds.0 receives the remote authpin request from mds.1
   (wait because subtree A is freezing)

 - client request authpins items in subtree B
 - freeze subtree B
 - import subtree A which is parent of subtree B
   (authpins parent inode of subtree B, see CDir::set_dir_auth())
 - freeze subtree A
 - client request tries authpinning items in subtree A
   (wait because subtree A is freezing)

Enforcing a authpinning order can avoid the deadlock, but it's very
expensive. The deadlock is rare, so I think deadlock detection is
more suitable for the case.

This patch introduces freeze tree deadlock detection. We record the
start time of freezing tree. If we fail to freeze the tree within a
given duration, cancel the process of freezing tree.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:22 +08:00
Sage Weil
edc4224de4 Merge remote-tracking branch 'gh/wip-hitset'
Reviewed-by: Greg Farnum <greg@inktank.com>

Conflicts:
	src/common/config_opts.h
	src/osd/ReplicatedPG.cc
	src/osdc/Objecter.cc
	src/vstart.sh
2013-12-15 16:57:23 -08:00
Sage Weil
f192a600c5 Revert "common/Formatter: add newline to flushed output if m_pretty"
This reverts commit d6146b0d91.

As Yehuda points out, this does not properly handle cases where we flush
the same output stream multiple times.
2013-12-15 16:23:09 -08:00
Sage Weil
c7b44d6675 Revert "common: fix perf_counters unittests for trailing newline in m_pretty"
This reverts commit ba5572397c.
2013-12-15 16:22:59 -08:00
Loic Dachary
31507c90f0 qa: test for error when ceph osd rm is EBUSY
http://tracker.ceph.com/issues/6824 fixes #6824

Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-15 23:06:26 +01:00
Loic Dachary
4b9a41aa17 qa: make cephtool test imune to pool size
instead of assuming the pool size is 2, query it and increment it to
test for pool set data size. It allows to run the test from vstart.sh
without knowing what the required pool size is in advance:

    rm -fr dev out ;  mkdir -p dev ; \
     MON=1 OSD=3 ./vstart.sh -n -X -l mon osd

    LC_ALL=C PATH=:$PATH CEPH_CONF=ceph.conf \
      ../qa/workunits/cephtool/test.sh

Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-15 21:45:31 +01:00
Loic Dachary
f9cfa24adc qa: add function name and line number to cephtool output
Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-15 21:45:31 +01:00
Loic Dachary
cb352484f1 qa: silence cephtool tests cleanup
The file removal installed to be triggered when the script stops must
not fail if the file does not exist.

Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-15 21:45:31 +01:00
Loic Dachary
15b8616b13 mon: set ceph osd (down|out|in|rm) error code on failure
Instead of always returning true, the error code is set if at least one
operation fails.

EINVAL if the OSD id is invalid (osd.foobar for instance).
EBUSY if trying to remove and OSD that is up.

When used with the ceph command line, it looks like this:

    ceph -c ceph.conf osd rm osd.0
    Error EBUSY: osd.0 is still up; must be down before removal.
    kill PID_OF_osd.0
    ceph -c ceph.conf osd down osd.0
    marked down osd.0.
    ceph -c ceph.conf osd rm osd.0 osd.1
    Error EBUSY: removed osd.0, osd.1 is still up; must be down before removal.

http://tracker.ceph.com/issues/6824 fixes #6824

Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-15 21:45:31 +01:00
Sage Weil
80c6c54d93 Merge pull request #716 from ceph/wip-formatter-newlines
common/Formatter: add newline to flushed output if m_pretty
2013-12-15 10:24:03 -08:00
Sage Weil
3862ad8f8f Merge pull request #943 from dachary/wip-formatter-newlines
common: fix perf_counters unittests for trailing newline in m_pretty
2013-12-15 10:23:33 -08:00
Sage Weil
550adb824f Merge pull request #942 from sstock/master
Add -n option to mount.ceph, feature 7006

Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-15 10:18:49 -08:00
Steve Stock
e37467b7bf Add -n option to mount.ceph. Required by autofs when /etc/mtab is a link to /proc/mounts (e.g. Debian Wheezy), otherwise automounting a ceph file system fails. Also useful when /etc is read-only. feature 7006
Signed-off-by: Steve Stock <steve@technolope.org>
2013-12-15 12:49:42 -05:00
Sage Weil
11065b5a76 Merge pull request #937 from christian-marie/master
Document librados's rados_write's behaviour in reguards to return value.
2013-12-15 08:41:16 -08:00
Sage Weil
25838f3b0d Merge pull request #924 from dachary/wip-erasure-doc
doc: update erasure code development doc
2013-12-15 08:40:52 -08:00
Sage Weil
62a7d9c7bd Merge pull request #946 from dachary/wip-80-column
osd: format test_osd_types.cc to 80 columns
2013-12-15 08:40:32 -08:00
Sage Weil
caf5963565 Merge pull request #945 from dachary/wip-6981
ceph-disk: zap needs at least one device

Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-15 08:40:16 -08:00
Sage Weil
89dd0206ee Merge pull request #944 from dachary/wip-6679
common: fix rare race condition in Throttle unit tests

Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-15 08:39:55 -08:00
Sage Weil
9c71d97b2c Merge pull request #948 from dachary/wip-6736-1
mon: typo s/degrated/degraded/

Backport: emperor, dumpling
2013-12-15 08:32:41 -08:00
Loic Dachary
aa365e4b1a mon: typo s/degrated/degraded/
http://tracker.ceph.com/issues/6736 refs #6736

Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-15 17:15:46 +01:00
Loic Dachary
5741bfe9fc osd: format test_osd_types.cc to 80 columns
Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-15 16:23:53 +01:00
Loic Dachary
07888ef3fd ceph-disk: zap needs at least one device
If given no argument, ceph-disk zap should display the usage instead of
silently doing nothing. Silence can be confused with "I zapped all the
disks".

http://tracker.ceph.com/issues/6981 fixes #6981

Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-15 15:34:17 +01:00
Loic Dachary
e57239e920 common: fix rare race condition in Throttle unit tests
The thread created to test Throttle race conditions updates a value (
throttle.get_current() ) that is tested by the main gtest thread but is
not protected by a lock. Instead of adding a lock, the main thread tests
the value after pthread_join() on the child thread.

http://tracker.ceph.com/issues/6679 fixes #6679

Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-15 14:31:27 +01:00
Loic Dachary
938f22cae2 common: format Throttle test to 80 columns
Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-15 14:30:38 +01:00
Loic Dachary
ba5572397c common: fix perf_counters unittests for trailing newline in m_pretty
Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-15 13:24:14 +01:00
Loic Dachary
c744aec660 Merge pull request #929 from kazhang/add-pkg-config
add apt-get install pkg-config for ubuntu server

Reviewed-by: Loic Dachary <loic@dachary.org>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-15 03:26:21 -08:00
John Wilkins
b7946ff4b3 doc: Added additional comments on placement targets and default placement.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-12-13 16:09:35 -08:00
John Wilkins
902f19c23a doc: Updates to federated config.
Reverted Emperor versionadded to Dumpling as it gets backported.
Added default index and bucket pools to pool creation
Added default default_placment setting
Added placement_pools key val pair examples.
Added comments for re-running the procedure for the secondary region.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-12-13 16:08:37 -08:00