Commit Graph

30211 Commits

Author SHA1 Message Date
Loic Dachary
b1530679a8 erasure-code: tests must use aligned buffers
The underlying code assumes the memory buffer is aligned on a long
boundary which is not always the case. Using buffer::create_page_aligned
which calls posix_memalign ensure the allocated buffer starts at an
address that is properly aligned.

Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-17 20:26:01 +01:00
Sage Weil
6f431200e3 ceph_test_rados_api_tier: fix HitSetTrim vs split, too
Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-16 17:10:48 -08:00
Sage Weil
00f436c144 Merge pull request #904 from ceph/wip-mds-cluster2
Wip mds cluster2

Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-16 17:03:27 -08:00
Sage Weil
c5bccfef88 ceph_test_rados_api_tier: fix HitSetRead test race with split
Recalculate the hash on each iteration in case we are racing with split.

Fixes: #7013
Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-16 16:52:35 -08:00
Sage Weil
94da54ff95 Merge pull request #954 from ceph/wip-7009
mon: move supported_commands fields, methods into Monitor, and fix leak

Reviewed-by: Greg Farnum <greg@inktank.com>
2013-12-16 16:31:39 -08:00
Sage Weil
7e618c937b mon: move supported_commands fields, methods into Monitor, and fix leak
We were leaking the static leader_supported_mon_commands.  Move this into
the class so that we can clean up in the destructor.

Rename get_command_descriptions -> format_command_descriptions.

Fixes: #7009
Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-16 16:09:44 -08:00
Sage Weil
1597d4e9f5 Merge pull request #951 from ceph/wip-linux-version
common: introduce get_linux_version()

Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-16 09:27:43 -08:00
Ilya Dryomov
824b3d8e84 FileJournal: use pclose() to close a popen() stream
In FileJournal::_check_disk_write_cache(), use pclose() instead of
fclose() to close a stream, created by popen().

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2013-12-16 18:57:22 +02:00
Ilya Dryomov
6696ab6479 FileJournal: switch to get_linux_version()
For the purposes of FileJournal::_check_disk_write_cache(), use
get_linux_version(), which is based on uname(2), instead of parsing the
contents of /proc/version.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2013-12-16 18:57:22 +02:00
Ilya Dryomov
fcf6e9878b common: introduce get_linux_version()
get_linux_version() returns a version of the currently running kernel,
encoded as in int, and is contained in common/linux_version.[ch].

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2013-12-16 18:57:21 +02:00
Ilya Dryomov
a2babe27e8 configure: break up AC_CHECK_HEADERS into one header-file per line
Break up AC_CHECK_HEADERS macro into one header-file per line so it's
easier to read and make changes.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2013-12-16 18:57:21 +02:00
Yan, Zheng
4526d13a9d mds: fix stale session handling for multiple mds
Don't add new caps to stale session when importing inodes. Don't
touch session when importing caps because it confuses the stale
session detection.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 14:24:52 +08:00
Yan, Zheng
43f7268f5d mds: properly set dirty flag when journalling import
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 14:24:52 +08:00
Yan, Zheng
802df76f68 mds: properly update mdsdir's authority during recovery
dirfrag of mdsdir doesn't inherit its parent inode's authority.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 14:24:52 +08:00
Yan, Zheng
b6d1d8f186 mds: finish opening sessions even if import aborted
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 14:24:52 +08:00
Yan, Zheng
80005f1ece mds: fix discover path race
When C_MDC_RetryDiscoverPath executed, we may have already become
auth mds of base

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 14:24:50 +08:00
Sage Weil
58d68995c4 Merge pull request #947 from dachary/wip-6824
mon: set ceph osd (down|out|in|rm) error code on failure

Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-15 21:16:48 -08:00
Yan, Zheng
5fdcc568c6 mds: fix bug in MDCache::open_ino_finish
It's wrong to erase open_ino_info_t after finishing contexts, because
MDCache::open_ino() can be called again when finishing contexts.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:25 +08:00
Yan, Zheng
71d1eb374a mds: add CEPH_FEATURE_EXPORT_PEER and bump the protocal version
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:25 +08:00
Yan, Zheng
d0b744a1d6 client: handle session flush message
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:25 +08:00
Yan, Zheng
05b192faab mds: simplify how to export non-auth caps
Introduce a new flag in cap import message. If client finds the flag
is set, it releases exporter's caps (send release to the exporter).
This saves the cap export message and a "mds to mds" message.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:25 +08:00
Yan, Zheng
9dc52ff04b mds: send cap import messages to clients after importing subtree succeeds
When importing subtree, the importer sends cap import messages to clients
before the import subtree operation is considered as success. If the
exporter crashes before EExport event is journalled, the importer needs to
re-export client caps. This confuses clients, and makes them lose track of
auth caps.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:25 +08:00
Yan, Zheng
6a565881f6 mds: re-send cap exports in resolve message.
For rename operation that changes inode's authority, if master mds
of the operation crashed, inode's original auth mds sends export
messages to clients when it receives the master mds' resolve ack
message, Client can't reply on the export message to add caps for
the master mds, then reconnect the cap when the master mds enters
reconnect stage. Because client may receive the export message after
receiving mdsmap that claims the master mds is in reconnect stage.

The fix is include cap exports in resolve message, so the master mds
can send import messages to clients when it enters the rejoin stage.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:25 +08:00
Yan, Zheng
4fdeb00df2 mds: include counterpart's information in cap import/export messages
when exporting indoes with client caps, the importer sends cap import
messages to clients, the exporter sends cap export messages to clients.
A client can receive these two messages in any order. If a client first
receives cap import message, it adds the imported caps. but the caps
from the exporter are still considered as valid. This can compromise
consistence. If MDS crashes while importing caps, clients can only
receive cap export messages, but don't receive cap import messages.
These clients don't know which MDS is the cap importer, so they can't
send cap reconnect when the MDS recovers.

We can handle above issues by including counterpart's information in
cap import/export messages. If a client first receives cap import
message, it added the imported caps, then removes the the exporter's
caps. If a client first receives cap export message, it removes the
exported caps, then adds caps for the importer.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:25 +08:00
Yan, Zheng
ef902ee0b9 mds: send info of imported caps back to the exporter (rename)
use MMDSSlaveRequest::OP_FINISH slave request to send information
of rename imported caps back to the exporter. This is preparation
for including counterpart's information in cap import/export message.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:25 +08:00
Yan, Zheng
85171fd6c2 mds: send info of imported caps back to the exporter (cache rejoin)
Use cache rejoin ack message to send information of rejoin imported
caps back to the exporter. Also move the code that exports reconnect
caps to MDCache::handle_cache_rejoin_ack()

This is preparation for including counterpart's information in cap
import/export message.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:24 +08:00
Yan, Zheng
ff8b9ac358 mds: send info of imported caps back to the exporter (export dir)
Introduce a new class Capability::Import and use it to send information
of imported caps back to the exporter. This is preparation for including
counterpart's information in cap import/export message.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:24 +08:00
Yan, Zheng
d00ec7915c mds: flush session messages before exporting caps
Following sequence of events can happen when exporting inodes:

- client sends open file request to mds.0
- mds.0 handles the request and sends inode stat back to the client
- mds.0 export the inode to mds.1
- mds.1 sends cap import message to the client
- mds.0 sends cap export message to the client
- client receives the cap import message from mds.1, but the client
  still doesn't have corresponding inode in the cache. So the client
  releases the imported caps.
- client receives the open file reply from mds.0
- client receives the cap export message from mds.0.

After the end of these events, the client doesn't have any cap for
the opened file.

To fix the message ordering issue, this patch introduces a new session
operation FLUSHMSG. Before exporting caps, we send a FLUSHMSG seesion
message to client and wait for the acknowledgment. When receiveing the
FLUSHMSG_ACK message from client, we are sure that clients have received
all messages sent previously.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:24 +08:00
Yan, Zheng
77515b7a3c mds: increase cap sequence when sharing max size
For case:
 - client voluntarily releases some caps through cap update message
 - mds shares the new max by sending cap grant message
 - mds recevies the cap update message

If mds doesn't increase the cap sequence when sharing the max size.
It can't determine if the cap update message was sent before or after
client reveived the cap grant message that updates max size.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:24 +08:00
Yan, Zheng
65259796ae mds: include inode version in auth mds' lock messages
encode inode version in auth mds' lock messages, so that version
of replica inodes get updated. This is important because client
use inode version in mds reply to check if the cached inode is
already up-to-date. It skips updating the inode if it thinks the
inode is already up-to-date.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:24 +08:00
Yan, Zheng
f134c77267 mds: avoid allocating MDRequest::More when cleanup request
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:24 +08:00
Yan, Zheng
e6c4d32e64 mds: waiting for slave reuqest to finish
If MDS receives a client request, but find there is an existing
slave request. It's possible that other MDS forwarded the request
to us, but the MMDSSlaveRequest::OP_FINISH message arrives after
the client request.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:24 +08:00
Yan, Zheng
1536e814da mds: check lock state before eval_gather
Locker::eval_gather() can dispatch requests, which may change other
locks' states.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:24 +08:00
Yan, Zheng
e1818692d1 mds: don't request CEPH_CAP_PIN from auth mds
avoid triggering assert(in->get_loner() >= 0 && in->mds_caps_wanted.empty())
in Locker::file_xsyn()

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:24 +08:00
Yan, Zheng
87ca260488 mds: fix sending resolve message
need to send resolve message when mds is in reconnect state

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:24 +08:00
Yan, Zheng
b7d78918de mds: keep dentry lock in sync state
unlike locks of other types, dentry lock in unreadable state can
block path traverse, so it should be in sync state as much as
possible.

This patch make Locker::try_eval() change dentry lock's state to
sync even when the dentry is freezing. Also make migrator check
imported dentries' lock states, change locks' states to sync if
necessary.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:23 +08:00
Yan, Zheng
d8440c4cae mds: avoid leaving bare-bone dirfrags in the cache
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:23 +08:00
Yan, Zheng
b2a137007f mds: re-issue caps after importing inode
After importing inode, the issued caps can be less than the caps
client wants. So always re-issue caps after importing inode.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:23 +08:00
Yan, Zheng
3ac08860d4 mds: avoid issuing caps when inode is frozen
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:23 +08:00
Yan, Zheng
31f5b0275e mds: fix rename notify
commit 1d86f77edf (mds: fix cross-authorty rename race) introduced
rename notify, but it puts the code in wrong bracket.

This patch also fixes a rename notify related bug in
MDCache::handle_mds_failure()

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:23 +08:00
Yan, Zheng
bd561772ba mds: re-send discover if want_xlocked becomes true
If want_xlocked becomes true, we can not rely on previously sent discover
because it's likely the previous discover is blocked on the xlocked dentry.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:23 +08:00
Yan, Zheng
913f7fd8db mds: fix empty directory check
Since commit 310032ee81(fix mds scatter_writebehind starvation), rdlock
a scatter lock does not always propagate dirty fragstats to corresponding
inode. So Server::_dir_is_nonempty() needs to check each dirfrag's stat
intead of checking inode's dirstat.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:23 +08:00
Yan, Zheng
2fea08b59c mds: merge delayed cache expire
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:23 +08:00
Yan, Zheng
498d5c4998 mds: process delayed expire if exporting dir cancelled in warnning state
we may add delayed expire when exporting dir is in warnning state

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:23 +08:00
Yan, Zheng
0aed0d48c7 mds: handle cache rejoin corner case
A recovering MDS may receives strong cache rejoin from a survivor,
then the survivor restarts, the recovering MDS receives week cache
rejoin from the same MDS. Before processing the week cache rejoin,
we should scour replicas added by the obsoleted strong cache rejoin.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:23 +08:00
Yan, Zheng
5a902a0e5d mds: unify nonce type
MDSCacheObject::replica_nonce is defined as __s16, but nonce type
in MDSCacheObject::replica_map is int. This mismatch may confuse
MDCache::handle_cache_expire().

this patch unifies the nonce type as uint32

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:23 +08:00
Yan, Zheng
0344d9af74 mds: rework stale import/export message detection
Current code uses import state to detect obsolete import/export messages.
it does not work for the case: cancel a subtree export, export the same
subtree again, the messages for the first export get dispatched.

This patch introduces "transation ID" for subtree exports. Each subtree
export has a unique TID, the ID is recorded in all import/export related
messages. By comparing the TID, we can reliably detect stale messages.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:22 +08:00
Yan, Zheng
9471fdc613 mds: put import/export related states together
Current code uses several STL maps to record import/export related
states. A map lookup is required for each state access, this is not
efficient. It's better to put import/export related states together.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:22 +08:00
Yan, Zheng
ab93aa59bf mds: freeze tree deadlock detection.
there are two situations that result freeze tree deadlock.

 - mds.0 authpins an item in subtree A
 - mds.0 sends request to mds.1 to authpin an item in subtree B
 - mds.0 freezes subtree A
 - mds.1 authpins an item in subtree B
 - mds.1 sends request to mds.0 to authpin an item in subtree A
 - mds.1 freezes subtree B
 - mds.1 receives the remote authpin request from mds.0
   (wait because subtree B is freezing)
 - mds.0 receives the remote authpin request from mds.1
   (wait because subtree A is freezing)

 - client request authpins items in subtree B
 - freeze subtree B
 - import subtree A which is parent of subtree B
   (authpins parent inode of subtree B, see CDir::set_dir_auth())
 - freeze subtree A
 - client request tries authpinning items in subtree A
   (wait because subtree A is freezing)

Enforcing a authpinning order can avoid the deadlock, but it's very
expensive. The deadlock is rare, so I think deadlock detection is
more suitable for the case.

This patch introduces freeze tree deadlock detection. We record the
start time of freezing tree. If we fail to freeze the tree within a
given duration, cancel the process of freezing tree.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-12-16 12:15:22 +08:00
Sage Weil
edc4224de4 Merge remote-tracking branch 'gh/wip-hitset'
Reviewed-by: Greg Farnum <greg@inktank.com>

Conflicts:
	src/common/config_opts.h
	src/osd/ReplicatedPG.cc
	src/osdc/Objecter.cc
	src/vstart.sh
2013-12-15 16:57:23 -08:00