Commit Graph

25144 Commits

Author SHA1 Message Date
Yan, Zheng
ce0b74e55e mds: encode dirfrag base in cache rejoin ack
Cache rejoin ack message already encodes inode base, make it also encode
dirfrag base. This allowes the message to replicate stray dentries like
MDentryUnlink message. The function will be used by later patch.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-04-01 09:24:41 -07:00
Yan, Zheng
9f66d0454f mds: include replica nonce in MMDSCacheRejoin::inode_strong
So the recovering MDS can properly handle cache expire messages.
Also increase the nonce value when sending the cache rejoin acks.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>

Also update the MMDSCacheRejoin encoding to the new format.
Signed-off-by: Greg Farnum <greg@inktank.com>
2013-04-01 09:22:38 -07:00
Yan, Zheng
01fd55a64c mds: remove MDCache::rejoin_fetch_dirfrags()
In commit 77946dcdae (mds: fetch missing inodes from disk), I introduced
MDCache::rejoin_fetch_dirfrags(). But it basicly duplicates the function
of MDCache::open_undef_dirfrags(), so just remove rejoin_fetch_dirfrags()
and make open_undef_dirfrags() also handle undefined inodes.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-04-01 09:17:19 -07:00
Yan, Zheng
e62e48bb32 mds: fix MDS recovery involving cross authority rename
For mds cluster, rename operation may involve multiple MDS. If the
rename source's auth MDS crashes after some witness MDS have prepared
the rename but before the rename is committing. Later when the MDS
recovers, its subtree map and linkages are different from the prepared
MDS'. This causes problems for both subtree resolve and cache rejoin.
The solution is, if the rename source's auth MDS fails, the prepared
witness MDS query the master MDS if the operation is committing. If
it's not, rollback the rename, then send resolve message to the
recovering MDS.

Another similar case is a prepared witness MDS crashes when the
rename source's auth MDS has prepared or is preparing the operation.
when the witness recovers, the master just delay sending the resolve
ack message until the it commits the operation.

This patch also updates Server::handle_client_rename(). Make preparing
the rename source's auth MDS be the final step before committing the
rename.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-04-01 09:17:19 -07:00
Yan, Zheng
3ab86637b3 mds: send resolve acks after master updates are safely logged
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-04-01 09:17:19 -07:00
Yan, Zheng
75346d8f3d mds: send cache rejoin messages after gathering all resolves
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-04-01 09:17:19 -07:00
Yan, Zheng
97bc0d26e6 mds: don't send MDentry{Link,Unlink} before receiving cache rejoin
The active MDS calls MDCache::rejoin_scour_survivor_replicas() when it
receives the cache rejoin message. The function will remove the objects
replicated by MDentry{Link,Unlink} from replica map.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-04-01 09:17:19 -07:00
Yan, Zheng
e381bb3930 mds: set resolve/rejoin gather MDS set in advance
For active MDS, it may receive resolve/rejoin message before receiving
the mdsmap message that claims the MDS cluster is in resolving/rejoning
state. So instead of set the gather MDS set when receiving the mdsmap.
set them in advance when detecting MDS' failure.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-04-01 09:17:19 -07:00
Yan, Zheng
ed85dd61a5 mds: don't send resolve message between active MDS
When MDS cluster is resolving, current behavior is sending subtree resolve
message to all other MDS and waiting for all other MDS' resolve message.
The problem is that active MDS can have diffent subtree map due to rename.
Besides gathering active MDS's resolve messages are also racy. The only
function for these messages is disambiguate other MDS' import. We can
replace it by import finish notification.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-04-01 09:17:19 -07:00
Yan, Zheng
30dbb1d4e5 mds: compose and send resolve messages in batch
Resolve messages for all MDS are the same, so we can compose and
send them in batch.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-04-01 09:17:19 -07:00
Yan, Zheng
a6d9eb8c58 mds: don't delay processing replica buffer in slave request
Replicated objects need to be added into the cache immediately

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-04-01 09:17:19 -07:00
Yan, Zheng
131271655f mds: unify slave request waiting
When requesting remote xlock or remote wrlock, the master request is
put into lock object's REMOTEXLOCK waiting queue. The problem is that
remote wrlock's target can be different from lock's auth MDS. When
the lock's auth MDS recovers, MDCache::handle_mds_recovery() may wake
incorrect request. So just unify slave request waiting, dispatch the
master request when receiving slave request reply.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-04-01 09:17:19 -07:00
Yan, Zheng
ef9a4f6605 mds: defer eval gather locks when removing replica
Locks' states should not change between composing the cache rejoin ack
messages and sending the message. If Locker::eval_gather() is called
in MDCache::{inode,dentry}_remove_replica(), it may wake requests and
change locks' states.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-04-01 09:17:09 -07:00
Yan, Zheng
12e7c3d171 mds: avoid sending duplicated table prepare/commit
This patch makes table client defer sending table prepare/commit messages
until receiving table server's 'ready' message. This avoid duplicated table
prepare/commit messages.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-04-01 09:16:59 -07:00
Yan, Zheng
a5dce808b5 mds: make sure table request id unique
When a MDS becomes active, the table server re-sends 'agree' messages
for old prepared request. If the recoverd MDS starts a new table request
at the same time, The new request's ID can happen to be the same as old
prepared request's ID, because current table client code assigns request
ID from zero after MDS restarts.

This patch make table server send 'ready' messages when table clients
become active or itself becomes active. The 'ready' message updates
table client's last_reqid to avoid request ID collision. The message
also replaces the roles of finish_recovery() and handle_mds_recovery()
callbacks for table client.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-04-01 09:16:48 -07:00
Yan, Zheng
bb83a5d63c mds: consider MDS as recovered when it reaches clientreplay state.
MDS in clientreplsy state already starts servering requests. It also
make MDS::handle_mds_recovery() and MDS::recovery_done() match.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-04-01 09:16:36 -07:00
Yan, Zheng
4ad35b2a83 mds: mark connection down when MDS fails
So if the MDS restarts and uses the same address, it does not get
old messages.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-03-31 16:57:14 +08:00
Yan, Zheng
fbcc64dffd mds: fix MDCache::adjust_bounded_subtree_auth()
There are cases that need both create new bound and swallow intervening
subtree. For example: A MDS exports subtree A with bound B and imports
subtree B with bound C at the same time. The MDS crashes, exporting
subtree A fails, but importing subtree B succeed. During recovery, the
MDS may create new bound C and swallow subtree B.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-03-31 16:57:14 +08:00
Yan, Zheng
573a4ae1a2 mds: process finished contexts in batch
If there are several unstable locks in an inode, current Locker::eval(CInode*,)
processes each lock's finished contexts seperately. This may cause very deep
call stack if finished contexts also call Locker::eval() on the same inode.
An extreme example is:

Locker::eval() wakes an open request(). Server::handle_client_open() starts
a log entry, then call Locker::issue_new_caps(). Locker::issue_new_caps()
calls Locker::eval() and wakes another request. The later request also tries
starting a log entry.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-03-31 16:57:14 +08:00
Yan, Zheng
5cbaae6648 mds: preserve subtree bounds until slave commit
When replaying an operation that rename a directory inode to non-auth subtree,
if the inode has subtree bounds, we should prevent them from being trimmed
until slave commit.

This patch also fixes a bug in ESlaveUpdate::replay(). EMetaBlob::replay()
should be called before MDCache::finish_uncommitted_slave_update().

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-03-31 16:57:14 +08:00
Sage Weil
ce8793ce3b Merge pull request #175 from dachary/wip-4594
fix null character in object name triggering segfault

Reviewed-by: Sage Weil <sage@inktank.com>
2013-03-30 18:22:01 -07:00
Loic Dachary
c344ff170d fix null character in object name triggering segfault
Parsing \n in  lfn_parse_object_name is implemented with

  out->append('\0');

which segfaults when using libstdc++ and g++ version 4.6.3 on Debian
GNU/Linux. It is replaced with

  (*out) += '\0';

to avoid the bugous implicit conversion. There is no append(charT)
method in C++98 or C++11, which means it relies on an implicit
conversion that is bugous. It would be better to rely on the
basic_string& operator+=(charT c); method as defined in ISO 14882-1998
(page 385) thru ISO 14882-2012 (page 640)

A set of tests is added to generate and parse object names. They need
access to the private function lfn_parse_object_name because there is
no convenient protected method to exercise it. The tests contain a
LFNIndex derived class, TestWrapLFNIndex which is made a friend of
LFNIndex to gain access to the private methods.

http://tracker.ceph.com/issues/4594 refs #4594

Signed-off-by: Loic Dachary <loic@dachary.org>
2013-03-30 14:28:34 +01:00
Sage Weil
2b8eb31b85 Merge branch 'wip-4490' 2013-03-29 18:02:15 -07:00
Sage Weil
e611937f3e mon: OSDMonitor: add 'osd pool set-quota' command
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-03-29 17:59:35 -07:00
John Wilkins
95328089b8 doc: Added entries for Pool, PG, & CRUSH. Moved heartbeat link.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-03-29 17:38:48 -07:00
John Wilkins
bcc5c65305 doc: Added heartbeat configuration settings.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-03-29 17:38:02 -07:00
John Wilkins
6157d68369 doc: Moved PG info to separate page. Moved heartbeat to mon-osd doc.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-03-29 17:36:23 -07:00
John Wilkins
ca77aabbf1 doc: Rewrote monitor configuration section.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-03-29 17:34:45 -07:00
John Wilkins
ea3c833d0f doc: Moved to separate section for parallelism.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-03-29 17:32:47 -07:00
John Wilkins
ba73b8301a doc: Cleanup.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-03-29 17:32:00 -07:00
Sage Weil
e9b3f2e6e9 ceph-disk list: say 'unknown cluster $UUID' when cluster is unknown
This makes it clearer that an old osd is in fact old.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-03-29 17:30:28 -07:00
Greg Farnum
9e7ddf677f config_opts: fix rgw_port comments to be plaintext
Signed-off-by: Greg Farnum <greg@inktank.com>
2013-03-29 17:05:41 -07:00
Samuel Just
3da3129e07 ReplicatedPG: check for full if delta_stats.num_bytes > 0
Signed-off-by: Samuel Just <sam.just@inktank.com>
2013-03-29 16:47:29 -07:00
Joao Eduardo Luis
9b09073259 mon: Monitor: check if 'pss' arg is !NULL on parse_pos_long()
We already do it all throughout the function, but this one place didn't.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-03-29 16:47:29 -07:00
Joao Eduardo Luis
e2a936d2ae common: util: add 'unit_to_bytesize()' function
Converts from a numerical value that may or may not contain an unit
modifier ('1024', '1K', '2M', ..., '1E') and returns the parsed size
in bytes.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-03-29 16:47:28 -07:00
Joao Eduardo Luis
23c2fa7fc2 osd: osd_types: add pool quota related fields 2013-03-29 16:03:21 -07:00
Sage Weil
562e1716bd ceph-disk: handle missing journal_uuid field gracefully
Only lower if we know it's not None.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-03-29 13:59:04 -07:00
Josh Durgin
b504e444fc Merge remote branch 'origin/next' 2013-03-29 12:58:01 -07:00
Josh Durgin
95c4a81be1 Merge pull request #170 from ceph/wip-rbd-aio-flush
Reviewed-by: Sage Weil <sage.weil@inktank.com>
2013-03-29 13:20:32 -07:00
Josh Durgin
4c4d5591bd librados: move snapc creation to caller for aio_operate
The common case already has a snapshot context, so avoid duplicating
it (copying a potentially large vector) in IoCtxImpl::aio_operate().

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-03-29 12:47:17 -07:00
Sage Weil
43e451f6ee Merge pull request #166 from ceph/wip-disk-list
Wip disk list

Reviewed-by: Dan Mick <dan.mick@inktank.com>
2013-03-29 12:24:47 -07:00
Yan, Zheng
3cbd0366b7 client: update cap->implemented when handling revoke
Fixes #4578

Tested-by: Noah Watkins <noahwatkins@gmail.com>
2013-03-29 11:26:01 -07:00
athanatos
f9c3bba374 Merge pull request #161 from dachary/wip-4560
unit tests for LFNIndex
2013-03-29 10:50:55 -07:00
Greg Farnum
4f8ba0e775 msgr: allow users to mark_down a NULL Connection*
Signed-off-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sam Just <sam.just@inktank.com>
2013-03-29 10:42:04 -07:00
Sage Weil
f8682cb8a7 Merge pull request #150 from ceph/wip-4313
mon: ConfigKeyService: stash config keys on the monitor

Reviewed-by: Sage Weil <sage@inktank.com
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2013-03-29 10:24:53 -07:00
Sage Weil
2fa16422f3 Merge pull request #171 from Elbandi/master
Run wrap-and-sort and add git to build deps

Reviewed-by: Sage Weil <sage@inkank.com>
2013-03-29 08:38:22 -07:00
Sage Weil
999b307af5 Merge pull request #172 from ceph/wip-ceph-json
Wip ceph json

Reviewed-by: Sage Weil <sage@inktank.com>
2013-03-29 08:37:04 -07:00
Andras Elso
2da57d7675 debian: Add git to Build-Depends (need by check_version script)
Signed-off-by: Andras Elso <elso.andras@gmail.com>
2013-03-29 13:34:54 +01:00
Andras Elso
8f5c665744 debian: Run wrap-and-sort from devscripts
Signed-off-by: Andras Elso <elso.andras@gmail.com>
2013-03-29 13:34:48 +01:00
Loic Dachary
972f0eb0ac unit test LFNIndex::remove_object and LFNIndex::lfn_unlink
When the object name is short, check that the corresponding file is
::unlink()ed. When the object name is long, there may be multiple files
with the same name, modulo the anti-collision number showing just before
the FILENAME_COOKIE. The following scenarii are tested:

 * there only is one file

 * there are multiple files and the last one is removed

 * there are multiple files and the last one is moved in place of the
   file that is to be removed

lfn_unlink and remove_object are tested together because
lfn_unlink is a private function and remove_object is a protected function
that does very little beside calling lfn_unlink

http://tracker.ceph.com/issues/4560 refs #4560

Signed-off-by: Loic Dachary <loic@dachary.org>
2013-03-29 09:46:14 +01:00