Commit Graph

26319 Commits

Author SHA1 Message Date
Yan, Zheng
eeb68eb33d mds: open inode by ino
This patch adds "open-by-ino" helper. It utilizes backtrace to find
inode's path and open the inode. The algorithm looks like:

1. Check MDS peers. If any MDS has the inode in its cache, goto step 6.
2. Fetch backtrace. If backtrace was previously fetched and get the
   same backtrace again, return -EIO.
3. Traverse the path in backtrace. If the inode is found, goto step 6;
   if non-auth dirfrag is encountered, goto next step. If fail to find
   the inode in its parent dir, goto step 1.
4. Request MDS peers to traverse the path in backtrace. If the inode
   is found, goto step 6. If MDS peer encounters non-auth dirfrag, it
   stops traversing. If any MDS peer fails to find the inode in its
   parent dir, goto step 1.
5. Use the same algorithm to open the inode's parent. Goto step 3 if
   succeeds; goto step 1 if fails.
6. return the inode's auth MDS ID.

The algorithm has two main assumptions:
1. If an inode is in its auth MDS's cache, its on-disk backtrace
   can be out of date.
2. If an inode is not in any MDS's cache, its on-disk backtrace
   must be up to date.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:22 +08:00
Yan, Zheng
617f70d216 mds: move fetch_backtrace() to class MDCache
We may want to fetch backtrace while corresponding inode isn't
instantiated. MDCache::fetch_backtrace() will be used by later
patch.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:22 +08:00
Yan, Zheng
05a7588d37 mds: remove old backtrace handling
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:22 +08:00
Yan, Zheng
39b5e76ca4 mds: update backtraces when unlinking inodes
unlink moves inodes to stray dir, it's a special form of rename.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:22 +08:00
Yan, Zheng
b88c49b751 mds: bring back old style backtrace handling
To queue a backtrace update, current code allocates a BacktraceInfo
structure and adds it to log segment's update_backtraces list. The
main issue of this approach is that BacktraceInfo is independent
from inode. It's very inconvenient to find pending backtrace updates
for given inodes. When exporting inodes from one MDS to another
MDS, we need find and cancel all pending backtrace updates on the
source MDS.

This patch brings back old backtrace handling code and adapts it
for the current backtrace format. The basic idea behind of the old
code is: when an inode's backtrace becomes dirty, add the inode to
log segment's dirty_parent_inodes list.

Compare to the current backtrace handling, another difference is
that backtrace update is journalled in EMetaBlob::full_bit

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:22 +08:00
Yan, Zheng
c9d2e25641 mds: rename last_renamed_version to backtrace_version
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:22 +08:00
Yan, Zheng
6c721116fc mds: journal backtrace update in EMetaBlob::fullbit
Current way to journal backtrace update is set EMetaBlob::update_bt
to true. The problem is that an EMetaBlob can include several inodes.
If an EMetaBlob's update_bt is true, journal replay code has to queue
backtrace updates for all inodes in the EMetaBlob.

This patch adds two new flags to class EMetaBlob::fullbit, make it be
able to journal backtrace update.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:22 +08:00
Yan, Zheng
03c0fe937d mds: reorder EMetaBlob::add_primary_dentry's parameters
prepare for adding new state parameter such as 'dirty_parent'

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:22 +08:00
Yan, Zheng
26effc0e58 mds: warn on unconnected snap realms
When there are more than one active MDS, restarting MDS triggers
assertion "reconnected_snaprealms.empty()" quite often. If there
is no snapshot in the FS, the items left in reconnected_snaprealms
should be other MDS' mdsdir. I think it's harmless.

If there are snapshots in the FS, the assertion probably can catch
real bugs. But at present, snapshot feature is broken, fixing it is
non-trivial. So replace the assertion with a warning.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:22 +08:00
Yan, Zheng
f3a9f4746d mds: slient MDCache::trim_non_auth()
No need to output the function's debug message to console.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:21 +08:00
Yan, Zheng
9424298f27 mds: fix check for base inode discovery
If a MDiscover message is for discovering base inode, want_base_dir
should be false, path should be empty.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:21 +08:00
Yan, Zheng
c9707f636c mds: Fix replica's allowed caps for filelock in SYNC_LOCK state
For replica, filelock in LOCK_LOCK state doesn't allow Fc cap. So
filelock in LOCK_SYNC_LOCK/LOCK_EXCL_LOCK state shouldn't allow Fc
cap either.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:21 +08:00
Yan, Zheng
3962a7510f mds: defer releasing cap if necessary
When inode is freezing or frozen, we defer processing MClientCaps
messages and cap release embedded in requests. The same deferral
logical should also cover MClientCapRelease messages.
2013-05-28 13:57:21 +08:00
Yan, Zheng
a918e611e2 mds: fix Locker::request_inode_file_caps()
After sending cache rejoin message, replica need notify auth MDS when
cap_wanted changes. But it can send MInodeFileCaps message only after
receiving auth MDS' rejoin ack. Locker::request_inode_file_caps() has
correct wait logical, but it skips sending MInodeFileCaps message if
the auth MDS is still in rejoin state.

The fix is defer sending MInodeFileCaps message until the auth MDS
is active. It makes the function's wait logical less tricky.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:21 +08:00
Yan, Zheng
2b1b6cae2d mds: notify auth MDS when cap_wanted changes
So the auth MDS can choose locks' states base on our cap_wanted.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:21 +08:00
Yan, Zheng
fc94f47b8b mds: export CInode:mds_caps_wanted
CInode:mds_caps_wanted is used to keep track of caps wanted by non-auth
MDS. The auth MDS checks it when choosing locks' states.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:21 +08:00
Yan, Zheng
e21f328f1a mds: export CInode::STATE_NEEDSRECOVER
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:21 +08:00
Yan, Zheng
882be6b1d7 mds: send slave request after target MDS is active
when failure of peer is detected, MDCache::handle_mds_failure()
checks if there are requests waiting for slave replies from the
failed peer, and adds them to the "wait for active peer" list.
The "retry request" logical only covers slave requests sent before
MDCache::handle_mds_failure() is called. If a slave request was
sent while peer isn't up, we wait for its reply forever.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:21 +08:00
Yan, Zheng
38fb2ec78b mds: unfreeze inode after rename rollback finishes
we should not wake up the unfreeze waiter while the inode is still
linked to a non-auth dirfrag.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:21 +08:00
Yan, Zheng
8a1114cead mds: remove buggy cache rejoin code
I previously added code to handle a corner case of cache rejoin:
entire subtree, together with the inode subtree root belongs to,
were trimmed between sending cache rejoin and receiving rejoin ack.
In this case, we should send cache expire message to the subtree's
auth MDS. But the code is complete broken, remove it temporarily.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:21 +08:00
Yan, Zheng
30c68218f7 mds: fix typo in Server::do_rename_rollback
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:21 +08:00
Yan, Zheng
e8497f8087 mds: fix import cancel race
Current code uses import state to detect obsolete import discover/prep
message. it does not work for the case: cancel a subtree import, import
the same subtree again, the discover/prep message for the first import
get dispatched.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:21 +08:00
Yan, Zheng
0708d44f12 mds: fix straydn race
For unlink/rename request, the target dentry's linkage may change
before all locks are acquired. So we need check if the existing stray
dentry is valid.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:21 +08:00
Yan, Zheng
7a6ec35367 mds: fix slave commit tracking
MDS may crash after journalling a slave commit, but before sending
commit ack to the master. Later when the MDS restarts, it will not
send commit ack to the master. So the master waits for the commit
ack forever. The fix is remove failed MDS from requests' uncommitted
slave list. When failed MDS recovers, its resolve message will tell
the master which slave requests are not committed. The master will
re-add the recovering MDS to requests' uncommitted slave list if
necessary.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:21 +08:00
Yan, Zheng
0c1ca8edda mds: fix uncommitted master wait
We may add new waiter while the master is committing. so we should
take the waiters and wake up them when the master is committed.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:21 +08:00
Yan, Zheng
5426c75d7b mds: adjust subtree auth if import aborts in PREPPED state
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:21 +08:00
Yan, Zheng
d7b999be1b mds: don't stop at export bounds when journaling dir context
We only journal the finish of exporting subtree, so we shouldn't
consider export bounds as subtree root.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:21 +08:00
Yan, Zheng
81d073fecb mds: fix underwater dentry cleanup
If the underwater dentry is a remove link, we shouldn't mark the
inode clean

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:20 +08:00
Yan, Zheng
8b4e9911a4 mds: journal new subtrees created by rename
this avoids creating bare dirfrags during journal replay.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-28 13:57:20 +08:00
Sage Weil
a6df7644b6 PendingReleaseNotes: notes about enabling HASHPSPOOL
Signed-off-by: Sage Weil <sage@inktank.com>
2013-05-27 21:17:06 -07:00
Sage Weil
aa0649c66b osdmaptool: fix cli tests
Now that the default pool flags have changed.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-05-27 21:17:04 -07:00
Sage Weil
f0958c36fd Merge pull request #321 from dalgaaf/wip-da-CID-727981
kv_flat_btree_async.cc: fix AioCompletion resource leak
2013-05-27 13:55:54 -07:00
Sage Weil
35a8c6160c Merge pull request #320 from dalgaaf/wip-da-CID-727983
kv_flat_btree_async.cc: fix resource leak
2013-05-27 13:55:24 -07:00
John Wilkins
615b54c6e4 doc: Updated rgw.conf example.
fixes: #4608

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-05-25 15:13:01 -07:00
John Wilkins
6f935419e6 doc: Updated RGW Quickstart.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-05-25 15:11:49 -07:00
John Wilkins
e59897c8b2 doc: Updated index for newer terms.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-05-25 15:11:06 -07:00
Samuel Just
6d1e14e045 pg_pool_t: enable FLAG_HASHPSPOOL by default
Fixes: #5160
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-24 16:21:48 -07:00
Danny Al-Gaaf
0f5474834a kv_flat_btree_async.cc: fix AioCompletion resource leak
Call AioCompletion::release() if the completion is no longer
needed to free the resources.

CID 727981 (#3 of 3): Resource leak (RESOURCE_LEAK)
  leaked_storage: Variable "top_aioc" going out of scope leaks the
  storage it points to.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
2013-05-24 14:50:34 +02:00
Danny Al-Gaaf
7b438e131b kv_flat_btree_async.cc: fix resource leak
Call AioCompletion::release() if the completion is no longer
needed to free the resources.

CID 727983 : Resource leak (RESOURCE_LEAK)
  leaked_storage: Variable "aioc" going out of scope leaks the
  storage it points to.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
2013-05-24 14:43:17 +02:00
Danny Al-Gaaf
9785478a2a ceph-disk: remove unnecessary semicolons
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
2013-05-24 12:46:15 +02:00
Danny Al-Gaaf
16ecae153d ceph-disk: cast output of _check_output()
Cast output of _check_output() to str() to be able to use
str.split().

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
2013-05-24 12:41:11 +02:00
Danny Al-Gaaf
9429ff90a0 ceph-disk: fix undefined variable
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
2013-05-24 12:33:16 +02:00
Danny Al-Gaaf
c127745cc0 ceph-disk: add missing spaces around operator
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
2013-05-24 12:29:07 +02:00
Samuel Just
8c1c2d98c6 Merge branch 'wip_scrub_tphandle' into next
Fixes: #5159
Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-23 20:08:54 -07:00
Samuel Just
86822485e5 PG: ping tphandle during omap loop as well
Signed-off-by: Samuel Just <sam.just@inktank.com>
2013-05-23 19:42:32 -07:00
Samuel Just
d62716dd4c PG: reset timeout in _scan_list for each object, read chunk
Signed-off-by: Samuel Just <sam.just@inktank.com>
2013-05-23 19:42:32 -07:00
Samuel Just
b8a25e08a6 OSD,PG: pass tphandle down to _scan_list
Signed-off-by: Samuel Just <sam.just@inktank.com>
2013-05-23 19:42:32 -07:00
John Wilkins
bb407bfd10 doc: Updated Ceph FS Quick Start.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-05-23 17:02:17 -07:00
John Wilkins
7c497d95db doc: Added troubleshooting to Ceph FS index.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-05-23 17:01:51 -07:00
John Wilkins
3dda794a66 doc: Added separate troubleshooting for MDS and Ceph FS.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-05-23 17:01:29 -07:00