Commit Graph

23375 Commits

Author SHA1 Message Date
Sage Weil
28d59d374b os/FileStore: fix non-btrfs op_seq commit order
The op_seq file is the starting point for journal replay.  For stable btrfs
commit mode, which is using a snapshot as a reference, we should write this
file before we take the snap.  We normally ignore current/ contents anyway.

On non-btrfs file systems, however, we should only write this file *after*
we do a full sync, and we should then fsync(2) it before we continue
(and potentially trim anything from the journal).

This fixes a serious bug that could cause data loss and corruption after
a power loss event.  For a 'kill -9' or crash, however, there was little
risk, since the writes were still captured by the host's cache.

Fixes: #3721
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
2013-01-03 17:15:07 -08:00
John Wilkins
f1e0305f0d doc: Removed the --without-tcmalloc flag until further advised.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-01-03 16:13:13 -08:00
Sage Weil
19df20867d Merge pull request #30 from rca/master
Minor clarification in docs.
2013-01-03 16:07:59 -08:00
John Wilkins
88af7d182a doc: Added defaults for PGs, links to recommended settings, and updated note on splitting.
Fixes: #3555

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-01-03 14:51:33 -08:00
Samuel Just
4ae4dce5c5 OSD: for old osds, dispatch peering messages immediately
Normally, we batch up peering messages until the end of
process_peering_events to allow us to combine many notifies, etc
to the same osd into the same message.  However, old osds assume
that the actiavtion message (log or info) will be _dispatched
before the first sub_op_modify of the interval.  Thus, for those
peers, we need to send the peering messages before we drop the
pg lock, lest we issue a client repop from another thread before
activation message is sent.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-01-03 14:18:00 -08:00
John Wilkins
73bc8ffc90 doc: Added comments on --without-tcmalloc option when building Ceph.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-01-03 13:30:14 -08:00
rca
37b57cdf0f Update doc/rados/configuration/filesystem-recommendations.rst
Clarified when it's necessary to use the setting:

filestore xattr use omap = true
2013-01-03 13:30:01 -08:00
John Wilkins
43ef6772eb doc: Added some packages to the copyable line.
Fixes: #3686

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-01-03 13:29:20 -08:00
John Wilkins
333ae82c61 doc: Fixed syntax error.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-01-03 13:28:06 -08:00
Sage Weil
7e94f6f1a7 Merge remote-tracking branch 'gh/wip-3714-b' into next
Signed-off-by: Samuel Just <sam.just@inktank.com>
2013-01-03 12:53:07 -08:00
David Zafman
224a33bb3b qa/workunit: Add dbench-short.sh for nfs suite
A multi-client dbench run doesn't work over NFS,
    see bug #3718.  Make single client dbench available.

    Signed-off-by: David Zafman <david.zafman@inktank.com>
2013-01-03 12:44:19 -08:00
Sage Weil
a32d6c5dca osd: move common active vs booting code into consume_map
Push osdmaps to PGs in separate method from activate_map() (whose name
is becoming less and less accurate).

Signed-off-by: Sage Weil <sage@inktank.com>
2013-01-02 22:39:10 -08:00
Sage Weil
0bfad8ef20 osd: let pgs process map advances before booting
The OSD deliberate consumes and processes most OSDMaps from while it
was down before it marks itself up, as this is can be slow.  The new
threading code does this asynchronously in peering_wq, though, and
does not let it drain before booting the OSD.  The OSD can get into
a situation where it marks itself up but is not responsive or useful
because of the backlog, and only makes the situation works by
generating more osdmaps as result.

Fix this by calling activate_map() even when booting, and when booting
draining the peering_wq on each call.  This is harmless since we are
not yet processing actual ops; we only need to be async when active.

Fixes: #3714
Signed-off-by: Sage Weil <sage@inktank.com>
2013-01-02 22:20:06 -08:00
Sage Weil
5fc94e89a9 osd: drop oldest_last_clean from activate_map
Signed-off-by: Sage Weil <sage@inktank.com>
2013-01-02 22:04:34 -08:00
Sage Weil
67f7ee6799 osd: drop unused variables from activate_map
Signed-off-by: Sage Weil <sage@inktank.com>
2013-01-02 22:04:08 -08:00
Sage Weil
a14a36ed78 OSDMap: fix modifed -> modified typo
Signed-off-by: Sage Weil <sage@inktank.com>
2013-01-02 21:09:07 -08:00
Sage Weil
6b5a89d237 Merge remote-tracking branch 'gh/next' 2013-01-02 18:13:25 -08:00
Sage Weil
43cba617aa log: fix locking typo/stupid for dump_recent()
We weren't locking m_flush_mutex properly, which in turn was leading to
racing threads calling dump_recent() and garbling the crash dump output.

Backport: bobtail, argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2013-01-02 17:01:32 -08:00
John Wilkins
29ff87a573 Merge branch 'master' of https://github.com/ceph/ceph 2013-01-02 15:59:59 -08:00
John Wilkins
64d2760a49 doc: Added a memory profiling section. Ported from the wiki.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-01-02 15:58:03 -08:00
John Wilkins
5066abf189 doc: Added memory profiling to the index.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-01-02 15:57:22 -08:00
Sam Lang
0e9a0cd7b8 qa/workunit: Update pjd script to use new tarball
The pjd script now uses the latest version of pjd
with an additional test for opening a non-existent
file.

Signed-off-by: Sam Lang <sam.lang@inktank.com>
2013-01-02 17:08:37 -06:00
Sam Lang
d8940d15c3 fuse: Fix cleanup code path on init failure
With the changes from 856f32ab, the cfuse.init call returns
a _positive_ errno, which was getting ignored.  Also, if an
error occurs during cfuse.init(), we need to teardown the client
mount.

Signed-off-by: Sam Lang <sam.lang@inktank.com>
2013-01-02 16:38:28 -06:00
Josh Durgin
c4370ff03f librbd: establish watch before reading header
This eliminates a window in which a race could occur when we have an
image open but no watch established. The previous fix (using
assert_version) did not work well with resend operations.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-02 14:15:34 -08:00
Sage Weil
9a1cf51888 Merge branch 'wip-journal-aio' into next
Reviewed-by: Samuel Just <sam.just@inktank.com>
Backport: bobtail
2013-01-02 13:42:22 -08:00
Sage Weil
483c6f76ad test_filejournal: optionally specify journal filename as an argument
Signed-off-by: Sage Weil <sage@inktank.com>
2013-01-02 13:39:05 -08:00
Sage Weil
c461e7fc1e test_filejournal: test journaling bl with >IOV_MAX segments
Signed-off-by: Sage Weil <sage@inktank.com>
2013-01-02 13:39:05 -08:00
Sage Weil
dda7b65189 os/FileJournal: limit size of aio submission
Limit size of each aio submission to IOV_MAX-1 (to be safe).  Take care to
only mark the last aio with the seq to signal completion.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-01-02 13:39:05 -08:00
Josh Durgin
e0858fa899 Revert "librbd: ensure header is up to date after initial read"
Using assert version for linger ops doesn't work with retries,
since the version will change after the first send.
This reverts commit e177680903.

Conflicts:

	qa/workunits/rbd/watch_correct_version.sh
2013-01-02 12:32:33 -08:00
John Wilkins
82297706da doc: Minor edits.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-01-02 11:24:39 -08:00
John Wilkins
d3b9803eab doc: Fixed typo, clarified usage.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-01-02 11:15:16 -08:00
Yan, Zheng
8422474320 mds: fix rename inode exportor check
Use "srcdn->is_auth() && destdnl->is_primary()" to check if the MDS is
inode exportor of rename operation is not reliable, This is because
OP_FINISH slave request may race with subtree import. The fix is use
a variable in MDRequest to indicate if the MDS is inode exportor.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:40 +08:00
Yan, Zheng
5e8642a82e mds: call maybe_eval_stray after removing a replica dentry
MDCache::handle_cache_expire() processes dentries after inodes, so the
MDCache::maybe_eval_stray() in MDCache::inode_remove_replica() always
fails to remove stray inode because MDCache::eval_stray() checks if the
stray inode's dentry is replicated.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:40 +08:00
Yan, Zheng
f5ea5c36a4 mds: don't defer processing caps if inode is auth pinned
We should not defer processing caps if the inode is auth pinned by MDRequest,
because the MDRequest may change lock state of the inode later and wait for
the deferred caps.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:40 +08:00
Yan, Zheng
fe5936b158 mds: remove unnecessary is_xlocked check
Locker::foo_eval() is always called for stable locks, so no need to
check if the lock is xlocked.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:40 +08:00
Yan, Zheng
b2d5005aa0 mds: fix lock state transition check
Locker::simple_excl() and Locker::scatter_mix() miss is_rdlocked
check; Locker::file_excl() miss is_rdlocked check and is_wrlocked
check.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:40 +08:00
Yan, Zheng
b3796f46a4 mds: indroduce DROPLOCKS slave request
In some rare case, Locker::acquire_locks() drops all acquired locks
in order to auth pin new objects. But Locker::drop_locks only drops
explicitly acquired remote locks, does not drop objects' version
locks that were implicitly acquired on remote MDS. These leftover
locks break locking order when re-acquiring _locks and may cause
dead lock.

The fix is indroduce DROPLOCKS slave request which drops all acquired
lock on remote MDS.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:40 +08:00
Yan, Zheng
7e04504d3e mds: fix on-going two phrase commits tracking
The slaves for two phrase commit should be mdr->more()->witnessed
instead of mdr->more()->slaves. mdr->more()->slaves includes MDS
for remote auth pin and lock

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:40 +08:00
Yan, Zheng
2f96b472ef mds: fix anchor table commit race
Anchor table updates for a given inode is fully serialized on client side.
But due to network latency, two commit requests from different clients can
arrive to anchor server out of order. The anchor table gets corrupted if
updates are committed in wrong order.

The fix is track on-going anchor updates for individual inode and delay
processing commit requests that arrive out of order.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:40 +08:00
Yan, Zheng
a79493da34 mds: skip frozen inode when assimilating dirty inodes' rstat
CDir::assimilate_dirty_rstat_inodes() may encounter frozen inodes that
are being renamed. Skip these frozen inodes because assimilating inode's
rstat require auth pinning the inode.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:40 +08:00
Yan, Zheng
61da9b1845 mds: mark rename inode as ambiguous auth on all involved MDS
When handling cross authority rename, the master first sends OP_RENAMEPREP
slave requests to witness MDS, then sends OP_RENAMEPREP slave request to
the rename inode's auth MDS after getting all witness MDS' acknowledgments.
Before receiving the OP_RENAMEPREP slave request, the rename inode's auth
MDS may change lock state of the rename inode and send lock messages to
witness MDS. But the witness MDS may already received the OP_RENAMEPREP
slave request and changed the source inode's authority. So the witness MDS
send lock acknowledgment message to wrong MDS and trigger assertion.

The fix is, firstly the master marks rename inode as ambiguous and send a
message to ask the rename inode's auth MDS to mark the inode as ambiguous,
then send OP_RENAMEPREP slave requests to the witness MDS, finally send
OP_RENAMEPREP slave request to the rename inode's auth MDS after getting
all witness MDS' acknowledgments.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:39 +08:00
Yan, Zheng
3b13d3dcbc mds: only export directory fragments in stray to their auth MDS
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:39 +08:00
Yan, Zheng
d9d7147339 mds: don't trim ambiguous imports in MDCache::trim_non_auth_subtree
Trimming ambiguous imports in MDCache::trim_non_auth_subtree() confuses
MDCache::disambiguate_imports() and causes infinite loop.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:39 +08:00
Yan, Zheng
fcb9f98887 mds: use null dentry to find old parent of renamed directory
When replaying an directory rename operation, MDS need to find old parent of
the renamed directory to adjust auth subtree. Current code searchs the cache
to find the old parent, it does not work if the renamed directory inode is not
in the cache. EMetaBlob for directory rename contains at most one null dentry,
so MDS can use null dentry to find old parent of the renamed directory. If
there is no null dentry in the EMetaBlob, the MDS was witness of the rename
operation and there is not auth subtree underneath the renamed directory.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:39 +08:00
Yan, Zheng
7a52016864 mds: don't journal null dentry for overwrited remote linkage
Server::_rename_prepare() adds null dest dentry to the EMetaBlob if
the rename operation overwrites a remote linkage. This is incorrect
because null dentry are processed after primary and remote dentries
during journal replay. The erroneous null dentry makes the dentry of
rename destination disappear.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:39 +08:00
Yan, Zheng
5ae715be5c mds: xlock stray dentry when handling rename or unlink
This prevents MDS from reintegrating stray before rename/unlink finishes

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:39 +08:00
Yan, Zheng
262795744b mds: don't trigger assertion when discover races with rename
Discover reply that adds replica dentry and inode can race with rename
if slave request for rename sends discover and waits, but waked up by
reply for different discover.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 19:25:39 +08:00
Yan, Zheng
e10267b531 mds: fix Locker::simple_eval()
Locker::simple_eval() checks if the loner wants CEPH_CAP_GEXCL to
decide if it should change the lock to EXCL state, but it checks
if CEPH_CAP_GEXCL is issued to the loner to decide if it should
change the lock to SYNC state. So if the loner wants CEPH_CAP_GEXCL,
but doesn't have CEPH_CAP_GEXCL, Locker::simple_eval() will keep
switching the lock state.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 13:55:43 +08:00
Yan, Zheng
7e23321b72 mds: don't renew revoking lease
MDS may receives lease renew request while lease is being revoked,
just ignore the renew request.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-01-02 13:54:51 +08:00
Sage Weil
eb02eaede5 Merge remote-tracking branch 'gh/wip-bobtail-docs' 2013-01-01 10:36:57 -08:00