Commit Graph

12207 Commits

Author SHA1 Message Date
Greg Farnum
a93b970ab1 C_Gather: Set debug #ifdefs to remove set.
This way when we're confident it works right, we can
remove the set<Context*> and just rely on ref counting.

Further optimizations would include using a spinlock
rather than a mutex, or possibly even just switching
sub_[created|existing]_count to be atomics.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
2011-01-14 16:12:32 -08:00
Greg Farnum
55cf6bad2f C_Gather: Rewrite for thread safety.
Previously, C_Gather wasn't thread safe at all,
and there was an issue with creating subs while some
subs were being finished.
These issues are now fixed.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
2011-01-14 16:11:01 -08:00
Greg Farnum
29825c75e7 mds: call MonClient::shutdown when doing a journal dump.
Previously we got a failed assert since nothing was calling this.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
2011-01-14 15:08:06 -08:00
Colin Patrick McCabe
1bae352ed2 os: don't crash on no-journal case
JournalingObjectStore::commit_start should handle the case where journal is
null. This will occur if the user doesn't configure a journal.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2011-01-14 10:08:10 -08:00
Sage Weil
6d0dc4bf64 mds: tolerate (with warning) replayed op with bad prealloc_inos
This comes up when an ESesssion close is followed by an EMetaBlob that
uses a prealloc_ino.  That isn't supposed to happen (it's probably a corner
case with session timeout vs a request waiting on locks that didn't
get killed/canceled?).  But tolerate it during replay just the same.

Works around #708.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-13 22:08:40 -08:00
Sage Weil
86337127c0 mds: improve debug output on ESession journal replay
Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-13 21:51:05 -08:00
Samuel Just
b60ef3a7ad OSD,ReplicatedPG: Do not run snap_trimmer while the pg is degraded
snap_trimmer causes replica crashes if the replica is missing
objects.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
2011-01-13 16:20:46 -08:00
Sage Weil
e060d7a115 filejournal: rewrite completion handling, fix ordering on full->notfull
Rewriting the completion handling to be simpler, clearer, so that it is
easier to maintain a strict completion ordering invariant.

This also fixes an ordering bug: When restarting journal, we defer
initially until we get a committed_thru from the previous commit and then
do all those completions.  That same logic needs to also apply to new items
submitted during that commit interval.  This was broken before, but the
simpler structure fixes it.  Fixes #666.

Tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-13 13:14:40 -08:00
Samuel Just
f2755a5337 PG: activate should not enqueue snap_trimmer on a replica
Previously, activate would queue_snap_trim() for replicas if snap_trimq
ended up non-empty, guaranteeing a crash for any replica starting up
while purged_snaps lagged behind pool->cached_removed_snaps.

This should fix #702.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
2011-01-13 13:16:40 -08:00
Samuel Just
1cdb01b47b ReplicatedPG: Fix oi.size bug in _rollback_to
_rollback_to calls _delete_head before cloning the clone into place.
_delete_head sets the object info size to 0.  _rollback_to now resets
the size to match the rolled back object.  Previously, this bug
manifested as a failed assert in scrub when checking the object sizes.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
2011-01-12 15:13:16 -08:00
Samuel Just
9c80239b6a ReplicatedPG: register_object_context and register_snapset_context cleanup
Previously, get_object_context and get_snapset_context did not register
the resulting objects.  In some cases, these objects would not get
registered and multiple copies would end up created.  This caused a bug
in find_object_context where get_snapset_context could return an object
distinct from the one referenced by the object returned from
get_object_context.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
2011-01-12 13:51:55 -08:00
Samuel Just
8f327d11ca ReplicatedPG: snap_trimmer work around
Currently, an OSD bug is causing snap_trimq to contain some snaps
already in purged_snaps.  This work around should let kvmtest
come back up.  A real fix is still needed.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
2011-01-12 12:07:44 -08:00
Colin Patrick McCabe
61bd155f4a osd: OSD::queue_pg_for_deletion: avoid double del
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2011-01-11 10:29:10 -08:00
Sage Weil
6e6c22ea23 mds: avoid double-pinning stray inodes
We make multiple iterations through populate_mydir().  Only pin each stray
once.  Fixes #689 and crashes like

mds/CInode.h: In function 'virtual void CInode::bad_get(int)':
mds/CInode.h:1088: FAILED assert(ref_set.count(by) == 0)
ceph version 0.24 (180a417603)
1: (CInode::bad_put(int)+0) [0x827b090]
2: (MDSCacheObject::get(int)+0x153) [0x813e463]
3: (MDCache::populate_mydir()+0x8a) [0x81a7e5a]
4: (MDCache::_create_system_file_finish(Mutation*, CDentry*,
Context*)+0x181) [0x819f501]
5: (C_MDC_CreateSystemFile::finish(int)+0x29) [0x81d6c29]
6: (finish_contexts(std::list<Context*, std::allocator<Context*> >&,
int)+0x6b) [0x81d663b]
7: (Journaler::_finish_flush(int, long long, utime_t, bool)+0x983) [0x82f2f53]
8: (Journaler::C_Flush::finish(int)+0x3f) [0x82fb24f]
9: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x801) [0x82d8e31]
10: (MDS::_dispatch(Message*)+0x2ae5) [0x80eaa15]
11: (MDS::ms_dispatch(Message*)+0x62) [0x80eb142]
12: (SimpleMessenger::dispatch_entry()+0x899) [0x80b8649]
13: (SimpleMessenger::DispatchThread::entry()+0x22) [0x80b30f2]

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-11 09:50:20 -08:00
Samuel Just
e189222f06 ReplicatedPG: Fix bug in rollback
Previously, _rollback_to assumed that the rollback was a noop if
ctx->clone_obc was set and it's prior version matches head's version.
However, this broke in sequences like:

Write "snap1 contents" to oid "blah"
create snapshot "snap1"
Write "snap2 contents" to oid "blah"
create snapshot "snap2"
rollback oid "blah" to snapshot "snap1"

In this case, make_writeable would have just cloned head to the snap2
clone, but the relevant clone is actually "snap1".  _rollback_to now
verifies that the most recent clone is the correct one before assuming
that head is already correct.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
2011-01-10 15:41:09 -08:00
Sage Weil
630565f3ac v0.24.1 2011-01-07 16:50:15 -08:00
Samuel Just
a64ddbb686 ReplicatedPG: get_object_context ssc refcount leak
If obc->obs.ssc is non-null, the second get_snapset_context ends up
leaking a snapset reference.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
2011-01-07 14:25:03 -08:00
Samuel Just
8665370030 ReplicatedPG: clone_overlap should contain one entry per clone
Previously, writefull and _delete_head would remove the last
entry from snapset.clone_overlap.  Now, the last entry becomes
an empty interval_set.  clone_overlap should contain one entry
per clone.

The missing entries previously caused a bug in _rollback_to where
iter would be clone_overlap.end().

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
2011-01-06 15:59:15 -08:00
Samuel Just
fab61391b7 PG: Fixes bug in _scrub with checking clones
I introduced this bug in
4a4a1e53c7.
curclone++ not curclone--.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
2011-01-04 14:38:53 -08:00
Samuel Just
4a4a1e53c7 PG: Fix bug in scrub when checking clone sizes
Previosly, _scrub checked:
assert(p->second.size == snapset.clone_size[curclone])

curclone was, however, an index into snapset.clones rather than a
snapid_t.  For clarity, curclone is now an iterator.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
2011-01-04 10:27:59 -08:00
Sage Weil
6c73da0a99 mds: assert no submit_entry during replay state
We should never submit items to the journal during replay.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-03 21:24:49 -08:00
Sage Weil
88c445b15f mds: start new log segment resolve start, not replay finish
Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-03 21:24:49 -08:00
Sage Weil
462cb8410d osd: clean up backlog generation checks a bit
Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-03 21:24:49 -08:00
Sage Weil
ff035ab31c osd: generate backlog if needed to get last_complete >= log.tail || backlog
If primary or a replica has a mistrimmed pg log, we need to generate the
backlog during peering.  This sucks, because the PG won't go active for
a long time, but it's what happens when there's a bug in the code that
mis-trims the PG log!

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-03 21:24:49 -08:00
Sage Weil
78f35a6450 osd: send sufficient log to compensate for replicas with last_complate < log.tail
If a replica has last_complete < log.tail and no backlog, send enough log
for them to get back into a consistent state.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-03 21:24:49 -08:00
Sage Weil
b40e7dc0f7 mds: load root inode on replay if auth
If we are auth for the root inode, load it's initial value off of disk. We
may not see it in the log if it has not been modified.  If it has, this
is useless but fast/harmless.  This only occurs for brand-new filesystems
where the mds is immediately restarted.

Fixes #671.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-03 14:33:03 -08:00
Greg Farnum
20593b0d38 msgr: Unlock dispatch_queue.lock when short-circuiting queue_received.
Previously we left the mutex locked, which is obviously bad bad bad!
I believe this was the cause of #673.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
2011-01-03 14:15:24 -08:00
Sage Weil
4efa300601 filestore: assert on out of order journal pipeline submissions
Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-03 13:14:49 -08:00
Sage Weil
259c509a89 filestore: fix wake condition when journal submission blocks
We only want to wake up if we are at the front of the line, in order to
preserve journal submission pipeline ordering.

This fixes, among other things, messages in the log like

2010-12-21 10:38:42.515974 7f0861486700 journal op_submit_finish 5364 expected 5370, OUT OF ORDER

and bug #666.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-03 13:14:13 -08:00
Sage Weil
15dcc65199 mds: fix purge_stray for directories, zeroed layouts
- We don't want to purge file content on directories
- Don't fall over if a file has a zero period

Reported-by: Paul Komkoff <i@stingr.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-03 11:50:53 -08:00
Colin Patrick McCabe
6cdfa30455 osd: PG::Info::History: init last_epoch_clean
It seems that we have not been zeroing
PG::Info::History:last_epoch_clean when the History structure is
created. This led to some very interesting log output (and bugs!)

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2011-01-03 10:30:56 -08:00
Samuel Just
9ad05cf7ff SimpleMessenger.cc: Fixes a dispatch_throttler leak in queue_received
when the pipe has been halted.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
2011-01-03 10:14:52 -08:00
Sage Weil
180a417603 v0.24 2010-12-20 15:58:09 -08:00
Sage Weil
69940e2717 osd: compensate for replicas with tail > last_complete
Normally we shouldn't ever have a last_complete < log.tail (&& !backlog).
But maybe we do (old bugs, whatever; see #590).  In that case, the primary
can compensate by sending more log info to the replica.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-20 13:22:49 -08:00
Sage Weil
b04b6f4823 mds: make nested scatterlock state change check more robust
The predirty_journal_parents() calls wrlock_start() with nowait=true
because it has a journal entry open and we don't want to trigger a nested
scatterlock change that needs to journal something again (either
via scatter_writebehind or scatter_start).  (MDLog can only handle a single
log entry open at once because building multiple at once would require very
very very careful ordering of predirty() calls and versions.)

We were already check for the simple_lock() case (which may call
writebehind); fix up the check to also cover the scatter_mix() (which may
call scatter_start) case.

Fixes this crash:

mds/MDLog.h: In function 'void MDLog::start_entry(LogEvent*)':
mds/MDLog.h:191: FAILED assert(cur_event == __null)
 ceph version 0.24~rc (commit:fe10300317383ec29948d7dbe3cb31b3aa277e3c)
 1: (CInode::finish_scatter_update(ScatterLock*, CDir*, unsigned long, unsigned long)+0x804) [0x606e14]
 2: (CInode::start_scatter(ScatterLock*)+0xaa) [0x60dc1a]
 3: (Locker::scatter_mix(ScatterLock*, bool*)+0x1ca) [0x589a9a]
 4: (Locker::wrlock_start(SimpleLock*, MDRequest*, bool)+0x165) [0x597d65]
 5: (MDCache::predirty_journal_parents(Mutation*, EMetaBlob*, CInode*, CDir*, int, int, snapid_t)+0x153e) [0x55a70e]
 6: (Locker::scatter_writebehind(ScatterLock*)+0x42d) [0x58553d]
 7: (Locker::simple_lock(SimpleLock*, bool*)+0x7ab) [0x58beeb]
 8: (Locker::scatter_nudge(ScatterLock*, Context*, bool)+0x3ad) [0x58c49d]
 9: (Locker::scatter_tick()+0x28a) [0x58c98a]
 10: (MDS::tick()+0x4e4) [0x4b26a4]
 11: (SafeTimer::timer_thread()+0x22c) [0x6d164c]
 12: (SafeTimerThread::entry()+0xd) [0x6d34bd]
 13: (Thread::_entry_func(void*)+0xa) [0x4943da]
 14: /lib/libpthread.so.0 [0x7fc87810b73a]
 15: (clone()+0x6d) [0x7fc876dad69d]

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 21:02:58 -08:00
Sage Weil
3a235b0f21 filestore: make OpSequencer::flush() work for writeahead journaling items
It was only waiting for items in the op_queue to complete.  The goal is
to wait for anything we've called queue_transactions(&osr,...) on. If we
do writeahead journaling, though, there might be new ops that are still
journaling but not yet submitted to the fs that are missed.

This adds a journal queue to the OpSequencer, and uses it in the writeahead
case only.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 15:30:39 -08:00
Colin Patrick McCabe
285f351b72 mon: build_initial_monmap: fix mismatched alloc
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-17 15:31:41 -08:00
Colin Patrick McCabe
caa4609387 common: cleanups
common_init: avoid (mismatched) heap allocation

ConfFile::_parse: avoid memory leak on error path

ConfFile: NULL filename if not set, rather than leaving it undefined

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-17 15:26:37 -08:00
Colin Patrick McCabe
28bcf0bc98 osd: PG::choose_acting: fix major iterator mistake
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-17 15:14:53 -08:00
Colin Patrick McCabe
f7dc1a9239 rgw: fix fd leak on error path
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-17 15:14:53 -08:00
Colin Patrick McCabe
795811d66a hadoop: fix a bunch of mismatched allocations
Using array new means you need array delete.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-17 15:14:53 -08:00
Colin Patrick McCabe
2f916086a6 auth: avoid mismatched allocation
Can't pair strdup and free.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-17 15:14:53 -08:00
Sage Weil
3c7d30f1ac osd: flush pg writes to disk before starting scrub scan
This avoids two races:
 - we just completed recovery by pushing objects to the replica, and the
   replica starts scanning before those writes reach the fs.
 - we just trimmed to something after last_update_applied.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 14:15:35 -08:00
Sage Weil
5184db4424 filestore: add per-sequencer flush operation
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 14:15:35 -08:00
Sage Weil
2fb60daf68 osd: debug scan_list and scrub a bit better
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 12:51:03 -08:00
Sage Weil
1cfad2ea77 osd: clear INCONSISTENT if scrub detects no errors
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 10:59:45 -08:00
Sage Weil
b190875548 osd: add assert that we're replica
ar Fred saw a crash where we got into merge_log as a stray, which really
shouldn't ever happen!  See #590.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 10:36:34 -08:00
Laszlo Boszormenyi
1e291fc9ef debian: don't strip rados classes
Signed-off-by: Laszlo Boszormenyi <gcs@debian.hu>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 08:31:00 -08:00
Laszlo Boszormenyi
9c173bb400 debian: rename ceph.lintian -> ceph.lintian-overrides
Signed-off-by: Laszlo Boszormenyi <gcs@debian.hu>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 08:30:43 -08:00
Samuel Just
73669d87e6 PG.cc:
sub_op_scrub must set finalizing_scrub on the replica
	before waiting for last_update_applied to catch up to
	info.last_update.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
2010-12-16 13:06:43 -08:00