Commit Graph

12180 Commits

Author SHA1 Message Date
Sage Weil
4efa300601 filestore: assert on out of order journal pipeline submissions
Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-03 13:14:49 -08:00
Sage Weil
259c509a89 filestore: fix wake condition when journal submission blocks
We only want to wake up if we are at the front of the line, in order to
preserve journal submission pipeline ordering.

This fixes, among other things, messages in the log like

2010-12-21 10:38:42.515974 7f0861486700 journal op_submit_finish 5364 expected 5370, OUT OF ORDER

and bug #666.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-03 13:14:13 -08:00
Sage Weil
15dcc65199 mds: fix purge_stray for directories, zeroed layouts
- We don't want to purge file content on directories
- Don't fall over if a file has a zero period

Reported-by: Paul Komkoff <i@stingr.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-03 11:50:53 -08:00
Colin Patrick McCabe
6cdfa30455 osd: PG::Info::History: init last_epoch_clean
It seems that we have not been zeroing
PG::Info::History:last_epoch_clean when the History structure is
created. This led to some very interesting log output (and bugs!)

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2011-01-03 10:30:56 -08:00
Samuel Just
9ad05cf7ff SimpleMessenger.cc: Fixes a dispatch_throttler leak in queue_received
when the pipe has been halted.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
2011-01-03 10:14:52 -08:00
Sage Weil
180a417603 v0.24 2010-12-20 15:58:09 -08:00
Sage Weil
69940e2717 osd: compensate for replicas with tail > last_complete
Normally we shouldn't ever have a last_complete < log.tail (&& !backlog).
But maybe we do (old bugs, whatever; see #590).  In that case, the primary
can compensate by sending more log info to the replica.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-20 13:22:49 -08:00
Sage Weil
b04b6f4823 mds: make nested scatterlock state change check more robust
The predirty_journal_parents() calls wrlock_start() with nowait=true
because it has a journal entry open and we don't want to trigger a nested
scatterlock change that needs to journal something again (either
via scatter_writebehind or scatter_start).  (MDLog can only handle a single
log entry open at once because building multiple at once would require very
very very careful ordering of predirty() calls and versions.)

We were already check for the simple_lock() case (which may call
writebehind); fix up the check to also cover the scatter_mix() (which may
call scatter_start) case.

Fixes this crash:

mds/MDLog.h: In function 'void MDLog::start_entry(LogEvent*)':
mds/MDLog.h:191: FAILED assert(cur_event == __null)
 ceph version 0.24~rc (commit:fe10300317383ec29948d7dbe3cb31b3aa277e3c)
 1: (CInode::finish_scatter_update(ScatterLock*, CDir*, unsigned long, unsigned long)+0x804) [0x606e14]
 2: (CInode::start_scatter(ScatterLock*)+0xaa) [0x60dc1a]
 3: (Locker::scatter_mix(ScatterLock*, bool*)+0x1ca) [0x589a9a]
 4: (Locker::wrlock_start(SimpleLock*, MDRequest*, bool)+0x165) [0x597d65]
 5: (MDCache::predirty_journal_parents(Mutation*, EMetaBlob*, CInode*, CDir*, int, int, snapid_t)+0x153e) [0x55a70e]
 6: (Locker::scatter_writebehind(ScatterLock*)+0x42d) [0x58553d]
 7: (Locker::simple_lock(SimpleLock*, bool*)+0x7ab) [0x58beeb]
 8: (Locker::scatter_nudge(ScatterLock*, Context*, bool)+0x3ad) [0x58c49d]
 9: (Locker::scatter_tick()+0x28a) [0x58c98a]
 10: (MDS::tick()+0x4e4) [0x4b26a4]
 11: (SafeTimer::timer_thread()+0x22c) [0x6d164c]
 12: (SafeTimerThread::entry()+0xd) [0x6d34bd]
 13: (Thread::_entry_func(void*)+0xa) [0x4943da]
 14: /lib/libpthread.so.0 [0x7fc87810b73a]
 15: (clone()+0x6d) [0x7fc876dad69d]

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 21:02:58 -08:00
Sage Weil
3a235b0f21 filestore: make OpSequencer::flush() work for writeahead journaling items
It was only waiting for items in the op_queue to complete.  The goal is
to wait for anything we've called queue_transactions(&osr,...) on. If we
do writeahead journaling, though, there might be new ops that are still
journaling but not yet submitted to the fs that are missed.

This adds a journal queue to the OpSequencer, and uses it in the writeahead
case only.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 15:30:39 -08:00
Colin Patrick McCabe
285f351b72 mon: build_initial_monmap: fix mismatched alloc
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-17 15:31:41 -08:00
Colin Patrick McCabe
caa4609387 common: cleanups
common_init: avoid (mismatched) heap allocation

ConfFile::_parse: avoid memory leak on error path

ConfFile: NULL filename if not set, rather than leaving it undefined

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-17 15:26:37 -08:00
Colin Patrick McCabe
28bcf0bc98 osd: PG::choose_acting: fix major iterator mistake
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-17 15:14:53 -08:00
Colin Patrick McCabe
f7dc1a9239 rgw: fix fd leak on error path
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-17 15:14:53 -08:00
Colin Patrick McCabe
795811d66a hadoop: fix a bunch of mismatched allocations
Using array new means you need array delete.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-17 15:14:53 -08:00
Colin Patrick McCabe
2f916086a6 auth: avoid mismatched allocation
Can't pair strdup and free.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-17 15:14:53 -08:00
Sage Weil
3c7d30f1ac osd: flush pg writes to disk before starting scrub scan
This avoids two races:
 - we just completed recovery by pushing objects to the replica, and the
   replica starts scanning before those writes reach the fs.
 - we just trimmed to something after last_update_applied.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 14:15:35 -08:00
Sage Weil
5184db4424 filestore: add per-sequencer flush operation
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 14:15:35 -08:00
Sage Weil
2fb60daf68 osd: debug scan_list and scrub a bit better
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 12:51:03 -08:00
Sage Weil
1cfad2ea77 osd: clear INCONSISTENT if scrub detects no errors
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 10:59:45 -08:00
Sage Weil
b190875548 osd: add assert that we're replica
ar Fred saw a crash where we got into merge_log as a stray, which really
shouldn't ever happen!  See #590.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 10:36:34 -08:00
Laszlo Boszormenyi
1e291fc9ef debian: don't strip rados classes
Signed-off-by: Laszlo Boszormenyi <gcs@debian.hu>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 08:31:00 -08:00
Laszlo Boszormenyi
9c173bb400 debian: rename ceph.lintian -> ceph.lintian-overrides
Signed-off-by: Laszlo Boszormenyi <gcs@debian.hu>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 08:30:43 -08:00
Samuel Just
73669d87e6 PG.cc:
sub_op_scrub must set finalizing_scrub on the replica
	before waiting for last_update_applied to catch up to
	info.last_update.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
2010-12-16 13:06:43 -08:00
Samuel Just
29480f42be ReplicatedPG.cc:
_scrub must set head when it encounters a head snap
	curclone counts down, not up

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
2010-12-15 17:23:59 -08:00
Sage Weil
914f6ddebd filestore: detect final version of async ioctl SNAP_CREATE_V2
Li's revised interface for the async snap ioctl is more flexible.  Update
the ioctl call sites and detection code accordingly.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-15 13:39:57 -08:00
Greg Farnum
06a2d7a269 mds: Save straydn in mdr so it's consistent across retry attempts.
Otherwise, we could choose new stray dirs and fail to get all
the locks we needed (while leaving old strays locked forever!).

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
2010-12-15 13:07:25 -08:00
Sage Weil
89d5c91e7d mon: trim pgmap less aggressively
This will make observer crashes due to missed states (#648) much harder to
hit.  Eventually the pgmap state trim problem will go away when the
monitor/paxos code is restructured (#647).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-14 11:50:03 -08:00
Yehuda Sadeh
b989087ddf crypto: catch cryptopp decrypt/encrypt exceptions 2010-12-14 10:51:46 -08:00
Colin Patrick McCabe
3932f084f7 osd: PG::prior_set_affected: const cleanup
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-14 01:53:37 -08:00
Sage Weil
9add26be76 mds: fix replay/resent vs completed request check
If it is a _replayed_ request, we should always send a simple ack if it is
completed, because the client doesn't not care about any additional caps.

If it is a _resent_ request, then we want to return useful caps on open or
create requests, even if any modification side-effects have already been
committed.  The additional checks for completed already exist in the
create and open handlers.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-12 14:40:05 -08:00
Colin Patrick McCabe
346a2aac42 rpm: update changelog
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-09 14:38:08 -08:00
Colin Patrick McCabe
e23d620068 rpm: fix ceph.spec to work with gcephtool
Don't try to package gui_resources unless we are building the GUI.
Get GUI dependencies correct.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-09 14:35:48 -08:00
Vangelis Koukis
83612ef736 Fix overflow in FileJournal::_open_file()
[ The following text is in the "iso-8859-7" character set. ]
    [ Your display is set for the "iso-8859-1" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Running the unstable branch, mkcephfs fails when trying to create
a 3GB journal file on the OSDs.

Relevant messages from the osd logfile:

2010-12-09 19:03:54.419737 7fdde4d51720 journal _open_file: unable to extend journal to 18446744072560312320 bytes
2010-12-09 19:03:54.419789 7fdde4d51720 filestore(/osd) mkjournal error creating journal on /osd/journal

The problem is that the calculation of the journal size in bytes
overflows, in FileJournal::_open_file().

Signed-off-by: Vangelis Koukis <vkoukis@cslab.ece.ntua.gr>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-09 13:45:30 -08:00
Samuel Just
d0fbc30a0a ReplicatedPG.cc: Fixes a bug in snap_trimmer where a pointer to a stack
Cond is left in the mode.waiting_cond list.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
2010-12-09 13:09:20 -08:00
Samuel Just
329ae1bc3b ReplicatedPG: snap_trimmer now acquires a read lock on the osd map
before calling share_pg_info.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
2010-12-09 13:09:20 -08:00
Colin Patrick McCabe
f68e6e7d38 rpm: don't try to package radosacl
radosacl is just a test binary, so unless we build with --with-debug, we
won't get it.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-09 11:18:33 -08:00
Colin Patrick McCabe
6722b0c85d rpm: add pkgconfig to BuildRequires
You can't build without pkgconfig.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-09 11:18:32 -08:00
Colin Patrick McCabe
9df18d1984 rpm: set files-attr for radosgw
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-09 10:28:39 -08:00
Sage Weil
b4264fbbdc filejournal: reset last_commited_seq if we find journal to be invalid
If we read an event that's later than our expected entry, we set read_pos
to -1 and discard the journal.  If that happens we also need to reset
last_committed_seq to avoid a crash like

2010-12-08 17:04:39.246950 7f269d138910 journal commit_finish thru 16904
2010-12-08 17:04:39.246961 7f269d138910 journal committed_thru 16904 < last_committed_seq 37778589
os/FileJournal.cc: In function 'virtual void FileJournal::committed_thru(uint64_t)':
os/FileJournal.cc:854: FAILED assert(seq >= last_committed_seq)
 ceph version 0.24~rc (commit:fe10300317383ec29948d7dbe3cb31b3aa277e3c)
 1: (FileJournal::committed_thru(unsigned long)+0xad) [0x588e7d]
 2: (JournalingObjectStore::commit_finish()+0x8c) [0x57f2ec]
 3: (FileStore::sync_entry()+0xcff) [0x5764cf]
 4: (FileStore::SyncThread::entry()+0xd) [0x506d9d]
 5: (Thread::_entry_func(void*)+0xa) [0x4790ba]
 6: /lib/libpthread.so.0 [0x7f26a2f8373a]
 7: (clone()+0x6d) [0x7f26a1c2569d]

Fixes #631

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-08 18:10:49 -08:00
Sage Weil
a9c098df47 mon: use helper for clock drift check; log relative instead of absolute time
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-08 11:12:51 -08:00
Sage Weil
fe10300317 mds: sync->mix replica state is sync->mix(2)
When auth first moves to sync->mix,
 - auth sends AC_MIX to replicas
 - replicas go to sync->mix
 - replicas finish gather, send AC_SYNCACK, move to sync->mix(2)
 - auth gets all acks, sends AC_MIX again
 - replica moves to MIX

So any new replica should just get sync->mix(2), so that it is not confused
by the second AC_MIX.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-07 16:44:19 -08:00
Sage Weil
2000f69e99 mds: no not choose lock state on replicas
The lock state has already been set during rejoin.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-07 16:44:19 -08:00
Sage Weil
3825c4b87b mds: small rejoin cleanup
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-07 16:44:18 -08:00
Sage Weil
9b9b86935e mds: rev mds cluster internal protocol
The lock encoding changed with the dirty bit on scatterlocks.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-07 16:44:18 -08:00
Sage Weil
2ea9b2d7db mds: fix replay of already-journaled requests
Check for already-completed tids for both retried and replayed requests.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-07 16:44:18 -08:00
Sage Weil
b5fd2e4d4e mds: open undef dirfrags during rejoin
Any invented dirfrags have a version of 0.  This will cause problems later
if we pre_dirty() anything in that dir because the dir version won't be
in sync (it'll be way too small).  Also, we can do that at any point,
e.g. when flushing dirty caps, and aren't allowed to delay, so we need to
load those dirfrags now.

In theory we could read only the fnode and not all the dentries, but we
may as well.  We should be more careful about memory that this patch is,
though.

Fixes #15.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-07 16:44:18 -08:00
Sage Weil
39c5933db0 mds: add missing try_clear_more() to scatterlock
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-07 16:44:18 -08:00
Sage Weil
c681ed752f mds: explicitly pass scatterlock dirty flag to auth on gather
This ensures that if the replica is thinks it is flushing something the
auth will always do a scatter_writebehind.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-07 16:44:18 -08:00
Sage Weil
9bbb33b436 mds: send LOCKFLUSHED to trigger finish_flush on replicas
Since f741766a we have triggered start_flush and finish_flush on replicas.
The problem is that the finish_flush didn't always happen for the mix->lock
case: we sould start_flush when we sent the AC_LOCKACK, but could only
finish_flush if/when we got another SYNC or MIX.  If the primary stayed in
the LOCK state, we would keep our flushing flag.  That in turn causes
problems later when we try to eval_gather() (esp if we are auth at that
point?).

Fix this by sending an explicit AC_LOCKFLUSHED message to replicas after
we do a scatter_writebehind.  The replica will only set flushing if it
flushed dirty data, which forces scatter_writebehind, so we will always
get the LOCKFLUSHED to match.  Replicas that didn't flush will also get
it, but oh well.  We'd need to keep track which ones sent dirty data to
do that properly, though.

TODO: still need to verify that this is correct for rejoin.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-07 16:44:18 -08:00
Sage Weil
681b010fdb mds: clear EXPORTINGCAPS on export_reverse
We need to reverse the effects of encode_export_inode_caps(), which is just
the pin and state bit.

The original problem can be reproduced with
 - ceph tell mds 0 injectargs '--mds-kill-import-at 5'
 - restart mds
 - recovery completes successfully
 - wait for the subtree to be reexported
 - fail with bad EXPORTINGCAPS get in encode_export_inode_caps

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-07 16:44:18 -08:00