Commit Graph

12144 Commits

Author SHA1 Message Date
Colin Patrick McCabe
6722b0c85d rpm: add pkgconfig to BuildRequires
You can't build without pkgconfig.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-09 11:18:32 -08:00
Colin Patrick McCabe
9df18d1984 rpm: set files-attr for radosgw
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-09 10:28:39 -08:00
Sage Weil
b4264fbbdc filejournal: reset last_commited_seq if we find journal to be invalid
If we read an event that's later than our expected entry, we set read_pos
to -1 and discard the journal.  If that happens we also need to reset
last_committed_seq to avoid a crash like

2010-12-08 17:04:39.246950 7f269d138910 journal commit_finish thru 16904
2010-12-08 17:04:39.246961 7f269d138910 journal committed_thru 16904 < last_committed_seq 37778589
os/FileJournal.cc: In function 'virtual void FileJournal::committed_thru(uint64_t)':
os/FileJournal.cc:854: FAILED assert(seq >= last_committed_seq)
 ceph version 0.24~rc (commit:fe10300317383ec29948d7dbe3cb31b3aa277e3c)
 1: (FileJournal::committed_thru(unsigned long)+0xad) [0x588e7d]
 2: (JournalingObjectStore::commit_finish()+0x8c) [0x57f2ec]
 3: (FileStore::sync_entry()+0xcff) [0x5764cf]
 4: (FileStore::SyncThread::entry()+0xd) [0x506d9d]
 5: (Thread::_entry_func(void*)+0xa) [0x4790ba]
 6: /lib/libpthread.so.0 [0x7f26a2f8373a]
 7: (clone()+0x6d) [0x7f26a1c2569d]

Fixes #631

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-08 18:10:49 -08:00
Sage Weil
a9c098df47 mon: use helper for clock drift check; log relative instead of absolute time
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-08 11:12:51 -08:00
Sage Weil
fe10300317 mds: sync->mix replica state is sync->mix(2)
When auth first moves to sync->mix,
 - auth sends AC_MIX to replicas
 - replicas go to sync->mix
 - replicas finish gather, send AC_SYNCACK, move to sync->mix(2)
 - auth gets all acks, sends AC_MIX again
 - replica moves to MIX

So any new replica should just get sync->mix(2), so that it is not confused
by the second AC_MIX.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-07 16:44:19 -08:00
Sage Weil
2000f69e99 mds: no not choose lock state on replicas
The lock state has already been set during rejoin.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-07 16:44:19 -08:00
Sage Weil
3825c4b87b mds: small rejoin cleanup
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-07 16:44:18 -08:00
Sage Weil
9b9b86935e mds: rev mds cluster internal protocol
The lock encoding changed with the dirty bit on scatterlocks.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-07 16:44:18 -08:00
Sage Weil
2ea9b2d7db mds: fix replay of already-journaled requests
Check for already-completed tids for both retried and replayed requests.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-07 16:44:18 -08:00
Sage Weil
b5fd2e4d4e mds: open undef dirfrags during rejoin
Any invented dirfrags have a version of 0.  This will cause problems later
if we pre_dirty() anything in that dir because the dir version won't be
in sync (it'll be way too small).  Also, we can do that at any point,
e.g. when flushing dirty caps, and aren't allowed to delay, so we need to
load those dirfrags now.

In theory we could read only the fnode and not all the dentries, but we
may as well.  We should be more careful about memory that this patch is,
though.

Fixes #15.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-07 16:44:18 -08:00
Sage Weil
39c5933db0 mds: add missing try_clear_more() to scatterlock
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-07 16:44:18 -08:00
Sage Weil
c681ed752f mds: explicitly pass scatterlock dirty flag to auth on gather
This ensures that if the replica is thinks it is flushing something the
auth will always do a scatter_writebehind.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-07 16:44:18 -08:00
Sage Weil
9bbb33b436 mds: send LOCKFLUSHED to trigger finish_flush on replicas
Since f741766a we have triggered start_flush and finish_flush on replicas.
The problem is that the finish_flush didn't always happen for the mix->lock
case: we sould start_flush when we sent the AC_LOCKACK, but could only
finish_flush if/when we got another SYNC or MIX.  If the primary stayed in
the LOCK state, we would keep our flushing flag.  That in turn causes
problems later when we try to eval_gather() (esp if we are auth at that
point?).

Fix this by sending an explicit AC_LOCKFLUSHED message to replicas after
we do a scatter_writebehind.  The replica will only set flushing if it
flushed dirty data, which forces scatter_writebehind, so we will always
get the LOCKFLUSHED to match.  Replicas that didn't flush will also get
it, but oh well.  We'd need to keep track which ones sent dirty data to
do that properly, though.

TODO: still need to verify that this is correct for rejoin.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-07 16:44:18 -08:00
Sage Weil
681b010fdb mds: clear EXPORTINGCAPS on export_reverse
We need to reverse the effects of encode_export_inode_caps(), which is just
the pin and state bit.

The original problem can be reproduced with
 - ceph tell mds 0 injectargs '--mds-kill-import-at 5'
 - restart mds
 - recovery completes successfully
 - wait for the subtree to be reexported
 - fail with bad EXPORTINGCAPS get in encode_export_inode_caps

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-07 16:44:18 -08:00
Sage Weil
f97660ff40 mds: fix LOOKUPHASH to avoid creating bogus replica CDir
We can't create the CDir if we are non-auth.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-07 16:44:18 -08:00
Sage Weil
4f6439945b mds: introduce rejoin_invent_dirfrag() helper
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-07 16:44:18 -08:00
Colin Patrick McCabe
1e2e4aa0f4 automake: in scripts, use sysconfdir as-is
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-07 10:56:05 -08:00
Colin Patrick McCabe
10b6887eae automake: in deb pkg, use --syconfdir=/etc
When building the debian packages, use --sysconfdir=/etc.

Also, don't fudge sysconfdir in the init-ceph script.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-07 10:48:19 -08:00
Sage Weil
57bcdc54d5 mkcephfs: require -k; update man page
Force users to specify keyring location; update man page accordingly.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-06 22:18:16 -08:00
Yehuda Sadeh
87545d0620 configure: detect crypto++ library 2010-12-06 15:25:34 -08:00
Sage Weil
ebcc9395b0 osd: drop not-quite-copy constructor for object_info_t
Making a copy-like constructor that doesn't actaully copy is confusing
and error prone.  In this case, we initialized a clone's object_info with
the head's snapid, causing problems with what info was encoded and crashing
later in the snap_trimmer.  Here the one caller already called
copy_user_bits(); let's move the lost copy there.

This backs out one of the changes in 0cc8d34e.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-06 14:01:51 -08:00
Colin Patrick McCabe
b1afea515f librados: fix error path in rados_deinitialize
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-06 11:17:10 -08:00
Yehuda Sadeh
aa3dda61dd librados: fix the C++ interface init 2010-12-06 11:16:35 -08:00
Yehuda Sadeh
9a60481681 librados: fix C interface error handling in init code 2010-12-06 10:31:06 -08:00
Greg Farnum
bf030ca267 client: resync ioctl header from ceph-client.
Previous change to the CEPH_IOCTL_MAGIC in fbbf448 was incorrect!

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
2010-12-06 09:59:50 -08:00
Laszlo Boszormenyi
4e3c201132 Tune Debian packaging for the upcoming v0.24 release.
Including switch OpenSSL dependency to Crypto++ as its being used instead of
the former; remove radosacl as its not compiled anymore and pristine clean
the source. Explicitly note this is in a 1.0 package format.
2010-12-05 22:20:48 -08:00
Sage Weil
27b70eb57b osd: search for unfound on osds in might_have_unfound
We were looking at 'up', which is just the set of OSDs we should be on in
the current epoch; nothing to do with where the objects might be found.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-04 21:29:00 -08:00
Sage Weil
8aa7b39138 Makefile: make radosacl build WITH_DEBUG only
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-04 20:45:50 -08:00
Yehuda Sadeh
23f370436e ceph.spec.in: update dependency 2010-12-03 19:34:37 -08:00
Yehuda Sadeh
e005925988 rgw: null terminate armor result 2010-12-03 19:34:37 -08:00
Yehuda Sadeh
f2424dfbd5 rgw: get rid of openssl altogether 2010-12-03 19:34:37 -08:00
Yehuda Sadeh
a28b449439 configure: check for the presence of libcrypto++ header files 2010-12-03 19:34:37 -08:00
Yehuda Sadeh
8821377030 crypto: change include 2010-12-03 19:34:37 -08:00
Yehuda Sadeh
76e02c71dc common: remove base64.c 2010-12-03 19:34:37 -08:00
Yehuda Sadeh
e135e9245e crypto: remove old openssl implementation 2010-12-03 19:34:37 -08:00
Yehuda Sadeh
7fa9426c6b makefile.am: most binaries (except rgw_*) don't link with openssl 2010-12-03 19:34:37 -08:00
Yehuda Sadeh
6ec622c0cf common: use ceph_armor instead of openssl based functions
also modify ceph_[un]armor to get dest buffer length
2010-12-03 19:34:37 -08:00
Yehuda Sadeh
58f3ce4a34 crypto: test for allocation failure, cleanup 2010-12-03 19:34:37 -08:00
Yehuda Sadeh
15d8bdf3bf crypto: use crypto++ for aes instead of openssl
need to implement it more efficiently, currently going through a string object
2010-12-03 19:34:37 -08:00
Sage Weil
378d13df95 osd: remove poid/soid from ScrubMap::object; clean up callers
The soid is in the key in the map; no need to store it in the value.
Update the scrub code appropriately.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-03 10:02:30 -08:00
Sage Weil
a457cbb9c2 mon: fix typo
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-03 10:02:30 -08:00
Colin Patrick McCabe
a4cc929ced make: create log directories and tmp directories
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-03 09:35:55 -08:00
Jim Schutt
a5297388a7 msgr: Correctly handle half-open connections.
If poll() says a socket is ready for reading, but zero bytes
are read, that means that the peer has sent a FIN.  Handle that.

One way the incorrect handling was manifesting is as follows:

Under a heavy write load, clients log many messages like this:

[19021.523192] libceph:  tid 876 timed out on osd6, will reset osd
[19021.523328] libceph:  tid 866 timed out on osd10, will reset osd
[19081.616032] libceph:  tid 841 timed out on osd0, will reset osd
[19081.616121] libceph:  tid 826 timed out on osd2, will reset osd
[19081.616176] libceph:  tid 806 timed out on osd3, will reset osd
[19081.616226] libceph:  tid 875 timed out on osd9, will reset osd
[19081.616275] libceph:  tid 834 timed out on osd12, will reset osd
[19081.616326] libceph:  tid 874 timed out on osd10, will reset osd

After the clients are done writing and the file system should
be quiet, osd hosts have a high load with many active threads:

$ ps u -C cosd
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      1383  162 11.5 1456248 943224 ?      Ssl  11:31 406:59 /usr/bin/cosd -i 7 -c /etc/ceph/ceph.conf

$ for p in `ps -C cosd -o pid --no-headers`; do grep -nH State /proc/$p/task/*/status | grep -v sleep; done
/proc/1383/task/10702/status:2:State:   R (running)
/proc/1383/task/10710/status:2:State:   R (running)
/proc/1383/task/10717/status:2:State:   R (running)
/proc/1383/task/11396/status:2:State:   R (running)
/proc/1383/task/27111/status:2:State:   R (running)
/proc/1383/task/27117/status:2:State:   R (running)
/proc/1383/task/27162/status:2:State:   R (running)
/proc/1383/task/27694/status:2:State:   R (running)
/proc/1383/task/27704/status:2:State:   R (running)
/proc/1383/task/27728/status:2:State:   R (running)

With this fix applied, a heavy load still causes many client
resets of osds, but no runaway threads result.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-03 09:10:58 -08:00
Colin Patrick McCabe
39b42b21e9 make: create /etc/ceph if it doesn't exist
make: create /etc/ceph if it doesn't exist. On uninstall, remove the
directory if it's empty. (Never remove a user's config file, though.)

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-02 17:35:32 -08:00
Colin Patrick McCabe
da5ab7c9a4 ost: object_info_t: decode old versions correctly
object_info_t has one constructor that initializes everything from a
bufferlist. This means that the decode function needs to give default
values to fields in object_info_t that aren't found in the bufferlist.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-12-02 16:56:48 -08:00
Greg Farnum
03eb4e7a07 man: add man page for cephfs
Add to Makefile, debian, and ceph.spec.in bits
2010-12-02 16:18:38 -08:00
Sage Weil
78a1462243 osd: fix log tail vs last_complete assert on replica activation
The last_complete may be below the log tail IFF we have a backlog.

Fixes 756918be3b.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-01 15:40:28 -08:00
Sage Weil
a3d8c52794 filestore: call lower-level do_transactions() during journal replay
We used to call apply_transactions, which avoided rejournaling anything
because the journal wasn't writeable yet, but that uses all kinds of other
machinery that relies on threads and finishers and such that aren't
appropriate or necessary when we're just replaying journaled events.

Instead, call the lower-level do_transactions() directly.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-01 13:48:56 -08:00
Sage Weil
9ecbc300cb filestore: do journal mode autodetect and sanity check _before_ replay
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-01 13:46:30 -08:00
Sage Weil
f9fa855a71 filestore: fix journal locking on trailing mode
We're already holding journal_lock due to the surrounding
op_submit_{start,finish}.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-01 11:05:11 -08:00