Commit Graph

25725 Commits

Author SHA1 Message Date
Sage Weil
60603d01e5 ceph-disk: use separate lock files for prepare, activate
Use a separate lock file for prepare and activate to avoid deadlock.  This
didn't seem to trigger on all machines, but in many cases, the prepare
process would take the file lock and later trigger a udev event and the
activate would then block on the same lock, either when we explicitly call
'udevadm settle --timeout=10' or when partprobe does it on our behalf
(without a timeout!).   Avoid this by using separate locks for prepare
and activate.  We only care if multiple activates race; it is
okay for a prepare to be in progress and for an activate to be kicked
off.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-05-06 12:12:04 -07:00
Danny Al-Gaaf
e662b6140b ceph-test.install: add ceph-monstore-tool and ceph-osdomap-tool
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
2013-05-06 11:30:37 -07:00
Danny Al-Gaaf
eae02fd34c ceph.spec.in: remove twice listed ceph-coverage
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
2013-05-06 11:30:18 -07:00
Danny Al-Gaaf
71cef0867b ceph.spec: add some files to ceph
Add installed, but not packaged files to ceph-test (ceph-monstore-tool,
ceph-osdomap-tool) rpm file section.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
2013-05-06 11:30:02 -07:00
Sage Weil
1a67f7b3ac mon: fix init sequence when not daemonizing
We made the common_init_finish and chdir conditional on daemonize in commit
2e0dd5ae6c, breaking init (asok at least)
when -f is specified (as with upstart).

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-03 16:20:26 -07:00
Sage Weil
3f0b8ec2d4 mon: avoid null deref in Monitor::_mon_status()
mikedawson reports:

*** Caught signal (Segmentation fault) **
 in thread 7f40ce270700

 ceph version 0.60-801-g7ec0151 (7ec0151397)
 1: /usr/bin/ceph-mon() [0x59d550]
 2: (()+0xfbd0) [0x7f40d3e38bd0]
 3: (operator<<(std::ostream&, entity_name_t const&)+0x16) [0x4d7c46]
 4: (operator<<(std::ostream&, entity_inst_t const&)+0x1b) [0x4d837b]
 5: (Monitor::_mon_status(std::ostream&)+0x2ce) [0x4d284e]
 6: (Monitor::do_admin_command(std::string, std::string, std::ostream&)+0x4f) [0x4d652f]
 7: (AdminHook::call(std::string, std::string, ceph::buffer::list&)+0x68) [0x4efa38]
 8: (AdminSocket::do_accept()+0x451) [0x64ab81]
 9: (AdminSocket::entry()+0x398) [0x64c528]
 10: (()+0x7f8e) [0x7f40d3e30f8e]
 11: (clone()+0x6d) [0x7f40d237ae1d]

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-03 16:04:31 -07:00
Sage Weil
b2501e91bb ceph.spec: require xfsprogs
This is needed when creating new OSDs (via ceph-disk).  At least for most
people.  Eventually we'll want to include btrfs here.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-05-03 13:28:24 -07:00
Sage Weil
c189d855e6 init-ceph: update osd crush map position on start
This is what the upstart ceph-osd.conf does; we need to do the same so that
new OSDs (e.g., that ceph-deploy creates) get added to the crush map.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-05-03 11:33:20 -07:00
Sage Weil
2e0dd5ae6c mon: fork early to avoid leveldb static env state
leveldb has static state that prevents it from recreating its worker thread
after our fork(), even when we close and reopen the database (tsk tsk!).
Avoid this by forking early, before we touch leveldb.

Hide the details in a Preforker class.  This is modeled after what
ceph-fuse already does; we should convert it later.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-03 11:29:24 -07:00
Sage Weil
4f49565b40 Merge remote-tracking branch 'gh/wip-mon-rank' into next
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-02 13:32:41 -07:00
Samuel Just
039a3a97ce tools/: add paranoid option to ceph-osdomap-tool
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-02 12:54:28 -07:00
Sage Weil
26105280d0 osd: default 'osd leveldb paranoid = false'
Signed-off-by: Sage Weil <sage@inktank.com>
2013-05-02 12:47:24 -07:00
Sage Weil
444660ed77 librados,client: bump mount timeout to 5 min
30 seconds is pretty short.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02 12:32:06 -07:00
Samuel Just
6a61268768 OSD: also walk maps individually for start_split in consume_map()
We need to go map-by-map to get the parents right in consume_map()
just as we must in load_pgs().

Fixes: 4884
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-02 12:21:18 -07:00
Sage Weil
c659dd764e rgw: increase startup timeout to 5 min
30s is too short.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-05-02 11:06:22 -07:00
Sage Weil
65d61f7a83 Merge branch 'wip-paranoid' into next 2013-05-02 10:18:39 -07:00
Sage Weil
17c14b251d Merge remote-tracking branch 'gh/wip-doc-cuttlefish' into next 2013-05-01 17:24:40 -07:00
Samuel Just
c194151a85 Merge remote-tracking branch 'upstream/wip_4884' into next
Fixes: #4884
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01 16:11:47 -07:00
Samuel Just
615b84b1fd Makefile,gitignore: ceph-monstore-tool, not ceph_monstore_tool
Signed-off-by: Samuel Just <sam.just@inktank.com>
2013-05-01 15:43:22 -07:00
Samuel Just
628e232060 Makefile: put ceph_monstore_tool in bin_DEBUGPROGRAMS
Signed-off-by: Samuel Just <sam.just@inktank.com>
2013-05-01 15:43:22 -07:00
Samuel Just
d0d93a743e tools: ceph-osdomap-tool.cc
Add tool for dumping info from osd omap.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2013-05-01 15:43:22 -07:00
Samuel Just
f4982268f7 OSD: load_pgs() should fill in start_split honestly
In load_pgs(), we previously called assigned children starting
at the loaded pg created between its stored epoch and the current
osdmap to have that pg as their parent.  This is not correct, some
of the children may have been split in subsequent epochs from children
split in earlier epochs.  Instead, do each map individually.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2013-05-01 14:59:08 -07:00
Samuel Just
3e0ca62b0f OSD: cancel_pending_splits needs to cancel all descendants
expand_pg_num() and load_pgs() may result in a pg with children
in pending_splits which also have children in pending_splits (etc).

Signed-off-by: Samuel Just <sam.just@inktank.com>
2013-05-01 14:56:25 -07:00
Sage Weil
d944180899 osd: add --osd-leveldb-paranoid flag
Signed-off-by: Sage Weil <sage@inktank.com>
2013-05-01 14:40:48 -07:00
Sage Weil
7cc0a35222 mon: add --mon-leveldb-paranoid flag
This is sort of equivalent to an fsck.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-05-01 14:40:47 -07:00
Greg Farnum
dfacd1bd80 dumper: fix Objecter locking
Locking expectations changed at some point, and the Dumper wasn't
updated to comply:
1) We need to take the lock for Objecter, as it
doesn't do so on its own any more.
2) We need to drop the lock in several places so that Objecter
can take delivery of messages

Signed-off-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-01 14:10:31 -07:00
Sage Weil
a21ea0186d Revert "PaxosService: use get and put for version_t"
This reverts commit e725c3e210.

These inadvertantely got rid of the prefix portion of the key, which
lead to overwriting the wrong keys.

Fixes: #4872
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
2013-05-01 11:05:02 -07:00
Sage Weil
88c030fc05 mon/Paxos: update first_committed when we trim
The Paxos::trim() -> ::trim_to() path trims old states but does not
update first_committed.  This misinforms later paxos rounds such that
peers think they can participate and end up with COMMIT messages
following the COLLECT/LAST exchange that are for future commits they
can't do anything with and then crash out when they get the BEGIN:

mon/Paxos.cc: 557: FAILED assert(begin->last_committed == last_committed)

Fixes: #4879
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01 10:57:58 -07:00
Sage Weil
3a6138b25e mon/Paxos: don't ignore peer first_committed
We go to the effort of keeping a map of the peer's first/last committed
so that we can send the right commits during the first phase of paxos,
but we forgot to record the first value.  This appears to simply be an
oversight.  It is mostly harmless; it just means we send extra states
that the peer already has.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01 10:57:47 -07:00
Joao Eduardo Luis
bb270f86a4 mon: Monitor: fix bug on _pick_random_mon() that would choose an invalid rank
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-05-01 10:45:59 -07:00
Joao Eduardo Luis
7f48fd0643 mon: Monitor: use rank instead of name when randomly picking monitors
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-05-01 10:45:51 -07:00
Samuel Just
8a8ae159f5 OSD: clean up in progress split state on pg removal
There are two cases: 1) The parent pg has not yet initiated the split 2) The
parent pg has initiated the split.

Previously in case 1), _remove_pg left the entry for its children in the
in_progress_splits map blocking subsequent peering attempts.

In case 1), we need to unblock requests on the child pgs for the parent on
parent removal.  We don't need to bother waking requests since any requests
received prior to the remove_pg request are necessarily obsolete.

In case 2), we don't need to do anything: the child will complete the split on
its own anyway.

Thus, we now track pending_splits vs in_progress_splits.  Children in
pending_splits are in state 1), in_progress_splits in state 2).  split_pgs
bumps pgs from pending_splits to in_progress_splits atomically with respect to
_remove_pg since the parent pg lock is held in both places.

Fixes: #4813
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01 10:43:39 -07:00
Greg Farnum
fe68afe9d1 mon: communicate the quorum_features properly when declaring victory.
Fixes #4747.

Signed-off-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-04-30 18:12:10 -07:00
John Wilkins
b17e8424e8 doc: Incorporating Tamil's feedback.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-04-30 18:04:46 -07:00
John Wilkins
bd6ea8d02c doc: Reordered header levels for visual clarity.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-04-30 17:48:05 -07:00
John Wilkins
bb93ebaaf8 doc: Fixed a few typos.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-04-30 17:39:50 -07:00
John Wilkins
14ce0ad177 doc: Updated the upgrade guide for Aronaut and Bobtail to Cuttlefish.
fixes: #4874

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-04-30 17:32:15 -07:00
Greg Farnum
3cf5824f60 Merge branch 'wip-4837-election-syncing' into next
Reviewed-by: Sage Weil <sage@inktank.com>
2013-04-30 15:39:21 -07:00
Sage Weil
cd1d6fb3f9 ceph-disk: tolerate /sbin/service or /usr/sbin/service
CentOS/RH has it in /sbin, others in /usr/sbin.

Backport: bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
2013-04-30 14:16:04 -07:00
Joao Eduardo Luis
a97eccadf7 mon: Monitor: disregard paxos_max_join_drift when deciding whether to sync
We should only rely on whether our paxos version is overlap with whatever
they have -- we'll catch up later with them.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-04-30 13:50:40 -07:00
Greg Farnum
a39bbdf32e mon: if we get our own sync_start back, drop it on the floor.
We have timeouts that will clean everything up, and this can happen
in some cases that we've decided are legitimate. Hopefully we'll
be able to do something else later.

Signed-off-by: Greg Farnum <greg@inktank.com>
2013-04-30 13:50:40 -07:00
Greg Farnum
d00b4cd783 Revert "mon: update assert for looser requirements"
We reverted the gating by paxos sequences, so now we don't
need to look at them at all.

This reverts commit 1e6f02b337.
Signed-off-by: Greg Farnum <greg@inktank.com>
2013-04-30 13:50:40 -07:00
Greg Farnum
cedcb1934f Revert "mon: when electing, be sure acked leaders have new enough stores to lead"
This was somehow broken -- out-of-date leaders were being elected -- and
we've decided smaller band-aids are more appropriate. We don't completely
revert the MMonElection changes, though -- there have been user clusters
running the code which includes these messages so we can't pretend it
never happened. We can make them clearly unused in the code, though.

This reverts commit fcaabf1a22.

Signed-off-by: Greg Farnum <greg@inktank.com>
2013-04-30 13:50:40 -07:00
Josh Durgin
c2bcc2a60c ObjectCacher: wait for all reads when stopping flusher
Stopping the flusher is essentially the shutdown step for the
ObjectCacher - the next thing is actually destroying it.

If we leave any reads outstanding, when they complete they will
attempt to use the now-destroyed ObjectCacher. This is particularly a
problem with rbd images, since an -ENOENT can instantly complete many
readers, so the upper layers don't wait for the other rados-level
reads of that object to finish before trying to shutdown the cache.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-04-30 13:47:47 -07:00
Sage Weil
17612a407a Merge branch 'wip-mon-compact' into next
Reviewed-by: Samuel Just <sam.just@inktank.com>
2013-04-30 11:49:31 -07:00
Greg Farnum
6ae9bbb5d0 elector: trigger a mon reset whenever we bump the epoch
We need to call reset during every election cycle; luckily we
can call it more than once. bump_epoch is (by definition!) only called
once per cycle, and it's called at the beginning, so we put it there.

Fixes #4858.

Signed-off-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-04-30 11:01:54 -07:00
David Zafman
53a2c64ff1 Merge branch 'wip-2209' into next
Reviewed-by: Samuel Just <sam.just@inktank.com>
2013-04-30 10:55:12 -07:00
Sage Weil
0acede3bff mon: change leveldb block size to 64K
#leveldb on freenode says > 2MB is nonsense (it might explain the weird
behavior we saw).  Riak tuning guide suggests 256KB for large data block
environments.  Default is 8KB.  64KB seems sane for us.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-04-30 10:26:24 -07:00
John Wilkins
6f2a7df4b0 doc: Fix typo.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-04-29 18:57:05 -07:00
John Wilkins
35a9823449 doc: Added reference to transition from mkcephfs to ceph-deploy.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-04-29 18:54:04 -07:00