Commit Graph

27647 Commits

Author SHA1 Message Date
Sage Weil
35d3f2d848 ceph-disk: refactor list_[all_]partitions
Make these methods work in terms of device *names*, not paths, and fix up
the only direct list_partitions() caller to do the same.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-16 15:51:44 -07:00
Sage Weil
e0401591e3 ceph-disk: add get_dev_name, path helpers
Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-16 15:51:43 -07:00
Sage Weil
d656aed599 mon/OSDMonitor: fix typo
From 5eac38797d

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-16 15:36:53 -07:00
Sage Weil
d90683fded osd/OSDMonitor: make 'osd pool rmsnap ...' not racy/crashy
Ensure that the snap does in fact exist before we try to remove it.  This
avoids a crash where a we get two dup rmsnap requests (due to thrashing, or
a reconnect, or something), the committed (p) value does have the snap, but
the uncommitted (pp) does not.  This fails the old test such that we try
to remove it from pp again, and assert.

Restructure the flow so that it is easier to distinguish the committed
short return from the uncommitted return (which must still wait for the
commit).

     0> 2013-07-16 14:21:27.189060 7fdf301e9700 -1 osd/osd_types.cc: In function 'void pg_pool_t::remove_snap(snapid_t)' thread 7fdf301e9700 time 2013-07-16 14:21:27.187095
osd/osd_types.cc: 662: FAILED assert(snaps.count(s))

 ceph version 0.66-602-gcd39d8a (cd39d8a6727d81b889869e98f5869e4227b50720)
 1: (pg_pool_t::remove_snap(snapid_t)+0x6d) [0x7ad6dd]
 2: (OSDMonitor::prepare_command(MMonCommand*)+0x6407) [0x5c1517]
 3: (OSDMonitor::prepare_update(PaxosServiceMessage*)+0x1fb) [0x5c41ab]
 4: (PaxosService::dispatch(PaxosServiceMessage*)+0x937) [0x598c87]
 5: (Monitor::handle_command(MMonCommand*)+0xe56) [0x56ec36]
 6: (Monitor::_ms_dispatch(Message*)+0xd1d) [0x5719ad]
 7: (Monitor::handle_forward(MForward*)+0x821) [0x572831]
 8: (Monitor::_ms_dispatch(Message*)+0xe44) [0x571ad4]
 9: (Monitor::ms_dispatch(Message*)+0x32) [0x588c52]
 10: (DispatchQueue::entry()+0x549) [0x7cf1d9]
 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7060fd]
 12: (()+0x7e9a) [0x7fdf35165e9a]
 13: (clone()+0x6d) [0x7fdf334fcccd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-07-16 15:35:57 -07:00
Samuel Just
1999fa2c6c ObjectStore: add omap_rmkeyrange to dump
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-16 15:30:11 -07:00
Samuel Just
44c3917753 OSD: add perfcounter tracking messages delayed pending a map
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-16 15:30:04 -07:00
Samuel Just
d9e0e789bc FileStore: add a perf counter for time spent acquiring op queue throttle
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-16 15:29:52 -07:00
Sage Weil
62d9983bce Merge branch 'wip-4779' into next
Reviewed-by: Sage Weil <sage@inktank.com># Please enter a commit message to explain why this merge is necessary,
2013-07-16 15:24:03 -07:00
Gregory Farnum
c449a8b325 Merge pull request #439 from yehudasa/wip-rgw-next
rgw: quiet down ECANCELED on put_obj_meta()
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-07-16 15:17:25 -07:00
Sage Weil
4d9d0ffb89 mon/OSDMonitor: return error if we can't set the new bucket's name
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2013-07-16 15:14:01 -07:00
Sage Weil
466d0f5fc8 crush: return EINVAL on invalid name from {insert,update,create_or_move}_item, set_item_name
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2013-07-16 15:13:55 -07:00
Sage Weil
93fc07c184 crush: add is_valid_crush_name() helper
[A-Za-z0-9-_.]+

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2013-07-16 15:13:30 -07:00
Joao Eduardo Luis
5eac38797d mon: OSDMonitor: only thrash and propose if we are the leader
'thrash_map' is only set if we are the leader, so we would thrash and
propose the pending value if we are the leader.  However, we should keep
the 'is_leader()' check not only for clarity's sake (an unfamiliar reader
may cry OMGBUG, prompting to a patch much like this), but also because
we may lose a subsequent election and become a peon instead, while still
holding a 'thrash_map' value > 0 -- and we really don't want to propose
while being a peon.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-16 15:08:10 -07:00
Sage Weil
b19ec576e6 mon/MDSMonitor: make 'ceph mds remove_data_pool ...' idempotent
Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-16 14:52:16 -07:00
Sage Weil
ba28c7cc2a mon/OSDMonitor: clean up waiting_for_map messages on shutdown
Do not leak these.

Fixes: #5643
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-07-16 14:49:59 -07:00
Sage Weil
f06a124a7f mon/OSDMonitor: send_to_waiting() in on_active()
The send_latest() helper may put a message in the waiting_for_map list
if we are not readable, but currently send_to_waiting() is only called
from update_from_paxos(), and it is possible that we may be unreadable
but not get a map update.

Instead, share the map when we are active.  Do the same for check_subs(),
which is also about sharing the *new* map.  Leave
share_map_with_random_osd() and process_failures() which are not
concerned with whether this is the latest map or not.

This problem surfaced when we changed the timing of refresh relative to
paxos commit, since update_from_paxos() is now not normally called while
readable; see f1ce8d7c95 and
c711203c0d.

Fixes: #5643
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-07-16 14:49:59 -07:00
Yehuda Sadeh
72d4351ea5 rgw: quiet down ECANCELED on put_obj_meta()
Fixes: #5439

ECANCELED there means that we lost in a race to write the object. We
should treat it as a successful write. This is reviving an old behavior
that was changed inadvertently.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2013-07-16 14:08:20 -07:00
Sage Weil
acbc2f0bc0 osd: do not enable HASHPSPOOL pool feature by default
This was added in kernel 3.9 and should not yet be enabled by default.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-16 13:41:08 -07:00
Sage Weil
64379e701b ceph-disk: rely on /dev/disk/by-partuuid instead of special-casing journal symlinks
This was necessary when ceph-disk-udev didn't create the by-partuuid (and
other) symlinks for us, but now it is fragile and error-prone.  (It also
appears to be broken on a certain customer RHEL VM.)  See
d7f7d61351.

Instead, just use the by-partuuid symlinks that we spent all that ugly
effort generating.

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2013-07-16 13:15:07 -07:00
Dan Mick
3706dbbf9f PendingReleaseNotes: formatted ceph CLI output and ceph-rest-api
Signed-off-by: Dan Mick <dan.mick@inktank.com>
2013-07-16 13:09:21 -07:00
Joao Eduardo Luis
ad1392f681 mon: Monitor: StoreConverter: clearer debug message on 'needs_conversion()'
The previous debug message outputted the function's name, as often our
functions do.  This was however a source of bewilderment, as users would
see those in logs and think their stores would need conversion.  Changing
this message is trivial enough and it will make ceph users happier log
readers.

Backport: cuttlefish
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-16 10:33:03 -07:00
Joao Eduardo Luis
e752c40c23 mon: Monitor: StoreConverter: sanitize 'store' pointer on init
We are supposed to have umount'ed the store and set the pointer to NULL.
We should not tolerate any other case on init().

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-16 10:32:14 -07:00
Joao Eduardo Luis
036e6739a4 mon: Monitor: do not reopen MonitorDBStore during conversion
We already open the store on ceph_mon.cc, before we start the conversion.
Given we are unable to reproduce this every time a conversion is triggered,
we are led to believe that this causes a race in leveldb that will lead
to 'store.db/LOCK' being locked upon the open this patch removes.

Regardless, reopening the db here is pointless as we already did it when
we reach Monitor::StoreConverter::convert().

Fixes: #5640
Backport: cuttlefish

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-16 10:31:46 -07:00
Gregory Farnum
38691e7f95 Merge pull request #438 from yehudasa/wip-rgw-next
Fix an issue with bucket placements and with listing on new installations.

Reviewed-by: Greg Farnum <greg@inktank.com>
2013-07-16 09:33:52 -07:00
Yehuda Sadeh
408014ee46 rgw: handle ENOENT when listing bucket metadata entries
Just return success (with an empty list)

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2013-07-15 18:43:56 -07:00
Yehuda Sadeh
eef4458279 rgw: fix bucket placement assignment
When we set bucket.instance meta, we need to set
the correct bucket placement to the bucket (according to
the specific placement rule). However, it might be that
bucket placement was never configured and we just go by
the defaults, using the old legacy pools selection.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2013-07-15 15:49:42 -07:00
Samuel Just
39e5a2a406 OSD: add config option for peering_wq batch size
Large peering_wq batch sizes may excessively delay
peering messages resulting in unreasonably long
peering.  This may speed up peering.

Backport: cuttlefish
Related: #5084
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-15 15:15:17 -07:00
Sage Weil
b46930c96c mon: make report pure json
Put the crc in the status string and drop the header and footer.  If users
want to capture it,

ceph report 2>&1 > foo.txt

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-15 14:29:14 -07:00
Sage Weil
40a8bbdc53 Merge remote-tracking branch 'gh/wip-mon-report' into next 2013-07-15 14:23:40 -07:00
Sage Weil
daf7672309 ceph: drop --threshold hack for 'pg dump_stuck'
We can live with the incompatibility here; the hack is currently
not working anyway (see #5623).

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2013-07-15 14:05:21 -07:00
Sage Weil
4282971d47 msg/Pipe: be a bit more explicit about encoding outgoing messages
Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-15 13:48:07 -07:00
Sage Weil
314cf046b0 messages/MClientReconnect: clear data when encoding
The MClientReconnect puts everything in the data payload portion of
the message and nothing in the front portion.  That means that if the
message is resent (socket failure or something), the messenger thinks it
hasn't been encoded yet (front empty) and reencodes, which means
everything gets added (again) to the data portion.

Decoding keep decoding until it runs out of data, so the second copy
means we decode garbage snap realms, leading to the crash in bug

Clearing data each time around resolves the problem, although it does
mean we do the encoding work multiple times.  We could alternatively
(or also) stick some data in the front portion of the payload
(ignored), but that changes the wire protocol and I would rather not
do that.

Fixes: #4565
Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-07-15 13:47:39 -07:00
Sage Weil
6a524c358c Merge pull request #436 from ceph/wip-mon-fixes
Wip mon fixes

Reviewed-by: Greg Farnum <greg@inktank.com>
2013-07-15 13:46:43 -07:00
Sage Weil
34f76bd915 mon: set forwarded message recv stamp
Set it to the stamp of the MForward that carried us.  One could argue
we really want the original receive stamp on the origin, but that is
not available to us, and this is better than nothing.

In particular, this gives 'ceph log ...' commands a timestamp when they
are forwarded via a peon.  The stamp is still between when the request
is sent and when it is committed/acked, so all is well from the
client's perspective.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-15 13:42:54 -07:00
Sage Weil
eac559f474 mon: drop win_election() _reset() kludge and strengthen assertions
This is only there for the benefit of win_standalone_election(), but it
doesn't need it, it clutters the code, and weakens our assertions.

Now the only win_election() callers are win_standalone_election() (which
is a single path that just did _reset()) and from the elector.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-15 13:42:54 -07:00
Sage Weil
c67d50b86b mon: set peon state to electing if other mons call an election
Previously we would call mon->reset() and set various flags (like
exited_quorum timestamp), but the state would remain PEON.  Make an
explicit join_election() callback and set the state there, and add
asserts in reset() (renamed to be private) so that we ensure all
callers are well-behaved.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-15 13:42:53 -07:00
Sage Weil
3ea984cd6a mon: once sync full is chosen, make sure we don't change our mind
It is possible for a sequence like:

 - probe
 - first probe reply has paxos trim that indicates a full sync is
   needed
 - start sync
 - clear store
 - something happens that makes us abort and bootstrap (e.g., the
   provider mon restarts
 - probe
 - first probe reply has older paxos trim bound and we call an election
 - on election completion, we crash because we have no data.

Non-determinism of the probe decision aside, we need to ensure that
the info we share during probe (fc, lc) is accurate, and that once we
clear the store we know we *must* do a full sync.

Fixes: #5621
Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-15 13:42:53 -07:00
Sage Weil
1fc85acdb8 mon/PaxosService: consolidate resetting in restart()
We had duplicated code in election_finished() and restart(), and it was
incomplete.  Put it all in restart() only (the mon should have called
restart() long before the election finishes).  Note that we cannot
assert as much in election_finished() because another service may have
just cross-proposed.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-15 13:42:53 -07:00
Sage Weil
7666c33a1d mon/PaxosService: assert not proposing in propose_pending
Drop the useless active check after the assert, too.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-15 13:42:53 -07:00
Sage Weil
c711203c0d mon/Paxos: separate proposal commit from the end of the round
Each commit should match with exactly one proposal; finish it when we
actually commit it and make sensible asserts.

The old finish_proposal() turns into finish_round(), and performs
generic checks and cleanup associated with the transition from
updating -> active.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-15 13:42:53 -07:00
Sage Weil
5c31010795 mon/Paxos: make all handle_accept paths go via out label
Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-15 13:42:53 -07:00
Sage Weil
2bf95e5a1c Merge branch 'next' 2013-07-15 13:20:32 -07:00
Sage Weil
f1ce8d7c95 mon: fix scrub vs paxos race: refresh on commit, not round completion
Consider:

 - paxos starts a commit N+1
 - a majority of the peers ack it
  - paxos::commit() writes N+1 it to disk
  - tells peers to commit
 - peers commit N+1, *and* refresh_from_paxos(), and generate N+1 full map
 - leader does _scrub on N+1, without latest full osdmap
 - peers do _scrub on N+1, with latest full osdmap
 - leader finishes paxos gather, does refresh_from_paxos()
 -> scrub fails.

Fix this by doing the refresh_from_paxos() at commit time and not when
the paxos round finishes.  We move the refresh out of finish_proposal
and into its own helper, and update all callers accordingly.  This
keeps on-disk state more tightly in sync with in-memory state and
avoids the need for a e.g., kludgey workaround in the scrub code.

We also simplify the bootstrap checks a bit by doing so immediately
and relying on the normal bootstrap paxos reset paths to clean up
any waiters.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-15 12:54:56 -07:00
Yehuda Sadeh
e996e9bee2 Merge pull request #437 from kri5/wip-fix-typo-rgw
rgw: Fix typo in rgw_user.cc

Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2013-07-15 12:20:15 -07:00
Yehuda Sadeh
1b5258155a Merge remote-tracking branch 'origin/wip-rgw-warnings' into next
Conflicts:
	src/test/test_rgw_admin_log.cc
	src/test/test_rgw_admin_meta.cc
	src/test/test_rgw_admin_opstate.cc

Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2013-07-15 11:17:03 -07:00
Yehuda Sadeh
a722fb713e rgw: fix bucket instance json encoding
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2013-07-15 10:37:28 -07:00
Yehuda Sadeh
346d9f42bc rgw_admin: fix gc list encoding
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2013-07-15 10:37:28 -07:00
Yehuda Sadeh
791d51eb36 Merge pull request #434 from gregsfortytwo/next
test_rgw: fix a number of unsigned/signed comparison warnings

Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2013-07-15 10:24:18 -07:00
John Wilkins
55ff523ef2 doc: Fixed link in Calxeda repo instruction.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2013-07-15 10:05:28 -07:00
Sage Weil
aa60f940ec mon: once sync full is chosen, make sure we don't change our mind
It is possible for a sequence like:

 - probe
 - first probe reply has paxos trim that indicates a full sync is
   needed
 - start sync
 - clear store
 - something happens that makes us abort and bootstrap (e.g., the
   provider mon restarts
 - probe
 - first probe reply has older paxos trim bound and we call an election
 - on election completion, we crash because we have no data.

Non-determinism of the probe decision aside, we need to ensure that
the info we share during probe (fc, lc) is accurate, and that once we
clear the store we know we *must* do a full sync.

Fixes: #5621
Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-07-15 10:02:47 -07:00