Commit Graph

27672 Commits

Author SHA1 Message Date
Sage Weil
ea1c623406 Merge pull request #441 from ceph/wip-5626
msgr fixes for lossless peer sessions

Reviewed-by: Greg Farnum <greg@inktank.com>
2013-07-17 14:50:41 -07:00
Sage Weil
57bd6fd51b osd: make 'from dead osd' message more informative
I thought I saw some weirdness here.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-17 14:39:04 -07:00
Sage Weil
16568d9e1f msg/Pipe: a bit of additional debug output
Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-17 14:39:04 -07:00
Sage Weil
ecab4bb951 msg/Pipe: hold pipe_lock during important parts of accept()
Previously we did not bother with locking for accept() because we were
not visible to any other threads.  However, we need to close accepting
Pipes from mark_down_all(), which means we need to handle interference.

Fix up the locking so that we hold pipe_lock when looking at Pipe state
and verify that we are still in the ACCEPTING state any time we retake
the lock.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-17 14:39:04 -07:00
Sage Weil
687fe888b3 msgr: close accepting_pipes from mark_down_all()
We need to catch these pipes too, particularly when doing a rebind(),
to avoid them leaking through.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-17 14:39:04 -07:00
Sage Weil
dd4addef2d msgr: maintain list of accepting pipes
New pipes exist in a sort of limbo before we know who the peer is and
add them to rank_pipe.  Keep a list of them in accepting_pipes for that
period.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-17 14:39:04 -07:00
Sage Weil
994e2bf224 msgr: adjust nonce on rebind()
We can have a situation where:

 - we have a pipe to a peer
 - pipe goes to standby (on peer)
 - we rebind to a new port
 - ....
 - we rebind again to the same old port
 - we connect to peer

and get reattached to the ancient pipe from two instances back.  Avoid that
by picking a new nonce each time we rebind.

Add 1,000,000 each time so that the port is still legible in the printed
output.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-17 14:38:57 -07:00
Sage Weil
07a0860a18 msgr: mark_down_all() after, not before, rebind
If we are shutting down all old connections and binding to new ports,
we want to avoid a sequence like:

 - close all prevoius connections
 - new connection comes in on old port
 - rebind to new ports
 -> connection from old port leaks through

As a first step, close all connections after we shut down the old
accepter and before we start the new one.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-17 14:36:37 -07:00
Sage Weil
ad548e72fd msg/Pipe: unlock msgr->lock earlier in accept()
Small cleanup.  Nothing needs msgr->lock for the previously larger
window.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-17 14:34:40 -07:00
Sage Weil
9f1c272618 msg/Pipe: avoid creating empty out_q entry
We need to maintain the invariant that all sub queues in out_q are never
empty.  Fix discard_requeued_up_to() to avoid creating an entry unless we
know it is already present.

This bug leads to an incorrect reconnect attempt when

 - we accept a pipe (lossless peer)
 - they send some stuff, maybe
 - fault
 - we initiate reconnect, even tho we have nothing queued

In particular, we shouldn't reconnect because we aren't checking for
resets, and the fact that our out_seq is 0 while the peer's might be
something else entirely will trigger asserts later.

This fixes at least one source of #5626, and possibly #5517.

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-17 14:34:40 -07:00
Sage Weil
579d858aab msg/Pipe: assert lock is held in various helpers
These all require that we hold pipe_lock.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-17 14:34:39 -07:00
Joao Eduardo Luis
0ebf23cee8 ceph_mon: obtain backup monmap if store is marked with 'force_sync'
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-17 14:12:13 -07:00
Sage Weil
d1501938f5 mon/OSDMonitor: make 'osd pool mksnap ...' not expose uncommitted state
We were returning success without waiting if the pending pool state had
the snap.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-07-17 09:44:50 -07:00
Sage Weil
56c5b83589 qa/workunits/cephtest/test.sh: put 'osd ls' before any 'osd create' tests
A monc/mon connection fault or the dup command test flag may mean an extra
osd id is created that we isn't actually up; reorder so that doesn't screw
up 'osd ls'.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-17 09:36:36 -07:00
Joao Eduardo Luis
ad9a1044db mon: MonCommands: remove obsolete 'sync status' command
Obsoleted by the sync refactor from
da0aff28ab

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-17 09:26:20 -07:00
Samuel Just
884fa2fcb6 OSD::_try_resurrect_pg: fix cur/pgid confusion
This bug prevented resurrection of ancestor pgs where
necessary.

Fixes: #5269
This may result in pg A being created just before pg B
is resurrected and split into A and B resulting in one
or the other operations getting and EEXIST.

Backport: cuttlefish
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-16 17:33:27 -07:00
Sage Weil
7e16b72dc3 mon/AuthMonitor: make 'auth del ...' idempotent
Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-16 17:21:33 -07:00
Sage Weil
f129d17414 qa/workunits/cephtool/test.sh: mds cluster_down/up are idempotent
As of d45429b81ab9817284d6dca98077cb77b5e8280f; fix the test.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-16 17:16:36 -07:00
Sage Weil
f2fa01e22d ceph: send successful commands twice with CEPH_CLI_TEST_DUP_COMMAND
Monitor commands need to be idempotent.  This helps us test this by
simply issuing any successful command a second time so that we notice
when a dup submission fails.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-16 16:58:13 -07:00
Sage Weil
d45429b81a mon/MDSMonitor: make 'mds cluster_{up,down}' idempotent
Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-16 16:26:57 -07:00
Sage Weil
9c4a0307db osdmaptool: fix cli tests
From the HASHPSPOOL change in acbc2f0bc0.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-16 16:10:08 -07:00
Sage Weil
1ec26b8e7b Merge branch 'wip-ceph-disk' into next
Reviewed-by: Gary Lowell <gary.lowell@inktank.com>
Tested-by: Jing Yuan Luke <jyluke@gmail.com>
2013-07-16 15:52:37 -07:00
Sage Weil
2ea8fac441 ceph-disk: use /sys/block to determine partition device names
Not all devices are basename + number; some have intervening character(s),
like /dev/cciss/c0d1p2.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-16 15:51:44 -07:00
Sage Weil
5b031e100b ceph-disk: reimplement is_partition() using /sys/block
Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-16 15:51:44 -07:00
Sage Weil
3359aaedde ceph-disk: use get_dev_name() helper throughout
This is more robust than the broken split trick.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-16 15:51:44 -07:00
Sage Weil
35d3f2d848 ceph-disk: refactor list_[all_]partitions
Make these methods work in terms of device *names*, not paths, and fix up
the only direct list_partitions() caller to do the same.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-16 15:51:44 -07:00
Sage Weil
e0401591e3 ceph-disk: add get_dev_name, path helpers
Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-16 15:51:43 -07:00
Sage Weil
d656aed599 mon/OSDMonitor: fix typo
From 5eac38797d

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-16 15:36:53 -07:00
Sage Weil
d90683fded osd/OSDMonitor: make 'osd pool rmsnap ...' not racy/crashy
Ensure that the snap does in fact exist before we try to remove it.  This
avoids a crash where a we get two dup rmsnap requests (due to thrashing, or
a reconnect, or something), the committed (p) value does have the snap, but
the uncommitted (pp) does not.  This fails the old test such that we try
to remove it from pp again, and assert.

Restructure the flow so that it is easier to distinguish the committed
short return from the uncommitted return (which must still wait for the
commit).

     0> 2013-07-16 14:21:27.189060 7fdf301e9700 -1 osd/osd_types.cc: In function 'void pg_pool_t::remove_snap(snapid_t)' thread 7fdf301e9700 time 2013-07-16 14:21:27.187095
osd/osd_types.cc: 662: FAILED assert(snaps.count(s))

 ceph version 0.66-602-gcd39d8a (cd39d8a6727d81b889869e98f5869e4227b50720)
 1: (pg_pool_t::remove_snap(snapid_t)+0x6d) [0x7ad6dd]
 2: (OSDMonitor::prepare_command(MMonCommand*)+0x6407) [0x5c1517]
 3: (OSDMonitor::prepare_update(PaxosServiceMessage*)+0x1fb) [0x5c41ab]
 4: (PaxosService::dispatch(PaxosServiceMessage*)+0x937) [0x598c87]
 5: (Monitor::handle_command(MMonCommand*)+0xe56) [0x56ec36]
 6: (Monitor::_ms_dispatch(Message*)+0xd1d) [0x5719ad]
 7: (Monitor::handle_forward(MForward*)+0x821) [0x572831]
 8: (Monitor::_ms_dispatch(Message*)+0xe44) [0x571ad4]
 9: (Monitor::ms_dispatch(Message*)+0x32) [0x588c52]
 10: (DispatchQueue::entry()+0x549) [0x7cf1d9]
 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7060fd]
 12: (()+0x7e9a) [0x7fdf35165e9a]
 13: (clone()+0x6d) [0x7fdf334fcccd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-07-16 15:35:57 -07:00
Samuel Just
1999fa2c6c ObjectStore: add omap_rmkeyrange to dump
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-16 15:30:11 -07:00
Samuel Just
44c3917753 OSD: add perfcounter tracking messages delayed pending a map
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-16 15:30:04 -07:00
Samuel Just
d9e0e789bc FileStore: add a perf counter for time spent acquiring op queue throttle
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-16 15:29:52 -07:00
Sage Weil
62d9983bce Merge branch 'wip-4779' into next
Reviewed-by: Sage Weil <sage@inktank.com># Please enter a commit message to explain why this merge is necessary,
2013-07-16 15:24:03 -07:00
Gregory Farnum
c449a8b325 Merge pull request #439 from yehudasa/wip-rgw-next
rgw: quiet down ECANCELED on put_obj_meta()
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-07-16 15:17:25 -07:00
Sage Weil
4d9d0ffb89 mon/OSDMonitor: return error if we can't set the new bucket's name
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2013-07-16 15:14:01 -07:00
Sage Weil
466d0f5fc8 crush: return EINVAL on invalid name from {insert,update,create_or_move}_item, set_item_name
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2013-07-16 15:13:55 -07:00
Sage Weil
93fc07c184 crush: add is_valid_crush_name() helper
[A-Za-z0-9-_.]+

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2013-07-16 15:13:30 -07:00
Joao Eduardo Luis
5eac38797d mon: OSDMonitor: only thrash and propose if we are the leader
'thrash_map' is only set if we are the leader, so we would thrash and
propose the pending value if we are the leader.  However, we should keep
the 'is_leader()' check not only for clarity's sake (an unfamiliar reader
may cry OMGBUG, prompting to a patch much like this), but also because
we may lose a subsequent election and become a peon instead, while still
holding a 'thrash_map' value > 0 -- and we really don't want to propose
while being a peon.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-16 15:08:10 -07:00
Sage Weil
b19ec576e6 mon/MDSMonitor: make 'ceph mds remove_data_pool ...' idempotent
Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-16 14:52:16 -07:00
Sage Weil
ba28c7cc2a mon/OSDMonitor: clean up waiting_for_map messages on shutdown
Do not leak these.

Fixes: #5643
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-07-16 14:49:59 -07:00
Sage Weil
f06a124a7f mon/OSDMonitor: send_to_waiting() in on_active()
The send_latest() helper may put a message in the waiting_for_map list
if we are not readable, but currently send_to_waiting() is only called
from update_from_paxos(), and it is possible that we may be unreadable
but not get a map update.

Instead, share the map when we are active.  Do the same for check_subs(),
which is also about sharing the *new* map.  Leave
share_map_with_random_osd() and process_failures() which are not
concerned with whether this is the latest map or not.

This problem surfaced when we changed the timing of refresh relative to
paxos commit, since update_from_paxos() is now not normally called while
readable; see f1ce8d7c95 and
c711203c0d.

Fixes: #5643
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-07-16 14:49:59 -07:00
Yehuda Sadeh
72d4351ea5 rgw: quiet down ECANCELED on put_obj_meta()
Fixes: #5439

ECANCELED there means that we lost in a race to write the object. We
should treat it as a successful write. This is reviving an old behavior
that was changed inadvertently.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2013-07-16 14:08:20 -07:00
Sage Weil
acbc2f0bc0 osd: do not enable HASHPSPOOL pool feature by default
This was added in kernel 3.9 and should not yet be enabled by default.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-07-16 13:41:08 -07:00
Sage Weil
64379e701b ceph-disk: rely on /dev/disk/by-partuuid instead of special-casing journal symlinks
This was necessary when ceph-disk-udev didn't create the by-partuuid (and
other) symlinks for us, but now it is fragile and error-prone.  (It also
appears to be broken on a certain customer RHEL VM.)  See
d7f7d61351.

Instead, just use the by-partuuid symlinks that we spent all that ugly
effort generating.

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2013-07-16 13:15:07 -07:00
Dan Mick
3706dbbf9f PendingReleaseNotes: formatted ceph CLI output and ceph-rest-api
Signed-off-by: Dan Mick <dan.mick@inktank.com>
2013-07-16 13:09:21 -07:00
Joao Eduardo Luis
ad1392f681 mon: Monitor: StoreConverter: clearer debug message on 'needs_conversion()'
The previous debug message outputted the function's name, as often our
functions do.  This was however a source of bewilderment, as users would
see those in logs and think their stores would need conversion.  Changing
this message is trivial enough and it will make ceph users happier log
readers.

Backport: cuttlefish
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-16 10:33:03 -07:00
Joao Eduardo Luis
e752c40c23 mon: Monitor: StoreConverter: sanitize 'store' pointer on init
We are supposed to have umount'ed the store and set the pointer to NULL.
We should not tolerate any other case on init().

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-16 10:32:14 -07:00
Joao Eduardo Luis
036e6739a4 mon: Monitor: do not reopen MonitorDBStore during conversion
We already open the store on ceph_mon.cc, before we start the conversion.
Given we are unable to reproduce this every time a conversion is triggered,
we are led to believe that this causes a race in leveldb that will lead
to 'store.db/LOCK' being locked upon the open this patch removes.

Regardless, reopening the db here is pointless as we already did it when
we reach Monitor::StoreConverter::convert().

Fixes: #5640
Backport: cuttlefish

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-07-16 10:31:46 -07:00
Gregory Farnum
38691e7f95 Merge pull request #438 from yehudasa/wip-rgw-next
Fix an issue with bucket placements and with listing on new installations.

Reviewed-by: Greg Farnum <greg@inktank.com>
2013-07-16 09:33:52 -07:00
Yehuda Sadeh
408014ee46 rgw: handle ENOENT when listing bucket metadata entries
Just return success (with an empty list)

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2013-07-15 18:43:56 -07:00