Commit Graph

20541 Commits

Author SHA1 Message Date
Samuel Just
adc9b91f37 os/HashIndex: use set<pair<string, hobject_t>> rather than multimap
Multimap does not make any guarantees about ordering of different
values with the same key.  list_by_hash, however, assumes that
the iterator order matches hobject_t order.  Thus, we use
set<pair<string, hobject_t> > to get the proper ordering.

Backport: stable

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-20 12:29:03 -07:00
Sage Weil
0b84384fd4 mon: shut up about sessionless MPGStats messages
If the mon gets a reset on the client connection, it clears the session
on the connection.  This is perfectly normal to see.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-19 22:14:11 -07:00
Sage Weil
6580450fbc osd: clean up boot method names
Prefix subsequent steps with _.  Better names.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-19 21:27:40 -07:00
Sage Weil
369fbf6110 osd: defer boot if heartbeatmap indicates we are unhealthy
If the OSD is bogged down or unresponsive, we should not try to join
the cluster.  This was observed on congress (slow/clogged op_tp combined
with osdmap thrashing).

Fixes: #2502
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-19 21:27:37 -07:00
Sage Weil
d76df212c8 Merge branch 'next'
Conflicts:
	src/include/ceph_features.h
2012-07-19 20:22:35 -07:00
Sage Weil
dec936923f osd/mon: subscribe (onetime) to pg creations on connect
Ask the monitor for pending pg creations each time we connect.

Normally, this is a freebie check.  If there are pending creations, though,
it ensures that the OSD finds out about them even if the original lame
broadcast didn't reach it.  Specifically:

 - osd is hunting for a monitor, but isn't yet connected
 - new pgs are created
 - send_pg_creates() sends out create messages, but osd does get it
 - osd finally connects to a mon

Fixes: #2151 (tho the bug description is bad)
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
2012-07-19 17:13:09 -07:00
Sage Weil
7f58b9beee mon: track pg creations by osd
Track the pending pg creations by osd, and use a helper to send out that
messages.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-19 17:13:09 -07:00
Sage Weil
4c6c927b27 Revert "rbd: fix usage for snap commands"
This reverts commit 42de6873f9.

Actually, these are fine!  Dan made them all kinds of fancy.
2012-07-19 16:45:07 -07:00
Sage Weil
42de6873f9 rbd: fix usage for snap commands
Snap commands take '--snap <snapname> <imagename>'.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-19 16:48:18 -07:00
Mike Ryan
58cd27fd29 doc: add missing dependencies to README
Signed-off-by: Mike Ryan <mike.ryan@inktank.com>
2012-07-19 11:29:40 -07:00
Sage Weil
6f381affdc add CRUSH_TUNABLES feature bit
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-18 19:49:58 -07:00
Samuel Just
e3349a2a3d OSD::handle_osd_map: don't lock pgs while advancing maps
We no longer do anything with the pgs here.  PG map
advancing is now handled in OSD::advance_pg asyncronously.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-18 15:37:28 -07:00
Sage Weil
c8ee30160d osd: add osd_debug_drop_pg_create_{probability,duration} options
This will let us exercise more of the pg creation code.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-18 14:26:16 -07:00
Samuel Just
8f5562ffe6 OSD: write_if_dirty during get_or_create_pg after handle_create
In the case that the pg is newly created, we will activate during
that call, so the info and log will be dirty.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-18 14:26:16 -07:00
Samuel Just
ca9f713004 OSD: actually send queries during handle_pg_create
During the osd threading refactor, we lost the do_queries
call in favor of dispatch_context.  However, this did not
include the queries triggered prior to pg instantiation.
Instead, use the rctx to send the queries.

Part of #2771.  Without the queries being sent,
can_create_pg will never become true.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-18 14:26:16 -07:00
Josh Durgin
0d0b468914 Merge branch 'next' 2012-07-18 12:58:47 -07:00
Sage Weil
5dd68b95b1 objecter: always resend linger registrations
If a linger op (watch) is sent to the OSD and updates the object, and then
the client loses the reply, it will resend the request.  The OSD will see
that it is a dup, however, and not set up the in-memory session state for
the watch.  This in turn will break the watch (i.e., notifies won't
get delivered).

Instead, always resend linger registration ops, so that we always have a
unique reqid and do the correct session registeration for each session.

 * track the tid of the registation op for each LingerOp
 * mark registrations ops as should_resend=false; cancel as needed
 * when we send a new registration op, cancel the old one to ensure we
   ignore the reply.  This is needed becuase we resend linger ops on any
   pg change, not just a primary change.
 * drop the first_send arg to send_linger(), as we can now infer that
   from register_tid == 0.

The bug was easily reproduced with ms inject socket failures = 500 and the
test_stress_watch utility.

Fixes: #2796
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-18 12:55:35 -07:00
Samuel Just
76efd9772c OSD: publish_map in init to initialize OSDService map
Other areas rely on OSDService::get_map() to function, possibly before
activate_map is first called.  In particular, with handle_osd_ping,
not initializing the map member results in:

ceph version 0.48argonaut-413-g90ddc5a (commit:90ddc5ae51627e7656459085d7e15105c8b8316d)
 1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x71ba9a]
 2: (()+0xfcb0) [0x7fcd8243dcb0]
 3: (OSD::handle_osd_ping(MOSDPing*)+0x74d) [0x5dbdfd]
 4: (OSD::heartbeat_dispatch(Message*)+0x22b) [0x5dc70b]
 5: (SimpleMessenger::DispatchQueue::entry()+0x92b) [0x7b5b3b]
 6: (SimpleMessenger::dispatch_entry()+0x24) [0x7b6914]
 7: (SimpleMessenger::DispatchThread::entry()+0xd) [0x7762fd]
 8: (()+0x7e9a) [0x7fcd82435e9a]
 9: (clone()+0x6d) [0x7fcd809ea4bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-18 10:44:36 -07:00
Sage Weil
7586cde9de qa/workunits/suites/pjd.sh: bash -x
This will let us see what test is failing, exactly, and what its inputs
were.  Hoping to help find #2187.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-18 10:52:44 -07:00
Josh Durgin
675d630203 ObjectCacher: fix cache_bytes_hit accounting
Misses are not hits!

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-18 10:25:13 -07:00
John Wilkins
4e1d973e46 doc: Fixed heading text.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2012-07-18 07:35:35 -07:00
John Wilkins
ebc577361c doc: favicon.ico should be new Ceph icon.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2012-07-18 07:35:00 -07:00
John Wilkins
3a377c44e1 doc: Overhauled Swift API documentation.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2012-07-17 21:28:59 -07:00
Sage Weil
aecf0031c8 Merge branch 'next' 2012-07-17 19:20:06 -07:00
Sage Weil
d78235be1b client: fix readdir locking
Several of the readdir-related methods were not taking client_lock.

Fixes: #1737
Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-17 19:19:39 -07:00
Sage Weil
82a575c9a5 client: fix leak of client_lock when not initialized
Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-17 19:18:55 -07:00
Samuel Just
90ddc5ae51 OSD: use service.get_osdmap() in heartbeat(), don't grab map_lock
service.get_osdmap() gives us sufficiently consist
access to the map state.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-17 16:58:21 -07:00
Samuel Just
58e81c82e0 OSD: handle_osd_ping: use service->get_osdmap()
This way, we avoid grabbing the map_lock.  Furthermore,
get curmap at the beginning of the method to ensure that
we send the message using the same map used to check
is_up.

This should also fix #2798, which was caused by
an osd being marked up between service.get_osdmap()
and OSD::osdmap.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-17 16:58:21 -07:00
Samuel Just
32892c1edd doc/dev/osd_internals: add newlines before numbered lists
Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-17 16:51:57 -07:00
Sage Weil
fe4c658bd3 librados: simplify locking slightly
No reason to hold mylock_all here.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-17 16:02:18 -07:00
Sage Weil
199397dc96 osd: default 'osd_preserve_trimmed_log = false'
This option makes the osd skip zeroing old trimmed regions of the log.  The
data is never read, since the xattrs indicate which part of the log is
valid.  We've never actually used this to debug a problem, and it consumes
space, so let's disable it.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-17 12:40:33 -07:00
Samuel Just
24df8b1d82 doc/dev: add osd_internals to toc
Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-17 09:54:47 -07:00
Samuel Just
5a27f07160 doc/internals/osd_internals: fix indentation errors
Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-17 09:31:22 -07:00
Sage Weil
6490c84ff9 doc: discuss choice of pg_num
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-17 08:36:54 -07:00
Sage Weil
36d0a3555f log: simplify log logic a bit
Whether an entry is eligible to log/dump is independent of the channel it
is sent to.  Some channels impose additional restrictions.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-17 08:36:54 -07:00
Josh Durgin
abe05a3fbb Merge branch 'next' 2012-07-16 17:36:06 -07:00
Pascal de Bruijn | Unilogic Networks B.V
96587f39e3 Robustify ceph-rbdnamer and adapt udev rules
Below is a patch which makes the ceph-rbdnamer script more robust and
fixes a problem with the rbd udev rules.

On our setup we encountered a symlink which was linked to the wrong rbd:

  /dev/rbd/mypool/myrbd -> /dev/rbd1

While that link should have gone to /dev/rbd3 (on which a
partition /dev/rbd3p1 was present).

Now the old udev rule passes %n to the ceph-rbdnamer script, the problem
with %n is that %n results in a value of 3 (for rbd3), but in a value of
1 (for rbd3p1), so it seems it can't be depended upon for rbdnaming.

In the patch below the ceph-rbdnamer script is made more robust and it
now it can be called in various ways:

  /usr/bin/ceph-rbdnamer /dev/rbd3
  /usr/bin/ceph-rbdnamer /dev/rbd3p1
  /usr/bin/ceph-rbdnamer rbd3
  /usr/bin/ceph-rbdnamer rbd3p1
  /usr/bin/ceph-rbdnamer 3

Even with all these different styles of calling the modified script, it
should now return the same rbdname. This change "has" to be combined
with calling it from udev with %k though.

With that fixed, we hit the second problem. We ended up with:

  /dev/rbd/mypool/myrbd -> /dev/rbd3p1

So the rbdname was symlinked to the partition on the rbd instead of the
rbd itself. So what probably went wrong is udev discovering the disk and
running ceph-rbdnamer which resolved it to myrbd so the following
symlink was created:

  /dev/rbd/mypool/myrbd -> /dev/rbd3

However partitions would be discovered next and ceph-rbdnamer would be
run with rbd3p1 (%k) as parameter, resulting in the name myrbd too, with
the previous correct symlink being overwritten with a faulty one:

  /dev/rbd/mypool/myrbd -> /dev/rbd3p1

The solution to the problem is in differentiating between disks and
partitions in udev and handling them slightly differently. So with the
patch below partitions now get their own symlinks in the following style
(which is fairly consistent with other udev rules):

  /dev/rbd/mypool/myrbd-part1 -> /dev/rbd3p1

Please let me know any feedback you have on this patch or the approach
used.

Regards,
Pascal de Bruijn
Unilogic B.V.

Signed-off-by: Pascal de Bruijn <pascal@unilogicnetworks.net>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-16 17:34:22 -07:00
caleb miles
b0465496d2 doc/radosgw/config.rst: mended small typo
Signed-off-by: caleb miles <caleb.miles@inktank.com>
2012-07-16 16:30:36 -07:00
Sage Weil
f9c1a6fb0a Merge branch 'next' 2012-07-16 16:13:55 -07:00
Sage Weil
2a8c4db72f Merge branch 'wip-mon-mkfs'
Reviewed-by: Tommi Virtanen <tv@inktank.com>
2012-07-16 16:15:33 -07:00
Sage Weil
4eec4fc57d mkcephfs: nicer empty directory check
From TV.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-16 16:14:39 -07:00
Sage Weil
4e66a3b98d mkcephfs: error out if mon data directory is not empty
The ceph-mon --mkfs function no longer wipes out the directory; it is in
fact mostly a no-op that just verifies the dir exists.

So, ensure that the directory is empty at mkfs time.  This could
alternatively do an 'rm -r' in that directory (that is in fact what
ceph-mon used to do), but this is safer.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-16 16:14:39 -07:00
Sage Weil
6b1835a92c vstart.sh: blow away mon directory on creation/start
Now that ceph-mon doesn't blow away the mon data content, we need to.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-16 16:14:39 -07:00
Sage Weil
54be9d0917 mon: stop doing rm -rf on mon mkfs
Simply verify that the directory exists, or if it doesn't, create it.
Do nothing about its content.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-16 16:14:39 -07:00
Sage Weil
52f96b9fd1 log: apply log_level to stderr/syslog logic
In non-crash situations, we want to make sure the message is both below the
syslog/stderr threshold and also below the normal log threshold.  Otherwise
we get anything we gather on those channels, even when the log level is
low.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-16 16:02:14 -07:00
Sage Weil
de524abdb1 log: dump logging levels in crash dump
So you know what you are/are not seeing.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-16 15:53:59 -07:00
Sage Weil
d3c76f754f Merge branch 'next' 2012-07-16 15:53:54 -07:00
Samuel Just
3821f6c4bf PG: grab reference to pg in C_OSD_AppliedRecoveredObject
Otherwise, accessing the pg via _applied_recovered_object
isn't safe.  Using intrusive_ptr clarifies the reference
ownership.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-16 15:43:52 -07:00
Sage Weil
64f745008b log: fix event gather condition
We should gather an event if it is below the log or gather threshold.

Previously we were only gathering if we were going to print it, which makes
the dump no more useful than what was already logged.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-16 15:36:44 -07:00
Samuel Just
d4410e4ad5 PG::RecoveryState::Stray::react(LogEvt&): set dirty_info/log
We adjust the info and the log, so we must set dirty_info and
dirty_log to force writes.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-16 14:18:22 -07:00