Multimap does not make any guarantees about ordering of different
values with the same key. list_by_hash, however, assumes that
the iterator order matches hobject_t order. Thus, we use
set<pair<string, hobject_t> > to get the proper ordering.
Backport: stable
Signed-off-by: Samuel Just <sam.just@inktank.com>
If the mon gets a reset on the client connection, it clears the session
on the connection. This is perfectly normal to see.
Signed-off-by: Sage Weil <sage@inktank.com>
If the OSD is bogged down or unresponsive, we should not try to join
the cluster. This was observed on congress (slow/clogged op_tp combined
with osdmap thrashing).
Fixes: #2502
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Ask the monitor for pending pg creations each time we connect.
Normally, this is a freebie check. If there are pending creations, though,
it ensures that the OSD finds out about them even if the original lame
broadcast didn't reach it. Specifically:
- osd is hunting for a monitor, but isn't yet connected
- new pgs are created
- send_pg_creates() sends out create messages, but osd does get it
- osd finally connects to a mon
Fixes: #2151 (tho the bug description is bad)
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
We no longer do anything with the pgs here. PG map
advancing is now handled in OSD::advance_pg asyncronously.
Signed-off-by: Samuel Just <sam.just@inktank.com>
In the case that the pg is newly created, we will activate during
that call, so the info and log will be dirty.
Signed-off-by: Samuel Just <sam.just@inktank.com>
During the osd threading refactor, we lost the do_queries
call in favor of dispatch_context. However, this did not
include the queries triggered prior to pg instantiation.
Instead, use the rctx to send the queries.
Part of #2771. Without the queries being sent,
can_create_pg will never become true.
Signed-off-by: Samuel Just <sam.just@inktank.com>
If a linger op (watch) is sent to the OSD and updates the object, and then
the client loses the reply, it will resend the request. The OSD will see
that it is a dup, however, and not set up the in-memory session state for
the watch. This in turn will break the watch (i.e., notifies won't
get delivered).
Instead, always resend linger registration ops, so that we always have a
unique reqid and do the correct session registeration for each session.
* track the tid of the registation op for each LingerOp
* mark registrations ops as should_resend=false; cancel as needed
* when we send a new registration op, cancel the old one to ensure we
ignore the reply. This is needed becuase we resend linger ops on any
pg change, not just a primary change.
* drop the first_send arg to send_linger(), as we can now infer that
from register_tid == 0.
The bug was easily reproduced with ms inject socket failures = 500 and the
test_stress_watch utility.
Fixes: #2796
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Other areas rely on OSDService::get_map() to function, possibly before
activate_map is first called. In particular, with handle_osd_ping,
not initializing the map member results in:
ceph version 0.48argonaut-413-g90ddc5a (commit:90ddc5ae51627e7656459085d7e15105c8b8316d)
1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x71ba9a]
2: (()+0xfcb0) [0x7fcd8243dcb0]
3: (OSD::handle_osd_ping(MOSDPing*)+0x74d) [0x5dbdfd]
4: (OSD::heartbeat_dispatch(Message*)+0x22b) [0x5dc70b]
5: (SimpleMessenger::DispatchQueue::entry()+0x92b) [0x7b5b3b]
6: (SimpleMessenger::dispatch_entry()+0x24) [0x7b6914]
7: (SimpleMessenger::DispatchThread::entry()+0xd) [0x7762fd]
8: (()+0x7e9a) [0x7fcd82435e9a]
9: (clone()+0x6d) [0x7fcd809ea4bd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Signed-off-by: Samuel Just <sam.just@inktank.com>
This way, we avoid grabbing the map_lock. Furthermore,
get curmap at the beginning of the method to ensure that
we send the message using the same map used to check
is_up.
This should also fix#2798, which was caused by
an osd being marked up between service.get_osdmap()
and OSD::osdmap.
Signed-off-by: Samuel Just <sam.just@inktank.com>
This option makes the osd skip zeroing old trimmed regions of the log. The
data is never read, since the xattrs indicate which part of the log is
valid. We've never actually used this to debug a problem, and it consumes
space, so let's disable it.
Signed-off-by: Sage Weil <sage@inktank.com>
Whether an entry is eligible to log/dump is independent of the channel it
is sent to. Some channels impose additional restrictions.
Signed-off-by: Sage Weil <sage@inktank.com>
Below is a patch which makes the ceph-rbdnamer script more robust and
fixes a problem with the rbd udev rules.
On our setup we encountered a symlink which was linked to the wrong rbd:
/dev/rbd/mypool/myrbd -> /dev/rbd1
While that link should have gone to /dev/rbd3 (on which a
partition /dev/rbd3p1 was present).
Now the old udev rule passes %n to the ceph-rbdnamer script, the problem
with %n is that %n results in a value of 3 (for rbd3), but in a value of
1 (for rbd3p1), so it seems it can't be depended upon for rbdnaming.
In the patch below the ceph-rbdnamer script is made more robust and it
now it can be called in various ways:
/usr/bin/ceph-rbdnamer /dev/rbd3
/usr/bin/ceph-rbdnamer /dev/rbd3p1
/usr/bin/ceph-rbdnamer rbd3
/usr/bin/ceph-rbdnamer rbd3p1
/usr/bin/ceph-rbdnamer 3
Even with all these different styles of calling the modified script, it
should now return the same rbdname. This change "has" to be combined
with calling it from udev with %k though.
With that fixed, we hit the second problem. We ended up with:
/dev/rbd/mypool/myrbd -> /dev/rbd3p1
So the rbdname was symlinked to the partition on the rbd instead of the
rbd itself. So what probably went wrong is udev discovering the disk and
running ceph-rbdnamer which resolved it to myrbd so the following
symlink was created:
/dev/rbd/mypool/myrbd -> /dev/rbd3
However partitions would be discovered next and ceph-rbdnamer would be
run with rbd3p1 (%k) as parameter, resulting in the name myrbd too, with
the previous correct symlink being overwritten with a faulty one:
/dev/rbd/mypool/myrbd -> /dev/rbd3p1
The solution to the problem is in differentiating between disks and
partitions in udev and handling them slightly differently. So with the
patch below partitions now get their own symlinks in the following style
(which is fairly consistent with other udev rules):
/dev/rbd/mypool/myrbd-part1 -> /dev/rbd3p1
Please let me know any feedback you have on this patch or the approach
used.
Regards,
Pascal de Bruijn
Unilogic B.V.
Signed-off-by: Pascal de Bruijn <pascal@unilogicnetworks.net>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
The ceph-mon --mkfs function no longer wipes out the directory; it is in
fact mostly a no-op that just verifies the dir exists.
So, ensure that the directory is empty at mkfs time. This could
alternatively do an 'rm -r' in that directory (that is in fact what
ceph-mon used to do), but this is safer.
Signed-off-by: Sage Weil <sage@inktank.com>
In non-crash situations, we want to make sure the message is both below the
syslog/stderr threshold and also below the normal log threshold. Otherwise
we get anything we gather on those channels, even when the log level is
low.
Signed-off-by: Sage Weil <sage@inktank.com>
Otherwise, accessing the pg via _applied_recovered_object
isn't safe. Using intrusive_ptr clarifies the reference
ownership.
Signed-off-by: Samuel Just <sam.just@inktank.com>
We should gather an event if it is below the log or gather threshold.
Previously we were only gathering if we were going to print it, which makes
the dump no more useful than what was already logged.
Signed-off-by: Sage Weil <sage@inktank.com>