Zero the right number of bytes. Fixes a bug where we clobber legit log
data. Fortunately this is only triggered with osd preserve pg log = false,
which was not the default until recently in master.
Fixes: #2799
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Mike Ryan <mike.ryan@inktank.com>
If we were to use solely the key() function, whenever we had a key with,
say, prefix 'Foo' and key 'Bar', the key() function would return something
similar to 'Foo<separator>Bar'. Therefore, obtaining the prefix and the key
would require one to be aware of the separator used, and, since that is
implementation specific, we can't rely on such prior knowledge.
This new function must then be implemented by any derivative class of
KeyValueDB, and is expected to return a pair (prefix,key) for the
current iterator's position -- the key() function should behave as
previously, returning only the 'key' component of the pair.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
This patch introduces the possibility of using single key/value
modification operations into the transaction interface.
Until now, any 'set' or 'rmkeys' operations required a map of keys to be
provided to the function, which made the task of removing or setting a
bunch of keys easier. Doing these same operations for a single key,
however, would entail creating a map with a single key.
Instead, this patch adds two new virtual abstract functions, to be
implemented by derivative classes, which set or remove one single
key/value, and we then implement the map-based, existing functions in
terms of these new functions.
We also update the derivative classes of KeyValueDB in order to reflect
these changes (i.e., LevelDBStore and KeyValueDBMemory).
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
The assign_bid method has issues with replay because it is a write
that also returns data. This means that the replayed operation would
return success, but no data, and cause a create to fail. Instead, let
the client set the bid based on its global id and a random number.
This only affects the creation of new images, since the bid is put
into an opaque string as part of the object prefix.
Keep the server side assign_bid around in case there are old clients
still using it.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
The wait_for_ondisk handling fixed COMMIT ordering, but the ACKs need to
go back in the same order too. For example:
- op A is queued
- client disconnects, both ACK and COMMIT replies are lost
- client reconnects
- op A and B are sent
- op A is queued
- op B is applied, ACK is sent
- op A and B COMMITs are sent
-> client's ack callbacks will see B and then A.
Fix this by creating a waiting_for_ack queue as well, and sending ACK
responses as needed. Also handle the case where the ACK should be sent
immediately when the retry event is received.
Fixes: #2823
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Mike Ryan <mike.ryan@inktank.com>
By popular demand, moved public api into namespace. This
required some changes to ceph_dencoder to get some template
annoyance working.
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
This appears to fix problems with mount failing for at least one user.
Reported-by: Paul Pettigrew <Paul.Pettigrew@mach.com.au>
Signed-off-by: Sage Weil <sage@inktank.com>
We screwed up and encoded using the name 'int' type instead of int32_t.
That means people have systems encoding this as both 32 and 64 bit,
depending on their architecture. This could be worse: x86_64 still has a
32-bit int (at least in my environment).
In any case, mixing both word sizes in their clusters is broken as a
result, with the exception of the kernel code, which doesn't decode this
part of the map and will tolerate differently-sized servers.
Fix this by:
* encoding using int32_t now
* decoding either 32-bit or 64-bit values, by assuming that the strings
will always be non-empty. This appears to be the case.
However:
* any cluster with 64-bit ints must upgrade all at once, or else the new
code will start encoding 32-bit values and the old code will be
confused.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
This should be helpful while investigating slow performance.
OpRequests now track events with timestamp in addition
to dumping them to the log. OpHistory keeps up to a
configurable number of the slowest ops over a configurable
recent time interval. The admin socket interface for the OSD
now has a dump_historic_ops command which dumps the stored
slow ops.
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Samuel Just <sam.just@inktank.com>
Providing an objclass to create and manipulate advisory
locking. Also providing a client api to control it. A lock
may either be exclusively locked or shared among multiple
lockers. A locker is identified by the rados client name, and
by a cookie-string.
A lock may be assigned with a tag that every operation on that
lock should use. A lock can be unlocked by the client that locked
it, or may be broken by other clients.
When a non-zero lock duration is assigned to a lock by a locker,
that locker expires after that time duration.
A lock may have a description.
Locks on a specific object can be listed. Lockers of a specific
lock can be enumerated (by get_info).
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Multimap does not make any guarantees about ordering of different
values with the same key. list_by_hash, however, assumes that
the iterator order matches hobject_t order. Thus, we use
set<pair<string, hobject_t> > to get the proper ordering.
Backport: stable
Signed-off-by: Samuel Just <sam.just@inktank.com>
If the mon gets a reset on the client connection, it clears the session
on the connection. This is perfectly normal to see.
Signed-off-by: Sage Weil <sage@inktank.com>
If the OSD is bogged down or unresponsive, we should not try to join
the cluster. This was observed on congress (slow/clogged op_tp combined
with osdmap thrashing).
Fixes: #2502
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Ask the monitor for pending pg creations each time we connect.
Normally, this is a freebie check. If there are pending creations, though,
it ensures that the OSD finds out about them even if the original lame
broadcast didn't reach it. Specifically:
- osd is hunting for a monitor, but isn't yet connected
- new pgs are created
- send_pg_creates() sends out create messages, but osd does get it
- osd finally connects to a mon
Fixes: #2151 (tho the bug description is bad)
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
We no longer do anything with the pgs here. PG map
advancing is now handled in OSD::advance_pg asyncronously.
Signed-off-by: Samuel Just <sam.just@inktank.com>
In the case that the pg is newly created, we will activate during
that call, so the info and log will be dirty.
Signed-off-by: Samuel Just <sam.just@inktank.com>
During the osd threading refactor, we lost the do_queries
call in favor of dispatch_context. However, this did not
include the queries triggered prior to pg instantiation.
Instead, use the rctx to send the queries.
Part of #2771. Without the queries being sent,
can_create_pg will never become true.
Signed-off-by: Samuel Just <sam.just@inktank.com>
If a linger op (watch) is sent to the OSD and updates the object, and then
the client loses the reply, it will resend the request. The OSD will see
that it is a dup, however, and not set up the in-memory session state for
the watch. This in turn will break the watch (i.e., notifies won't
get delivered).
Instead, always resend linger registration ops, so that we always have a
unique reqid and do the correct session registeration for each session.
* track the tid of the registation op for each LingerOp
* mark registrations ops as should_resend=false; cancel as needed
* when we send a new registration op, cancel the old one to ensure we
ignore the reply. This is needed becuase we resend linger ops on any
pg change, not just a primary change.
* drop the first_send arg to send_linger(), as we can now infer that
from register_tid == 0.
The bug was easily reproduced with ms inject socket failures = 500 and the
test_stress_watch utility.
Fixes: #2796
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Other areas rely on OSDService::get_map() to function, possibly before
activate_map is first called. In particular, with handle_osd_ping,
not initializing the map member results in:
ceph version 0.48argonaut-413-g90ddc5a (commit:90ddc5ae51627e7656459085d7e15105c8b8316d)
1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x71ba9a]
2: (()+0xfcb0) [0x7fcd8243dcb0]
3: (OSD::handle_osd_ping(MOSDPing*)+0x74d) [0x5dbdfd]
4: (OSD::heartbeat_dispatch(Message*)+0x22b) [0x5dc70b]
5: (SimpleMessenger::DispatchQueue::entry()+0x92b) [0x7b5b3b]
6: (SimpleMessenger::dispatch_entry()+0x24) [0x7b6914]
7: (SimpleMessenger::DispatchThread::entry()+0xd) [0x7762fd]
8: (()+0x7e9a) [0x7fcd82435e9a]
9: (clone()+0x6d) [0x7fcd809ea4bd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Signed-off-by: Samuel Just <sam.just@inktank.com>