The C++ AioCompletion::get_version() method only returns 32-bits. Sigh.
Add a get_version64() method that returns all 64-bits. Do not touch the
32-bit version to avoid breaking the ABI.
Backport: dumpling, cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
This way, we can avoid omap_rmkeyrange in the common append
and trim cases.
Fixes: #6040
Backport: Dumpling
Signed-off-by: Samuel Just <sam.just@inktank.com>
Currently we only detect new mon addrs and names during the probing phase.
For non-trivial clusters, this means we can get into a sticky spot when
we discover enough peers to form an quorum, but not all of them, and the
undiscovered ones are enough to break the mon ranks and prevent an
election.
One way to work around this is to continue addr and name discovery during
the election. We should also consider making the ranks less sensitive to
the undefined addrs; that is a separate change.
Fixes: #4924
Backport: dumpling
Signed-off-by: Sage Weil <sage@inktank.com>
Tested-by: Bernhard Glomm <bernhard.glomm@ecologic.eu>
There really are stl implementations (like the one on my ubuntu 12.04
machine) which have a list::size() which is linear in the size of the
list. That assert, therefore, is quite expensive!
Fixes: #6040
Backport: Dumpling
Signed-off-by: Samuel Just <sam.just@inktank.com>
Set this up with the existing at_version member, but only increase
it for user_modify ops. Use this when logging the PG's user_version. In
order to maintain compatibility with old clients on classic pools, we
force user_version to follow at_version whenever it's updated.
Now that we have and are maintaining this PG user version, use it
for the user version on ops that get ENOENT back, when short-circuiting
replies as part of reply_op_error()[1], or when replying to repops
in eval_repop; further use it for the cls_current_version() function. This
is a small semantic change for that function, as previously it would
generally return the same value as the user would get sent back via
MOSDOpReply -- but I don't think it was something you could count on.
We now define it as being the user version of the PG at the start of the
op, and as a bonus it is defined even for read ops (the at_version is
only filled in on write operations).
[1]: We tweak PGLog to make it easier to retrieve both user and PG versions.
Signed-off-by: Greg Farnum <greg@inktank.com>
There's little point to updating versions individually when we can
do so en masse and avoid mistakes in duplication.
Signed-off-by: Greg Farnum <greg@inktank.com>
The system we've been building up works out very nicely for new clients,
but they could not have interoperated with old clients that were only
referring to our replay_version. In order to deal with this, we add
a bad_replay_version to MOSDOpReply which is encoded where we used
to encode replay_version. bad_replay_version will follow the same semantics
as reassert_version used to (except that it is filled in on reads), but
is not accessible to new clients, who can see only our properly-controlled
replay_version and user_version. This will let old and new clients
interoperate correctly when communicating about watches, etc.
Signed-off-by: Greg Farnum <greg@inktank.com>
We now require it when creating a pg_log_entry_t. The user_version
is the version which info.last_user_version should be set to
after the transaction is applied, which for everything except for
a user-modify op is going to be the version it was already at.
For now we are filling in the user-modify op's changing user_version
to be ctx->at_version.version
Signed-off-by: Greg Farnum <greg@inktank.com>
We add a corresponding user_version to pg_log_entry_t, and the logic
to assign from one to the other and to recover last_user_version from
a master's log. We aren't yet setting it to anything, though.
Signed-off-by: Greg Farnum <greg@inktank.com>
ctx->new_obs.oi.user_version is initialized to ctx->obs.oi.user_version,
and for read ops it won't be changed. That means
reply_user_version == ctx->new_obs.oi.user_version in all cases, which
means we don't want it.
Signed-off-by: Greg Farnum <greg@inktank.com>
We never expose the full eversion_t data to users, and do not want to.
However, we pull some tricks in the encode/decode functions to avoid
having to change the object_info_t disk format for this change.
When we can break compatibility, we should simplify this.
Signed-off-by: Greg Farnum <greg@inktank.com>
As part of this, rename OpContext::reply_version->reply_user_version.
The semantics that necessitate the reply_version are only for user versions,
so rename it for clarity. Then use the reply_user_version in
set_user_version() (if the op succeeded).
For now we use the PG version for ENOENT (preserving the previous
semantics), but that will get changed to the pg's user_version soon
as well.
Signed-off-by: Greg Farnum <greg@inktank.com>
There are a lot of pointers throughout our request infrastructure used solely
for exporting the version to users. The interfaces we actually expose only
provide a uint64_t (leaving off eversion_t's epoch), and that's all we're
going to maintain in our new user_version scheme, so don't pretend we'll
have more in our internal interfaces.
I audited this pretty carefully; in particular:
Op::objver is only used for passing data back to users via the calling
functions IoCtxImpl::last_objver, etc
IoCtxImpl::last_objver is used only for the set_sync_op_version() call, which
provides data only for the uint64_t get_last_version() and
rados_get_last_version() calls.
AioCompletionImpl::objver is used only for the uint64_t get_version() call.
LingerOp::pobjver is used only for referencing things that are now version_t.
Signed-off-by: Greg Farnum <greg@inktank.com>
We set this in the if below for writes, and for reads it doesn't need to
be updated (and isn't). Remove the confusing double-set so future code
inspectors don't get concerned there's a bug like I did.
Signed-off-by: Greg Farnum <greg@inktank.com>
This was confusing the heck out of me when trying to figure out
why I was hitting an assert. So replace the if-else block with
a more appropriate assert and don't include any misleading calls
to prepare_transaction() from sub_op_modify().
Signed-off-by: Greg Farnum <greg@inktank.com>
We have been returning the object's "user version" and using that
for replay, but that is in fact incorrect. In preparation for fixing
up the user version semantics, rename get_version to get_replay_version
and set_version to set_replay_version.
Signed-off-by: Greg Farnum <greg@inktank.com>
Trivial cleanup. There were still 3 references to g_conf in CephxKeyServer.
Replaced them in favor of cct->_conf.
Signed-off-by: Roald J. van Loon <roaldvanloon@gmail.com>
bool get_next(const K &key, pair<K, VPtr> *next)
may indirectly delete the object pointed by next->second when
doing :
*next = make_pair(i->first, next_val);
and it will deadlock (EDEADLK) when
void operator()(V *to_remove) {
{
Mutex::Locker l(parent->lock);
tries to acquire the lock because it is already held. The
Mutex::Locker is isolated in a block and the *next* parameter is set
outside of the block.
A test case demonstrating the problem is added to test_sharedptr_registry.cc
http://tracker.ceph.com/issues/6117fixes#6117
Signed-off-by: Loic Dachary <loic@dachary.org>
* unify conventions to match those used by jerasure ( data chunk = K,
coding chunk = M, use coding instead of parity, use erasures instead
of erased )
* make lines 80 characters long
* modify the descriptions to take into account that the chunk rank
will encoded in the pool name and not on a per object basis
* remove the doxygen link to ErasureCodeInterface because it fails
doc: asphyxiate does not support class
http://tracker.ceph.com/issues/6115
* only systematic codes are considered at this point ( all jerasure
techniques are systematic). Although the API could be extended to
include non systematic codes, it is probably a case of over
engineering at this point.
* add link to
http://tracker.ceph.com/issues/6113
add ceph osd pool create [name] [key=value]
* update the plugin system description to match the proposed
implementation http://tracker.ceph.com/issues/5877http://tracker.ceph.com/issues/4929 refs #4929
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
Signed-off-by: Loic Dachary <loic@dachary.org>
The thread created to test SharedPtrRegistry race conditions updates a
value ( ptr ) that is tested by the main gtest thread but is not
protected by a lock. Instead of adding a lock, the main thread tests
the value after pthread_join() on the child thread.
http://tracker.ceph.com/issues/6130fixes#6130
Signed-off-by: Loic Dachary <loic@dachary.org>
This lets us tell by the presence of the admin socket commands whether
a signal will make us shut down cleanly. See #5924.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>