We overwrite target_oloc.pool with the appropriate [read|write]_tier.
write_tier wins if it matches both.
We don't handle any sort of redirect yet.
Signed-off-by: Greg Farnum <greg@inktank.com>
The only current user of the precalc_pgid field is list_objects. That's
fine, but we don't want new users to inadvertently appear and somehow
break the caching/tiering stuff by forcing us to go to the base pool
when we should be talking to somebody else. Add an assert to catch
these cases.
Signed-off-by: Greg Farnum <greg@inktank.com>
For now we simply set target_oloc = base_oloc in recalc_op_target(), but
we will shortly be doing more interesting things with it there.
Signed-off-by: Greg Farnum <greg@inktank.com>
We want to be able to target other pools for caching and tiering, so
we need to take an oloc from the client and translate it into an
actual target. Rename oloc to base_oloc to make clear which one it is.
Signed-off-by: Greg Farnum <greg@inktank.com>
While iterating over the store files we race against leveldb, which may
be shuffling data around thus removing some files.
By ignoring missing files on stat, we'll get to not account those files
but that's okay -- this is just an estimate.
Fixes: #6178
Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
Sigh. This doesn't make much intuitive sense to me, but this is how it
currently works.
Switch to using the async api while we are at it.
Signed-off-by: Sage Weil <sage@inktank.com>
Fixes: 6151
Backport: dumpling
Signed-off-by: Samuel Just <sam.just@inktank.com>
Introduced: f808c205c503f7d32518c91619f249466f84c4cf
Reviewed-by: Sage Weil <sage@inktank.com>
We add fields sufficient to specify
* many pools have a tiering relationship with pool foo
* pool foo is a tier pool for pool bar
* the tiering relationship between foo and bar is specified
by cache_mode
* client reads and writes for pool foo should be directed to
pools bar and baz, respectively (where probably, but not
necessarily, baz == bar or baz == foo).
This lets us specify very sophisticated caching policies on
the server side that all clients going forward can handle
simply by directing the messages as the read_tier and write_tier
flags, and the (not-yet-implemented) redirect replies
from OSDs, specify.
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Farnum <greg@inktank.com>
The dout() prefix does get_osdmap(), which requires (and asserts) that we
hold the pg lock, but in some cases we do not, notably
ReplicatedPG::object_context_destructor_callback.
Signed-off-by: Sage Weil <sage@inktank.com>
make ceph_test_rados / RadosModel validate the versions exposed by librados
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
If the redhat-rpm-config package is installed, the debuginfo rpms will
be built by default. The build will fail when the package installed
and the specfile also invokes the macro.
Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
Set the user version to the *current* object version, not the version
we would use if we were to modify it. We move the assignments inside
the reply (read or error) block to make it more obvious which paths
are possible.
Signed-off-by: Sage Weil <sage@inktank.com>
The C++ AioCompletion::get_version() method only returns 32-bits. Sigh.
Add a get_version64() method that returns all 64-bits. Do not touch the
32-bit version to avoid breaking the ABI.
Backport: dumpling, cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
There was a bunch of situations in which we would have a proper error to
propagate to user-space but we would always return '1' (EXIT_FAILURE).
Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
Except in very special cases, we should let PaxosService take its course
and trigger the proposals itself. In this case, we were proposing right
before returning to PaxosService, and we were returning false on top of it
(most likely to guarantee that PaxosService wouldn't try to propose).
This doesn't make much sense, so let's do it like all the other cool kids
are doing and let PaxosService decide what's best for us.
Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
This way, we can avoid omap_rmkeyrange in the common append
and trim cases.
Fixes: #6040
Backport: Dumpling
Signed-off-by: Samuel Just <sam.just@inktank.com>
Currently we only detect new mon addrs and names during the probing phase.
For non-trivial clusters, this means we can get into a sticky spot when
we discover enough peers to form an quorum, but not all of them, and the
undiscovered ones are enough to break the mon ranks and prevent an
election.
One way to work around this is to continue addr and name discovery during
the election. We should also consider making the ranks less sensitive to
the undefined addrs; that is a separate change.
Fixes: #4924
Backport: dumpling
Signed-off-by: Sage Weil <sage@inktank.com>
Tested-by: Bernhard Glomm <bernhard.glomm@ecologic.eu>
There really are stl implementations (like the one on my ubuntu 12.04
machine) which have a list::size() which is linear in the size of the
list. That assert, therefore, is quite expensive!
Fixes: #6040
Backport: Dumpling
Signed-off-by: Samuel Just <sam.just@inktank.com>
Set this up with the existing at_version member, but only increase
it for user_modify ops. Use this when logging the PG's user_version. In
order to maintain compatibility with old clients on classic pools, we
force user_version to follow at_version whenever it's updated.
Now that we have and are maintaining this PG user version, use it
for the user version on ops that get ENOENT back, when short-circuiting
replies as part of reply_op_error()[1], or when replying to repops
in eval_repop; further use it for the cls_current_version() function. This
is a small semantic change for that function, as previously it would
generally return the same value as the user would get sent back via
MOSDOpReply -- but I don't think it was something you could count on.
We now define it as being the user version of the PG at the start of the
op, and as a bonus it is defined even for read ops (the at_version is
only filled in on write operations).
[1]: We tweak PGLog to make it easier to retrieve both user and PG versions.
Signed-off-by: Greg Farnum <greg@inktank.com>
There's little point to updating versions individually when we can
do so en masse and avoid mistakes in duplication.
Signed-off-by: Greg Farnum <greg@inktank.com>
The system we've been building up works out very nicely for new clients,
but they could not have interoperated with old clients that were only
referring to our replay_version. In order to deal with this, we add
a bad_replay_version to MOSDOpReply which is encoded where we used
to encode replay_version. bad_replay_version will follow the same semantics
as reassert_version used to (except that it is filled in on reads), but
is not accessible to new clients, who can see only our properly-controlled
replay_version and user_version. This will let old and new clients
interoperate correctly when communicating about watches, etc.
Signed-off-by: Greg Farnum <greg@inktank.com>
We now require it when creating a pg_log_entry_t. The user_version
is the version which info.last_user_version should be set to
after the transaction is applied, which for everything except for
a user-modify op is going to be the version it was already at.
For now we are filling in the user-modify op's changing user_version
to be ctx->at_version.version
Signed-off-by: Greg Farnum <greg@inktank.com>
We add a corresponding user_version to pg_log_entry_t, and the logic
to assign from one to the other and to recover last_user_version from
a master's log. We aren't yet setting it to anything, though.
Signed-off-by: Greg Farnum <greg@inktank.com>