Commit Graph

30244 Commits

Author SHA1 Message Date
Greg Farnum
9776e97af2 osd/PG: factor out get_next_version()
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Farnum <greg@inktank.com>
2013-12-06 14:37:26 -08:00
Greg Farnum
0b0d1e8e42 librados: add wait_for_latest_osdmap()
There are times when users may need to make sure the client has the
latest osdmap, for example after sending a mon command modifying
pool properties.

Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Farnum <greg@inktank.com>

squash "librados: add wait_for_latest_osdmap()"
2013-12-06 14:37:26 -08:00
Sage Weil
828590688f librados: expose methods for calculating object hash position
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-12-06 14:37:26 -08:00
Sage Weil
4b5ab3f106 osdc/Objecter: expose methods for getting object hash position and pg
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-12-06 14:37:26 -08:00
Sage Weil
92879f7787 osd: capture hashing of objects to hash positions/pgs in pg_pool_t
The hashing is dependent on pool properties; capture (more of) it in a
method instead of having it in OSDMap.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-12-06 14:37:25 -08:00
Sage Weil
76e0b88f56 osd/OSDMap: use new object_locator_t::hash to place object in a pg
The hash value, if provided, becomes the ps (placement seed) portion of the
pg_t, skipping any hashing of the object name (or locator key).

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-12-06 14:37:25 -08:00
Greg Farnum
d692da34ab osd/osd_types: add explicit hash to object_locator_t
Instead of hashing the object name or key, we allow the hash position to be
provided explicitly.

Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Farnum <greg@inktank.com>
2013-12-06 14:37:25 -08:00
Greg Farnum
0d4ea9f746 encoding: allow users to specify a different compatv after encoding
This way we can set the compatv preferentially depending on whether
we've actually encoded new information or not.

Signed-off-by: Greg Farnum <greg@inktank.com>
2013-12-06 14:37:25 -08:00
Sage Weil
d2963c0a3d librados: add mon_command to C++ API
This way librados users can execute monitor commands.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-12-06 14:37:25 -08:00
Sage Weil
468fffa529 librados: document aio_flush()
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-12-06 14:37:25 -08:00
Sage Weil
bc7ace2eef librados: constify inbl command args
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-12-06 14:37:25 -08:00
Sage Weil
a29d4fc3fd osdc/Objecter: constify inbl command args
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-12-06 14:37:24 -08:00
Sage Weil
fb49065fe7 mon/MonClient: constify inbl command args
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2013-12-06 14:37:24 -08:00
Sage Weil
ef0f255a4a osdc/Objecter: reimplement list_objects
Return to caller at the end of each PG.  This allows the caller to look at
the [pg_]hash_position and get something meaningful.

If there are no objects in the PG, we skip it so that every callback has
*some* data (unless the pool is totally empty!).  So the real difference
here is that we don't move on to the next PG just to reach max_entries.

This gives the client some data sooner, but may mean more callbacks into
client code.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:36:52 -08:00
Sage Weil
d2e6cc635f librados: add get_pg_hash_position to determine pg while listing objects
Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:36:49 -08:00
Sage Weil
eff932c60a osdc/Objecter: stick bl inside ListContext
This is simpler and less error-prone.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:36:45 -08:00
Sage Weil
8e5803abf7 osdc/Objecter: factor pg_read out of list_objects code
This will get used later for other ops against PGs (instead of objects).

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:36:41 -08:00
Sage Weil
dd8c939841 osdc/Objecter: separate explicit pg target from current target
The pgid field is used to store the pg the op mapped to.  We were just
setting it directly for PGLS.  Instead, fill in a new base_pgid, and copy that
to pgid in recalc_op_target(), the same way we do when we map an object
name to a PG.

In particular, we take this opportunity to map a raw pgid to an actual
pgid.  This means the base_pg could come from a raw hash value (although
it doesn't, yet).

Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Farnum <greg@inktank.com>
2013-12-06 14:36:37 -08:00
Sage Weil
9381b69378 osdc/Objecter: drop redundant condition
We are inside an if (response_size) block.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:36:34 -08:00
Sage Weil
bffcca6a0a osd/osd_types: make pref optional in pg_t constructor
We don't use preferred placements any more, so this will
make it easier to start dropping references to it in new code.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:36:31 -08:00
Josh Durgin
3caf3effcb rbd: check write return code during bench-write
This is allows rbd-bench to detect http://tracker.ceph.com/issues/6938
when combined with rapidly changing the mon osd full ratio.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-06 14:33:41 -08:00
Josh Durgin
e32874fc5a objecter: resend all writes after osdmap loses the full flag
Now that the osd does not respond if it gets a map with the full flag
set first, clients need to resend all writes.

Clients talking to old osds are still subject to the race condition,
so both sides must be upgraded to avoid it.

Refs: #6938
Backport: dumpling, emperor
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-06 14:33:35 -08:00
Josh Durgin
4111729dda osd: drop writes when full instead of returning an error
There's a race between the client and osd with a newly marked full
osdmap.  If the client gets the new map first, it blocks writes and
everything works as expected, with no errors from the osd.

If the osd gets the map first, however, it will respond to any writes
with -ENOSPC. Clients will pass this up the stack, and not retry these
writes later.  -ENOSPC isn't handled well by all clients. RBD, for
example, may pass it on to qemu or kernel rbd which will both
interpret it as EIO.  Filesystems on top of rbd will not behave well
when they receive EIOs like this, especially if the cluster oscillates
between full and not full, so some writes succeed.

To fix this, never return ENOSPC from the osd because of a map marked
full, and rely on the client to retry all writes when the map is no
longer marked full.

Old clients talking to osds with this fix will hang instead of
propagating an error, but only if they run into this race
condition. ceph-fuse and rbd with caching enabled are not affected,
since the ObjectCacher will retry writes that return errors.

Refs: #6938
Backport: dumpling, emperor
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-06 14:33:26 -08:00
Sage Weil
1d5427a790 Merge pull request #907 from ceph/wip-3x
osd: default to 3x replication
2013-12-06 14:25:38 -08:00
Sage Weil
384f01dfd3 crush/mapper: dump indep partial progression for debugging
...if DEBUG_INDEP is #defined.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:24:03 -08:00
Sage Weil
e632a79b3c PendingReleaseNotes: note change of CRUSH indep mode in release notes
Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:24:03 -08:00
Sage Weil
c853019475 crush: add feature CRUSH_V2 for new indep mode and SET_*_TRIES rule steps
Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:24:03 -08:00
Sage Weil
caa0e22e15 crush: CHOOSE_LEAF -> CHOOSELEAF throughout
This aligns the internal identifier names with the user-visible names in
the decompiled crush map language.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:24:03 -08:00
Sage Weil
431a13eb37 osd/OSDMap: fix feature calculation for CACHEPOOL
We need to include the faeture in the mask.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:24:02 -08:00
Sage Weil
03911b07e0 crush/CrushCompiler: [de]compile set_choose[leaf]_tries rule step
Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:24:02 -08:00
Sage Weil
09ce7a2bd3 crush/CrushWrapper: set chooseleaf_tries to 5 for 'simple' indep rules
When making a generic indep rule, set the recursive retry to 5.  This gives
better overall results.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:24:02 -08:00
Sage Weil
d1b97462cf crush/mapper: add SET_CHOOSE_TRIES rule step
Since we can specify the recursive retries in a rule, we may as well also
specify the non-recursive tries too for completeness.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:24:02 -08:00
Sage Weil
64aeded50d crush/mapper: apply chooseleaf_tries to firstn mode too
Parameterize the attempts for the _firstn choose method, and apply the
rule-specified tries count to firstn mode as well.  Note that we have
slightly different behavior here than with indep:

 If the firstn value is not specified for firstn, we pass through the
 normal attempt count.  This maintains compatibility with legacy behavior.
 Note that this is usually *not* actually N^2 work, though, because of the
 descend_once tunable.  However, descend_once is unfortunately *not* the
 same thing as 1 chooseleaf try because it is only checked on a reject but
 not on a collision.  Sigh.

 In contrast, for indep, if tries is not specified we default to 1
 recursive attempt, because that is simply more sane, and we have the
 option to do so.  The descend_once tunable has no effect for indep.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:24:02 -08:00
Sage Weil
cb88763ccb crush/mapper: fix up the indep tests
Fix indentation.
Simplify+fix the changed vs moved calculation.
Use the new SET_CHOOSE_LEAF_TRIES command.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:24:02 -08:00
Sage Weil
580cf5f68c Merge pull request #886 from ceph/wip-6922
Fix some pg_num change return codes and make them more resistant to mis-use

Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-06 14:15:56 -08:00
Sage Weil
63755c42f9 Merge pull request #909 from dachary/wip-crush-unittest
more CrushWrapper unittest
2013-12-06 12:35:52 -08:00
Loic Dachary
4e26cc0dac crush: unittest CrushWrapper::get_immediate_parent
Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-06 20:40:48 +01:00
Loic Dachary
09938e6455 crush: unittest CrushWrapper::update_item
Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-06 20:40:48 +01:00
Loic Dachary
16ac59042e crush: unittest s/std::string/string/
Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-06 20:40:48 +01:00
Loic Dachary
b8190180c3 crush: unittest use const instead of define
And reduce the depth of the hierarchy because three levels of buckets
capture the same cases as four levels.

Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-06 20:40:48 +01:00
Loic Dachary
dc095214d3 crush: unittest CrushWrapper::check_item_loc
Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-06 20:40:48 +01:00
Loic Dachary
000c59a9a2 crush: unittest remove useless c->create()
Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-06 20:40:48 +01:00
Yehuda Sadeh
516788d15b Merge remote-tracking branch 'origin/next' 2013-12-06 11:24:06 -08:00
Sage Weil
cb26fbde52 osd: default to 3x replication
3x is the recommendation; it should be the default too.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 10:35:45 -08:00
Sage Weil
f4c16236b7 Merge pull request #901 from dachary/wip-crush-unittest
crush: check for invalid names in loc[]

Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-06 08:29:01 -08:00
Loic Dachary
aedbc99ffc crush: check for invalid names in loc[]
Add the is_valid_crush_loc helper to test for invalid crush names in
insert_item and update_item, before performing any side
effect. Implement the associated unit tests.

Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-06 09:43:47 +01:00
Sage Weil
fe03ad2801 osd: queue pg deletion after on_removal txn
The removal is normally so slow that these don't really race, but they
could.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-05 23:13:28 -08:00
Sage Weil
aa63d6730a os/MemStore: implement reference 'memstore' backend
This is (as near to) a trivial ObjectStore backend for the OSD as we can
get at the moment.  Everything is stored in memory.  We are slightly
tricky with the locking, but not overly so.

On umount we dump everything out to disk, and on mount we load it all in
again, so we have some very coarse persistence/durability... just enough
to make this usable in a non-failure environment.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-05 23:13:28 -08:00
João Eduardo Luís
80fb336b4c Merge pull request #900 from ceph/wip-mon-mds-trim
mon: MDSMonitor: trim versions and let PaxosService decide whether to propose

We were not trimming mdsmap versions and were generating a new map every time
we modified the pending value.

Now we not only make sure that MDSMonitor will trim old maps (configurable
option allowing us to set the maximum number of maps to keep, defaulting to 500,
much like other services do) but we also delegate to PaxosService the decision on
whether to propose our pending value.

We also perform several modifications to 'ceph-kvstore-tool', allowing one to obtain
the contents of a given prefix:key and have them outputted to a file instead of stdout,
and also add support for getting the size of a given prefix:key's value.

'ceph report' was also modified so that we always output the first and last
committed versions for all services; up until this point, we would only output the
first committed version on all services, and only a few were also outputting the
last committed version.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-05 18:15:21 -08:00
Joao Eduardo Luis
47ee79704f mon: ceph-kvstore-tool: get size of value for prefix/key
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-12-06 01:06:17 +00:00