Commit Graph

30072 Commits

Author SHA1 Message Date
Josh Durgin
e32874fc5a objecter: resend all writes after osdmap loses the full flag
Now that the osd does not respond if it gets a map with the full flag
set first, clients need to resend all writes.

Clients talking to old osds are still subject to the race condition,
so both sides must be upgraded to avoid it.

Refs: #6938
Backport: dumpling, emperor
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-06 14:33:35 -08:00
Josh Durgin
4111729dda osd: drop writes when full instead of returning an error
There's a race between the client and osd with a newly marked full
osdmap.  If the client gets the new map first, it blocks writes and
everything works as expected, with no errors from the osd.

If the osd gets the map first, however, it will respond to any writes
with -ENOSPC. Clients will pass this up the stack, and not retry these
writes later.  -ENOSPC isn't handled well by all clients. RBD, for
example, may pass it on to qemu or kernel rbd which will both
interpret it as EIO.  Filesystems on top of rbd will not behave well
when they receive EIOs like this, especially if the cluster oscillates
between full and not full, so some writes succeed.

To fix this, never return ENOSPC from the osd because of a map marked
full, and rely on the client to retry all writes when the map is no
longer marked full.

Old clients talking to osds with this fix will hang instead of
propagating an error, but only if they run into this race
condition. ceph-fuse and rbd with caching enabled are not affected,
since the ObjectCacher will retry writes that return errors.

Refs: #6938
Backport: dumpling, emperor
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-06 14:33:26 -08:00
Sage Weil
1d5427a790 Merge pull request #907 from ceph/wip-3x
osd: default to 3x replication
2013-12-06 14:25:38 -08:00
Sage Weil
384f01dfd3 crush/mapper: dump indep partial progression for debugging
...if DEBUG_INDEP is #defined.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:24:03 -08:00
Sage Weil
e632a79b3c PendingReleaseNotes: note change of CRUSH indep mode in release notes
Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:24:03 -08:00
Sage Weil
c853019475 crush: add feature CRUSH_V2 for new indep mode and SET_*_TRIES rule steps
Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:24:03 -08:00
Sage Weil
caa0e22e15 crush: CHOOSE_LEAF -> CHOOSELEAF throughout
This aligns the internal identifier names with the user-visible names in
the decompiled crush map language.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:24:03 -08:00
Sage Weil
431a13eb37 osd/OSDMap: fix feature calculation for CACHEPOOL
We need to include the faeture in the mask.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:24:02 -08:00
Sage Weil
03911b07e0 crush/CrushCompiler: [de]compile set_choose[leaf]_tries rule step
Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:24:02 -08:00
Sage Weil
09ce7a2bd3 crush/CrushWrapper: set chooseleaf_tries to 5 for 'simple' indep rules
When making a generic indep rule, set the recursive retry to 5.  This gives
better overall results.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:24:02 -08:00
Sage Weil
d1b97462cf crush/mapper: add SET_CHOOSE_TRIES rule step
Since we can specify the recursive retries in a rule, we may as well also
specify the non-recursive tries too for completeness.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:24:02 -08:00
Sage Weil
64aeded50d crush/mapper: apply chooseleaf_tries to firstn mode too
Parameterize the attempts for the _firstn choose method, and apply the
rule-specified tries count to firstn mode as well.  Note that we have
slightly different behavior here than with indep:

 If the firstn value is not specified for firstn, we pass through the
 normal attempt count.  This maintains compatibility with legacy behavior.
 Note that this is usually *not* actually N^2 work, though, because of the
 descend_once tunable.  However, descend_once is unfortunately *not* the
 same thing as 1 chooseleaf try because it is only checked on a reject but
 not on a collision.  Sigh.

 In contrast, for indep, if tries is not specified we default to 1
 recursive attempt, because that is simply more sane, and we have the
 option to do so.  The descend_once tunable has no effect for indep.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:24:02 -08:00
Sage Weil
cb88763ccb crush/mapper: fix up the indep tests
Fix indentation.
Simplify+fix the changed vs moved calculation.
Use the new SET_CHOOSE_LEAF_TRIES command.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 14:24:02 -08:00
Sage Weil
580cf5f68c Merge pull request #886 from ceph/wip-6922
Fix some pg_num change return codes and make them more resistant to mis-use

Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-06 14:15:56 -08:00
Sage Weil
63755c42f9 Merge pull request #909 from dachary/wip-crush-unittest
more CrushWrapper unittest
2013-12-06 12:35:52 -08:00
Loic Dachary
4e26cc0dac crush: unittest CrushWrapper::get_immediate_parent
Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-06 20:40:48 +01:00
Loic Dachary
09938e6455 crush: unittest CrushWrapper::update_item
Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-06 20:40:48 +01:00
Loic Dachary
16ac59042e crush: unittest s/std::string/string/
Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-06 20:40:48 +01:00
Loic Dachary
b8190180c3 crush: unittest use const instead of define
And reduce the depth of the hierarchy because three levels of buckets
capture the same cases as four levels.

Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-06 20:40:48 +01:00
Loic Dachary
dc095214d3 crush: unittest CrushWrapper::check_item_loc
Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-06 20:40:48 +01:00
Loic Dachary
000c59a9a2 crush: unittest remove useless c->create()
Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-06 20:40:48 +01:00
Yehuda Sadeh
516788d15b Merge remote-tracking branch 'origin/next' 2013-12-06 11:24:06 -08:00
Sage Weil
cb26fbde52 osd: default to 3x replication
3x is the recommendation; it should be the default too.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 10:35:45 -08:00
Sage Weil
f4c16236b7 Merge pull request #901 from dachary/wip-crush-unittest
crush: check for invalid names in loc[]

Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-06 08:29:01 -08:00
Loic Dachary
aedbc99ffc crush: check for invalid names in loc[]
Add the is_valid_crush_loc helper to test for invalid crush names in
insert_item and update_item, before performing any side
effect. Implement the associated unit tests.

Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-06 09:43:47 +01:00
Sage Weil
fe03ad2801 osd: queue pg deletion after on_removal txn
The removal is normally so slow that these don't really race, but they
could.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-05 23:13:28 -08:00
Sage Weil
aa63d6730a os/MemStore: implement reference 'memstore' backend
This is (as near to) a trivial ObjectStore backend for the OSD as we can
get at the moment.  Everything is stored in memory.  We are slightly
tricky with the locking, but not overly so.

On umount we dump everything out to disk, and on mount we load it all in
again, so we have some very coarse persistence/durability... just enough
to make this usable in a non-failure environment.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-05 23:13:28 -08:00
João Eduardo Luís
80fb336b4c Merge pull request #900 from ceph/wip-mon-mds-trim
mon: MDSMonitor: trim versions and let PaxosService decide whether to propose

We were not trimming mdsmap versions and were generating a new map every time
we modified the pending value.

Now we not only make sure that MDSMonitor will trim old maps (configurable
option allowing us to set the maximum number of maps to keep, defaulting to 500,
much like other services do) but we also delegate to PaxosService the decision on
whether to propose our pending value.

We also perform several modifications to 'ceph-kvstore-tool', allowing one to obtain
the contents of a given prefix:key and have them outputted to a file instead of stdout,
and also add support for getting the size of a given prefix:key's value.

'ceph report' was also modified so that we always output the first and last
committed versions for all services; up until this point, we would only output the
first committed version on all services, and only a few were also outputting the
last committed version.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-05 18:15:21 -08:00
Joao Eduardo Luis
47ee79704f mon: ceph-kvstore-tool: get size of value for prefix/key
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-12-06 01:06:17 +00:00
Joao Eduardo Luis
c98c1043e3 tools: ceph-kvstore-tool: output value contents to file on 'get'
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-12-06 01:06:17 +00:00
Joao Eduardo Luis
00048fe33f mon: Have 'ceph report' print last committed versions
Only for those services that weren't doing it.

Backport: dumpling
Backport: emperor

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-12-06 01:06:16 +00:00
Joao Eduardo Luis
cc64382822 mon: MDSMonitor: let PaxosService decide on whether to propose
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-12-06 01:06:11 +00:00
Sage Weil
5823146077 os/ObjectStore: make getattrs() pure virtual
It is required.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-05 15:33:20 -08:00
tamil
11e26ee424 s/true/1 and s/false/0
Signed-off-by: tamil <tamil.muthamizhan@inktank.com>
2013-12-05 13:05:12 -08:00
Joao Eduardo Luis
cf099415ad mon: MDSMonitor: implement 'get_trim_to()' to let the mon trim mdsmaps
This commit also adds two options to the MDSMonitor:

  - mon_max_mdsmap_epochs: the maximum amount of maps we'll keep (def: 500)
  - mon_mds_force_trim: the version we want to trim to

This results in 'get_trim_to()' returning the possible values:

  - if we have set mon_mds_force_trim, and this value is greater than the
    last committed version, trim to mon_mds_force_trim
  - if we hold more than the max number of maps, trim to last - max
  - if we have set mon_mds_force_trim and if we hold more than the max
    number of maps, and mon_mds_force_trim is lower than last - max,
    then trim to last - max

Backport: dumpling
Backport: emperor

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-12-05 17:47:37 +00:00
Joao Eduardo Luis
3e845b56a3 mon: MDSMonitor: print map on encode_pending() iff debug mon = 30+
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-12-05 17:41:37 +00:00
Joao Eduardo Luis
62fb47509b mon: MDSMonitor: consider 'debug level' parameter on 'print_map()'
The parameter was there, just not used.  It does default to 7, so
existing callers are okay.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-12-05 17:41:37 +00:00
Joao Eduardo Luis
032a00bb35 mon: MDSMonitor: remove reference to no-longer-used encode_trim()
We weren't using it and it's no longer used by anyone anyway.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-12-05 17:41:37 +00:00
Sage Weil
39ddd213da Merge pull request #899 from dachary/wip-crush-unittest
CrushWrapper::insert_item unittest and minor fixes

Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-05 09:18:50 -08:00
Loic Dachary
ccc6014512 crush: CrushWrapper unit tests
Covers all cases for the following methods. All but insert_item are trivial.

* insert_item
* set_item_name
* name_exists
* item_exists
* get_item_id
* get_item_name
* get_num_type_names
* get_type_id
* get_type_name
* is_valid_crush_name

Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-05 18:07:03 +01:00
Loic Dachary
b9bff8e8cb crush: remove redundant test in insert_item
A year after the last modification of test to check if an item was added
twice to the same bucket, the subtree_contains test was added a few
lines above it, making it redundant.

Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-05 18:07:03 +01:00
Loic Dachary
8af75968ac crush: insert_item returns on error if bucket name is invalid
A bucket name may be created as a side effect of insert_item. All names
in the loc argument are checked for validity at the beginning of the
method and an error is returned immediately if one is found. This allows
to not check for errors when setting the name of an item later on.

Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-05 18:06:55 +01:00
Sage Weil
3b8371a4bf os/ObjectStore: prevent copying
Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-04 14:46:49 -08:00
Sage Weil
a70200e329 os/ObjectStore: pass cct to ctor
Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-04 14:46:40 -08:00
Loic Dachary
0dd7d2985a Merge pull request #892 from jpds/ceph-disk-journal-mbrtogpt
Call --mbrtogpt on journal run of sgdisk should the drive require a GPT ...

Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Loic Dachary <loic@dachary.org>
2013-12-04 11:42:30 -08:00
Sage Weil
ea600d0e0b Merge pull request #782 from danchai/master
ObjBencher: add rand_read_bench to support rand test in rados-bench
2013-12-04 07:42:48 -08:00
Jonathan Davies
35011e0b01 Call --mbrtogpt on journal run of sgdisk should the drive require a GPT table.
Signed-off-by: Jonathan Davies <jonathan.davies@canonical.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Loic Dachary <loic@dachary.org>
2013-12-04 13:57:13 +00:00
danchai
cae10830c7 ObjBencher: add rand_read_bench functions to support rand test in rados-bench
Signed-off-by: Tengwei Cai <tengweicai@gmail.com>
2013-12-04 15:41:42 +08:00
Sage Weil
e829859291 doc/rados/operations/crush: fix more
Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-03 22:46:37 -08:00
Sage Weil
7709a10f52 doc/rados/operations/crush: fix rst
Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-03 22:18:50 -08:00