Commit Graph

30070 Commits

Author SHA1 Message Date
Greg Farnum
f1ccdb418b Elector: share local command set when deferring
We're about to use this at a basic level, to identify when we have
"classic" monitors in-quorum, but could also do something more
sophisticated like a set intersection on the commands.

Signed-off-by: Greg Farnum <greg@inktank.com>
2013-12-09 11:26:04 -08:00
Greg Farnum
ba673be3e6 Monitor: import MonCommands.h from original Dumpling and expose it
If the Elector doesn't receive a set of commands from the elected leader, it
assumes the monitor is "classic" and uses the Dumpling command set as
the leader set.

Signed-off-by: Greg Farnum <greg@inktank.com>
2013-12-09 11:26:04 -08:00
Greg Farnum
3cb58f7406 Monitor: validate incoming commands against the leader's set too
Then check against our own, and forward if we don't recognize it
or for some reason don't match.

Signed-off-by: Greg Farnum <greg@inktank.com>
2013-12-09 11:26:04 -08:00
Greg Farnum
cb51b1ed1a Monitor: disseminate leader's command set instead of our own
Signed-off-by: Greg Farnum <greg@inktank.com>
2013-12-09 11:26:04 -08:00
Greg Farnum
d33df28c2b Elector: transmit local api on election win, accept leader's on loss
If we're the leader, just point to our local set. Disseminating these
will let peons advertise the full command set supported by the leader.
INCOMPLETE: does not yet handle winning Electors who do not send a command set.

Signed-off-by: Greg Farnum <greg@inktank.com>
2013-12-09 11:26:04 -08:00
Greg Farnum
8025fb33ad messages: make room for passing supported monitor commands in MMonElection
We're going to use this space to let leader tell everybody what
commands it supports.

Signed-off-by: Greg Farnum <greg@inktank.com>
2013-12-09 11:26:03 -08:00
Greg Farnum
f932903646 Monitor: pull command mapping out of _allowed_command()
We want to be able to validate commands against both the leader and
local command sets, so make that functionality generic.

Signed-off-by: Greg Farnum <greg@inktank.com>
2013-12-09 11:26:03 -08:00
Sage Weil
7d000e3411 Merge pull request #918 from ceph/port/misc
Misc portability patches

Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-09 11:16:49 -08:00
Sage Weil
4c5f7ba8ba Merge pull request #922 from dachary/wip-crush-choose-tries
crush: fix map->choose_tries boundary test

Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-09 08:28:43 -08:00
Loic Dachary
41152a6317 crush: --show-utilization* implies --show-statistics
--show-utilization* outputs only if --show-statistics is set, which is
confusing. Instead of failing, set --show-statistics to avoid the
confusion.

Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-09 10:57:17 +01:00
Greg Farnum
dcb0a4f3bb Monitor: add a separate leader_supported_commands
This isn't used yet, but will be shortly.

Signed-off-by: Greg Farnum <greg@inktank.com>
2013-12-08 22:21:41 -08:00
Greg Farnum
4cd5c3bf3f Monitor: expose local monitor commands to other compilation units
Signed-off-by: Greg Farnum <greg@inktank.com>
2013-12-08 22:21:41 -08:00
Greg Farnum
dca5383f2e MonCommand: add operator== and operator!=
Signed-off-by: Greg Farnum <greg@inktank.com>
2013-12-08 22:21:41 -08:00
Greg Farnum
ac69a0122b MonCommand: support encode/decode
Signed-off-by: Greg Farnum <greg@inktank.com>
2013-12-08 22:21:41 -08:00
Greg Farnum
3dcbf460d1 encoding: fix [encode|decode]_array_nohead
We want to actually encode each element and keep it, rather than
writing each one at the position after the array end!

Signed-off-by: Greg Farnum <greg@inktank.com>
2013-12-08 22:21:41 -08:00
Loic Dachary
7482d62f24 crush: add CrushTester accessors
Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-08 22:17:26 +01:00
Loic Dachary
c928f077f7 crush: output --show-bad-mappings on err
Instead of using stdout so that it displays well when used in
conjunction with --show-statistics

Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-08 22:17:26 +01:00
Loic Dachary
5e0722fab5 crush: fix map->choose_tries boundary test
CrushWrapper::start_choose_profile allocates map->choose_tries with
choose_total_tries elements. When crush_choose_firstn sets a value, it
tests against map->choose_local_tries which could lead to memory
corruption if map->choose_total_tries is smaller than
map->choose_local_tries.

Another indesirable but non fatal side effect is that the output crushtool
--show-choose-tries will be truncated to choose_local_tries which is
set to a lower value than choose_total_tries by the default tuneables.

Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-08 17:00:54 +01:00
Sage Weil
94da2153d1 Merge pull request #869 from ceph/wip-crush
crush changes for erasure coding

Reviewed-by: Loic Dachary <loic@dachary.org>
Reviewed-by: Samuel Just <sam.just@inktank.com>
2013-12-07 20:59:22 -08:00
Noah Watkins
ef4061f0ad librbd: remove unused private variable
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
2013-12-07 18:07:03 -08:00
Noah Watkins
ad3825c608 TrackedOp: remove unused private variable
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
2013-12-07 18:07:03 -08:00
Noah Watkins
3b39a8a9f1 librbd: rename howmany to avoid conflict
A howmany macro exists on some platforms in standard headers, but there
really isn't any sort of standard that I've found. We just avoid the
conflict entirely this way.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
2013-12-07 18:07:03 -08:00
Sage Weil
096f9b3268 Merge pull request #917 from ceph/port/compat
compat: define replacement TEMP_FAILURE_RETRY

Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-07 14:01:14 -08:00
Sage Weil
96068bfad6 Merge pull request #919 from ceph/port/fdatasync
wbthrottle: use feature check for fdatasync

Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-07 14:00:40 -08:00
Noah Watkins
539fe26109 wbthrottle: use feature check for fdatasync
Checking for fdatasync uses the same approach as the qemu configure
script. The relevant commit is d1722a27f552a22561104210e0afad4577878e53.
Here is a copy of the commit message which explains the check:

Under Darwin, a symbol exists for the fdatasync() function, so that our
link test succeeds. However _POSIX_SYNCHRONIZED_IO is set to '-1'.

According to POSIX:2008, a value of -1 means the feature is not
supported.
A value of 0 means supported at compilation time, and a value greater 0
means supported at both compilation and run time.

Enable fdatasync() only if _POSIX_SYNCHRONIZED_IO is '>0'.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
2013-12-07 10:37:00 -08:00
Noah Watkins
663da61c02 rados_sync: fix mismatched tag warning
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
2013-12-07 10:24:46 -08:00
Noah Watkins
60a25093a4 rados_sync: remove unused private variable
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
2013-12-07 10:24:46 -08:00
Noah Watkins
43c1676778 mon: check for sys/vfs.h existence
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
2013-12-07 10:24:20 -08:00
Noah Watkins
c99cf265fd make: increase maximum template recursion depth
With clang on OSX spirit blows up without this.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
2013-12-07 10:22:54 -08:00
Noah Watkins
e2be099118 compat: define replacement TEMP_FAILURE_RETRY
Not all platforms have it.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
2013-12-07 10:18:51 -08:00
Sage Weil
a52ef1df49 Merge remote-tracking branch 'gh/wip-fix-3x'
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-06 16:56:10 -08:00
Sage Weil
0386095ea0 Merge remote-tracking branch 'gh/wip-fix-tunables'
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-06 16:55:54 -08:00
Sage Weil
3b3cbf52fb crush/CrushCompiler: make current set of tunables 'safe'
We can reenable this error the next time we add new tunables.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 16:24:16 -08:00
Sage Weil
8535ceda03 crushtool: remove scary tunables messages
Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 16:24:15 -08:00
Sage Weil
4eb8891d8d crush/CrushCompiler: start with legacy tunables when compiling
Ensure that a crush file always compiled deterministically, even though
the default values for *new* maps has changed.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 16:24:15 -08:00
Sage Weil
e8fdef217f crush: add indep data set to cli tests
This will help us catch things if we break the mapping.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 16:22:59 -08:00
Sage Weil
564de6ea05 osdmaptool: fix cli tests for 3x
Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 16:22:26 -08:00
Sage Weil
6704be68d4 osd: default to 3x replication
3x is the recommendation; it should be the default too.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-06 16:21:37 -08:00
Sage Weil
308e4f9def Merge pull request #913 from dachary/wip-crush-unittest
CrushWrapper::move_bucket unittest and minor fixes

Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-06 16:10:00 -08:00
Josh Durgin
8d0180b1b7 objecter: don't take extra throttle budget for resent ops
These ops have already taken their budget in the original op_submit().
It will be returned via put_op_budget() when they complete.
If there were many localized reads of missing objects from replicas,
or cache pool redirects, this would cause the objecter to use up all
of its op throttle budget and hang.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-06 16:03:20 -08:00
Sage Weil
38647f7627 Revert "osd: default to 3x replication"
This reverts commit cb26fbde52.

Fix unit tests and do integration tests first; this may have unexpected
consequences.
2013-12-06 15:48:39 -08:00
Loic Dachary
cbeb1f4510 crush: detach_bucket must test item >= 0 not > 0
Since detach_bucket is a private helper solely used by move_bucket which
contains another ( correct ) safeguard, the code cannot be reached and
the problem can never happen. If another function uses detach_bucket,
it may happen.

Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-07 00:31:54 +01:00
Loic Dachary
2cd73f9d3e crush: remove obsolete comments from link_bucket
Probably copy/pasted from move_bucket.

Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-07 00:27:09 +01:00
Loic Dachary
e00324b2bc crush: remove redundant code from move_bucket
The following was introduced in 2012 by a2d0cff1b0

  // un-set the device name so we can use add_item later
  build_rmap(name_map, name_rmap);
  name_map.erase(id);
  name_rmap.erase(id_name);

when insert_item refused to move a bucket for which a name already
exists. It was changed in 2013 by
4e2557a038 and now supports it. The
TestCrushWrapper unittest for move_bucket pass.

Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-07 00:21:16 +01:00
Loic Dachary
8ef80a4c67 crush: unittest CrushWrapper::move_bucket
Signed-off-by: Loic Dachary <loic@dachary.org>
2013-12-07 00:20:31 +01:00
Sage Weil
865880b5b1 Merge pull request #888 from ceph/wip-crush-tunables
default to bobtail-era crush tunables.

Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
2013-12-06 14:45:57 -08:00
Sage Weil
650f896c4d Merge pull request #903 from ceph/wip-memstore
memstore: reference ObjectStore backend

Reviewed-by: Samuel Just <sam.just@inktank.com>
2013-12-06 14:38:15 -08:00
Josh Durgin
3caf3effcb rbd: check write return code during bench-write
This is allows rbd-bench to detect http://tracker.ceph.com/issues/6938
when combined with rapidly changing the mon osd full ratio.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-06 14:33:41 -08:00
Josh Durgin
e32874fc5a objecter: resend all writes after osdmap loses the full flag
Now that the osd does not respond if it gets a map with the full flag
set first, clients need to resend all writes.

Clients talking to old osds are still subject to the race condition,
so both sides must be upgraded to avoid it.

Refs: #6938
Backport: dumpling, emperor
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-06 14:33:35 -08:00
Josh Durgin
4111729dda osd: drop writes when full instead of returning an error
There's a race between the client and osd with a newly marked full
osdmap.  If the client gets the new map first, it blocks writes and
everything works as expected, with no errors from the osd.

If the osd gets the map first, however, it will respond to any writes
with -ENOSPC. Clients will pass this up the stack, and not retry these
writes later.  -ENOSPC isn't handled well by all clients. RBD, for
example, may pass it on to qemu or kernel rbd which will both
interpret it as EIO.  Filesystems on top of rbd will not behave well
when they receive EIOs like this, especially if the cluster oscillates
between full and not full, so some writes succeed.

To fix this, never return ENOSPC from the osd because of a map marked
full, and rely on the client to retry all writes when the map is no
longer marked full.

Old clients talking to osds with this fix will hang instead of
propagating an error, but only if they run into this race
condition. ceph-fuse and rbd with caching enabled are not affected,
since the ObjectCacher will retry writes that return errors.

Refs: #6938
Backport: dumpling, emperor
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-06 14:33:26 -08:00