Commit Graph

20380 Commits

Author SHA1 Message Date
Samuel Just
90381dc9a1 OSD: set superblock compat_features on boot and mkfs
Previously, we did not actually persist the osd compatibility
mask.  Without persisting the current compat mask, a previous,
incompatible version of the OSD would not be prevented from
starting on the same store.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-16 10:59:55 -07:00
Samuel Just
470796b545 CompatSet: users pass bit indices rather than masks
CompatSet users number the Feature objects rather than
providing masks.  Thus, we should do

mask |= (1 << f.id) rather than mask |= f.id.

In order to detect old, broken encodings, the lowest
bit will be set in memory but not set in the encoding.
We can reconstruct the correct mask from the names map.

This bug can cause an incompat bit to not be detected
since 1|2 == 1|2|3.

fixes: #2748

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-16 10:59:55 -07:00
Sage Weil
e429da34c9 Merge remote-tracking branch 'gh/bugfix-2022'
Reviewed-by: Samuel Just <sam.just@inktank.com>
2012-07-16 10:48:25 -07:00
Sage Weil
47b38dd0ea Merge remote-tracking branch 'gh/bugfix-2779'
Reviewed-by: Greg Farnum <greg@inktank.com>
2012-07-16 09:12:09 -07:00
Sage Weil
f94c764638 mon: remove osds from [near]full sets when their stats are removed from pgmap
Greg points out that we could have a situation like:

 - mon recovers..
 - goes through osdmaps, notes an osd was removed and removes from
   full/nearfull
 - goes through pgmaps, and re-adds it when it encounters some osd_stat_ts.

Fix this by removing the osd from the full/nearfull set when we remove
the osd_stat_t from the pgmap.  Any osd removal is always followed by
an osd_stat_rm[] record when the primary processes the new osdmap and
proposed the appropriate pgmap updates.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-15 22:03:31 -07:00
Sage Weil
fe57681892 mon/MonitorStore: always O_TRUNC when writing states
It is possible for a .new file to already exist, potentially with a
larger size.  This would happen if:

 - we were proposing a different value
 - we crashed (or were stopped) before it got renamed into place
 - after restarting, a different value was proposed and accepted.

This isn't so unlikely for the log state machine, where we're
aggregating random messages.  O_TRUNC ensure we avoid getting the tail
end of some previous junk.

I observed #2593 and found that a logm state value had a larger size on
one mon (after slurping) than the others, pointing to put_bl_sn_map().

While we are at it, O_TRUNC put_int() too; the same type of bug is
possible there, too.

Fixes: #2593
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-15 21:38:29 -07:00
Sage Weil
bf9a85ade6 filestore: dump open fds when we hit EMFILE
Use a helper to dump /proc/self/fd when we hit EMFILE in the filestore.
Ideally, we should trigger this in other appropriate places, but it is
not immediately clear that there is a sane way to do that.

Fixes: #2330
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-15 16:31:05 -07:00
Sage Weil
a278ea1316 osdmap: drop useless and unused get_pg_role() method
Users probably want get_pg_acting_rank().  If they don't, they can probably
have the mapping and can calculate the rank themselves.  Having this here
is asking for bugs like #2022.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-14 17:39:34 -07:00
Sage Weil
38962abd5b osd: based misdirected op role calc on acting set
We want to look at the acting set here, nothing else.  This was causing us
to erroneously queue ops for later (wasting memory) and to erroneously
print out a 'misdrected op' message in the cluster log (confusion and
incorrect [but ignored] -ENXIO reply).

Fixes: #2022
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-14 17:39:33 -07:00
Sage Weil
6faeedacfb osd: simplify helper usage for misdirected ops
Make the helper exclusively for the PG != NULL cases, and open-code the
one PG == NULL caller.  This is simpler, and lets us include more useful
information in the log message.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-14 17:39:33 -07:00
Noah Watkins
ed4f80f960 vstart: use absolute path for keyring
Stores absolute path to the generated keyring so that tests running in
other directories (e.g. src/java/test) can simply reference the
generated ceph.conf.

Signed-off-by: Noah Watkins <jawhawk@cs.ucsc.edu>
2012-07-14 17:39:11 -07:00
Samuel Just
117b28680e OSD: add config options to fake missed pings
In order to test monitor and osd failure detection and false
positive correction, this patch adds the following options:

 1. osd_debug_drop_ping_probability: probability of dropping
    a string of pings from a client upon ping recipt.
 2. osd_debug_drop_ping_duration: number of pings to drop in
    a row.

This should help with replicating some wrongly-marked-down
thrashing cases.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-13 16:09:53 -07:00
caleb miles
ce20e02021 crushtool: allow information generated during testing to be dumped
to a set of CSV files for off-line analysis.

Signed-off-by: caleb miles <caleb.miles@inktank.com>
2012-07-13 15:14:15 -07:00
John Wilkins
8a89d40e6b doc: remove last reference to ceph-cookbooks.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2012-07-13 14:16:08 -07:00
John Wilkins
2011956745 doc: cookbooks issue resolved, so changed 'ceph-cookbooks' back to 'ceph.'
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2012-07-13 14:08:41 -07:00
Samuel Just
53600798f7 OSD: send_still_alive when we get a reply if we reported failure
When we get a ping reply, remove the peer from the failure_queue
and send a still alive message if the peer is in the failure_pending
map.

Otherwise, the monitor could slowly accumulate sporadic failure reports
leading to an osd being incorrectly marked out.

This bug may have been contributing to the wrongly-marked-down
thrashing observed on some systems.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-13 12:18:46 -07:00
Samuel Just
5924f8e4a8 PG: merge_log always use stats from authoritative replica
If the osd recieving the log has divergent entries, it will
also have a "divergent" stat structure.  In general, it suffices
to simply trust the stat structure shipped with the authoritative
log and info since merge_log is only used to merge an authoritative
log.

Probably fixes #2769.

In cases like #2769, this bug can result in a primary with a stat
structure which double counts an operation: once for the
divergent operation, and once for the replay.  It turned up
in a regression suite run as a scrub stat mismatch.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-13 10:19:24 -07:00
Josh Durgin
3dd65a897b qa: download tests from specified branch
These python tests aren't installed, so they need to be downloaded

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-13 09:42:27 -07:00
Sage Weil
ce7e0be100 mon: use single helper for [near]full sets
Use a single helper to add/remove osds from the [near]full sets.  This
keeps the logic in a single place, and simplifies the code somewhat.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-13 07:28:08 -07:00
Sage Weil
30b3dd1d34 mon: purge removed osds from [near]full sets
The [near]full sets are volatile state.  Remove removed (or created)
osds from the set when we process a map.

Fixes: #2779
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-13 07:28:05 -07:00
Samuel Just
bcfa573f5f ReplicatedPG: don't mark repop done until apply completes
Consider the following sequence:
1. issue, apply repop
2. replicas and primary commit
  Here, repop->waitfor_(ack|disk) are empty, so we mark
  repop->done and remove_repop.
3. interval change, repops still in queue are marked aborted
4. activate, last_update_applied = last_update
5. the repop from one enters apply_repop, is not aborted,
   and finds that last_update_applied has passed it by.

Fixes #2749

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-12 16:52:37 -07:00
Sage Weil
10ec5926c3 test_librbd: fix warnings
test/test_librbd.cc: In member function ‘virtual void LibRBD_TestClone_Test::TestBody()’:
warning: test/test_librbd.cc:1040:111: format ‘%ld’ expects argument of type ‘long int’, but argument 2 has type ‘uint64_t {aka long long unsigned int}’ [-Wformat]
warning: test/test_librbd.cc:1040:111: format ‘%ld’ expects argument of type ‘long int’, but argument 3 has type ‘uint64_t {aka long long unsigned int}’ [-Wformat]
warning: test/test_librbd.cc:1040:111: format ‘%ld’ expects argument of type ‘long int’, but argument 4 has type ‘int64_t {aka long long int}’ [-Wformat]

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-12 16:14:41 -07:00
Samuel Just
5450567a67 ReplicatedPG,PG: dump recovery/backfill state on pg query
Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-12 14:06:03 -07:00
Sage Weil
b133c49017 Merge remote-tracking branch 'gh/wip-2101' 2012-07-12 13:11:33 -07:00
Josh Durgin
508bf3fb96 rbd: enable layering when using the new format
We'll add options for different features later.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-12 11:27:32 -07:00
John Wilkins
dfe29aff7f doc: reverted file and role names.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2012-07-12 11:46:43 -07:00
Tommi Virtanen
f8478d4c56 upstart: Make ceph-osd always set the crush location.
This used to be conditional on config having osd_crush_location set,
but with that, minimal configuration left the OSD completely out of
the crush map, and prevented the OSD from starting properly.

Note: Ceph does not currently let this mechanism automatically move
hosts to another location in the CRUSH hierarchy. This means if you
let this run with defaults, setting osd_crush_location later will not
take effect. Set up your config file (or Chef environment) fully
before starting the OSDs the first time.

Signed-off-by: Tommi Virtanen <tv@inktank.com>
2012-07-12 10:47:29 -07:00
Sage Weil
5ceb7c734a doc: fix config metavariables discussion
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-12 10:00:41 -07:00
Sage Weil
d1054df6be doc: perf counters
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-12 10:00:41 -07:00
John Wilkins
09c60b4399 doc: added :: to code example.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2012-07-12 09:00:19 -07:00
John Wilkins
ad8beeb407 doc: minor edits.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2012-07-12 08:55:15 -07:00
John Wilkins
63a1799853 doc: cookbook name change broke some things in doc. Fixed.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2012-07-12 08:47:47 -07:00
Yehuda Sadeh
cc8df29e19 rados tool: bulk objects removal
Issue #2776. Allow the removal of multiple objects in a single
rados tool command:

  # rados -p pool rm obj1 [obj2 [...]]

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-11 20:06:32 -07:00
Sage Weil
762a5b6383 Merge remote-tracking branch 'gh/wip-cct'
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2012-07-11 19:59:32 -07:00
Sage Weil
f20b602296 Merge branch 'next'
Conflicts:
	src/rados.cc
2012-07-11 18:56:00 -07:00
Sage Weil
99a048d882 rados: more usage cleanup
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-11 18:54:30 -07:00
Dan Mick
0081c8e420 rados: usage message
Bad linebreaks, wrapping, stringification, missing doc for bench args

    Signed-off-by: Dan Mick <dan.mick@inktank.com>
    Reviewed-by: Samuel Just <sam.just@inktank.com>
2012-07-11 18:53:35 -07:00
John Wilkins
0782db3694 doc: changed role file names as part of update to roles.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2012-07-11 17:35:38 -07:00
John Wilkins
e5997f4e11 doc: added DHO config.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2012-07-11 17:35:01 -07:00
Yehuda Sadeh
173d592a4e rados tool: remove -t param option for target pool
Bug #2772. This fixes an issue that was introduced when we
added the 'rados cp' command. The -t param was already used
for rados bench. With this change the only way to specify
a target pool is using --target-pool.
Though this problem is post argonaut, the 'rados cp' command
has been backported, so we need this fix there too.

Backport: argonaut

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-11 17:11:15 -07:00
Sage Weil
31c8dcc11c crush: sum and check quantized weights for bucket
Sum the quantized weights for each bucket, and check that for overflow.

This could change the results of a compile marginally if the map is using
non-divisible weight values that quantize funny.  The old code might
calculate a bucket sum that is not the actual sum of the quantized weights.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-11 16:36:47 -07:00
caleb miles
675a1b7bbc crush: Set maximum device/bucket weights.
Signed-off-by: caleb miles <caleb.miles@inktank.com>
2012-07-11 16:03:44 -07:00
caleb miles
c9fc5a2477 crush: prevent integer overflow on reweight
Disallow setting OSD weights to a value over 10,000 and cap bucket weight
at 10,000,000 in a CRUSH map. Addresses issue #2101.

Signed-off-by: caleb miles <caleb.miles@inktank.com>
2012-07-11 16:03:13 -07:00
Dan Mick
d29ec1e24d rados: usage message
Bad linebreaks, wrapping, stringification, missing doc for bench args

    Signed-off-by: Dan Mick <dan.mick@inktank.com>
    Reviewed-by: Samuel Just <sam.just@inktank.com>
2012-07-11 15:32:04 -07:00
Sage Weil
2c001b28fb Makefile: don't install crush headers
This is leftover from when we built a libcrush.so.  We can re-add when we
start doing that again.

Reported-by: Laszlo Boszormenyi <gcs@debian.hu>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-11 09:19:16 -07:00
Sage Weil
22d0648db2 librados: simplify cct refcounting
get() in ctor, put() in dtor.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-11 09:04:50 -07:00
Sage Weil
c5bcb04b9a lockdep: stop lockdep when its cct goes away
When a cct is destroyed, tell lockdep so that it can shut down if it needed
it.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-11 08:58:22 -07:00
Sage Weil
7adc6c08f1 mon: simplify logmonitor check_subs; less noise
* simple helper to translate name to id
 * verify sub type is valid in caller
 * assert sub type is valid in method
 * simplify iterator usage

Among other things, this gets rid of this noise in the logs:

2012-07-10 20:51:42.617152 7facb23f1700  1 mon.a@1(peon).log v310 check_sub sub monmap not log type

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-10 21:27:50 -07:00
Sage Weil
fa96e19f4d Merge branch 'stable' into next 2012-07-10 18:21:29 -07:00
Sage Weil
0f917c2f14 osd: guard class call decoding
Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-10 18:21:06 -07:00