Commit Graph

20518 Commits

Author SHA1 Message Date
Yehuda Sadeh
bb6e0d0e58 wireshark: update patch
Update to latest source tree (svn 43768).

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-31 15:49:25 -07:00
Samuel Just
deec81b4e9 ReplicatedPG: clear waiting_for_ack when we send the commit
Otherwise, we might send the ack anyway later, after a subsequent
commit is sent resulting in an out-of-order op.

This resulted in a a crash when the client encountered out of
order ops.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-31 14:26:19 -07:00
Samuel Just
e0e72d78b7 Merge remote-tracking branch 'upstream/wip-leveldb-iterators' 2012-07-31 13:51:49 -07:00
Samuel Just
cda5e8e0c3 PG,ReplicatedPG: clarify scrub state clearing
scrub_clear_state takes care of clearing the SCRUB and REPAIR
flags.  Thus, PG::scrub() needn't clear them again since
any change that would have caused that if block to occur
would have triggered ReplicatedPG::on_change(), which also
clears the scrub reservations.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-30 13:43:51 -07:00
Samuel Just
6d464a21fc PG::mark_clean(): queue_snap_trim if snap_trimq is not empty
Currently, we won't queue for snap trim until the next map
update.

Noticed while reviewing another patch, this would result in
snaps not being trimmed until the next map update.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-30 13:39:00 -07:00
Samuel Just
1041b92ca5 ReplicatedPG::snap_trimmer: requeue if scrub_block_writes
Otherwise, we do not continue snap_trimming once scrub is
complete.

Noticed while revewing another patch.  This would result
in snaps not being trimmed again until the next map
update.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-30 13:38:25 -07:00
Sage Weil
480380025b Merge branch 'wip-osd'
Reviewed-by: Samuel Just <sam.just@inktank.com>
2012-07-30 10:49:44 -07:00
Sage Weil
9e5d4e61a7 osd: initialize send_notify on pg load
When the PG is loaded, we need to set send_notify if we are not the
primary.  Otherwise, if the PG does not go through
start_peering_interval() or experience a role change, we will not set
the flag and tell the primary that we exist.  This can cause problems
for example if we have unfound objects that the primary needs, although
I'm sure there are other bad implications as well.

Fixes: #2866
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 10:49:16 -07:00
Sage Weil
f9ff8dd3b2 osd: replace STRAY bit with bool
We were setting a bit in pg->state that is private to the non-primary
PG.  The other bits get shared with the mon etc, but this one didn't.

Replace it with a simple bool.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-30 10:49:15 -07:00
Joao Eduardo Luis
8e40477225 test: test_keyvaluedb_iterators: Test KeyValueDB implementations iterators
This set of tests focus on testing the expected behavior of LevelDBStore's
and KeyValueDBMemory's iterators.

We test a grand total of six use cases, each one with several test
units, being tested for both the LevelDBStore and the in-memory mock
(totalling 48 test units, plus two disabled by default):

 * Removing keys:
  - Using both the whole-space iterator and the whole-space snapshot
    iterator
  - Tests key removal while iterating the store, either by prefix or by
    removing specific (prefix,key) pairs

 * Setting keys:
  - Using both the whole-space iterator and the whole-space snapshot
    iterator
  - Tests key insertion while iterating the store
  - Tests value update while iterating the store
  - This use case has two disabled tests: one when setting keys, other
    when updating values, both on LevelDBStore and using the whole-space
    iterator; this is because they will fail, unlike when using the
    in-memory mock implementation, because leveldb implicitely creates
    an iterator that will read from a snapshot instead of directly from
    the underlying store.

 * Using Upper/Lower Bounds:
  - Using the whole-space iterator (we don't modify the store's state,
    so there is no need to also test the whole-space snapshot iterator)
  - Tests upper/lower bounds when the key, the prefix or both are empty
  - Tests upper/lower bounds when both the key and the prefix are set

 * Seeking:
  - Using the whole-space iterator (we don't modify the store's state,
    so there is no need to also test the whole-space snapshot iterator)
  - Tests seeking to first and to last
  - Tests seeking to first and to last using a prefix

 * Key-Space Iteration:
  - Using the whole-space iterator (we don't modify the store's state,
    so there is no need to also test the whole-space snapshot iterator)
  - Tests forward and backward iteration over the key-space

 * Empty Store:
  - Using the whole-space iterator (we don't modify the store's state,
    so there is no need to also test the whole-space snapshot iterator)
  - Tests seeking and using bounds functions when the store is empty

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2012-07-30 18:47:41 +01:00
Joao Eduardo Luis
9dd8a333d4 os: KeyValueDB: implement snapshot iterators
Create a set of functions, to be implemented by derivative classes of
KeyValueDB, responsible for returning an iterator with strong
read-consistency guarantees. How this iterator is implemented, or by what
is it backed up, is implementation specific, but it must guarantee that
all reads made using this iterator are as if there were no subsequent
writes to the store since we created the iterator.

For instance, LevelDBStore will back this iterator with a leveldb Snapshot,
while KeyValueDBMemory will perform a copy of its in-memory map.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2012-07-30 18:47:41 +01:00
Joao Eduardo Luis
fb1d549582 os: KeyValueDB: re-implement (prefix) iter in terms of whole-space iter
In-a-nutshell-version: Create a whole-space iterator interface, and
implement the already existing, prefix-based iterator in terms of the
new whole-space iterator;

This patch introduces a significant change on the architecture of
KeyValueDB's iterator, although its interface remains the same.

Before this patch, KeyValueDB simply defined an interface for a
prefix-based interface, to be implemented by derivative classes. Being
constrained by a prefix-based approach to iterate over the store only makes
sense when we know which prefixes we want to iterate over, but for that we
must know about the prefixes beforehand. This approach didn't work when one
wanted to iterate over the whole key space, without any previous awareness
about the keys and their prefixes.

This patch introduces a new interface for a whole-space iterator, to be
implemented by derivative classes, which is prefix-independent. We also
define an abstract function to obtain this iterator, which must also be
implemented by the derivative class. With this interface in place, we are
then able to implement a prefix-dependent iterator in terms of the
whole-space iterator, which will be offered by the KeyValueDB class itself.

Furthermore, we implement these changes on LevelDBStore and KeyValueDBMemory,
the in-memory mock store, which leads to significant changes on both:

  * LevelDBStore
    - Substitute the previously existing LevelDBIteratorImpl, which
      followed a prefix-based iteration, for
      LevelDBWholeSpaceIteratorImpl, which now iterates over the whole
      key space of the store;

  * KeyValueDBMemory:
    - Substitute the previously existing MemIterator, which followed a
      prefix-based iteration, for WholeSpaceMemIterator, which now
      iterates over the whole key space of the in-memory mock store;
    - Change the in-memory mock store data structure. Previously, we
      used a map-of-maps, mapping prefixes to a key/value map; now we
      keep a single map, mapping (prefix,key) pairs to values.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2012-07-30 18:47:41 +01:00
Joao Eduardo Luis
9d43c8a406 test: workloadgen: Don't linearly iterate over a map to obtain a collection
We were iterating over the collections map a certain amount of times, in
order to obtain the collection in that position. To avoid this kind of
behavior in a function that may be called a large amount of times, and
that may iterate over a rather large map, we now keep the collection ids
in a vector. In order to obtain a given collection on position X, we will
simply look for the collection id on position X of the vector, and then
obtain the collection from the map using its collection id.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2012-07-28 13:53:29 -07:00
Sage Weil
bae837010b osd: peering: make Incomplete a Peering substate
This allows us to still catch changes in the prior set that would affect
our conclusions (that we are incomplete) and, when they happen, restart
peering.

Consider:
 - calc prior set, osd A is down
 - query everyone else, no good info
 - set down, go to Incomplete (previously WaitActingChange) state.
 - osd A comes back up (we do nothing)
 - osd A sends notify message with good info (we ignore)

By making this a Peering substate, we catch the Peering AdvMap reaction,
which will notice a prior set down osd is now up and move to Reset.

Fixes: #2860
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-28 09:04:32 -07:00
Sage Weil
d1602ee2c2 osd: peering: move to Incomplete when.. incomplete
PG::choose_acting() may return false and *not* request an acting set change
if it can't find any suitable peers with enough info to recover.  In that
case, we should move to Incomplete, not WaitActingChange, just like we do
a bit lower in GetLog() if we have non-contiguous logs.  The state name is
more accurate, and this is also needed to fix bug #2860.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-28 09:04:29 -07:00
Sage Weil
1fc19df873 Merge remote-tracking branch 'gh/wip-msgr-masterbits'
Reviewed-by: Greg Farnum <greg@inktank.com>
2012-07-28 07:21:05 -07:00
Sage Weil
d61269402d config: send warnings to a ostream* argument
We shouldn't always send these to stderr.  (Among other things, the
warning: prefix breaks the gitbuilder error detection.)

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-28 07:39:27 -07:00
Sage Weil
de4474acbd vstart.sh: apply extra conf after the defaults
This let's you do e.g., -o 'debug ms = 100' and it will apply after
the default logging levels.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-27 14:28:04 -07:00
Sage Weil
f69d025b3c conf: make dup lines override previous value
If you put

[some section]
 foo = 1
 ...
 foo = 2

in a .conf file, make the second key override the first.

Generate a warning if a value is overridden to sidestep some user hangbanging.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-27 10:43:27 -07:00
Sage Weil
4dfc14c404 mon: remove superfluous "can't delete except on master" comments
That's what 'return false' means for preprocess_*().

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-27 10:43:26 -07:00
Sage Weil
5f3ef77df4 mon: make pool snap creation ops idempotent
Return 0 if the snap already exists, or is already deleted.

Also, avoid updating the pg_pool if we are just waiting for the current
round to commit.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-27 10:43:26 -07:00
Sage Weil
53aa959302 objecter: return ENOENT/EEXIST on pool snap delete/create
Do these checks on the client to mask monitor idempotency from the user.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-27 10:43:26 -07:00
Sage Weil
507f99e9a0 librados: make snap create/destroy handle client-side errors
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-27 10:43:26 -07:00
Sage Weil
3715d2052d mon: check for invalid pool snap creates in preprocess_op, too
This avoids waiting for a paxos commit just to return an error.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-27 10:43:26 -07:00
Sage Weil
640e5fdebc qa: simple tests for 'ceph osd create|rm' commands
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-27 10:43:26 -07:00
Sage Weil
6f7837a96d mon: make 'osd rm ...' idempotent
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-27 10:43:26 -07:00
Sage Weil
4788567e0f qa: simple test for pool create/delete commands
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-27 10:43:26 -07:00
Sage Weil
a01e22d259 mon: make pool creation idempotent
Return success if the pool already exists.  Part of #2638.

Also, fix this so we wait until a creating pool is created before we reply.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-27 10:43:25 -07:00
Sage Weil
5503376f4c mon: make pool removal idempotent
Return success if pool does not exist.  Part of #2638.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-27 10:43:25 -07:00
Sage Weil
597f14ab20 objecter: make pool create/delete return EEXIST/ENOENT
Do these checks on the client side to mask monitor idempotency from
the user.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-27 10:43:25 -07:00
Sage Weil
358d6b619f librados: make pool create/destroy handle client-side errors
Add tests!

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-27 10:43:25 -07:00
Sage Weil
46e819ede8 objecter: fix mon command resends
The monitor session is lossy.  Send these when the op is initiated, or
when we reconnect.  The timeout/cutoff was preventing ops from getting
resent if there was an ill-timed mon reset.

Backport: testing, stable/argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-27 10:43:25 -07:00
Sage Weil
c2e1c6298b mutex: assert we are unlocked by the same thread that locked
This only works for non-recursive locks.  (Which is probably all of them?)

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-27 10:43:25 -07:00
Sage Weil
6ec9555bcf cond: reorder asserts
Make the more specific checks assert before the less specific ones, so we
are more likely to crash with useful information.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-27 10:43:25 -07:00
Sage Weil
9553c6edfb osd: fixing sharing of past_intervals on backfill restart
We need to share past_intervals whenever we instantiate the PG on a peer.
In the PG activation case, this is based on whether our peer_info[] value
for that peer is dne().  However, the backfill code was updating the
peer info (history) in the block preceeding the dne() check, which meant
we never shared past_intervals in this case and the peer would have to
chew through a potentially large number of maps if the PG has not been
clean recently.

Fix by checking dne() prior to the backfill block.  We still need to fill
in the message later because it isn't yet instantiated.

Fixes: #2849
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-26 21:55:03 -07:00
Sage Weil
29aa1cf440 filestore: check for EIO in read path
Check for EIO in read methods and helpers.  Try to do checks in low-level
methods (e.g., lfn_*()) to avoid duplication in higher-level methods.

The transaction apply function already checks for EIO on writes, and will
generate a nicer error message, so we can largely ignore the write path,
as long as errors get passed up correctly.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-26 21:55:00 -07:00
Sage Weil
0891948e7f filestore: add 'filestore fail eio' option, default true
By default we will assert/fail/crash on EIO from the underlying fs.  We
already do this in the write path, but not the read path, or in various
internal infrastructure.

Signed-off-by: Sage Weil <sage@inktank.com>

Conflicts:

	src/os/FileStore.cc
2012-07-26 21:29:13 -07:00
Josh Durgin
17bb78a29b librbd: fix id initialization in new format
48bd839b1e should have included this.
I misread it due to the use of bid instead of id when generating
the object prefix.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-26 15:29:29 -07:00
Yehuda Sadeh
5601ae27d6 mon: set a configurable max osd cap
Don't allow setting a higher osd num through the
ceph control util.

Fixes: #2752
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-26 15:26:32 -07:00
John Wilkins
bcb9ab8b3b doc: updates to fix problem with ceph-cookbooks appearing in chef-server.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2012-07-25 15:57:58 -07:00
Sage Weil
9767146f8b osd: generate past intervals in parallel on boot
Even though we aggressively share past_intervals with notifies etc, it is
still possible for an osd to get buried behind a pile of old maps and need
to generate these if it has been out of the cluster for a while.  This has
happened to us in the past but, sadly, we did not merge the work then.
On the bright side, this implementation is much much much cleaner than the
old one because of the pg_interval_t helper we've since switched to.

On bootup, we look at the intervals each pg needs and calclate the union,
and then iterate over that map range.  The inner bit of the loop is
functionally identical to PG::build_past_intervals(), keeping the per-pg
state in the pistate struct.

Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-25 13:28:55 -07:00
Sage Weil
d45929f4d0 osd: move calculation of past_interval range into helper
PG::generate_past_intervals() first calculates the range over which it
needs to generate past intervals.  Do this in a helper function.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

Conflicts:

	src/osd/PG.cc
2012-07-25 13:28:40 -07:00
Sage Weil
18d5fc41c9 osd: fix map epoch boot condition
We only want to join the cluster if we can catch up to the latest
osdmap with a small number of maps, in this case a single map message.

Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>

Conflicts:

	src/osd/OSD.cc
2012-07-25 13:27:34 -07:00
Sage Weil
11b275a086 osd: avoid misc work before we're active
If we're booting, we shouldn't scrub, or send reports to the montior,
or send heartbeats, or any of that.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-24 20:54:11 -07:00
Sage Weil
278b5f5800 mon: ignore pgtemp messages from down osds
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-24 20:51:45 -07:00
Sage Weil
08e2ecac97 mon: ignore osd_alive messages from down osds
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-24 20:51:38 -07:00
Sage Weil
404a7f526b admin_socket: json output, always
If the perfcounters stuff were refactored to use the Formatter, we could
put the JSONFormatter in the admin_socket code and make this a bit less
annoying.  Later.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-24 17:23:07 -07:00
Sage Weil
0133392bdb admin_socket: dump config in json; add test
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-24 17:23:03 -07:00
Sage Weil
8c3b49072f Merge branch 'next' 2012-07-24 17:22:50 -07:00
Sage Weil
0ef8cd3c6c config: fix 'config set' admin socket command
Fixes: #2832
Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-24 13:53:03 -07:00