Commit Graph

20578 Commits

Author SHA1 Message Date
Sage Weil
9767146f8b osd: generate past intervals in parallel on boot
Even though we aggressively share past_intervals with notifies etc, it is
still possible for an osd to get buried behind a pile of old maps and need
to generate these if it has been out of the cluster for a while.  This has
happened to us in the past but, sadly, we did not merge the work then.
On the bright side, this implementation is much much much cleaner than the
old one because of the pg_interval_t helper we've since switched to.

On bootup, we look at the intervals each pg needs and calclate the union,
and then iterate over that map range.  The inner bit of the loop is
functionally identical to PG::build_past_intervals(), keeping the per-pg
state in the pistate struct.

Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-25 13:28:55 -07:00
Sage Weil
d45929f4d0 osd: move calculation of past_interval range into helper
PG::generate_past_intervals() first calculates the range over which it
needs to generate past intervals.  Do this in a helper function.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

Conflicts:

	src/osd/PG.cc
2012-07-25 13:28:40 -07:00
Sage Weil
18d5fc41c9 osd: fix map epoch boot condition
We only want to join the cluster if we can catch up to the latest
osdmap with a small number of maps, in this case a single map message.

Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>

Conflicts:

	src/osd/OSD.cc
2012-07-25 13:27:34 -07:00
Sage Weil
11b275a086 osd: avoid misc work before we're active
If we're booting, we shouldn't scrub, or send reports to the montior,
or send heartbeats, or any of that.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-24 20:54:11 -07:00
Sage Weil
278b5f5800 mon: ignore pgtemp messages from down osds
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-24 20:51:45 -07:00
Sage Weil
08e2ecac97 mon: ignore osd_alive messages from down osds
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-24 20:51:38 -07:00
Sage Weil
404a7f526b admin_socket: json output, always
If the perfcounters stuff were refactored to use the Formatter, we could
put the JSONFormatter in the admin_socket code and make this a bit less
annoying.  Later.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-24 17:23:07 -07:00
Sage Weil
0133392bdb admin_socket: dump config in json; add test
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-24 17:23:03 -07:00
Sage Weil
8c3b49072f Merge branch 'next' 2012-07-24 17:22:50 -07:00
Sage Weil
0ef8cd3c6c config: fix 'config set' admin socket command
Fixes: #2832
Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-24 13:53:03 -07:00
Sage Weil
186a595ca0 Merge branch 'next' 2012-07-24 11:49:41 -07:00
Sage Weil
f565ace62a osd: fix pg log zeroing
Zero the right number of bytes.  Fixes a bug where we clobber legit log
data.  Fortunately this is only triggered with osd preserve pg log = false,
which was not the default until recently in master.

Fixes: #2799
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Mike Ryan <mike.ryan@inktank.com>
2012-07-24 11:02:37 -07:00
Yehuda Sadeh
3e886799d9 Merge branch 'wip-2763' 2012-07-24 10:10:22 -07:00
Pierre Rognant
d67ad0db64 Wireshark dissector updated, work with the current development tree of wireshark. The way I patched it is not really clean, but it can be useful if some people quickly need to inspect ceph network flows. 2012-07-24 10:09:27 -07:00
Yehuda Sadeh
52f51a24e2 wireshar/ceph/packet-ceph.c: fix eol
Removing extra char from dos eol format.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-24 10:09:27 -07:00
Joao Eduardo Luis
a3d57a6e43 os: KeyValueDB: Add virtual raw_key() function to return (prefix,key) pair
If we were to use solely the key() function, whenever we had a key with,
say, prefix 'Foo' and key 'Bar', the key() function would return something
similar to 'Foo<separator>Bar'. Therefore, obtaining the prefix and the key
would require one to be aware of the separator used, and, since that is
implementation specific, we can't rely on such prior knowledge.

This new function must then be implemented by any derivative class of
KeyValueDB, and is expected to return a pair (prefix,key) for the
current iterator's position -- the key() function should behave as
previously, returning only the 'key' component of the pair.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2012-07-24 02:30:14 +01:00
Joao Eduardo Luis
a16d9c64da os: KeyValueDB: allow finer-grained control of transaction operations
This patch introduces the possibility of using single key/value
modification operations into the transaction interface.

Until now, any 'set' or 'rmkeys' operations required a map of keys to be
provided to the function, which made the task of removing or setting a
bunch of keys easier. Doing these same operations for a single key,
however, would entail creating a map with a single key.

Instead, this patch adds two new virtual abstract functions, to be
implemented by derivative classes, which set or remove one single
key/value, and we then implement the map-based, existing functions in
terms of these new functions.

We also update the derivative classes of KeyValueDB in order to reflect
these changes (i.e., LevelDBStore and KeyValueDBMemory).

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2012-07-24 02:30:14 +01:00
Sage Weil
6c0fa50944 doc: update information about stable vs development releases
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-23 17:39:12 -07:00
Josh Durgin
48bd839b1e librbd: replace assign_bid with client id and random number
The assign_bid method has issues with replay because it is a write
that also returns data. This means that the replayed operation would
return success, but no data, and cause a create to fail. Instead, let
the client set the bid based on its global id and a random number.

This only affects the creation of new images, since the bid is put
into an opaque string as part of the object prefix.

Keep the server side assign_bid around in case there are old clients
still using it.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-07-23 17:16:01 -07:00
Sage Weil
67832c34a2 osd: fix ACK ordering on resent ops
The wait_for_ondisk handling fixed COMMIT ordering, but the ACKs need to
go back in the same order too.  For example:

 - op A is queued
 - client disconnects, both ACK and COMMIT replies are lost
 - client reconnects
 - op A and B are sent
 - op A is queued
 - op B is applied, ACK is sent
 - op A and B COMMITs are sent
 -> client's ack callbacks will see B and then A.

Fix this by creating a waiting_for_ack queue as well, and sending ACK
responses as needed.  Also handle the case where the ACK should be sent
immediately when the retry event is received.

Fixes: #2823
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Mike Ryan <mike.ryan@inktank.com>
2012-07-23 16:51:03 -07:00
Yehuda Sadeh
96dbc412df rados::cls:🔒 move api types into namespace
By popular demand, moved public api into namespace. This
required some changes to ceph_dencoder to get some template
annoyance working.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-23 16:01:32 -07:00
Sage Weil
d9bfe9547d v0.49
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJQDZfmAAoJEH6/3V0X7TFtra0P/iXVIF+hcSpjZZApNe90Pa21
 ZrmC7Nu+0skrWtkfFyN1GuDsngDllZh+D7O6bUVozQVxKoz9bahsLDmlfwj1vi7N
 AyV1sWIGU1wBUmuYqXHOT3Kl7R3SuJjML4bDVi4YCb3HGERUo0O1PBnowSltoE5J
 Q0etTZWxuAjD5iOZTC2U5RIn0YOa0pCdrjHzPelkwrkJvNtvB9Voo4VFGKevMxUR
 RrDV85oBovj8XqTZsjO91vX5LFy0RG+Mb3sCoTk6A2T1gp3EOoMOAx2kNls5tgW1
 JivrrPVddgI10u+6DnVBZOJPnhcO3yCVmwSPjUK0xPOQ0YyEjOMWovS/ZzD5Lr6K
 FQpmuwkPIQ2+XVMMmta9TByy+r7h3ddGc7BcNB7Tfy9/AtxhPRARKsXzCfMQn4mD
 kvLXViL5uLzR+ZmCU40LfHQSpWXzHyxVV60LKqg4yUp//LE9Q6HgStw2nNklHggi
 ihY2SDAQf8WYhbbBbxuANI4TdxLeK1iEKLzqZikqUBXkU2q6fP+tYVV8niGhGi7l
 QzmLZmotr0kAhutaMTRf74NrFoZqLbW5grf+5JHPQyB6Q0KhykSQ5KbCB6AOzQyG
 Aff5Vu1QVkbmE81DbxogHdpUdPn7t5L6qitKNAQCGu8LSIxFJomub5Z/9Z5J7/f0
 ZNRyGNHs1c6qWkTk5kP0
 =6eMd
 -----END PGP SIGNATURE-----

Merge tag 'v0.49'

v0.49
2012-07-23 12:43:19 -07:00
Sage Weil
ca6265d0f4 v0.49 2012-07-23 11:28:08 -07:00
Sage Weil
c8f1311988 mon: make 'ceph osd rm ...' wipe out all state bits, not just EXISTS
This ensures that when a new osd reclaims that id it behaves as if it were
really new.

Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-23 10:47:10 -07:00
Sage Weil
5fcb22f03c mkcephfs: add sync between btrfs scan and mount
This appears to fix problems with mount failing for at least one user.

Reported-by: Paul Pettigrew <Paul.Pettigrew@mach.com.au>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-23 09:21:09 -07:00
Sage Weil
2d7e2cbf26 crush: fix name map encoding
We screwed up and encoded using the name 'int' type instead of int32_t.
That means people have systems encoding this as both 32 and 64 bit,
depending on their architecture.  This could be worse: x86_64 still has a
32-bit int (at least in my environment).

In any case, mixing both word sizes in their clusters is broken as a
result, with the exception of the kernel code, which doesn't decode this
part of the map and will tolerate differently-sized servers.

Fix this by:

 * encoding using int32_t now
 * decoding either 32-bit or 64-bit values, by assuming that the strings
   will always be non-empty.  This appears to be the case.

However:

 * any cluster with 64-bit ints must upgrade all at once, or else the new
   code will start encoding 32-bit values and the old code will be
   confused.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2012-07-21 09:15:06 -07:00
Sage Weil
b497bdacf5 osd/OpTracker: fix use-after-free
And formatting.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-21 08:50:47 -07:00
Samuel Just
a6735ab009 OpRequest,OSD: track recent slow ops
This should be helpful while investigating slow performance.

OpRequests now track events with timestamp in addition
to dumping them to the log.  OpHistory keeps up to a
configurable number of the slowest ops over a configurable
recent time interval.  The admin socket interface for the OSD
now has a dump_historic_ops command which dumps the stored
slow ops.

Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-20 17:20:16 -07:00
Samuel Just
d624f3435f Merge branch 'next' 2012-07-20 14:32:44 -07:00
Samuel Just
9e207aa881 test/store_test.cc: verify collection_list_partial results are sorted
Synthetic test now also varies snapshots and uses a small variety of
hashes.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-20 13:59:25 -07:00
Yehuda Sadeh
49877cdeda cls_lock: cls_lock_id_t -> cls_lock_locker_id_t
Renamed type to make more sense.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-20 13:41:51 -07:00
Yehuda Sadeh
315bbea511 cls_lock: document lock properties
Added some comments about different lock properties.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-20 13:28:19 -07:00
Yehuda Sadeh
056d42cf91 cls_log: update a comment
Was missing output param description.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-20 13:16:05 -07:00
Yehuda Sadeh
2c7d782177 rados: lock info keeps expiration, not duration
We pass duration in the request, but internally we keep
the expiration.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-20 13:11:54 -07:00
Yehuda Sadeh
d16844c890 rados tool: add advisory lock control commands
Can now lock, break lock, list locks and show lock
info.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-20 13:00:43 -07:00
Yehuda Sadeh
2f8de8943e cls_lock: objclass for advisory locking
Providing an objclass to create and manipulate advisory
locking. Also providing a client api to control it. A lock
may either be exclusively locked or shared among multiple
lockers. A locker is identified by the rados client name, and
by a cookie-string.
A lock may be assigned with a tag that every operation on that
lock should use. A lock can be unlocked by the client that locked
it, or may be broken by other clients.
When a non-zero lock duration is assigned to a lock by a locker,
that locker expires after that time duration.
A lock may have a description.
Locks on a specific object can be listed. Lockers of a specific
lock can be enumerated (by get_info).

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-20 12:59:07 -07:00
Yehuda Sadeh
9c5c3edfcc objclass: add api calls to get/set xattrs
added the following functions:
  cls_cxx_getxattr
  cls_cxx_getxattrs
  cls_cxx_setxattr

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-20 12:55:55 -07:00
Samuel Just
adc9b91f37 os/HashIndex: use set<pair<string, hobject_t>> rather than multimap
Multimap does not make any guarantees about ordering of different
values with the same key.  list_by_hash, however, assumes that
the iterator order matches hobject_t order.  Thus, we use
set<pair<string, hobject_t> > to get the proper ordering.

Backport: stable

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-20 12:29:03 -07:00
Sage Weil
0b84384fd4 mon: shut up about sessionless MPGStats messages
If the mon gets a reset on the client connection, it clears the session
on the connection.  This is perfectly normal to see.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-19 22:14:11 -07:00
Sage Weil
6580450fbc osd: clean up boot method names
Prefix subsequent steps with _.  Better names.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-19 21:27:40 -07:00
Sage Weil
369fbf6110 osd: defer boot if heartbeatmap indicates we are unhealthy
If the OSD is bogged down or unresponsive, we should not try to join
the cluster.  This was observed on congress (slow/clogged op_tp combined
with osdmap thrashing).

Fixes: #2502
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-07-19 21:27:37 -07:00
Sage Weil
d76df212c8 Merge branch 'next'
Conflicts:
	src/include/ceph_features.h
2012-07-19 20:22:35 -07:00
Sage Weil
dec936923f osd/mon: subscribe (onetime) to pg creations on connect
Ask the monitor for pending pg creations each time we connect.

Normally, this is a freebie check.  If there are pending creations, though,
it ensures that the OSD finds out about them even if the original lame
broadcast didn't reach it.  Specifically:

 - osd is hunting for a monitor, but isn't yet connected
 - new pgs are created
 - send_pg_creates() sends out create messages, but osd does get it
 - osd finally connects to a mon

Fixes: #2151 (tho the bug description is bad)
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
2012-07-19 17:13:09 -07:00
Sage Weil
7f58b9beee mon: track pg creations by osd
Track the pending pg creations by osd, and use a helper to send out that
messages.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-19 17:13:09 -07:00
Sage Weil
4c6c927b27 Revert "rbd: fix usage for snap commands"
This reverts commit 42de6873f9.

Actually, these are fine!  Dan made them all kinds of fancy.
2012-07-19 16:45:07 -07:00
Sage Weil
42de6873f9 rbd: fix usage for snap commands
Snap commands take '--snap <snapname> <imagename>'.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-19 16:48:18 -07:00
Mike Ryan
58cd27fd29 doc: add missing dependencies to README
Signed-off-by: Mike Ryan <mike.ryan@inktank.com>
2012-07-19 11:29:40 -07:00
Sage Weil
6f381affdc add CRUSH_TUNABLES feature bit
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-18 19:49:58 -07:00
Samuel Just
e3349a2a3d OSD::handle_osd_map: don't lock pgs while advancing maps
We no longer do anything with the pgs here.  PG map
advancing is now handled in OSD::advance_pg asyncronously.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-18 15:37:28 -07:00
Sage Weil
c8ee30160d osd: add osd_debug_drop_pg_create_{probability,duration} options
This will let us exercise more of the pg creation code.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-18 14:26:16 -07:00