Commit Graph

19412 Commits

Author SHA1 Message Date
Sage Weil
c24c9e3a55 Merge remote branch 'gh/wip-filestore-misc'
Conflicts:
	src/test/filestore/run_seed_to.sh
2012-04-28 16:25:31 -07:00
Sage Weil
6bb3e84190 Merge remote branch 'gh/wip-2353'
Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
2012-04-28 15:53:35 -07:00
Sage Weil
254644a4f0 osd: always share past_intervals
Share past intervals when starting up new replicas.  This can happen via
an MOSDPGInfo or an MOSDPGLog message.

Fix up get_or_create_pg() so the past_intervals arg is required (and a ref,
like the other args). Fix doxygen comment.

Now the only time generate_past_intervals() should do any work is when
upgrading old clusters, during pg creation, and (possibly) during pg
split (when that is fully implemented).

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
2012-04-28 15:49:40 -07:00
Sage Weil
5047cac0cd Merge branch 'wip-osdmap'
Conflicts:
	src/mon/PGMonitor.cc
	src/osd/OSDMap.h
2012-04-28 15:25:20 -07:00
Sage Weil
352247e1b9 fix file_layout.sh layouts test
preferred_osd is not gone.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
2012-04-28 14:52:56 -07:00
Sage Weil
c97c20de0e Merge branch 'wip-mon'
Reviewed-by: Gregory Farnum <gregory.farnum@dreamhost.com>
2012-04-28 14:48:51 -07:00
Sage Weil
e205e11c5a mon: 'osd [un]set noin'
Missed this one.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
2012-04-28 14:48:26 -07:00
Sage Weil
c661e66cc2 Merge branch 'next' 2012-04-28 14:47:53 -07:00
Sage Weil
c971545a15 osd: set dirty_info in generate_past_intervals
This ensures that we save our work.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
2012-04-28 07:46:42 -07:00
Sage Weil
944a431177 osd: fill in past intervals during advance_map
If ceph-osd is way behind, we will advance through past maps before we
mark ourselves up.  This avoids the slow recalculation once we are up, and
the ensuing badness.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
2012-04-28 07:46:42 -07:00
Sage Weil
0c65ac6f4e osd: drop useless PG::fulfill_info()
There is a nice symmetry there with fulfill_log(), but it is a short
function with a single caller that mostly just forces us to copy a bunch
of data structures around unnecessarily.  Drop it.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
2012-04-28 07:46:42 -07:00
Sage Weil
7e8ab0f29b osd: share past intervals with notifies
Send past_intervals along with pg_info_t on every notify.  The reasoning
here is as follows:

 - we already have the state in memory
 - if we don't send it, and the primary doesn't have it, it will
   recalculate it by reading/decoding many previous maps from disk
 - for a highly-tortured cluster, i see past_intervals on the order of
   ~6 KB, times 600 pgs means ~2.5 MB sent for every activate_map(). for
   comparison, the same cluster would need to read and decode ~1 GB of
   maps to recalculate the same info.
 - for healthy clusters, the data is small, and costs little.
 - for unhealthy clusters, the data is large, but most useful.

In theory we could set a threshold so that we don't send it if it is
large, but allow the primary to query it explicitly.  I doubt it's worth
the complexity.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
2012-04-28 07:46:42 -07:00
Sage Weil
0c6914039c osd: only generate missing intervals in generate_past_intervals
We can (currently) get into a situation where we don't have the full
history back to last_epoch_clean because non-primaries record past
intervals but don't initially have the full history, resulting in a partial
recent history.

If this happens, only fill in what's missing; no need to rebuild the recent
parts too.

Signed-off-by: Sage Weil <sage@newdream.net>
2012-04-28 07:46:42 -07:00
Sage Weil
db8e20b211 osd: include past_intervals in pg debug printout
Signed-off-by: Sage Weil <sage@newdream.net>
2012-04-28 07:46:42 -07:00
Sage Weil
12d1675ca0 osd: fix check for whether to recalculate past_intervals
We may not recalculate all the way back to last_interval_clean due to
the oldest_map floor.  Figure out what we want and could calculate before
deciding whether what we have is insufficient.

Also, print something if we discard and recalculate so it is clear what is
happening and why.

Signed-off-by: Sage Weil <sage@newdream.net>
2012-04-28 07:46:42 -07:00
Sage Weil
90dae62b9c osd: PG::Interval -> pg_interval_t
Signed-off-by: Sage Weil <sage@newdream.net>
2012-04-28 07:46:41 -07:00
Sage Weil
924a12516c Merge branch 'next' into t 2012-04-28 07:46:23 -07:00
Dan Mick
f922dc4355 Stop rebuild of libcommon.la on "make dist"
Fixes: 2356
Reviewed-by: Josh Durgin <josh.durgin@dreamhost.com>
2012-04-28 07:45:34 -07:00
Sage Weil
e44b126c40 mon: limit size of MOSDMap message sent as reply
We may send an MOSDMap as a reply to various requests, including

 - a failure report
 - a boot message
 - a pg_temp message
 - an up_thru message

In these cases, send a single MOSDMap message, but limit how big it gets.
All recipients here are osds, which are smart enough to request more maps
based on the MOSDMap::newest_map field.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
2012-04-28 07:45:29 -07:00
Sage Weil
d1df320b2d ceph-object-corpus: revert rewind
From 92becb696b

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
2012-04-28 07:45:24 -07:00
Sage Weil
4274fd05d4 osdmap: fix addr dedup check
Compare *every* address for a match, or else note that it is (or might be)
different.  Previously, we falsely took diff==0 to mean that all addrs
were definitely equal, which was not necessarily the case.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
2012-04-27 21:52:15 -07:00
Sage Weil
06d1bc22d6 osd: fix bad map debug messages
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
2012-04-27 21:48:31 -07:00
Dan Mick
a477d6be7e Stop rebuild of libcommon.la on "make dist"
Fixes: 2356
Reviewed-by: Josh Durgin <josh.durgin@dreamhost.com>
2012-04-27 18:32:09 -07:00
Yehuda Sadeh
510eed0fcd filestore: fix error message
error message was misleading, fixing it.

Signed-off-by: Yehuda Sadeh <yehuda.sadeh@dreamhost.com>
2012-04-27 16:05:36 -07:00
Yehuda Sadeh
f03dc34f7e filestore: first lock osd mount point, next detect fs type
Fixes #2353. Problem was that there were (at least) two osd processes
that were racing for the fs detection, which triggered some errors
in the btrfs create/remove snapshot.

Signed-off-by: Yehuda Sadeh <yehuda.sadeh@dreamhost.com>
2012-04-27 15:46:49 -07:00
Samuel Just
10c616a50a OSD: use map bl cache pinning during handle_osd_map
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
2012-04-27 14:29:39 -07:00
Samuel Just
d0d6912527 simple_cache.hpp: add pinning
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
2012-04-27 14:28:08 -07:00
Samuel Just
8ce5155137 Merge branch 'next' 2012-04-27 14:00:09 -07:00
Samuel Just
92becb696b FileJournal: simply flush by waiting for completions to empty
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
2012-04-27 13:58:58 -07:00
Samuel Just
155700d67e PG: in GetInfo Notify handler, fix peer_info_requested filter
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
2012-04-27 11:46:34 -07:00
Sage Weil
05b4fb33a1 Merge branch 'wip-lpg'
Conflicts:
	src/osd/OSDMap.h
2012-04-26 21:57:23 -07:00
Sage Weil
cee218f0da Merge branch 'next' 2012-04-26 21:53:36 -07:00
Sage Weil
dbd99129ce librados: test get/set of debug levels
Also do some sanity checks on the subsystem log level settings.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
2012-04-26 21:52:29 -07:00
Sage Weil
4e2e87941b config: allow {get,set}_val on subsystem debug levels
This mimics the allows you to get and set subsystem debug levels via the
normal config access methods.  Among other things, this allows librados
users to set debug levels.

Fixes: #2350
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
2012-04-26 21:52:23 -07:00
Samuel Just
7f3790a9ed OSD.cc: track osdmap refs using an LRU
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
2012-04-26 19:41:25 -07:00
Samuel Just
ec1ea6a8fd common/: added templated simple lru implementations
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
2012-04-26 19:41:25 -07:00
Sage Weil
873e9beedf osdmap: dedup pg_temp
We only deal with the case where the entire map is identical, since the
individual items are too small to make the pointer overhead worthwhile.
Too bad.  A in-memory btree-like structure would work better for this.

Signed-off-by: Sage Weil <sage@newdream.net>
2012-04-26 18:49:19 -07:00
Sage Weil
ed1024fb15 osdmap: use shared_ptr<> for pg_temp
This will let us dedup later.

Signed-off-by: Sage Weil <sage@newdream.net>
2012-04-26 18:49:19 -07:00
Sage Weil
207eec65d5 osd: make map dedup optional
On by default.  This trades CPU for memory.  Some might have unlimited RAM
and not care.

Signed-off-by: Sage Weil <sage@newdream.net>
2012-04-26 18:49:19 -07:00
Sage Weil
0188d9bb32 osd: dedup osdmaps when added to the in-memory cache
When we add an OSDMap to our in-memory cache, dedup against an existing map
at a nearby epoch.

Signed-off-by: Sage Weil <sage@newdream.net>
2012-04-26 18:49:19 -07:00
Sage Weil
4cfbd81c13 osdmap: drop obsolete PG_ROLE_* constants
There are cruft from the old primary/chain/splay replication code.  All
current code says <0 is stray, 0 is primary, and >0 is replica.  That is,
the role is the acting vector position, or -1 if not in the vector.

Signed-off-by: Sage Weil <sage@newdream.net>
2012-04-26 18:49:19 -07:00
Sage Weil
2a46564158 buffer: make contents_equal() more efficient
Iterate both lists in parallel in terms of buffers, and use memcmp() to
do the comparison.

Signed-off-by: Sage Weil <sage@newdream.net>
2012-04-26 18:49:19 -07:00
Sage Weil
36d43825c7 osdmap: dedup crush map
If the encoded crush map is identical between two versions, share the
reference.

Signed-off-by: Sage Weil <sage@newdream.net>
2012-04-26 18:49:19 -07:00
Sage Weil
98b1d8f36c osdmap: use shared_ptr for CrushWrapper
Signed-off-by: Sage Weil <sage@newdream.net>
2012-04-26 18:49:19 -07:00
Sage Weil
e0436cb900 osdmaptool: kludge to load a range of maps into memory
Signed-off-by: Sage Weil <sage@newdream.net>
2012-04-26 18:49:19 -07:00
Sage Weil
d6359d4465 osdmap: dedup addrs and addr vectors between maps
Compare two maps.  If an addrs matches, share the reference.  If all
addrs match, share the entire vector.

This leads to roughly 70% drop in memory utilization for the set of
thrashed maps I'm working with.

Signed-off-by: Sage Weil <sage@newdream.net>
2012-04-26 18:49:19 -07:00
Josh Durgin
8a3fd4495c Merge branch 'next' 2012-04-26 17:54:56 -07:00
Sage Weil
ee541c0f8d osdmap: filter out nonexistent osds from map
It is possible that the crush map contains device ids that do not exist as
osds.  Filter them out of the CRUSH result.

Drop the max devices assert, as that is trivially violated by adding a new
item to the crush map beyond max_osd (via 'ceph osd crush add ...').

Signed-off-by: Sage Weil <sage@newdream.net>
2012-04-26 17:42:29 -07:00
Josh Durgin
8f4dba62f8 librbd: the length argument of aio_discard should be uint64_t
size_t was accidentally copy-pasted.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
2012-04-26 17:41:27 -07:00
Sage Weil
fe76c5ba77 filestore: interprect any fiemap error as EOPNOTSUPP
On 2.6.32-5-amd64 (debian) and XFS I'm getting EINVAL.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
2012-04-26 17:40:52 -07:00