Commit Graph

12050 Commits

Author SHA1 Message Date
Samuel Just
ac6b018acb Causes the MDSes to switch among a set of stray directories when
switching to a new journal segment.

MDSCache:
	The stray member has been replaced with strays, an array of inodes
	representing the set of available stray directories, as well as
	stray_index indicating the index of the current stray directory.

	get_stray() now returns a pointer to the current stray directory
	inode.

	advance_stray() advances stray_index to the next stray directory.

	migrate_stray no longer takes a source argument, the source mds
	is inferred from the parent of the dir entry.

	stray dir entries are now stray<index> rather than stray.

	scan_stray_dir now scans all stray directories.

MDSLog:
	start_new_segment now calls advance_stray() on MDSCache to force a new
	stray directory.

mdstypes:
	NUM_STRAY indicates the number of stray directories to use per MDS

	MDS_INO_STRAY now takes an index argument as well as the mds number

	MDS_INO_STRAY_OWNER(i) returns the mds owner of the stray directory i

	MDS_INO_STRAY_OWNER(i) returns the index of the stray directory i

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
2010-11-22 13:25:14 -08:00
Samuel Just
3f8f59059a Timer must be initialized in Client::init and shutdown in
Client::shutdown.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
2010-11-22 13:16:28 -08:00
Colin Patrick McCabe
8eb4de9e6e generate_past_intervals:generate back to lastclean
PG::generate_past_intervals needs to generate all the intervals back to
history.last_epoch_clean, rather than just to
history.last_epoch_started. This is required by
PG::build_might_have_unfound, which needs to examine these intervals
when building the might_have_unfound set.

Move the check for whether past_intervals is up-to-date into
generate_past_intervals itself. Fix the check.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-11-22 10:47:53 -08:00
Sage Weil
80f2823571 vstart.sh: 'init-ceph stop' instead of 'stop.sh'
This just makes it easier to run multiple vstart sessions as the same user
on the same host.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-22 10:07:40 -08:00
Sage Weil
53d0650a42 Merge branch 'osd_msgr' into unstable 2010-11-22 09:55:37 -08:00
Sage Weil
27c6f217ca mds: remove bogus assert
Causes problems during resolve finish.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-22 09:55:01 -08:00
Sage Weil
9e15ade88d mds: do not eval subtree root when replay|resolve
This is nonsensical.  And can lead to scatter_writebehind, which breaks
horribly.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-22 09:55:01 -08:00
Sage Weil
c0c81d53b4 mds: trim exported subtree _after_ adjusting auth
We need to set the subtree bounds before trimming it away, or else we may
throw out things we're still auth for.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-22 09:55:01 -08:00
Sage Weil
cd53719f3c mds: resolve cleanup
Only track ambiguous imports and such if we get a resolve message while in
the resolve state.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-22 09:55:01 -08:00
Sage Weil
924b1fcbf7 osd: bind to new cluster address when wrongly marked down
If we come back up on the same address, there is a possible race.  Other
nodes will mark_down when they see us go down.  If we go up first, queue
some messages, and _then_ they see that we're down and mark_down, the
messages we queued will get lost.  Since it's stateful on the cluster
backend, we need to introduce an ordering so that closing out the _old_
session doesn't break the new session.  We do this by binding to a new
address (just a different port, actually) before marking ourselves back
up.

Fixes #592.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-22 09:49:43 -08:00
Sage Weil
1940976339 msgr: implement rebind() to pick a new port
Closes out all old connections and binds to a _different_ port.  This
ensures that someone doing mark_down on our old address won't get us.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-22 09:45:29 -08:00
Greg Farnum
f7170f95f0 client: only encode_cap_releases once per request.
Accomplish this by making a list of cap releases in the (permanent)
MetaRequest, and then copying that into the (potentially-temporary)
MClientRequest.
2010-11-22 09:09:01 -08:00
Sage Weil
51abcaa2c0 mon: clean up cluster_addr code a bit, better debug output
Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-21 20:52:41 -08:00
Sage Weil
28498a00cf osd: send correct ip addrs to monitor for cluster_, hb_addr
Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-21 20:52:40 -08:00
Sage Weil
2031364451 osdmap: fix cluster_addr encoding; printing
The cluster addrs were getting lost because we were checking v instead of
ev.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-21 20:52:40 -08:00
Sage Weil
ec434eda6a osd: unconditionally set up separate msgr instance for osd<->osd msgs
Always set up cluster_messenger (before we would only do so if there was
an explicit address configured for it).  The overhead to do so is minimal,
it simplifies the code, and will allow us to fix down->up transitions
(later).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-21 19:59:43 -08:00
Sage Weil
0dddf4537e filestore: only warn about disk write cache on kernels <2.6.33
Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-21 16:16:43 -08:00
Sage Weil
0856f57e25 osd: fix search_for_missing: old last_update implies object not present
For example, if an osd sends an empty PG::Info (last_update = 0'0) and
empty missing, we should not conclude that the object is there.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-21 16:15:25 -08:00
Sage Weil
6ef5c2f3ad init-ceph: fix cleanlogs for no log_sym_dir case
Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-21 16:09:13 -08:00
Colin Patrick McCabe
fc9b09760b OSDMap: const cleanup
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-11-19 19:15:11 -08:00
Colin Patrick McCabe
2a5c38939b mds-dumper: Define Dumper::~Dumper()
To fix compile error.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-11-19 19:14:29 -08:00
Colin Patrick McCabe
8566c5cd71 ReplicatedPG::pull: fix test for unfound
The test for unfound objects was reversed, leading us to try to pull
unfound objects and refrain from pulling objects that we knew how to
get. Should fix bug #585.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-11-19 14:21:00 -08:00
Sage Weil
2f5502fabd osdmap: fix printing, again
Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-19 13:41:58 -08:00
Sage Weil
4303820b43 Merge remote branch 'origin/mds' into unstable 2010-11-19 10:17:58 -08:00
Colin Patrick McCabe
b91e14e122 multi-dump.sh: add diff mode
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-11-18 21:13:02 -08:00
Colin Patrick McCabe
9cab522e71 Add multi-dump.sh
This is a debug tool that can dump out Ceph information at various
epochs. For instance, it can show how the OSDmap changed over time.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-11-18 20:57:15 -08:00
Colin Patrick McCabe
6e2b594b33 ReplicatedPG::get_object_contect: fix broken calls
ReplicatedPG::get_object_context takes three parameters.  The last two
are "const object_locator_t& oloc" and "bool can_create".
Unfortunately, booleans can degrade to ints, and ints can be used to
initialize objects of type object_locator_t.

So when you make a call like:
> ctx->snapset_obc = get_object_context(snapoid, true);

What happens is that you actually call:
> get_object_context(snapoid, object_locator(1), false);

So you pass an invalid and *not* blank object_locator_t, and pass false
for can_create. This is not what the caller wanted. This change gets rid
of the default parameters and fixes the callers.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-11-18 15:05:29 -08:00
Colin Patrick McCabe
43e0b2670b ReplicatedPG: call finish_recovery when needed
Don't loop in ReplicatedPG::start_recovery_ops. There is already a loop
in both recover_replicas and recover_primary that will try to do as many
recovery ops as it can, there's no need to repeat it. Also, the former
loop provably would never execute more than once because of the way
the code was structured.

If there are no more recovery operations to do, and PG::is_all_uptodate
is true at the end of ReplicatedPG::start_recovery_ops, call
finish_recovery.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-11-18 12:50:23 -08:00
Colin Patrick McCabe
ea5d1d6693 osd_resurrection_1_impl: turn on recovery at end
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
2010-11-18 10:09:14 -08:00
Jim Schutt
4adfdee7f1 Makefile: fix builddir weirdness
Signed-off-by: Jim Schutt <jaschut@sandia.gov>
2010-11-17 16:52:19 -08:00
Sage Weil
7e9812b4a9 osd: rev PG::Info encoding for last_epoch_clean change
This was missed by 184fbf582b, so any fs
created between now and then won't decode properly.  It's more important
to make an fs prior to that work, though, so that the upgrade path from
the last stable version works.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-17 14:37:38 -08:00
Sage Weil
c17e7da4d1 Merge branch 'mds_frags' into unstable 2010-11-17 13:06:14 -08:00
Sage Weil
f6823a79a6 mds: adjust dir_auth_pins on steal_dentry
dir_auth_pins is a counter of dentry auth_pins in the current dir; those
need to be added in when stealing.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-17 13:06:03 -08:00
Sage Weil
b705be1187 mds: wrlock scatterlocks to prevent a gather racing with split/merge logging
We have the dirs split in our cache for some time while journaling it to
disk, before the fragment_notify goes out.  Make sure we don't do a
scatterlock gather during that time that will confuse the inode auth (who
has their dirfrags fragmented differently).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-17 13:06:03 -08:00
Sage Weil
66d43ac867 mds: fix subtree map update on dirfrag merge
Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-17 13:06:03 -08:00
Sage Weil
7f6a256146 mds: clear PIN_SUBTREE on split/merge in purge_strays
This makes the helper work for merge as well as split.  Remove the special
fixups in the caller that were making split work before.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-17 13:06:03 -08:00
Sage Weil
669b55440f mds: don't complete freeze while parent inode is frozen
This makes maybe_finish_freeze() conditions match that of is_freezeable()
and avoids an assert.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-17 13:06:02 -08:00
Sage Weil
3777ff8a9a mds: move dirty rstat inodes to new dir on refragment
Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-17 13:06:02 -08:00
Sage Weil
d538817f62 mds: flush log on fragment
This makes request lock auth_pins expire, so the fragment moves along.
Otherwise we can end up waiting for the log flush timer to go off.

This isn't a complete solution; in-progress requests won't know to flush.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-17 13:06:02 -08:00
Sage Weil
cd5ee00602 mds: initialize PIN_SUBTREE on split
Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-17 13:06:02 -08:00
Sage Weil
b58b8d098e mds: fix discover requests, tracking wrt fragments
Track discover requests by tid.  The old system of tracking outstanding
discovers was kludgey and somewhat broken.  Also there is a possibility
of getting dup replies if someone does kick_requests().

There is still room for improvement with the logic detemrining when a
discover is sent: we may want to discover multiple dirfrags in parallel,
but the current code will only do one at a time.

Signed-off-by: Sage Weil <sage@newdream.net>

comment
2010-11-17 13:04:17 -08:00
Sage Weil
a63c06c89f mds: fix EFragment replay
If the inode already exists in our cache, adjust our (existing) fragments.
But it might not.  In that case, we just replay the metablob.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-17 13:02:39 -08:00
Sage Weil
a961049b71 mds: don't fragment mdsdir or .ceph
Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-17 13:02:38 -08:00
Jim Schutt
b54880e0f9 Detect broken system linux/fiemap.h
RedHat 5.5 has a /usr/include/linux/fiemap.h, but it is
broken because it does not itself include linux/types.h.
As a result, __u64 and friends are not defined.

We have a Ceph-local copy of fiemap.h, so use it
if the system version is broken.

While we're at it, fix up the configure message to
note we're using a local copy.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-17 12:48:00 -08:00
Sage Weil
29a9e66841 osdmap: don't include blacklist info in summary
It's confusing users and isn't that important.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-17 10:24:21 -08:00
Greg Farnum
c43455cee4 client: Remove the I_COMPLETE flag from the parent directory in relink_inode.
This papers over issues arising from the client's lack of proper support
for hard links, and lets it pass the snaptest-upchildrealms test.
2010-11-17 09:58:38 -08:00
Samuel Just
d57181d3d5 config: added max_mds
MDSMonitor: create_new_fs adapted to use the max_mds parameter

max_mds is now a configurable value and create_new_fs will initialize
max_mds to the specified value.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
2010-11-16 16:09:47 -08:00
Sage Weil
d1dcc03566 mds: allow frag merge on subtree root
Fix purge_stolen and adjust_dir_fragments.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-16 12:09:00 -08:00
Sage Weil
c49312659a mds: make dirfrag thrashing join and split
Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-16 12:09:00 -08:00
Sage Weil
8f24919d39 mds: add timestamp to LogEvents
This just gives us a bit of useful info when debugging problems.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-16 12:08:12 -08:00