Commit Graph

20415 Commits

Author SHA1 Message Date
Samuel Just
90ddc5ae51 OSD: use service.get_osdmap() in heartbeat(), don't grab map_lock
service.get_osdmap() gives us sufficiently consist
access to the map state.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-17 16:58:21 -07:00
Samuel Just
58e81c82e0 OSD: handle_osd_ping: use service->get_osdmap()
This way, we avoid grabbing the map_lock.  Furthermore,
get curmap at the beginning of the method to ensure that
we send the message using the same map used to check
is_up.

This should also fix #2798, which was caused by
an osd being marked up between service.get_osdmap()
and OSD::osdmap.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-17 16:58:21 -07:00
Samuel Just
32892c1edd doc/dev/osd_internals: add newlines before numbered lists
Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-17 16:51:57 -07:00
Sage Weil
fe4c658bd3 librados: simplify locking slightly
No reason to hold mylock_all here.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-17 16:02:18 -07:00
Sage Weil
199397dc96 osd: default 'osd_preserve_trimmed_log = false'
This option makes the osd skip zeroing old trimmed regions of the log.  The
data is never read, since the xattrs indicate which part of the log is
valid.  We've never actually used this to debug a problem, and it consumes
space, so let's disable it.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-17 12:40:33 -07:00
Samuel Just
24df8b1d82 doc/dev: add osd_internals to toc
Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-17 09:54:47 -07:00
Samuel Just
5a27f07160 doc/internals/osd_internals: fix indentation errors
Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-17 09:31:22 -07:00
Sage Weil
6490c84ff9 doc: discuss choice of pg_num
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-17 08:36:54 -07:00
Sage Weil
36d0a3555f log: simplify log logic a bit
Whether an entry is eligible to log/dump is independent of the channel it
is sent to.  Some channels impose additional restrictions.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-17 08:36:54 -07:00
Josh Durgin
abe05a3fbb Merge branch 'next' 2012-07-16 17:36:06 -07:00
Pascal de Bruijn | Unilogic Networks B.V
96587f39e3 Robustify ceph-rbdnamer and adapt udev rules
Below is a patch which makes the ceph-rbdnamer script more robust and
fixes a problem with the rbd udev rules.

On our setup we encountered a symlink which was linked to the wrong rbd:

  /dev/rbd/mypool/myrbd -> /dev/rbd1

While that link should have gone to /dev/rbd3 (on which a
partition /dev/rbd3p1 was present).

Now the old udev rule passes %n to the ceph-rbdnamer script, the problem
with %n is that %n results in a value of 3 (for rbd3), but in a value of
1 (for rbd3p1), so it seems it can't be depended upon for rbdnaming.

In the patch below the ceph-rbdnamer script is made more robust and it
now it can be called in various ways:

  /usr/bin/ceph-rbdnamer /dev/rbd3
  /usr/bin/ceph-rbdnamer /dev/rbd3p1
  /usr/bin/ceph-rbdnamer rbd3
  /usr/bin/ceph-rbdnamer rbd3p1
  /usr/bin/ceph-rbdnamer 3

Even with all these different styles of calling the modified script, it
should now return the same rbdname. This change "has" to be combined
with calling it from udev with %k though.

With that fixed, we hit the second problem. We ended up with:

  /dev/rbd/mypool/myrbd -> /dev/rbd3p1

So the rbdname was symlinked to the partition on the rbd instead of the
rbd itself. So what probably went wrong is udev discovering the disk and
running ceph-rbdnamer which resolved it to myrbd so the following
symlink was created:

  /dev/rbd/mypool/myrbd -> /dev/rbd3

However partitions would be discovered next and ceph-rbdnamer would be
run with rbd3p1 (%k) as parameter, resulting in the name myrbd too, with
the previous correct symlink being overwritten with a faulty one:

  /dev/rbd/mypool/myrbd -> /dev/rbd3p1

The solution to the problem is in differentiating between disks and
partitions in udev and handling them slightly differently. So with the
patch below partitions now get their own symlinks in the following style
(which is fairly consistent with other udev rules):

  /dev/rbd/mypool/myrbd-part1 -> /dev/rbd3p1

Please let me know any feedback you have on this patch or the approach
used.

Regards,
Pascal de Bruijn
Unilogic B.V.

Signed-off-by: Pascal de Bruijn <pascal@unilogicnetworks.net>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-16 17:34:22 -07:00
caleb miles
b0465496d2 doc/radosgw/config.rst: mended small typo
Signed-off-by: caleb miles <caleb.miles@inktank.com>
2012-07-16 16:30:36 -07:00
Sage Weil
f9c1a6fb0a Merge branch 'next' 2012-07-16 16:13:55 -07:00
Sage Weil
2a8c4db72f Merge branch 'wip-mon-mkfs'
Reviewed-by: Tommi Virtanen <tv@inktank.com>
2012-07-16 16:15:33 -07:00
Sage Weil
4eec4fc57d mkcephfs: nicer empty directory check
From TV.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-16 16:14:39 -07:00
Sage Weil
4e66a3b98d mkcephfs: error out if mon data directory is not empty
The ceph-mon --mkfs function no longer wipes out the directory; it is in
fact mostly a no-op that just verifies the dir exists.

So, ensure that the directory is empty at mkfs time.  This could
alternatively do an 'rm -r' in that directory (that is in fact what
ceph-mon used to do), but this is safer.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-16 16:14:39 -07:00
Sage Weil
6b1835a92c vstart.sh: blow away mon directory on creation/start
Now that ceph-mon doesn't blow away the mon data content, we need to.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-16 16:14:39 -07:00
Sage Weil
54be9d0917 mon: stop doing rm -rf on mon mkfs
Simply verify that the directory exists, or if it doesn't, create it.
Do nothing about its content.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-16 16:14:39 -07:00
Sage Weil
52f96b9fd1 log: apply log_level to stderr/syslog logic
In non-crash situations, we want to make sure the message is both below the
syslog/stderr threshold and also below the normal log threshold.  Otherwise
we get anything we gather on those channels, even when the log level is
low.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-16 16:02:14 -07:00
Sage Weil
de524abdb1 log: dump logging levels in crash dump
So you know what you are/are not seeing.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-16 15:53:59 -07:00
Sage Weil
d3c76f754f Merge branch 'next' 2012-07-16 15:53:54 -07:00
Samuel Just
3821f6c4bf PG: grab reference to pg in C_OSD_AppliedRecoveredObject
Otherwise, accessing the pg via _applied_recovered_object
isn't safe.  Using intrusive_ptr clarifies the reference
ownership.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-16 15:43:52 -07:00
Sage Weil
64f745008b log: fix event gather condition
We should gather an event if it is below the log or gather threshold.

Previously we were only gathering if we were going to print it, which makes
the dump no more useful than what was already logged.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-16 15:36:44 -07:00
Samuel Just
d4410e4ad5 PG::RecoveryState::Stray::react(LogEvt&): set dirty_info/log
We adjust the info and the log, so we must set dirty_info and
dirty_log to force writes.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-16 14:18:22 -07:00
Samuel Just
4afa892591 PG: use stats from primary after rewinding divergent entries
If the osd recieving the info has divergent entries, it will
also have a "divergent" stat structure.

Probably fixes #2769.

In cases like #2769, this bug can result in a primary with a stat
structure which double counts an operation: once for the
divergent operation, and once for the replay.

This is another way for the bug addressed in
5924f8e4a8 to happen.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-16 14:18:22 -07:00
Samuel Just
5f60236610 Merge remote-tracking branch 'upstream/next' 2012-07-16 14:18:04 -07:00
Samuel Just
c7fb964c07 PG::RecoveryState::Stray::react(LogEvt&): reset last_pg_scrub
We need to reset the last_pg_scrub data in the osd since we
are replacing the info.

Probably fixes #2453

In cases like 2453, we hit the following backtrace:

     0> 2012-05-19 17:24:09.113684 7fe66be3d700 -1 osd/OSD.h: In function 'void OSD::unreg_last_pg_scrub(pg_t, utime_t)' thread 7fe66be3d700 time 2012-05-19 17:24:09.095719
osd/OSD.h: 840: FAILED assert(last_scrub_pg.count(p))

 ceph version 0.46-313-g4277d4d (commit:4277d4d3378dde4264e2b8d211371569219c6e4b)
 1: (OSD::unreg_last_pg_scrub(pg_t, utime_t)+0x149) [0x641f49]
 2: (PG::proc_primary_info(ObjectStore::Transaction&, pg_info_t const&)+0x5e) [0x63383e]
 3: (PG::RecoveryState::ReplicaActive::react(PG::RecoveryState::MInfoRec const&)+0x4a) [0x633eda]
 4: (boost::statechart::detail::reaction_result boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list3<boost::statechart::custom_reaction<PG::RecoveryState::MQuery>, boost::statechart::custom_reaction<PG::RecoveryState::MInfoRec>, boost::statechart::custom_reaction<PG::RecoveryState::MLogRec> >, boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> >(boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*)+0x130) [0x6466a0]
 5: (boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x81) [0x646791]
 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x63dfcb]
 7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x11) [0x63e0f1]
 8: (PG::RecoveryState::handle_info(int, pg_info_t&, PG::RecoveryCtx*)+0x177) [0x616987]
 9: (OSD::handle_pg_info(std::tr1::shared_ptr<OpRequest>)+0x665) [0x5d3d15]
 10: (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0x2a0) [0x5d7370]
 11: (OSD::_dispatch(Message*)+0x191) [0x5dd4a1]
 12: (OSD::ms_dispatch(Message*)+0x153) [0x5ddda3]
 13: (SimpleMessenger::dispatch_entry()+0x863) [0x77fbc3]
 14: (SimpleMessenger::DispatchThread::entry()+0xd) [0x746c5d]
 15: (()+0x7efc) [0x7fe679b1fefc]
 16: (clone()+0x6d) [0x7fe67815089d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Because we don't clear the scrub state before reseting info,
the last_scrub_stamp state in the info.history structure
changes without updating the osd state resulting in the
above assert failure.

Backport: stable

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-16 14:07:49 -07:00
Samuel Just
5d82a77060 doc/dev/osd_internals: OSD overview, pg removal, map/message handling
This is a start on some osd internals documentation for new
developers.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-16 11:11:58 -07:00
Samuel Just
1b8819bbac PG: Place info in biginfo object
The purged_snaps set can grow without bound as snaps are
created and removed.  Because the filestore doesn't
provide unlimited size collection attributes, it's better
to place the full info on the biginfo object, since we
need to write it during write_info anyway.

Added CEPH_OSD_FEATURE_INCOMPAT_BIGINFO to prevent downgrade.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-16 10:59:55 -07:00
Samuel Just
12d70738d1 PG: use write_info to set snap_collections in make_snap_collections
At one point, snap_collections were written to a pg collection
attribute.  Subsequently, they were moved to the biginfo object
since the structure can grow too large for limited size xattrs.
make_snap_collection, however, was not updated.

Using write_info here should prevent this from happening in
the future.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-16 10:59:55 -07:00
Samuel Just
90381dc9a1 OSD: set superblock compat_features on boot and mkfs
Previously, we did not actually persist the osd compatibility
mask.  Without persisting the current compat mask, a previous,
incompatible version of the OSD would not be prevented from
starting on the same store.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-16 10:59:55 -07:00
Samuel Just
470796b545 CompatSet: users pass bit indices rather than masks
CompatSet users number the Feature objects rather than
providing masks.  Thus, we should do

mask |= (1 << f.id) rather than mask |= f.id.

In order to detect old, broken encodings, the lowest
bit will be set in memory but not set in the encoding.
We can reconstruct the correct mask from the names map.

This bug can cause an incompat bit to not be detected
since 1|2 == 1|2|3.

fixes: #2748

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-16 10:59:55 -07:00
Sage Weil
b7814dbefb osd: based misdirected op role calc on acting set
We want to look at the acting set here, nothing else.  This was causing us
to erroneously queue ops for later (wasting memory) and to erroneously
print out a 'misdrected op' message in the cluster log (confusion and
incorrect [but ignored] -ENXIO reply).

Fixes: #2022
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-16 10:57:33 -07:00
Sage Weil
14d2efc438 mon/MonitorStore: always O_TRUNC when writing states
It is possible for a .new file to already exist, potentially with a
larger size.  This would happen if:

 - we were proposing a different value
 - we crashed (or were stopped) before it got renamed into place
 - after restarting, a different value was proposed and accepted.

This isn't so unlikely for the log state machine, where we're
aggregating random messages.  O_TRUNC ensure we avoid getting the tail
end of some previous junk.

I observed #2593 and found that a logm state value had a larger size on
one mon (after slurping) than the others, pointing to put_bl_sn_map().

While we are at it, O_TRUNC put_int() too; the same type of bug is
possible there, too.

Fixes: #2593
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-16 10:57:08 -07:00
Sage Weil
e429da34c9 Merge remote-tracking branch 'gh/bugfix-2022'
Reviewed-by: Samuel Just <sam.just@inktank.com>
2012-07-16 10:48:25 -07:00
Sage Weil
47b38dd0ea Merge remote-tracking branch 'gh/bugfix-2779'
Reviewed-by: Greg Farnum <greg@inktank.com>
2012-07-16 09:12:09 -07:00
Sage Weil
f94c764638 mon: remove osds from [near]full sets when their stats are removed from pgmap
Greg points out that we could have a situation like:

 - mon recovers..
 - goes through osdmaps, notes an osd was removed and removes from
   full/nearfull
 - goes through pgmaps, and re-adds it when it encounters some osd_stat_ts.

Fix this by removing the osd from the full/nearfull set when we remove
the osd_stat_t from the pgmap.  Any osd removal is always followed by
an osd_stat_rm[] record when the primary processes the new osdmap and
proposed the appropriate pgmap updates.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-15 22:03:31 -07:00
Sage Weil
fe57681892 mon/MonitorStore: always O_TRUNC when writing states
It is possible for a .new file to already exist, potentially with a
larger size.  This would happen if:

 - we were proposing a different value
 - we crashed (or were stopped) before it got renamed into place
 - after restarting, a different value was proposed and accepted.

This isn't so unlikely for the log state machine, where we're
aggregating random messages.  O_TRUNC ensure we avoid getting the tail
end of some previous junk.

I observed #2593 and found that a logm state value had a larger size on
one mon (after slurping) than the others, pointing to put_bl_sn_map().

While we are at it, O_TRUNC put_int() too; the same type of bug is
possible there, too.

Fixes: #2593
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-15 21:38:29 -07:00
Sage Weil
bf9a85ade6 filestore: dump open fds when we hit EMFILE
Use a helper to dump /proc/self/fd when we hit EMFILE in the filestore.
Ideally, we should trigger this in other appropriate places, but it is
not immediately clear that there is a sane way to do that.

Fixes: #2330
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-15 16:31:05 -07:00
Sage Weil
a278ea1316 osdmap: drop useless and unused get_pg_role() method
Users probably want get_pg_acting_rank().  If they don't, they can probably
have the mapping and can calculate the rank themselves.  Having this here
is asking for bugs like #2022.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-14 17:39:34 -07:00
Sage Weil
38962abd5b osd: based misdirected op role calc on acting set
We want to look at the acting set here, nothing else.  This was causing us
to erroneously queue ops for later (wasting memory) and to erroneously
print out a 'misdrected op' message in the cluster log (confusion and
incorrect [but ignored] -ENXIO reply).

Fixes: #2022
Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-14 17:39:33 -07:00
Sage Weil
6faeedacfb osd: simplify helper usage for misdirected ops
Make the helper exclusively for the PG != NULL cases, and open-code the
one PG == NULL caller.  This is simpler, and lets us include more useful
information in the log message.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-07-14 17:39:33 -07:00
Noah Watkins
ed4f80f960 vstart: use absolute path for keyring
Stores absolute path to the generated keyring so that tests running in
other directories (e.g. src/java/test) can simply reference the
generated ceph.conf.

Signed-off-by: Noah Watkins <jawhawk@cs.ucsc.edu>
2012-07-14 17:39:11 -07:00
Samuel Just
117b28680e OSD: add config options to fake missed pings
In order to test monitor and osd failure detection and false
positive correction, this patch adds the following options:

 1. osd_debug_drop_ping_probability: probability of dropping
    a string of pings from a client upon ping recipt.
 2. osd_debug_drop_ping_duration: number of pings to drop in
    a row.

This should help with replicating some wrongly-marked-down
thrashing cases.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-13 16:09:53 -07:00
caleb miles
ce20e02021 crushtool: allow information generated during testing to be dumped
to a set of CSV files for off-line analysis.

Signed-off-by: caleb miles <caleb.miles@inktank.com>
2012-07-13 15:14:15 -07:00
John Wilkins
8a89d40e6b doc: remove last reference to ceph-cookbooks.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2012-07-13 14:16:08 -07:00
John Wilkins
2011956745 doc: cookbooks issue resolved, so changed 'ceph-cookbooks' back to 'ceph.'
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2012-07-13 14:08:41 -07:00
Josh Durgin
5a5597f6c5 qa: download tests from specified branch
These python tests aren't installed, so they need to be downloaded

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-13 13:35:07 -07:00
Samuel Just
53600798f7 OSD: send_still_alive when we get a reply if we reported failure
When we get a ping reply, remove the peer from the failure_queue
and send a still alive message if the peer is in the failure_pending
map.

Otherwise, the monitor could slowly accumulate sporadic failure reports
leading to an osd being incorrectly marked out.

This bug may have been contributing to the wrongly-marked-down
thrashing observed on some systems.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-13 12:18:46 -07:00
Samuel Just
5924f8e4a8 PG: merge_log always use stats from authoritative replica
If the osd recieving the log has divergent entries, it will
also have a "divergent" stat structure.  In general, it suffices
to simply trust the stat structure shipped with the authoritative
log and info since merge_log is only used to merge an authoritative
log.

Probably fixes #2769.

In cases like #2769, this bug can result in a primary with a stat
structure which double counts an operation: once for the
divergent operation, and once for the replay.  It turned up
in a regression suite run as a scrub stat mismatch.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2012-07-13 10:19:24 -07:00