RepoMirrors/ceph

mirror of https://github.com/ceph/ceph synced 2024-12-16 16:39:21 +00:00

Author	SHA1	Message	Date
Samuel Just	90ddc5ae51	OSD: use service.get_osdmap() in heartbeat(), don't grab map_lock service.get_osdmap() gives us sufficiently consist access to the map state. Signed-off-by: Samuel Just <sam.just@inktank.com>	2012-07-17 16:58:21 -07:00
Samuel Just	58e81c82e0	OSD: handle_osd_ping: use service->get_osdmap() This way, we avoid grabbing the map_lock. Furthermore, get curmap at the beginning of the method to ensure that we send the message using the same map used to check is_up. This should also fix #2798, which was caused by an osd being marked up between service.get_osdmap() and OSD::osdmap. Signed-off-by: Samuel Just <sam.just@inktank.com>	2012-07-17 16:58:21 -07:00
Samuel Just	32892c1edd	doc/dev/osd_internals: add newlines before numbered lists Signed-off-by: Samuel Just <sam.just@inktank.com>	2012-07-17 16:51:57 -07:00
Sage Weil	fe4c658bd3	librados: simplify locking slightly No reason to hold mylock_all here. Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-17 16:02:18 -07:00
Sage Weil	199397dc96	osd: default 'osd_preserve_trimmed_log = false' This option makes the osd skip zeroing old trimmed regions of the log. The data is never read, since the xattrs indicate which part of the log is valid. We've never actually used this to debug a problem, and it consumes space, so let's disable it. Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-17 12:40:33 -07:00
Samuel Just	24df8b1d82	doc/dev: add osd_internals to toc Signed-off-by: Samuel Just <sam.just@inktank.com>	2012-07-17 09:54:47 -07:00
Samuel Just	5a27f07160	doc/internals/osd_internals: fix indentation errors Signed-off-by: Samuel Just <sam.just@inktank.com>	2012-07-17 09:31:22 -07:00
Sage Weil	6490c84ff9	doc: discuss choice of pg_num Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-17 08:36:54 -07:00
Sage Weil	36d0a3555f	log: simplify log logic a bit Whether an entry is eligible to log/dump is independent of the channel it is sent to. Some channels impose additional restrictions. Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-17 08:36:54 -07:00
Josh Durgin	abe05a3fbb	Merge branch 'next'	2012-07-16 17:36:06 -07:00
Pascal de Bruijn \| Unilogic Networks B.V	96587f39e3	Robustify ceph-rbdnamer and adapt udev rules Below is a patch which makes the ceph-rbdnamer script more robust and fixes a problem with the rbd udev rules. On our setup we encountered a symlink which was linked to the wrong rbd: /dev/rbd/mypool/myrbd -> /dev/rbd1 While that link should have gone to /dev/rbd3 (on which a partition /dev/rbd3p1 was present). Now the old udev rule passes %n to the ceph-rbdnamer script, the problem with %n is that %n results in a value of 3 (for rbd3), but in a value of 1 (for rbd3p1), so it seems it can't be depended upon for rbdnaming. In the patch below the ceph-rbdnamer script is made more robust and it now it can be called in various ways: /usr/bin/ceph-rbdnamer /dev/rbd3 /usr/bin/ceph-rbdnamer /dev/rbd3p1 /usr/bin/ceph-rbdnamer rbd3 /usr/bin/ceph-rbdnamer rbd3p1 /usr/bin/ceph-rbdnamer 3 Even with all these different styles of calling the modified script, it should now return the same rbdname. This change "has" to be combined with calling it from udev with %k though. With that fixed, we hit the second problem. We ended up with: /dev/rbd/mypool/myrbd -> /dev/rbd3p1 So the rbdname was symlinked to the partition on the rbd instead of the rbd itself. So what probably went wrong is udev discovering the disk and running ceph-rbdnamer which resolved it to myrbd so the following symlink was created: /dev/rbd/mypool/myrbd -> /dev/rbd3 However partitions would be discovered next and ceph-rbdnamer would be run with rbd3p1 (%k) as parameter, resulting in the name myrbd too, with the previous correct symlink being overwritten with a faulty one: /dev/rbd/mypool/myrbd -> /dev/rbd3p1 The solution to the problem is in differentiating between disks and partitions in udev and handling them slightly differently. So with the patch below partitions now get their own symlinks in the following style (which is fairly consistent with other udev rules): /dev/rbd/mypool/myrbd-part1 -> /dev/rbd3p1 Please let me know any feedback you have on this patch or the approach used. Regards, Pascal de Bruijn Unilogic B.V. Signed-off-by: Pascal de Bruijn <pascal@unilogicnetworks.net> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>	2012-07-16 17:34:22 -07:00
caleb miles	b0465496d2	doc/radosgw/config.rst: mended small typo Signed-off-by: caleb miles <caleb.miles@inktank.com>	2012-07-16 16:30:36 -07:00
Sage Weil	f9c1a6fb0a	Merge branch 'next'	2012-07-16 16:13:55 -07:00
Sage Weil	2a8c4db72f	Merge branch 'wip-mon-mkfs' Reviewed-by: Tommi Virtanen <tv@inktank.com>	2012-07-16 16:15:33 -07:00
Sage Weil	4eec4fc57d	mkcephfs: nicer empty directory check From TV. Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-16 16:14:39 -07:00
Sage Weil	4e66a3b98d	mkcephfs: error out if mon data directory is not empty The ceph-mon --mkfs function no longer wipes out the directory; it is in fact mostly a no-op that just verifies the dir exists. So, ensure that the directory is empty at mkfs time. This could alternatively do an 'rm -r' in that directory (that is in fact what ceph-mon used to do), but this is safer. Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-16 16:14:39 -07:00
Sage Weil	6b1835a92c	vstart.sh: blow away mon directory on creation/start Now that ceph-mon doesn't blow away the mon data content, we need to. Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-16 16:14:39 -07:00
Sage Weil	54be9d0917	mon: stop doing rm -rf on mon mkfs Simply verify that the directory exists, or if it doesn't, create it. Do nothing about its content. Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-16 16:14:39 -07:00
Sage Weil	52f96b9fd1	log: apply log_level to stderr/syslog logic In non-crash situations, we want to make sure the message is both below the syslog/stderr threshold and also below the normal log threshold. Otherwise we get anything we gather on those channels, even when the log level is low. Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-16 16:02:14 -07:00
Sage Weil	de524abdb1	log: dump logging levels in crash dump So you know what you are/are not seeing. Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-16 15:53:59 -07:00
Sage Weil	d3c76f754f	Merge branch 'next'	2012-07-16 15:53:54 -07:00
Samuel Just	3821f6c4bf	PG: grab reference to pg in C_OSD_AppliedRecoveredObject Otherwise, accessing the pg via _applied_recovered_object isn't safe. Using intrusive_ptr clarifies the reference ownership. Signed-off-by: Samuel Just <sam.just@inktank.com>	2012-07-16 15:43:52 -07:00
Sage Weil	64f745008b	log: fix event gather condition We should gather an event if it is below the log or gather threshold. Previously we were only gathering if we were going to print it, which makes the dump no more useful than what was already logged. Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-16 15:36:44 -07:00
Samuel Just	d4410e4ad5	PG::RecoveryState::Stray::react(LogEvt&): set dirty_info/log We adjust the info and the log, so we must set dirty_info and dirty_log to force writes. Signed-off-by: Samuel Just <sam.just@inktank.com>	2012-07-16 14:18:22 -07:00
Samuel Just	4afa892591	PG: use stats from primary after rewinding divergent entries If the osd recieving the info has divergent entries, it will also have a "divergent" stat structure. Probably fixes #2769. In cases like #2769, this bug can result in a primary with a stat structure which double counts an operation: once for the divergent operation, and once for the replay. This is another way for the bug addressed in `5924f8e4a8` to happen. Signed-off-by: Samuel Just <sam.just@inktank.com>	2012-07-16 14:18:22 -07:00
Samuel Just	5f60236610	Merge remote-tracking branch 'upstream/next'	2012-07-16 14:18:04 -07:00
Samuel Just	c7fb964c07	PG::RecoveryState::Stray::react(LogEvt&): reset last_pg_scrub We need to reset the last_pg_scrub data in the osd since we are replacing the info. Probably fixes #2453 In cases like 2453, we hit the following backtrace: 0> 2012-05-19 17:24:09.113684 7fe66be3d700 -1 osd/OSD.h: In function 'void OSD::unreg_last_pg_scrub(pg_t, utime_t)' thread 7fe66be3d700 time 2012-05-19 17:24:09.095719 osd/OSD.h: 840: FAILED assert(last_scrub_pg.count(p)) ceph version 0.46-313-g4277d4d (commit:4277d4d3378dde4264e2b8d211371569219c6e4b) 1: (OSD::unreg_last_pg_scrub(pg_t, utime_t)+0x149) [0x641f49] 2: (PG::proc_primary_info(ObjectStore::Transaction&, pg_info_t const&)+0x5e) [0x63383e] 3: (PG::RecoveryState::ReplicaActive::react(PG::RecoveryState::MInfoRec const&)+0x4a) [0x633eda] 4: (boost::statechart::detail::reaction_result boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list3<boost::statechart::custom_reaction<PG::RecoveryState::MQuery>, boost::statechart::custom_reaction<PG::RecoveryState::MInfoRec>, boost::statechart::custom_reaction<PG::RecoveryState::MLogRec> >, boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> >(boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const)+0x130) [0x6466a0] 5: (boost::statechart::simple_state<PG::RecoveryState::ReplicaActive, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const)+0x81) [0x646791] 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x63dfcb] 7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x11) [0x63e0f1] 8: (PG::RecoveryState::handle_info(int, pg_info_t&, PG::RecoveryCtx)+0x177) [0x616987] 9: (OSD::handle_pg_info(std::tr1::shared_ptr<OpRequest>)+0x665) [0x5d3d15] 10: (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0x2a0) [0x5d7370] 11: (OSD::_dispatch(Message)+0x191) [0x5dd4a1] 12: (OSD::ms_dispatch(Message*)+0x153) [0x5ddda3] 13: (SimpleMessenger::dispatch_entry()+0x863) [0x77fbc3] 14: (SimpleMessenger::DispatchThread::entry()+0xd) [0x746c5d] 15: (()+0x7efc) [0x7fe679b1fefc] 16: (clone()+0x6d) [0x7fe67815089d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Because we don't clear the scrub state before reseting info, the last_scrub_stamp state in the info.history structure changes without updating the osd state resulting in the above assert failure. Backport: stable Signed-off-by: Samuel Just <sam.just@inktank.com>	2012-07-16 14:07:49 -07:00
Samuel Just	5d82a77060	doc/dev/osd_internals: OSD overview, pg removal, map/message handling This is a start on some osd internals documentation for new developers. Signed-off-by: Samuel Just <sam.just@inktank.com>	2012-07-16 11:11:58 -07:00
Samuel Just	1b8819bbac	PG: Place info in biginfo object The purged_snaps set can grow without bound as snaps are created and removed. Because the filestore doesn't provide unlimited size collection attributes, it's better to place the full info on the biginfo object, since we need to write it during write_info anyway. Added CEPH_OSD_FEATURE_INCOMPAT_BIGINFO to prevent downgrade. Signed-off-by: Samuel Just <sam.just@inktank.com>	2012-07-16 10:59:55 -07:00
Samuel Just	12d70738d1	PG: use write_info to set snap_collections in make_snap_collections At one point, snap_collections were written to a pg collection attribute. Subsequently, they were moved to the biginfo object since the structure can grow too large for limited size xattrs. make_snap_collection, however, was not updated. Using write_info here should prevent this from happening in the future. Signed-off-by: Samuel Just <sam.just@inktank.com>	2012-07-16 10:59:55 -07:00
Samuel Just	90381dc9a1	OSD: set superblock compat_features on boot and mkfs Previously, we did not actually persist the osd compatibility mask. Without persisting the current compat mask, a previous, incompatible version of the OSD would not be prevented from starting on the same store. Signed-off-by: Samuel Just <sam.just@inktank.com>	2012-07-16 10:59:55 -07:00
Samuel Just	470796b545	CompatSet: users pass bit indices rather than masks CompatSet users number the Feature objects rather than providing masks. Thus, we should do mask \|= (1 << f.id) rather than mask \|= f.id. In order to detect old, broken encodings, the lowest bit will be set in memory but not set in the encoding. We can reconstruct the correct mask from the names map. This bug can cause an incompat bit to not be detected since 1\|2 == 1\|2\|3. fixes: #2748 Signed-off-by: Samuel Just <sam.just@inktank.com>	2012-07-16 10:59:55 -07:00
Sage Weil	b7814dbefb	osd: based misdirected op role calc on acting set We want to look at the acting set here, nothing else. This was causing us to erroneously queue ops for later (wasting memory) and to erroneously print out a 'misdrected op' message in the cluster log (confusion and incorrect [but ignored] -ENXIO reply). Fixes: #2022 Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-16 10:57:33 -07:00
Sage Weil	14d2efc438	mon/MonitorStore: always O_TRUNC when writing states It is possible for a .new file to already exist, potentially with a larger size. This would happen if: - we were proposing a different value - we crashed (or were stopped) before it got renamed into place - after restarting, a different value was proposed and accepted. This isn't so unlikely for the log state machine, where we're aggregating random messages. O_TRUNC ensure we avoid getting the tail end of some previous junk. I observed #2593 and found that a logm state value had a larger size on one mon (after slurping) than the others, pointing to put_bl_sn_map(). While we are at it, O_TRUNC put_int() too; the same type of bug is possible there, too. Fixes: #2593 Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-16 10:57:08 -07:00
Sage Weil	e429da34c9	Merge remote-tracking branch 'gh/bugfix-2022' Reviewed-by: Samuel Just <sam.just@inktank.com>	2012-07-16 10:48:25 -07:00
Sage Weil	47b38dd0ea	Merge remote-tracking branch 'gh/bugfix-2779' Reviewed-by: Greg Farnum <greg@inktank.com>	2012-07-16 09:12:09 -07:00
Sage Weil	f94c764638	mon: remove osds from [near]full sets when their stats are removed from pgmap Greg points out that we could have a situation like: - mon recovers.. - goes through osdmaps, notes an osd was removed and removes from full/nearfull - goes through pgmaps, and re-adds it when it encounters some osd_stat_ts. Fix this by removing the osd from the full/nearfull set when we remove the osd_stat_t from the pgmap. Any osd removal is always followed by an osd_stat_rm[] record when the primary processes the new osdmap and proposed the appropriate pgmap updates. Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-15 22:03:31 -07:00
Sage Weil	fe57681892	mon/MonitorStore: always O_TRUNC when writing states It is possible for a .new file to already exist, potentially with a larger size. This would happen if: - we were proposing a different value - we crashed (or were stopped) before it got renamed into place - after restarting, a different value was proposed and accepted. This isn't so unlikely for the log state machine, where we're aggregating random messages. O_TRUNC ensure we avoid getting the tail end of some previous junk. I observed #2593 and found that a logm state value had a larger size on one mon (after slurping) than the others, pointing to put_bl_sn_map(). While we are at it, O_TRUNC put_int() too; the same type of bug is possible there, too. Fixes: #2593 Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-15 21:38:29 -07:00
Sage Weil	bf9a85ade6	filestore: dump open fds when we hit EMFILE Use a helper to dump /proc/self/fd when we hit EMFILE in the filestore. Ideally, we should trigger this in other appropriate places, but it is not immediately clear that there is a sane way to do that. Fixes: #2330 Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-15 16:31:05 -07:00
Sage Weil	a278ea1316	osdmap: drop useless and unused get_pg_role() method Users probably want get_pg_acting_rank(). If they don't, they can probably have the mapping and can calculate the rank themselves. Having this here is asking for bugs like #2022. Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-14 17:39:34 -07:00
Sage Weil	38962abd5b	osd: based misdirected op role calc on acting set We want to look at the acting set here, nothing else. This was causing us to erroneously queue ops for later (wasting memory) and to erroneously print out a 'misdrected op' message in the cluster log (confusion and incorrect [but ignored] -ENXIO reply). Fixes: #2022 Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-14 17:39:33 -07:00
Sage Weil	6faeedacfb	osd: simplify helper usage for misdirected ops Make the helper exclusively for the PG != NULL cases, and open-code the one PG == NULL caller. This is simpler, and lets us include more useful information in the log message. Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-14 17:39:33 -07:00
Noah Watkins	ed4f80f960	vstart: use absolute path for keyring Stores absolute path to the generated keyring so that tests running in other directories (e.g. src/java/test) can simply reference the generated ceph.conf. Signed-off-by: Noah Watkins <jawhawk@cs.ucsc.edu>	2012-07-14 17:39:11 -07:00
Samuel Just	117b28680e	OSD: add config options to fake missed pings In order to test monitor and osd failure detection and false positive correction, this patch adds the following options: 1. osd_debug_drop_ping_probability: probability of dropping a string of pings from a client upon ping recipt. 2. osd_debug_drop_ping_duration: number of pings to drop in a row. This should help with replicating some wrongly-marked-down thrashing cases. Signed-off-by: Samuel Just <sam.just@inktank.com>	2012-07-13 16:09:53 -07:00
caleb miles	ce20e02021	crushtool: allow information generated during testing to be dumped to a set of CSV files for off-line analysis. Signed-off-by: caleb miles <caleb.miles@inktank.com>	2012-07-13 15:14:15 -07:00
John Wilkins	8a89d40e6b	doc: remove last reference to ceph-cookbooks. Signed-off-by: John Wilkins <john.wilkins@inktank.com>	2012-07-13 14:16:08 -07:00
John Wilkins	2011956745	doc: cookbooks issue resolved, so changed 'ceph-cookbooks' back to 'ceph.' Signed-off-by: John Wilkins <john.wilkins@inktank.com>	2012-07-13 14:08:41 -07:00
Josh Durgin	5a5597f6c5	qa: download tests from specified branch These python tests aren't installed, so they need to be downloaded Signed-off-by: Josh Durgin <josh.durgin@inktank.com>	2012-07-13 13:35:07 -07:00
Samuel Just	53600798f7	OSD: send_still_alive when we get a reply if we reported failure When we get a ping reply, remove the peer from the failure_queue and send a still alive message if the peer is in the failure_pending map. Otherwise, the monitor could slowly accumulate sporadic failure reports leading to an osd being incorrectly marked out. This bug may have been contributing to the wrongly-marked-down thrashing observed on some systems. Signed-off-by: Samuel Just <sam.just@inktank.com>	2012-07-13 12:18:46 -07:00
Samuel Just	5924f8e4a8	PG: merge_log always use stats from authoritative replica If the osd recieving the log has divergent entries, it will also have a "divergent" stat structure. In general, it suffices to simply trust the stat structure shipped with the authoritative log and info since merge_log is only used to merge an authoritative log. Probably fixes #2769. In cases like #2769, this bug can result in a primary with a stat structure which double counts an operation: once for the divergent operation, and once for the replay. It turned up in a regression suite run as a scrub stat mismatch. Signed-off-by: Samuel Just <sam.just@inktank.com>	2012-07-13 10:19:24 -07:00

1 2 3 4 5 ...

20415 Commits