RepoMirrors/mars - mars

Commit Graph

Author	SHA1	Message	Date
Thomas Schoebel-Theuer	dd4748bb52	light: clarify code	2016-03-01 11:58:23 +01:00
Thomas Schoebel-Theuer	8fa728a0c9	light: fix annoying unnecessary error message	2016-03-01 11:58:23 +01:00
Thomas Schoebel-Theuer	8abcbf196d	light: safeguard sync vs replay	2016-03-01 11:58:23 +01:00
Thomas Schoebel-Theuer	e70ac4df8c	light: safeguard position update	2016-03-01 11:58:23 +01:00
Thomas Schoebel-Theuer	fafad9512a	light: always update position symlinks at logger switchoff	2016-03-01 11:58:23 +01:00
Thomas Schoebel-Theuer	42c2dc98da	light: fix typo in replay link comparison	2016-03-01 11:58:23 +01:00
Thomas Schoebel-Theuer	a312e3d93b	light: fix memory leak regression from `f235b76900`	2016-03-01 11:58:09 +01:00
Thomas Schoebel-Theuer	8bc1e80488	light: safeguard skipping of logfiles in disconnected state. Found by code inspection, neither in practice nor by testing. Should not occur in practice, because it could only occur after marsadm pause-fetch, which is an exceptional state only to be entered for maintenance or for emergency failover. Skipping over an incorrect logfile at a secondary may produce an unnecessary split brain. Fix the potential problem by doing it only after "primary --force", and by never creating a new logfile, always by re-using existing logfiles.	2016-02-10 06:44:00 +01:00
Thomas Schoebel-Theuer	f235b76900	light: fix potential deadlock on restart after inconsistent symlinks This has been found by testing. In extremely rare cases, such after crashes at the "wrong moment" or after defective /mars filesystems, the replay link could show a different length than the corresponding versionlink. The versionlink wouldn't be updated anymore when additionally the logfile has the same length than the replay link. The incorrect versionlink will then lead to a lock. Fix the problem by using the _minimum_ of all length indicators. For safty, or when in doubt, replay more data, which will in turn update the versionlink again to its correct value.	2016-02-10 06:24:27 +01:00
Thomas Schoebel-Theuer	8e2de8288d	light: fix missing versionlink upon slow or defective IO Some primary appeared to have died, and was rebooted. In the meantime, the old secondary was forcefully switched to primary. Afterwards, the old primary = new secondary got stuck because 2 versionlinks, which had been _produced_ by _himself_, were missing, but they were present at the new primary = old secondary! How could this happen? All transaction logfiles were fully present and correct everywhere. However, the old primary kern.log showed that a problem with the RAID system must have existed. In addition, the RAID controller errorlog also reported some problems which appeared to have healed. Problem analysis shows the following possibility: The transaction logger can continue to write data, even via fsync(), while the _writeback_ of other parts of the /mars filesystem (e.g. symlink updates) got stuck for a long time due to an IO problem. Usually, slow or even missing symlink updates are no problem because upon recovery after a reboot, everything is healed by transaction replay (possibly replaying much more data than really necessary, but this does not affect semantics, and it is even advantageous when RAID disks might contain defective data). There is one exception: after a logrotate, the corresponding new versionlink should appear after a small time. Otherwise, the above mentioned scenario could emerge. We use sync_filesystem() to ensure that any versionlink update to a _new_ versionlink is either guaranteed to become persistent, or (in case of IO problems) the mars_light thread will hang, which will be (hopefully) noticed soon by monitoring.	2016-02-03 22:01:48 +01:00
Thomas Schoebel-Theuer	ea48664a14	light: disallow primary from rotating over damaged logfiles Only a secondary is allowed to do this, because we assume that logfile replay has the property of "anytime consistency" only there. When a primary cannot recover after a crash due to a defective logfile, this is not true. The primary is simply lost in such a (rare) case. Observed 2 times during almost 8 millions of operating hours. In such a case, hardware is truly defective, and you have only the following options: 1) switchover to a secondary via "primary --force", OR 2) deconstruct the resource everywhere, run fsck or similar on whatever replica seems to be the best version, and reconstruct the resource from scratch, OR 3) restore your backup.	2016-01-21 08:09:47 +01:00
Thomas Schoebel-Theuer	acdb9d7a42	light: fix reset of replay-code Reset was forgotten in secondary role. Do it always whenever a logfile is actually rotated.	2016-01-20 14:48:43 +01:00
Thomas Schoebel-Theuer	496e57e1e1	logger: add new indicator for damaged logfiles	2016-01-15 17:10:58 +01:00
Thomas Schoebel-Theuer	d67336420d	light: fix becoming primary when logfiles are damaged When logfile replay aborts with an error, becoming primary would be impossible. Without this, repair would be only possible by complete destruction of the resource. A previous version of this patch introduced /proc/sys/mars/allow_primary_when_damaged which would complicate the sysadmin interface. People would be unsure what to do.	2016-01-13 14:12:02 +01:00
Thomas Schoebel-Theuer	3eedff125d	infra: fix comparison Under weird circumstances, when a new symlink contents was just a shortened version (prefix) of the old one, the symlink was not updated.	2016-01-02 10:18:33 +01:00
Thomas Schoebel-Theuer	54d8433b21	light: fix spelling	2015-10-07 10:46:04 +02:00
Thomas Schoebel-Theuer	c39a2988b7	light: fix long-lasting switchoff at end of sync	2015-06-17 11:33:27 +02:00
Thomas Schoebel-Theuer	4ecd6937c7	light: don't try fetching from (none)	2015-06-17 11:33:27 +02:00
Thomas Schoebel-Theuer	876625d66a	light: disallow modprobe when UUID is missing	2015-03-23 13:48:11 +01:00
Thomas Schoebel-Theuer	7f565f77b6	light: prohibit communication with wrong UUID	2015-03-06 11:49:54 +01:00
Thomas Schoebel-Theuer	7ced30b24c	infra: report peak IO latencies	2015-02-27 11:32:57 +01:00
Thomas Schoebel-Theuer	c35065fe97	infra: report global IO hangs	2015-02-27 11:32:57 +01:00
Thomas Schoebel-Theuer	c1823bbfab	light: report actually running buildtag	2015-02-27 11:32:56 +01:00
Thomas Schoebel-Theuer	736489eccd	light: suppress irrelevant warning	2015-02-24 15:51:28 +01:00
Thomas Schoebel-Theuer	036953fa54	light: provisionary allow fetch during detach	2015-02-24 15:51:28 +01:00
Thomas Schoebel-Theuer	0453fbae9b	light: fix race on rmmod	2015-02-24 15:51:27 +01:00
Thomas Schoebel-Theuer	f10e7358ad	light: stop syncing upon logfile holes	2015-02-24 15:51:26 +01:00
Thomas Schoebel-Theuer	827b5b5192	light: fix syncpos indication of inconsistency	2015-02-24 12:08:41 +01:00
Thomas Schoebel-Theuer	c03fc47539	light: fix start of sync	2015-02-24 12:08:41 +01:00
Thomas Schoebel-Theuer	0c38493e13	light: add hysteresis to emergency revovery	2015-02-24 12:08:39 +01:00
Thomas Schoebel-Theuer	092201decc	light: less side effects by emergency mode	2015-02-24 11:15:29 +01:00
Thomas Schoebel-Theuer	5d81381664	all: disallow sync IO during emergency mode	2015-02-11 15:20:26 +01:00
Thomas Schoebel-Theuer	e7464b3c02	all: correct error code EIO The error code -EIO should always refer to a problem of lower storage laysers. Thus MARS should not generate that code itself, but other ones.	2015-01-20 15:20:10 +01:00
Thomas Schoebel-Theuer	802cc73b49	infra: additionally safeguard race on brick resource deallocation	2015-01-19 18:01:04 +01:00
Thomas Schoebel-Theuer	fa49247b8e	infra: fix stale dents	2015-01-19 18:01:04 +01:00
Thomas Schoebel-Theuer	ce48d7031c	all: fix hang of NotYetPrimary in lower emergency modes	2014-12-07 09:24:16 +01:00
Thomas Schoebel-Theuer	7366cb9dad	light: fix leave-cluster communication	2014-12-07 09:24:16 +01:00
Thomas Schoebel-Theuer	28c8575cc0	light: fix becoming primary during split brain Always prefer the own logfile if one exists. This should improve becoming in most split brain situations.	2014-12-07 09:24:16 +01:00
Thomas Schoebel-Theuer	aa09d7df30	all: clarify license GPLv2+	2014-11-25 18:09:17 +01:00
Thomas Schoebel-Theuer	917d5ae2d2	light: fix client shutdown on slow network On slow networks, the generic net_io_timeout is too long if you are impatiently waiting for disconnect. Change the io_timeout of the individual client brick to a short value.	2014-11-12 09:01:35 +01:00
Thomas Schoebel-Theuer	1295c43a7a	infra: move io_timeout to generic interface This is needed for the next commit.	2014-11-12 09:01:34 +01:00
Thomas Schoebel-Theuer	843a931cae	light: fix zero progress of rate display	2014-11-12 09:01:33 +01:00
Thomas Schoebel-Theuer	547cc60a72	light: fix long-lasting pause-fetch effect	2014-11-12 09:01:33 +01:00
Thomas Schoebel-Theuer	f6cca5ca72	light: fix copy switch off	2014-11-12 09:01:33 +01:00
Thomas Schoebel-Theuer	ed57478ace	light: fix versionlink in emergency mode	2014-08-25 09:43:06 +02:00
Thomas Schoebel-Theuer	6a176c26c7	light: fix propagation of maxnr	2014-08-14 10:01:21 +02:00
Thomas Schoebel-Theuer	3a6ff3d2c8	infra: quickfix Redhat/openvz builds	2014-07-14 17:27:11 +02:00
Thomas Schoebel-Theuer	4a2ee37b98	light: treat double logfiles directly as split brain	2014-07-11 08:19:10 +02:00
Thomas Schoebel-Theuer	16f5a5dd77	light: fix becoming primary in multiple logrotated situations	2014-07-11 07:55:33 +02:00
Thomas Schoebel-Theuer	1439d30ffb	all: port to newer kernels (up to 3.15)	2014-06-18 12:10:55 +02:00

1 2 3 4

195 Commits