RepoMirrors/mars

mirror of https://github.com/schoebel/mars synced 2024-12-27 17:12:32 +00:00

Author	SHA1	Message	Date
Thomas Schoebel-Theuer	e7f41563f2	main: fix livelock at end of sync Only observed on very fast hardware. Leaving the loop may unnecessarily take a long time.	2016-03-08 11:37:41 +01:00
Thomas Schoebel-Theuer	04b2f2120e	Kbuild: fix external 1&1 build process	2016-03-03 12:42:41 +01:00
Thomas Schoebel-Theuer	669d73e602	all: release mars0.1stable28	2016-03-03 09:40:25 +01:00
Thomas Schoebel-Theuer	833a1cb524	doc: increment version number	2016-03-03 09:40:06 +01:00
Thomas Schoebel-Theuer	72deaee082	doc: describe removal of "Light"	2016-03-03 09:40:02 +01:00
Thomas Schoebel-Theuer	a5f8f3e464	main: rename mars_light.c to mars_main.c	2016-03-03 09:35:16 +01:00
Thomas Schoebel-Theuer	4d31d09534	all: remove CONFIG_MARS_BIGMODULE	2016-03-03 09:33:34 +01:00
Thomas Schoebel-Theuer	daa701edf1	light: s/light_class/main_class/g	2016-03-03 09:05:01 +01:00
Thomas Schoebel-Theuer	2990b9362e	light: s/light_thread/main_thread/g	2016-03-03 09:04:04 +01:00
Thomas Schoebel-Theuer	42a8bfaa60	all: s/light_(worker\|checker)/main_\1/g	2016-03-03 08:57:07 +01:00
Thomas Schoebel-Theuer	fedb9a93b3	doc: clarify distance limits for synchronous operations	2016-03-03 07:48:57 +01:00
Thomas Schoebel-Theuer	ca9708c194	doc: clarify limitations of network bottlenecks	2016-03-03 07:48:57 +01:00
Thomas Schoebel-Theuer	6455ccdc75	all: release light0.1stable27	2016-03-01 12:53:56 +01:00
Thomas Schoebel-Theuer	2c034ed79f	doc: new slides from GUUG 2016	2016-03-01 11:58:24 +01:00
Thomas Schoebel-Theuer	207635632b	marsadm: check uniqueness of IPs at join-cluster	2016-03-01 11:58:23 +01:00
Thomas Schoebel-Theuer	20eca8c447	marsadm: verbose callstack at ldie	2016-03-01 11:58:23 +01:00
Thomas Schoebel-Theuer	dd4748bb52	light: clarify code	2016-03-01 11:58:23 +01:00
Thomas Schoebel-Theuer	8fa728a0c9	light: fix annoying unnecessary error message	2016-03-01 11:58:23 +01:00
Thomas Schoebel-Theuer	8abcbf196d	light: safeguard sync vs replay	2016-03-01 11:58:23 +01:00
Thomas Schoebel-Theuer	e70ac4df8c	light: safeguard position update	2016-03-01 11:58:23 +01:00
Thomas Schoebel-Theuer	fafad9512a	light: always update position symlinks at logger switchoff	2016-03-01 11:58:23 +01:00
Thomas Schoebel-Theuer	42c2dc98da	light: fix typo in replay link comparison	2016-03-01 11:58:23 +01:00
Thomas Schoebel-Theuer	a312e3d93b	light: fix memory leak regression from `f235b76900`	2016-03-01 11:58:09 +01:00
Thomas Schoebel-Theuer	708547b2f4	all: release light0.1stable26	2016-02-15 07:11:06 +01:00
Thomas Schoebel-Theuer	b22fda4983	doc: bump version	2016-02-15 07:11:04 +01:00
Thomas Schoebel-Theuer	a88310ce7a	doc: clarify *-{lognr,basenr} macros	2016-02-15 07:10:41 +01:00
Thomas Schoebel-Theuer	f7e64a2f35	doc: swap chapters 4 and 5	2016-02-15 07:10:41 +01:00
Thomas Schoebel-Theuer	7b9406762a	doc: split chapter 3, describe the macro processor in its own chapter	2016-02-15 07:10:41 +01:00
Thomas Schoebel-Theuer	83ae4720fa	marsadm: reimplement buggy primitive macros The old version was complicated and error prone, due to historic development. Now the structure should be much simpler.	2016-02-15 07:10:41 +01:00
Thomas Schoebel-Theuer	8c3cfe97f3	marsadm: show wrong permissions Feature request by Tilmann Steinberg. It greatly eases debugging when searching for a source of wrong permissions. Some admin tools like Puppet seem to have their own default notion of "secure permissions" and try to "fix" them ;)	2016-02-15 07:10:41 +01:00
Thomas Schoebel-Theuer	c0d57bef7a	marsadm: fix view-wait-is-* when symlinks are not yet present	2016-02-15 07:10:40 +01:00
Thomas Schoebel-Theuer	8bc1e80488	light: safeguard skipping of logfiles in disconnected state. Found by code inspection, neither in practice nor by testing. Should not occur in practice, because it could only occur after marsadm pause-fetch, which is an exceptional state only to be entered for maintenance or for emergency failover. Skipping over an incorrect logfile at a secondary may produce an unnecessary split brain. Fix the potential problem by doing it only after "primary --force", and by never creating a new logfile, always by re-using existing logfiles.	2016-02-10 06:44:00 +01:00
Thomas Schoebel-Theuer	f235b76900	light: fix potential deadlock on restart after inconsistent symlinks This has been found by testing. In extremely rare cases, such after crashes at the "wrong moment" or after defective /mars filesystems, the replay link could show a different length than the corresponding versionlink. The versionlink wouldn't be updated anymore when additionally the logfile has the same length than the replay link. The incorrect versionlink will then lead to a lock. Fix the problem by using the _minimum_ of all length indicators. For safty, or when in doubt, replay more data, which will in turn update the versionlink again to its correct value.	2016-02-10 06:24:27 +01:00
Thomas Schoebel-Theuer	c3d2aaa40b	all: release light0.1stable25	2016-02-03 22:08:19 +01:00
Thomas Schoebel-Theuer	e65444b2f2	doc: update version	2016-02-03 22:01:49 +01:00
Thomas Schoebel-Theuer	1edef479fc	marsadm: show the old designated primary in the log This is vital for incident analysis.	2016-02-03 22:01:49 +01:00
Thomas Schoebel-Theuer	89014d29c3	marsadm: new primitive device-opened This is absolutely needed for race avoidance in scripting.	2016-02-03 22:01:48 +01:00
Thomas Schoebel-Theuer	561c2bd6c6	marsadm: rename occurences of deprecated present-{disk,device}	2016-02-03 22:01:48 +01:00
Thomas Schoebel-Theuer	6418370357	marsadm: rename present-{disk,device} to *-present and deprecate it This is important for namespace systematics of primitive macros. First name the object, then name its property. Like in OO. Exception: when _finding_ the object itself needs an operation, or additional information, e.g. %get-disk{} (this is the "lookup operation" for the object itself, at least by concept). For compatibility, the old forms will be accepted also (silently, undocumented).	2016-02-03 22:01:48 +01:00
Thomas Schoebel-Theuer	08c776fc36	marsadm: allow devices as size argument	2016-02-03 22:01:48 +01:00
Thomas Schoebel-Theuer	f4f9ba93e2	marsadm: correct replay error checking	2016-02-03 22:01:48 +01:00
Thomas Schoebel-Theuer	7ff2d896ea	marsadm: fix join-cluster when the peer is actively running In such a case rsync may spill an error because some symlinks were updated in the meantime or have vanished. We can safely ignore that.	2016-02-03 22:01:48 +01:00
Thomas Schoebel-Theuer	e36a2ea4f1	marsadm: fix present-{disk,device}	2016-02-03 22:01:48 +01:00
Thomas Schoebel-Theuer	8e2de8288d	light: fix missing versionlink upon slow or defective IO Some primary appeared to have died, and was rebooted. In the meantime, the old secondary was forcefully switched to primary. Afterwards, the old primary = new secondary got stuck because 2 versionlinks, which had been _produced_ by _himself_, were missing, but they were present at the new primary = old secondary! How could this happen? All transaction logfiles were fully present and correct everywhere. However, the old primary kern.log showed that a problem with the RAID system must have existed. In addition, the RAID controller errorlog also reported some problems which appeared to have healed. Problem analysis shows the following possibility: The transaction logger can continue to write data, even via fsync(), while the _writeback_ of other parts of the /mars filesystem (e.g. symlink updates) got stuck for a long time due to an IO problem. Usually, slow or even missing symlink updates are no problem because upon recovery after a reboot, everything is healed by transaction replay (possibly replaying much more data than really necessary, but this does not affect semantics, and it is even advantageous when RAID disks might contain defective data). There is one exception: after a logrotate, the corresponding new versionlink should appear after a small time. Otherwise, the above mentioned scenario could emerge. We use sync_filesystem() to ensure that any versionlink update to a _new_ versionlink is either guaranteed to become persistent, or (in case of IO problems) the mars_light thread will hang, which will be (hopefully) noticed soon by monitoring.	2016-02-03 22:01:48 +01:00
Thomas Schoebel-Theuer	0e6bb47cb6	marsadm: fix edge cases of try_to_avoid_splitbrain() Originally a trivial silly bug (boolean value was wrong), leading to an endless loop when a local versionlink was missing, which can happen only after a primary crash at the wrong moment shortly after a logrotate (not even during ordinary operations), followed by a hard reboot. As documented in mars-manual.pdf, you simply need "modprobe mars" to recover after such a crash reboot. MARS remembers the primary state persistently for you and restores everything _automatically_. Using "marsadm primary" in such a case to switch the current primary to primary again (after an unnecessary "marsadm secondary" which is strongly discouraged by mars-manual.pdf), although the host is / was already in primary state after the reboot, is at least as silly as the mentioned bug. Doing this in an /etc/init.d/ startup script where it really doesn't belong into, is even more silly. The latter is even an OPERATIONAL RISK, because "marsadm secondary" works _globally_ in the whole cluster (as documented in mars-manual.pdf). Such an improper startup script _can_ (potentially) disturb another cluster member which had become primary in the _meantime_ during reboot. Global cluster operations don't belong into startup scripts, because reboots may happen unintentionally at any time.	2016-02-03 22:00:47 +01:00
Thomas Schoebel-Theuer	cd01d1ae02	all: release light0.1stable23	2016-01-21 08:11:24 +01:00
Thomas Schoebel-Theuer	d9fd3de2a2	doc: update version	2016-01-21 08:10:26 +01:00
Thomas Schoebel-Theuer	e207443833	marsadm: fix binary operators =~ and "match"	2016-01-21 08:09:48 +01:00
Thomas Schoebel-Theuer	ea48664a14	light: disallow primary from rotating over damaged logfiles Only a secondary is allowed to do this, because we assume that logfile replay has the property of "anytime consistency" only there. When a primary cannot recover after a crash due to a defective logfile, this is not true. The primary is simply lost in such a (rare) case. Observed 2 times during almost 8 millions of operating hours. In such a case, hardware is truly defective, and you have only the following options: 1) switchover to a secondary via "primary --force", OR 2) deconstruct the resource everywhere, run fsck or similar on whatever replica seems to be the best version, and reconstruct the resource from scratch, OR 3) restore your backup.	2016-01-21 08:09:47 +01:00
Thomas Schoebel-Theuer	acdb9d7a42	light: fix reset of replay-code Reset was forgotten in secondary role. Do it always whenever a logfile is actually rotated.	2016-01-20 14:48:43 +01:00

1 2 3 4 5 ...

1478 Commits