Meaning of stable tagnames -------------------------- Example: mars0.1stable01: 0 = version of on-disk data structures (only incremented when downgrades are impossible) (not incremented on backwards-compatible upgrades) 1 = version of feature set stable = feature set is frozen during this series 01 = bugfix revision Example: mars0.2beta2.3: The general idea is as before. "beta" means that new features are roughly tested in the lab, but not in production, so there may be some bugs. New features may be added during the beta phase. Example: mars0.3alpha*: Never use this for production. Only for historic code inspection. Release Conventions / Branches / Tagnames ----------------------------------------- mars0.1 series (now EOL): - Unstable tagnames: light0.1beta%d.%d (obsolete) - Stable branch: mars0.1.y (obsolete) - Stable tagnames: mars0.1stable%02d (obsolete) mars0.1a series (stable): New master branch. Now stable. This branch is operational for several years on several thousands of servers, and several petabytes of data. - Stable branch: mars0.1a.y - Stable tagnames: mars0.1astable%02d - Historic tagnames: light0.1abeta%d (obsolete) mars1.0 series (planned): - Replace symlink tree by transactional status files (future-proof) This is required for upstream merging to the kernel. It has further advantages, such as better scalability. - Trying to additionally address public needs. - Potentially for Linux kernel upstream, - Unstable tagnames: mars1.0beta%d.%d (planned) - Stable branch: mars1.0.y (planned) - Stable tagnames: mars1.0stable%02d (planned) WIP-* branches are for development and may be rebased onto anything at any time without notice. They will disappear eventually. Never use them for production! *stable* branches mean the following: - Heavily tested. Has to obey an HA SLA of 99.98% end-to-end, including network outages and HumanError(tm) at 1&1 Ionos ShaHoLin. Thus the _component_ SLA of MARS must be much better. - New features may be introduced, but are off by default. - There is always an upgrade path. Simply install the new version, obeying the below compatibility rules. - Rolling upgrades (temporarily different MARS kernel module versions at primary vs secondary side) are supported. Typically, do "rmmod mars; modprobe mars" at the secondary side first, then handover, then do the same at the former primary side. Or, of course, you may combine it with (typically security- triggered) rolling kernel reboots. I am putting high effort into maintaining rolling upgrades of kernel modules. The network protocols are designed to support this. - COMPATIBILITY RULES: Ensure that $marsadm_version >= $module_version. This is the safe side of your update strategy. Update marsadm first, before updating the kernel module. This way, the controls for newer features are already in place when the new kernel module is activated (no blind flight). Since marsadm is a plain Perl script with _no_ dependencies from anything else, this is something I can reasonably expect from users. REASON: ensuring forever backwards compatibility to stone-aged marsadm versions would make me ill. I cannot change old versions anymore, but just provide new versions. I cannot ensure and test all possible O(n^2) combinations of marsadm versions with kernel module versions to work eternally for all times when marsadm would be frozen, or even all O(n^3) combinations of frozen marsadm with mixed-operations kernel modules. The development of MARS would be hindered by too old marsadm versions, since my effort would grow quadratically or even worse. Hint: nevertheless, many combinations of old marsadm with newer kernel module version are working anyway, in particular when the gap is a small $epsilon. But I cannot guarantee in general. If you want to violate the above rule, you must test the combination yourself. - Best practice in bigger installations: first test your upgrade or downgrade at some test clusters first. If you have a separate pre-live stage, it definitely is your friend. - As long as $marsadm_version + $epsilon >= $module_version remains true (at least "approximately") and has been tested in pre-live, marsadm may be upgraded and downgraded independently from kernel, and during operations (best via your favorite package manager). Of course, no magic will happen: newer features are only available when newer versions of _both_ the userspace tool and the kernel modules are installed. - Please check this ChangeLog for any upgrade / downgrade incompatibility bugs. In case they are detected, they will be fixed. But I cannot retrospectivly change already released versions and their bugs. Fixes are only possible in newer versions. - Downgrade is possible *inside* of the same stable branch series, at least over about 20 minor releases. I cannot test all possible O(n^2) downgrade combinations. Thus be careful when downgrading over high version distances (which should not be done anyway unless you have some very well-thought reasons). - Downgrade to _prior_ *stable* branches, or over very big differences in minor version number, may be restricted, or may require some extaordinary actions. Please read this ChangeLog for details. Example: a new future-proof internal deletion format has been introduced in mars0.1astable88. It is off by default. If you never activate it, you can downgrade inside of mars0.1astable* as you like. Only if you actually activate it, and if you really need to downgrade beyond that old version, you have to obey the downgrade instructions documented below. However, don't rely on this. Somewhen after many releases, and after its stability has been proven at 1&1 Inonos ShaHoLin over a long time, the new feature may be auto-activated, so that downgrade should remain possible, but may then require some manual actions. ----------------------------------- Changelog for series 0.1a: This is the new master branch, starting January 2019. The old stable branch mars 0.1.y is EOL, now fully superseeded by this branch. mars0.1astable132 * Major doc update of architecture guide. More improvements like splits for various interests are certainly needed, but currently not in focus due to limited time. mars0.1astable131 * Major fix: long-standing race condition on aspect {de,}allocation. Was triggering extremely rarely and was very hard to reproduce, but it could lead to rare stacktraces and to rare kernel hangs. I am not 100% sure to have fixed it fully, but massive testing over a long time tells me that it has _at least_ improved. Further stresstest improvements after safeguarding potentially misleading callbacks from aio & co over shared kernel files, where their shared pagecache cannot distinguish different callers. Possibly more improvements might appear in future releases. This will take a lot of time for extreme stress-testing. * Minor safeguard: better safeguard of indirect calls via mb(). Theoretically, this should be unnecessary. But I saw some extremely rare effects (only at a certain hardware class), so I prefer stability over maximum performance. mars0.1astable130 * Minor improvement: marsadm now compensates race between emergency mode removal and invalidate. mars0.1astable129 * Painful regression from mars0.1astable128: rmmod mars (and possibly some other situations) could hang forever, requiring reboot. Please upgrade ASAP in case you already have deployed exactly this version. Do not use it anymore. * Minor fix: when the new prepatch series v2 was used (LTS kernels 4.19 and 5.4), "modinfo mars" was incorrectly reporting "no_prepatch". Now the report tells you the prepatch series (v1 versus v2). mars0.1astable128 (do not use anymore) * Critical fix: inherent races between automatic network re-submission and completion are compensated now. Likely, these races could have been the reason for very rare stack traces. mars0.1astable127 * Major fixes / safeguards: very rare stacktraces in copy_endio() and client_io() are hopefully better addressed now. I have no reproduced at the moment. * Various minor fixes and improvements. mars0.1astable126 * Critical regression from mars0.1astable125: uninitialized pointer could lead to (rare) kernel oops. Not observed in practice until now. Please upgrade for safety. * Major fix, only relevant when logfile compression was enabled: In certain corner cases, compressed logfiles were not always decompressible, as indicated by DefectiveLog error messages. Workaround was possible by switching compression off, cron, and invalidate. * Minor compat fix: very old kernels missing WRITE_ONCE. * Minor cosmetics: benchmark results now priority KERN_INFO. mars0.1astable125 * Critical fixes: over a very long time, internal int counters could wrap around into negative numbers and cause kernel Oops. * Critical safeguard: about once per 1 million of operation hours, a stacktrace was observed in copy_endio(). At the moment, I have no reproducer for the very spurious bug. Hopefully it is fixed now. * Minor improvement: alternation between sync and replay now avoids unnecessary waiting. mars0.1astable124 * Major improvement: support for LTS kernels 4.19 and 5.4. A new pre-patch generation obeying the new ksys_* conventions has been added. This will help further porting in the future. IMPORTANT: do NOT OMIT the fix for upstream bug 0001-sched-wait-fix-endless-kthread-loop-at-timeout.patch from directoy pre-patches/vanilla-*/ . Leaving out this fix may SERIOUSLY HARM your experience, due to kernel soft lockups happening when the network is interrupted, i.e. exactly during certain types of incidents. Anyway, please also use 0001-mars-v2-minimum-pre-patch-for-mars.patch because IO performance is MUCH WORSE without this pre-patch. mars0.1astable123 * Major fix: under very unlikely conditions, deadlock of the logger IO scheduling was possible. Never observed during millions of total operation hours. Please update for maximum HA safety. * Minor fix: scarce divide by zero in IO or network throughput limiters (normally not used) was possible under very special cirmumstances. mars0.1astable122 * Minor improvement (possibly a regression from 0.1astable116), observed in a very tricky situation where _both_ the primary and secondary RAIDs were heavily degraded at the same time: The replay could take too much preference over sync, leading to quasi-starvation of sync. Workaround was possible by temporarily setting /proc/sys/mars/sync_flip_interval_sec to 0, and manually switching between replay and sync. Now awful RAID degradation should be handled more gracefully. * Minor improvement, stimulated by Gabriel Franciso: Before marsadm asks DNS, first /etc/hosts is consulted via /usr/bin/getent. mars0.1astable121 * Fix scarce use-after-free, only observed at rmmod operations under KASAN test kernel 4.14. For maximum safety, please update to this version. mars0.1astable120 * Fix build with LTS kernel 4.14. IMPORTANT: you need to patch your 4.14 kernel sources with pre-patches/vanilla-4.14/0001-sched-wait-fix-endless-kthread-loop-at-timeout.patch Otherwise you will encounter _massive_ problems! mars0.1astable119 * Major systemd update, only relevant when you are using the systemd template generator: - Now works completely lockless. This should improve the parallelism degree and reduce the risk of deadlocks. - Now uses per-resource triggers for parallel incremental updates after {create,join,leave}-resource etc. Attention: checkout the new templates in systemd-testing/ (the old ones will no longer work). - New pseudo unit type .script : in place of being interpreted by systemd, you may now write some (wrapper) scripts for more complex operations, and/or for achieving idempotence. Although I would like to prefer native systemd units, I added this feature after I became desperate when trying to achieve true idempotence via native systemd units. After several months of fruitless attempts, I gave up and added .script . For details, please read the new docs. An example of .script can be found in the new systemd-icpu/ . Notice that "nodeagent" is a third-party tool which in turn may call systemctl for startup of LXC containers (after contacting a database, and a plethora of other things). Any attempts to call nodeagent via ExecStart= and co did not really work as it should. - Some new template engine features, like markers DEFAULT_START and much more. * Updated docs on systemd. * No other changes outside of the systemd area. mars0.1astable118 * Critical fix, only relevant for trial builds without pre-patch: kernel NULL deref, fixed by Gabriel Francisco. I am releasing this alone because the next release will take some more time. mars0.1astable117 * Minor fix, only relevant for k > 2 replicas: invalidate and log-purge-all could abort unnecessarily due to races. Workaround was possible by retrying the command. * Minor usability: new commands {de,}activate-guest for enabling or getting rid of temporary guests. Only relevant for clusters with > 2 members. * Some minor fixes and improvements. * Minor doc update (new commands). * Further dkms improvements / tuning from Gabriel Francisco. mars0.1astable116 * Critical fix, only relevant for cluster naming schemes where hostname A may be a _prefix_ of hostname B: such naming schemes could have led to a multitude of bizarre and unexplainable confusions and problems. Example: hostnames icpu-bs6 and icpu-bs60 . Please UPDATE when such hostnames may occur in the _same_ cluster. I was unable to find the bug via the test suite because such hostnames were not deployed at the test machines. Thanks to Stephan Christiany who pointed me at the problem. * Fix annoying bug: during long-lasting sync (several TB), the automatic flipping between replay and sync could sometimes get stuck in sync mode, and then /mars could fill up because replay was starving unnecessarily (waiting that sync would finish, which could take a long time). Workaround was possible by "pause-sync"; wait until replay has caught up; "resume-sync". mars0.1astable115 * Critical regression from mars0.1astable113 / 114, only relevant when the new ssh-less peer operations are actually used: Race on peer thread creation could lead to kernel memory corruption. For maximum safety, please avoid the affected kernel module versions. * dkms improvements from Gabriel Francisco. mars0.1astable114 * Major usability: ssh-less {merge,split}-cluster. Now all cluster operations should work without ssh and its agent forwarding. Of course, you will need to update mars.ko and marsadm on all of your machines first. * Doc update: describe new options and behaviour. * Some smaller fixes / safeguards / improvements. mars0.1astable113 * Critical fix: deadlock was possible after receiving _corrupted_ data over the network. Very unlikely to trigger, since there is a lot of other magic checking, but anyway. Now treated like any other communication error. * Major improvement: "marsadm primary" (also with --force) now does the equivalent of "up", after the operation has succeeded. This should be useful for people who forget to do the "up" manually after a manual unplanned failover. * Minor fix: race at {wait,update}-cluster leading to unnecessary abort. * Minor fix: update-cluster did not always transfer directories. * Minor fix: new join-cluster method could sometimes fail at the first try. Workaround by repetition. * Minor fix: primitive macros wait-todo-primary-{on,off} were documented, but not implemented. * Minor improvement: by the way, all missing combinations from {is,nr,todo}-secondary and wait-{is,todo}-{primary,secondary}-{on,off} are also implemented. * Minor improvement: try to automatically fetch any unknown peer info. May help after failed join-cluster & co. * Minor improvement: speedup new join-cluster method. * Minor doc update: describe new primitives. * Some smaller fixes and improvements. mars0.1astable112 * Critical fix: generic mars_readlink() did not work with an extremely low probability, so it slipped through years of testing. My reproducer indicates that it "fixed" itself after a while, just leading to some unnecessary delays. Nevertheless, I mark it "critical" under a HA viewpoint, although most people likely might have never noticed it. Recommendation: please update. * Major fix, only relevant for k > 2 replica: fetch could get stuck in cyclic dependencies for some time, making only slow progress. * Major fix: join-resource could loop when old method is selected and ssh was not working. * Minor fix: do not produce alive-timestamp & co on a fresh /mars, before {create,join}-cluster has been executed. * Several smaller fixes and improvements. mars0.1astable111 * Minor fix, only relevant for new deletions: Split-brain cleanup was sometimes stumbling over deleted logfiles. Workaround by cron. mars0.1astable110 * Minor improvement: new disk-error for better diagnosing any problems with disk setup / LVM etc. * Doc update (new macros etc). mars0.1astable109 * Regression from mars0.1astable106: when the old deletions were active, logfiles could be unlinked unnecessarily (displayed as Orphan). It did not really harm due to automatic re-fetching, but caused unnecessary network traffic. mars0.1astable108 * Improved metadata scalability. * Some smaller fixes and improvements. mars0.1astable107 * Critical regression from mars0.1astable106: use-after-free. * Fix use-after-free at rmmmod. mars0.1astable106 * Major regression from mars0.1astable97: marsadm primitive disk-present erronously reported the disk name in place of boolean value 0 or 1. * Minor fix for new deletions (beta): invalidate / re- join-resource were sometimes hanging in Orphan due to a conflict with the new deletions. * Minor improvements: somewhat more improved scalability both in #resources and in #hosts. mars0.1astable105 * Minor marsadm regression from mars0.1astable104: race on _old_ deletions could lead to lost deletions. Workaround by repeating any affected commands, e.g. leave-resource. mars0.1astable104 * Major fix: marsadm did not obey an abort of certain phased commands when a single resource argument was given. As a result, a wrong exit code could be returned in such a case. * Minor fix: when beta feature logfile digests were disabled _during_ operations, already existing old logfiles were not always checked correctly at the secondary, reporting DefectiveLog (although they were healthy). Workaround by just enabling again and invalidate. With the fix, you may now replay the old logfiles :) * Minor fix: inherent race between join-resource and log-rotate (unavoidable in the Distributed System) could lead to split brain, or to hanging replay. Now compensated. * Minor fix: join-cluster without ssh was sometimes not updating the local link tree immediately. * Usability (BETA feature): improved scalability in #hosts. The below BETA feature warnings apply. Do not exceed the "officially documented" limits too much. * Usability: join-resource avoids unnecessary fallback to ssh / rsync. IMPORTANT: please update marsadm first, before updating the kernel module. See the above compatibility rules. This time the compatibility rules are important. I know that marsadm < 0.1astable85 does no reliable join-resource anymore, while combinations with old 0.1astable95 appear to work. There is no merit in bisecting old marsadm releases, instead of just fucking update the old userspace script in a controlled manner. * Usability: more accurate IOPS and friends. * Several smaller fixes and improvements. mars0.1astable103 * Major regression from mars0.1astable99: secondary replay could hang unnecessarily due to a cascade of race conditions. AFAICS consistency was not affected (thanks to md5 checksumming). Observed with a specific load pattern at less than 1% of resources, or in average after ~ 120 operation hours when logrotate was 12 times per hour. Unfortunately, it slipped through all my release tests due to relatively low trigger probability. Workaround by "invalidate". Which is however no good solution. Please avoid kernel module versions between *99 and *102 for production. mars0.1astable102 * Major usability (BETA): scalability in number of hosts. It should have no visible side effect in functionality, but better non-functional properties. Tested in the _lab_ with 1000 additional dummy hosts and additionally 8000 dummy resources in total. BETA WARNING: at the moment, there are no practical experiences. There might be problems which might not show up during lab tests. Do not blindly rollout or merge-cluster big masses in production. I will tell you when practical experiences allow for rising the "official" limits as documented in the user manual. mars0.1astable101 * Major usability: join-cluster now works without ssh. Of course, you need to rollout the new marsadm and the new mars.ko first, and to modprobe it at any pre-existing cluster. The new feature is automatically activated when you modprobe _before_ doing join-cluster. By running join-cluster first (without modprobe), you can fallback to the old ssh + rsync based method. Important: now you can modprobe before /mars/uuid is created or retrieved. Previously, you could accidentally try the wrong sequence "modprobe mars; mount /mars" without harm because it was denied by missing uuid, but now such illegal attempts would result in a big fuckup. Suchalike fuckup is now prevented by always insisting on /mars being a mountpoint. This might break old ill-behaved scripts or buggy /etc/fstab or racy systemd dependencies, which need to be fixed. Always ensure that no modprobe is attempted before /mars has been mounted in a race-free and reboot-safe manner. Notice: merge-cluster and split-cluster are not yet ssh-free zones. This will be addressed in a later release. * Minor usability: show age of any hanging /dev/mars/ IO requests. This is useful for diagnosing faulty RAID controllers etc. * Lots of further minor fixes and improvements. mars0.1astable100 * Minor fix: UpToDate was not reported in a very weird corner case. * Minor fix, only relevant when the new deletion method is enabled: leave-resource did sometimes not delete all superfluous logfiles at the other peers, sometimes not clearing a split brain situation immediately. Workaround by cron which did the cleanup later. * Minor usability: reduced speakiness of "marsadm view all" with respect to the new compression / digest features. Full info can be obtained with --verbose. * Minor fix, only observed at join-cluster without ssh: Not all symlink infos were transferred in a corner case. * Further minor fixes and improvements. mars0.1astable99 * Minor fixes: some more corner cases of unnecessary split brain rarely occuring after fatal primary crashes. mars0.1astable98 * Minor regression from mars0.1astable97: when old kernel modules < mars0.1astable97 were combined with exactly that marsadm version, the presence of /dev/mars/$resource was detected incorrectly. Do not use exactly that combination. Simply skip the marsadm version mars0.1astable97. Other version combinations are still possible for independent and rolling updates of kernel and marsadm. Best practice: first update marsadm to mars0.1astable98 or newer, so this bug is fixed, and then your rolling kernel updates will work again for updating or even downgrading old kernels. * Minor fix: in a hardly reachable corner case, detach was hanging. Workaround by rmmod was possible. * Minor fix: spurious races at join-resource without ssh could occur, so it sometimes did not notice that a new resource was added in the meantime. Usage of ssh, or just retrying was helpful. Thus hardly relevant in practice. * Various minor fixes and improvements. Some masked bugs, not visible, only triggerable by a future version of MARS. mars0.1astable97 * Critical fix: when logfile is damaged (e.g. after a primary crash), some corner cases of primary recovery could hang. Workaround by "detach ; attach" seemed possible (as far as observed during testing). * Critical fix for BETA feature network compression only: Memory deallocation could fail under certain circumstances, resulting in a memory leak, or potentially memory corruption. Only relevant when network transport compression is enabled. * Major fix: when a primary crash was occuring exactly during a very short log-rotate time window, a race condition could sometimes lead to unnecessary split brain (secondaries could bypass the primary). * Several minor fixes and improvements. mars0.1astable96 * Minor improvement: auto-correct defective symlink timestamps which are too far in the future. This can happen when running with a defective CMOS hardware clock, e.g. after a fatal hardware failure, and before ntpd has corrected the local clock. * Minor usability: more pretty formatting of compression and digest flags in "marsadm view". mars0.1astable95 * Minor fix: sometimes, in a hardly relevant corner case, join-resource could abort unnecessarily. * Minor improvement: marsadm view now distinguishes role ForcedPrimary from plain Primary. This could help a larger team of sysadmins earlier noticing potentially upcoming SplitBrain even while the network is interrupted, so any actual SplitBrain cannot be detected, although it is suspectible. * Reduce footprint of some deprecated marsadm functions and macros. mars0.1astable94 * Major regression from mars0.1astable86: Memory leak in remote communication. This could accumulate over a longer time. Please update when affected. mars0.1astable93 * Minor improvement: in some special cases, secondaries may now follow primaries having a damaged logfile. mars0.1astable92 * Major improvement from an operational perspective: "marsadm view all" now reports the current status of /dev/mars/mydata in human-readable form, including the Open status, the current IOPS, the number of currently flying IO requests = IO queue length = indicator for IO problems or overload, and any error information. mars0.1astable91 * Major features, disabled by default: - Network transport compression. May improve network bottlenecks. - Transaction logfile payload compression. May improve the filling speed of /mars. * Major feature, enabled by default: - More logfile checksumming digests, some consuming less CPU. * Rough benchmarks, supporting you activation decisions. Please read mars-user-manual.pdf for instructions. Rolling updates with mixed versions are supported. mars0.1astable90 * Minor improvement: more reactiveness. This release is meant as an anchor point in case you would need a downgrade. mars0.1astable89 * Minor improvement: better kernel module reactiveness. More on scalability is in the dev pipeline. For now, use marsadm --timeout=300 or similar when stretching the official limits (but don't stretch too much until I have improved all relevant parts). mars0.1astable88 * New experimental scalability feature, deactivated by default: New deletion method, uses the special symlink value ".deleted" as a marker for logically deleted symlinks. This leads to a _massive_ simplification of code, and improves scalability for future masses of resources and/or cluster hosts. After updating both mars.ko and marsadm, you may activate it via marsadm option --delete-method=0 but ONLY FOR TESTING. I will tell you when it will be stable enough for production. Somewhen in future, it will hopefully become the default, and eventually the old complex code can be hopefully purged after the whole world uses the new method. Note: when never activated, it should not have any influence on old-style production. Both methods can be used in parallel on different clusters. So you can activate it on some test clusters first. Do not _directly_ rollback to old mars.ko and/or marsadm versions after activation. First deactivate the feature via --delete-method=1, then wait for a few hours until marsadm cron has done purging. "find /mars -type l -ls" must no longer report any "-> .deleted" values anywhere in the entire cluster. Then you can roll back to old releases. * Doc: small update on new marsadm command link-purge-all. mars0.1astable87 * Minor fix: unnecessary split brain could result from a race between handover and log-rotate / cron. mars0.1astable86 * Minor improvement: speedup metadata traffic avoiding some O(n^2) internal algorithms. mars0.1astable85 * Minor improvement: avoid ssh / rsync at join-resource. Only when ordinary communication over over port 7777 (default) fails, fallback to ssh connections. * Minor marsadm speedup by avoidance of unnecessary sleep times. * Minor fix: ensure that primary --force works even when a logfile was truncated forcefully. * Minor fix: use-after-free reported by KASAN, only triggerable with a future development version, not observed with the current stable version. I include it here for safeguarding. * Minor doc updates. Explain fundamental requirements for geo-redundancy, and some background on cost comparisons. mars0.1astable84 * Major improvement: try to automatically self-repair any defective logfile at secondaries, by fetching again from primary. This can only work when the version at the primary is healthy. When successful, "invalidate" is no longer necessary. mars0.1astable83 * Major improvement: new marsadm option --parallel can drastically speed up handover, provided that the rest of your infrastructure can deal with parallelism. Several cluster managers are known to have problems with that. So be careful, do not blindly use this feature! Future releases will try to improve the systemd interface such that parallelism is possible without problems. * Doc updates: describe dimensioning of storage networks and its realtime behaviour, at the background of Kirchhoff's law. Neglecting this may lead to much higher cost than necessary, and may lead to a variety of operational problems, up to failures of projects. Also, working with wrong definitions of Cloud Storage can lead to a similar effect. Recommended reading! mars0.1astable82 * Major improvement: the mars_main kernel thread is now working non-blocking in practically all relevant cases. Some more cases will be addressed in future. Testing with 32 resources in parallel is now working, and even 64 resources appear to work in the lab, although somewhat slower (on typical server iron). "marsadm primary all" is now much faster. More future improvements to come. Currently, "marsadm primary all" uses an internal barrier synchronisation model, which may lead to unnecessary waiting time for faster resources. There are plans to address this in future releases. ATTENTION! You will need NEW VERSIONS of your pre-patch. This will automatically adjust /proc/sys/fs/aio-max-nr to higher values when needed. If you don't use the new pre-patch, you will need to tune /proc/sys/fs/aio-max-nr yourself. Otherwise you will get serious operational deadlocks due to virtual resource limitations, even with only 32 resources, but a higher number of replicas. Since there is no practical experience yet (the biggest known productive installation uses only 24 resources), I do not yet increase the official limits as documented in the appendix of mars-user-manual.pdf. Although very slow due to some O(n^2) algorithms, 128 resources are just surviving now, without bombing or deadlocking, but are not yet really usable. Therefore, do not try to stretch the official limits too much. Please report any success stories (or problems) in case you are using some more resources _productively_. * Minor doc improvements. New slides from LCA2020 added. mars0.1astable81 * Minor doc improvement: explain why running MARS inside of VMs is a bad idea. Explain fully managed geo-location transparency of VMs. mars0.1astable80 * Compatibility up to kernels <= 4.14. Attention! There is a bug in upstream kernels >= 4.11, leading to an endless loop in kernel mode under certain preconditions. The fix is in pre-patches/vanilla-4.14/0001-sched-wait-fix-* If you _forget_ to apply this fix for _affected_ kernels, you may get "operational fun" at the wrong moment: ordinary operations will likely be unaffected, but a _silent_ network outage at the wrong moment (race condition) may hang up your kernel at the secondary site, just in the moment when you probably want to do a failover. LTS kernels 4.9 and earlier are not affected by the bug, although potentially present also there, but it is a _masked_ (sleeping) bug there. I already submitted the fix to LKML, but unfortunately has been ignored up to now. mars0.1astable79 * Critical fix: in a multiple-failure scenario which is hard to reach, and then acting badly by disregarding heavy warnings from marsadm and from mars-user-manual.pdf, data consistency could be violated. Detected by testing (the situation has not been observed in practice up to now). When unsure, better update to this fixed version. * Minor fix: in a scarce corner case plus an additional scarce race, primary handover could hang. * Major systemd interface fixes and improvements: - When handover fails due to failed systemd stopping at the old primary (e.g. hanging umount etc), the application stack will be automatically restarted before the handover operation reports timeout. The idea is to keep your applications running whenever possible. - New commands marsadm set-systemd-want and get-systemd-want for a temporary shutdown of the systemd unit stack. This is useful e.g. for performing an fsck. - Implemented transitive closure of indirectly referenced further systemd units. - Attach / detach now automatically starts / stops the systemd unit stack. - Improved reliability of systemd handover. - Fixed many bugs in the systemd template macro processor. - Updated doc accordingly. mars0.1astable78 * Major or minor fix: memory leak, triggered under scarce conditions. Observed cases were a few kilobytes. However, it could accumulate over a very long time. When unsure, better update to this version. * Minor usability: report each resource size. mars0.1astable77 * Major doc update: the old mars-manual.pdf has been split into - mars-user-manual.pdf (for sysadmins) - mars-architecture-guide.pdf (for managers and architects) - mars-for-kernel-developers.lyx (unfinished) - football-user-manual.lyx The first two manuals have been heavily rewritten and extended! * Minor fix: after primary crash without failover, the secondaries could get stuck because a version symlink was forgotten to update under scarce preconditions. * Minor improvement: emergency space calculation is now more accurate. * Minor usability: hint when marsadm resize would be possible. * Several minor cosmetic improvements. mars0.1astable76 * Major fix: when the primary was dead and the secondary had an incomplete logfile which was not recognized as being damaged, "primary --force" did not always work under all circumstances. * Minor fix: some config information was not replicated throughout the cluster. Ordinary users were typically not affected. * Minor improvement: marsadm view now shows the replication degree [$x/$y] at each individual resource. * Added slides from FrOSCon2019. mars0.1astable75 * Major fix, only relevant for a scarce corner case: When overflowing the kernel fscache with gigabytes of data, and when a few more weird preconditions were met, it was possible to potentially eat up the whole kernel memory and to trigger OOM. Notice: depending on kernel version, and depending on various overload scenarios, you may trigger OOM anyway, independently from MARS. * Minor fix: marsadm now is reporting the amount of Writeback data (as necessary for the Recovery phase after a crash) more precisely. * Minor improvement: speedup IOPS by better internal hash dimensioning. mars0.1astable74 * Full merge of EOL branch mars0.1.stable74, which was the last stable release in EOL branch mars0.1.y. * Major fix, only relevant for a corner case: Writeback made no human-visible progress under multiple weird preconditions. * Minor fix: ssh connections should be more robust when clumsy firewalls are leading to ssh hangs. * Minor usability improvement: marsadm view shows more fancy details on logfile numbers. * Minor speedups in internal infrastructure. * Football subproject: update to Football-2.0 mars0.1astable73 (merged from mars0.1stable73) * Critical fix, only relevant for kernels >= 4.2.x: NULL deref occurs systematically when more than 64 file handles are being allocated. There is already an upstream bugfix in linux-next (missing initializer for resize_wait in fs/file.c). Since this fix is missing in many LTS and distro kernels (at the moment), I added a workaround in MARS. Recommendation: anyone operating MARS on newer kernels should update to mars0.1astable73 for safe operations. Don't leave this unfixed. It can explode at the worst moment, and restoring operations may only be possible by completely giving up a secondary host, or with a fix. mars0.1astable72 (merged from mars0.1stable72) * Minor fix: writeback improved in a corner case. * Minor improvement: display WriteBack data amount in marsadm view. * Major doc improvement: describe IO performance tuning. mars0.1astable71 (merged from mars0.1stable71) * Major fix: writeback at the primary was unnecessarily slow at certain situations. mars0.1astable70 (merged from mars0.1stable70) * Critical fix: a few upper-layer kernel components are allocating struct bio on the stack. This led to stack memory corruption. If you ever had this problem, you certainly have noticed it ;) Thus it should not have affected your data. Unfortunately, I got no bug reports about this for several years. Discovered when testing compatibility to very new kernels, and now hopefully fixed. * Major fixes: the systemd interface was not in a mature state. Now improved a lot. More improvements are likely to follow in the next months. * Minor clarification: build for ancient kernel 2.6.32 was broken. Fixing the build was no problem, but then the resulting kernel deadlocked in certain situations (sb_mount mutex and sisters). The reason is that stacking of filesystem instances (like /vol/mydata relying on IO to /mars) is a pain in the very old kernel architecture. Any upstream kernel before 3.16 is EOL right now. Nevertheless, I am officially supporting 3.2 at the moment, and have tested it. Anyway, productive use of ancient kernels is not recommended, for various reasons. Notice that you also need old gcc versions for building such EOL kernels. Thus I decided to remove support for 2.6.32 officially. If somebody needs it _really_, please contact me. mars0.1astable69 (merged from mars0.1stable69) * Major improvement: compatibility to upstream kernel 4.9.x. mars0.1astable68 (merged from mars0.1stable68) * Minor fix: sometimes sync was advancing only slowly. * Minor fix: in extremly rare cases and under further conditions, detach could hang due to a race. Workaround was possible by re-attaching. * Minor improvement: /dev/mars/mydata now disappears only after writeback has finished. Although the old behaviour was correct, certain userspace tool could have erronously concluded that the primary has finished working. The new bevaiour is hopefully more like to user expectance. * Minor improvement: propagate physical and logical sector sizes from the underlying disk to /dev/mars/mydata. This can affects mkfs and other tools for making better decisions about their internal parameters. * Minor safeguard: disallow manual --ignore-sync override when the target primary is inconsistent, only relevant for (non-existent) sysadmins who absolutely don't know what they are doing when they are combining this with --force. Systemadmins who really know what they are doing can use fake-sync in front of it, and then they are explicitly stating once again that they really want to force a defective system, and that they really know the fact that it is defective. * Minor improvement: additional warning when network connections are interrupted (asymmetrically), such as by mis-configuration of network interfaces / routing / firewall rules / etc. mars0.1astable67 (merged from mars0.1stable67) * Minor fix: don't unnecessarily alert sysadmins when no systemd unit files are installed. * Minor doc update: new slides from LCA2019, updated old slides from FrOSCon2018. * Minor doc update: describe some more use cases, add some advice for managers. mars0.1astable66. * Merge mars0.1stable66. In detail: * Critical fix, only relevant for kernels 4.3 to 4.4: Due to a forgotten adaptation to newer kernels, some userspace tools like xfs_repair could read/write wrong data upon _large_ IO requests, and/or kernel memory corruption could occur. Kernel-level filesystems are typically _not_ affected because they typically use 4k pages at maximum. If you are operating such a kernel, please upgrade to minimize any risks. You probably want userspace tools like xfs_repair to not crash your kernel ;) The problem was reproducibly detected at lab regression testing, _before_ updating a big installation from kernel 3.16 to 4.4. It did not show up with the old kernel. Notice: kernels >4.6 are not yet supported at the moment, but work on them is likely being continued during the next months. Stay tuned. * Minor doc updates. mars0.1abeta18 * Merge mars0.1stable65. mars0.1abeta17 * Merge mars0.1stable64. * Fix compiler warning at certain kernel versions. mars0.1abeta16 * Merge mars0.1stable63. mars0.1abeta15 * Merge mars0.1stable62. mars0.1abeta14 * Merge mars0.1stable61. mars0.1abeta13 * Minor feature: marsadm takes comma-separated list of resource names in place of "all". * Merge mars0.1stable60. mars0.1abeta12 * Merge mars0.1stable59. mars0.1abeta11 * Merge mars0.1stable58. mars0.1abeta10 * Make IP_TOS compile-time configurable. * Update doc on IP_TOS. mars0.1abeta9 * Major feature: lowlevel TCP tuning, separately for traffic types MARS_TRAFFIC_META (default port 7777), and MARS_TRAFFIC_REPLICATION (default port 7778), and MARS_TRAFFIC_SYNC (default port 7779). * Merge mars0.1stable57. mars0.1abeta8 * Merge mars0.1stable56. mars0.1abeta7 * Merge mars0.1stable55. mars0.1abeta6 * Merge mars0.1stable54. mars0.1abeta5 * Merge mars0.1stable53. mars0.1abeta4 * Merge mars0.1stable52. mars0.1abeta3 * Merge mars0.1stable51. mars0.1abeta2 * Merge mars0.1stable50. * Silence annoying false-positive network interruption messages. mars0.1abeta1 * Merge mars0.1stable49. * Several smaller fixes. mars0.1abeta0 Forked off from 0.1balpha4. Merge 0.1stable48 (in several intermediate steps). Some infrastructure for version detection. Backport of selected fixes from branch 0.1b.y. Add marsadm split-cluster. ----------------------------------- Changelog for the deprecated series 0.1b: (only the part which has been merged with branch mars0.1a) (notice that there were a few more historic branches which were not really usable, and never went into production) mars0.1balpha4 -------- * First improvements for scalability to thousands of nodes. Not yet tested with really huge masses of nodes, only with relatively small clusters. * Merge fixes from mars0.1stable41 (see there) * Doc update on socket bundling. mars0.1balpha3.4 -------- * Merge fix from mars0.1stable40 (see there) mars0.1balpha3.3 -------- * Merge fixes from mars0.1stable39 * Major fix: copy was sometimes hanging. * Minor fix: unnecessary delay of metadata propagation. * Performance improvements / bottleneck enhancenemts: - Lamport clock - Network - md5 checksumming * Userspace: faster logfile deletion via cron job. mars0.1balpha3.2 -------- * Merge mars0.1stable38: now compiles without pre-patch on certain kernel versions. Please read ChangeLog there. mars0.1balpha3.1 -------- * Minor fix: deadlock on termination of copy thread. mars0.1balpha3 -------- * Some tuning (more to come later): * Speedup network by better corking. * New scalable Lamport clock implementation. mars0.1balpha2 -------- * Socket bundling (cherry-picked from mars0.2.y). * Speedup copy processes (sync, logfile transfer). * Speedup bio and md5 checksumming. mars0.1balpha1 -------- * First improvements for scalability to more than 10 resources per node. Already tested with 128 resources on a pair of nodes. More improvements to come later. No functional changes otherwise (from a sysadmin perspective). Rollback to stable series 0.1 should be possible at any time. * Include fix from 0.1stable37. mars0.1balpha0 -------- * Minor fix: the 1&1 specific feature set-sync-pref-list was not used at all. Without it, the limitation feature for the sync parallelism degree did not work correctly (without leading to harm, other than optimum sync throughput / performance). Removed the old _obsolete_ feature (for formal reasons, this cannot be done in the 0.1stable branch). Re-implemnented the feature in a very simple form, which is hopefully "obviously correct" now. * Minor feature: please use "marsadm cron" as a fool-proof short form, in particular at cron jobs. ----------------------------------- Changelog for series 0.1: Attention! This branch is now EOL. Everything has been merged into branch mars0.1a.y which is also the master branch. PLEASE UPGRADE to the new branch. Upgrade is easy: just rollout the new marsadm version, install the new kernel modules, and load them where possible. Mixed operation of different versions is no problem, but is of course not the desired state, so keep this period as short as possible. Rollback is also easy. Motivation: branch 0.1a is productive for several years at 1&1. Experiences: now runs provably better than 0.1.y with better performance, smoother, etc. mars0.1stable74 (last stable release in branch mars0.1.y) * Major fix, only relevant for a corner case: Writeback made no human-visible progress under multiple weird preconditions. * Minor usability improvement: marsadm view shows more fancy details on logfile numbers. mars0.1stable73 * Critical fix, only relevant for kernels >= 4.2.x: NULL deref occurs systematically when more than 64 file handles are being allocated. There is already an upstream bugfix in linux-next (missing initializer for resize_wait in fs/file.c). Since this fix is missing in many LTS and distro kernels (at the moment), I added a workaround in MARS. Recommendation: anyone operating MARS on newer kernels should update to mars0.1astable73 for safe operations. Don't leave this unfixed. It can explode at the worst moment, and restoring operations may only be possible by completely giving up a secondary host, or with a fix. mars0.1stable72 * Minor fix: writeback improved in a corner case. * Minor improvement: display WriteBack data amount in marsadm view. * Major doc improvement: describe IO performance tuning. mars0.1stable71 * Major fix: writeback at the primary was unnecessarily slow at certain situations. mars0.1stable70 * Critical fix: a few upper-layer kernel components are allocating struct bio on the stack. This led to stack memory corruption. If you ever had this problem, you certainly have noticed it ;) Thus it should not have affected your data. Unfortunately, I got no bug reports about this for several years. Discovered when testing compatibility to very new kernels, and now hopefully fixed. * Major fixes: the systemd interface was not in a mature state. Now improved a lot. More improvements are likely to follow in the next months. * Minor clarification: build for ancient kernel 2.6.32 was broken. Fixing the build was no problem, but then the resulting kernel deadlocked in certain situations (sb_mount mutex and sisters). The reason is that stacking of filesystem instances (like /vol/mydata relying on IO to /mars) is a pain in the very old kernel architecture. Any upstream kernel before 3.16 is EOL right now. Nevertheless, I am officially supporting 3.2 at the moment, and have tested it. Anyway, productive use of ancient kernels is not recommended, for various reasons. Notice that you also need old gcc versions for building such EOL kernels. Thus I decided to remove support for 2.6.32 officially. If somebody needs it _really_, please contact me. mars0.1stable69 * Major improvement: compatibility to upstream kernel 4.9.x. mars0.1stable68 * Minor fix: in extremly rare cases and under further conditions, detach could hang due to a race. Workaround was possible by re-attaching. * Minor improvement: /dev/mars/mydata now disappears only after writeback has finished. Although the old behaviour was correct, certain userspace tool could have erronously concluded that the primary has finished working. The new bevaiour is hopefully more like to user expectance. * Minor improvement: propagate physical and logical sector sizes from the underlying disk to /dev/mars/mydata. This can affects mkfs and other tools for making better decisions about their internal parameters. * Minor safeguard: disallow manual --ignore-sync override when the target primary is inconsistent, only relevant for (non-existent) sysadmins who absolutely don't know what they are doing when they are combining this with --force. Systemadmins who really know what they are doing can use fake-sync in front of it, and then they are explicitly stating once again that they really want to force a defective system, and that they really know the fact that it is defective. * Minor improvement: additional warning when network connections are interrupted (asymmetrically), such as by mis-configuration of network interfaces / routing / firewall rules / etc. mars0.1stable67 * Minor fix: don't unnecessarily alert sysadmins when no systemd unit files are installed. * Minor doc update: new slides from LCA2019, updated old slides from FrOSCon2018. * Minor doc update: describe some more use cases, add some advice for managers. mars0.1stable66 * Critical fix, only relevant for kernels 4.3 to 4.4: Due to a forgotten adaptation to newer kernels, some userspace tools like xfs_repair could read/write wrong data upon _large_ IO requests, and/or kernel memory corruption could occur. Kernel-level filesystems are typically _not_ affected because they typically use 4k pages at maximum. If you are operating such a kernel, please upgrade to minimize any risks. You probably want userspace tools like xfs_repair to not crash your kernel ;) The problem was reproducibly detected at lab regression testing, _before_ updating a big installation from kernel 3.16 to 4.4. It did not show up with the old kernel. Notice: kernels >4.6 are not yet supported at the moment, but work on them is likely being continued during the next months. Stay tuned. * Minor doc updates. mars0.1stable65 * Major fix, only observed during KASAN debugging: Use-after-free which appears to splat only at Football during final deletion of resources. Never observed at production. Update if you are very cautious. * A few minor fixes, not relevant for production. * Minor doc improvements. mars0.1stable64 * Major regression: split-brain detection did not display correctly. * Minor fix: rare race conditon on O_NONBLOCK networking. Only observed during testing with kernel 4.9 (sorry, _all_ the adaptations are not yet ready for release, but it is making progress now). I am not sure whether this bug could also trigger with kernel 4.4 or earlier, therefore I am releasing the fix beforehand. * Minor doc architectural explanations. mars0.1stable63 * Minor fix: when compiling for some newer kernels (only there), schedule() could be called during wait for some condition, worsening performance unnecessarily. * Minor improvement: starting join-resource in batches was slow because each was waiting for cluster communication. Use a manual "marsadm wait-cluster" before starting batches of join-resource operations. * Doc: some clarifications on BigCluster scalability behaviour. mars0.1stable62 * Minor fix: race between join-resource and log-rotate. * Minor fix: report split brain logfile amount only when actually detectable. * Minor improvement: shift annoying error message over to Orphan state detection. * Football: update to Football-2.0-RC12 * doc: some updates. mars0.1stable61 * Minor fix: in very rare cases where some symlinks are missing, don't abort in try_to_avoid_splitbrain(). * Minor improvement: better human-readable numbers. * Minor doc: more on asynchronous background operations. mars0.1stable60 * Major improvement: new option --ignore-sync allows primary Handover without --force even when some sync is running somewhere. Any running syncs will restart from scratch (which might take some time, depending on LV size and many more factors like the network). * Minor fix: split-cluster did not work correctly when no resources were existing anymore, at all. * Doc: major update. More explanation on CAP theorem, and on differences / commonalities with DRBD. mars0.1stable59 * Major fix: "marsadm up" did not work when sync could not be started. Now does "best effort". * Minor fix: marsadm system interface was active when not activated. * Minor usability improvement: new repliaction state "Orphaned" indicates that logfiles are missing, and thus replication is stuck. mars0.1stable58 * Major fix for Football / split-cluster: for safety, cron deletes some blocking left-overs. * Major fix at _asymmetric_ split-cluster: ignore hindering abort condition. * Minor fix: not all internal systemd links were removed upon marsadm set-systemd-unit mydata "". * Doc: Football. * Doc: architectural treatment of centralized storage. mars0.1stable57 * Minor fix: silly deadlock upon scarce race at logging. Without debug logging, probability should be extremely low (only observed at rmmod). * Added initial version of systemd templates (for future backward compatibility with branch 0.1a). * Doc: systemd templates. mars0.1stable56 * Minor fix: split-cluster could unnecessarily abort in some cases. * Added initial version of submodule "football". More updates will follow. mars0.1stable55 * Major fix: unnecessary / false positive split brain could occur after the primary logfile was truncated, e.g. at crashes or disk damages. Systematic triggering in masses was possible by keeping /dev/mars/mydata mounted while _forcing_ a reboot _during_ (!) its umount (e.g. by patching the "reboot" command and/or patching systemd dependencies or similar to provoke this regularly). mars0.1stable54 * Major fix, only relevant for massive execution of leave-resource, e.g. when playing Football (Tetris) games: When non-versioned symlinks were eventually deleted, later re-creation did not always succeed. Fixed by an new generic timestamp ordering approach. * Stability client-side fixes (could lead to stacktraces), backported from branch 0.1a (were forgotten long ago). * Major doc update: new section on reliability of storage architectures. This explains why many BigCluster systems don't work as expected. Backed up by graphs and by mathematical formulas. A must-read for anyone working in the storage area! mars0.1stable53 * Major fix: rare corner case of split brain was not displayed correctly. * Major usablilty: show amount of data during split brain. This hints the sysadmins about the size of future data loss at later split brain resolution. * Minor workaround: crashed /mars filesystems may contain completely damaged symlinks with timestamps in the far distant future, e.g. year >3000 etc. Safeguard unusual Lamport time slips by ignoring implausible values. * Major improvement: internal locking overhead reduced. * Minor improvment: reduce message trigger overhead. * Several minor improvements. * Doc updates. mars0.1stable52 * Major contrib: new example scripts for MARS background data migration during production. 1&1-specific code in a separate plugin. You can write your own plugins for adaptation to your needs. * Minor fix: limit the size of the writeback buffer by the rest space in /mars. This is only relevant when /mars is dimensioned smaller than RAM (which should never be the case in production systems, but might happen accidentally or for testing). Analogously, limit the maximum logfile size. * Minor fix: prevent creation of many tiny logfiles over time when secondaries are not catching up. The default threshold is a minimum of 5 GB size when more than 10 logfiles are already present. * Minor fix: cleanup old internal .tmp-* symlinks which might remain as leftovers when marsadm is dying at the wrong moment. * Minor improvement: don't run O(n) mapfree under spinlock. More speed improvements under preparation; will result in O(k). * Some more minor improvements. mars0.1stable51 * Minor fix: don't abort log-delete-all too early when there are holes in the deletion sequence numbers. * Backport of marsadm cron from branch 0.1a, in order to systematically support mixed operation of different MARS versions in bigger installations (avoid confusion at junior sysadmins and at monitoring staff). * Rectified the semantics of log-delete, which now does the same as log-delete-all. Single deletion is only needed for testing, and has been renamed to log-delete-one. Leaving the old semantics would have been an operational risk when junior sysadmins or 24/7 surveillance people are not carefully looking at the details of semantics. Now everything is hopefully as everybody not familiar with MARS would naively assume. * Doc update. mars0.1stable50 * Major usability improvement (backport from 0.1a): marsadm shows number of replicas of each resource, out of total number of cluster members. Example: [2/4] * Minor fix: automatically cleanup internal backups produced by the new merge-cluster / split-cluster after 1 week. * Minor fix: also cleanup some new symlink types replicated through the network when running asymmetric clusters with mixed branches 0.1 and 0.1a. * Minor annoyance: silence split-cluster error message when no resources are present. mars0.1stable49 * Backports of new marsadm commands merge-cluster and split-cluster. The new functionality is needed for background migration of resources. Please be aware that this branch has not been constructed for scalability in the dimension of #nodes, so don't merge too many nodes and use split-cluster after each background migration. Better scalability is / will be addressed at the 0.1a and 0.1b branches. However, currently they are not yet stable. No changes at the kernel module (besides some bug fixes); this is solely done at userspace level. The new userspace-level commands should have almost no intersection with (and therefore no impact onto) other parts of this well-proven stable branch. * Backports of new wait-cluster implementation. This avoids irritating messages after split-cluster. mars0.1stable48 * Critical fix: DDOS-like attacks at the MARS ports (or similar caused by bugs / misbehaviour) are prevented by configurable limits /proc/sys/mars/handler_dent_limit and /proc/sys/mars/handler_limit . * Critical safeguard: when the network is interruted for a long time while the log-rotate frequency is very high and a lot of resources (exceeding the official limits as documented) had been used, masses of deletion links may accumulate in /mars/todo/. First, already existing deletions to the same targets are reused now. Second, a maximum limit (of currently 512 entries) is enforced, and a warning is spit when too many deletions are accumulated over time. * Minor fix: earlier detection of socket hangups. mars0.1stable47 * Critical fix: leave-cluster could lead to deadlocks, also on remote nodes. * Contrib: mass automation script (unmaintained). mars0.1stable46 * Major fix: bugfix from 0.1stable44 (state "Detached" was reported too early) was incorrect, now fixed. * Minor fix: display of host lists in special case of create-resource was misleading. mars0.1stable45 * Major fix: on secondaries, orphane files and symlinks were sometimes created in /mars and could accumulate over a long time. After several months or years of operation, the /mars directory could appear being full via "df /mars", but "du -s /mars" was not reporting the hidden space allocation. Also, upon remount or reboot the cleanup of orphane files could take a rather long time. Workaround was possible by "rmmod mars; umount /mars; mount /mars; modprobe mars". Fixed by regularly pruning the dentry cache of the /mars filesystem. mars0.1stable44 -------- * Major fix: state "Detached" was reported too early, before the underlying disk was really closed. * Doc: new updated slides from FrOSCon 2017. New architectural comparison with Big Storage Clusters in terms of scalability, reliability and costs. mars0.1stable43 -------- * Major fix, only relevant for k >= 3 replicas: Logfile fetch did not switch over to another alive peer upon _speicfic_ network problems with the _current_ peer. As a consequence, an unaffected replica could hang. Workarould was possible by pause-fetch / resume-fetch or by fixing the network :) mars0.1stable42 -------- * Minor fix: ssh IPs and port numbers are automatically probed on join-cluster. * Minor compatibility to branch mars.1b.y: join-resource does additional rsync for safety. * Minor fix: rate display was not going down to 0 on switchoff or long pauses. * Minor improvement: show peers in internal debugging info. mars0.1stable41 -------- * Minor fix: a scarce race could lead to an unnecessary split brain when umounting _after_ role transition from primary to secondary. mars0.1stable40 -------- * Potentially critical fix: on very fast machines, and with extremely low probability, a race in AIO could lead to a kernel page fault. For maximum safety, update to this version is recommended. mars0.1stable39 -------- * Minor fix: hangs of logfile updates. Found by stress-testing on fast hardware over 10GBit network links. Might explain some extremely rare (1 per several millions of operations hours) production hangs on secondaries. Workaround possible by "pause-fetch; resume-fetch". * Minor fixes of rare kthread retarding under very high load. * Minor improvement: add version number to "marsadm version" which can be used for future compatibilty checking with respect to new features. mars0.1stable38 -------- * Compile without pre-patch on some kernel versions! Whether the pre-patch is applied will be detected automatically. However, there is some (hopefully minor) performance penalty when the pre-patch is missing. This will be addressed in a future release (but might go to branch 0.1b instead, not yet decided). Tested with vanilla kernels 3.10.105, 3.14.79, 3.16.43, 4.1.39, 4.4.67. Vanilla kernels 4.8.x and later are _not_ yet working (independently from pre-patches). This will be addressed in a future release. * No functional changes otherwise. Rollback to prior versions should be easy. Please report any issues. * Updated docs describing build methods. mars0.1stable37 -------- * Minor fix: secondary logfile replication could hang in the extremely unusual case that the expected primary logfile size gets shortened after a crash followed by reboot. Workaround was possible via "pause-fetch; resume-fetch". mars0.1stable36 -------- * Doc: new slides from GUUG2017, both in English and in German. Some very important hints for cost savings. May easily save you a few millions when operating some petabytes of data. * Doc: new chapter on cost savings in mars-manual.pdf. Some parts of German oral explanations from the GUUG conference translated to English for my English-speaking audience. More to come later (hopefully; I need to get the time). mars0.1stable35 -------- * Minor fix: when syncing a big resource (e.g. 40TiB) over an 1GBit uplink, the sync may take longer than 1 day. This increases the probability for triggering an unintended restart of that sync from scratch. Among further obscure preconditions, more than 5 logfiles must exist such that the wrong assumption of an emergency mode can happen at the secondary. In order to trigger the bug more likely, it is therefore helpful to misconfigure /etc/cron.d/mars by log-rotate'ing every 10 minutes, but doing log-delete-all only once an hour (which contradicts my upstream documentation and unnecessarily wastes valuable storage space in /mars). Fixed by correction of a typo-like error. mars0.1stable34 -------- * Minor fix: in some rare cases, when lots of gigabytes had to be replayed in one big slurp, the replay position wasn't updated during a longer time. Some admins were complaining that it appeared "stuck" although it worked in reality. Improved by increasing the update frequency of the replay link. * Minor fix: after network errors, sometimes the sync restarted from scratch, unnecessarily. * Minor fix: under rare conditions, rmmod could hang forever. A known reason has been fixed. Other theoretical reasons hopefully improved by some further safeguards. mars0.1stable33 -------- * Minor regression from stable29: After a primary crash, without switchover, and when the primary recovery phase involves a logrotate to an empty new logfile which had been in the meantime shortly before the crash but has not yet been used before the crash (race condition), a kernel NULL pointer deref may stop the main thread. Workaround: either remove the empty logfile by hand, or just do a failover to the other side. mars0.1stable32 -------- * Critical regression between stable30 and stable31 (can be avoided by simply using stable30 for affected kernels): on _old_ kernels (before 4.3.x) the removal of merge_bvec_fn() (see upstream commit 8ae126660fddbeebb9251a174e6fa45b6ad8f932) can lead to fatal crashes at the primary side. Fixed by using (hopefully) proper #ifdef's according to the kernel version. Notice: between stable30 and stable31 no true MARS fixes were made (since no bugs were found). This strategy is likely to continue for a while, for newer adaptations to even newer kernels. In case of problems, go back. And, please, report it to me :) mars0.1stable31 -------- * New _minimum_ pre-patches for vanilla LTS kernels 3.2.x to 4.7.x. For security reasons, please prefer them over the old _generic_ pre-patch versions which expose many unnecessary EXPORT_SYMBOL to potential attackers. * Adaptions to vanilla kernels up to 4.7.x. Note: 4.8rc-* does not yet work. * Regression testing with many kernel versions: looks fine. mars0.1stable30 -------- * Minor fix: in very rare cases of a primary crash, a missing versionlink could lead to a hang. * Minor fix: improved error reporting of replay code. * Minor fix: improved switchback to former primary side. * Minor fix: systematically add some missing macros. * Minor improvements: add some example systemd unit and other contrib stuff like a cronjob example. * Doc: minor additions and improvements. mars0.1stable29 -------- * Minor fix: on very fast hardware and networks, sync could take a while for terminating. * Minor fix: external module build. * Major usability improvement: new expert commands marsadm lowlevel-ls-host-ips, lowlevel-set-host-ip, lowlevel-delete-host. Necessary for moves between networks, dedicated replication IPs, etc. * Minor doc update. mars0.1stable28 -------- * Doc: describe new naming conventions. MARS Light is now simply called MARS. No distinction between "Light" and the future "Full" anymore. Please note that the git branches light0.1.y and light0.2.y have been renamed to mars0.1.y and mars0.2.y respectively. * Minor sourcecode cleanup: s/light//g or s/light/main/g where appropriate. No other changes in the sourcecode, deliberately. In case anyone encounters any build problems compiling MARS, this release is separated just for the sake of build testing, or Debian packaging testing, etc. * Doc: minor clarifications. mars0.1stable27 light0.1stable27 -------- * Critical fix: typo in sync progress comparison code could lead to data version mismatches during sync when alternating with replay. Only observed at a certain new hardware class, and only while testing with an extremely high load (9 loaded resources in parallel to 9 concurrent syncs). As a workaround, echo 0 > /proc/sys/mars/sync_flip_interval_sec can be used. Nevertheless, update is highly recommended! * Major fix: slow memory leak (regression from light0.1stable26). Only when starting the transaction logger (i.e. primary is typically not affected). But don't let run it for a longer time. Monitoring is possible via /proc/slabinfo (size-64 or siblings). * Minor fix: join-cluster did not check for duplicate IP addresses. * Minor fixes: some unnecessary annoying error messages. * Docu: new slides from GUUG 2016 in Köln. light0.1stable26 -------- * Minor fixes: some primitive macros were reporting misleading or even wrong values at split brain, or during/after emergency mode. Some high-level macros as well as try_to_avoid_split_brain should work better / more reliable now. * Minor fix: potential deadlock after crash reboot, or after defective /mars filesystem. Never observed in practice. * Minor safeguard: unnecessary split brain could emerge at secondaries under extremely rare and strange conditions. Unsure whether it ever occurred in practice. * Minor usability improvement: show incorrect permissions on /mars. Some other sysadmin tools like Puppet seem to have their own default notion of "secure permissions" ;) * Minor doc reorg, better chapter structure. light0.1stable25 -------- * Major fix: in rare cases "marsadm primary" (without --force) could go into an endless loop, even if --timeout= was specified. * Minor fix: in rare cases of hanging or defective IO, crashes of the primary could replicate versionlinks to the secondary, but after reboot they were missing at the primary because of of hanging IO or other IO / RAID controller problems. Now using sync_filesystem() for either ensuring actuality, or for letting the mars_light main control thread hang (which will hopefully be noticed soon by monitoring). * Minor fix: join-cluster uses rsync, which could abort due to vanished filesystem objects while the primary is actively running. Now it should tolerate such "errors". * Minor fixes / additions at primitive macros. * Tiny doc update. light0.1stable24 -------- * Skip this release due to a regression. light0.1stable23 -------- * Minor fix: the new replay-code error message was forgotten to reset at secondaries. Now the annoying old error message disappears after the next successful logrotate. * Minor fixes of internal marsadm code (not in use until now). * Minor doc update. light0.1stable22 -------- * Critical fix for non-storage servers: the /mars directory was readable by ordinary non-root users, opening a potential security hole. Originally MARS was designed for standalone storage servers solely, but now it is increasingly deployed to machines where ordinary users can log in. Update recommended, but only urgent for potentially affected installations. * Minor fix: when a logfile was damaged (observed at defective hardware), this was often (but not always) detected by the md5 data checksums in the transaction logfiles. So far so good. The replay / recovery process stopped for a very good reason. But it was not easily possible to _force_ any of the resource members into primary role when the defect was already present at the _primary_ (which happend once during 7 millions of operating hours, and at a primary site which proved defective afterwards), and the defect had been replicated to all secondaries. As a workaround, the resource could be destroyed via leave-resource everywhere, and re-surrected from scratch. Clumsy. Now an md5 checksum error in the middle of a logfile is treated similarly to an EOF. "primary --force" will succeed now, without applying the defective data (as before). Split brain will result for sure in such a case. * Minor improvement: md5 logfile checksum errors are now displayed directly in the diskstate macro (and therefore also at plain "view"). * Minor improvement: when "marsadm view all" told you "InConsistent" as the disk state, this was _formally correct_ because it related to the state of the _disk_, not to the state of the replication. The former message could appear regularly during ordinary out-of-order writeback at the primary side, without violating the consistency of /dev/mars/mydata. However, many people were confused and alarmed by the irritating message. Now a better wording is used: "WriteBack" and "Recovery" describes more intuitively what is really happening :) * Minor doc improvements. light0.1stable21 -------- * Hint: now MARS has been rolled out to more than 1600 servers, including some MySQL database servers, and has collected more than 6 millions of operation hours. * Minor fixes, none of them observed in practice, only found by testing while working on new features: - potential read page fault - potential deadlock - incorrect remote symlink update under untypical circumstances light0.1stable20 -------- * Hint: MARS is now running on more than 850 storage servers, and has collected more than 4.5 millions of operation hours. There were no new incidents with customer impact since the last major bugfix (more than 3 millions of operation hours since then). It is difficult to deduce a reliability from that, but it appears that at least 99.999%, if not 99.9999% are now real for the MARS component as a standalone component (not to be confused with overall system reliability). Our storage hardware is clearly much less reliable. MARS does compensate these defects all the time. * Minor fix: memory leak in networking code, does not occur at light0.1 operations (but maybe future versions of MARS). * Doc: add presentation slides from Froscon2015. light0.1stable19 -------- * Minor safeguard: warn when somebody tries leave-resource --host= for a damaged host, and later the dead host resurrects in an unreasonable way. * Doc update: describe use cases for DRBD vs MARS more clearly. * Minor spelling fixes. light0.1stable18 -------- * Minor safeguard: prevent join-resource when previous log-purge-all has been forgotten. Prevent create-resource also when previous delete-resource has been forgotten. Anyway, this happens only in very exotic repair scenarios after very heavy failures. * Doc updates: simplify descriptions of split-brain resolution and emergency mode resolution. Nowadays 'invalidate' will do everything in all tested cases; the more complex alternative methods have been moved to the appendix. light0.1stable17 -------- * Minor fix: stacktrace / oops in aio callback path due to a subtle race, observed once during 2.5 millions of operation hours. In the observed case, the secondary was hanging, without customer impact. However, the error class could potentially occur also at the primary side. Probably the bug was triggered by a hardware problem from the RAID controller. light0.1stable16 -------- * Minor fix: sync could take a long time to complete under high application load, similarly to a live-lock. * Some smaller minor fixes for annoying messages. * Contrib: added configurable Nagios check. * Contrib: added some example scripts which could be used by clustermanagers etc. * Doc: important new section on pitfalls when using existing clustermanagers UNMODIFIED for long distance replication. PLEASE READ! light0.1stable15 -------- * NOTICE: MARS succeeded baptism on fire at 04/22/2015 when a whole co-location had a partial power blackout, followed by breakdown of air conditioning, followed by mass hardware defects due to overheating. MARS showed exactly 0 errors when (emergency) switching to another datacenter was started in masses. * Major fix of race in transaction logger: the primary could hang when using very fast hardware, typically after ~24000 operation hours. The problem was noticed 6 times during a grand total of more than 1,000,000 operation hours on a mixed hardware park, showing up only on specific hardware classes. Together with 3 other incidents during early beta phase which also had customer impact, this means that we have reached a reliability of about ===> 99.999% After this fix, the reliability should grow even higher. A workaround for this bug exists: # echo 2 > /proc/sys/mars/logger_completion_semantics Update is only mandatory when you cannot use the workaround. * Minor improvement in marsadm: re-allow --force combined with "all". This is highly appreciated for speeding up operations / handling during emergency datacenter switchover. * Various smaller improvements. * Contrib (unsupported): example rollout script for mass rollout. light0.1stable14 -------- * Minor safeguard: modprobe mars will refuse to start when the cluster UUID is missing. * Minor fix: external race in marsadm resize, only relevant for scripting. * Minor fix: potential race on plugged IO requests. * Clarify output of marsadm view. Many systematical improvements and hints. * Add some unevitable macros for scripting / automation. * Various tiny improvements. light0.1stable13 -------- * Critical safeguard for accidental join-cluster with wrong argument: make UUID mandatory, disallow completely unrelated hosts to communicate symlink tree updates when their UUIDs mismatch. * Minor fix: leave-resource --host=other did not work when disks were named differently throughout the cluster. * Minor fix: detach --host=other --force (which is needed as a precondition) did not work. * Various minor fixes and clarifications. "marsadm view all" now reports the communication status in the cluster. light0.1stable12 -------- * Critical (but usually not extremely relevant) fix: When emergency mode occurs just during a sync, the target could remain inconsistent without notice. Now noticed. You always could/should manually invalidate whenever an emergency mode appeared. Now this is automatically fixed by restarting any sync from scratch (if one was actually running before; otherwise consistency was never violated). * Major documentation update / corrections. * Major (but less relevant) fix: leave-cluster did not really work. * Minor fix (regression): rmmod could hang when sync was running. * Various minor fixes and clarifications. light0.1stable11 -------- * Major documentation update. mars-manual.pdf increased from 66 to 80 pages. Please read! You probably should know this. * Minor fixes: better cleanup on invalidate / leave-resource. * Minor clarifications: more precise EIO error codes, more verbose error reporting via "marsadm cat". light0.1stable10 -------- * Major fixes of internal network protocol errors, leading to internal shutdown of sockets, which were transparently re-opened. It could affect network performance. Not sure whether stability was also affected (probably under extremely high load); for better safety you should upgrade. * Major fix from Manuel Lausch: regex parsing sometimes went completely wrong when hostnames followed a similar name scheme than internal symlinks. * Major, only relevant for k>2 replicas: fix wrong internal sharing of data structures resulting from parallel data connections. * Minor fix: race in fake-sync. * Minor fix: race in invalidate. * Minor, only for k>2 replicas: fix direct primary handover when some non-involved hosts are currently unreachable. * Minor: improve becoming primary during split brain. * Minor: improve becoming primary when emergency mode starts. * Minor: silence some annoying stderr messages. * Several internal minor fixes and clarifications. light0.1stable09 -------- * Major fix of scarce race (potentially critical): the bio response thread could terminate too early, leading to a premature dealloc of kernel memory. This has only been observed on slow virtual machines with slow virtual devices, and very high load on k=4 replicas. This could potentially affect the stability of the system. Although not observed at production machines at 1&1, I recommend updating production machines to this release ASAP. * Major usability fix: incorrect commandline options of marsadm were just ignored if they appeared after the resource argument. Misspellings could cause undesired effects. For instance, "marsadm delete-resource vital --force --MISSPELLhost=banana" was accidentally destroying the primary during operation (which is _possible_ when using --force, and this was even a _required_ sort of "STONITH"-like feature -- however from a human point of view it was intended to destroy _another_ host, so this was an unexpected behaviour from a sysadmin point of view). * Major workaround: the concept "actual primary" is wrong, because during split brain there may exist several primaries. Do not use the macro view-actual-primary any longer. It is deprecated now. Use view-is-primary instead, on each host you are interested in. * Minor fix: "marsadm invalidate" did not work in some weired split brain situations / was not equivalent to "marsadm leave-resource $res; marsadm join-resource $res". The latter was the old workaround to fix the situation. Now it shouldn't be necessary anymore. * Minor fix: pause-fetch could take very long to terminate. * Minor fix: marsadm wait-cluster did not wait for all hosts particiapting in the resource, but only for one of them. This is only relevant for k>2 replicas. * Minor fix: the rates displayed by "marsadm view" did not drop down to 0 when no progress was made. * Minor fix: logging to syslog was incomplete. * Minor usability fix: decrease boring speakyness of "log-rotate" and "log-delete" for cron jobs. * Minor fixes: several internal awkwardnesses, potentially affecting performance and/or stability in weired situations. light0.1stable08 -------- * Minor fix: after emergency mode, a versionlink was forgotten to create. This could lead to unnecessary reports of split brain and/or need for additional re-invalidate. * Minor fix: the predicate 'view-is-consistent' reported 'false' in some situations on secondaries when all was ok. * Minor fix: it was impossible to determine the 'is-consistent' from 'marsadm view' (without -1and1 suffix). Added a new [Cc-] flag. This is absolutely needed to determine whether the underlying disks must have the same checksum (provided that both disks are detached and the network works and fetch+replay had completed before the detach). * Updated docs to reflect this. * Minor fix: 'invalidate' did not work when the resource was not completely detached. Now it implicitly does a detach before starting invalidation. * Minor fix: wait-umount was waiting for umount of _all_ primaries during split brain. Now it waits only for umount of the local node. Notice that having multiple primaries in parallel is an erroneous state anyway. * Minor fix: leave-cluster did not work without --force. light0.1stable07 -------- * Minor fix: re-creation of a completely destroyed resource did not always work correctly light0.1stable06 -------- * Major fix: becoming primary was hanging in scarce situations. * Minor fix: some split brains were not always detected correctly. * Minor fix for Redhat openvz kernel builds. * Several fixes for 1&1 internal Debian builds. light0.1stable05 -------- * Major fix: incomplete calls to vfs_readdir() which could lead to incomplete symlink updates / replication hangs. * Minor fix: scarce race on replay EOF. * Separated kernel from userspace build environment. * Removed some potentially dangerous Kconfig options if they would be set to wrong values (robustness against accidentally producing bad kernel modules). * Dito: some additional checks against bad main Kconfig options (mainly for out-of-tree builds). * Separated contrib code from maintained code. * Added some pre-patches for newer kernels (WIP - not yet fully tested at all combinations) * Minor doc addition: LinuxTag 2014 presentation. light0.1stable04 -------- * Quiet annoying error message. * Minor readability improvements. * Minor doc updates. light0.1stable03 -------- * Major: fix internal aio race (could lead to memory corruption). * Fix refcounting in trans_logger. * Some minor fixes in module code. * Fix 1&1-internal out-of-tree builds. * Various minor fixes. * Update monitoring tools / docs (German, contributed by Jörg Mann). light0.1stable02 -------- * Fix sorting of internal data structure. * Fix IO error propagation at replay. light0.1stable01 -------- * Fix parallelism of logfile propagation: sometimes a secondary could get a more recent version than the primary had on stable storage after its crash, eventually leading to an (annoying) split brain. Some people might take this as a feature instead of a bug, but now the logfile transfer starts only after the primary _knows_ that the data is successfully committed to stable storage. * Fix memory leaks in error path. * Fix error propagation between client and server. * Make string allocation fully dynamic (remove limitation). * Fix some annoying messages. * Fix usage output of marsadm. * Userspace: contributed bugfix for Debian udev rules by Jörg Mann. * Improved debugging (only for testing). light0.1beta0.18 (feature release) -------- * New commands marsadm view-$macroname * New customizable macro processor * New err/warn/inf reporting via symlinks * Per-resource emergency mode * Allow limiting the sync parallelism * New flood-protected syslogging * Some smaller improvements * Update docs * Update test suite light0.1beta0.17 -------- * Major bugfix: race in logfile switchover could sometimes lead to the wrong logfile (extremely rare to hit, but potentially harmful). * Disallow primary switching when some secondaries are syncing. * Fix logfile fetch from multiple peers. * Fix computation of transitive closure (affected log-purge-all, split brain detection, and many others). * Fix incorrect emergency mode detection. * Primaries no longer fetch logfiles (unnecessarily, only makes a difference at concurrent split brain operations). * Detached resources no longer fetch logfiles (unexpectedly). * Myriads of smaller fixes. light0.1beta0.16 -------- * Critical bugfix: "marsadm primary --force" was assumed to be given by sysadmins only in case of emergency, when the network is down. When given in non-emergency cases where the old primary continues to run (/dev/mars/* being actively used and written), the old primary could suddendly do a "logrotate" to the new split-brain logfile produced by the new (second) primary. Now two primaries should be able to run concurrently in split-brain mode without mutually trashing their logfiles. * primary --force now only works in disconnected mode, in order to hinder unintended forceful creation of split brain during normal operation. * Stop fetching of logfiles behind split brain points (save space at the target hosts - usually the data will be discarded later). * Fixed split brain detection in userspace. * leave-resource now waits for local actions to take place (remote actions stay asynchronously). * invalidate / join-resource now work only if a designated primary exists (otherwise they would not know uniquely from whom to start initial sync). * Update docs, clarify scenarios intended <-> emergengy switching. * Fixed mutual overwrite of deletion symlinks in case of racing log-deletes spawned in parallel by cron jobs (resilience). * Fixed races between deletion and re-erection (e.g. fresh join-resource after leave-resource during network partitions). * Fixed duration of network timeouts in case the network is down (replaced non-working TCP_KEEPALIVE by explicit timeouts). * New option --dry-run which does not really create symlinks. * New command "delete-resource" (VERY DANGEROUS) for forcefully destroying a resource, even when it is in use. Intended only for _emergency_ cases when sysadmins are desperate. Use only by hand, first run with --dry-run in order to check what will happen! * New command "log-purge-all" (potentially DANGEROUS) for resolving split brain in desperate situations (cleanup of leftovers). Only use by hand, first run with --dry-run! * Lots of smaller imprevements / usability / readability etc. * Update test suite. light0.1beta0.15 -------- * Introduce write throttling of bulk writers. * Update test suite. light0.1beta0.14 -------- * Fix logfile transfer in case of "holes" created by emergency mode. * Fix "marsadm invalidate" after emergency mode had been entered. * Fix "marsadm resize" capacity propagation from underlying LVM. * Update test suite. light0.1beta0.13 -------- * Fix shutdown during operation (flying requests). * Fix unnecessary Lamport clock propagation storms. * Improve unnecessary page cache utilisation (mapfree). * Update test suite. light0.1beta0.12 and earlier -------- There was no dedicated ChangeLog. For details, look at the commit history. Release Policy / Software Lifecycle ----------------------------------- New source releases are simply announced by appearance of git tags. -------------- HISTORIC FOOTNOTE: the historic distinction between MARS Light and the future MARS Full has been dropped. All versions are simply called "mars". Old tagnames light* will remain valid, but newer names will follow the convention s/light/mars/g (this means that the old version number counting will be continued, only the "light" is substituted).