mars/ChangeLog

Meaning of stable tagnames
--------------------------

Example: light0.1stable01:
              0            = version of on-disk data structures
                             (only incremented when downgrades are impossible)
                             (not incremented on backwards-compatible upgrades)
                1          = version of feature set
                 stable    = feature set is frozen during this series
                       01  = bugfix revision

Release Conventions / Branches / Tagnames
-----------------------------------------

	light0.1 series (stable):
		 - Asynchronous replication.
		   Currently operational at more than 700 storage servers at
		   1&1, more than 2,000,000 operating hours (end of June 2015)
		 - Unstable tagnames: light0.1beta%d.%d (obsolete)
		 - Stable branch: light0.1.y
		 - Stable tagnames: light0.1stable%02d

	light0.2 series (currently in beta stage):
		   Mostly for internal needs of 1&1 (but not limited to that).
		   More detailed ChangeLog descriptions will follow when it
		   becomes stable.
		 - (NOT YET FULLY WORKING - currently only for kernel 3.2.x)
		   Getting rid of the kernel prepatch! MARS may be built
		   as an external kernel module for any supported
		   kernel version. First prototype is only tested for
		   unaltered 3.2.x vanilla kernel, but compatibility to
		   further vanilla kernel versions (maybe even
		   Redhat-specific ones) will follow during the course of
		   the MARS light0.2 stable series. The problem is not
		   compatibility as such, but _testing_ that it really
		   works. These tests need a lot of time.
		   => further arguments for getting to kernel upstream ASAP.
		 - Improved network throughput by parallel TCP connections
		   (in particular under packet loss).
		   Also called "socket bundling".
		   First benchmarks show an impressive speedup over
		   highly congested long-distance lines.
		 - Future-proof updates in the network protocol:
		   Mixed operation of 32/64bit and/or {big,low}endian
		 - Support for multi-homed network interfaces.
		 - Transparent data compression over low bandwidth lines.
		   Consumes a lot of CPU, therefore only recommended for
		   low write loads or for desperate network situations.
		 - Various smaller features and improvements.
		 - Unstable tagnames: light0.2beta%d.%d (current)
		 - Stable branch: light0.2.y (planned)
		 - Stable tagnames: light0.2stable%02d (planned)

	light0.3 series (planned):
		   (some might possibly go to 1.0 series instead)
		 - Remote device: bypassing iSCSI. In essence,
		   /dev/mars/mydata can appear at any other cluster member
		   which doesn't necessarily need any local disks.
		 - Improve replication latency.
		 - New pseudo-synchronous replication modes.
		   For the internal needs of database folks at 1&1.
		 - (Maybe) old test suite could be retired, a new
		   one is at github.com/schoebel/test-suite
		 - Unstable tagnames: light0.3beta%d.%d (planned)
		 - Stable branch: light0.3.y (planned)
		 - Stable tagnames: light0.3stable%02d (planned)

	light1.0 series (planned):
		 - Replace symlink tree by transactional status files
		   (future-proof)
		   This is required for upstream merging to the kernel.
		   It has further advantages, such as better scalability.
		 - Trying to additionally address public needs.
		 - Potentially for Linux kernel upstream,
		 - Unstable tagnames: light1.0beta%d.%d (planned)
		 - Stable branch: light1.0.y (planned)
		 - Stable tagnames: light1.0stable%02d (planned)

	full* (somewhen in future)

	WIP-* branches are for development and may be rebased onto anything
	at any time without notice. They will disappear eventually.

	*stable* branches mean that only bug fixes and documentation
	updates / clarifications will be applied. Updates to the test suite /
	new test cases potentially disguising bugs, and other minor additions
	of debugging code / paranoia code which may lead to discovery
	of bugs are also possible. Error messages / warnings and their
	error class may	also be changed.
	NO NEW FEATURES, not even minor ones, except when absolutely
	necessary for a bugfix.

-----------------------------------
Changelog for series 0.2:

light0.2beta3.0
--------
	* Include all fixes from light0.1stable20.
	* Minor fixes: memory leak; possible (but not really observed)
	  deadlock in new socket bundling code.

light0.2beta2.0
--------
	* New feature: LZO compression of network bulk data.
	* Various Fixes.
	* Minor doc updates.

light0.2beta1.0
--------
	* New feature: socket bundling. Measurements show major network
	  performance improvements when running over highly congested
	  shared network lines.
	* New network protocol: future-proof for mixed operations between
	  different processor architectures. Self-adaptive. Allows mixed
	  operations of different MARS versions even when new features
	  have been incorporated into the network protocol.
	* New option: backward compatibility to the old protocol.
	  For transitional purposes (may be removed some day).
	* Some minor code cleanups. Much more to come later.
	* Doc: describe the bundling feature, performance results for
	  various settings.
	* Sorry, this release was buggy because some WIP patches went
	  into it by accident. Fixed later by factoring out these patches
	  into branch WIP-compatibility which will be included into the
	  light0.2 series later once they are more mature.

-----------------------------------
Changelog for series 0.1:

light0.1stable20
--------
	* Hint: MARS is now running on more than 850 storage servers,
	  and has collected more than 4.5 millions of operation hours.
	  There were no new incidents with customer impact since the last
	  major bugfix (more than 3 millions of operation hours since then).
	  It is difficult to deduce a reliability from that, but it appears
	  that at least 99.999%, if not 99.9999% are now real for the
	  MARS component as a standalone component (not to be confused with
	  overall system reliability). Our storage hardware is clearly much
	  less reliable. MARS does compensate these defects all the time.

	* Minor fix: memory leak in networking code, does not occur
	  at light0.1 operations (but maybe future versions of MARS).
	* Doc: add presentation slides from Froscon2015.

light0.1stable19
--------
	* Minor safeguard: warn when somebody tries leave-resource --host=
	  for a damaged host, and later the dead host resurrects in an
	  unreasonable way.
	* Doc update: describe use cases for DRBD vs MARS more clearly.
	* Minor spelling fixes.

light0.1stable18
--------
	* Minor safeguard: prevent join-resource when previous log-purge-all
	  has been forgotten. Prevent create-resource also when previous
	  delete-resource has been forgotten. Anyway, this happens only in
	  very exotic repair scenarios after very heavy failures.
	* Doc updates: simplify descriptions of split-brain resolution and
	  emergency mode resolution. Nowadays 'invalidate' will do everything
	  in all tested cases; the more complex alternative methods have
	  been moved to the appendix.

light0.1stable17
--------
	* Minor fix: stacktrace / oops in aio callback path due to a
	  subtle race, observed once during 2.5 millions of operation hours.
	  In the observed case, the secondary was hanging, without
	  customer impact. However, the error class could potentially
	  occur also at the primary side. Probably the bug was triggered
	  by a hardware problem from the RAID controller.

light0.1stable16
--------
	* Minor fix: sync could take a long time to complete under high
	  application load, similarly to a live-lock.
	* Some smaller minor fixes for annoying messages.
	* Contrib: added configurable Nagios check.
	* Contrib: added some example scripts which could be used by
	  clustermanagers etc.
	* Doc: important new section on pitfalls when using existing
	  clustermanagers UNMODIFIED for long distance replication.
	  PLEASE READ!

light0.1stable15
--------
	* NOTICE: MARS succeeded baptism on fire at 04/22/2015 when a whole
	  co-location had a partial power blackout, followed by breakdown
	  of air conditioning, followed by mass hardware defects due to
	  overheating. MARS showed exactly 0 errors when (emergency)
	  switching to another datacenter was started in masses.
	* Major fix of race in transaction logger: the primary could hang
	  when using very fast hardware, typically after ~24000 operation
	  hours. The problem was noticed 6 times during a grand total of
	  more than 1,000,000 operation hours on a mixed hardware park,
	  showing up only on specific hardware classes. Together with 3
	  other incidents during early beta phase which also had customer
	  impact, this means that we have reached a reliability of about
				  ===> 99.999%
	  After this fix, the reliability should grow even higher.
	  A workaround for this bug exists:
	  # echo 2 > /proc/sys/mars/logger_completion_semantics
	  Update is only mandatory when you cannot use the workaround.
	* Minor improvement in marsadm: re-allow --force combined with "all".
	  This is highly appreciated for speeding up operations / handling
	  during emergency datacenter switchover.
	* Various smaller improvements.
	* Contrib (unsupported): example rollout script for mass rollout.

light0.1stable14
--------
	* Minor safeguard: modprobe mars will refuse to start when the
	  cluster UUID is missing.
	* Minor fix: external race in marsadm resize, only relevant
	  for scripting.
	* Minor fix: potential race on plugged IO requests.
	* Clarify output of marsadm view. Many systematical improvements
	  and hints.
	* Add some unevitable macros for scripting / automation.
	* Various tiny improvements.

light0.1stable13
--------
	* Critical safeguard for accidental join-cluster with wrong argument:
	  make UUID mandatory, disallow completely unrelated hosts to
	  communicate symlink tree updates when their UUIDs mismatch.
	* Minor fix: leave-resource --host=other did not work when disks
	  were named differently throughout the cluster.
	* Minor fix: detach --host=other --force (which is needed as a
	  precondition) did not work.
	* Various minor fixes and clarifications. "marsadm view all"
	  now reports the communication status in the cluster.

light0.1stable12
--------
	* Critical (but usually not extremely relevant) fix:
	  When emergency mode occurs just during a sync, the target could
	  remain inconsistent without notice. Now noticed.
	  You always could/should manually invalidate whenever an
	  emergency mode appeared.
	  Now this is automatically fixed by restarting any sync from
	  scratch (if one was actually running before; otherwise consistency
	  was never violated).
	* Major documentation update / corrections.
	* Major (but less relevant) fix: leave-cluster did not really work.
	* Minor fix (regression): rmmod could hang when sync was running.
	* Various minor fixes and clarifications.

light0.1stable11
--------
	* Major documentation update. mars-manual.pdf increased from
	  66 to 80 pages. Please read! You probably should know this.
	* Minor fixes: better cleanup on invalidate / leave-resource.
	* Minor clarifications: more precise EIO error codes, more verbose
	  error reporting via "marsadm cat".

light0.1stable10
--------
	* Major fixes of internal network protocol errors, leading to
	  internal shutdown of sockets, which were transparently re-opened.
	  It could affect network performance. Not sure whether
	  stability was also affected (probably under extremely high load);
	  for better safety you should upgrade.
	* Major fix from Manuel Lausch: regex parsing sometimes went
	  completely wrong when hostnames followed a similar name scheme
	  than internal symlinks.
	* Major, only relevant for k>2 replicas: fix wrong internal sharing
	  of data structures resulting from parallel data connections.
	* Minor fix: race in fake-sync.
	* Minor fix: race in invalidate.
	* Minor, only for k>2 replicas: fix direct primary handover when
	  some non-involved hosts are currently unreachable.
	* Minor: improve becoming primary during split brain.
	* Minor: improve becoming primary when emergency mode starts.
	* Minor: silence some annoying stderr messages.
	* Several internal minor fixes and clarifications.

light0.1stable09
--------
	* Major fix of scarce race (potentially critical): the bio response
	  thread could terminate too early, leading to a premature dealloc
	  of kernel memory. This has only been observed on slow virtual
	  machines with slow virtual devices, and very high load on k=4
	  replicas. This could potentially affect the stability of the system.
	  Although not observed at production machines at 1&1, I recommend
	  updating production machines to this release ASAP.
	* Major usability fix: incorrect commandline options of marsadm
	  were just ignored if they appeared after the resource argument.
	  Misspellings could cause undesired effects. For instance,
	  "marsadm delete-resource vital --force --MISSPELLhost=banana"
	  was accidentally destroying the primary during operation (which
	  is _possible_ when using --force, and this was even a _required_
	  sort of "STONITH"-like feature -- however from a human point
	  of view it was intended to destroy _another_ host, so this was
	  an unexpected behaviour from a sysadmin point of view).
	* Major workaround: the concept "actual primary" is wrong, because
	  during split brain there may exist several primaries. Do not
	  use the macro view-actual-primary any longer. It is deprecated now.
	  Use view-is-primary instead, on each host you are interested in.
	* Minor fix: "marsadm invalidate" did not work in some weired
	  split brain situations / was not equivalent to
	  "marsadm leave-resource $res; marsadm join-resource $res".
	  The latter was the old workaround to fix the situation.
	  Now it shouldn't be necessary anymore.
	* Minor fix: pause-fetch could take very long to terminate.
	* Minor fix: marsadm wait-cluster did not wait for all hosts
	  particiapting in the resource, but only for one of them.
	  This is only relevant for k>2 replicas.
	* Minor fix: the rates displayed by "marsadm view" did not drop down
	  to 0 when no progress was made.
	* Minor fix: logging to syslog was incomplete.
	* Minor usability fix: decrease boring speakyness of "log-rotate"
	  and "log-delete" for cron jobs.
	* Minor fixes: several internal awkwardnesses, potentially affecting
	  performance and/or stability in weired situations.

light0.1stable08
--------
	 * Minor fix: after emergency mode, a versionlink was forgotten
	   to create. This could lead to unnecessary reports of split
	   brain and/or need for additional re-invalidate.
	 * Minor fix: the predicate 'view-is-consistent' reported 'false'
	   in some situations on secondaries when all was ok.
	 * Minor fix: it was impossible to determine the 'is-consistent'
	   from 'marsadm view' (without -1and1 suffix). Added a new [Cc-]
	   flag. This is absolutely needed to determine whether the
	   underlying disks must have the same checksum (provided that
	   both disks are detached and the network works and fetch+replay
	   had completed before the detach).
	 * Updated docs to reflect this.
	 * Minor fix: 'invalidate' did not work when the resource was not
	   completely detached. Now it implicitly does a detach before
	   starting invalidation.
	 * Minor fix: wait-umount was waiting for umount of _all_ primaries
	   during split brain. Now it waits only for umount of the local node.
	   Notice that having multiple primaries in parallel is an
	   erroneous state anyway.
	 * Minor fix: leave-cluster did not work without --force.

light0.1stable07
--------
	 * Minor fix: re-creation of a completely destroyed resource
	   did not always work correctly

light0.1stable06
--------
	 * Major fix: becoming primary was hanging in scarce situations.
	 * Minor fix: some split brains were not always detected correctly.
	 * Minor fix for Redhat openvz kernel builds.
	 * Several fixes for 1&1 internal Debian builds.

light0.1stable05
--------
	 * Major fix: incomplete calls to vfs_readdir()
	   which could lead to incomplete symlink updates /
	   replication hangs.
	 * Minor fix: scarce race on replay EOF.
	 * Separated kernel from userspace build environment.
	 * Removed some potentially dangerous Kconfig options
	   if they would be set to wrong values (robustness against
	   accidentally producing bad kernel modules).
	 * Dito: some additional checks against bad main Kconfig options
	   (mainly for out-of-tree builds).
	 * Separated contrib code from maintained code.
	 * Added some pre-patches for newer kernels
	   (WIP - not yet fully tested at all combinations)
	 * Minor doc addition: LinuxTag 2014 presentation.

light0.1stable04
--------
	 * Quiet annoying error message.
	 * Minor readability improvements.
	 * Minor doc updates.

light0.1stable03
--------
	 * Major: fix internal aio race (could lead to memory corruption).
	 * Fix refcounting in trans_logger.
	 * Some minor fixes in module code.
	 * Fix 1&1-internal out-of-tree builds.
	 * Various minor fixes.
	 * Update monitoring tools / docs (German, contributed by Jörg Mann).

light0.1stable02
--------
	 * Fix sorting of internal data structure.
	 * Fix IO error propagation at replay.

light0.1stable01
--------
	 * Fix parallelism of logfile propagation: sometimes a secondary
	   could get a more recent version than the primary had on stable
	   storage after its crash, eventually leading to an (annoying)
	   split brain. Some people might take this as a feature instead
	   of a bug, but now the logfile transfer starts only after the
	   primary _knows_ that the data is successfully committed to
	   stable storage.
	 * Fix memory leaks in error path.
	 * Fix error propagation between client and server.
	 * Make string allocation fully dynamic (remove limitation).
	 * Fix some annoying messages.
	 * Fix usage output of marsadm.
	 * Userspace: contributed bugfix for Debian udev rules by Jörg Mann.
	 * Improved debugging (only for testing).

light0.1beta0.18 (feature release)
--------
	 * New commands marsadm view-$macroname
	 * New customizable macro processor
	 * New err/warn/inf reporting via symlinks
	 * Per-resource emergency mode
	 * Allow limiting the sync parallelism
	 * New flood-protected syslogging
	 * Some smaller improvements
	 * Update docs
	 * Update test suite

light0.1beta0.17
--------
	 * Major bugfix: race in logfile switchover could sometimes
	   lead to the wrong logfile (extremely rare to hit, but
	   potentially harmful).
	 * Disallow primary switching when some secondaries are
	   syncing.
	 * Fix logfile fetch from multiple peers.
	 * Fix computation of transitive closure (affected
	   log-purge-all, split brain detection, and many others).
	 * Fix incorrect emergency mode detection.
	 * Primaries no longer fetch logfiles (unnecessarily, only
	   makes a difference at concurrent split brain operations).
	 * Detached resources no longer fetch logfiles (unexpectedly).
	 * Myriads of smaller fixes.

light0.1beta0.16
--------

	 * Critical bugfix: "marsadm primary --force" was assumed to be given
	   by sysadmins only in case of emergency, when the network is down.
	   When given in non-emergency cases where the old primary continues
	   to run (/dev/mars/* being actively used and written), the
	   old primary could suddendly do a "logrotate" to the
	   new split-brain logfile produced by the new (second) primary.
	   Now two primaries should be able to run concurrently in split-brain
	   mode without mutually trashing their logfiles.
	 * primary --force now only works in disconnected mode, in order
	   to hinder unintended forceful creation of split brain during
	   normal operation.
	 * Stop fetching of logfiles behind split brain points (save space
	   at the target hosts - usually the data will be discarded later).
	 * Fixed split brain detection in userspace.
	 * leave-resource now waits for local actions to take place
	   (remote actions stay asynchronously).
	 * invalidate / join-resource now work only if a designated primary
	   exists (otherwise they would not know uniquely from whom
	   to start initial sync).
	 * Update docs, clarify scenarios intended <-> emergengy switching.
	 * Fixed mutual overwrite of deletion symlinks in case of racing
	   log-deletes spawned in parallel by cron jobs (resilience).
	 * Fixed races between deletion and re-erection (e.g. fresh
	   join-resource after leave-resource during network partitions).
	 * Fixed duration of network timeouts in case the network is down
	   (replaced non-working TCP_KEEPALIVE by explicit timeouts).
	 * New option --dry-run which does not really create symlinks.
	 * New command "delete-resource" (VERY DANGEROUS) for
	   forcefully destroying a resource, even when it is in use.
	   Intended only for _emergency_ cases when sysadmins are
	   desperate. Use only by hand, first run with --dry-run in order
	   to check what will happen!
	 * New command "log-purge-all" (potentially DANGEROUS) for
	   resolving split brain in desperate situations (cleanup of
	   leftovers). Only use by hand, first run with --dry-run!
	 * Lots of smaller imprevements / usability / readability etc.
	 * Update test suite.

light0.1beta0.15
--------

	 * Introduce write throttling of bulk writers.
	 * Update test suite.

light0.1beta0.14
--------

	 * Fix logfile transfer in case of "holes" created by
	   emergency mode.
	 * Fix "marsadm invalidate" after emergency mode had been entered.
	 * Fix "marsadm resize" capacity propagation from underlying LVM.
	 * Update test suite.

light0.1beta0.13
--------

	 * Fix shutdown during operation (flying requests).
	 * Fix unnecessary Lamport clock propagation storms.
	 * Improve unnecessary page cache utilisation (mapfree).
	 * Update test suite.


light0.1beta0.12 and earlier
--------

	There was no dedicated ChangeLog. For details, look at the
	commit history.

Release Policy / Software Lifecycle
-----------------------------------

	New source releases are simply announced by appearance of git tags.

General Conventions
-------------------

The git tags have the following meaning:

	full* for future use.

	light1.0 The first number indicates the main symlink tree revision, the second number indicates the sub revision. The main symlink tree revision is only updated upon (potentially) incompatible changes. Upgrades of main revisions will always be possible, but downgrades are not automatically supported. The sub revision will indicate new releases, and they may also indicate symlink tree extersions which are both forwards and backwards compatible. It may just happen that new features are not available with elder releases :)
	Example: 1.0 ff will indicate the future main production revision.
	Extensions: suffixes like pre1 indicate pre-releases. Other suffixes like testing2 are reserved for future use.
	Hint: you may automatically convert the MARS git tags into Debian release tags by a regex inserting a ~ after any transition from a digit to an alpha character. We just omitted the ~ because git treats it as an invalid character. The corresponding Debian tags _should_ result in the correct ordering according to the Debian guidelines. Please report a bug if not :)

	light0.1beta* Internal 1&1 releases during the pilot phase. May be used by the public, but you should know that the 1.0 symlink tree revision will appear soon.

	light0.0alpha* Very old prototypes; never use them. Vital feature were missing. Only for historic inspection.