Some primary appeared to have died, and was rebooted.
In the meantime, the old secondary was forcefully switched
to primary.
Afterwards, the old primary = new secondary got stuck because 2
versionlinks, which had been _produced_ by _himself_, were
missing, but they were present at the new primary = old secondary!
How could this happen?
All transaction logfiles were fully present and correct everywhere.
However, the old primary kern.log showed that a problem with the
RAID system must have existed. In addition, the RAID controller
errorlog also reported some problems which appeared to have healed.
Problem analysis shows the following possibility:
The transaction logger can continue to write data, even via
fsync(), while the _writeback_ of other parts of the /mars filesystem
(e.g. symlink updates) got stuck for a long time due to an IO problem.
Usually, slow or even missing symlink updates are no problem because
upon recovery after a reboot, everything is healed by transaction
replay (possibly replaying much more data than really necessary,
but this does not affect semantics, and it is even advantageous
when RAID disks might contain defective data).
There is one exception: after a logrotate, the corresponding new
versionlink should appear after a small time. Otherwise, the
above mentioned scenario could emerge.
We use sync_filesystem() to ensure that any versionlink update
to a _new_ versionlink is either guaranteed to become persistent,
or (in case of IO problems) the mars_light thread will hang, which
will be (hopefully) noticed soon by monitoring.
Originally a trivial silly bug (boolean value was wrong), leading to an
endless loop when a local versionlink was missing, which can happen
only after a primary crash at the wrong moment shortly after a logrotate
(not even during ordinary operations), followed by a hard reboot.
As documented in mars-manual.pdf, you simply need "modprobe mars"
to recover after such a crash reboot. MARS remembers the primary state
persistently for you and restores everything _automatically_.
Using "marsadm primary" in such a case to switch the current primary
to primary again (after an unnecessary "marsadm secondary" which is
strongly discouraged by mars-manual.pdf), although the host is / was
already in primary state after the reboot, is at least as silly as
the mentioned bug. Doing this in an /etc/init.d/ startup script
where it really doesn't belong into, is even more silly.
The latter is even an OPERATIONAL RISK, because "marsadm secondary"
works _globally_ in the whole cluster (as documented in mars-manual.pdf).
Such an improper startup script _can_ (potentially) disturb another
cluster member which had become primary in the _meantime_ during reboot.
Global cluster operations don't belong into startup scripts, because
reboots may happen unintentionally at any time.
Only a secondary is allowed to do this, because we assume that
logfile replay has the property of "anytime consistency"
only there.
When a primary cannot recover after a crash due to a defective
logfile, this is not true. The primary is simply lost in such a
(rare) case. Observed 2 times during almost 8 millions of
operating hours.
In such a case, hardware is truly defective, and you have only
the following options:
1) switchover to a secondary via "primary --force", OR
2) deconstruct the resource everywhere, run fsck or similar on
whatever replica seems to be the best version,
and reconstruct the resource from scratch, OR
3) restore your backup.
At an actual primary, "Inconsistent" would be the correct description
for the state of the _disk_.
However most sysadmins will confuse this with the state of the
_replication_ (which is of course never inconsistent during
writeback from the memory buffer).
Although documented correctly, misunderstandings continue
to survive, because humans are automatically abstracting away
from detail components such as a "disk", and are automatically
assuming that "marsadm view" would relate to the replication
as a whole.
Avoid misunderstandings by more detailed message distinctions
aiming to address all of these in parallel.
When logfile replay aborts with an error, becoming primary would be
impossible.
Without this, repair would be only possible by complete destruction
of the resource.
A previous version of this patch introduced
/proc/sys/mars/allow_primary_when_damaged which would complicate
the sysadmin interface. People would be unsure what to do.
Only relevant for non-storage servers where customers have access to.
Notice that /mars is a _reserved_ filesystem for MARS-internal purposes.
It has mothing to do with an ordinary filesystem.
Users have generally to be kept out.
Very old idiotic bug.
Under some circumstances, a byte beyond the end of a non-null-terminated
string (such as produced by the VFS) might be read, potentially leading
to a page fault just one byte after a page border.