Some primary appeared to have died, and was rebooted.
In the meantime, the old secondary was forcefully switched
to primary.
Afterwards, the old primary = new secondary got stuck because 2
versionlinks, which had been _produced_ by _himself_, were
missing, but they were present at the new primary = old secondary!
How could this happen?
All transaction logfiles were fully present and correct everywhere.
However, the old primary kern.log showed that a problem with the
RAID system must have existed. In addition, the RAID controller
errorlog also reported some problems which appeared to have healed.
Problem analysis shows the following possibility:
The transaction logger can continue to write data, even via
fsync(), while the _writeback_ of other parts of the /mars filesystem
(e.g. symlink updates) got stuck for a long time due to an IO problem.
Usually, slow or even missing symlink updates are no problem because
upon recovery after a reboot, everything is healed by transaction
replay (possibly replaying much more data than really necessary,
but this does not affect semantics, and it is even advantageous
when RAID disks might contain defective data).
There is one exception: after a logrotate, the corresponding new
versionlink should appear after a small time. Otherwise, the
above mentioned scenario could emerge.
We use sync_filesystem() to ensure that any versionlink update
to a _new_ versionlink is either guaranteed to become persistent,
or (in case of IO problems) the mars_light thread will hang, which
will be (hopefully) noticed soon by monitoring.
In rare cases, this could lead to buffer overflows.
Replace buggy concept from the prototype phase with more
robust (although slightly less performant) code.
We report an error if there are unfreed mrefs after the device brick
has been switched to power off.
Instead of reporting an error at once, we report only warnings in the first 20
seconds. If there are still unfreed mrefs after that time an error is reported.
Signed-off-by: Thomas Schoebel-Theuer <tst@1und1.de>
Some filesystems like ext3 have only full second resolution.
Therefore, we _must_ advance the Lamport clock in whole seconds
when working on such gear, since we want to prevent lost
updates which would be caused by standstill Lamport clocks.
Sometimes, the lamport clock gets updated more frequently per second
than real time. In such cases, the Lamport clock will run much faster
than real time. After some weeks of operation, the Lamport clock
will be far in the future.
In general, we cannot do anything against that. When some fine-grained
information cannot be coded into some specific data type, it
cannot be coded.
However, when updates start to occur less frequently, we want to
_leave_ the workaround mode ASAP. The old code set tv_nsec to 0
which made it very likely that the workaround was triggered
again unnecessarily.
In order to _reduce_ that effect, we prevent unnecessary cascades
of whole-second leaps by setting the nanoseconds constantly to 1
if the full second was increased due to insufficient capabilities
of the underlying filesystem. At least in those cases where
Lamport timestamps are transferred over the network and/or we have
mixed configurations between ext3/ext4, we hope to
decrease the risk of endless cascades.
Experience shows that the new code behaves better.
Switch on only when all predecessor bricks are also on.
Failing to do so can result in fatal errors.
Similarly, switch only off if no successor exists any more.