mars/kernel
Thomas Schoebel-Theuer 8e2de8288d light: fix missing versionlink upon slow or defective IO
Some primary appeared to have died, and was rebooted.
In the meantime, the old secondary was forcefully switched
to primary.

Afterwards, the old primary = new secondary got stuck because 2
versionlinks, which had been _produced_ by _himself_, were
missing, but they were present at the new primary = old secondary!

How could this happen?

All transaction logfiles were fully present and correct everywhere.

However, the old primary kern.log showed that a problem with the
RAID system must have existed. In addition, the RAID controller
errorlog also reported some problems which appeared to have healed.

Problem analysis shows the following possibility:

The transaction logger can continue to write data, even via
fsync(), while the _writeback_ of other parts of the /mars filesystem
(e.g. symlink updates) got stuck for a long time due to an IO problem.

Usually, slow or even missing symlink updates are no problem because
upon recovery after a reboot, everything is healed by transaction
replay (possibly replaying much more data than really necessary,
but this does not affect semantics, and it is even advantageous
when RAID disks might contain defective data).

There is one exception: after a logrotate, the corresponding new
versionlink should appear after a small time. Otherwise, the
above mentioned scenario could emerge.

We use sync_filesystem() to ensure that any versionlink update
to a _new_ versionlink is either guaranteed to become persistent,
or (in case of IO problems) the mars_light thread will hang, which
will be (hopefully) noticed soon by monitoring.
2016-02-03 22:01:48 +01:00
..
sy_old light: fix missing versionlink upon slow or defective IO 2016-02-03 22:01:48 +01:00
brick_atomic.h all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
brick_checking.h all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
brick_locks.h all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
brick_mem.c all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
brick_mem.h infra: fix potential fault 2016-01-02 10:18:33 +01:00
brick_say.c all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
brick_say.h all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
brick.c infra: remove outdated code 2015-03-23 13:48:11 +01:00
brick.h infra: fix wrong error code ENOSYS 2015-10-07 10:44:35 +02:00
gpl-2.0.txt all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
Kbuild infra: fix BUILDTAG for out-of-tree builds 2014-08-08 10:38:32 +02:00
Kconfig infra: disable DEBUG_SLAB 2014-06-18 12:10:55 +02:00
lamport.c all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
lamport.h all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
lib_limiter.c all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
lib_limiter.h all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
lib_log.c all: correct error code EIO 2015-01-20 15:20:10 +01:00
lib_log.h all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
lib_mapfree.c infra: clean buffer cache on opening block devices 2015-06-17 11:33:18 +02:00
lib_mapfree.h infra: clean buffer cache on opening block devices 2015-06-17 11:33:18 +02:00
lib_pairing_heap.h all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
lib_queue.h all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
lib_rank.c all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
lib_rank.h all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
lib_timing.c infra: report global IO hangs 2015-02-27 11:32:57 +01:00
lib_timing.h infra: report peak IO latencies 2015-02-27 11:32:57 +01:00
Makefile infra: add standalone make 2014-06-18 12:10:54 +02:00
mars_aio.c aio: fix race on shutdown 2015-07-15 10:38:49 +02:00
mars_aio.h aio: fix race on shutdown 2015-07-15 10:38:49 +02:00
mars_bio.c infra: clean buffer cache on opening block devices 2015-06-17 11:33:18 +02:00
mars_bio.h all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
mars_buf.c all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
mars_buf.h all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
mars_check.c all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
mars_check.h all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
mars_client.c all: correct error code EIO 2015-01-20 15:20:10 +01:00
mars_client.h all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
mars_copy.c copy: reset copy area upon consistency errors 2015-02-24 09:19:46 +01:00
mars_copy.h all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
mars_dummy.c all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
mars_dummy.h all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
mars_generic.c all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
mars_if.c if: fix wrong error code ENOSYS 2015-10-07 10:44:44 +02:00
mars_if.h all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
mars_net.c server: fix memory leak on writes 2015-10-19 07:24:20 +02:00
mars_net.h all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
mars_server.c server: fix memory leak on writes 2015-10-19 07:24:20 +02:00
mars_server.h server: fix memory leak on writes 2015-10-19 07:24:20 +02:00
mars_sio.c sio: fix race on shutdown 2015-07-15 10:38:49 +02:00
mars_sio.h sio: fix race on shutdown 2015-07-15 10:38:49 +02:00
mars_trans_logger.c logger: move ranking array from stack to brick instance 2016-01-02 10:18:22 +01:00
mars_trans_logger.h logger: move ranking array from stack to brick instance 2016-01-02 10:18:22 +01:00
mars_usebuf.c all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
mars_usebuf.h all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00
mars.h all: disallow sync IO during emergency mode 2015-02-11 15:20:26 +01:00
meta.h all: clarify license GPLv2+ 2014-11-25 18:09:17 +01:00