light: fix missing versionlink upon slow or defective IO

Some primary appeared to have died, and was rebooted.
In the meantime, the old secondary was forcefully switched
to primary.

Afterwards, the old primary = new secondary got stuck because 2
versionlinks, which had been _produced_ by _himself_, were
missing, but they were present at the new primary = old secondary!

How could this happen?

All transaction logfiles were fully present and correct everywhere.

However, the old primary kern.log showed that a problem with the
RAID system must have existed. In addition, the RAID controller
errorlog also reported some problems which appeared to have healed.

Problem analysis shows the following possibility:

The transaction logger can continue to write data, even via
fsync(), while the _writeback_ of other parts of the /mars filesystem
(e.g. symlink updates) got stuck for a long time due to an IO problem.

Usually, slow or even missing symlink updates are no problem because
upon recovery after a reboot, everything is healed by transaction
replay (possibly replaying much more data than really necessary,
but this does not affect semantics, and it is even advantageous
when RAID disks might contain defective data).

There is one exception: after a logrotate, the corresponding new
versionlink should appear after a small time. Otherwise, the
above mentioned scenario could emerge.

We use sync_filesystem() to ensure that any versionlink update
to a _new_ versionlink is either guaranteed to become persistent,
or (in case of IO problems) the mars_light thread will hang, which
will be (hopefully) noticed soon by monitoring.
This commit is contained in:
Thomas Schoebel-Theuer 2016-01-29 07:59:54 +01:00 committed by Thomas Schoebel-Theuer
parent 0e6bb47cb6
commit 8e2de8288d
3 changed files with 33 additions and 0 deletions

View File

@ -544,6 +544,7 @@ struct mars_rotate {
struct mars_limiter sync_limiter;
struct mars_limiter fetch_limiter;
int inf_prev_sequence;
int inf_old_sequence;
long long flip_start;
loff_t dev_size;
loff_t start_pos;
@ -1320,6 +1321,10 @@ void write_info_links(struct mars_rotate *rot)
}
if (inf.inf_is_logging || inf.inf_is_replaying) {
count += _update_version_link(rot, &inf);
if (min > rot->inf_old_sequence) {
mars_sync();
rot->inf_old_sequence = min;
}
}
}
if (count) {
@ -2462,6 +2467,7 @@ void _create_new_logfile(const char *path)
}
} else {
MARS_DBG("created empty logfile '%s'\n", path);
mars_sync();
filp_close(f, NULL);
mars_trigger();
}

View File

@ -177,6 +177,7 @@ extern struct mars_brick *make_brick_all(
/* General fs wrappers (for abstraction)
*/
extern int mars_stat(const char *path, struct kstat *stat, bool use_lstat);
extern void mars_sync(void);
extern int mars_mkdir(const char *path);
extern int mars_rmdir(const char *path);
extern int mars_unlink(const char *path);

View File

@ -137,6 +137,32 @@ int mars_stat(const char *path, struct kstat *stat, bool use_lstat)
}
EXPORT_SYMBOL_GPL(mars_stat);
void mars_sync(void)
{
struct file *f;
mm_segment_t oldfs;
oldfs = get_fs();
set_fs(get_ds());
f = filp_open("/mars", O_DIRECTORY | O_RDONLY, 0);
set_fs(oldfs);
if (unlikely(IS_ERR(f)))
return;
if (likely(f->f_mapping)) {
struct inode *inode = f->f_mapping->host;
if (likely(inode && inode->i_sb)) {
struct super_block *sb = inode->i_sb;
down_read(&sb->s_umount);
sync_filesystem(sb);
up_read(&sb->s_umount);
}
}
filp_close(f, NULL);
}
int mars_mkdir(const char *path)
{
mm_segment_t oldfs;