Weaken the assertions a bit and just adjust missing appropriately.
Things may not match up perfectly if the split point is a backlog
entry, so just make missing what it should be a worry less about
what it was.
Here is the specific crash:
09.06.18 16:29:15.085353 1124096336 osd1 10 pg[1.8( v 5'4/3'2 (0'0,5'4] n=2 ec=2 les=10 10/3) r=1 lcod 0'0 stray m=1] my log = log(0'0,5'4]+backlog
3'1 (0'0) m 200.00000000/head by mds0.1:1 09.06.18 16:20:07.524996 indexed
3'2 (0'0) m 2.00000000/head by mds0.1:5 09.06.18 16:20:07.527454 indexed
5'3 (3'1) m 200.00000000/head by mds0.1:23 09.06.18 16:20:25.128842 indexed
5'4 (5'3) m 200.00000000/head by mds0.1:35 09.06.18 16:20:48.623669 indexed
09.06.18 16:29:15.085393 1124096336 osd1 10 pg[1.8( v 5'4/3'2 (0'0,5'4] n=2 ec=2 les=10 10/3) r=1 lcod 0'0 stray m=1] osd2 log = log(8'68,9'69]+backlog
3'2 (0'0) b 2.00000000/head by mds0.1:5 09.06.18 16:20:07.527454
9'69 (8'68) m 200.00000000/head by mds0.1:1114 09.06.18 16:28:08.837907
09.06.18 16:29:15.085416 1124096336 osd1 10 pg[1.8( v 5'4/3'2 (0'0,5'4] n=2 ec=2 les=10 10/3) r=1 lcod 0'0 stray m=1] merge_log log(8'68,9'69]+backlog from osd2 into log(0'0,5'4]+backlog
09.06.18 16:29:15.085456 1124096336 osd1 10 pg[1.8( v 5'4/3'2 (0'0,5'4] n=2 ec=2 les=10 10/3) r=1 (log bound mismatch, actual=[3'2,9'69] len=2) lcod 0'0 stray m=1] merge_log split point is 3'2 (0'0) b 2.00000000/head by mds0.1:5 09.06.18 16:20:07.527454
09.06.18 16:29:15.085472 1124096336 osd1 10 pg[1.8( v 5'4/3'2 (0'0,5'4] n=2 ec=2 les=10 10/3) r=1 (log bound mismatch, actual=[3'2,9'69] len=2) lcod 0'0 stray m=1] merge_log merging 3'2 (0'0) b 2.00000000/head by mds0.1:5 09.06.18 16:20:07.527454
09.06.18 16:29:15.085493 1124096336 osd1 10 pg[1.8( v 5'4/3'2 (0'0,5'4] n=2 ec=2 les=10 10/3) r=1 (log bound mismatch, actual=[3'2,9'69] len=2) lcod 0'0 stray m=2] merge_log merging 9'69 (8'68) m 200.00000000/head by mds0.1:1114 09.06.18 16:28:08.837907
osd/PG.h: In function 'void PG::Missing::add_next_event(PG::Log::Entry&)':
osd/PG.h:494: FAILED assert(missing[e.soid].need == e.prior_version)
And call it from trim_peers(), so that we always apply the same
conditions on log trimming.
This ensures we don't trim the logs while degraded through one of
the other paths.
I'm pretty sure this was giving inconsistent results across archs,
because bits would get shifted into the high 32 and then back again
on x86_64 but not x86_32.
When replica finds itself fully up to date (last_complete ==
last_update) it tells the primary. Primary checks the same.
If the primary find the min_last_complete_ondisk has changed,
it sends out a trim command.
This will let us drop huge pg logs out of memory after a recovery
without waiting for IO and the usual piggybacked trimming logic
to kick in.
Okay, do not rely on MDS to provide dentry positioning information,
since it is all relative to the start _string_ we provide, and that
can change directory position without notice.
Simplify readdir a bit wrt seeks. A seek to 0, a new frag, or
prior to the current chunk resets buffered state.
For each frag, we walk through chunks, always in order. We set
dentry positions/offsets based on the frag and position within our
sweep across the frag. Successive chunks are grabbed from the MDS
relative to a filename (not offset), so concurrent
insertions/removals don't bother us (although we will not see
insertions lexicographically prior to our position).
The final placement seed needs to factor in pool, but that can't be
fed into stable_mod or you get weird results (for example, 1.ff and
1.adff won't necessary map to the same thing because of the
stable_mod). Add pool to the stable_mod result, instead. The seed
itself doesn't need to be bounded; it's just an input for CRUSH.
Just so long as there are a limited number of such inputs for a given
pool.
Needs to factor in frag_is_leftmost to account for . and .., just
like the fi->offset calculation in readdir_prepopulate. Fixes the
problem where an ls on a large dir returns duplicate entries.
This is mainly just because /bin/ls will use the size, or blocks,
or blksize to decide how big of a buffer to allocate for getdents,
and the default of 4MB is unreasonably big. 64k seems like an
okay number, I guess.