Just concatenate operations to a bufferlist as we go. No
distinct decoding step is needed; we parse the transaction as it
is replayed/applied. This avoids the old decoded intermediate
representation overhead.
Since we still decode the old version, that code is still there,
but not used for anything new.
missed this before:
- no need to initalize in create_pending(), constructor does that
- int32_t, not int
- pool_max while we're at it
- initialize pool_max in OSDMap constructor
This potentially has issues, since pools are not removed from the map
until after all the PGs are removed (which is threaded, not inline with
map delivery). But Sage thinks it's okay and the system keeps working
even if you delete a pool while benchmarking on it with rados.
This lets us return NULL if the pool isn't in the map, which is
needed functionality for pool deletion. Meanwhile, code which
expects the pool to exist will continue to cause a crash if it doesn't.
There are lots of callers to journal_dirty_inode that may
unwittingly be dealing with a non-head inode (e.g.
check_file_max). If the provided inode is snapped, infer an
appropriate follows values so as not to cow_inode() again.
This allows _do_cap_update to clear out the client_range.
Kill (now) unused/unnecessary 'wanted' arg to _do_cap_update.
Also delay cap removal until after _do_cap_update (whcih takes
a Capability*). This probably needs further cleanup.
If a recovery op finished right as another recovery op was
begin started, we could get into start_recovery_ops() and get
max = 0 and not start anything. Since the PG wasn't being
requeued for later, it would never recover. So, requeue if we
race and get max == 0.
We caught a bunch of crashes like this:
10.02.11 17:01:01.600660 7f87070c3950 -- 10.3.14.134:6800/8203 >> 10.3.14.130:6800/18914 pipe(0x7fc2be2cebe0 sd=36 pgs=2409 cs=1 l=0).do_sendmsg error Broken pipe
10.02.11 17:01:01.600700 7f87070c3950 -- 10.3.14.134:6800/8203 >> 10.3.14.130:6800/18914 pipe(0x7fc2be2cebe0 sd=36 pgs=2409 cs=1 l=0).writer error sending 0x7fc27da1c570, 32: Broken pipe
10.02.11 17:01:01.600796 7f87070c3950 -- 10.3.14.134:6800/8203 >> 10.3.14.130:6800/18914 pipe(0x7fc2be2cebe0 sd=-1 pgs=2409 cs=1 l=0).fault initiating reconnect
...
./common/Thread.h: In function 'int Thread::join(void**)':
./common/Thread.h:66: FAILED assert(0)
1: (Thread::join(void**)+0x73) [0x64fcd3]
2: (SimpleMessenger::Pipe::join_reader()+0x68) [0x6555a2]
3: (SimpleMessenger::Pipe::connect()+0xf5) [0x645be9]
4: (SimpleMessenger::Pipe::writer()+0x157) [0x64793d]
5: (SimpleMessenger::Pipe::Writer::entry()+0x19) [0x63e107]
6: (Thread::_entry_func(void*)+0x20) [0x64e816]
7: /lib/libpthread.so.0 [0x7fc2c3bbdfc7]
8: (clone()+0x6d) [0x7fc2c2e005ad]
that look a bit like multiple procs were racing into
join_reader(). Add an assert to catch that if it happens again,
and also wrap thread starts in pipe_lock to ensure we keep the
_running flags in sync with reality. Add in a few other
sanity checks too.
We need to update the beacon timestamp even when we are updating
the mds state. Otherwise we can get caught in a busy loop
between marking an mds laggy and !laggy because the beacon stamp
never updates.
So even if we are updating, and the reply will be slow, update
our timestamp, so we don't mark the mds laggy.
We're going backwards, so once this test fails, it always fails,
and we can break instead of continue. Any skipped intervals will
be pruned shortly anyway.
We already required this if prior PG members were down, so this
affected the 'failure' case. We now also require it for
non-failure PG changes (expansion, migration).
This fixes our maybe_went_rw calculation for prior PG intervals,
which is based on up_thru. If maybe_went_rw is false when the
pg actually went rw, we can lose (and have lost) data. But it is
not practical to calculate without up_thru being consistently
updated, because determining whether a pg would have been able to
go active depends on knowing last_epoch_started at a previous
point in time, which then determines how many prior intervals
may have been considered, which in turn determines whether
up_thru would have been updated, etc. Much simpler to update it
all the time.
This should not impose a significantly greater cost, since we
already need it for the failure case. And in general the
migration/expansion/whatever case is no more common nor critical
than the failure case.