If the queue refcount is the last one for the pg, the pg->put()
in the loop will destroy the pg while the lock is still held
leading to #3071. Thus, grab refcount in case we need to drop
it.
Signed-off-by: Samuel Just <sam.just@inktank.com>
The user log entry contains the request id, which will be used
by replay ops to put themselves in the correct place in the
waiting_for_commit/ack maps. Thus, the repop needs to be tagged
with the same version as the log entry with the request id.
Thus, the request id bearing log entry should be the last in
the log entry vector.
This should fix#3072, wherein a replay which should wait on
the repop tagged as version '36 will instead wait on '35.
Signed-off-by: Samuel Just <sam.just@inktank.com>
want_acting is filled in during recovery completion in
order to move the newly backfilled osd into its correct
place. In this case, however, want_acting must contain
only members of acting and up. Thus, we can be sure that
if any of them go down, we would restart peering anyway.
Thus, we need not transition to WaitActingChange, which
does not reflect that we continue to serve client operations
in the interim.
Signed-off-by: Samuel Just <sam.just@inktank.com>
When we get a pool_op_reply, we find out which osdmap we need to wait for.
The wait_for_new_map() code was feeding that epoch into
maybe_request_map(), which was feeding it to the monitor with the subscribe
request. However, that epoch is the *start* epoch, not what we want. Fix
this code to always subscribe to what we have (+1), and ensure we keep
asking for more until we catch up to what we know we should eventually
get.
Bug: #3075
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Commit dd011aba90 changed
the conf file sample to say {hostname}, but changed the
prose only from ``localhost`` to ``{localhost}``.
Signed-off-by: Tommi Virtanen <tv@inktank.com>
Do not special case failure during connect. In particular, we may be
reconnecting and experience a second fault, and wipe out our session
(e.g., between the fs client and the mds) and destroy important session
state.
This logic dates back to the original patch in '08 when the standby
state was introduced.
Bug: #3070
Signed-off-by: Sage Weil <sage@inktank.com>
Uses a fixed access/secret key for easier testing. Starts a standalone
apache2 process with basic config (based on the teuthology one).
Signed-off-by: Sage Weil <sage@inktank.com>
If we encounter nobackfill, let ourselves to fall out of the recovery
queue. If we encounter a map that has does not have the flag set and we
are not clean, requeue ourselves. This is a big hammer, but simple.
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Mike Ryan <mike.ryan@inktank.com>
Reviewed-by: Greg Farnum <gregory.farnum@inktank.com>
Fixes: #3068
Bug #2954
Consider the following case:
1) Primary calls share_pg_info()
2) Primary processes client op and sends off sub_op to replica
3) Replica process sub_op
4) Replica process info reverting stat to before 2)
Similarly:
1) Primary processes client op
2) Primary calls share_pg_info()
3) Replica processes info
[4) Replica processes sub_op]
If 4) is interrupted by a map change, we can end up in a case there
the replica's info has a stat which reflects a log entry which
is not there. If that logs ends up authoratative, the most recent
op will be replayed and end up double counted in the log.
There should actually be no cases where the stats change after the
replica goes active except for as part of a sub_op_modify. Thus,
ReplicaActive::MInfoRec should not update the stats.
Signed-off-by: Samuel Just <sam.just@inktank.com>
CID 716882: Copy-paste error (COPY_PASTE_ERROR)At (2): "last_epoch_started" in
"other.last_epoch_started" looks like a copy-paste error. Should it say
"last_epoch_split" instead?
From what I can tell, this really should be checking other.last_epoch_split
rather than other.last_epoch_started.
Signed-off-by: Samuel Just <sam.just@inktank.com>
CID 717345: Uninitialized pointer field (UNINIT_CTOR)At (8): Non-static class
member "obc" is not initialized in this constructor nor in any functions that
it calls.
At (2): Non-static class member "id" is not initialized in this constructor nor
in any functions that it calls.
At (4): Non-static class member "reply" is not
initialized in this constructor nor in any functions that it calls.
At (6): Non-static class member "timeout" is not initialized in this
constructor nor in any functions that it calls.
Signed-off-by: Samuel Just <sam.just@inktank.com>
CID 717344: Uninitialized scalar field (UNINIT_CTOR)At (2): Non-static class
member "epoch_started" is not initialized in this constructor nor in any
functions that it calls.
Signed-off-by: Samuel Just <sam.just@inktank.com>
CID 717343: Uninitialized pointer field (UNINIT_CTOR)At (3): Non-static class
member "snapset" is not initialized in this constructor nor in any functions
that it calls.
Signed-off-by: Samuel Just <sam.just@inktank.com>