If we race with a reconnect we could get a second notify message
before the notify linger op is torn down. Ensure we only ever
call the notify completion once to prevent a segfault.
Fixes: #13805
Signed-off-by: Sage Weil <sage@redhat.com>
Since each OSD only sends a failure report for a given peer once,
we don't need to count reports vs reporters separately. (This was
probably a bad idea anyway.) Remove this logic and the associated
config option.
Reported-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
We used to have a complicated pg creation process in which we
would query any previous mappings for the pg before we created the
new 'empty' pg locally. The tracking of the prior mappings was
very simple (and broken), but it didn't really matter because the
mon would resend pg create messages periodically. Now it doesn't,
so that broke.
However, none of this is necessary: the PG peering process does
all of the same things. Namely, it
- enumerates past intervals
- determines which ones may have been rw
- queries OSDs from each one to gather any potential changes
This is a more robust version of what the creation code was (or
should have been doing). So, let's rip it all out and let
peering handle it. As long as the newly instantiated PG sets
last_epoch_started and _clean to the created epoch we will probe
and consider all of these prior mappings and find any previous
instance of the PG (if one existed).
Yay for removing unnecessary code!
Signed-off-by: Sage Weil <sage@redhat.com>
If the .0 pg no longer exists, we know the entire pool was
deleted, and can avoid querying every other pg. (This is a good
thing because leveldb and rocksdb can be very slow to query
missing keys.)
Signed-off-by: Sage Weil <sage@redhat.com>
Previously we were calculating and managing in-core state that
wasn't committed as part of the pg_map, leading to all sorts of
ugliness that didn't really work. Instead,
* set mapping in all creating pgs in the committed pg_map
* make all pg create message sending be based on committed state
* update mappings for creating pgs every time we consume a new
osdmap, so that we have a reliable/stable epoch to attach to
it.
In particular, having that stable epoch means we have a reference
we can put in the pg create message that will also be used for
the subscription version. That way OSDs get consistent creates
from any mon.
Signed-off-by: Sage Weil <sage@redhat.com>
If the OSD is down it will ignore the message. If it gets marked up, we
will eventually consume that map and call check_subs().
Signed-off-by: Sage Weil <sage@redhat.com>
1. MonClient remembers our subscriptions; only indicate we want
osd_pg_creates once, in init.
2. We don't need to re-request the latest osdmap each time we
reconnect.
Signed-off-by: Sage Weil <sage@redhat.com>
Instead of resending all subscriptions, only send the new ones. This
avoids races like
- ask for 4+
- mon sends maps 4-50
- ask for 4+ and something else
- mon has to resend same maps and the other thing
Signed-off-by: Sage Weil <sage@redhat.com>
Generate and send pg create messages only for those OSDs who have
subscribed on this monitor. This is N time more efficient (where there
are N monitors) than the previous method.
Signed-off-by: Sage Weil <sage@redhat.com>
We want to know about all future pg creations, not just those pending
when we start. (This only helps once the mon knows how to do this...)
Signed-off-by: Sage Weil <sage@redhat.com>
Track pg creations, grouped by the first epoch they mapped to a particular
OSD. This will be necessary to send messages only for new creations.
Signed-off-by: Sage Weil <sage@redhat.com>
It will be less work for the old primary to ignore the create message
and the new one to query it and find nothing that for the slightly more
complicated peering and removal process to happen. Also, this reduces
bloat in the OSDMap a bit.
Signed-off-by: Sage Weil <sage@redhat.com>
Currently the leader mon often replies to OSDs by sending a set of
incremental OSDmaps (e.g., in response to an osd boot or failure).
Instead, send a small message to the proxying peon mon (if any)
with the epoch to start from and let *them* generate a suitable
reply.
Signed-off-by: Sage Weil <sage@redhat.com>
This is only needed for legacy clients to avoid confusing them--
we don't actually need the renewals at all. Make them infrequent
to reduce mon load.
Signed-off-by: Sage Weil <sage@redhat.com>
Simplify the session liveness detection:
- renew on any message
- renew on keepalive[2] messages (lightweight ping in msgr)
Signed-off-by: Sage Weil <sage@redhat.com>
This significantly reduced CPU utilization on the bigbang scale
testing cluster at CERN. Note that it is already disabled for
leveldb by default (in ceph_mon.cc).
Signed-off-by: Sage Weil <sage@redhat.com>
This ensures that we don't throttle back mon reports so much that
the mon times out out due to no pg stat reports. Since there is
little value is having a lower max anyway, just set this at an
upper bound (relative to the mon's timeout value).
Signed-off-by: Sage Weil <sage@redhat.com>
We need an exclusive lock over paths that update state related to
mon reports, lest they step on fields like up_thru_*, *stats_ack*,
last_mon_report, and so on. Everybody still needs a read lock
on map_lock too to get a stable OSDMap epoch.
Signed-off-by: Sage Weil <sage@redhat.com>
We don't need to restart the boot process unless we are in preboot;
if we are in booting state we just need to resend the boot
message.
Signed-off-by: Sage Weil <sage@redhat.com>
The subscribe MonClient service is stateful--we don't need to
force a new subscribe send unless sub_want() says we need to.
Keep forcing it for instances where we request an *old* map.
Signed-off-by: Sage Weil <sage@redhat.com>