Share past intervals when starting up new replicas. This can happen via
an MOSDPGInfo or an MOSDPGLog message.
Fix up get_or_create_pg() so the past_intervals arg is required (and a ref,
like the other args). Fix doxygen comment.
Now the only time generate_past_intervals() should do any work is when
upgrading old clusters, during pg creation, and (possibly) during pg
split (when that is fully implemented).
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
If ceph-osd is way behind, we will advance through past maps before we
mark ourselves up. This avoids the slow recalculation once we are up, and
the ensuing badness.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
There is a nice symmetry there with fulfill_log(), but it is a short
function with a single caller that mostly just forces us to copy a bunch
of data structures around unnecessarily. Drop it.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Send past_intervals along with pg_info_t on every notify. The reasoning
here is as follows:
- we already have the state in memory
- if we don't send it, and the primary doesn't have it, it will
recalculate it by reading/decoding many previous maps from disk
- for a highly-tortured cluster, i see past_intervals on the order of
~6 KB, times 600 pgs means ~2.5 MB sent for every activate_map(). for
comparison, the same cluster would need to read and decode ~1 GB of
maps to recalculate the same info.
- for healthy clusters, the data is small, and costs little.
- for unhealthy clusters, the data is large, but most useful.
In theory we could set a threshold so that we don't send it if it is
large, but allow the primary to query it explicitly. I doubt it's worth
the complexity.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
We can (currently) get into a situation where we don't have the full
history back to last_epoch_clean because non-primaries record past
intervals but don't initially have the full history, resulting in a partial
recent history.
If this happens, only fill in what's missing; no need to rebuild the recent
parts too.
Signed-off-by: Sage Weil <sage@newdream.net>
We may not recalculate all the way back to last_interval_clean due to
the oldest_map floor. Figure out what we want and could calculate before
deciding whether what we have is insufficient.
Also, print something if we discard and recalculate so it is clear what is
happening and why.
Signed-off-by: Sage Weil <sage@newdream.net>
We may send an MOSDMap as a reply to various requests, including
- a failure report
- a boot message
- a pg_temp message
- an up_thru message
In these cases, send a single MOSDMap message, but limit how big it gets.
All recipients here are osds, which are smart enough to request more maps
based on the MOSDMap::newest_map field.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Compare *every* address for a match, or else note that it is (or might be)
different. Previously, we falsely took diff==0 to mean that all addrs
were definitely equal, which was not necessarily the case.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Fixes#2353. Problem was that there were (at least) two osd processes
that were racing for the fs detection, which triggered some errors
in the btrfs create/remove snapshot.
Signed-off-by: Yehuda Sadeh <yehuda.sadeh@dreamhost.com>
This mimics the allows you to get and set subsystem debug levels via the
normal config access methods. Among other things, this allows librados
users to set debug levels.
Fixes: #2350
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
We only deal with the case where the entire map is identical, since the
individual items are too small to make the pointer overhead worthwhile.
Too bad. A in-memory btree-like structure would work better for this.
Signed-off-by: Sage Weil <sage@newdream.net>
There are cruft from the old primary/chain/splay replication code. All
current code says <0 is stray, 0 is primary, and >0 is replica. That is,
the role is the acting vector position, or -1 if not in the vector.
Signed-off-by: Sage Weil <sage@newdream.net>
Compare two maps. If an addrs matches, share the reference. If all
addrs match, share the entire vector.
This leads to roughly 70% drop in memory utilization for the set of
thrashed maps I'm working with.
Signed-off-by: Sage Weil <sage@newdream.net>
It is possible that the crush map contains device ids that do not exist as
osds. Filter them out of the CRUSH result.
Drop the max devices assert, as that is trivially violated by adding a new
item to the crush map beyond max_osd (via 'ceph osd crush add ...').
Signed-off-by: Sage Weil <sage@newdream.net>