We should only rely on whether our paxos version is overlap with whatever
they have -- we'll catch up later with them.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
We have timeouts that will clean everything up, and this can happen
in some cases that we've decided are legitimate. Hopefully we'll
be able to do something else later.
Signed-off-by: Greg Farnum <greg@inktank.com>
We reverted the gating by paxos sequences, so now we don't
need to look at them at all.
This reverts commit 1e6f02b337.
Signed-off-by: Greg Farnum <greg@inktank.com>
This was somehow broken -- out-of-date leaders were being elected -- and
we've decided smaller band-aids are more appropriate. We don't completely
revert the MMonElection changes, though -- there have been user clusters
running the code which includes these messages so we can't pretend it
never happened. We can make them clearly unused in the code, though.
This reverts commit fcaabf1a22.
Signed-off-by: Greg Farnum <greg@inktank.com>
Stopping the flusher is essentially the shutdown step for the
ObjectCacher - the next thing is actually destroying it.
If we leave any reads outstanding, when they complete they will
attempt to use the now-destroyed ObjectCacher. This is particularly a
problem with rbd images, since an -ENOENT can instantly complete many
readers, so the upper layers don't wait for the other rados-level
reads of that object to finish before trying to shutdown the cache.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
We need to call reset during every election cycle; luckily we
can call it more than once. bump_epoch is (by definition!) only called
once per cycle, and it's called at the beginning, so we put it there.
Fixes#4858.
Signed-off-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
#leveldb on freenode says > 2MB is nonsense (it might explain the weird
behavior we saw). Riak tuning guide suggests 256KB for large data block
environments. Default is 8KB. 64KB seems sane for us.
Signed-off-by: Sage Weil <sage@inktank.com>
In read cases track stats in PG::unstable_stats
Include unstable_stats in write_info() and publish_stats_to_osd()
For now this information may not get persisted
fixes: #2209
Signed-off-by: David Zafman <david.zafman@inktank.com>
pg_stats_lock to pg_stats_publish_lock
pg_stats_valid to pg_stats_publish_valid
pg_stats_stable to pg_stats_publish
update_stats() to publish_stats_to_osd()
clear_stats() to clear_publish_stats()
Signed-off-by: David Zafman <david.zafman@inktank.com>
This resolves the leveldb growth-without-bound problem observed by
mikedawson, and all the badness that stems from it. Enable this by
default until we figure out why leveldb is not behaving better.
While we are at it, trim more states at a time. This will make
compaction less frequent, which should help given that there is some
overhead unrelated to the amount of deleted data.
Fixes: #4815
Signed-off-by: Sage Weil <sage@inktank.com>
Each time we trim a PaxosService, have leveldb compact so that the
space from removed states is reclaimed.
This is probably not optimal if leveldb's heuristics are doing the right
thing, but it currently appears as if they are not.
Signed-off-by: Sage Weil <sage@inktank.com>
Add a prefix compaction opteration to the transaction that will be
performed after the transaction applies.
Signed-off-by: Sage Weil <sage@inktank.com>
This is an opportunistic time to optimize our local data since we are
out of quorum. It serves as a safety net for cases where leveldb's
automatic compaction doesn't work quite right and lets things get out
of hand.
Anecdotally we have seen stores in excess of 30GB compact down to a few
hundred KB. And a 9GB store compact down to 900MB in only 1 minute.
Signed-off-by: Sage Weil <sage@inktank.com>
This is an opportunistic time to optimize our local data since we are
out of quorum. It serves as a safety net for cases where leveldb's
automatic compaction doesn't work quite right and lets things get out
of hand.
Anecdotally we have seen stores in excess of 30GB compact down to a few
hundred KB. And a 9GB store compact down to 900MB in only 1 minute.
Signed-off-by: Sage Weil <sage@inktank.com>
This is a workaround that makes the warning go away. Not certain there
isn't something we should be changing...
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joe Buck <joe.buck@inktank.com>
After Monitor::init_paxos() has loaded all of the PaxosService state,
we should then map creating pgs to osds. This ensures we do so after the
osdmap has been loaded and the pgs actually map somewhere meaningful.
Fixes: #4675
Signed-off-by: Sage Weil <sage@inktank.com>
This avoids calculating new pg creation mappings if the osdmap isn't
loaded yet, which currently happens when during Monitor::paxos_init()
on startup. Assuming osdmap epoch is nonzero, it should always be
safe to do this (although possibly unnecessary).
More cleanup here is certainly possible, but this is one step toward fixing
the bad behavior for #4675.
Signed-off-by: Sage Weil <sage@inktank.com>
Factor out the portion of the function that remaps creating pgs to osds
from the part that sends those pending creates out.
Signed-off-by: Sage Weil <sage@inktank.com>