With writeahead journaling in particular, we can get requests that
stay in the queue for a long time even after the commit is sent to the
client while we are waiting for the transaction to apply to the fs.
Instead of showing up as 'waiting for subops', make it clear that the
client has gotten its reply and it is local state that is slow.
Signed-off-by: Sage Weil <sage@inktank.com>
Small transactions make pg removal nicer to the op queue. It also slows
down PG deletion a bit, which may exacerbate the PG resurrection case
until #3884 is addressed.
At least on user reported this fixed an osd that kept failing due to
an internal heartbeat failure.
Signed-off-by: Sage Weil <sage@inktank.com>
The motivation here is if there is a problem draining the op queue
during a sync. For XFS and ext4, this isn't generally a problem: you
can continue to make writes while a syncfs(2) is in progress. There
are currently some possible implementation issues with btrfs, but we
have not demonstrated them recently.
Meanwhile, this can cause queue length spikes that screw up latency.
During a commit, we allow too much into the queue (say, recovery
operations). After the sync finishes, we have to drain it out before
we can queue new work (say, a higher priority client request). Having
a deep queue below the point where priorities order work limits the
value of the priority queue.
Signed-off-by: Sage Weil <sage@inktank.com>
commit_set() and flush_set() are identical in functionality,
so use flush_set everywhere and remove commit_set from
the code.
Also fixes a bug in flush_set where the finisher context was
getting freed twice if no objects needed to be flushed.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
This workunit runs the internal tests for our local branch of hadoop-common.
Requires ant be installed on the host running the test.
Signed-off-by: Joe Buck <jbbuck@gmail.com>
This is set on start, and subsequently gets into the changed set.
Once any other config value is injected, it is the first thing reported
by the logs, but is confusing and useless to the user. Hide it.
Signed-off-by: Sage Weil <sage@inktank.com>
I do not think we saw any bugs from this, but anything that involved
capability issues on restart or migrate might have been caused by
this.
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Farnum <greg@inktank.com>
The initial values of up/acting need to be based on the PG's osdmap, not
the OSD's latest. This can cause various confusion in
pg_interval_t::check_new_interval() when calling OSDMap methods due to the
up/acting OSDs not existing yet (for example).
Fixes: #3879
Reported-by: Jens Kristian S?gaard <jens@mermaidconsulting.dk>
Tested-by: Jens Kristian S?gaard <jens@mermaidconsulting.dk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
The docs had the recommended journal size based on the option
"filestore min sync interval" when it should have been
"filestore max sync interval".
While in there, fix a couple of typos -- multiple when it should
be multiply, and a missing word. Change "Should at least twice"
to "Should be at least twice..."
Signed-off-by: Travis Rhoden <trhoden@gmail.com>
Check the integer (fixed-point) value to avoid any worries
about floating-point rounding. Add tests for reweight < 0.
Fixes: #3872
Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Sage Weil <sage.weil@inktank.com>
For a large PG these are saturating the filestore and journal queues. Do
them synchronously to make them more friendly. They don't need to be fast.
Signed-off-by: Sage Weil <sage@inktank.com>
If the file is opened with O_SYNC, O_DSYNC, or O_RSYNC, we need to
flush cached data (and metadata for O_SYNC) on a write.
For O_RSYNC, we need to flush dirty data on a read.
This patch adds a file_flush() call to the objectCacher
to allow a specific range to be flushed from the cache, and
in the O_SYNC,O_DSYNC case for write and O_RSYNC case for read,
calls that function waiting for the flush to complete. The patch
also adds a flags field directly to the file handle struct, and
replaces the append boolean with the use of the flags field directly.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
osd pool get was missing size, min_size, crash_replay_interval,
and crush_ruleset; they're all easily added.
Fixes: #3869
Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Sam Lang <sam.lang@inktank.com>
Honor filestore_flush_min in the inline flush case.
Backport: bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
If sync_file_range is not present, we always close inline, and flush
via fdatasync(2).
Fixes compile on ancient platforms like RHEL5.8.
Backport: bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
There was already a dependency on python in the debian control file,
a similar dependency was added to the rpm spec file. perl is needed
for the logrotate script, so a dependecy was on perl wass added to
both. Bug 3768.
Signed-off-by: Gary Lowell <gary.lowell@inktank.com>