A bit of colission from spec changes for the rhel7/ceph-common
changes and alfredo's pull request for wip-die-ceph-mkcephfs.
Signed-off-by: Sandon Van Ness <sandon@inktank.com>
Higher the clone probability to 8% and lower the probability of flatten
to 2%. This should give us longer parent chaines (before this we would
usually have one parent and even then only for a few ops time).
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Truncate base images after they have been cloned from to cover more
code paths and make sure that clients look at snapshot parent_overlap
(i.e. parent_overlap of the base image at the time the snapshot was
taken) and not that of the base image (i.e. parent_overlap of the base
image as of now).
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
The C++ version of rbd_get_parent_info() allows passing NULL for parent
image name, image name and snapshot name out parameters. Make C API do
the same both for consistency and to make it easier to check whether
the image at hand has a parent or not.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Currently for pools with different rules, "ceph df" cannot report
right available space for them, respectively. For detail assisment
of the bug ,pls refer to bug report #8943
This patch fix this bug and make ceph df works correctlly.
Fixes Bug #8943
Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
Fix dup bh_write for TX state bh
Tested-by: Sage Weil <sage@redhat.com>
Reviewed-by: Haomai Wang <haomaiwang@gmail.com>
Original changeset
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The TX state bh should be skipped because the bh should be inflight. We only
need to write dirty bh. And TX and dirty state bh both should be waited until
flushed.
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
tx buffers need to go on the bh_lru_rest as well, and removing erases
(not inserts) them into dirty_or_tx_bh.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
The else-if chain here was wrong. Handling dirty or tx buffers and
errors should be in independent conditions.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Librbd will calculate max dirty object according to rbd_cache_max_size, it
doesn't suitable for every case. If user set image order 24, the calculating
result is too small for reality. It will increase the overhead of trim call
which is called each read/write op.
Now we make it as option for tunning, by default this value is calculated.
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Flush op in ObjectCacher will iterate the whole active object set, each
dirty object also may own several BufferHead. If the object set is large,
it will consume too much time.
Use dirty_bh instead to reduce overhead. Now only dirty BufferHead will
be checked.
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
This reverts commit 74b386f03e, reversing
changes made to 36265d0db0.
The dirty_or_tx list is used by flush_set, which means we can
resubmit new IOs for writes that are already in progress. This
has a compounding effect that overwhelms the OSDs with dup IOs
and stalls out the client.
See, for example, teh failues in this run:
/a/sage-2014-07-25_17:14:20-fs-wip-msgr-testing-basic-plana
The fix is probably pretty simple, but reverting for now to make
the tests pass.
Signed-off-by: Sage Weil <sage@inktank.com>
In the 0.82 release, standbyreplay MDS daemons would try
to reformat the jouranl if they saw an older version on
disk, where this should have only been done by the active
MDS for the rank. Depending on timing, this could cause
fatal corruption of the journal.
This change handles the following cases:
* only do reformat if not in standbyreplay (else raise EAGAIN
to keep trying til an active mds reformats it)
* if journal header goes away while in standbyreplay then raise
EAGAIN (handle rewrite happening in background)
* if journal version is greater than the max supported, suicide
Fixes: #8811
Signed-off-by: John Spray <john.spray@redhat.com>
(cherry picked from commit 5438500af8)
If the cache is full, we block some requests, and then we change the
cache_mode to something else (say, forward), the full waiters don't get
requeued until the cache becomes un-full. In the meantime, however, later
requests will get processed and redirected, breaking the op ordering.
Fix this by requeueing any full waiters if we see that the cache_mode is
not writeback.
Fixes: #8931
Signed-off-by: Sage Weil <sage@redhat.com>
We only want to do this if is_active(). Otherwise, the normal
requeueing code will do its thing, taking care to get the queue orders
correct.
Backport: firefly
Signed-off-by: Sage Weil <sage@redhat.com>
We could race with another thread that deletes this right after we call
dec(). Our access of cct would then become a use-after-free. Valgrind
managed to turn this up.
Copy it into a local variable before the dec() to be safe, and move the
dout line below to make this possibility explicit and obvious in the code.
Signed-off-by: Sage Weil <sage@redhat.com>
Fixes: #8442
Backport: firefly
Data pools might have strict write alignment requirements. Use pool
alignment info when setting the max_chunk_size for the write.
Signed-off-by: Yehuda Sadeh <yehuda@redhat.com>
osd: set pg flag INCOMPLETE_CLONES when turning off cache pool
Reviewed-by: Greg Farnum <greg@inktank.com>
First patch Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
When closing journal, it should check must_write_header and update
journal header if must_write_header alreay set.
It can reduce the nosense journal-replay after restarting osd.
Signed-off-by: Ma Jianpeng <jianpeng.ma@intel.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Adding the available help arguments from the man page
Fixes: #8112
Reviewed-by: Yehuda Sadeh <yehuda@redhat.com>
Signed-off-by: Abhishek Lekshmanan <abhishek.lekshmanan@gmail.com>
Whitespace removal to make all help options align in a similar fashion
Reviewed-by: Yehuda Sadeh <yehuda@redhat.com>
Signed-off-by: Abhishek Lekshmanan <abhishek.lekshmanan@gmail.com>
We cannot assume that just because cache_mode is NONE that we will have
all clones present; check for the absense of the INCOMPLETE_CLONES flag
here too.
Signed-off-by: Sage Weil <sage@redhat.com>
During recovery, we can clone subsets if we know that all clones will be
present. We skip this on caching pools because they may not be; do the
same when INCOMPLETE_CLONES is set.
Signed-off-by: Sage Weil <sage@redhat.com>
When scrubbing, do not complain about missing cloens when we are in a
caching mode *or* when the INCOMPLETE_CLONES flag is set. Both are
indicators that we may be missing clones and that that is okay.
Fixes: #8882
Signed-off-by: Sage Weil <sage@redhat.com>
Set a flag on the pg_pool_t when we change cache_mode NONE. This
is because object promotion may promote heads without all of the clones,
and when we switch the cache_mode back those objects may remain. Do
this on any cache_mode change (to or from NONE) to capture legacy
pools that were set up before this flag existed.
Signed-off-by: Sage Weil <sage@redhat.com>
If we have a pending pool value but the cache_mode hasn't changed, this is
still a no-op (and we don't need to block).
Backport: firefly
Signed-off-by: Sage Weil <sage@redhat.com>