Commit Graph

69795 Commits

Author SHA1 Message Date
Sage Weil
44d498332c os/bluestore: only discard deallocated regions of a blob if !shared
If a blob is shared, we can't discard deallocated regions: there may
be deferred buffers in flight and we might get a read via the clone.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:29 -05:00
Sage Weil
d3a425faf8 os/bluestore: avoid waking up kv thread on deferred write completion
In a simple HDD workload with queue depth of 1, we halve our throughput
because the kv thread does a full commit twice per IO: once for the
initial commit, and then again to clean up the deferred write record. The
second wakeup is unnecessary; we can clean it up on the next commit.

We do need to do this wakeup in a few cases, though, when draining the
OpSequencers: (1) on replay during startup, and (2) on shutdown in
_osr_drain_all().

Send everything through _osr_drain_all() for simplicity.

This doubles HDD qd=1 IOPS from ~50 to ~100 on my 7200 rpm test device
(rados bench 30 write -b 4096 -t 1).

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:28 -05:00
Sage Weil
78b9cea09f os/bluestore: move many initializations into header
This is less fragile, especially with 2 constructors.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:28 -05:00
Sage Weil
6db031be4d os/bluestore: restructure deferred write queue
First, eliminate the work queue--it's useless.  We are dispatching aio and
should not block.  And if a single thread isn't sufficient to do it, it
probably means we should be parallelizing kv_sync_thread too (which is our
only caller that matters).

Repurpose the old osr-list -> txc-list-per-osr queue structure to manage
the queuing.  For any given osr, dispatch one batch of aios at a time,
taking care to collapse any overwrites so that the latest write wins.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:28 -05:00
Sage Weil
5fafd1fcc2 os/bluestore: fix OpSequencer/Sequencer lifecycle
Make osr_set refcounts so that it can tolerate a Sequencer destruction
racing with flush or a Sequencer that outlives the BlueStore instance
itself.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:28 -05:00
Sage Weil
3dc82d57e4 os/bluestore: move _osr_reap_done
Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:28 -05:00
Sage Weil
986776d30d os/bluestore: reimplement/rename _sync -> _flush_all
The old implementation is racy and doesn't actually work.  Instead, rely
on a list of all OpSequencers and drain them all.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:28 -05:00
Sage Weil
3cf2b0f9b7 os/bluestore: keep all OpSequencers registered
Maintain the set of all live OpSequencers.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:28 -05:00
Sage Weil
9b28d615e7 os/bluestore: keep onode refs for lifetime of obc
This ensures that we don't trim an onode from the cache while it has a
txc that is still in flight.  Which in turn ensures that if we try to read
the object, we will have any writing buffers available.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:28 -05:00
Sage Weil
4aa44d2b49 os/bluestore: make OnodeSpace onode_map private
Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:28 -05:00
Sage Weil
2d0d375809 os/bluestore: make Sequencer::flush() more efficient
BlueStore collection methods only need preceding transactions to be
applied to the kv db; they do not need to be committed.

Note that this is *only* needed for collection listings; all other read
operations are immediately safe after queue_transactions().

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:28 -05:00
Sage Weil
9ee0c842f2 os/bluestore: add OpSequencer::drain()
Currently this is the same as flush, but more precisely it is an internal
method that means all txc's must complete.  Update _wal_apply() to use it
instead of flush(), which is part of the public Sequencer interface.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:27 -05:00
Sage Weil
5cb5a902d2 os/bluestore: revert throttle perfcounters
This reverts 3e40595f3c

The individual throttles have their own set of perfcounters; no need to
duplicate them here.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:27 -05:00
Sage Weil
78df9b3e4d os/bluestore: release deferred throttle on io finish, before cleanup
The throttle is really about limiting deferred IO; we do not need to
actually remove the deferred record from the kv db before queueing more.
(In fact, the txc that queues more will do the cleanup.)

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:27 -05:00
Sage Weil
eff1e83145 os/bluestore: separate _txc_finish_kv into _txc_{applied,committed}_kv
We can unblock flush()ing threads as soon as we have applied to the kv db,
while the callbacks must wait until we have committed.

Move methods around a bit to better match the execution order.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:27 -05:00
Sage Weil
3238162bd9 os/bluestore: make flush() only wait for kv commit
The only remaining flush() users only need to see previous txc's applied
to the kv db (e.g., _omap_clear needs to see the records to delete them).

Signed-off-by: Sage Weil <sage@redhat.com>

# Conflicts:
#	src/os/bluestore/BlueStore.h
2017-03-21 13:56:27 -05:00
Sage Weil
a56cd6ba38 os/bluestore: no need to Onode::flush() on truncate
We do not release extents until after any deferred IO, so this flush() is
unnecessary.

Signed-off-by: Sage Weil <sage@redhat.com>

# Conflicts:
#	src/os/bluestore/BlueStore.cc
2017-03-21 13:56:27 -05:00
Sage Weil
83e33a32fd os/bluestore: no need to Onode::flush() in _do_read
We now ensure that deferred writes are in cache until the txc retires,
so there is no need to wait here.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:27 -05:00
Sage Weil
6f2f8b3e3b os/bluestore: pin writing cache buffers until txc is finished
Notably, this includes WAL writes, which means an in-flight WAL write will
always be in the cache.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:27 -05:00
Sage Weil
bcd2a32912 os/bluestore: write padded data into buffer cache
We rely on the buffer cache to avoid reading any deferred write data. In
order for that to work, we have to ensure the entire block whose
overwrite is deferred is in the buffer cache.  Otherwise, a write to 0~5
that results in a deferred write could break a subsequent read from 5~5
that reads the same block from disk before the deferred write lands.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:27 -05:00
Sage Weil
bc5bfdd592 os/bluestore: update freelist on initial commit
It does not matter if we update the freelist in the initial commit or when
cleaning up the deferred transaction; both will eventually update the
persistent kv freelist.  We maintain one case to ensure that legacy
deferred events (from a kraken upgrade) release when they are replayed.

What matters while online is the Allocator, which has an independent
in-memory copy of the freelist to make decisions.  And we can delay that
as long as we want.  To avoid any concerns about deferred writes racing
against released blocks, just defer any release until the txc is fully
completed (including any deferred writes).  This ensures that even if we
have a pattern like

 txc 1: schedule deferred write on block A
 txc 2: release block A
 txc 1+2: commit
 txc 2: done!
 txc 1: do deferred write
 txc 1: done!

then txc 2 won't do its release because it is stuck behind txc 1 in the
OpSequencer queue:

 ...
 txc 1: reaped
 txc 2: reaped (and extents released to alloc)

This builds in some delay in just-released space being usable again, but
it should be a very small amount of space relative to the size of the
store!

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:27 -05:00
Sage Weil
597383935e os/bluestore: wal -> deferred
"wal" can refer to both the rocksdb wal (effectively, or journal) and the
"wal" events we include in it (mainly promises to do future IO or release
extents to the freelist).  This is super confusing!

Instead, call them 'deferred'.. deferred transactions, ops, writes, or
releases.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:26 -05:00
Sage Weil
ffd4d2f1dd vstart.sh: larger wal device
Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:26 -05:00
Sage Weil
a6f9c198e4 Merge pull request #14030 from tchaikov/wip-denc-without-nullptr
os/bluestore: do not use nullptr to calc the size of bluestore_pextent_t

Reviewed-by: Sage Weil <sage@redhat.com>
2017-03-21 12:58:14 -05:00
Jason Dillaman
16f7da43c9 Merge pull request #12041 from yangdongsheng/rbd_mirror_clone
librbd: asynchronous clone state machine

Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-03-21 11:42:15 -04:00
Kefu Chai
ddd011be7f Merge pull request #14058 from tchaikov/wip-doc-linkcheck
doc: add optional argument for build-doc

Reviewed-by: Ken Dreyer <kdreyer@redhat.com>
Reviewed-by: liuchang0812 <liuchang0812@gmail.com>
2017-03-21 22:41:59 +08:00
Mykola Golub
75b1efd517 Merge pull request #14023 from dillaman/wip-rbd-coverity
librbd: fix valid coverity warnings

Reviewed-by: Pan Liu <liupan1111@gmail.com>
Reviewed-by: Mykola Golub <mgolub@mirantis.com>
2017-03-21 16:40:36 +02:00
Mykola Golub
089a024ae9 Merge pull request #14034 from liupan1111/wip-fix-comment-nbd
rbd-nbd: fix typo in comment

Reviewed-by: Mykola Golub <mgolub@mirantis.com>
2017-03-21 16:37:04 +02:00
Pan Liu
ef449cf54e rbd-nbd: fix typo in comment.
Signed-off-by: Pan Liu <liupan1111@gmail.com>
2017-03-21 19:22:26 +08:00
Kefu Chai
e423f0b597 doc: cephfs: fix the unexpected indent warning
Signed-off-by: Kefu Chai <kchai@redhat.com>
2017-03-21 13:47:09 +08:00
Kefu Chai
5b9ec53512 admin/build-doc: support optional argument for specifying sphinx builders
Signed-off-by: Kefu Chai <kchai@redhat.com>
2017-03-21 13:47:05 +08:00
Kefu Chai
b935248197 Merge pull request #13997 from tchaikov/wip-doc-fixings
doc:  fixes to silence sphinx-build

Reviewed-by: Brad Hubbard <bhubbard@redhat.com>
2017-03-21 11:46:12 +08:00
Brad Hubbard
e2a53b28a5 Merge pull request #14057 from badone/wip-RadosImport-connect
tools/rados: Check return value of connect

Reviewed-by: David Zafman <dzafman@redhat.com>
2017-03-21 13:35:09 +10:00
Brad Hubbard
c119091ef0 tools/rados: Check return value of connect
Fail gracefully if Rados::connect returns an error.

Fixes: http://tracker.ceph.com/issues/19319
Signed-off-by: Brad Hubbard <bhubbard@redhat.com>
2017-03-21 12:22:20 +10:00
Haomai Wang
fe66443240 Merge pull request #13971 from optimistyzy/0315_1
os/blestore/NVMEDevice: fix the I/O logic for read

Reviewed-by: Haomai Wang <haomai@xsky.com>
2017-03-21 05:18:20 +08:00
Yuri Weinstein
1732698962 Merge pull request #13923 from xiexingguo/wip-clean-pglog-t
OSD: drop parameter t from merge_log()

Reviewed-by: Gregory Farnum <gfarnum@redhat.com>
2017-03-20 13:05:54 -07:00
Yuri Weinstein
80a10ec16c Merge pull request #13938 from jimmyway/wip-chg-return-value-to-refs
osd: replace object_info_t::operator=() with decode()

Reviewed-by: Kefu Chai <kchai@redhat.com>
2017-03-20 13:04:44 -07:00
Yuri Weinstein
9b3c2daeb8 Merge pull request #13980 from majianpeng/filejournal-bufferlist-rebuild
os/filestore/FileJournal: bufferlist rebuild

Reviewed-by: Sage Weil <sage@redhat.com>
2017-03-20 13:03:58 -07:00
Sage Weil
decb9d00d7 Merge pull request #13535 from dongbula/add-rgw-finisher-to-perfcounter
rgw: add radosclient finisher to perf counter

Reviewed-by: Casey Bodley <cbodley@redhat.com>
2017-03-20 10:20:48 -05:00
Casey Bodley
6f5d509476 Merge pull request #13955 from wangzhengyong/notify_finish
rgw: handle error return value in build_linked_oids_index

Reviewed-by: Casey Bodley <cbodley@redhat.com>
2017-03-20 10:25:39 -04:00
Casey Bodley
0afb438cce Merge pull request #13820 from mikulely/cleanup-rgw-lc
rgw: cleanup lifecycle managament

Reviewed-by: Casey Bodley <cbodley@redhat.com>
2017-03-20 10:25:14 -04:00
Casey Bodley
75be7eac7a Merge pull request #13481 from theanalyst/rgw/env-dout
rgw: don't log the env_map twice

Reviewed-by: Casey Bodley <cbodley@redhat.com>
2017-03-20 10:20:43 -04:00
Matt Benjamin
8af0e58642 Merge pull request #13895 from guihecheng/rgw_file-fix
rgw_file: fix reversed return value of getattr
2017-03-20 09:54:02 -04:00
Matt Benjamin
ef805fe862 Merge pull request #14045 from guihecheng/rgw_file-fix-retcode
rgw_file: fix non-negative return code for open operation
2017-03-20 09:44:18 -04:00
Jason Dillaman
4ad10888d2 Merge pull request #14049 from yangdongsheng/rbd_cleanup
cleanup: rbd: fix a typo in comment

Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-03-20 09:37:41 -04:00
Ziye Yang
f2a1e34f88 Bluestore,NVMEDEVICE: fix the I/O logic for READ
Aio_submit will submit both aio_read/write, and also there
are synchronized read and random read, so we need to
handle the read I/O completion in a correct way.

Since random read has its own ioc, so the
num_reading for ioc will be at most 1, which will be easy
to handle in io_complete. And we need only to differentiate
whethere it is an aio_read.

Also fix the exception logic in command send, make the style
consistent.

Signed-off-by: optimistyzy <optimistyzy@gmail.com>
2017-03-20 17:46:52 +08:00
Kefu Chai
ad4ffed838 Merge pull request #13936 from ZVampirEM77/cleanup-rgw-doc
doc: fix typos in radosgw-admin usage

Reviewed-by: Kefu Chai <kchai@redhat.com>
2017-03-20 15:49:08 +08:00
Kefu Chai
fa54a27132 Merge pull request #13559 from voidbag/wip-fix-_open_super_meta
os/bluestore: fix bug in _open_super_meta()

Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Kefu Chai <kchai@redhat.com>
2017-03-20 15:32:28 +08:00
Kefu Chai
a824249f87 Merge pull request #13718 from aclamk/wip-bs-indexed-bitshift
os/bluestore: cleanup, got rid of table reference of 1<<x

Reviewed-by: Sage Weil <sage@redhat.com>
2017-03-20 15:31:10 +08:00
Kefu Chai
684ae029e6 Merge pull request #13769 from wangzhengyong/wip-noid
os/bluestore: "noid" is not always necessary in clone op

Reviewed-by: Igor Fedotov <ifedotov@mirantis.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2017-03-20 15:30:29 +08:00