RepoMirrors/ceph

mirror of https://github.com/ceph/ceph synced 2025-03-06 16:28:28 +00:00

Author	SHA1	Message	Date
Sage Weil	44d498332c	os/bluestore: only discard deallocated regions of a blob if !shared If a blob is shared, we can't discard deallocated regions: there may be deferred buffers in flight and we might get a read via the clone. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:29 -05:00
Sage Weil	d3a425faf8	os/bluestore: avoid waking up kv thread on deferred write completion In a simple HDD workload with queue depth of 1, we halve our throughput because the kv thread does a full commit twice per IO: once for the initial commit, and then again to clean up the deferred write record. The second wakeup is unnecessary; we can clean it up on the next commit. We do need to do this wakeup in a few cases, though, when draining the OpSequencers: (1) on replay during startup, and (2) on shutdown in _osr_drain_all(). Send everything through _osr_drain_all() for simplicity. This doubles HDD qd=1 IOPS from ~50 to ~100 on my 7200 rpm test device (rados bench 30 write -b 4096 -t 1). Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:28 -05:00
Sage Weil	78b9cea09f	os/bluestore: move many initializations into header This is less fragile, especially with 2 constructors. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:28 -05:00
Sage Weil	6db031be4d	os/bluestore: restructure deferred write queue First, eliminate the work queue--it's useless. We are dispatching aio and should not block. And if a single thread isn't sufficient to do it, it probably means we should be parallelizing kv_sync_thread too (which is our only caller that matters). Repurpose the old osr-list -> txc-list-per-osr queue structure to manage the queuing. For any given osr, dispatch one batch of aios at a time, taking care to collapse any overwrites so that the latest write wins. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:28 -05:00
Sage Weil	5fafd1fcc2	os/bluestore: fix OpSequencer/Sequencer lifecycle Make osr_set refcounts so that it can tolerate a Sequencer destruction racing with flush or a Sequencer that outlives the BlueStore instance itself. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:28 -05:00
Sage Weil	3dc82d57e4	os/bluestore: move _osr_reap_done Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:28 -05:00
Sage Weil	986776d30d	os/bluestore: reimplement/rename _sync -> _flush_all The old implementation is racy and doesn't actually work. Instead, rely on a list of all OpSequencers and drain them all. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:28 -05:00
Sage Weil	3cf2b0f9b7	os/bluestore: keep all OpSequencers registered Maintain the set of all live OpSequencers. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:28 -05:00
Sage Weil	9b28d615e7	os/bluestore: keep onode refs for lifetime of obc This ensures that we don't trim an onode from the cache while it has a txc that is still in flight. Which in turn ensures that if we try to read the object, we will have any writing buffers available. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:28 -05:00
Sage Weil	4aa44d2b49	os/bluestore: make OnodeSpace onode_map private Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:28 -05:00
Sage Weil	2d0d375809	os/bluestore: make Sequencer::flush() more efficient BlueStore collection methods only need preceding transactions to be applied to the kv db; they do not need to be committed. Note that this is only needed for collection listings; all other read operations are immediately safe after queue_transactions(). Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:28 -05:00
Sage Weil	9ee0c842f2	os/bluestore: add OpSequencer::drain() Currently this is the same as flush, but more precisely it is an internal method that means all txc's must complete. Update _wal_apply() to use it instead of flush(), which is part of the public Sequencer interface. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:27 -05:00
Sage Weil	5cb5a902d2	os/bluestore: revert throttle perfcounters This reverts `3e40595f3c` The individual throttles have their own set of perfcounters; no need to duplicate them here. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:27 -05:00
Sage Weil	78df9b3e4d	os/bluestore: release deferred throttle on io finish, before cleanup The throttle is really about limiting deferred IO; we do not need to actually remove the deferred record from the kv db before queueing more. (In fact, the txc that queues more will do the cleanup.) Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:27 -05:00
Sage Weil	eff1e83145	os/bluestore: separate _txc_finish_kv into _txc_{applied,committed}_kv We can unblock flush()ing threads as soon as we have applied to the kv db, while the callbacks must wait until we have committed. Move methods around a bit to better match the execution order. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:27 -05:00
Sage Weil	3238162bd9	os/bluestore: make flush() only wait for kv commit The only remaining flush() users only need to see previous txc's applied to the kv db (e.g., _omap_clear needs to see the records to delete them). Signed-off-by: Sage Weil <sage@redhat.com> # Conflicts: # src/os/bluestore/BlueStore.h	2017-03-21 13:56:27 -05:00
Sage Weil	a56cd6ba38	os/bluestore: no need to Onode::flush() on truncate We do not release extents until after any deferred IO, so this flush() is unnecessary. Signed-off-by: Sage Weil <sage@redhat.com> # Conflicts: # src/os/bluestore/BlueStore.cc	2017-03-21 13:56:27 -05:00
Sage Weil	83e33a32fd	os/bluestore: no need to Onode::flush() in _do_read We now ensure that deferred writes are in cache until the txc retires, so there is no need to wait here. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:27 -05:00
Sage Weil	6f2f8b3e3b	os/bluestore: pin writing cache buffers until txc is finished Notably, this includes WAL writes, which means an in-flight WAL write will always be in the cache. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:27 -05:00
Sage Weil	bcd2a32912	os/bluestore: write padded data into buffer cache We rely on the buffer cache to avoid reading any deferred write data. In order for that to work, we have to ensure the entire block whose overwrite is deferred is in the buffer cache. Otherwise, a write to 0~5 that results in a deferred write could break a subsequent read from 5~5 that reads the same block from disk before the deferred write lands. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:27 -05:00
Sage Weil	bc5bfdd592	os/bluestore: update freelist on initial commit It does not matter if we update the freelist in the initial commit or when cleaning up the deferred transaction; both will eventually update the persistent kv freelist. We maintain one case to ensure that legacy deferred events (from a kraken upgrade) release when they are replayed. What matters while online is the Allocator, which has an independent in-memory copy of the freelist to make decisions. And we can delay that as long as we want. To avoid any concerns about deferred writes racing against released blocks, just defer any release until the txc is fully completed (including any deferred writes). This ensures that even if we have a pattern like txc 1: schedule deferred write on block A txc 2: release block A txc 1+2: commit txc 2: done! txc 1: do deferred write txc 1: done! then txc 2 won't do its release because it is stuck behind txc 1 in the OpSequencer queue: ... txc 1: reaped txc 2: reaped (and extents released to alloc) This builds in some delay in just-released space being usable again, but it should be a very small amount of space relative to the size of the store! Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:27 -05:00
Sage Weil	597383935e	os/bluestore: wal -> deferred "wal" can refer to both the rocksdb wal (effectively, or journal) and the "wal" events we include in it (mainly promises to do future IO or release extents to the freelist). This is super confusing! Instead, call them 'deferred'.. deferred transactions, ops, writes, or releases. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:26 -05:00
Sage Weil	ffd4d2f1dd	vstart.sh: larger wal device Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:26 -05:00
Sage Weil	a6f9c198e4	Merge pull request #14030 from tchaikov/wip-denc-without-nullptr os/bluestore: do not use nullptr to calc the size of bluestore_pextent_t Reviewed-by: Sage Weil <sage@redhat.com>	2017-03-21 12:58:14 -05:00
Jason Dillaman	16f7da43c9	Merge pull request #12041 from yangdongsheng/rbd_mirror_clone librbd: asynchronous clone state machine Reviewed-by: Jason Dillaman <dillaman@redhat.com>	2017-03-21 11:42:15 -04:00
Kefu Chai	ddd011be7f	Merge pull request #14058 from tchaikov/wip-doc-linkcheck doc: add optional argument for build-doc Reviewed-by: Ken Dreyer <kdreyer@redhat.com> Reviewed-by: liuchang0812 <liuchang0812@gmail.com>	2017-03-21 22:41:59 +08:00
Mykola Golub	75b1efd517	Merge pull request #14023 from dillaman/wip-rbd-coverity librbd: fix valid coverity warnings Reviewed-by: Pan Liu <liupan1111@gmail.com> Reviewed-by: Mykola Golub <mgolub@mirantis.com>	2017-03-21 16:40:36 +02:00
Mykola Golub	089a024ae9	Merge pull request #14034 from liupan1111/wip-fix-comment-nbd rbd-nbd: fix typo in comment Reviewed-by: Mykola Golub <mgolub@mirantis.com>	2017-03-21 16:37:04 +02:00
Pan Liu	ef449cf54e	rbd-nbd: fix typo in comment. Signed-off-by: Pan Liu <liupan1111@gmail.com>	2017-03-21 19:22:26 +08:00
Kefu Chai	e423f0b597	doc: cephfs: fix the unexpected indent warning Signed-off-by: Kefu Chai <kchai@redhat.com>	2017-03-21 13:47:09 +08:00
Kefu Chai	5b9ec53512	admin/build-doc: support optional argument for specifying sphinx builders Signed-off-by: Kefu Chai <kchai@redhat.com>	2017-03-21 13:47:05 +08:00
Kefu Chai	b935248197	Merge pull request #13997 from tchaikov/wip-doc-fixings doc: fixes to silence sphinx-build Reviewed-by: Brad Hubbard <bhubbard@redhat.com>	2017-03-21 11:46:12 +08:00
Brad Hubbard	e2a53b28a5	Merge pull request #14057 from badone/wip-RadosImport-connect tools/rados: Check return value of connect Reviewed-by: David Zafman <dzafman@redhat.com>	2017-03-21 13:35:09 +10:00
Brad Hubbard	c119091ef0	tools/rados: Check return value of connect Fail gracefully if Rados::connect returns an error. Fixes: http://tracker.ceph.com/issues/19319 Signed-off-by: Brad Hubbard <bhubbard@redhat.com>	2017-03-21 12:22:20 +10:00
Haomai Wang	fe66443240	Merge pull request #13971 from optimistyzy/0315_1 os/blestore/NVMEDevice: fix the I/O logic for read Reviewed-by: Haomai Wang <haomai@xsky.com>	2017-03-21 05:18:20 +08:00
Yuri Weinstein	1732698962	Merge pull request #13923 from xiexingguo/wip-clean-pglog-t OSD: drop parameter t from merge_log() Reviewed-by: Gregory Farnum <gfarnum@redhat.com>	2017-03-20 13:05:54 -07:00
Yuri Weinstein	80a10ec16c	Merge pull request #13938 from jimmyway/wip-chg-return-value-to-refs osd: replace object_info_t::operator=() with decode() Reviewed-by: Kefu Chai <kchai@redhat.com>	2017-03-20 13:04:44 -07:00
Yuri Weinstein	9b3c2daeb8	Merge pull request #13980 from majianpeng/filejournal-bufferlist-rebuild os/filestore/FileJournal: bufferlist rebuild Reviewed-by: Sage Weil <sage@redhat.com>	2017-03-20 13:03:58 -07:00
Sage Weil	decb9d00d7	Merge pull request #13535 from dongbula/add-rgw-finisher-to-perfcounter rgw: add radosclient finisher to perf counter Reviewed-by: Casey Bodley <cbodley@redhat.com>	2017-03-20 10:20:48 -05:00
Casey Bodley	6f5d509476	Merge pull request #13955 from wangzhengyong/notify_finish rgw: handle error return value in build_linked_oids_index Reviewed-by: Casey Bodley <cbodley@redhat.com>	2017-03-20 10:25:39 -04:00
Casey Bodley	0afb438cce	Merge pull request #13820 from mikulely/cleanup-rgw-lc rgw: cleanup lifecycle managament Reviewed-by: Casey Bodley <cbodley@redhat.com>	2017-03-20 10:25:14 -04:00
Casey Bodley	75be7eac7a	Merge pull request #13481 from theanalyst/rgw/env-dout rgw: don't log the env_map twice Reviewed-by: Casey Bodley <cbodley@redhat.com>	2017-03-20 10:20:43 -04:00
Matt Benjamin	8af0e58642	Merge pull request #13895 from guihecheng/rgw_file-fix rgw_file: fix reversed return value of getattr	2017-03-20 09:54:02 -04:00
Matt Benjamin	ef805fe862	Merge pull request #14045 from guihecheng/rgw_file-fix-retcode rgw_file: fix non-negative return code for open operation	2017-03-20 09:44:18 -04:00
Jason Dillaman	4ad10888d2	Merge pull request #14049 from yangdongsheng/rbd_cleanup cleanup: rbd: fix a typo in comment Reviewed-by: Jason Dillaman <dillaman@redhat.com>	2017-03-20 09:37:41 -04:00
Ziye Yang	f2a1e34f88	Bluestore,NVMEDEVICE: fix the I/O logic for READ Aio_submit will submit both aio_read/write, and also there are synchronized read and random read, so we need to handle the read I/O completion in a correct way. Since random read has its own ioc, so the num_reading for ioc will be at most 1, which will be easy to handle in io_complete. And we need only to differentiate whethere it is an aio_read. Also fix the exception logic in command send, make the style consistent. Signed-off-by: optimistyzy <optimistyzy@gmail.com>	2017-03-20 17:46:52 +08:00
Kefu Chai	ad4ffed838	Merge pull request #13936 from ZVampirEM77/cleanup-rgw-doc doc: fix typos in radosgw-admin usage Reviewed-by: Kefu Chai <kchai@redhat.com>	2017-03-20 15:49:08 +08:00
Kefu Chai	fa54a27132	Merge pull request #13559 from voidbag/wip-fix-_open_super_meta os/bluestore: fix bug in _open_super_meta() Reviewed-by: Sage Weil <sage@redhat.com> Reviewed-by: Kefu Chai <kchai@redhat.com>	2017-03-20 15:32:28 +08:00
Kefu Chai	a824249f87	Merge pull request #13718 from aclamk/wip-bs-indexed-bitshift os/bluestore: cleanup, got rid of table reference of 1<<x Reviewed-by: Sage Weil <sage@redhat.com>	2017-03-20 15:31:10 +08:00
Kefu Chai	684ae029e6	Merge pull request #13769 from wangzhengyong/wip-noid os/bluestore: "noid" is not always necessary in clone op Reviewed-by: Igor Fedotov <ifedotov@mirantis.com> Reviewed-by: Sage Weil <sage@redhat.com>	2017-03-20 15:30:29 +08:00

1 2 3 4 5 ...

69795 Commits