RepoMirrors/ceph

mirror of https://github.com/ceph/ceph synced 2024-12-30 15:33:31 +00:00

Author	SHA1	Message	Date
Sage Weil	9c2eb28589	os/newstore: clean up kv commit debug output Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:42 -04:00
Sage Weil	90e7f5e648	os/newstore: only ftruncate if i_size is incorrect Even a no-op ftruncate can block in the kernel. Prior to this change I could frequently see ftruncate wait for an aio completion on the same file. Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:42 -04:00
Sage Weil	4c1552001a	Revert "os/newstore: avoid sync append for small ios" This reverts commit `69baab2f7e`. This is slower. :(	2015-09-01 13:39:42 -04:00
Sage Weil	e89b2474b7	os/newstore: avoid sync append for small ios An append is expensive in terms of latency (write, fdatasync, kv commit), while a wal write is just the kv commit and the write and fdatasync are async. For small IOs doing the wal may improve performance. Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:42 -04:00
Sage Weil	668c277715	rocksdb: fallocate_with_keep_size = false This improves my 4k random writes on hdd by about 25%. Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:42 -04:00
Sage Weil	08f3efb474	Revert "os/NewStore: data_map shouldn't be empty when writing all overlays" This reverts commit `0d9cce462f`. We may want to write an overlay if hte object is new and the write is small to defer the cost of the fsync.	2015-09-01 13:39:42 -04:00
Zhiqiang Wang	02d0ef8fe0	os/NewStore: delay the read of all the overlays until wal applying The read of all the overlays can be delayed until applying the wal. If we are doing async wal apply, this can reduce write op latency by eliminating unnecessary reads in the write code path. Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>	2015-09-01 13:39:42 -04:00
Xiaoxi Chen	e3abf245ba	os/newstore: fix deadlock when newstore_sync_transaction=true There is a deadlock issue in Newstore when newstore_sync_transaction = true. With sync_transaction to true, the txc state machine will go all the way down from STATE_IO_DONE to STATE_FINISHING in the same thread, while holding the osr->qlock(). The deadlock is caused in _txc_finish and _osr_reap_done, when trying to lock osr->qlock again. Since the _txc_finish can be called with(in sync transaction mode) or without (in async transaction mode) holding the qlock, so fix this by setting the qlock to PTHREAD_MUTEX_RECURSIVE, thus we can recursive acquire the qlock. Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>	2015-09-01 13:39:42 -04:00
Zhiqiang Wang	cdc652ebbe	os/NewStore: fix the append of the later overlays when doing combination The data of the later contiguous overlays should be claim_append to 'op->data', instead of 'bl'. Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>	2015-09-01 13:39:41 -04:00
Xiaoxi Chen	36ed3dd20a	os/Newstore: flush_commit return true on STATE_KV_DONE There is a racing condition here, if the flush_commit() call happened after _txc_finish_kv and before next state, the context was pushed to on_commits but no one will handle the context since we already pass _txc_finish_kv. This bug can be easily reproduce by putting a sleep(5) after _txc_finish_kv, and trigger the bug by ceph-osd -i 0 --mkfs. Fix this bug by return true directly when state >= STATE_KV_DONE(instead of > in previous code). We already persist the data in STATE_KV_DONE so it's safe for us to do this. Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>	2015-09-01 13:39:41 -04:00
Zhiqiang Wang	e02e743857	os/NewStore: avoid dup the data of the overlays in the WAL When writing all the overlays, there is no need to dup the data in WAL. Instead, we can reference the overlays in the WAL, and remove these overlays after commiting them to the fs. When replaying, we can get these data from the referenced overlays. Doing this way, we can save a write and a deletion for each of the overlay data in the db. Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>	2015-09-01 13:39:41 -04:00
Sage Weil	6399f1d060	os/newstore: fix multiple aio case Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:41 -04:00
Sage Weil	2a7393a446	os/newstore: more conservative default for aio queue depth There appears to be a kernel aio bug when the queue depth is small. Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:41 -04:00
Xiaoxi Chen	37da4292b3	os/newstore:close fd after writting with O_DIRECT fix bug in 2b4c60e0a521ad10b94bbc82865b49f2d28c2ac9 Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>	2015-09-01 13:39:41 -04:00
Zhiqiang Wang	65055a0207	os/NewStore: need to increase the wal op length when combining overlays Need to add the length of the combining overlays to the length of the wal op. Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>	2015-09-01 13:39:41 -04:00
Xiaoxi Chen	df239f0f62	os/Newstore:Fix collection_list_range We need to rule out hobject_t::max before calling get_object_key (in which will call get_filestore_key_u32 and get an assert failure) Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>	2015-09-01 13:39:41 -04:00
Sage Weil	4c9e37de8a	os/newstore: fix race in _txc_aio_submit We cannot rely on the iterator pointers being valid after we submit the aio because we are racing with the completion. Make our loop decision before submitting and avoid dereferencing txc after that point. Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:40 -04:00
Xiaoxi Chen	117330045f	os/newstore : Do not need to call fdatasync if using direct. skip ::fdatasync if in direct mode. Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>	2015-09-01 13:39:40 -04:00
Zhiqiang Wang	c552cd20ab	osd/NewStore: fix for skipping the overlay in _do_overlay_trim When the offset of the write starts at the end of the overlay, that is, p->first + p->second.length == offset, the overlay could be skipped as well. Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>	2015-09-01 13:39:40 -04:00
Zhiqiang Wang	793dcc396c	os/NewStore: combine contiguous overlays when writing all the overlays Combine contiguous overlay writes to reduce the numbers of WAL writes and fs writes. Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>	2015-09-01 13:39:40 -04:00
Xiaoxi Chen	29ba720885	os/Nestore: batch cleanup batch cleanup wal. Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>	2015-09-01 13:39:40 -04:00
Sage Weil	4eca15a950	os/newstore: fix _txc_aio_submit The aios may complete before _txc_aio_submit completes. In fact, the aio may complete, commit to the kv store, and then queue more wal aio's before we finish the loop. Move aios to a separate list to ensure we only submit them once and do not right another CPU adjusting the list. Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:40 -04:00
Sage Weil	41886c5420	os/newstore: throttle over entire write lifecycle Take a global throttle when we submit ops and release when they complete. The first throttles cover the period from submit to commit, while the wal ones also cover the async post-commit wal work. The configs are additive since the wal ones cover both periods; this should make them reasonably idiot-proof. Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:40 -04:00
Zhiqiang Wang	b1136fbd33	os/NewStore: data_map shouldn't be empty when writing all overlays This should be an assert instead of creating new data_map. Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>	2015-09-01 13:39:40 -04:00
Zhiqiang Wang	a165fe81c5	os/NewStore: clear the shared_overlays after writing all the overlays Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>	2015-09-01 13:39:40 -04:00
Zhiqiang Wang	dffa43051a	os/NewStore: don't clear overlay in the create/append case of write Shouldn't clear the overlay in the create/append case of write. Otherwise, this removes the overlay data and leads to data loss. Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>	2015-09-01 13:39:40 -04:00
Sage Weil	f9f9e1b105	os/newstore: debug io_submit EAGAIN Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:40 -04:00
Sage Weil	dd79b4d832	os/newstore: release wal throttle when wal completes, not when queued If we take the aio path, the io is queued immediately and the resources are released back to the pool. Instead release them when wal completes. Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:39 -04:00
Sage Weil	715fd3b7a2	os/newstore: todo Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:39 -04:00
Sage Weil	3b66712598	os/newstore: move toward state-machine Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:39 -04:00
Sage Weil	2317e446c5	os/newstore: use aio for wal writes, too Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:39 -04:00
Sage Weil	e580a82729	os/newstore: a few comments about wal Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:39 -04:00
Sage Weil	5d8e14653d	os/newstore: combined O_DSYNC with O_DIRECT This avoids the need for an explicit fdatasync when doing O_DIRECT. Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:39 -04:00
Sage Weil	b7a53b5874	os/newstore: basic aio support Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:39 -04:00
Sage Weil	ba0d8d7fdd	os/Newstore: add newstore_db_path option The load of Keyvalue DB is heavy, allow user to put DB to a seperate(fast) device. Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>	2015-09-01 13:39:39 -04:00
Sage Weil	143d48570f	os/newstore: throttle wal work Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:39 -04:00
Sage Weil	efe218b4aa	os/newstore: show # o_direct buffers in debug output Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:38 -04:00
Sage Weil	7e1af1e616	os/newstore: use a threadpool for applying wal events Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:38 -04:00
Sage Weil	dfd389e66a	os/newstore: rebuild buffers to be page-aligned for O_DIRECT Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:38 -04:00
Sage Weil	552d95213b	ceph_test_objectstore: fix omap test cleanup Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:38 -04:00
Sage Weil	04f55d8d18	os/newstore: use fdatasync instead of fsync On XFS at least, fdatasync is sufficient to make data readable. Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:38 -04:00
Sage Weil	1321b880cc	os/newstore: update todo Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:38 -04:00
Xiaoxi Chen	65877832f8	os/Newstore: Check onode.omap_head in valid() and next() The db iter will be set to KeyValueDB::Iterator() if onode.omap_head not present. In that case if we touch the db iter we will get a segmentation fault. Prevent to touch the db iter when onode.omap_head is invalid(equals to 0). Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>	2015-09-01 13:39:38 -04:00
Xiaoxi Chen	1a97fd6cb7	Use .str() to output a stringstream. a nit. Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>	2015-09-01 13:39:38 -04:00
Xiaoxi Chen	9d0e925566	os/Newstore: Allow gap in _do_write append mode We can allow some gap so we only need to ensure onode.size <= offset. Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>	2015-09-01 13:39:38 -04:00
Xiaoxi Chen	5e9c64b4dd	Implement get_omap_iterator implemented get_omap_iterator Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>	2015-09-01 13:39:38 -04:00
Xiaoxi Chen	c86410239b	os/KeyValueDB: Add raw_key() interface for IteratorImpl raw_key() is useful to split out the prefix. Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>	2015-09-01 13:39:38 -04:00
Xiaoxi Chen	b595aac4e1	test/store_test Add get_omap_iterator test cases omap iterator test cases include: iter aganist omap lower_bound upper_bound Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>	2015-09-01 13:39:38 -04:00
Sage Weil	ca9bc6327d	os/newstore: drop sync() Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:37 -04:00
Sage Weil	d57547f103	os/newstore: drop sync() Signed-off-by: Sage Weil <sage@redhat.com>	2015-09-01 13:39:37 -04:00

1 2 3 4 5 ...

44206 Commits