Commit Graph

44206 Commits

Author SHA1 Message Date
Sage Weil
9c2eb28589 os/newstore: clean up kv commit debug output
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:42 -04:00
Sage Weil
90e7f5e648 os/newstore: only ftruncate if i_size is incorrect
Even a no-op ftruncate can block in the kernel.  Prior to this change I
could frequently see ftruncate wait for an aio completion on the same
file.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:42 -04:00
Sage Weil
4c1552001a Revert "os/newstore: avoid sync append for small ios"
This reverts commit 69baab2f7e.

This is slower.  :(
2015-09-01 13:39:42 -04:00
Sage Weil
e89b2474b7 os/newstore: avoid sync append for small ios
An append is expensive in terms of latency (write, fdatasync, kv commit),
while a wal write is just the kv commit and the write and fdatasync are
async.  For small IOs doing the wal may improve performance.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:42 -04:00
Sage Weil
668c277715 rocksdb: fallocate_with_keep_size = false
This improves my 4k random writes on hdd by about 25%.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:42 -04:00
Sage Weil
08f3efb474 Revert "os/NewStore: data_map shouldn't be empty when writing all overlays"
This reverts commit 0d9cce462f.

We may want to write an overlay if hte object is new and the write is small to defer the cost
of the fsync.
2015-09-01 13:39:42 -04:00
Zhiqiang Wang
02d0ef8fe0 os/NewStore: delay the read of all the overlays until wal applying
The read of all the overlays can be delayed until applying the wal. If
we are doing async wal apply, this can reduce write op latency by
eliminating unnecessary reads in the write code path.

Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
2015-09-01 13:39:42 -04:00
Xiaoxi Chen
e3abf245ba os/newstore: fix deadlock when newstore_sync_transaction=true
There is a deadlock issue in Newstore when newstore_sync_transaction = true.
With sync_transaction to true, the txc state machine will go all the way down
from STATE_IO_DONE to STATE_FINISHING in the same thread, while holding the osr->qlock().
The deadlock is caused in _txc_finish and _osr_reap_done, when trying to
lock osr->qlock again.

Since the _txc_finish can be called with(in sync transaction mode) or without
(in async transaction mode) holding the qlock, so fix this by setting the qlock
to PTHREAD_MUTEX_RECURSIVE, thus we can recursive acquire the qlock.

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
2015-09-01 13:39:42 -04:00
Zhiqiang Wang
cdc652ebbe os/NewStore: fix the append of the later overlays when doing combination
The data of the later contiguous overlays should be claim_append to
'op->data', instead of 'bl'.

Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
2015-09-01 13:39:41 -04:00
Xiaoxi Chen
36ed3dd20a os/Newstore: flush_commit return true on STATE_KV_DONE
There is a racing condition here, if the flush_commit() call
happened after _txc_finish_kv and before next state, the context
was pushed to on_commits but no one will handle the context since
we already pass _txc_finish_kv. This bug can be easily reproduce
by putting a sleep(5) after _txc_finish_kv, and trigger the bug by
ceph-osd -i 0 --mkfs.

Fix this bug by return true directly when state >= STATE_KV_DONE(instead
of > in previous code). We already persist the data in STATE_KV_DONE so
it's safe for us to do this.

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
2015-09-01 13:39:41 -04:00
Zhiqiang Wang
e02e743857 os/NewStore: avoid dup the data of the overlays in the WAL
When writing all the overlays, there is no need to dup the data in WAL.
Instead, we can reference the overlays in the WAL, and remove these
overlays after commiting them to the fs. When replaying, we can get
these data from the referenced overlays. Doing this way, we can save a
write and a deletion for each of the overlay data in the db.

Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
2015-09-01 13:39:41 -04:00
Sage Weil
6399f1d060 os/newstore: fix multiple aio case
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:41 -04:00
Sage Weil
2a7393a446 os/newstore: more conservative default for aio queue depth
There appears to be a kernel aio bug when the queue depth is small.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:41 -04:00
Xiaoxi Chen
37da4292b3 os/newstore:close fd after writting with O_DIRECT
fix bug in 2b4c60e0a521ad10b94bbc82865b49f2d28c2ac9

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
2015-09-01 13:39:41 -04:00
Zhiqiang Wang
65055a0207 os/NewStore: need to increase the wal op length when combining overlays
Need to add the length of the combining overlays to the length of the
wal op.

Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
2015-09-01 13:39:41 -04:00
Xiaoxi Chen
df239f0f62 os/Newstore:Fix collection_list_range
We need to rule out hobject_t::max before calling get_object_key
(in which will call get_filestore_key_u32 and get an assert failure)

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
2015-09-01 13:39:41 -04:00
Sage Weil
4c9e37de8a os/newstore: fix race in _txc_aio_submit
We cannot rely on the iterator pointers being valid after we submit the
aio because we are racing with the completion.  Make our loop decision
before submitting and avoid dereferencing txc after that point.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:40 -04:00
Xiaoxi Chen
117330045f os/newstore : Do not need to call fdatasync if using direct.
skip ::fdatasync if in direct mode.

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
2015-09-01 13:39:40 -04:00
Zhiqiang Wang
c552cd20ab osd/NewStore: fix for skipping the overlay in _do_overlay_trim
When the offset of the write starts at the end of the overlay, that is,
p->first + p->second.length == offset, the overlay could be skipped as
well.

Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
2015-09-01 13:39:40 -04:00
Zhiqiang Wang
793dcc396c os/NewStore: combine contiguous overlays when writing all the overlays
Combine contiguous overlay writes to reduce the numbers of WAL writes
and fs writes.

Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
2015-09-01 13:39:40 -04:00
Xiaoxi Chen
29ba720885 os/Nestore: batch cleanup
batch cleanup wal.

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
2015-09-01 13:39:40 -04:00
Sage Weil
4eca15a950 os/newstore: fix _txc_aio_submit
The aios may complete before _txc_aio_submit completes.  In fact, the aio
may complete, commit to the kv store, and then queue more wal aio's before
we finish the loop.  Move aios to a separate list to ensure we only submit
them once and do not right another CPU adjusting the list.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:40 -04:00
Sage Weil
41886c5420 os/newstore: throttle over entire write lifecycle
Take a global throttle when we submit ops and release when they complete.
The first throttles cover the period from submit to commit, while the wal
ones also cover the async post-commit wal work.  The configs are additive
since the wal ones cover both periods; this should make them reasonably
idiot-proof.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:40 -04:00
Zhiqiang Wang
b1136fbd33 os/NewStore: data_map shouldn't be empty when writing all overlays
This should be an assert instead of creating new data_map.

Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
2015-09-01 13:39:40 -04:00
Zhiqiang Wang
a165fe81c5 os/NewStore: clear the shared_overlays after writing all the overlays
Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
2015-09-01 13:39:40 -04:00
Zhiqiang Wang
dffa43051a os/NewStore: don't clear overlay in the create/append case of write
Shouldn't clear the overlay in the create/append case of write.
Otherwise, this removes the overlay data and leads to data loss.

Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
2015-09-01 13:39:40 -04:00
Sage Weil
f9f9e1b105 os/newstore: debug io_submit EAGAIN
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:40 -04:00
Sage Weil
dd79b4d832 os/newstore: release wal throttle when wal completes, not when queued
If we take the aio path, the io is queued immediately and the resources
are released back to the pool.  Instead release them when wal completes.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:39 -04:00
Sage Weil
715fd3b7a2 os/newstore: todo
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:39 -04:00
Sage Weil
3b66712598 os/newstore: move toward state-machine
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:39 -04:00
Sage Weil
2317e446c5 os/newstore: use aio for wal writes, too
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:39 -04:00
Sage Weil
e580a82729 os/newstore: a few comments about wal
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:39 -04:00
Sage Weil
5d8e14653d os/newstore: combined O_DSYNC with O_DIRECT
This avoids the need for an explicit fdatasync when doing O_DIRECT.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:39 -04:00
Sage Weil
b7a53b5874 os/newstore: basic aio support
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:39 -04:00
Sage Weil
ba0d8d7fdd os/Newstore: add newstore_db_path option
The load of Keyvalue DB is heavy, allow user to put
DB to a seperate(fast) device.

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
2015-09-01 13:39:39 -04:00
Sage Weil
143d48570f os/newstore: throttle wal work
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:39 -04:00
Sage Weil
efe218b4aa os/newstore: show # o_direct buffers in debug output
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:38 -04:00
Sage Weil
7e1af1e616 os/newstore: use a threadpool for applying wal events
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:38 -04:00
Sage Weil
dfd389e66a os/newstore: rebuild buffers to be page-aligned for O_DIRECT
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:38 -04:00
Sage Weil
552d95213b ceph_test_objectstore: fix omap test cleanup
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:38 -04:00
Sage Weil
04f55d8d18 os/newstore: use fdatasync instead of fsync
On XFS at least, fdatasync is sufficient to make data readable.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:38 -04:00
Sage Weil
1321b880cc os/newstore: update todo
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:38 -04:00
Xiaoxi Chen
65877832f8 os/Newstore: Check onode.omap_head in valid() and next()
The db iter will be set to KeyValueDB::Iterator() if onode.omap_head
not present. In that case if we touch the db iter we will get a segmentation
fault.

Prevent to touch the db iter when onode.omap_head is invalid(equals to 0).

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
2015-09-01 13:39:38 -04:00
Xiaoxi Chen
1a97fd6cb7 Use .str() to output a stringstream.
a nit.

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
2015-09-01 13:39:38 -04:00
Xiaoxi Chen
9d0e925566 os/Newstore: Allow gap in _do_write append mode
We can allow some gap so we only need to ensure
onode.size <= offset.

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
2015-09-01 13:39:38 -04:00
Xiaoxi Chen
5e9c64b4dd Implement get_omap_iterator
implemented get_omap_iterator

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
2015-09-01 13:39:38 -04:00
Xiaoxi Chen
c86410239b os/KeyValueDB: Add raw_key() interface for IteratorImpl
raw_key() is useful to split out the prefix.

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
2015-09-01 13:39:38 -04:00
Xiaoxi Chen
b595aac4e1 test/store_test Add get_omap_iterator test cases
omap iterator test cases include:
  iter aganist omap
  lower_bound
  upper_bound

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
2015-09-01 13:39:38 -04:00
Sage Weil
ca9bc6327d os/newstore: drop sync()
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:37 -04:00
Sage Weil
d57547f103 os/newstore: drop sync()
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:37 -04:00