Commit Graph

44194 Commits

Author SHA1 Message Date
Sage Weil
2a7393a446 os/newstore: more conservative default for aio queue depth
There appears to be a kernel aio bug when the queue depth is small.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:41 -04:00
Xiaoxi Chen
37da4292b3 os/newstore:close fd after writting with O_DIRECT
fix bug in 2b4c60e0a5

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
2015-09-01 13:39:41 -04:00
Zhiqiang Wang
65055a0207 os/NewStore: need to increase the wal op length when combining overlays
Need to add the length of the combining overlays to the length of the
wal op.

Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
2015-09-01 13:39:41 -04:00
Xiaoxi Chen
df239f0f62 os/Newstore:Fix collection_list_range
We need to rule out hobject_t::max before calling get_object_key
(in which will call get_filestore_key_u32 and get an assert failure)

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
2015-09-01 13:39:41 -04:00
Sage Weil
4c9e37de8a os/newstore: fix race in _txc_aio_submit
We cannot rely on the iterator pointers being valid after we submit the
aio because we are racing with the completion.  Make our loop decision
before submitting and avoid dereferencing txc after that point.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:40 -04:00
Xiaoxi Chen
117330045f os/newstore : Do not need to call fdatasync if using direct.
skip ::fdatasync if in direct mode.

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
2015-09-01 13:39:40 -04:00
Zhiqiang Wang
c552cd20ab osd/NewStore: fix for skipping the overlay in _do_overlay_trim
When the offset of the write starts at the end of the overlay, that is,
p->first + p->second.length == offset, the overlay could be skipped as
well.

Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
2015-09-01 13:39:40 -04:00
Zhiqiang Wang
793dcc396c os/NewStore: combine contiguous overlays when writing all the overlays
Combine contiguous overlay writes to reduce the numbers of WAL writes
and fs writes.

Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
2015-09-01 13:39:40 -04:00
Xiaoxi Chen
29ba720885 os/Nestore: batch cleanup
batch cleanup wal.

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
2015-09-01 13:39:40 -04:00
Sage Weil
4eca15a950 os/newstore: fix _txc_aio_submit
The aios may complete before _txc_aio_submit completes.  In fact, the aio
may complete, commit to the kv store, and then queue more wal aio's before
we finish the loop.  Move aios to a separate list to ensure we only submit
them once and do not right another CPU adjusting the list.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:40 -04:00
Sage Weil
41886c5420 os/newstore: throttle over entire write lifecycle
Take a global throttle when we submit ops and release when they complete.
The first throttles cover the period from submit to commit, while the wal
ones also cover the async post-commit wal work.  The configs are additive
since the wal ones cover both periods; this should make them reasonably
idiot-proof.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:40 -04:00
Zhiqiang Wang
b1136fbd33 os/NewStore: data_map shouldn't be empty when writing all overlays
This should be an assert instead of creating new data_map.

Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
2015-09-01 13:39:40 -04:00
Zhiqiang Wang
a165fe81c5 os/NewStore: clear the shared_overlays after writing all the overlays
Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
2015-09-01 13:39:40 -04:00
Zhiqiang Wang
dffa43051a os/NewStore: don't clear overlay in the create/append case of write
Shouldn't clear the overlay in the create/append case of write.
Otherwise, this removes the overlay data and leads to data loss.

Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
2015-09-01 13:39:40 -04:00
Sage Weil
f9f9e1b105 os/newstore: debug io_submit EAGAIN
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:40 -04:00
Sage Weil
dd79b4d832 os/newstore: release wal throttle when wal completes, not when queued
If we take the aio path, the io is queued immediately and the resources
are released back to the pool.  Instead release them when wal completes.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:39 -04:00
Sage Weil
715fd3b7a2 os/newstore: todo
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:39 -04:00
Sage Weil
3b66712598 os/newstore: move toward state-machine
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:39 -04:00
Sage Weil
2317e446c5 os/newstore: use aio for wal writes, too
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:39 -04:00
Sage Weil
e580a82729 os/newstore: a few comments about wal
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:39 -04:00
Sage Weil
5d8e14653d os/newstore: combined O_DSYNC with O_DIRECT
This avoids the need for an explicit fdatasync when doing O_DIRECT.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:39 -04:00
Sage Weil
b7a53b5874 os/newstore: basic aio support
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:39 -04:00
Sage Weil
ba0d8d7fdd os/Newstore: add newstore_db_path option
The load of Keyvalue DB is heavy, allow user to put
DB to a seperate(fast) device.

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
2015-09-01 13:39:39 -04:00
Sage Weil
143d48570f os/newstore: throttle wal work
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:39 -04:00
Sage Weil
efe218b4aa os/newstore: show # o_direct buffers in debug output
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:38 -04:00
Sage Weil
7e1af1e616 os/newstore: use a threadpool for applying wal events
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:38 -04:00
Sage Weil
dfd389e66a os/newstore: rebuild buffers to be page-aligned for O_DIRECT
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:38 -04:00
Sage Weil
552d95213b ceph_test_objectstore: fix omap test cleanup
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:38 -04:00
Sage Weil
04f55d8d18 os/newstore: use fdatasync instead of fsync
On XFS at least, fdatasync is sufficient to make data readable.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:38 -04:00
Sage Weil
1321b880cc os/newstore: update todo
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:38 -04:00
Xiaoxi Chen
65877832f8 os/Newstore: Check onode.omap_head in valid() and next()
The db iter will be set to KeyValueDB::Iterator() if onode.omap_head
not present. In that case if we touch the db iter we will get a segmentation
fault.

Prevent to touch the db iter when onode.omap_head is invalid(equals to 0).

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
2015-09-01 13:39:38 -04:00
Xiaoxi Chen
1a97fd6cb7 Use .str() to output a stringstream.
a nit.

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
2015-09-01 13:39:38 -04:00
Xiaoxi Chen
9d0e925566 os/Newstore: Allow gap in _do_write append mode
We can allow some gap so we only need to ensure
onode.size <= offset.

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
2015-09-01 13:39:38 -04:00
Xiaoxi Chen
5e9c64b4dd Implement get_omap_iterator
implemented get_omap_iterator

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
2015-09-01 13:39:38 -04:00
Xiaoxi Chen
c86410239b os/KeyValueDB: Add raw_key() interface for IteratorImpl
raw_key() is useful to split out the prefix.

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
2015-09-01 13:39:38 -04:00
Xiaoxi Chen
b595aac4e1 test/store_test Add get_omap_iterator test cases
omap iterator test cases include:
  iter aganist omap
  lower_bound
  upper_bound

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
2015-09-01 13:39:38 -04:00
Sage Weil
ca9bc6327d os/newstore: drop sync()
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:37 -04:00
Sage Weil
d57547f103 os/newstore: drop sync()
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:37 -04:00
Sage Weil
205344d32d os/newstore: drop flush
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:37 -04:00
Sage Weil
f93856f71a os/newstore: drop sync_and_flush
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:37 -04:00
Sage Weil
28bc4ee76e os/newstore: use FS::zero()
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:37 -04:00
Sage Weil
c67c9a2bee os/newstore: use O_DIRECT is write is page-aligned
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:37 -04:00
Sage Weil
5539a75efb os/newstore: pass flags to _{open,create}_fid
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:37 -04:00
Sage Weil
48f639beec os/newstore: drop unused FragmentHandle
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:37 -04:00
Sage Weil
93fa4f1e30 os/newstore: do not call completions from kv thread
Reads may call wait_wal() holding user locks, and so we cannot block
progress on WAL completion/flushing by calling callbacks that may take
user locks.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:37 -04:00
Sage Weil
86a3f7dd51 os/newstore: let wal cleanup kv txn get batched
No need to trigger another sync kv commit here; just let the next KV
commit catch it.

We could possibly do a bit better here by not waking up the kv thread at
all...

Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:37 -04:00
Sage Weil
ec21f578a7 os/newstore: fix off-by-one on overlay_max_length
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:37 -04:00
Sage Weil
f9a7fd4e4c os/newstore: use lower_bound for finding overlay extents in map
Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:36 -04:00
Sage Weil
66aae98277 os/newstore: use overlay even if it is a new object or append
This avoids the fsync for small writes.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-09-01 13:39:36 -04:00
Xiaoxi Chen
0981428123 os/Newstore:Change assert in get_onode
db->get will return negtive when key is not found.

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
2015-09-01 13:39:36 -04:00