Even a no-op ftruncate can block in the kernel. Prior to this change I
could frequently see ftruncate wait for an aio completion on the same
file.
Signed-off-by: Sage Weil <sage@redhat.com>
An append is expensive in terms of latency (write, fdatasync, kv commit),
while a wal write is just the kv commit and the write and fdatasync are
async. For small IOs doing the wal may improve performance.
Signed-off-by: Sage Weil <sage@redhat.com>
The read of all the overlays can be delayed until applying the wal. If
we are doing async wal apply, this can reduce write op latency by
eliminating unnecessary reads in the write code path.
Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
There is a deadlock issue in Newstore when newstore_sync_transaction = true.
With sync_transaction to true, the txc state machine will go all the way down
from STATE_IO_DONE to STATE_FINISHING in the same thread, while holding the osr->qlock().
The deadlock is caused in _txc_finish and _osr_reap_done, when trying to
lock osr->qlock again.
Since the _txc_finish can be called with(in sync transaction mode) or without
(in async transaction mode) holding the qlock, so fix this by setting the qlock
to PTHREAD_MUTEX_RECURSIVE, thus we can recursive acquire the qlock.
Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
The data of the later contiguous overlays should be claim_append to
'op->data', instead of 'bl'.
Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
There is a racing condition here, if the flush_commit() call
happened after _txc_finish_kv and before next state, the context
was pushed to on_commits but no one will handle the context since
we already pass _txc_finish_kv. This bug can be easily reproduce
by putting a sleep(5) after _txc_finish_kv, and trigger the bug by
ceph-osd -i 0 --mkfs.
Fix this bug by return true directly when state >= STATE_KV_DONE(instead
of > in previous code). We already persist the data in STATE_KV_DONE so
it's safe for us to do this.
Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
When writing all the overlays, there is no need to dup the data in WAL.
Instead, we can reference the overlays in the WAL, and remove these
overlays after commiting them to the fs. When replaying, we can get
these data from the referenced overlays. Doing this way, we can save a
write and a deletion for each of the overlay data in the db.
Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
We need to rule out hobject_t::max before calling get_object_key
(in which will call get_filestore_key_u32 and get an assert failure)
Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
We cannot rely on the iterator pointers being valid after we submit the
aio because we are racing with the completion. Make our loop decision
before submitting and avoid dereferencing txc after that point.
Signed-off-by: Sage Weil <sage@redhat.com>
When the offset of the write starts at the end of the overlay, that is,
p->first + p->second.length == offset, the overlay could be skipped as
well.
Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
The aios may complete before _txc_aio_submit completes. In fact, the aio
may complete, commit to the kv store, and then queue more wal aio's before
we finish the loop. Move aios to a separate list to ensure we only submit
them once and do not right another CPU adjusting the list.
Signed-off-by: Sage Weil <sage@redhat.com>
Take a global throttle when we submit ops and release when they complete.
The first throttles cover the period from submit to commit, while the wal
ones also cover the async post-commit wal work. The configs are additive
since the wal ones cover both periods; this should make them reasonably
idiot-proof.
Signed-off-by: Sage Weil <sage@redhat.com>
Shouldn't clear the overlay in the create/append case of write.
Otherwise, this removes the overlay data and leads to data loss.
Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
If we take the aio path, the io is queued immediately and the resources
are released back to the pool. Instead release them when wal completes.
Signed-off-by: Sage Weil <sage@redhat.com>
The db iter will be set to KeyValueDB::Iterator() if onode.omap_head
not present. In that case if we touch the db iter we will get a segmentation
fault.
Prevent to touch the db iter when onode.omap_head is invalid(equals to 0).
Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>