It appears that db->submit_transaction() will block if there is a sync
commit that is in progress instead of simply queueing the new txn for
later. To work around this, submit these to the backend in the
kv_sync_thread prior to the synchronous submit_transaction_sync().
Signed-off-by: Sage Weil <sage@redhat.com>
Even a no-op ftruncate can block in the kernel. Prior to this change I
could frequently see ftruncate wait for an aio completion on the same
file.
Signed-off-by: Sage Weil <sage@redhat.com>
An append is expensive in terms of latency (write, fdatasync, kv commit),
while a wal write is just the kv commit and the write and fdatasync are
async. For small IOs doing the wal may improve performance.
Signed-off-by: Sage Weil <sage@redhat.com>
The read of all the overlays can be delayed until applying the wal. If
we are doing async wal apply, this can reduce write op latency by
eliminating unnecessary reads in the write code path.
Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
There is a deadlock issue in Newstore when newstore_sync_transaction = true.
With sync_transaction to true, the txc state machine will go all the way down
from STATE_IO_DONE to STATE_FINISHING in the same thread, while holding the osr->qlock().
The deadlock is caused in _txc_finish and _osr_reap_done, when trying to
lock osr->qlock again.
Since the _txc_finish can be called with(in sync transaction mode) or without
(in async transaction mode) holding the qlock, so fix this by setting the qlock
to PTHREAD_MUTEX_RECURSIVE, thus we can recursive acquire the qlock.
Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
The data of the later contiguous overlays should be claim_append to
'op->data', instead of 'bl'.
Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
There is a racing condition here, if the flush_commit() call
happened after _txc_finish_kv and before next state, the context
was pushed to on_commits but no one will handle the context since
we already pass _txc_finish_kv. This bug can be easily reproduce
by putting a sleep(5) after _txc_finish_kv, and trigger the bug by
ceph-osd -i 0 --mkfs.
Fix this bug by return true directly when state >= STATE_KV_DONE(instead
of > in previous code). We already persist the data in STATE_KV_DONE so
it's safe for us to do this.
Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
When writing all the overlays, there is no need to dup the data in WAL.
Instead, we can reference the overlays in the WAL, and remove these
overlays after commiting them to the fs. When replaying, we can get
these data from the referenced overlays. Doing this way, we can save a
write and a deletion for each of the overlay data in the db.
Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
We need to rule out hobject_t::max before calling get_object_key
(in which will call get_filestore_key_u32 and get an assert failure)
Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>