If a blob is shared, we can't discard deallocated regions: there may
be deferred buffers in flight and we might get a read via the clone.
Signed-off-by: Sage Weil <sage@redhat.com>
In a simple HDD workload with queue depth of 1, we halve our throughput
because the kv thread does a full commit twice per IO: once for the
initial commit, and then again to clean up the deferred write record. The
second wakeup is unnecessary; we can clean it up on the next commit.
We do need to do this wakeup in a few cases, though, when draining the
OpSequencers: (1) on replay during startup, and (2) on shutdown in
_osr_drain_all().
Send everything through _osr_drain_all() for simplicity.
This doubles HDD qd=1 IOPS from ~50 to ~100 on my 7200 rpm test device
(rados bench 30 write -b 4096 -t 1).
Signed-off-by: Sage Weil <sage@redhat.com>
First, eliminate the work queue--it's useless. We are dispatching aio and
should not block. And if a single thread isn't sufficient to do it, it
probably means we should be parallelizing kv_sync_thread too (which is our
only caller that matters).
Repurpose the old osr-list -> txc-list-per-osr queue structure to manage
the queuing. For any given osr, dispatch one batch of aios at a time,
taking care to collapse any overwrites so that the latest write wins.
Signed-off-by: Sage Weil <sage@redhat.com>
Make osr_set refcounts so that it can tolerate a Sequencer destruction
racing with flush or a Sequencer that outlives the BlueStore instance
itself.
Signed-off-by: Sage Weil <sage@redhat.com>
The old implementation is racy and doesn't actually work. Instead, rely
on a list of all OpSequencers and drain them all.
Signed-off-by: Sage Weil <sage@redhat.com>
This ensures that we don't trim an onode from the cache while it has a
txc that is still in flight. Which in turn ensures that if we try to read
the object, we will have any writing buffers available.
Signed-off-by: Sage Weil <sage@redhat.com>
BlueStore collection methods only need preceding transactions to be
applied to the kv db; they do not need to be committed.
Note that this is *only* needed for collection listings; all other read
operations are immediately safe after queue_transactions().
Signed-off-by: Sage Weil <sage@redhat.com>
Currently this is the same as flush, but more precisely it is an internal
method that means all txc's must complete. Update _wal_apply() to use it
instead of flush(), which is part of the public Sequencer interface.
Signed-off-by: Sage Weil <sage@redhat.com>
This reverts 3e40595f3c
The individual throttles have their own set of perfcounters; no need to
duplicate them here.
Signed-off-by: Sage Weil <sage@redhat.com>
The throttle is really about limiting deferred IO; we do not need to
actually remove the deferred record from the kv db before queueing more.
(In fact, the txc that queues more will do the cleanup.)
Signed-off-by: Sage Weil <sage@redhat.com>
We can unblock flush()ing threads as soon as we have applied to the kv db,
while the callbacks must wait until we have committed.
Move methods around a bit to better match the execution order.
Signed-off-by: Sage Weil <sage@redhat.com>
The only remaining flush() users only need to see previous txc's applied
to the kv db (e.g., _omap_clear needs to see the records to delete them).
Signed-off-by: Sage Weil <sage@redhat.com>
# Conflicts:
# src/os/bluestore/BlueStore.h
We do not release extents until after any deferred IO, so this flush() is
unnecessary.
Signed-off-by: Sage Weil <sage@redhat.com>
# Conflicts:
# src/os/bluestore/BlueStore.cc
We rely on the buffer cache to avoid reading any deferred write data. In
order for that to work, we have to ensure the entire block whose
overwrite is deferred is in the buffer cache. Otherwise, a write to 0~5
that results in a deferred write could break a subsequent read from 5~5
that reads the same block from disk before the deferred write lands.
Signed-off-by: Sage Weil <sage@redhat.com>
It does not matter if we update the freelist in the initial commit or when
cleaning up the deferred transaction; both will eventually update the
persistent kv freelist. We maintain one case to ensure that legacy
deferred events (from a kraken upgrade) release when they are replayed.
What matters while online is the Allocator, which has an independent
in-memory copy of the freelist to make decisions. And we can delay that
as long as we want. To avoid any concerns about deferred writes racing
against released blocks, just defer any release until the txc is fully
completed (including any deferred writes). This ensures that even if we
have a pattern like
txc 1: schedule deferred write on block A
txc 2: release block A
txc 1+2: commit
txc 2: done!
txc 1: do deferred write
txc 1: done!
then txc 2 won't do its release because it is stuck behind txc 1 in the
OpSequencer queue:
...
txc 1: reaped
txc 2: reaped (and extents released to alloc)
This builds in some delay in just-released space being usable again, but
it should be a very small amount of space relative to the size of the
store!
Signed-off-by: Sage Weil <sage@redhat.com>
"wal" can refer to both the rocksdb wal (effectively, or journal) and the
"wal" events we include in it (mainly promises to do future IO or release
extents to the freelist). This is super confusing!
Instead, call them 'deferred'.. deferred transactions, ops, writes, or
releases.
Signed-off-by: Sage Weil <sage@redhat.com>
Aio_submit will submit both aio_read/write, and also there
are synchronized read and random read, so we need to
handle the read I/O completion in a correct way.
Since random read has its own ioc, so the
num_reading for ioc will be at most 1, which will be easy
to handle in io_complete. And we need only to differentiate
whethere it is an aio_read.
Also fix the exception logic in command send, make the style
consistent.
Signed-off-by: optimistyzy <optimistyzy@gmail.com>