Commit Graph

69825 Commits

Author SHA1 Message Date
Kefu Chai
c9c2cf6b8a vstart.sh: remove start_*
so there are only two ways to override the number of daemons to start
- using the env var CEPH_NUM_{MON|OSD|MGR|MDS} or {MON|OSD|MGR|MDS}
- command line options: --{mon,osd,mds}_num

do prevent a daemon from running, set the corrresponding env var to 0.

Signed-off-by: Kefu Chai <kchai@redhat.com>
2017-03-24 00:54:35 +08:00
Kefu Chai
c41fe1eae1 vstart.sh: do nothing if $CEPH_NUM_* is 0
Signed-off-by: Kefu Chai <kchai@redhat.com>
2017-03-22 13:16:42 +08:00
Kefu Chai
302d8d5f61 vstart.sh: extract start_{osd,mon,mgr,mds} into functions
Signed-off-by: Kefu Chai <kchai@redhat.com>
2017-03-22 13:16:41 +08:00
Orit Wasserman
68bc509413 Merge pull request #13963 from cbodley/wip-18725
rgw-admin: remove deprecated regionmap commands
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
2017-03-21 23:44:22 +02:00
Sage Weil
66b42be685 Merge pull request #13888 from liewegas/wip-bluestore-dw
os/bluestore: fix deferred writes; improve flush

Reviewed-by: Igor Fedotov <ifedotov@mirantis.com>
2017-03-21 15:05:56 -05:00
Casey Bodley
5444c9d092 Merge pull request #13902 from Wilhelmshaven/rm_redundant_code
rgw: remove redundant codes in rgw_cache.h

Reviewed-by: Casey Bodley <cbodley@redhat.com>
2017-03-21 15:43:48 -04:00
Sage Weil
def17606fc os/bluestore: handle zombie OpSequencers
It's possible for the Sequencer to go away while the OpSequencer still has
txcs in flight.  We were handling the case where the osr was on the
deferred_queue, but it may be off the deferred_queue but waiting for the
commit to happen, and we still need to wait for that.

Fix this by introducing a 'zombie' state for the osr, in which we keep the
osr in the osr_set.

Clean up the OpSequencer methods and a few other method names.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:31 -05:00
Sage Weil
d8fa788ca8 os/bluestore: clean up flush_all()
Add assertions if we fail to flush everything.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:31 -05:00
Sage Weil
9732b6c8e9 os/bluestore: move cached items around on collection split
We've been avoiding doing this for a while and it has finally caught up
with us: the SharedBlob may outlive the split due to deferred IO, and
a read on the child collection may load a competing Blob and SharedBlob
and read from the on-disk blocks that haven't been written yet.

Fix by preserving the one-SharedBlob-instance invariant by moving cache
items to the new Collection and cache shard like we should have from the
beginning.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:30 -05:00
Sage Weil
e4d547ede7 os/bluestore: simplify flush() wake-up condition
Clearer, and fewer wakeups.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:30 -05:00
Sage Weil
52c93f5b71 ceph_test_objectstore: set bluestore cache shards to 5
Better test coverage!

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:30 -05:00
Sage Weil
d93d6d0968 unittest_bluestore_types: fix Collection using tests
We can't use a bare Collection since we get/put refs, the last put will
delete it, and the dtor asserts nref == 0 (no faking a ref and deliberately
leaking!).

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:30 -05:00
Sage Weil
4de29d0f14 os/bluestore/KernelDevice: drop unused flush_lock
Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:30 -05:00
Sage Weil
ed9f54bae7 os/bluestore: better debugging around collections
Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:30 -05:00
Sage Weil
3ad789cef3 os/bluestore: nicer Onode dout prefix
Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:30 -05:00
Sage Weil
e83fc55491 os/bluestore: flush_cache on umount, fsck finish, etc.
Otherwise cache items survive beyond umount into the next mount cycle!

Also, ensure that we flush_cache *before* clearing coll_map, as some cache
items have references back to the Collection.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:30 -05:00
Sage Weil
01ef844421 os/bluestore: take Collection ref from SharedBlob
These can survive as long as the txc, which can be longer than the
Collection.  Make sure we have a valid ref as both finish_write and
~SharedBlob use coll for the SharedBlobSet (and coll->store->cct for
debug).

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:30 -05:00
Sage Weil
81e8682be1 os/bluestore: fix perfcounters for deferred io
Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:30 -05:00
Sage Weil
f4d4c9c68a os/bluestore: remove dead _do_deferred_op code
Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:29 -05:00
Sage Weil
3a3d9ad097 os/bluestore: make throttles tunable online
Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:29 -05:00
Sage Weil
3f9c216145 os/bluestore: prevent throttle deadlock due to deferred writes
Kick off deferred IOs if we pass the throttle midpoint or if we would
block during submission.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:29 -05:00
Sage Weil
e46081c8c6 ceph_test_objectstore: fix Synthetic to never modify bufferlists
We were modifying bufferlists in place, and kludging around it by making
full copies elsewhere.  Instead, never modify a buffer.

This fixes issues where the buffer we submit to ObjectStore ends up in
the cache and we modify in place later, corrupting the implementation's
copy.  (This was affecting BlueStore.)

Rearrange the data methods to be next to each other and clean them up a
bit too.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:29 -05:00
Sage Weil
ba159deb55 os/bluestore: drop obsolete comment
Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:29 -05:00
Sage Weil
1fefeeb39e os/bluestore: avoid extra dev flush on single device when all io is deferred
If we have no non-deferred IO to flush, and we are running bluefs on a
single shared device, then we can rely on the bluefs flush to make our
current batch of deferred ios stable.

Separate deferred into a "done" and "stable" list.  If we do sync, put
everything from "done" onto "stable".  Otherwise, after we do our kv
commit via bluefs, move "done" to "stable" then.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:29 -05:00
Sage Weil
c1f01082a1 os/bluestore: debug alloc release
Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:29 -05:00
Sage Weil
7a3e85f1a0 os/bluestore: flush old/discarded OpSequencers too
When the Sequencer goes away it get deregistered.  If there are still
deferred IOs in flight, we need to wait for those too.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:29 -05:00
Sage Weil
a4b9012268 os/bluestore: batch up to bluestore_deferred_batch_ops before submitting
Allow several deferred writes to accumulate before we submit them.  In
general we have no time pressure, and on HDD (and perhaps sometimes SSD)
it is beneficial to accumulate and batch these so that they result in
fewer seeks.  On HDD, this is particularly true of seeks away from the
journal.  And on sequential workloads this can avoid seeks.  In may even
allow the block layer or SSD firmware to merge IOs and perform fewer
writes.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:29 -05:00
Sage Weil
44d498332c os/bluestore: only discard deallocated regions of a blob if !shared
If a blob is shared, we can't discard deallocated regions: there may
be deferred buffers in flight and we might get a read via the clone.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:29 -05:00
Sage Weil
d3a425faf8 os/bluestore: avoid waking up kv thread on deferred write completion
In a simple HDD workload with queue depth of 1, we halve our throughput
because the kv thread does a full commit twice per IO: once for the
initial commit, and then again to clean up the deferred write record. The
second wakeup is unnecessary; we can clean it up on the next commit.

We do need to do this wakeup in a few cases, though, when draining the
OpSequencers: (1) on replay during startup, and (2) on shutdown in
_osr_drain_all().

Send everything through _osr_drain_all() for simplicity.

This doubles HDD qd=1 IOPS from ~50 to ~100 on my 7200 rpm test device
(rados bench 30 write -b 4096 -t 1).

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:28 -05:00
Sage Weil
78b9cea09f os/bluestore: move many initializations into header
This is less fragile, especially with 2 constructors.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:28 -05:00
Sage Weil
6db031be4d os/bluestore: restructure deferred write queue
First, eliminate the work queue--it's useless.  We are dispatching aio and
should not block.  And if a single thread isn't sufficient to do it, it
probably means we should be parallelizing kv_sync_thread too (which is our
only caller that matters).

Repurpose the old osr-list -> txc-list-per-osr queue structure to manage
the queuing.  For any given osr, dispatch one batch of aios at a time,
taking care to collapse any overwrites so that the latest write wins.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:28 -05:00
Sage Weil
5fafd1fcc2 os/bluestore: fix OpSequencer/Sequencer lifecycle
Make osr_set refcounts so that it can tolerate a Sequencer destruction
racing with flush or a Sequencer that outlives the BlueStore instance
itself.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:28 -05:00
Sage Weil
3dc82d57e4 os/bluestore: move _osr_reap_done
Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:28 -05:00
Sage Weil
986776d30d os/bluestore: reimplement/rename _sync -> _flush_all
The old implementation is racy and doesn't actually work.  Instead, rely
on a list of all OpSequencers and drain them all.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:28 -05:00
Sage Weil
3cf2b0f9b7 os/bluestore: keep all OpSequencers registered
Maintain the set of all live OpSequencers.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:28 -05:00
Sage Weil
9b28d615e7 os/bluestore: keep onode refs for lifetime of obc
This ensures that we don't trim an onode from the cache while it has a
txc that is still in flight.  Which in turn ensures that if we try to read
the object, we will have any writing buffers available.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:28 -05:00
Sage Weil
4aa44d2b49 os/bluestore: make OnodeSpace onode_map private
Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:28 -05:00
Sage Weil
2d0d375809 os/bluestore: make Sequencer::flush() more efficient
BlueStore collection methods only need preceding transactions to be
applied to the kv db; they do not need to be committed.

Note that this is *only* needed for collection listings; all other read
operations are immediately safe after queue_transactions().

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:28 -05:00
Sage Weil
9ee0c842f2 os/bluestore: add OpSequencer::drain()
Currently this is the same as flush, but more precisely it is an internal
method that means all txc's must complete.  Update _wal_apply() to use it
instead of flush(), which is part of the public Sequencer interface.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:27 -05:00
Sage Weil
5cb5a902d2 os/bluestore: revert throttle perfcounters
This reverts 3e40595f3c

The individual throttles have their own set of perfcounters; no need to
duplicate them here.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:27 -05:00
Sage Weil
78df9b3e4d os/bluestore: release deferred throttle on io finish, before cleanup
The throttle is really about limiting deferred IO; we do not need to
actually remove the deferred record from the kv db before queueing more.
(In fact, the txc that queues more will do the cleanup.)

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:27 -05:00
Sage Weil
eff1e83145 os/bluestore: separate _txc_finish_kv into _txc_{applied,committed}_kv
We can unblock flush()ing threads as soon as we have applied to the kv db,
while the callbacks must wait until we have committed.

Move methods around a bit to better match the execution order.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:27 -05:00
Sage Weil
3238162bd9 os/bluestore: make flush() only wait for kv commit
The only remaining flush() users only need to see previous txc's applied
to the kv db (e.g., _omap_clear needs to see the records to delete them).

Signed-off-by: Sage Weil <sage@redhat.com>

# Conflicts:
#	src/os/bluestore/BlueStore.h
2017-03-21 13:56:27 -05:00
Sage Weil
a56cd6ba38 os/bluestore: no need to Onode::flush() on truncate
We do not release extents until after any deferred IO, so this flush() is
unnecessary.

Signed-off-by: Sage Weil <sage@redhat.com>

# Conflicts:
#	src/os/bluestore/BlueStore.cc
2017-03-21 13:56:27 -05:00
Sage Weil
83e33a32fd os/bluestore: no need to Onode::flush() in _do_read
We now ensure that deferred writes are in cache until the txc retires,
so there is no need to wait here.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:27 -05:00
Sage Weil
6f2f8b3e3b os/bluestore: pin writing cache buffers until txc is finished
Notably, this includes WAL writes, which means an in-flight WAL write will
always be in the cache.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:27 -05:00
Sage Weil
bcd2a32912 os/bluestore: write padded data into buffer cache
We rely on the buffer cache to avoid reading any deferred write data. In
order for that to work, we have to ensure the entire block whose
overwrite is deferred is in the buffer cache.  Otherwise, a write to 0~5
that results in a deferred write could break a subsequent read from 5~5
that reads the same block from disk before the deferred write lands.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:27 -05:00
Sage Weil
bc5bfdd592 os/bluestore: update freelist on initial commit
It does not matter if we update the freelist in the initial commit or when
cleaning up the deferred transaction; both will eventually update the
persistent kv freelist.  We maintain one case to ensure that legacy
deferred events (from a kraken upgrade) release when they are replayed.

What matters while online is the Allocator, which has an independent
in-memory copy of the freelist to make decisions.  And we can delay that
as long as we want.  To avoid any concerns about deferred writes racing
against released blocks, just defer any release until the txc is fully
completed (including any deferred writes).  This ensures that even if we
have a pattern like

 txc 1: schedule deferred write on block A
 txc 2: release block A
 txc 1+2: commit
 txc 2: done!
 txc 1: do deferred write
 txc 1: done!

then txc 2 won't do its release because it is stuck behind txc 1 in the
OpSequencer queue:

 ...
 txc 1: reaped
 txc 2: reaped (and extents released to alloc)

This builds in some delay in just-released space being usable again, but
it should be a very small amount of space relative to the size of the
store!

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:27 -05:00
Sage Weil
597383935e os/bluestore: wal -> deferred
"wal" can refer to both the rocksdb wal (effectively, or journal) and the
"wal" events we include in it (mainly promises to do future IO or release
extents to the freelist).  This is super confusing!

Instead, call them 'deferred'.. deferred transactions, ops, writes, or
releases.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:26 -05:00
Sage Weil
ffd4d2f1dd vstart.sh: larger wal device
Signed-off-by: Sage Weil <sage@redhat.com>
2017-03-21 13:56:26 -05:00