RepoMirrors/ceph

mirror of https://github.com/ceph/ceph synced 2025-01-29 14:34:40 +00:00

Author	SHA1	Message	Date
Kefu Chai	c9c2cf6b8a	vstart.sh: remove start_* so there are only two ways to override the number of daemons to start - using the env var CEPH_NUM_{MON\|OSD\|MGR\|MDS} or {MON\|OSD\|MGR\|MDS} - command line options: --{mon,osd,mds}_num do prevent a daemon from running, set the corrresponding env var to 0. Signed-off-by: Kefu Chai <kchai@redhat.com>	2017-03-24 00:54:35 +08:00
Kefu Chai	c41fe1eae1	vstart.sh: do nothing if $CEPH_NUM_* is 0 Signed-off-by: Kefu Chai <kchai@redhat.com>	2017-03-22 13:16:42 +08:00
Kefu Chai	302d8d5f61	vstart.sh: extract start_{osd,mon,mgr,mds} into functions Signed-off-by: Kefu Chai <kchai@redhat.com>	2017-03-22 13:16:41 +08:00
Orit Wasserman	68bc509413	Merge pull request #13963 from cbodley/wip-18725 rgw-admin: remove deprecated regionmap commands Reviewed-by: Orit Wasserman <owasserm@redhat.com>	2017-03-21 23:44:22 +02:00
Sage Weil	66b42be685	Merge pull request #13888 from liewegas/wip-bluestore-dw os/bluestore: fix deferred writes; improve flush Reviewed-by: Igor Fedotov <ifedotov@mirantis.com>	2017-03-21 15:05:56 -05:00
Casey Bodley	5444c9d092	Merge pull request #13902 from Wilhelmshaven/rm_redundant_code rgw: remove redundant codes in rgw_cache.h Reviewed-by: Casey Bodley <cbodley@redhat.com>	2017-03-21 15:43:48 -04:00
Sage Weil	def17606fc	os/bluestore: handle zombie OpSequencers It's possible for the Sequencer to go away while the OpSequencer still has txcs in flight. We were handling the case where the osr was on the deferred_queue, but it may be off the deferred_queue but waiting for the commit to happen, and we still need to wait for that. Fix this by introducing a 'zombie' state for the osr, in which we keep the osr in the osr_set. Clean up the OpSequencer methods and a few other method names. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:31 -05:00
Sage Weil	d8fa788ca8	os/bluestore: clean up flush_all() Add assertions if we fail to flush everything. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:31 -05:00
Sage Weil	9732b6c8e9	os/bluestore: move cached items around on collection split We've been avoiding doing this for a while and it has finally caught up with us: the SharedBlob may outlive the split due to deferred IO, and a read on the child collection may load a competing Blob and SharedBlob and read from the on-disk blocks that haven't been written yet. Fix by preserving the one-SharedBlob-instance invariant by moving cache items to the new Collection and cache shard like we should have from the beginning. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:30 -05:00
Sage Weil	e4d547ede7	os/bluestore: simplify flush() wake-up condition Clearer, and fewer wakeups. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:30 -05:00
Sage Weil	52c93f5b71	ceph_test_objectstore: set bluestore cache shards to 5 Better test coverage! Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:30 -05:00
Sage Weil	d93d6d0968	unittest_bluestore_types: fix Collection using tests We can't use a bare Collection since we get/put refs, the last put will delete it, and the dtor asserts nref == 0 (no faking a ref and deliberately leaking!). Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:30 -05:00
Sage Weil	4de29d0f14	os/bluestore/KernelDevice: drop unused flush_lock Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:30 -05:00
Sage Weil	ed9f54bae7	os/bluestore: better debugging around collections Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:30 -05:00
Sage Weil	3ad789cef3	os/bluestore: nicer Onode dout prefix Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:30 -05:00
Sage Weil	e83fc55491	os/bluestore: flush_cache on umount, fsck finish, etc. Otherwise cache items survive beyond umount into the next mount cycle! Also, ensure that we flush_cache before clearing coll_map, as some cache items have references back to the Collection. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:30 -05:00
Sage Weil	01ef844421	os/bluestore: take Collection ref from SharedBlob These can survive as long as the txc, which can be longer than the Collection. Make sure we have a valid ref as both finish_write and ~SharedBlob use coll for the SharedBlobSet (and coll->store->cct for debug). Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:30 -05:00
Sage Weil	81e8682be1	os/bluestore: fix perfcounters for deferred io Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:30 -05:00
Sage Weil	f4d4c9c68a	os/bluestore: remove dead _do_deferred_op code Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:29 -05:00
Sage Weil	3a3d9ad097	os/bluestore: make throttles tunable online Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:29 -05:00
Sage Weil	3f9c216145	os/bluestore: prevent throttle deadlock due to deferred writes Kick off deferred IOs if we pass the throttle midpoint or if we would block during submission. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:29 -05:00
Sage Weil	e46081c8c6	ceph_test_objectstore: fix Synthetic to never modify bufferlists We were modifying bufferlists in place, and kludging around it by making full copies elsewhere. Instead, never modify a buffer. This fixes issues where the buffer we submit to ObjectStore ends up in the cache and we modify in place later, corrupting the implementation's copy. (This was affecting BlueStore.) Rearrange the data methods to be next to each other and clean them up a bit too. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:29 -05:00
Sage Weil	ba159deb55	os/bluestore: drop obsolete comment Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:29 -05:00
Sage Weil	1fefeeb39e	os/bluestore: avoid extra dev flush on single device when all io is deferred If we have no non-deferred IO to flush, and we are running bluefs on a single shared device, then we can rely on the bluefs flush to make our current batch of deferred ios stable. Separate deferred into a "done" and "stable" list. If we do sync, put everything from "done" onto "stable". Otherwise, after we do our kv commit via bluefs, move "done" to "stable" then. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:29 -05:00
Sage Weil	c1f01082a1	os/bluestore: debug alloc release Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:29 -05:00
Sage Weil	7a3e85f1a0	os/bluestore: flush old/discarded OpSequencers too When the Sequencer goes away it get deregistered. If there are still deferred IOs in flight, we need to wait for those too. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:29 -05:00
Sage Weil	a4b9012268	os/bluestore: batch up to bluestore_deferred_batch_ops before submitting Allow several deferred writes to accumulate before we submit them. In general we have no time pressure, and on HDD (and perhaps sometimes SSD) it is beneficial to accumulate and batch these so that they result in fewer seeks. On HDD, this is particularly true of seeks away from the journal. And on sequential workloads this can avoid seeks. In may even allow the block layer or SSD firmware to merge IOs and perform fewer writes. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:29 -05:00
Sage Weil	44d498332c	os/bluestore: only discard deallocated regions of a blob if !shared If a blob is shared, we can't discard deallocated regions: there may be deferred buffers in flight and we might get a read via the clone. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:29 -05:00
Sage Weil	d3a425faf8	os/bluestore: avoid waking up kv thread on deferred write completion In a simple HDD workload with queue depth of 1, we halve our throughput because the kv thread does a full commit twice per IO: once for the initial commit, and then again to clean up the deferred write record. The second wakeup is unnecessary; we can clean it up on the next commit. We do need to do this wakeup in a few cases, though, when draining the OpSequencers: (1) on replay during startup, and (2) on shutdown in _osr_drain_all(). Send everything through _osr_drain_all() for simplicity. This doubles HDD qd=1 IOPS from ~50 to ~100 on my 7200 rpm test device (rados bench 30 write -b 4096 -t 1). Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:28 -05:00
Sage Weil	78b9cea09f	os/bluestore: move many initializations into header This is less fragile, especially with 2 constructors. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:28 -05:00
Sage Weil	6db031be4d	os/bluestore: restructure deferred write queue First, eliminate the work queue--it's useless. We are dispatching aio and should not block. And if a single thread isn't sufficient to do it, it probably means we should be parallelizing kv_sync_thread too (which is our only caller that matters). Repurpose the old osr-list -> txc-list-per-osr queue structure to manage the queuing. For any given osr, dispatch one batch of aios at a time, taking care to collapse any overwrites so that the latest write wins. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:28 -05:00
Sage Weil	5fafd1fcc2	os/bluestore: fix OpSequencer/Sequencer lifecycle Make osr_set refcounts so that it can tolerate a Sequencer destruction racing with flush or a Sequencer that outlives the BlueStore instance itself. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:28 -05:00
Sage Weil	3dc82d57e4	os/bluestore: move _osr_reap_done Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:28 -05:00
Sage Weil	986776d30d	os/bluestore: reimplement/rename _sync -> _flush_all The old implementation is racy and doesn't actually work. Instead, rely on a list of all OpSequencers and drain them all. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:28 -05:00
Sage Weil	3cf2b0f9b7	os/bluestore: keep all OpSequencers registered Maintain the set of all live OpSequencers. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:28 -05:00
Sage Weil	9b28d615e7	os/bluestore: keep onode refs for lifetime of obc This ensures that we don't trim an onode from the cache while it has a txc that is still in flight. Which in turn ensures that if we try to read the object, we will have any writing buffers available. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:28 -05:00
Sage Weil	4aa44d2b49	os/bluestore: make OnodeSpace onode_map private Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:28 -05:00
Sage Weil	2d0d375809	os/bluestore: make Sequencer::flush() more efficient BlueStore collection methods only need preceding transactions to be applied to the kv db; they do not need to be committed. Note that this is only needed for collection listings; all other read operations are immediately safe after queue_transactions(). Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:28 -05:00
Sage Weil	9ee0c842f2	os/bluestore: add OpSequencer::drain() Currently this is the same as flush, but more precisely it is an internal method that means all txc's must complete. Update _wal_apply() to use it instead of flush(), which is part of the public Sequencer interface. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:27 -05:00
Sage Weil	5cb5a902d2	os/bluestore: revert throttle perfcounters This reverts `3e40595f3c` The individual throttles have their own set of perfcounters; no need to duplicate them here. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:27 -05:00
Sage Weil	78df9b3e4d	os/bluestore: release deferred throttle on io finish, before cleanup The throttle is really about limiting deferred IO; we do not need to actually remove the deferred record from the kv db before queueing more. (In fact, the txc that queues more will do the cleanup.) Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:27 -05:00
Sage Weil	eff1e83145	os/bluestore: separate _txc_finish_kv into _txc_{applied,committed}_kv We can unblock flush()ing threads as soon as we have applied to the kv db, while the callbacks must wait until we have committed. Move methods around a bit to better match the execution order. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:27 -05:00
Sage Weil	3238162bd9	os/bluestore: make flush() only wait for kv commit The only remaining flush() users only need to see previous txc's applied to the kv db (e.g., _omap_clear needs to see the records to delete them). Signed-off-by: Sage Weil <sage@redhat.com> # Conflicts: # src/os/bluestore/BlueStore.h	2017-03-21 13:56:27 -05:00
Sage Weil	a56cd6ba38	os/bluestore: no need to Onode::flush() on truncate We do not release extents until after any deferred IO, so this flush() is unnecessary. Signed-off-by: Sage Weil <sage@redhat.com> # Conflicts: # src/os/bluestore/BlueStore.cc	2017-03-21 13:56:27 -05:00
Sage Weil	83e33a32fd	os/bluestore: no need to Onode::flush() in _do_read We now ensure that deferred writes are in cache until the txc retires, so there is no need to wait here. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:27 -05:00
Sage Weil	6f2f8b3e3b	os/bluestore: pin writing cache buffers until txc is finished Notably, this includes WAL writes, which means an in-flight WAL write will always be in the cache. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:27 -05:00
Sage Weil	bcd2a32912	os/bluestore: write padded data into buffer cache We rely on the buffer cache to avoid reading any deferred write data. In order for that to work, we have to ensure the entire block whose overwrite is deferred is in the buffer cache. Otherwise, a write to 0~5 that results in a deferred write could break a subsequent read from 5~5 that reads the same block from disk before the deferred write lands. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:27 -05:00
Sage Weil	bc5bfdd592	os/bluestore: update freelist on initial commit It does not matter if we update the freelist in the initial commit or when cleaning up the deferred transaction; both will eventually update the persistent kv freelist. We maintain one case to ensure that legacy deferred events (from a kraken upgrade) release when they are replayed. What matters while online is the Allocator, which has an independent in-memory copy of the freelist to make decisions. And we can delay that as long as we want. To avoid any concerns about deferred writes racing against released blocks, just defer any release until the txc is fully completed (including any deferred writes). This ensures that even if we have a pattern like txc 1: schedule deferred write on block A txc 2: release block A txc 1+2: commit txc 2: done! txc 1: do deferred write txc 1: done! then txc 2 won't do its release because it is stuck behind txc 1 in the OpSequencer queue: ... txc 1: reaped txc 2: reaped (and extents released to alloc) This builds in some delay in just-released space being usable again, but it should be a very small amount of space relative to the size of the store! Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:27 -05:00
Sage Weil	597383935e	os/bluestore: wal -> deferred "wal" can refer to both the rocksdb wal (effectively, or journal) and the "wal" events we include in it (mainly promises to do future IO or release extents to the freelist). This is super confusing! Instead, call them 'deferred'.. deferred transactions, ops, writes, or releases. Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:26 -05:00
Sage Weil	ffd4d2f1dd	vstart.sh: larger wal device Signed-off-by: Sage Weil <sage@redhat.com>	2017-03-21 13:56:26 -05:00

1 2 3 4 5 ...

69825 Commits