Commit Graph

89926 Commits

Author SHA1 Message Date
Sage Weil
ba7f9af21c mon/OSDMonitor: fix long line
Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:08:40 -05:00
Sage Weil
9400a8f33c mon/OSDMonitor: move pool created check into caller
This makes for less confusing debug output.  Speaking from experience.

Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:08:40 -05:00
Sage Weil
0d5dc37c9d mon/OSDMonitor: adjust pgp_num_target down along with pg_num_target as needed
If the user asks to reduce pg_num, reduce pg_num_target too at the same
time.

Don't completely hide pgp_num yet (by increasing it when pg_num_target
increases).

Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:08:40 -05:00
Sage Weil
1a08a41266 mon/OSDMonitor: add mon_osd_max_initial_pgs to cap initial pool pgs
Configure how many initial PGs we create a pool with.  If the user wants
more than this then we do subsequent splits.

Default to 1024, so that pool creation works in the usual way for most users,
but does some splitting for very large pools/clusters.

Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:08:40 -05:00
Sage Weil
93928ff029 osd/OSDMap: set pg[p]_num_target in build_simple*() methods
These are only used by unit tests and osdmaptool as far as I can tell.

Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:08:40 -05:00
Sage Weil
45d8d5dcf4 mon/PGMap: adjust SMALLER_PGP_NUM warning to use *_target values
If the cluster is failing to converge on the target values that is a
separate problem.

Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:08:40 -05:00
Sage Weil
aea329eb9a mon/OSDMonitor: set CREATING flag for force-create-pg
In order to recreate a lost PG, we need to set the CREATING flag for the
pool.  This prevents pg_num from changing in future OSDMap epochs until
*after* the PG has successfully been instantiated.

Note that a pg_num change in *this* epoch is fine; the recreated PG will
instantiate in *this* epoch, which is /after/ the split a pg_num in this
epoch would describe.

Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:08:40 -05:00
Sage Weil
fdfc5c64e8 mon/OSDMonitor: start sending new-style pg_create2 messages
The new sharded wq implementation cannot handle a resent mon create
message and a split child already existing.  This a side effect of the
new pg create path instantiating the PG at the pool create epoch osdmap
and letting it roll forward through splits; the mon may be resending a
create for a pg that was already created elsewhere and split elsewhere,
such that one of those split children has peered back onto this same OSD.
When we roll forward our re-created empty parent it may split and find the
child already exists, crashing.

This is no longer a concern because the mgr-based controller for pg_num
will not split PGs until after the initial PGs are all created.  (We
know this because the pool has the CREATED flag set.)

The old-style path had it's own problem
http://tracker.ceph.com/issues/22165.  We would build the history and
instantiate the pg in the latest osdmap epoch, ignoring any split children
that should have been created between teh pool create epoch and the
current epoch.  Since we're now taking the new path, that is no longer
a problem.

Fixes: http://tracker.ceph.com/issues/22165
Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:08:40 -05:00
Sage Weil
fffd50cee0 mon/OSDMonitor: set last_force_resend_prenautilus for pg_num_pending changes
This will force pre-nautilus clients to resend ops when we are adjusting
pg_num_pending.  This is a big hammer: for nautilus+ clients, we only have
an interval change for the affected PGs (the two PGs that are about to
merge), whereas this compat hack will do an op resend for the whole pool.
However, it is better than requiring all clients be upgraded to nautilus in
order to do PG merges.

Note that we already do the same thing for pre-luminous clients both for
splits, so we've already inflicted similar pain the past (and, to my
knowledge, have not seen any negative feedback or fallout from that).

Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:08:40 -05:00
Sage Weil
9426dc4c2b osd: ignore pg creates when pool FLAG_CREATING is not set
We only process mon-initiated PG creates while the pool is is CREATING
mode.  This ensures that we will not have any racing split or merge
operations.

Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:08:40 -05:00
Sage Weil
101671d95a mgr: do not adjust pg_num until FLAG_CREATING removed from pool
This is more reliable than looking at PG states because the PG may have
gone active and sent a notification to the mon (pg created!) and mgr
(new state!) but the mon may not have persisted that information yet.

Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:08:40 -05:00
Sage Weil
f4f014937d mon/OSDMonitor: add FLAG_CREATING on upgrade if pools still creating
Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:08:40 -05:00
Sage Weil
21458e88e4 mon/OSDMonitor: prevent FLAG_CREATING from getting set pre-nautilus
Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:08:40 -05:00
Sage Weil
7fc3a9bd07 mon/OSDMonitor: disallow pg_num changes while CREATING flag is set
Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:08:40 -05:00
Sage Weil
41c38559db mon/OSDMonitor: set POOL_CREATING flag until initial pool pgs are created
Set the flag when the pool is created, and clear it when the initial set
of PGs have been created by the mon.  Move the update_creating_pgs()
block so that we can process the pgid removal from the creating list and
the pool flag removal in the same epoch; otherwise we might remove the
pgid but have no cluster activity to roll over another osdmap epoch to
allow the pool flag to be removed.

Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:08:40 -05:00
Sage Weil
0e526b467a osd/osd_types: add pg_pool_t FLAG_POOL_CREATING
Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:08:40 -05:00
Sage Weil
52e18ec08a osd/osd_types: introduce last_force_resend_prenautilus
Previously, we renamed the old last_force_resend to
last_force_resend_preluminous and created a new last_force_resend for
luminous+.  This allowed us to force preluminous clients to resend ops
(because they didn't understand the new pg split => new interval rule)
without affecting luminous clients.

Do the same rename again, adding a last_force_resend_prenautilus (luminous
or mimic).

Adjust the OSD code accordingly so it matches the behavior we'll see from
a luminous client.

Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:08:40 -05:00
Sage Weil
7074ad4a56 osd/PGLog: merge_from helper
When merging two logs, we throw out all of the actual log entries.
However, we need to convert them to dup ops as appropriate, and merge
those together.  Reuse the trim code to do this.

Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:08:40 -05:00
Sage Weil
27168cf502 osd: no cache agent or snap trimming during premerge
The PG is quiesced; not background activity.

Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:08:40 -05:00
Sage Weil
bbf952125e osd: notify mon when pending PGs are ready to merge
When a PG is in the pending merge state it is >= pg_num_pending and <
pg_num.  When this happens quiesce IO, peer, wait for activate to commit,
and then notify the mon that we are idle and safe to merge.

Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:08:40 -05:00
Sage Weil
a5274c75e2 mgr: add simple controller to adjust pg[p]_num_actual
This is a pretty trivial controller.  It adds some constraints that were
obviously not there before when the user could set these values to anything
they wanted, but does not implement all of the "nice" stepping that we'll
eventually want.  That can come later.

Splits:
- throttle pg_num increases, currently using the same config option
(mon_osd_max_creating_pgs) that we used to throttle pg creation
- do not increase pg_num until the initial pg creation has completed.

Merges:
- wait until the source and target pgs for merge are active and clean
before doing a merge.

Adjust pgp_num all at once for now.

Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:08:40 -05:00
Sage Weil
d48d7c9ce5 mon/OSDMonitor: MOSDPGReadyToMerge to complete a pg_num change
This message allows pg_num to be decremented (once the final PGs are
ready).

Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:08:40 -05:00
Sage Weil
7f3d156ebf mon/OSDMonitor: allow pg_num to adjusted up or down via pg[p]_num_target
The CLI now sets the *_target values, imposing only the subset of constraints that
the user needs to be concerned with.

new "pg_num_actual" and "pgp_num_actual" properties/commands are added that allow
the underlying raw values to be adjusted.  For the merge case, this sets
pg_num_pending instead of pg_num so that the OSDs can go through the
merge prep process.

A controller (in a future commit) will make pg[p]_num converge to pg[p]_num_target.

Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:08:39 -05:00
Sage Weil
17b270a04f osd/osd_types: make pg merge an interval boundary
Both the merge itself *and* the pending merge are interval transitions.

Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:07:56 -05:00
Sage Weil
0540492461 osd/osd_types: add pg_t::is_merge() method
This checks if we are a merge *source*, and if so, who the parent (target)
will be.

Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:07:56 -05:00
Sage Weil
71f4691909 osd/osd_types: add pg_num_pending to pg_pool_t
Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:07:56 -05:00
Sage Weil
46ba9febab osd: allow multiple threads to block on wait_min_pg_epoch
Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:07:56 -05:00
Sage Weil
2177350b01 osd: restructure advance_pg() call mechanism
Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:07:56 -05:00
Sage Weil
45ef31d84f mon/PGMap: prune merged pgs
Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:07:56 -05:00
Sage Weil
8a9b3f33f0 mon/PGMap: track pgs by state for each pool
We had this globally, but it's useful to have the per-pool breakdowns.

Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:07:56 -05:00
Sage Weil
cba9dea7da osd/SnapMapper: allow split_bits to decrease (merge)
Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:07:56 -05:00
Sage Weil
d39337fdf8 os/bluestore: fix osr_drain before merge
We need to make sure the deferred writes on the source collection finish
before the merge so that ops ordered via the final target sequencer will
occur after those writes.

Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:07:56 -05:00
Sage Weil
044ce83b1e os/bluestore: allow reuse of osr from existing collection
We try to attach an old osr at prepare_new_collection time, but that
happens before a transaction is submitted, and we might have a
transaction that removes and then recreates a collection.

Move the logic to _osr_attach and extend it to include reusing an osr
in use by a collection already in coll_map.  Also adjust the
_osr_register_zombie method to behave if the osr is already there, which
can happen with a remove, create, remove+create transaction sequence.

Fixes: https://tracker.ceph.com/issues/25180
Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:07:56 -05:00
Sage Weil
8a1100bf59 os/filestore: (re)implement merge
Merging is a bit different then splitting, because the two collections
may already be hashed at different levels.  Since lookup etc rely on the
idea that the object is always at the deepest level of hashing, if you
merge collections with different levels that share some common bit prefix
then some objects will end up higher up the hierarchy even though deeper
hashed directories exist.

Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:07:56 -05:00
Sage Weil
2465df57b7 os/filestore: add _merge_collections post-check
Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:07:56 -05:00
Sage Weil
1a80ba0636 os: implement merge_collection
Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:07:56 -05:00
Sage Weil
ad3aab364b os/ObjectStore: add merge_collection operation to Transaction
Signed-off-by: Sage Weil <sage@redhat.com>
2018-09-07 12:07:56 -05:00
Casey Bodley
539c675db9
Merge pull request #23634 from cbodley/wip-21154
rgw: RGWRadosGetOmapKeysCR takes result by shared_ptr

Reviewed-by: Yehuda Sadeh <yehuda@redhat.com>
2018-09-07 11:05:20 -04:00
Casey Bodley
fd77ff74ae rgw: RGWRadosGetOmapKeysCR takes result by shared_ptr
Fixes: http://tracker.ceph.com/issues/21154

Signed-off-by: Casey Bodley <cbodley@redhat.com>
2018-09-07 09:34:05 -04:00
Casey Bodley
04959fe3e4
Merge pull request #23920 from cbodley/wip-rgw-cr-rados-fixes
rgw multisite: async rados requests don't access coroutine memory

Reviewed-by: Yehuda Sadeh <yehuda@redhat.com>
2018-09-07 09:30:29 -04:00
Kefu Chai
d4a781477f
Merge pull request #23959 from rubenk/doc-remove-unknown-option-from-manpage
doc: remove deprecated 'scrubq' from ceph(8)

Reviewed-by: Kefu Chai <kchai@redhat.com>
2018-09-07 19:12:12 +08:00
Kefu Chai
434a0294f5
Merge pull request #23931 from cyx1231st/wip-msgr-test
tests: fix to check server_conn in MessengerTest.NameAddrTest

Reviewed-by: Kefu Chai <kchai@redhat.com>
2018-09-07 14:43:47 +08:00
Yingxin
b3c28a16cf tests: fix to check server_conn in MessengerTest.NameAddrTest
Signed-off-by: Yingxin <yingxin.cheng@intel.com>
2018-09-07 19:54:44 +08:00
vasukulkarni
93748a325c
Merge pull request #23944 from ceph/wip-s3a-update-mirror
qa/tasks: update mirror link for maven
2018-09-06 14:44:29 -07:00
Andrew Schoen
331e4180be
Merge pull request #23963 from alfredodeza/wip-rm35535
ceph-volume:  batch tests for mixed-type of devices

Reviewed-by: Andrew Schoen <aschoen@redhat.com>
2018-09-06 15:29:33 -05:00
Alfredo Deza
c1481799a2 ceph-volume lvm.batch use 'ceph' as the cluster name with filestore
Custom cluster names are currently broken on ceph-volume, should get
addressed with http://tracker.ceph.com/issues/27210 which is out of
scope for these changes

Signed-off-by: Alfredo Deza <adeza@redhat.com>
2018-09-06 15:19:01 -04:00
Alfredo Deza
a096a016cc ceph-volume tests/functional update filestore xenial test vars
Signed-off-by: Alfredo Deza <adeza@redhat.com>
2018-09-06 14:26:37 -04:00
Alfredo Deza
89e52dd197 ceph-volume tests/functional update bluestore xenial test vars
Signed-off-by: Alfredo Deza <adeza@redhat.com>
2018-09-06 14:26:37 -04:00
Alfredo Deza
a5ec54207a ceph-volume tests/functional update filestore centos7 test vars
Signed-off-by: Alfredo Deza <adeza@redhat.com>
2018-09-06 14:26:37 -04:00
Alfredo Deza
2549d57372 ceph-volume tests/functional update bluestore centos7 test vars
Signed-off-by: Alfredo Deza <adeza@redhat.com>
2018-09-06 14:26:37 -04:00