Fixed reference link for hit set type value. Restructured wording in description.
Fixes: https://tracker.ceph.com/issues/34539
Signed-off-by: James McClune <jmcclune@mcclunetechnologies.net>
* refs/pull/23449/head:
osd/OSDMap: cleanup: s/tmpmap/nextmap/
qa/standalone/osd/osd-backfill-stats: fixes
osd/OSDMap: clean out pg_temp mappings that exceed pool size
mon/OSDMonitor: clean temps and upmaps in encode_pending, efficiently
osd/OSDMapMapping: do not crash if acting > pool size
Reviewed-by: David Zafman <dzafman@redhat.com>
Reviewed-by: xie xingguo <xie.xingguo@zte.com.cn>
Reviewed-by: Neha Ojha <nojha@redhat.com>
* refs/pull/23984/head:
mon: test if gid exists in pending for prepare_beacon
Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Greg Farnum <gfarnum@redhat.com>
Grep from the primary's log, not every osd's log.
For the backfill_remapped task in particular, after the pg_temp change it
just so happens that the primary changes across the pool size change and
thus two different primaries do (some) backfill. Fix that test to pass
the correct primary.
Other tests are unaffected as they do not (happen to) trigger a primary
change and already satisfied the (removed) check that only one OSD does
backfill.
Signed-off-by: Sage Weil <sage@redhat.com>
If the pool size is reduced, we can end up with pg_temp mappings that are
too big. This can trigger bad behavior elsewhere (e.g., OSDMapMapping,
which assumes that acting and up are always <= pool size).
Fixes: http://tracker.ceph.com/issues/26866
Signed-off-by: Sage Weil <sage@redhat.com>
- do not rebuild the next map when we already have it
- do this work in encode_pending, not create_pending, so we get bad
values before they are published.
Signed-off-by: Sage Weil <sage@redhat.com>
Existing oversized pg_temp mappings (or some other bug) might make acting
exceed the pool size. Avoid overrunning out buffer if that happens.
Note that the mapping won't be completely accurate in that case!
Signed-off-by: Sage Weil <sage@redhat.com>
If it does not, send a null map. Bug introduced by
624efc6432 which made preprocess_beacon only look
at the current fsmap (correctly). prepare_beacon relied on preprocess_beacon
doing that check on pending.
Running:
while sleep 0.5; do bin/ceph mds fail 0; done
is sufficient to reproduce this bug. You will see:
2018-09-07 15:33:30.350 7fffe36a8700 5 mon.a@0(leader).mds e69 preprocess_beacon mdsbeacon(24412/a up:reconnect seq 2 v69) v7 from mds.0 127.0.0.1:6813/2891525302 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
2018-09-07 15:33:30.350 7fffe36a8700 10 mon.a@0(leader).mds e69 preprocess_beacon: GID exists in map: 24412
2018-09-07 15:33:30.350 7fffe36a8700 5 mon.a@0(leader).mds e69 _note_beacon mdsbeacon(24412/a up:reconnect seq 2 v69) v7 noting time
2018-09-07 15:33:30.350 7fffe36a8700 7 mon.a@0(leader).mds e69 prepare_update mdsbeacon(24412/a up:reconnect seq 2 v69) v7
2018-09-07 15:33:30.350 7fffe36a8700 12 mon.a@0(leader).mds e69 prepare_beacon mdsbeacon(24412/a up:reconnect seq 2 v69) v7 from mds.0 127.0.0.1:6813/2891525302
2018-09-07 15:33:30.350 7fffe36a8700 15 mon.a@0(leader).mds e69 prepare_beacon got health from gid 24412 with 0 metrics.
2018-09-07 15:33:30.350 7fffe36a8700 5 mon.a@0(leader).mds e69 mds_beacon mdsbeacon(24412/a up:reconnect seq 2 v69) v7 is not in fsmap (state up:reconnect)
in the mon leader log. The last line indicates the problem was safely handled.
Fixes: http://tracker.ceph.com/issues/35848
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
* refs/pull/20469/head:
osd/PG: remove warn on delete+merge race
osd: base project_pg_history on is_new_interval
osd: make project_pg_history handle concurrent osdmap publish
osd: handle pg delete vs merge race
osd/PG: do not purge strays in premerge state
doc/rados/operations/placement-groups: a few minor corrections
doc/man/8/ceph: drop enumeration of pg states
doc/dev/placement-groups: drop old 'splitting' reference
osd: wait for laggy pgs without osd_lock in handle_osd_map
osd: drain peering wq in start_boot, not _committed_maps
osd: kick split children
osd: no osd_lock for finish_splits
osd/osd_types: remove is_split assert
ceph-objectstore-tool: prevent import of pg that has since merged
qa/suites: test pg merging
qa/tasks/thrashosds: support merging pgs too
mon/OSDMonitor: mon_inject_pg_merge_bounce_probability
doc/rados/operations/placement-groups: update to describe pg_num reductions too
doc/rados/operations: remove reference to lpgs
osd: implement pg merge
osd/PG: implement merge_from
osdc/Objecter: resend ops on pg merge
osd: collect and record pg_num changes by pool
osd: make load_pgs remove message more accurate
osd/osd_types: pg_t: add is_merge_target()
osd/osd_types: pg_t::is_merge -> is_merge_source
osd/osd_types: adding or substracting invalid stats -> invalid stats
osd/PG: clear_ready_to_merge on_shutdown (or final merge source prep)
osd: debug pending_creates_from_osd cleanup, don't use cbegin
ceph-objectstore-tool: debug intervals update
mgr/ClusterState: discard pg updates for pgs >= pg_num
mon/OSDMonitor: fix long line
mon/OSDMonitor: move pool created check into caller
mon/OSDMonitor: adjust pgp_num_target down along with pg_num_target as needed
mon/OSDMonitor: add mon_osd_max_initial_pgs to cap initial pool pgs
osd/OSDMap: set pg[p]_num_target in build_simple*() methods
mon/PGMap: adjust SMALLER_PGP_NUM warning to use *_target values
mon/OSDMonitor: set CREATING flag for force-create-pg
mon/OSDMonitor: start sending new-style pg_create2 messages
mon/OSDMonitor: set last_force_resend_prenautilus for pg_num_pending changes
osd: ignore pg creates when pool FLAG_CREATING is not set
mgr: do not adjust pg_num until FLAG_CREATING removed from pool
mon/OSDMonitor: add FLAG_CREATING on upgrade if pools still creating
mon/OSDMonitor: prevent FLAG_CREATING from getting set pre-nautilus
mon/OSDMonitor: disallow pg_num changes while CREATING flag is set
mon/OSDMonitor: set POOL_CREATING flag until initial pool pgs are created
osd/osd_types: add pg_pool_t FLAG_POOL_CREATING
osd/osd_types: introduce last_force_resend_prenautilus
osd/PGLog: merge_from helper
osd: no cache agent or snap trimming during premerge
osd: notify mon when pending PGs are ready to merge
mgr: add simple controller to adjust pg[p]_num_actual
mon/OSDMonitor: MOSDPGReadyToMerge to complete a pg_num change
mon/OSDMonitor: allow pg_num to adjusted up or down via pg[p]_num_target
osd/osd_types: make pg merge an interval boundary
osd/osd_types: add pg_t::is_merge() method
osd/osd_types: add pg_num_pending to pg_pool_t
osd: allow multiple threads to block on wait_min_pg_epoch
osd: restructure advance_pg() call mechanism
mon/PGMap: prune merged pgs
mon/PGMap: track pgs by state for each pool
osd/SnapMapper: allow split_bits to decrease (merge)
os/bluestore: fix osr_drain before merge
os/bluestore: allow reuse of osr from existing collection
os/filestore: (re)implement merge
os/filestore: add _merge_collections post-check
os: implement merge_collection
os/ObjectStore: add merge_collection operation to Transaction
This was there just to confirm that this path was exercised by the
rados suite (it is, several hits per rados run of 1/666).
Signed-off-by: Sage Weil <sage@redhat.com>
The class's osdmap may be updated while we are in our loop. Pass it in
explicitly instead.
Fixes: http://tracker.ceph.com/issues/26970
Signed-off-by: Sage Weil <sage@redhat.com>
Deletion involves an awkward dance between the pg lock and shard locks,
while the merge prep and tracking is "shard down". If the delete has
finished its work we may find that a merge has since been prepped.
Unwinding the merge tracking is nontrivial, especially because it might
involved a second PG, possibly even a fabricated placeholder one. Instead,
if we delete and find that a merge is coming, undo our deletion and let
things play out in the future map epoch.
Signed-off-by: Sage Weil <sage@redhat.com>
The point of premerge is to ensure that the constituent parts of the
target PG are fully clean. If there is an intervening PG migration and
one of the halves finishes migrating before the other, one half could
get removed and the final merge could result in an incomplete PG. In the
worst case, the two halves (let's call them A and B) could have started
out together on say [0,1,2], A moves to [3,4,5] and gets deleted from
[0,1,2], and then the final merge happens such that *all* copies of the PG
are incomplete.
We could construct a clever check that does allow removal of strays when
the sibling PG is also ready to go, but it would be complicated. Do the
simple thing. In reality, this would be an extremely hard case to hit
because the premerge window is generally very short.
Signed-off-by: Sage Weil <sage@redhat.com>
We can't hold osd_lock while blocking because other objectstore completions
need to take osd_lock (e.g., _committed_osd_maps), and those objectstore
completions need to complete in order to finish_splits. Move the blocking
to the top before we establish any local state in this stack frame since
both the public and cluster dispatchers may race in handle_osd_map and
we are dropping and retaking osd_lock.
Signed-off-by: Sage Weil <sage@redhat.com>
We can't safely block in _committed_osd_maps because we are being run
by the store's finisher threads, and we may have to wait for a PG to split
and then merge via that same queue and deadlock.
Do not hold osd_lock while waiting as this can interfere with *other*
objectstore completions that take osd_lock.
Signed-off-by: Sage Weil <sage@redhat.com>
Ensure that we bring split children up to date to the latest map even in
the absence of new OSDMaps feeding in NullEvts. This is important when
the handle_osd_map (or boot) thread is blocked waiting for pgs to catch
up, but we also need a newly-split child to catch up (perhaps so that it
can merge).
Signed-off-by: Sage Weil <sage@redhat.com>
This used to protect the pg registration probably? There is no need for
it now.
More importantly, having it here can cause a deadlock when we are holding
osd_lock and blocking on wait_min_pg_epoch(), because a PG may need to
finish splitting to advance and then merge with a peer. (The wait won't
block on *this* PG since it isn't registered in the shard yet, but it
will block on the merge peer.)
Signed-off-by: Sage Weil <sage@redhat.com>
The problem is:
osd is at epoch 80
import pg 1.a as of e57
1.a and 1.1a merged in epoch 60something
we set up a merge now,
but in should_restart_peering via advance_pg we hit the is_split assert
that the ps is < old_pg_num
We can meaningfully return false (this is not a split) for a pg that is
beyond pg_num.
Signed-off-by: Sage Weil <sage@redhat.com>
We currently import a portion of the PG if it has split. Merge is more
complicated, though, mainly because COT is operating in a mode where it
fast-forwards the PG to the latest OSDMap epoch, which means it has to
implement any transformations to the PG (split/merge) independently.
Avoid doing this for merge.
Signed-off-by: Sage Weil <sage@redhat.com>
Optionally bounce pg_num back up right after we decrease it. This triggers
conditions in the OSD where the merge and split logic may conflict.
Signed-off-by: Sage Weil <sage@redhat.com>
- Vevamps the split tracking infrastructure, and adds new tracking for
upcoming merges in consume_map. These are now unified into the same
identify_ method. these consume the new pg_num change tracking
instructure we just added in the prior commit.
- PGs that are about to merge have a new wait infrastructure, since all
sources and the target have to reach the target epoch before the merge
can happen.
- If one of the sources for a merge does not exist, we create an empty
dummy PG to merge with. The implies that the resulting merged PG will
be incomplete (and mostly useless), but it unifies the code paths.
- The actual merge (PG::merge_from) happens in advance_pg().
Fixes: http://tracker.ceph.com/issues/85
Signed-off-by: Sage Weil <sage@redhat.com>
This is the building block that smooshes multiple PGs back into one. The
resulting combination PG will have no PG log. That means the sources
need to be clean and quiesced or else the result will end up being
marked incomplete.
Signed-off-by: Sage Weil <sage@redhat.com>