The event was previously not getting moved to the completed
list. There are a couple more cases too:
- When some pgs go away (a pool is removed) during the event
- When the OSD comes back in after going out
Signed-off-by: John Spray <john.spray@redhat.com>
It's not efficient to have python calling this
O(pg_num) times to find the pgs for an OSD, but
I'm just shooting for something functional for now.
Signed-off-by: John Spray <john.spray@redhat.com>
Some kernels (4.9+) sometime fail to return data when reading
from a block device under memory pressure. This patch retries
the read if the checksum verification fails, tests show that
the first retried read succeeds in ~99.5% of the cases, so
3 attempts are made by default before giving up on the data.
Works-around: http://tracker.ceph.com/issues/22464
Signed-off-by: Paul Emmerich <paul.emmerich@croit.io>
As of nautilus, this will be more than two versions old:
external tooling should have been updated by now.
Signed-off-by: John Spray <john.spray@redhat.com>
Reworks the bluestore validation and reporting to account for reusable
VGs from fast devices, and adds validation calls to ensure the new way
to calculate this process will work.
Signed-off-by: Alfredo Deza <adeza@redhat.com>
If the object is known to exist in the image, the copy-up operation
can be skipped for that object.
Fixes: http://tracker.ceph.com/issues/23445
Signed-off-by: Mykola Golub <mgolub@suse.com>
* refs/pull/23845/head:
osd/OSDMap: include age in up and in counts for ceph status
mon/OSDMonitor: set new_last_{up,in}_change
osd/OSDMap: store last_up_change and last_in_change
mgr/MgrMap: include mgr age in map printer
mon/MgrMap: track active_changed timestamp
mon: include mon quorum age in status
include/utime: add utimespan_str helper
Reviewed-by: John Spray <john.spray@redhat.com>
* refs/pull/23975/head:
common/buffer.cc: add create_small_page_aligned to avoid mem waste when apply for small mem in big page size(e.g. 64k) OS
Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Instead of creating an additional pool on every migration test
that needs it, create a pool on the test case setup and reuse.
Signed-off-by: Mykola Golub <mgolub@suse.com>
Because the old connections are gone, and hence we should not
leave behind a long list of obsolete ping_history there, which
is misleading...
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
OSDs may not be aware of their deadness and trapped at
an obsolete map in which they were still marked as up:
```
host osd down_at stuck_at
ceph-03 9 e712 e711
ceph-03 13 e700 e699
ceph-03 28 e697 e696
ceph-03 48 e697 e696
ceph-03 52 e707 e704
ceph-03 61 e710 e708
ceph-03 73 e712 e710
ceph-03 77 e708 e707
ceph-05 12 e711 e710
ceph-05 21 e703 e702
ceph-05 24 e700 e699
ceph-05 29 e703 e699
ceph-05 41 e711 e710
ceph-05 53 e711 e710
ceph-05 72 e712 e711
```
In https://github.com/ceph/ceph/pull/23958 an OSD will ping monitor
periodically now if it is stuck at __wait_for_healthy__. But in the
above case OSDs are still considering themselves as __active__ and
hence should miss that fixer.
Since these OSDs might be still able to contact with monitors (
otherwise there is no way for them to be marked up again) and send
beacons contiguously, we can simply get them out of the trap by
sharing some new maps with them.
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Signed-off-by: runsisi <runsisi@zte.com.cn>