The $() form is preferable to `` because folks (like me) might be using
` as a keyboard shortcut to GNU Screen, causing havoc to ensue whenever
copy-pasting the ` character.
Signed-off-by: Nathan Cutler <ncutler@suse.com>
mgr/dashboard: Set timeout in RestClient calls
Reviewed-by: Lenz Grimmer <lgrimmer@suse.com>
Reviewed-by: Sebastian Wagner <swagner@suse.com>
Reviewed-by: Tatjana Dehler <tdehler@suse.com>
* refs/pull/23223/head:
osd/PG: kill dead functions and related options
iosd/osd_type: kill unused input ec_pool for iterate_mayberw_back_to
common: kill dead options
osd/PG: do not initialize up/acting twice
osd/PG: clear missing_loc properly if last location is gone
Reviewed-by: Sage Weil <sage@redhat.com>
* refs/pull/22692/head:
doc/mgr/devicehealth: document devicehealth module
doc/rados/operations/health-checks: document DEVICE_HEALTH* messages
mgr/devicehealth: fix style for returns
mgr/devicehealth: use constants for health warnings
mgr/devicehealth: deal with as many daemons as we can until limit
mgr/devicehealth: warn if too many daemons are expected to fail soon
mgr/devicehealth: set primary-affinity 0 for failing devices
msg/devicehealth: fix config options
mgr/devicehealth: only fetch osdmap once from check_health
mgr/devicehealth: revise health messages
mgr/devicehealth: add 'device check-health' command and run periodically
mgr/devicehealth: fix new options
mgr/devicehealth: add helpers to life_expectancy_response()
mgr/devicehealth: simplify setting defaults
common/blkdev remove debug statements
Reviewed-by: John Spray <john.spray@redhat.com>
- if mark_out_threshold is met we write to log.warn instead of raising a
health warning.
- check that OSD is 'in' before calling mark_out().
- raise a health warning in case OSD is marked 'out' but still has PGs
attached to it.
- cast thresholds default values to string.
- add SCSI multipath support to health warning message.
- change health warning message.
Signed-off-by: Yaarit Hatuka <yaarithatuka@gmail.com>
* refs/pull/23334/head:
pybind/rados/rados: do not pass prval from stack
Reviewed-by: John Spray <john.spray@redhat.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
This change is motived by the failure tracked in
https://tracker.ceph.com/issues/25198. The failure highlights a case, when a
call to trim_log() after the PG has recovered, races with the previous op,
on a replica OSD. Since the previous operation has not completed, the
last_complete value for that OSD is not valid, when we try to trim the
log. It is also worth noting that the race is due to MOSDPGTrim going through
the strict queue as a peering message vs regular ops going through the
non-strict queue.
During the investigation of this bug, we noticed that, with
https://tracker.ceph.com/issues/23979, we allow pg log trimming to
happen on the primary and replicas, whenever we cross the upper bound of
the pg log. This also ensures that pg log trimming happens while processing
any new op.
Therefore, the function trim_log(), which earlier served the purpose of
trimming logs on the primary and replicas, just before the PG went into
the Recovered state, is no more required. This acted like a last line of
defense to trim logs, when we did not need the logs any more. But, this call
seems redundant now, because, we are limiting the pg log length at all times.
Signed-off-by: Neha Ojha <nojha@redhat.com>
"npm ci" is the recommended command to install dependencies
in a continuous integration system.
It will make sure node_modules is empty and that the version in
"package-lock.json" match the ones in "package.json"
Signed-off-by: Tiago Melo <tmelo@suse.com>
Because we had a min_max setting with CRIT the maximum,
it wasn't possible to actually turn off stats entirely.
Fixes: http://tracker.ceph.com/issues/25197
Signed-off-by: John Spray <john.spray@redhat.com>
This will make sure that, at anytime, when someone runs 'npm install'
the resulting packages that are installed are allways the same.
Signed-off-by: Tiago Melo <tmelo@suse.com>
The prval is a pointer to an int to write the final completion code of
the rados op. This can't be on the stack since we immediately leave the
current scope after preparing the op (looong before we do the rados op).
We keep the tuple return value to avoid breaking users of this API
(devicehealth module, gnocchi at a minimum).
Fixes: http://tracker.ceph.com/issues/25175
Signed-off-by: Sage Weil <sage@redhat.com>
This test is to prove that the issue from
http://tracker.ceph.com/issues/24957 was fixed
by http://tracker.ceph.com/issues/24784
When running lvm list against a raw device it should handle
gracefully the situation where there are multiple PVs with the
name of the given device.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
Delay to declared to be healthy until we have received the first
replies from both front and back connections.
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
If we never hear any replies from a heartbeat peer, use first_tx
to calculdate failed_for, which is more accurate.
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
The original logic will reuse the timestamp which we send pings to
the specific heartbeat peer to update the last_rx_front[back] field
on receiving the corresponding replies, which later shall be honoured
as the exact time we succeed in getting the corresponding replies and
is used to calculate the heartbeat latency and determine whether the
relevant peer is dead.
However this is not accurate enough as there may be a delay between
we receive a reply and call heartbeat_check(). We can eliminate
the delay by introducing a map to track the ping-history here,
each entry of which consists of three elements:
1. "tx_time", worked as the map key, indicates the exact timestamp
we send pings.
2. "deadline", indicates we shall receive all replies by then,
otherwise we consider this peer as "dead".
3. "unacknowledged", indicates how many pings for the corresponding
ping are still unacknowledged. The initial value is 2(as we send
two pings from the front and back side for each peer).
We insert an item into the map on every time we sending out a ping, and
decrease the "unacknowledged" counter by 1 each time we get a reply from
the tracked ping. If "unacknowledged" drops to 0, we know all the replies
have been successfully collected and we can safely erase the relevant
item from the map as well as the earlier sent ones, if there is any.
By comparing the current timestamp with the oldest deadline, we can now
make a much accurate decision about whether the corresponding peer is
healthy or not. And by setting last_rx_* to the timestamp we receiving
the reply, the lower bound when we can no longer hear a reply from the
corresponding connection is also much clear now.
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>