Commit Graph

938 Commits

Author SHA1 Message Date
SrinivasaBharathKanta
1017b1d230
Merge pull request #59743 from sseshasa/wip-fix-mclock-low-iops-capacity-threshold
common,osd: Use last valid OSD IOPS value if measured IOPS is unrealistic
2024-11-06 15:46:54 +05:30
Kamoltat (Junior) Sirivadhna
28e38e30bb
Merge pull request #59483 from kamoltat/wip-ksirivad-exit-stretch-mode
mon [stretch mode]: support disable_stretch_mode
Reviewed-by: Nitzan Mordechai <nmordech@redhat.com>
2024-11-05 13:07:06 -05:00
Samuel Just
048ce81f45
Merge pull request #56677 from athanatos/sjust/for-review/wip-replica-read
osd,crimson/osd: rework of replica read and related state

Reviewed-by: Matan Breizman <mbreizma@redhat.com>
2024-11-04 09:49:09 -08:00
Ernesto Puerta
8ccb634804
mgr/zabbix: remove deprecated module
This (already deprecated) module is removed as a side-effect of the
deprecation and removal of the `restful` module.

Fixes: https://tracker.ceph.com/issues/47066
Signed-off-by: Ernesto Puerta <epuertat@redhat.com>
2024-10-28 14:17:19 +01:00
Ernesto Puerta
96ec7badb8
mgr/restful: remove deprecated module
Detailed changes:
* Remove `restful` mgr module dir,
* Remove Python depedencies (`pecan`, `werkzeug`) from ceph.spec and
  debian control,
* Remove docs,
* Remove associated QA tests,
* Update vstart.

Fixes: https://tracker.ceph.com/issues/47066
Signed-off-by: Ernesto Puerta <epuertat@redhat.com>
2024-10-28 14:17:18 +01:00
Samuel Just
dda683b20c suites/rados/thrash-erasure-code/.../ec-small-objects-balanced.yaml: remove
We don't support balanced reads on ec pools.  Additionally, the yaml
actually specifies 'balanced_reads' rather than 'balance_reads' and
therefore has no actual effect.

Signed-off-by: Samuel Just <sjust@redhat.com>
2024-10-21 17:04:51 +00:00
Sridhar Seshasayee
da4b85c55a common,osd: Use last valid OSD IOPS value if measured IOPS is unrealistic
The OSD's IOPS capacity is used by the mClock scheduler to determine the
quantum of bandwidth allocation for the various operations on the OSD.
Prior to this commit, maybe_override_max_osd_capacity_for_qos() only
checked if the measured IOPS capacity exceeded the higher threshold defined
by 'osd_mclock_iops_capacity_threshold_[hdd|ssd]' and if so fallback to the
last valid or the default IOPS capacity as defined by
osd_mclock_max_capacity_iops_[hdd|ssd].

It's quite possible that the reported IOPS is unrealistically low. This
could be due to transient factors on the underlying device or it could
indicate bad health of the device. Either way, the safer option would be
to fallback to the last valid or the default IOPS setting for that OSD in
order to avoid cluster performance (slow or stalled ops) issues down the
line.

Therefore, to handle this case, the commit introduces additional config
options viz.,
 - osd_mclock_iops_capacity_low_threshold_hdd - set to 50 IOPS and
 - osd_mclock_iops_capacity_low_threshold_ssd - set to 1000 IOPS

If the measured IOPS capacity doesn't fall within the low and high
threshold range, the default or the last valid IOPS capacity is used.
The existing cluster log warning is suitably modified to convey the
reason.

Additionally, for a couple of valgrind related teuthology tests, the
cluster warning is added to the ignorelist since the reported IOPS can
be very low due to slowness.

Fixes: https://tracker.ceph.com/issues/67421
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
2024-10-17 16:38:20 +05:30
Kamoltat (Junior) Sirivadhna
abfff2b714
Merge pull request #57146 from kamoltat/wip-ksirivad-fix-connection-score-json
src/mon/ConnectionTracker.cc: Fix dump function
Reviewed-by Kamoltat Sirivadhna <ksirivad@redhat.com>
2024-09-26 10:15:04 -04:00
Kamoltat Sirivadhna
4d2f8879be qa: Added tests for disabling stretch mode
Test disabling stretch mode with the following scenario:

1. Healthy Stretch Mode
2. Degraded Stretch Mode

Fixes: https://tracker.ceph.com/issues/67467

Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>
2024-09-22 17:12:07 +00:00
Adam Kupczyk
a787a91719
Merge pull request #54504 from aclamk/wip-aclamk-bs-refactor-write-path
os/bluestore: Recompression, part 2. New write path.
2024-08-13 15:15:50 +02:00
Laura Flores
bd1082daaa
Merge pull request #58736 from amathuria/wip-66922-amat
qa/rados/dashboard: Add PG_DEGRADED to ignorelist
2024-08-08 15:41:18 -05:00
Patrick Donnelly
cfed7c0baa
Merge PR #59029 into main
* refs/pull/59029/head:
	qa: simplify postmerge construction

Reviewed-by: Samuel Just <sjust@redhat.com>
Reviewed-by: Brad Hubbard <bhubbard@redhat.com>
2024-08-07 20:58:17 -04:00
Kamoltat (Junior) Sirivadhna
6a0d503a59
Merge pull request #56233 from kamoltat/wip-ksirivad-fix-64802
RADOS: Generalize stretch mode pg temp handling to be usable without stretch mode
Samuel Just <sjust@redhat.com>
2024-08-07 09:45:54 -04:00
Adam Kupczyk
8bd233bef5 qa/bluestore: Add write_v1/v2 selection
Add framework for various random options for debug bluestore.
Use framework to select:
- write_v1
- write_v2
- write_v1 / write_v2 selected at random

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
2024-08-07 10:55:46 +00:00
Adam Kupczyk
e88ab6547e
Merge pull request #58664 from aclamk/wip-aclamk-qa-less-bluestore-debug
qa/suites/rados: Reduced BlueStore log levels
2024-08-06 12:53:02 +02:00
Patrick Donnelly
382357dcd4
qa: simplify postmerge construction
and avoid errors when "clusternodes" is not defined.

Fixes: https://tracker.ceph.com/issues/67352
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
2024-08-05 21:07:24 -04:00
Yuri Weinstein
dd87d573cf
Merge pull request #58635 from badone/wip-tracker-50371-rados_api_test-timeout-failures
qa: Restrict rados api tests to large clusters and increase timeout

Reviewed-by: Laura Flores <lflores@redhat.com>
2024-08-01 06:44:34 -07:00
Yuri Weinstein
a2c60161de
Merge pull request #57863 from NitzanMordhai/wip-nitzan-thrash-erasure-code-crush-4-nodes-8-6-overrides
suites/ec-rados-plugin=jerasure-k=8-m=6-crush: roles set with overrides

Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Reviewed-by: Matan Breizman <Matan.Brz@gmail.com>
2024-08-01 06:42:09 -07:00
Yuri Weinstein
3956c4278a
Merge pull request #58205 from NitzanMordhai/wip-nitzan-rados-dashboard-test-update-ignorelist
suites: test should ignore osd_down warnings

Reviewed-by: Ronen Friedman <rfriedma@redhat.com>
2024-07-26 10:24:44 -07:00
Yuri Weinstein
4adc795c49
Merge pull request #58215 from badone/wip-tracker-59380-admin-socket-injectfull
qa/suites/rados: Cancel injectfull to allow cleanup

Reviewed-by: Neha Ojha <nojha@redhat.com>
2024-07-23 10:57:08 -07:00
Yuri Weinstein
1fa959e982
Merge pull request #57485 from sseshasa/wip-fix-validator-osd-down-grace-tmout
qa/suites/rados/verify/validater: increase heartbeat grace timeout

Reviewed-by: Samuel Just <sjust@redhat.com>
Reviewed-by: Laura Flores <lflores@redhat.com>
2024-07-23 10:50:32 -07:00
Laura Flores
39a09a3590
Merge pull request #58275 from NitzanMordhai/wip-nitzn-host-thraser-fix-min-in-checks
suites: host thrasher should check min_in before thrashing host
2024-07-22 13:22:30 -05:00
Aishwarya Mathuria
4a4f9a3e99 qa/rados/dashboard: Add PG_DEGRADED to ignorelist
Eventually, the PG_DEGRADED warning goes away and cluster goes
back to healthy state before the end of the test

Fixes: https://tracker.ceph.com/issues/66922
Signed-off-by: Aishwarya Mathuria <amathuri@redhat.com>
2024-07-22 22:22:59 +05:30
Adam Kupczyk
8ee137f662 qa/suites/rados: Reduced BlueStore log levels
Having debug 20 is impractical. Slows down execution and takes disk space,
but gives little help in eventual debugging.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
2024-07-22 15:34:02 +02:00
Adam Kupczyk
337c8bf901
Merge pull request #57002 from aclamk/wip-aclamk-bs-storetest-expand-synthetic
Improved structure for objectstore unit tests.
2024-07-22 13:48:06 +02:00
Brad Hubbard
d034fec463 qa: Restrict rados api tests to large clusters and increase timeout
Running these tests with thrashers on small clusters leads to many very
slow ops due to the cluster being overloaded. That has a tendency to
make some of the API tests timeout and fail.

Fixes: https://tracker.ceph.com/issues/50371

Signed-off-by: Brad Hubbard <bhubbard@redhat.com>
2024-07-18 09:09:22 +10:00
Kamoltat
ed7f4e8829 qa: Added mon connection score tests
Basically when we deploy a 3 MONS

Check if the connection scores are clean
with a 60 seconds grace period

Fixes: https://tracker.ceph.com/issues/65695

Signed-off-by: Kamoltat <ksirivad@redhat.com>
2024-07-17 22:26:55 +00:00
Kamoltat
7b41aff3f0 qa/suites/rados: 3-az-stretch-cluster-netsplit test
Test the case where 2 DC loses connection with each other
for a 3 AZ stretch cluster with stretch pool enabled.
Check if cluster is accessible and PGs are active+clean
after reconnected.

Signed-off-by: Kamoltat <ksirivad@redhat.com>
2024-07-17 22:16:01 +00:00
Kamoltat
4ca1320727 qa/suites/rados/singleton/all: init mon-stretch-pool.yaml
Test the following new Ceph CLI commands:

`ceph osd pool stretch set`
`ceph osd pool stretch unset`
`ceph osd pool stretch show`

`qa/workunits/mon/mon-stretch-pool.sh`

will create the stretch cluster
while performing input validation for the CLI
Commands mentioned above.

`qa/tasks/stretch_cluster.py`

is in charge of
setting a pool to stretch cluster
and checks whether it prevents PGs
from the going active when there is not
enough buckets available in the acting
set of PGs to go active.

Also, test different MON fail over scenarios
after setting pool as stretch

`qa/suites/rados/singleton/all/mon-stretch-pool.yaml`

brings the scripts together.

Fixes: https://tracker.ceph.com/issues/64802

Signed-off-by: Kamoltat <ksirivad@redhat.com>
2024-07-17 22:12:04 +00:00
Nitzan Mordechai
e5cd5469b2 suites/ec-rados-plugin=jerasure-k=8-m=6-crush: roles set with overrides
roles being set without overrides causing too many values to unpack (expected 1)

Fixes: https://tracker.ceph.com/issues/66209
Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>
2024-06-27 06:29:02 +00:00
Nitzan Mordechai
89d695fb8b suites: check for host thrasher
The last PR modified the suites to only check for host thrasher.
This update fixes that issue by implementing different settings
with dedicated YAML files for host thrashing

Fixes: https://tracker.ceph.com/issues/66657
Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>
2024-06-26 12:16:48 +00:00
Brad Hubbard
4c5d0e30d2 qa/suites/rados: Cancel injectfull to allow cleanup
IO is frozen when the injectfull command is sent as part of the test
which causes the cleanup to hang so we need to clear it.

Fixes: https://tracker.ceph.com/issues/59380
Signed-off-by: Brad Hubbard <bhubbard@redhat.com>
2024-06-26 10:03:43 +10:00
Yuri Weinstein
359d20f326
Merge pull request #58141 from ljflores/wip-tracker-65852
qa/suites/rados/thrash/workloads: remove cache tiering workload

Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Reviewed-by: Samuel Just <sjust@redhat.com>
2024-06-25 06:47:14 -07:00
Nitzan Mordechai
2c65f1da96 suites: test should ignore osd_down warnings
Fixes: https://tracker.ceph.com/issues/64870
Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>
2024-06-23 08:49:45 +00:00
Adam King
98c986f1f5
Merge pull request #57412 from adk3798/stray-laundry2
qa/cephadm: fix ignorelist of CEPHADM_STRAY_DAEMON for rados_api_tests

Reviewed-by: Laura Flores <lflores@ibm.com>
2024-06-20 12:06:18 -04:00
Laura Flores
35505a7f1f qa/suites/rados/thrash/workloads: remove cache tiering workload
Fixes: https://tracker.ceph.com/issues/65852
Signed-off-by: Laura Flores <lflores@ibm.com>
2024-06-19 12:53:44 -05:00
Laura Flores
820e4004f3 qa/suites/rados/thrash-old-clients: update supported releases and distro
thrash-old-clients tests should only support N-3 releases. To fix this for
main, I have removed all releases < quincy and have added squid.

Also, we are fully switching to centos.9_stream packages/containers after
the centos.8_stream end of life, so I changed the distro from centos.8_stream
to centos.9_stream.

*** Note: If this commit is backported, it should be done in such a way that
only releases >= quincy reference centos.9_stream. For instance, if backporting to squid,
a reef/squid thrash test is okay to make references to centos.9_stream since both reef and
squid support this, but a pacific/squid test will have to take a different approach
since pacific does not support centos.9_stream.

Fixes: https://tracker.ceph.com/issues/66398
Signed-off-by: Laura Flores <lflores@ibm.com>
2024-06-10 17:34:27 -05:00
nmordech@redhat.com
3f26a965f6 suites: adding dencoder test multi versions
We are currently conducting regular ceph-dencoder tests for backward compatibility.
However, we are omitting tests for forward compatibility.
This suite will introduce tests against the ceph-objects-corpus to address forward
compatibility issues that may arise.
the script will install N-2 version and run against the latest version corpus objects
that we have, then install N-1 to N version and check them as well.

Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>
2024-05-16 05:16:17 +00:00
Sridhar Seshasayee
aae02b6af4 qa/suites/rados/verify/validater: increase heartbeat grace timeout
OSD_DOWN cluster log warning is raised on rare occasions due to
the osd_hearbeat_grace timeout getting exceeded. The warning is
soon cleared. Given the nature of the test (valgrind), the
grace timeout is increased to 160 secs to avoid generating the
warning.

Fixes: https://tracker.ceph.com/issues/65768
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
2024-05-16 10:02:43 +05:30
Adam King
0a67436e36 qa/cephadm: fix ignorelist of CEPHADM_STRAY_DAEMON for rados_api_tests
Not every log with this error has the parentheses, so
these warnings were still causing the test to fail

[ERR] [WRN] CEPHADM_STRAY_DAEMON: 2 stray daemon(s)... in cluster log

Signed-off-by: Adam King <adking@redhat.com>
2024-05-10 16:21:45 -04:00
Patrick Donnelly
07afb4ae09
Merge PR #56997 into main
* refs/pull/56997/head:
	pybind/mgr: disable sqlite3/python autocommit
	qa/tasks/mgr: add tests for sqlite autocommit
	qa/tasks/vstart_runner: run daemons in foreground
	qa/tasks/vstart_runner: add missing poll method
	qa/suites/rados/mgr: add cli/devicehealth tasks
	qa: reorganize mgr unit tests
	qa: use position-independent link
	qa: add missing terminating newline
	pybind/mgr: add killpoint for sqlite3 database setup
	mgr: allow specifying module option level
	mon/MgrMonitor: promote standby when unsetting down flag
	mon/MgrMonitor: only drop active if exists

Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
2024-04-30 16:46:06 -04:00
Adam Kupczyk
9eb14fc01c qa/rados: Adapt bluestore tests to new naming in ceph_test_objectstore
Plus: fixed bluestore compression test invocation.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
2024-04-30 14:24:49 +00:00
Patrick Donnelly
fb82b6d35a
qa/tasks/mgr: add tests for sqlite autocommit
That autocommit is properly turned off and that commits via context managers
work as expected.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
2024-04-29 16:33:32 -04:00
Adam King
1d97f673d3 qa/cephadm: ignore stray daemon warning during rados_api_tests
The "stray daemon" that is getting logged about in this test is
from "stray daemon laundry.pid70383 on host smithi027 not managed by cephadm".
It seems the rados_api_tests is creating some additional "laundry" entity
during these tests that gets reported as an actual daemon in the mgr,
but cephadm is unaware of it, resulting in the warning. Originally
we thought to maybe add "laundry" itself to the ignorelist, but
without an additional patch that added extra logging for debug
purposes (which can't be merged) the log statement found in
the logs due to this problem will not say what daemon it found
to be stray. There will just be a generic warning about a stray
daemon. In a real cluster, a user would then check "ceph health detail"
to find out what daemon is stray, but the log scraper can't do this
and just fails the test due to the presence of the warning.

Signed-off-by: Adam King <adking@redhat.com>
2024-04-29 13:54:37 -04:00
Patrick Donnelly
440f25e1ec
qa/suites/rados/mgr: add cli/devicehealth tasks
These should have been part of the commit adding the tests.

Fixes: 9ebcbdbed0
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
2024-04-29 12:22:27 -04:00
Patrick Donnelly
2f48dc9a00
qa: reorganize mgr unit tests
Refactor common tasks and allow loading mgrmodules before unittests start.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
2024-04-29 12:22:27 -04:00
Patrick Donnelly
1749edd668
qa: use position-independent link
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
2024-04-29 12:22:27 -04:00
Casey Bodley
0c72fcc26a
Merge pull request #56008 from kchheda3/wip-notification-subsys
rgw/notification: add rgw notification specific debug log subsystem

Reviewed-by: Yuval Lifshitz <ylifshit@redhat.com>
2024-03-21 15:08:35 +00:00
Laura Flores
88f8db5c4b
Merge pull request #56146 from ljflores/wip-tracker-64725
qa/suites/rados/singleton: add POOL_APP_NOT_ENABLED to ignorelist
2024-03-20 16:50:47 -05:00
Yuri Weinstein
98a7421080
Merge pull request #53308 from NitzanMordhai/wip-nitzan-qa-tasks-with-crush-rules
suites: qa tasks with crush rules

Reviewed-by: Samuel Just <sjust@redhat.com>
2024-03-20 08:37:45 -07:00