Check total pg count for the cluster vs osd count and max pgs per osd
before allowing pool creation, pg_num change, or pool size change.
"in" OSDs are the ones we distribute data too, so this should be the right
count to use. (Whether they happen to be up or down at the moment is
incidental.)
If the user really wants to create the pool, they can change the
configurable limit.
Signed-off-by: Sage Weil <sage@redhat.com>
This introduces two config parameters:
mds_cache_memory_limit: Sets the soft maximum of the cache to the given
byte count. (Like mds_cache_size, this doesn't actually limit the maximum
size of the cache. It just dictates the steady-state size.)
mds_cache_reservation: This replaces mds_health_cache_threshold everywhere
except the Beacon heartbeat sent to the mons. The idea here is to specify a
reservation of memory (5% by default) for operations and the MDS tries to
always maintain that reservation. So, the MDS will recall caps from clients
when it begins dipping into its reservation of memory.
mds_cache_size still limits the cache by Inode count but is now by-default 0
(i.e. unlimited). The new preferred way of specifying cache limits is by memory
size. The default is 1GB.
Fixes: http://tracker.ceph.com/issues/20594
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1464976
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Reordered the RC releases sections back to their respective components,
added a ceph-mon section, added links to documentation wherever
possible, and a few forgotten RGW announcements. Also cleared up the
PendingReleaseNotes upto this point
Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>
Also cleanup PendingReleasenotes to an empty file so that only newer
changes are tracked, adding the relevant section back to
RC1 where relevant. Moving all the RC1 announcements back to RC2, when
we go to 12.2.0 we'll collapse all of these back to the release
announcments
Signed-off-by: Abhishek Lekshmanan <alekshmanan@suse.com>
This has a few problems:
1- It does not do it's analysis over CRUSH rule roots/classes, which
means that an innocent user of classes will see skewed usage (bc hdds are
more full than ssds, say)
2- It does not take degraded clusters into account, which means the warning
will appear when a fresh OSD is added.
See http://tracker.ceph.com/issues/20730
Signed-off-by: Sage Weil <sage@redhat.com>
rgw: use a namespace for rgw reshard pool for upgrades as well
Reviewed-by: Casey Bodley <cbodley@redhat.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
This is used to dump extra weirdness to the health detail structured
output, but we are about to remove all of that in luminous.
Signed-off-by: Sage Weil <sage@redhat.com>
It's still sort of awkward to prefix these commands
with "mgr tell" but this makes them at least
somewhat accessible to the average user.
Signed-off-by: John Spray <john.spray@redhat.com>
Make an incompat change here with a release note since
this only affects pool creation, a rare event, and folks
who have customized their configs (also rare).
Keep it simple: a config sets the default rule, or else we pick
the first TYPE_REPLICATED pool in the crush map.
Signed-off-by: Sage Weil <sage@redhat.com>
This is undocumented and untested -- it was something
written before and superceded by the "recover_dentries"
subcommand. While we're at it, also
s/scavenge_dentries/recover_dentries/
internally.
Signed-off-by: John Spray <john.spray@redhat.com>
- rename the option (max -> warn)
- add an err_..._ratio multiplier
- switch to HEALTH_ERR once requests are blocked long enough
- make the error ratio high (default is 32*128s -> about an hour) so that
we don't trigger on a heavily loaded cluster.
Signed-off-by: Sage Weil <sage@redhat.com>
With bluestore, making the smallest write match min_alloc_size avoids
write amplification. With EC pools this is the stripe unit, or
stripe_width / num_data_chunks. Rather than requiring people to divide
by k to get the smallest ec write, allow it to be specified directly
via stripe_unit. Store it in the ec profile so changing a monitor
config option isn't necessary to set it.
This is particularly important for ec overwrites since they allow random i/o
which should match bluestore's checksum granularity (aka min_alloc_size).
Signed-off-by: Josh Durgin <jdurgin@redhat.com>
This had been broken for some time, as since the new
JournalStream stuff, zero padding was no longer a valid
encoding.
Fixes: http://tracker.ceph.com/issues/19691
Signed-off-by: John Spray <john.spray@redhat.com>
In practice this tends to get bubbled up the stack as an error on
the caller, and they usually do not handle it properly. For example,
with librbd, this turns into EIO and break the VM.
Instead, this will manifest as a hung op on the client. That is
also not ideal, but given that the root cause here is generally a
bug, it's not clear what else would be better.
We already log an error in the cluster log, so teuthology runs will
continue to fail.
Signed-off-by: Sage Weil <sage@redhat.com>
Expose public methods that include a new output argument to indicate
whether there are more keys to fetch or not.
Mark the old interfaces deprecated.
Signed-off-by: Sage Weil <sage@redhat.com>
This change does prioritize backfill of PGs which don't
have min_size active copies. Such PGs would cause IO stalls
for clients and would increase throttlers usage.
This change also fixes few subtlle out-of-bounds bugs.
Signed-off-by: Bartłomiej Święcki <bartlomiej.swiecki@corp.ovh.com>
Tell users they need to set this to true before Monitors will allow
pools to be removed.
Also update the Pending Release Notes so that users can find this change
there.
This was changed with commit 5d7f4ea
Signed-off-by: Wido den Hollander <wido@42on.com>
osd: set server-side limits on omap get operations
Reviewed-by: xie xingguo <xie.xingguo@zte.com.cn>
Reviewed-by: Kefu Chai <kchai@redhat.com>
Reviewed-by: Samuel Just <sjust@redhat.com>
If we have an OSD with a weight that's not 1.0 and mark it out,
we should restore the same weight when we mark it back in. We
already do this when an OSD is automatically marked out, just
not when it is explicitly marked out.
Signed-off-by: Sage Weil <sage@redhat.com>
This assumes that if the mon does not explicitly specify
the kv type that it is leveldb. No prior version of
Ceph has had non-experimental rocksdb, so this is
relatively safe. It's also necessary because the
default is now 'rocksdb' and we shouldn't assume those
old mons are rocksdb.
This will break for users to explicitly specified
rocksdb for the mon despite it being experimental.
Signed-off-by: Sage Weil <sage@redhat.com>
Exclusive lock, object map, fast-diff, and deep-flatten have been
enabled by default for all new images.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
The rbd cli will warn about the deprecation when attempting to create
image format 1 images. librbd will log an error message when opening
a format 1 RBD image.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
the symbols of buffer::list::iterator_impl<> were wrongly exposed
in previous infernalis release, and the clients linked against
librados are very likely using them. so we need to document this
change.
Signed-off-by: Kefu Chai <kchai@redhat.com>
Allow librados users to opt to receive ENOSPC or EDQUOT when they submit
an operation against a full cluster. This should only be used if the
librados app can handle those errors gracefully (librbd, for example,
cannot).
Also note that this allows savvy librados users to send delete operations;
they will get either a success or EDQUOT, depending on whether the
operation results in a net drop in space utilization.
Signed-off-by: Sage Weil <sage@redhat.com>
'ceph mon_metadata' was added still during this dev cycle, so there is
no need to deprecate it first.
Fixes: #11545
Signed-off-by: Joao Eduardo Luis <joao@suse.de>