* refs/pull/16608/head:
qa: whitelist mds down wrn during cephfs testing
mds: add config to disable fragmentation
qa: add max_mds thrash test
qa: mds_thrash updates for new max_mds behavior
doc: update upgrade procedure and release notes
qa: add test for cluster resizing
qa: remove use of mds deactivate
cephfs: add new down/joinable fs flags
mds: evict all clients if last mds shutting down
cephfs: deprecate ceph mds deactivate
cephfs: kill allow_dirfrags
cephfs: Kill allow_multimds
cephfs: Change behavior of cluster_down flag
mon/FSCommands: Set extra MDS to standby
cephfs: Health check changes
mon/MDSMonitor: Remove command support for legacy syntax
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
MDS deactivation is now handled by the max_mds parameter. Deprecate
ceph mds deactivate and note it to be removed in a future release.
Signed-off-by: Douglas Fuller <dfuller@redhat.com>
As dirfrags are now standard in CephFS, remove the machinery for
tracking and enabling this feature.
ceph fs set <fs> allow_dirfrags is now deprecated and prints a warning
message.
Signed-off-by: Douglas Fuller <dfuller@redhat.com>
With multi-mds now declared stable, allow_multimds now defaults to 1.
Given the max_mds parameter, it is now redundant. Remove it, leaving a
comment placeholder in the features bitmap.
ceph fs set <fs> allow_multimds is now deprecated and prints a warning
message.
Signed-off-by: Douglas Fuller <dfuller@redhat.com>
JSON cannot express arbitrary binary blobs. Instead of outputting invalid
and unparseable JSON, represent the value of blobs as something like
'<<< binary blob of length 12 >>>'.
Fixes: http://tracker.ceph.com/issues/23622
Signed-off-by: Sage Weil <sage@redhat.com>
This is too complete a rewrite to reasonably break down into small steps,
and even if I could, it would be harder to review that way than to simply
review the new implementation. The semantics of the old one were so weird
that it's harder to reason about the change in behavior than to simply
review the new behavior.
That's my story, at least, and I'm sticking to it!
So, here are the highlights:
- $foo meta expansions are evaluated at get_val() time. This means the
weird bool arguments to set_val specifying whether things were expanded
are removed (they didn't make any sense unless you were thinking about the
old implementation).
- for every option, we track values from any inputs (config, mon,
override). At get_val() time, we pick the highest priority one.
- diff() is rewritten to be simple and to show you all of the above.
- internal interfaces are simplified, and in terms of Option::value_t
whenever possible.
- unit tests simplified somewhat based on the above.
Known issues:
- legacy values get pushed out in select cases. Notably if foo=$bar
and bar is updated, we do not update $foo (there is no dependency
tracking to do this efficiently).
Signed-off-by: Sage Weil <sage@redhat.com>
This script is pointless. It is equivalent to the built-in default
behavior, which makes it only useful as a sample for what a location
hook's output should be. The documentation has been updated to provide
that.
Signed-off-by: Sage Weil <sage@redhat.com>
This reverts commit 3189ba19a7, reversing
changes made to b7620de020.
Despite the change in json format being positive, the unfortunate side-effect
is that it broke upgrade testing (because the QA framework must handle the
transition of mdsmap["info"] to a list from object) and the ceph-mgr.
Fixes: http://tracker.ceph.com/issues/22527
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
* refs/pull/19369/head:
qa: update handling of fs status format
PendingReleaseNotes: add note for format change
mds/MDSMap : use arrary_section for mds stat
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Zheng Yan <zyan@redhat.com>
Reviewed-by: Xiaoxi Chen <xiaoxchen@ebay.com>
These configs were used for initialization but it is more appropriate to
require setting these file system attributes via `ceph fs set`. This is similar
to what was already done with max_mds. There are new variables added for `fs
set` where missing.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
drop sections already in previous releases, keeping only Mimic sections
and a new section header for items going post 12.2.2
Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>
Check total pg count for the cluster vs osd count and max pgs per osd
before allowing pool creation, pg_num change, or pool size change.
"in" OSDs are the ones we distribute data too, so this should be the right
count to use. (Whether they happen to be up or down at the moment is
incidental.)
If the user really wants to create the pool, they can change the
configurable limit.
Signed-off-by: Sage Weil <sage@redhat.com>
This introduces two config parameters:
mds_cache_memory_limit: Sets the soft maximum of the cache to the given
byte count. (Like mds_cache_size, this doesn't actually limit the maximum
size of the cache. It just dictates the steady-state size.)
mds_cache_reservation: This replaces mds_health_cache_threshold everywhere
except the Beacon heartbeat sent to the mons. The idea here is to specify a
reservation of memory (5% by default) for operations and the MDS tries to
always maintain that reservation. So, the MDS will recall caps from clients
when it begins dipping into its reservation of memory.
mds_cache_size still limits the cache by Inode count but is now by-default 0
(i.e. unlimited). The new preferred way of specifying cache limits is by memory
size. The default is 1GB.
Fixes: http://tracker.ceph.com/issues/20594
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1464976
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Reordered the RC releases sections back to their respective components,
added a ceph-mon section, added links to documentation wherever
possible, and a few forgotten RGW announcements. Also cleared up the
PendingReleaseNotes upto this point
Signed-off-by: Abhishek Lekshmanan <abhishek@suse.com>
Also cleanup PendingReleasenotes to an empty file so that only newer
changes are tracked, adding the relevant section back to
RC1 where relevant. Moving all the RC1 announcements back to RC2, when
we go to 12.2.0 we'll collapse all of these back to the release
announcments
Signed-off-by: Abhishek Lekshmanan <alekshmanan@suse.com>
This has a few problems:
1- It does not do it's analysis over CRUSH rule roots/classes, which
means that an innocent user of classes will see skewed usage (bc hdds are
more full than ssds, say)
2- It does not take degraded clusters into account, which means the warning
will appear when a fresh OSD is added.
See http://tracker.ceph.com/issues/20730
Signed-off-by: Sage Weil <sage@redhat.com>
rgw: use a namespace for rgw reshard pool for upgrades as well
Reviewed-by: Casey Bodley <cbodley@redhat.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
This is used to dump extra weirdness to the health detail structured
output, but we are about to remove all of that in luminous.
Signed-off-by: Sage Weil <sage@redhat.com>
It's still sort of awkward to prefix these commands
with "mgr tell" but this makes them at least
somewhat accessible to the average user.
Signed-off-by: John Spray <john.spray@redhat.com>
Make an incompat change here with a release note since
this only affects pool creation, a rare event, and folks
who have customized their configs (also rare).
Keep it simple: a config sets the default rule, or else we pick
the first TYPE_REPLICATED pool in the crush map.
Signed-off-by: Sage Weil <sage@redhat.com>
This is undocumented and untested -- it was something
written before and superceded by the "recover_dentries"
subcommand. While we're at it, also
s/scavenge_dentries/recover_dentries/
internally.
Signed-off-by: John Spray <john.spray@redhat.com>
- rename the option (max -> warn)
- add an err_..._ratio multiplier
- switch to HEALTH_ERR once requests are blocked long enough
- make the error ratio high (default is 32*128s -> about an hour) so that
we don't trigger on a heavily loaded cluster.
Signed-off-by: Sage Weil <sage@redhat.com>
With bluestore, making the smallest write match min_alloc_size avoids
write amplification. With EC pools this is the stripe unit, or
stripe_width / num_data_chunks. Rather than requiring people to divide
by k to get the smallest ec write, allow it to be specified directly
via stripe_unit. Store it in the ec profile so changing a monitor
config option isn't necessary to set it.
This is particularly important for ec overwrites since they allow random i/o
which should match bluestore's checksum granularity (aka min_alloc_size).
Signed-off-by: Josh Durgin <jdurgin@redhat.com>
This had been broken for some time, as since the new
JournalStream stuff, zero padding was no longer a valid
encoding.
Fixes: http://tracker.ceph.com/issues/19691
Signed-off-by: John Spray <john.spray@redhat.com>
In practice this tends to get bubbled up the stack as an error on
the caller, and they usually do not handle it properly. For example,
with librbd, this turns into EIO and break the VM.
Instead, this will manifest as a hung op on the client. That is
also not ideal, but given that the root cause here is generally a
bug, it's not clear what else would be better.
We already log an error in the cluster log, so teuthology runs will
continue to fail.
Signed-off-by: Sage Weil <sage@redhat.com>
Expose public methods that include a new output argument to indicate
whether there are more keys to fetch or not.
Mark the old interfaces deprecated.
Signed-off-by: Sage Weil <sage@redhat.com>
This change does prioritize backfill of PGs which don't
have min_size active copies. Such PGs would cause IO stalls
for clients and would increase throttlers usage.
This change also fixes few subtlle out-of-bounds bugs.
Signed-off-by: Bartłomiej Święcki <bartlomiej.swiecki@corp.ovh.com>
Tell users they need to set this to true before Monitors will allow
pools to be removed.
Also update the Pending Release Notes so that users can find this change
there.
This was changed with commit 5d7f4ea
Signed-off-by: Wido den Hollander <wido@42on.com>
osd: set server-side limits on omap get operations
Reviewed-by: xie xingguo <xie.xingguo@zte.com.cn>
Reviewed-by: Kefu Chai <kchai@redhat.com>
Reviewed-by: Samuel Just <sjust@redhat.com>
If we have an OSD with a weight that's not 1.0 and mark it out,
we should restore the same weight when we mark it back in. We
already do this when an OSD is automatically marked out, just
not when it is explicitly marked out.
Signed-off-by: Sage Weil <sage@redhat.com>
This assumes that if the mon does not explicitly specify
the kv type that it is leveldb. No prior version of
Ceph has had non-experimental rocksdb, so this is
relatively safe. It's also necessary because the
default is now 'rocksdb' and we shouldn't assume those
old mons are rocksdb.
This will break for users to explicitly specified
rocksdb for the mon despite it being experimental.
Signed-off-by: Sage Weil <sage@redhat.com>
Exclusive lock, object map, fast-diff, and deep-flatten have been
enabled by default for all new images.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
The rbd cli will warn about the deprecation when attempting to create
image format 1 images. librbd will log an error message when opening
a format 1 RBD image.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
the symbols of buffer::list::iterator_impl<> were wrongly exposed
in previous infernalis release, and the clients linked against
librados are very likely using them. so we need to document this
change.
Signed-off-by: Kefu Chai <kchai@redhat.com>
Allow librados users to opt to receive ENOSPC or EDQUOT when they submit
an operation against a full cluster. This should only be used if the
librados app can handle those errors gracefully (librbd, for example,
cannot).
Also note that this allows savvy librados users to send delete operations;
they will get either a success or EDQUOT, depending on whether the
operation results in a net drop in space utilization.
Signed-off-by: Sage Weil <sage@redhat.com>
'ceph mon_metadata' was added still during this dev cycle, so there is
no need to deprecate it first.
Fixes: #11545
Signed-off-by: Joao Eduardo Luis <joao@suse.de>
Use a clean name for keyvaluestore (no -dev suffix), but mark as
experimental to ensure users know what they are signing up for.
Signed-off-by: Sage Weil <sage@redhat.com>
Recent versions of Python contain a change to thread shutdown that
causes ceph to hang on exit; see http://bugs.python.org/issue21963.
As it turns out, this is relatively easy to avoid by not spawning
threads on exit, as Rados.__del__() will certainly do by calling
shutdown(); I suspect, but haven't proven, that the problem is
that shutdown() tries to start() a threading.Thread() that never
makes it all the way back to signal start().
Also add a PendingReleaseNote and extra doc comments to clarify.
Fixes: #8797
Signed-off-by: Dan Mick <dan.mick@redhat.com>
Add release note
New librados interface
New pg_nls_response_t over the wire protocol
Ignore internal namespace (.ceph_internal)
Enhance ObjListCtx to keep independent IoCtxImpl so nspace won't change out from under listing code
Add ListObject with private implementation ListObjectImpl to return from iterator
Add EINVAL error for old librados interface when LIBRADOS_ALL_NSPACES set
Add throw to old librados c++ interface when all_nspaces set
Fixes: #9031
Signed-off-by: David Zafman <dzafman@redhat.com>
OSDs will now rely on 'leveldb_*' config options. We do keep however
leveldb's log enabled for OSDs by passing 'leveldb_log=""' as a default
argument to global_init() on ceph_osd.cc -- however, users will be able
to override this at their own discretion.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
'leveldb_*' options are currently used both by the monitor and the osd.
However, the monitor has quite different requirements from those of the
osds.
We need to specify some default values that must squash the defaults we
have for 'leveldb_*' options, while allowing users to overriding them too.
We take this not-exactly-ideal-but-still-good-enough approach of
defining the monitor-specific defaults in the 'default arguments' to
global_init(), thus allowing the user's options to take precedence over
whatever we define.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
From this point onward, users should use leveldb's options and add them
to the appropriate config sections of their configuration file.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
A 'status' or 'health' request will return a HEALTH_WARN whenever the
monitor handling the request has the option set to zero.
Fixes: 7784
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
The FileStore's leveldb currently uses libleveldb's defaults for cache and
write buffer size, which are both 4 MB. Increase the cache size to 128MB and
the write buffer to 8MB.
Tested-by: Dmitry Smirnov <onlyjob@member.fsf.org>
Signed-off-by: Sage Weil <sage@inktank.com>
Reading past the end of a pointer returned by string.data() in c++98
is undefined. While we're fixing this, also allow comparison of xattrs
containing null bytes.
Fixes: #7250
Backport: dumpling
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Just before sending an op, prepare_mutate_op() is called, creating a
new Op. prepare_read_op() already copied over all the out-params
correctly, but for write operations the individual op return value
pointers were not copied, so they would not be filled in. With this
fixed, librados users can get the per-op return codes again.
Partially fixes: #6483
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Require that all OSDs support TMAP2OMAP before starting the MDS. This
avoids doing some work and then crashing with EOPNOTSUPP, and gives us
a more informative message in the logs.
Signed-off-by: Sage Weil <sage@inktank.com>
rbd_list will return -ENOENT when no rbd_directory object
exists. Handle this in the cli tool and interpret it as success with
an empty list.
Add this to the release notes since it changes command line behavior.
Fixes: #6693
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
--osd-pool-default-crush-replicated-ruleset replaces
--osd-pool-default-crush-rule
If --osd-pool-default-crush-rule is set it takes precedence over
--osd-pool-default-crush-replicated-ruleset and a deprecation warning is
displayed.
The CrushWrapper::get_osd_pool_default_crush_replicated_ruleset helper is
used to implement this behaviour.
Signed-off-by: Loic Dachary <loic@dachary.org>
In commit 4f403c26dc we broke the general
non-daemon case.
Also make a note in the release notes.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Note that a bunch of stuff we thought would go in 0.70 is actually in 0.69,
so the update/release notes were adjusted accordingly.
Signed-off-by: Sage Weil <sage@inktank.com>
This is incomplete and unfortunately unusable in its current state:
- it would only set USES_TMAP for old encoded object_info_t and tmapput,
but would NOT set it for tmapup
- a config option turned that off by default.
That means that the mds conversion from tmap -> omap won't be able to use
this because any existing cluster has tmap objects without the USES_TMAP
flag set. And we don't want to unconditionally try a tmap->omap conversion
on omap operations because there are lots of existing librados users out
there that will be negatively impacted by this.
Instead, the MDS will need to handle this conversion on the client side by
reading either tmap or omap objects and explicitly rewriting the content
with omap (while truncating the tmap data away).
The auto-conversion function was added in v0.44.
Signed-off-by: Sage Weil <sage@inktank.com>
This way users can't put snapshots on their clusters unless they explicitly
ask for them and have seen a warning message. We take a bit of the MDSMap
flags in order to do so. The only thing a little weird here is that anybody
who upgrades to this patch who already has snapshots will hit the EPERM
and have to go through the warning, but it doesn't impact existing snapshots
at all so they should be good.
To go along with this, we add "ever_allowed_snaps" and "explicitly_allowed_snaps"
members to the MDSMap, which are default to false and are set to true
when allow_new_snaps is set. Old maps decoded with new code default to true
and false, respectively, so we can tell.
Fixes: #6332
Signed-off-by: Greg Farnum <greg@inktank.com>
Signed-off-by: Loic Dachary <loic@dachary.org>
The C++ AioCompletion::get_version() method only returns 32-bits. Sigh.
Add a get_version64() method that returns all 64-bits. Do not touch the
32-bit version to avoid breaking the ABI.
Backport: dumpling, cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
This is unlikely to be noticed by anybody, but it is a big change. Document
in the PendingReleaseNotes and bump up the librados minor version number
to 68.
Signed-off-by: Greg Farnum <greg@inktank.com>
This method is problematic because it both writes/mutates and returns data,
which means that an untimely client disconnect or peering event will result
in a success to the client with no payload.
It has not been used since v0.52 (18054ba46fe2779d8df8b1a0d69ec93ca6a66c34)
which is pre-bobtail; so this change breaks compatibility with pre-bobtail
librbd clients (at least for image creation).
Signed-off-by: Sage Weil <sage@inktank.com>
Also increase fd limit defaults to accomodate the larger number
of fds.
Fixes: #5692
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Mark Nelson <mark.nelson@inktank.com>
We can live with the incompatibility here; the hack is currently
not working anyway (see #5623).
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Maximum object size is 100GB configurable with osd_max_object_size
Error EFBIG if attempt to WRITE/WRITEFULL/TRUNCATE beyond osd_max_object_size
Error EINVAL if length < 1 for WRITE/WRITEFULL/ZERO
Make ZERO beyond existing size a no-op
Fixes: #5252Fixes: #5340
Signed-off-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>