We need the merge_log and proc_replica_log paths to result in the
same missing set. This patch adds some machinery for specifying
a log merge scenario and comparing both paths to the same correct
result. This machinery also makes it a bit easier to read and add
new tests.
Signed-off-by: Samuel Just <sam.just@inktank.com>
This test didn't quite make sense since the divergent entry
cannot be from a newer epoch. It also didn't quite match the
diagram.
Signed-off-by: Samuel Just <sam.just@inktank.com>
We can't merge using the primary's log since we haven't decided whether
to send them a complete log yet. Thus, merge based on the truncated olog
rather than the primary's log. This is a consequence of the division
between trimming divergent entries in peering/unfound search and sending
a complete log to actual members of the actingbackfill set in activate().
_merge_divergent_entries on the truncated log and add_next_event() on the
newer entries result in the same missing/log regardless of the order in
which they are performed.
Signed-off-by: Samuel Just <sam.just@inktank.com>
The _merge_old_entry structure had trouble distinguishing between the
following cases:
missing: foo, 1,1
merge_old_entry modify 1,1 0,0
merge_old_entry modify 1,2 1,1
and
merge_old_entry modify 1,2 1,1
In the first case, we should end up with foo removed from missing
at the end. In the second, we need foo added to missing at 1,1.
It's far simpler to present all of the divergent entries for a single
object at once.
Signed-off-by: Samuel Just <sam.just@inktank.com>
load_pgs can take a while and it is nice to know what ceph-osd is doing
without cranking up logging.
Did a quick audit of dout(1)'s and making this the default. This lets us
see basic OSD state changes (load_pgs, boot, active) at the default level.
At this point all osd state changes should be logged at level 1.
Signed-off-by: Sage Weil <sage@inktank.com>
In an effort to reduce fragmentation, prefix every rbd write with
a CEPH_OSD_OP_SETALLOCHINT osd op with an expected_write_size value set
to the object size (1 << order). Backwards compatibility is taken care
of on the osd side.
"The CEPH_OSD_OP_SETALLOCHINT hint is durable, in that it's enough to
do it once. The reason every rbd write is prefixed is that rbd doesn't
explicitly create objects and relies on writes creating them
implicitly, so there is no place to stick a single hint op into. To
get around that we decided to prefix every rbd write with a hint (just
like write and setattr ops, hint op will create an object implicitly if
it doesn't exist)."
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Add a new config option, filestore_max_alloc_hint_size, to cap
SETALLOCHINT hint size. The unit is a byte, the default value is
1 megabyte.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Introduce XfsFileStoreBackend class, currently the only filestore
backend implementing SETALLOCHINT op. This commit adds a build-time
dependency on libxfs as xfs-specific ioctl (XFS_IOC_FSSETXATTR /
XFS_XFLAG_EXTSIZE) is used to implement the new set_alloc_hint()
method.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Refactor FS detection checks in FileStore::_detect_fs() so that they
look the same as the ones in FileStore::mkfs(). This is in preparation
for adding XfsFileStoreBackend class.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
This is primarily for librbd/krbd's benefit and is supposed to combat
fragmentation:
"... knowing that rbd images have a 4m size, librbd can pass a hint
that will let the osd do the xfs allocation size ioctl on new files so
that they are allocated in 1m or 4m chunks. We've seen cases where
users with rbd workloads have very high levels of fragmentation in xfs
and this would mitigate that and probably have a pretty nice
performance benefit."
SETALLOCHINT is considered advisory, so our backwards compatibility
mechanism here is to set FAILOK flag for all SETALLOCHINT ops.
xfs is hooked up in the subsequent commits.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
When CEPH_ARGS is parsed each side of the -- must be appended to the
corresponding side of the existing argument list. For instance when
-a -b -- foo bar
is merged with a CEPH_ARGS containing
-c -d -- frob nitz
it must become
-a -b -c -d -- foo bar frob nitz
http://tracker.ceph.com/issues/7578fixes#7578
Signed-off-by: Loic Dachary <loic@dachary.org>
1. Increased the String length for distro, version and os_desc columns in osds_info table
2. Corrected version information extraction in client/ceph-brag
3. Removed the version_id json entry when version list returned for UUID
4. Updated the README to reflect point 3
Signed-off-by: Babu Shanmugam <anbu@enovance.com>
Otherwise, a high enough 'count' value will trigger all sorts of timeouts
on the OSD; a low enough 'size' value will have the same effect for a
high enough value of 'count' (even the default value may have ill effects
on the osd's behaviour). Limiting these values do not fix how 'osd bench'
should behave, but avoid someone from inadvertently bork an OSD.
Four options have been added and the user may adjust them if he so
desires to play with the OSD's fate:
- 'osd_bench_small_size_max_iops' [default: 100] defines the amount of
expected IOPS for a small block size (i.e., <1MB).
- 'osd_bench_large_size_max_throughput' [default: 100<<20] defines
the expected throughput in B/s. We assume 100MB/s.
- 'osd_bench_max_block_size' [default: 64 << 20] caps the block size
allowed. We have defined 64 MB.
- 'osd_bench_duration' [default: 30] caps the expected duration. This
values is used when calculating the maximum allowed 'count', and is
not enforced as the maximum duration of the operation. If other IO
is undergoing, or 'osd bench' is somehow slowed down, 'osd bench' may
go over this duration. Adjusting this option does however allow the
user to specify higher 'count' values for (e.g.) a small block size,
as the operation is assumed to perform the operation over a longer
time span.
These options attempt to avoid combinations of dangerous parameters. For
instance, we limit the block size to 64 MB (by default) so that there is
no temptation to specify a large enough block size, along with a very small
'count', such that the end result is similar to specifying a big count with
a sane block size.
Fixes: 7248
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Check that rados put immediately followed by rados get retrieves exactly
the same content.
http://tracker.ceph.com/issues/7423 refs #7423
Signed-off-by: Loic Dachary <loic@dachary.org>
When reading from a replicated pool, trying to read more than the object
size results in a short read that does not go beyond the object size. In
erasure coded pools, objects are padded and the read will return more
bytes than the object actually contains.
http://tracker.ceph.com/issues/7423fixes#7423
Signed-off-by: Loic Dachary <loic@dachary.org>
In the event that mod_desc.bl contains pointers into a large
message buffer, we'd otherwise end up keeping around the entire
MOSDECSubOpWrite which created each log entry.
Fixes: #7539
Signed-off-by: Samuel Just <sam.just@inktank.com>
The !tracking_enabled branch actually had a leak which was unreachable
since the caller does the check for tracking_enabled.
Signed-off-by: Samuel Just <sam.just@inktank.com>
Otherwise, clear_data on MOSDOp will leave essentially
all of the buffers intact. This is a problem since the
OpTracker mechanism relies on being able to keep the mesage
around without keeping around the data.
Signed-off-by: Samuel Just <sam.just@inktank.com>