Commit Graph

34512 Commits

Author SHA1 Message Date
Sage Weil
41364711a6 osd/ReplicatedPG: observed INCOMPLETE_CLONES when doing clone subsets
During recovery, we can clone subsets if we know that all clones will be
present.  We skip this on caching pools because they may not be; do the
same when INCOMPLETE_CLONES is set.

Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-24 10:07:33 -07:00
Sage Weil
956f28721d osd/ReplicatedPG: do not complain about missing clones when INCOMPLETE_CLONES is set
When scrubbing, do not complain about missing cloens when we are in a
caching mode *or* when the INCOMPLETE_CLONES flag is set.  Both are
indicators that we may be missing clones and that that is okay.

Fixes: #8882
Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-24 10:07:33 -07:00
Sage Weil
54bf055c5d osd/osd_types: add pg_pool_t FLAG_COMPLETE_CLONES
Set a flag on the pg_pool_t when we change cache_mode NONE.  This
is because object promotion may promote heads without all of the clones,
and when we switch the cache_mode back those objects may remain.  Do
this on any cache_mode change (to or from NONE) to capture legacy
pools that were set up before this flag existed.

Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-24 10:06:55 -07:00
Sage Weil
67d13d76f5 mon/OSDMonitor: improve no-op cache_mode set check
If we have a pending pool value but the cache_mode hasn't changed, this is
still a no-op (and we don't need to block).

Backport: firefly
Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-24 10:06:54 -07:00
Sage Weil
a05a0da3b1 common/random_cache: fix typo
Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-23 10:11:59 -07:00
Sage Weil
54ea5c13ac Merge pull request #2136 from yuyuyu101/fix-randomcache
common/RandomCache: Fix inconsistence between contents and count

Reviewed-by: Sage Weil <sage@redhat.com>
2014-07-23 09:57:59 -07:00
Haomai Wang
5efdc6236c common/RandomCache: Fix inconsistence between contents and count
The add/clear method may cause count inconsistent with the real size of
contents.

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
2014-07-23 11:31:46 +08:00
Josh Durgin
422218a3b3 Merge pull request #2120 from ceph/wip-8858
Wip 8858

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2014-07-22 16:58:25 -07:00
Gregory Farnum
29401c7e77 Merge pull request #2133 from ceph/wip-8897
os: fix build warnings with name/attr len checks (fixes 8889)

Reviewed-by: Greg Farnum <greg@inktank.com>
2014-07-22 15:36:40 -07:00
João Eduardo Luís
d37b2ac1e3 Merge pull request #2128 from ceph/wip-8851
mon: AuthMonitor: always encode full regardless of keyserver having keys

Reviewed-by: Gregory Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2014-07-22 22:10:17 +01:00
Sage Weil
253ca2b902 os: make name/attr max methods unsigned
This fixes warnings when we use these in MIN/MAX macros against
other unsigned values.

Fixes: #8897
Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-22 13:38:32 -07:00
Sage Weil
daac7508d2 os/KeyValueStore: make get_max_object_name_length() sane
This is getting the NAME_MAX from the OS, but in reality the backend
KV store is the limiter.  And for leveldb, there is no real limit.
Return 4096 for now.

Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-22 13:37:20 -07:00
Josh Durgin
74b386f03e Merge pull request #2129 from ceph/wip-librbd-oc
librbd: reduce cache flush overhead

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Haomai Wang <haomaiwang@gmail.com>
2014-07-22 13:33:24 -07:00
Sage Weil
36265d0db0 Merge pull request #2125 from ceph/wip-memstore
memstore: a few fixes, and enable the tests!

Reviewed-by: Haomai Wang <haomaiwang@gmail.com>
2014-07-22 10:52:40 -07:00
Sage Weil
f7112c5beb Merge pull request #2105 from rootfs/wip-qa-hadoop-wordcount
update hadoop-wordcount test to be able to run on hadoop 2.x. 

Reviewed-by: Sage Weil <sage@redhat.com>
2014-07-22 08:42:03 -07:00
rootfs
e311a085a8 uncomment cleanup command 2014-07-22 11:31:37 -04:00
Wido den Hollander
d87e5b9f60 powerdns: RADOS Gateway backend for bucket directioning
This backend can be used to create one global namespace for multiple
RGW regions.

Using a CNAME DNS response the traffic is directed towards the RGW region
without using HTTP redirects.
2014-07-22 16:51:05 +02:00
Joao Eduardo Luis
b551ae2bce mon: AuthMonitor: always encode full regardless of keyserver having keys
On clusters without cephx, assuming an admin never added a key to the
cluster, the monitors have empty key servers.  A previous patch had the
AuthMonitor not encoding an empty keyserver as a full version.

As such, whenever the monitor restarts we will have to read the whole
state from disk in the form of incrementals.  This poses a problem upon
trimming, as we do every now and then: whenever we start the monitor, it
will start with an empty keyserver, waiting to be populated from whatever
we have on disk.  This is performed in update_from_paxos(), and the
AuthMonitor's will rely on the keyserver version to decide which
incrementals we care about -- basically, all versions > keyserver version.

Although we started with an empty keyserver (version 0) and are expecting
to read state from disk, in this case it means we will attempt to read
version 1 first.  If the cluster has been running for a while now, and
even if no keys have been added, it's fair to assume that version is
greater than 0 (or even 1), as the AuthMonitor also deals and keeps track
of auth global ids.  As such, we expect to read version 1, then version 2,
and so on.  If we trim at some point however this will not be possible,
as version 1 will not exist -- and we will assert because of that.

This is fixed by ensuring the AuthMonitor keeps track of full versions
of the key server, even if it's of an empty key server -- it will still
keep track of the key server's version, which is incremented each time
we update from paxos even if it is empty.

Fixes: #8851
Backport: dumpling, firefly

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2014-07-22 00:25:37 +01:00
Josh Durgin
34b0efdec7 ObjectCacher: fix bh_{add,remove} dirty_or_tx_bh accounting
tx buffers need to go on the bh_lru_rest as well, and removing erases
(not inserts) them into dirty_or_tx_bh.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2014-07-21 14:13:21 -07:00
Josh Durgin
8a05f1ba0e ObjectCacher: fix dirty_or_tx_bh logic in bh_set_state()
The else-if chain here was wrong. Handling dirty or tx buffers and
errors should be in independent conditions.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2014-07-21 14:13:11 -07:00
Haomai Wang
d3587419da Wait tx state buffer in flush_set
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
2014-07-21 13:31:32 -07:00
Haomai Wang
3c7229a2fe Add rbdcache max dirty object option
Librbd will calculate max dirty object according to rbd_cache_max_size, it
doesn't suitable for every case. If user set image order 24, the calculating
result is too small for reality. It will increase the overhead of trim call
which is called each read/write op.

Now we make it as option for tunning, by default this value is calculated.

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
2014-07-21 13:31:32 -07:00
Haomai Wang
5cb4b000dd Reduce ObjectCacher flush overhead
Flush op in ObjectCacher will iterate the whole active object set, each
dirty object also may own several BufferHead. If the object set is large,
it will consume too much time.

Use dirty_bh instead to reduce overhead. Now only dirty BufferHead will
be checked.

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
2014-07-21 13:31:32 -07:00
Ma, Jianpeng
9061988ec7 osd: init local_connection for fast_dispatch in _send_boot()
We were not properly setting up Sessions on the local_connection for
fast_dispatch'ed Messages if the cluster_addr was set explicitly: the OSD
was not in the dispatch list at bind() time (in ceph_osd.cc), and nothing
called it later on. This issue was missed in testing because Inktank only
uses unified NICs.

That led to errors like the following:

When do ec-read, i met a bug which was occured 100%. The messages are:
2014-07-14 10:03:07.318681 7f7654f6e700 -1 osd/OSD.cc: In function
'virtual void OSD::ms_fast_dispatch(Message*)' thread 7f7654f6e700 time
2014-07-14 10:03:07.316782 osd/OSD.cc: 5019: FAILED assert(session)

 ceph version 0.82-585-g79f3f67 (79f3f67491)
 1: (OSD::ms_fast_dispatch(Message*)+0x286) [0x6544b6]
 2: (DispatchQueue::fast_dispatch(Message*)+0x56) [0xb059d6]
 3: (DispatchQueue::run_local_delivery()+0x6b) [0xb08e0b]
 4: (DispatchQueue::LocalDeliveryThread::entry()+0xd) [0xa4a5fd]
 5: (()+0x8182) [0x7f7665670182]
 6: (clone()+0x6d) [0x7f7663a1130d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

To resolve this, we have the OSD invoke ms_handle_fast_connect() explicitly
in send_boot(). It's not really an appropriate location, but we're already
doing a bunch of messenger twiddling there, so it's acceptable for now.

Signed-off-by: Ma Jianpeng <jianpeng.ma@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2014-07-21 13:13:44 -07:00
Sage Weil
c1c5f4b5f5 Merge pull request #2121 from ceph/wip-dencoder
limit leveldb linkage; move ceph-dencoder back into ceph-common

Reviewed-by: Dan Mick <dan.mick@inktank.com>

RGW patch Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2014-07-21 13:10:02 -07:00
Sage Weil
27f6dbb64a Merge pull request #2067 from thorstenb/wip-janitorial-clang-3
[werror] Fix mismatched tags (struct vs. class) inconsistence

Reviewed-by: Sage Weil <sage@redhat.com>
2014-07-21 09:08:31 -07:00
Thorsten Behrens
b6f3aff766 Fix mismatched tags (struct vs. class) inconsistency
Signed-off-by: Thorsten Behrens <tbehrens@suse.com>
2014-07-21 17:09:17 +02:00
Sage Weil
ff15a43c71 Merge pull request #2111 from ceph/wip-8174
osd: add config for osd_max_object_name_len = 2048 (was hard-coded at 4096)

Reviewed-by: Haomai Wang <haomaiwang@gmail.com>

and the first patch was
Reviewed-by: Samuel Just <sam.just@inktank.com>
2014-07-20 14:21:09 -07:00
Sage Weil
2aa3edcb13 os/FileStore: fix max object name limit
Our max object name is not limited by file name size, but by the length of
the name we can stuff in an xattr.  That will vary from file system to
file system, so just make this 4096.  In practice, it should be limited
via the global tunable, if it is adjusted at all.

Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-20 07:48:47 -07:00
Sage Weil
f4bffece8f ceph_test_objectstore: test memstore
Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-19 13:56:07 -07:00
Sage Weil
6f312b0584 os/MemStore: copy attrs on clone
Backport: firefly
Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-19 13:56:07 -07:00
Sage Weil
8dd6b8f9d8 os/MemStore: fix wrlock ordering checks
We can't compare the shared_ptrs themselves; we need to compare the
addresses of the actual objects.

Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-19 13:56:07 -07:00
Sage Weil
a2594a5472 osd/MemStore: handle collection_move_rename within the same collection
Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-19 13:56:07 -07:00
Sage Weil
34671108ce ceph-dencoder: don't link librgw.la (and rados, etc.)
Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-18 22:44:51 -07:00
Sage Weil
b1a641f307 rgw: move a bunch of stuff into rgw_dencoder
This will help out ceph-dencoder ...

Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-18 22:39:46 -07:00
Sage Weil
1c170776cb libosd_types, libos_types, libmon_types
Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-18 22:33:42 -07:00
Sage Weil
58cc894b32 Revert "ceph.spec: move ceph-dencoder to ceph from ceph-common"
This reverts commit 95f5a448b5.
2014-07-18 20:55:39 -07:00
Sage Weil
f181f78b74 Revert "debian: move ceph-dencoder to ceph from ceph-common"
This reverts commit b37e3bde3b.
2014-07-18 20:55:35 -07:00
Sage Weil
ad4a4e1346 unittest_osdmap: revert a few broken changes
From commit 80ea6067f7.

Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-18 16:51:16 -07:00
Yehuda Sadeh
d7209c1125 rgw: dump prefix unconditionally
As part of issue #8858, and to be more in line with S3, dump the Prefix
field when listing bucket even if bucket is empty.

Signed-off-by: Yehuda Sadeh <yehuda@redhat.com>
2014-07-18 14:56:24 -07:00
Yehuda Sadeh
dc417e477d rgw: list extra objects to set truncation flag correctly
Otherwise we end up returning wrong truncated value, and no data on the
next iteration.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2014-07-18 14:56:24 -07:00
Yehuda Sadeh
82d2d612e7 rgw: account common prefixes for MaxKeys in bucket listing
To be more in line with the S3 api. Beforehand we didn't account the
common prefixes towards the MaxKeys (a single common prefix counts as a
single key). Also need to adjust the marker now if it is pointing at a
common prefix.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2014-07-18 14:56:23 -07:00
Yehuda Sadeh
924686f0b6 rgw: add NextMarker param for bucket listing
Partially fixes #8858.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2014-07-18 14:55:58 -07:00
Wido den Hollander
09a5974fd3 crushtool: Send output to stdout instead of stderr
A lot of output was send to stderr instead of stdout and vise versa.

Error messages should go to stderr, but all other output to stdout
2014-07-18 20:18:18 +02:00
Gregory Farnum
b9463e3497 Merge pull request #2115 from ceph/wip-8811
Make standby-replay MDSes much more careful about journal formats; both changing them and generally being aware.

Reviewed-by: Greg Farnum <greg@inktank.com>
2014-07-18 11:17:52 -07:00
Yehuda Sadeh
e6cf618c25 rgw: improve delmited listing of bucket
If found a prefix, calculate a string greater than that so that next
request we can skip to that. This is still not the most efficient way to
do it. It'll be better to push it down to the objclass, but that'll
require a much bigger change.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2014-07-18 10:45:57 -07:00
Yehuda Sadeh
49fc68cf8c utf8: export encode_utf8() and decode_utf8()
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2014-07-18 10:45:57 -07:00
Sage Weil
bd3367eafb osd: add config for osd_max_attr_name_len = 100
Set a limit on the length of an attr name.  The fs can only take 128
bytes, but we were not imposing any limit.

Add a test.

Reported-by: Haomai Wang <haomaiwang@gmail.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2014-07-18 10:44:49 -07:00
Sage Weil
7c0b2a05b9 os: add ObjectStore::get_max_attr_name_length()
Most importantly, capture that attrs on FileStore can't be more than about
100 chars.  The Linux xattrs can only be 128 chars, but we also have some
prefixing we do.

Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-18 10:44:05 -07:00
Sage Weil
7e0aca18a0 osd: add config for osd_max_object_name_len = 2048 (was hard-coded at 4096)
Previously we had a hard coded limit of 4096.  Objects > 3k crash the OSD
when running on ext4, although they probably work on xfs.  But rgw only
generates objects a bit over 1024 bytes (maybe 1200 tops?), so let set a
more reasonable limit here.  2048 is a nice round number and should be
safe.

Add a test.

Fixes: #8174
Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-18 10:44:05 -07:00