Commit Graph

34541 Commits

Author SHA1 Message Date
Ma Jianpeng
4eb18dd487 os/FileJournal: Update the journal header when closing journal
When closing journal, it should check must_write_header and update
journal header if must_write_header alreay set.
It can reduce the nosense journal-replay after restarting osd.

Signed-off-by: Ma Jianpeng <jianpeng.ma@intel.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2014-07-24 18:54:33 -07:00
John Wilkins
4fe07925e4 doc: Updated mon doc per feedback. Fixed hyperlinks.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
2014-07-24 16:00:52 -07:00
Gregory Farnum
2a6b5309e5 Merge pull request #2079 from nereocystis/seq_read_bench-args
Make the declaration argument names match those in the implementation (as used by callers).

Reviewed-by: Greg Farnum <greg@inktank.com>
2014-07-24 14:36:21 -07:00
Abhishek Lekshmanan
c51182257e doc: update radosgw man page with available opts
Fixes:#8112

Reviewed-by: Yehuda Sadeh <yehuda@redhat.com>
Signed-off-by: Abhishek Lekshmanan <abhishek.lekshmanan@gmail.com>
2014-07-24 13:21:25 -07:00
Abhishek Lekshmanan
e259aca55a rgw: list all available options during help()
Adding the available help arguments from the man page

Fixes: #8112

Reviewed-by: Yehuda Sadeh <yehuda@redhat.com>
Signed-off-by: Abhishek Lekshmanan <abhishek.lekshmanan@gmail.com>
2014-07-24 13:17:26 -07:00
Abhishek Lekshmanan
99e80a5f62 rgw: format help options to align with the rest
Whitespace removal to make all help options align in a similar fashion

Reviewed-by: Yehuda Sadeh <yehuda@redhat.com>
Signed-off-by: Abhishek Lekshmanan <abhishek.lekshmanan@gmail.com>
2014-07-24 13:15:53 -07:00
Sage Weil
7d137430aa Merge remote-tracking branch 'gh/next' 2014-07-23 19:14:52 -07:00
Sage Weil
f9e885b990 Merge pull request #2127 from ceph/wip-8701
filestore: fix collection_move behavior

Reviewed-by: Greg Farnum <greg@inktank.com>
2014-07-23 19:13:55 -07:00
Sage Weil
c757fbdd7b Merge pull request #2140 from ceph/wip-8889
osd: greedily get obc write lock in some cases

Reviewed-by: Greg Farnum <greg@inktank.com>
2014-07-23 19:13:11 -07:00
Sage Weil
d4faf747b7 ceph_test_objectstore: clean up on finish of MoveRename
Otherwise, we leave collections around, and the next test fails.

Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-23 14:48:14 -07:00
Sage Weil
3ec9a42b47 os/LFNIndex: use FDCloser for fsync_dir
This prevents an fd leak when maybe_inject_failure() throws an exception.

Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-23 14:48:14 -07:00
Sage Weil
6fb3260d59 os/LFNIndex: only consider alt xattr if nlink > 1
If we are doing a lookup, the main xattr fails, we'll check if there is an
alt xattr.  If it exists, but the nlink on the inode is only 1, we will
kill the xattr.  This cleans up the mess left over by an incomplete
lfn_unlink operation.

This resolves the problem with an lfn_link to a second long name that
hashes to the same short_name: we will ignore the old name the moment the
old link goes away.

Fixes: #8701
Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-23 14:48:14 -07:00
Sage Weil
ec36f0a130 os/LFNIndex: remove alt xattr after unlink
After we unlink, if the nlink on the inode is still non-zero, remove the
alt xattr.  We can *only* do this after the rename or unlink operation
because we don't want to leave a file system link in place without the
matching xattr; hence the fsync_dir() call.

Note that this might leak an alt xattr if we happen to fail after the
rename/unlink but before the removexattr is committed.  We'll fix that
next.

Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-23 14:48:14 -07:00
Sage Weil
a320c260a9 os/LFNIndex: FDCloser helper
Add a helper to close fd's when we leave scope.  This is important when
injecting failures by throwing exceptions.

Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-23 14:48:14 -07:00
Sage Weil
b2cdfce646 os/LFNIndex: handle long object names with multiple links (i.e., rename)
When we rename an object (collection_move_rename) to a different name, and
the name is long, we run into problems because the lfn xattr can only track
a single long name linking to the inode.  For example, suppose we have

foobar -> foo_123_0 (attr: foobar) where foobar hashes to 123.

At first, collection_add could only link a file to another file in a
different collection with the same name. Allowing collection_move_rename
to rename the file, however, means that we have to convert:

col1/foobar -> foo_123_0 (attr: foobar)

to

col1/foobaz -> foo_234_0 (attr: foobaz)

This is a problem because if we link, reset xattr, unlink we end up with

col1/foobar -> foo_123_0 (attr: foobaz)

if we restart after we reset the attr.  This will cause the initial foobar
lookup to since the attr doesn't match, and the file won't be able to be
looked up.

Fix this by allow *two* (long) names to link to the same inode.  If we
lfn_link a second (different) name, move the previous name to the "alt"
xattr and set the new name.  (This works because link is always followed
by unlink.)  On lookup, check either xattr.

Don't even bother to remove the alt xattr on unlink.  This works as long
as the old name and new name don't hash to the same shortname and end up
in the same LFN chain.  (Don't worry, we'll fix that next.)

Fixes part of #8701
Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-23 14:48:14 -07:00
Sage Weil
cf98805c09 ceph_test_objectstore: fix warning
Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-23 14:48:14 -07:00
Samuel Just
6aa48a485e store_test: add long name collection_move_rename tests
Currently fails.

Signed-off-by: Samuel Just <sam.just@inktank.com>
2014-07-23 14:48:14 -07:00
Dan Mick
c0cb56f6c8 ceph.spec.in: add bash completion file for radosgw-admin
Signed-off-by: Sandon Van Ness <sandon@inktank.com>
(cherry picked from commit b700963071)
2014-07-23 13:32:00 -07:00
Dan Mick
1ad4cd386e ceph.spec.in: rhel7-related changes:
udev rules: /lib -> /usr/lib
/sbin binaries move to /usr/sbin or %{_sbindir}

Signed-off-by: Sandon Van Ness <sandon@inktank.com>
(cherry picked from commit 235e4c7de8)
2014-07-23 13:31:53 -07:00
Dan Mick
c57811fc4b Fix/add missing dependencies:
- rbd-fuse depends on librados2/librbd1
- ceph-devel depends on specific releases of libs and libcephfs_jni1
- librbd1 depends on librados2
- python-ceph does not depend on libcephfs1

Signed-off-by: Sandon Van Ness <sandon@inktank.com>
(cherry picked from commit 7cf8132239)
2014-07-23 13:31:47 -07:00
Dan Mick
793e05a27a ceph.spec.in: whitespace fixes
Signed-off-by: Sandon Van Ness <sandon@inktank.com>
(cherry picked from commit ec8af52a5e)
2014-07-23 13:31:39 -07:00
Dan Mick
dae6ecbc31 ceph.spec.in: split out ceph-common as in Debian
Move files, postun scriptlet, and add dependencies on ceph-common
where appropriate

Signed-off-by: Sandon Van Ness <sandon@inktank.com>
(cherry picked from commit e131b9d5a5)
2014-07-23 13:31:22 -07:00
Sage Weil
a05a0da3b1 common/random_cache: fix typo
Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-23 10:11:59 -07:00
Sage Weil
54ea5c13ac Merge pull request #2136 from yuyuyu101/fix-randomcache
common/RandomCache: Fix inconsistence between contents and count

Reviewed-by: Sage Weil <sage@redhat.com>
2014-07-23 09:57:59 -07:00
Haomai Wang
5efdc6236c common/RandomCache: Fix inconsistence between contents and count
The add/clear method may cause count inconsistent with the real size of
contents.

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
2014-07-23 11:31:46 +08:00
Sage Weil
356af4bf46 osd/ReplicatedPG: debug obc locks
Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-22 18:01:14 -07:00
Sage Weil
6fe27823b8 osd/ReplicatedPG: greedily take write_lock for copyfrom finish, snapdir
In the cases where we are taking a write lock and are careful
enough that we know we should succeed (i.e, we assert(got)),
use the get_write_greedy() variant that skips the checks for
waiters (be they ops or backfill) that are normally necessary
to avoid starvation.  We don't care about staration here
because our op is already in-progress and can't easily be
aborted, and new ops won't start because they do make those
checks.

Fixes: #8889
Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-22 18:00:34 -07:00
Sage Weil
09626501d7 osd: allow greedy get_write() for ObjectContext locks
There are several lockers that need to take a write lock
because there is an operation that is already in progress and
know it is safe to do so.  In particular, they need to skip
the starvation checks (op waiters, backfill waiting).

Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-22 18:00:09 -07:00
Josh Durgin
422218a3b3 Merge pull request #2120 from ceph/wip-8858
Wip 8858

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2014-07-22 16:58:25 -07:00
Gregory Farnum
29401c7e77 Merge pull request #2133 from ceph/wip-8897
os: fix build warnings with name/attr len checks (fixes 8889)

Reviewed-by: Greg Farnum <greg@inktank.com>
2014-07-22 15:36:40 -07:00
João Eduardo Luís
d37b2ac1e3 Merge pull request #2128 from ceph/wip-8851
mon: AuthMonitor: always encode full regardless of keyserver having keys

Reviewed-by: Gregory Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2014-07-22 22:10:17 +01:00
Sage Weil
253ca2b902 os: make name/attr max methods unsigned
This fixes warnings when we use these in MIN/MAX macros against
other unsigned values.

Fixes: #8897
Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-22 13:38:32 -07:00
Sage Weil
daac7508d2 os/KeyValueStore: make get_max_object_name_length() sane
This is getting the NAME_MAX from the OS, but in reality the backend
KV store is the limiter.  And for leveldb, there is no real limit.
Return 4096 for now.

Signed-off-by: Sage Weil <sage@redhat.com>
2014-07-22 13:37:20 -07:00
Josh Durgin
74b386f03e Merge pull request #2129 from ceph/wip-librbd-oc
librbd: reduce cache flush overhead

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Haomai Wang <haomaiwang@gmail.com>
2014-07-22 13:33:24 -07:00
Sage Weil
36265d0db0 Merge pull request #2125 from ceph/wip-memstore
memstore: a few fixes, and enable the tests!

Reviewed-by: Haomai Wang <haomaiwang@gmail.com>
2014-07-22 10:52:40 -07:00
Sage Weil
f7112c5beb Merge pull request #2105 from rootfs/wip-qa-hadoop-wordcount
update hadoop-wordcount test to be able to run on hadoop 2.x. 

Reviewed-by: Sage Weil <sage@redhat.com>
2014-07-22 08:42:03 -07:00
rootfs
e311a085a8 uncomment cleanup command 2014-07-22 11:31:37 -04:00
Wido den Hollander
d87e5b9f60 powerdns: RADOS Gateway backend for bucket directioning
This backend can be used to create one global namespace for multiple
RGW regions.

Using a CNAME DNS response the traffic is directed towards the RGW region
without using HTTP redirects.
2014-07-22 16:51:05 +02:00
Joao Eduardo Luis
b551ae2bce mon: AuthMonitor: always encode full regardless of keyserver having keys
On clusters without cephx, assuming an admin never added a key to the
cluster, the monitors have empty key servers.  A previous patch had the
AuthMonitor not encoding an empty keyserver as a full version.

As such, whenever the monitor restarts we will have to read the whole
state from disk in the form of incrementals.  This poses a problem upon
trimming, as we do every now and then: whenever we start the monitor, it
will start with an empty keyserver, waiting to be populated from whatever
we have on disk.  This is performed in update_from_paxos(), and the
AuthMonitor's will rely on the keyserver version to decide which
incrementals we care about -- basically, all versions > keyserver version.

Although we started with an empty keyserver (version 0) and are expecting
to read state from disk, in this case it means we will attempt to read
version 1 first.  If the cluster has been running for a while now, and
even if no keys have been added, it's fair to assume that version is
greater than 0 (or even 1), as the AuthMonitor also deals and keeps track
of auth global ids.  As such, we expect to read version 1, then version 2,
and so on.  If we trim at some point however this will not be possible,
as version 1 will not exist -- and we will assert because of that.

This is fixed by ensuring the AuthMonitor keeps track of full versions
of the key server, even if it's of an empty key server -- it will still
keep track of the key server's version, which is incremented each time
we update from paxos even if it is empty.

Fixes: #8851
Backport: dumpling, firefly

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2014-07-22 00:25:37 +01:00
Ma, Jianpeng
1518fa2092 osd: init local_connection for fast_dispatch in _send_boot()
We were not properly setting up Sessions on the local_connection for
fast_dispatch'ed Messages if the cluster_addr was set explicitly: the OSD
was not in the dispatch list at bind() time (in ceph_osd.cc), and nothing
called it later on. This issue was missed in testing because Inktank only
uses unified NICs.

That led to errors like the following:

When do ec-read, i met a bug which was occured 100%. The messages are:
2014-07-14 10:03:07.318681 7f7654f6e700 -1 osd/OSD.cc: In function
'virtual void OSD::ms_fast_dispatch(Message*)' thread 7f7654f6e700 time
2014-07-14 10:03:07.316782 osd/OSD.cc: 5019: FAILED assert(session)

 ceph version 0.82-585-g79f3f67 (79f3f67491)
 1: (OSD::ms_fast_dispatch(Message*)+0x286) [0x6544b6]
 2: (DispatchQueue::fast_dispatch(Message*)+0x56) [0xb059d6]
 3: (DispatchQueue::run_local_delivery()+0x6b) [0xb08e0b]
 4: (DispatchQueue::LocalDeliveryThread::entry()+0xd) [0xa4a5fd]
 5: (()+0x8182) [0x7f7665670182]
 6: (clone()+0x6d) [0x7f7663a1130d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

To resolve this, we have the OSD invoke ms_handle_fast_connect() explicitly
in send_boot(). It's not really an appropriate location, but we're already
doing a bunch of messenger twiddling there, so it's acceptable for now.

Signed-off-by: Ma Jianpeng <jianpeng.ma@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 9061988ec7)
2014-07-21 14:53:47 -07:00
Josh Durgin
34b0efdec7 ObjectCacher: fix bh_{add,remove} dirty_or_tx_bh accounting
tx buffers need to go on the bh_lru_rest as well, and removing erases
(not inserts) them into dirty_or_tx_bh.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2014-07-21 14:13:21 -07:00
Josh Durgin
8a05f1ba0e ObjectCacher: fix dirty_or_tx_bh logic in bh_set_state()
The else-if chain here was wrong. Handling dirty or tx buffers and
errors should be in independent conditions.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2014-07-21 14:13:11 -07:00
Haomai Wang
d3587419da Wait tx state buffer in flush_set
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
2014-07-21 13:31:32 -07:00
Haomai Wang
3c7229a2fe Add rbdcache max dirty object option
Librbd will calculate max dirty object according to rbd_cache_max_size, it
doesn't suitable for every case. If user set image order 24, the calculating
result is too small for reality. It will increase the overhead of trim call
which is called each read/write op.

Now we make it as option for tunning, by default this value is calculated.

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
2014-07-21 13:31:32 -07:00
Haomai Wang
5cb4b000dd Reduce ObjectCacher flush overhead
Flush op in ObjectCacher will iterate the whole active object set, each
dirty object also may own several BufferHead. If the object set is large,
it will consume too much time.

Use dirty_bh instead to reduce overhead. Now only dirty BufferHead will
be checked.

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
2014-07-21 13:31:32 -07:00
Ma, Jianpeng
9061988ec7 osd: init local_connection for fast_dispatch in _send_boot()
We were not properly setting up Sessions on the local_connection for
fast_dispatch'ed Messages if the cluster_addr was set explicitly: the OSD
was not in the dispatch list at bind() time (in ceph_osd.cc), and nothing
called it later on. This issue was missed in testing because Inktank only
uses unified NICs.

That led to errors like the following:

When do ec-read, i met a bug which was occured 100%. The messages are:
2014-07-14 10:03:07.318681 7f7654f6e700 -1 osd/OSD.cc: In function
'virtual void OSD::ms_fast_dispatch(Message*)' thread 7f7654f6e700 time
2014-07-14 10:03:07.316782 osd/OSD.cc: 5019: FAILED assert(session)

 ceph version 0.82-585-g79f3f67 (79f3f67491)
 1: (OSD::ms_fast_dispatch(Message*)+0x286) [0x6544b6]
 2: (DispatchQueue::fast_dispatch(Message*)+0x56) [0xb059d6]
 3: (DispatchQueue::run_local_delivery()+0x6b) [0xb08e0b]
 4: (DispatchQueue::LocalDeliveryThread::entry()+0xd) [0xa4a5fd]
 5: (()+0x8182) [0x7f7665670182]
 6: (clone()+0x6d) [0x7f7663a1130d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

To resolve this, we have the OSD invoke ms_handle_fast_connect() explicitly
in send_boot(). It's not really an appropriate location, but we're already
doing a bunch of messenger twiddling there, so it's acceptable for now.

Signed-off-by: Ma Jianpeng <jianpeng.ma@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2014-07-21 13:13:44 -07:00
Sage Weil
c1c5f4b5f5 Merge pull request #2121 from ceph/wip-dencoder
limit leveldb linkage; move ceph-dencoder back into ceph-common

Reviewed-by: Dan Mick <dan.mick@inktank.com>

RGW patch Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2014-07-21 13:10:02 -07:00
Sage Weil
27f6dbb64a Merge pull request #2067 from thorstenb/wip-janitorial-clang-3
[werror] Fix mismatched tags (struct vs. class) inconsistence

Reviewed-by: Sage Weil <sage@redhat.com>
2014-07-21 09:08:31 -07:00
Thorsten Behrens
b6f3aff766 Fix mismatched tags (struct vs. class) inconsistency
Signed-off-by: Thorsten Behrens <tbehrens@suse.com>
2014-07-21 17:09:17 +02:00
Sage Weil
ff15a43c71 Merge pull request #2111 from ceph/wip-8174
osd: add config for osd_max_object_name_len = 2048 (was hard-coded at 4096)

Reviewed-by: Haomai Wang <haomaiwang@gmail.com>

and the first patch was
Reviewed-by: Samuel Just <sam.just@inktank.com>
2014-07-20 14:21:09 -07:00