Commit Graph

22533 Commits

Author SHA1 Message Date
Dan Mick
bbd343a1d1 rbd: tests for copy with explicit/implicit pool names
Validate change to not assume dest pool == src pool

Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 39180430b9)
2012-11-27 14:06:23 -08:00
Dan Mick
e612afc2c0 rbd: fix import pool assumptions
import allows specifying one image, implicitly or explicitly the
"source" image, even though it's really the destination.  Fix up
the reassignment of 'source' to 'dest', and check for and complain
about specifying two different pools or images for import.

Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit c219698149)
2012-11-27 14:06:21 -08:00
Dan Mick
81d3830738 rbd: change destpool assumptions.
Don't default destpool to srcpool; it's surprising, and
not useful/helpful enough to violate the convention that
"default pool is rbd"

Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 3b0c360528)
2012-11-27 14:06:18 -08:00
Dan Mick
724cfd1b41 rbd: --size fixes
* require --size/-s for both create *and* resize
* explicitly permit create with size 0.

Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 08f47a42b5)
2012-11-27 14:06:15 -08:00
Dan Mick
66b148e3ab rbd: allow parsing image@snap even if --pool given
Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit e452df6dad)
2012-11-27 14:06:12 -08:00
Sage Weil
8e9554e175 Merge remote-tracking branch 'gh/wip-mon-workloadgen' into next 2012-11-27 12:54:40 -08:00
Joao Eduardo Luis
3112cd8fbe test: mon: run_test.sh: helper script for the mon's workloadgen
Takes advantage of qa/workunits/mon/workloadgen.sh to avoid duplicating
code.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2012-11-27 20:00:44 +00:00
Joao Eduardo Luis
2a681052b2 qa: workunits: mon: add workloadgen's workunit
Uses test/mon/test_osd_workloadgen to generate a bunch of map
changes

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2012-11-27 20:00:44 +00:00
Joao Eduardo Luis
e1820d870e test: mon: workload generator
User-space tool that interacts with the monitor, with the objective of
generating a workload mimicking a set of OSDs and clients.

As it is, the tool will mimic any number of OSDs, by keeping in-memory
stubs that will act as independent OSDs, generating random operations
that will induce map updates; the client stub, on the other hand,
performs no operations besides connecting to the monitor and whatever
happens between the Objecter class and the monitor (mainly keeping
updated with map updates).

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2012-11-27 20:00:44 +00:00
Joao Eduardo Luis
f5029074da messages: MLog: make ctor's uuid argument a const
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2012-11-27 20:00:44 +00:00
Joao Eduardo Luis
317777436a mon: Monitor: use existing strict_strtol() on parse_pos_long()
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2012-11-27 20:00:44 +00:00
Joao Eduardo Luis
f7276deaff crush: relax the order by which rules and buckets must be defined
Before we only allowed buckets (say, 'root') to be defined *before*
rules.

With this patch, we allow buckets and rules to be defined by any order,
although some care should be taken when creating the plain-text crush
map, or the crushtool will error out when a rule uses a bucket only
defined later on in the file.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2012-11-27 20:00:44 +00:00
Joao Eduardo Luis
1fcccd3ea4 crushtool: rework how verbosity works
'verbose' was a bool that would either be passed as one or zero to class
CrushCompile. However, most messages would only be outputted with a
verbose level > 1.

This patch makes it so that multiple '-v' increase the verbosity level;
i.e., -v mean verbose = 1; -v -v means verbose = 2; and so forth.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2012-11-27 20:00:44 +00:00
Sage Weil
15b4ac58b2 Merge remote-tracking branch 'gh/wip-perf' into next
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-11-27 09:29:03 -08:00
Sage Weil
60d8206286 Merge remote-tracking branch 'gh/wip-crush' into next 2012-11-27 09:28:18 -08:00
Danny Al-Gaaf
d4bc3729fd fix syncfs handling in error case
If the call to syncfs() fails, don't try to call syncfs again via
syscall(). If HAVE_SYS_SYNCFS is defined, don't fall through to try
syscall() with SYS_syncfs or __NR_syncfs.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
2012-11-27 08:52:52 -08:00
Sage Weil
16215d9ca8 osdc/ObjectCacher: remove unused waitfor_{rd,wr}
Signed-off-by: Sage Weil <sage@inktank.com>
2012-11-26 21:13:35 -08:00
Sage Weil
011d1e79ab osdc/ObjectCacher: *do* pin object during write
This hopefully resolves #3431.

We originally did this in 46897fd4ff, and
then reverted in caed0e917f.

The current conundrum:
 - commit_set() will issue a write and queue a waiter on a tid
 - discard will discard all BufferHeads and unpin the object
 - trim will try to close and fail assert(ob->can_close())

But:
 - we can't wake the waiter on discard because we don't know what range(s)
   it is waiting for; discard needn't be the whole object.

So: pin the object so it doesn't get trimmed, and unpin when we write.

Adjust can_close() so that it is based on the lru pin status, and assert
that pinned implies the previous conditions are all true.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Sam Lang <sam.lang@inktank.com>
2012-11-26 21:13:32 -08:00
Sage Weil
6efe977f3d mon, osd: adjust msgr requires for CRUSH_TUNABLES2 feature
Make this code a bit manageable for multiple features.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-11-26 17:15:45 -08:00
Sage Weil
0cc47ff682 crush: introduce CRUSH_TUNABLES2 feature
For the chooseleaf_descend_once flag.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-11-26 17:15:45 -08:00
Jim Schutt
88f218181a crush: for chooseleaf rules, retry CRUSH map descent from root if leaf is failed
Consider the CRUSH rule
  step chooseleaf firstn 0 type <node_type>

This rule means that <n> replicas will be chosen in a manner such that
each chosen leaf's branch will contain a unique instance of <node_type>.

When an object is re-replicated after a leaf failure, if the CRUSH map uses
a chooseleaf rule the remapped replica ends up under the <node_type> bucket
that held the failed leaf.  This causes uneven data distribution across the
storage cluster, to the point that when all the leaves but one fail under a
particular <node_type> bucket, that remaining leaf holds all the data from
its failed peers.

This behavior also limits the number of peers that can participate in the
re-replication of the data held by the failed leaf, which increases the
time required to re-replicate after a failure.

For a chooseleaf CRUSH rule, the tree descent has two steps: call them the
inner and outer descents.

If the tree descent down to <node_type> is the outer descent, and the descent
from <node_type> down to a leaf is the inner descent, the issue is that a
down leaf is detected on the inner descent, so only the inner descent is
retried.

In order to disperse re-replicated data as widely as possible across a
storage cluster after a failure, we want to retry the outer descent. So,
fix up crush_choose() to allow the inner descent to return immediately on
choosing a failed leaf.  Wire this up as a new CRUSH tunable.

Note that after this change, for a chooseleaf rule, if the primary OSD
in a placement group has failed, choosing a replacement may result in
one of the other OSDs in the PG colliding with the new primary.  This
requires that OSD's data for that PG to need moving as well.  This
seems unavoidable but should be relatively rare.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
2012-11-26 17:15:45 -08:00
Yehuda Sadeh
0beeb47c43 rgw: document ops logging setup
Fixes: #3530

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2012-11-26 15:55:00 -08:00
Yehuda Sadeh
6bc32b2008 rgw: usage REST api handles cateogories
Fixes: #3528
The usage REST api was missing the categories filter.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2012-11-26 15:55:00 -08:00
Sage Weil
94423ac90f perfcounters: fl -> time, use u64 nsec instead of double
(Almost) all current float users are actually time values, so switch to
a utime_t-based interface and internally using nsec in a u64.  This avoids
using floating point in librbd, which is problematic for windows VMs that
leave the FPU in an unfriendly state.

There are two non-time users in the mds and osd that log the CPU load.
Just multiply those values by 100 and report as ints instead.

Fixes: #3521
Signed-off-by: Sage Weil <sage@inktank.com>
2012-11-26 15:30:25 -08:00
Sage Weil
3a0ee8e49d perfcounters: add 'perf' option to disable perf counters
Signed-off-by: Sage Weil <sage@inktank.com>
2012-11-26 15:30:25 -08:00
Alexandre Oliva
b1c71088bb logrotate on systems without invoke-rc.d
The which command doesn't output anything to stdout when it can't find
the given program name, and then [ -x ] passes.  Use the exit status
of which to tell whether the command exists, before testing whether
it's executable, to fix it.

Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
2012-11-26 15:04:08 -08:00
Alexandre Oliva
a37c34debd Search for srcdir/.git in check_version
Support srcdir != . looking for .git in srcdir when computing the ceph
release and git tag.

Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
2012-11-26 15:04:08 -08:00
Yehuda Sadeh
74b2a2d964 rgw: POST requests not default to init multipart upload
Fixes: #3516
We don't default to init multipart upload request when
getting S3 POST. This way when the request is not really
init multipart upload we'd end up sending a 405 response
instead of 500. Also, it's cleaner this way.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
2012-11-26 12:29:40 -08:00
Noah Watkins
1f8c32347b java: add ceph_open_layout interface
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
2012-11-26 11:15:47 -08:00
Noah Watkins
f0c608c0d6 client: add ceph_open_layout interface
Adds an interface identical to ceph_open() that takes additional
parameters specifying a file layout to use on new files.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-11-26 11:15:47 -08:00
Josh Durgin
365ba0600b qa: add script to run objectcacher tests
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2012-11-26 10:37:43 -08:00
Sage Weil
30669d6d87 Merge remote-tracking branch 'gh/wip-upstart' into next 2012-11-26 08:38:25 -08:00
Sage Weil
525f942edc init-ceph: do not make noise about missing devs
It is pretty normal not to include the devs line in the ceph.conf.  Do not
print/warn about it.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-11-26 08:37:45 -08:00
Sage Weil
bc32fc42d2 syncfs: check for __NR_syncfs too
Also make the filestore startup tell us *all* variants that are
supported, not just the first one.

Tested-by: Stefan Priebe <s.priebe@profihost.ag>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-11-25 13:29:52 -08:00
Sage Weil
6890675b87 monmap: fix crash from dup initial seed mons
Fix bug reproduced by

	-m hostname,ip_that_hosthname_resolves_to

Backport: argonaut
Reported-by: Drunkard Zhang <gongfan193@gmail.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-11-25 09:34:02 -08:00
Sage Weil
7602a05576 osdc/ObjectCacher: fix BufferHead leak on ENOENT
This was detected by fsstress over ceph-fuse under valgrind.

Signed-off-by: Sage Weil <sage@inktank.com>
2012-11-24 10:05:52 -08:00
Sage Weil
8a03d50146 Merge remote-tracking branch 'gh/wip-mon-misc-fixes' into next 2012-11-24 09:16:13 -08:00
Danny Al-Gaaf
df550c9cce make mkcephfs and init-ceph osd filesystem handling more flexible
Remove btrfs specific keys and replace them by more generic
keys to be able to replace btrfs with e.g. xfs or ext4 easily.

Add new key to define the osd fs type: 'osd mkfs type', which can
get defined in the [osd] section for all OSDs.

Replaced config keys:
- 'btrfs devs' -> 'devs'
- 'btrfs path' -> 'fs path'
- 'btrfs options' -> 'osd mount options $fstype'

New config key:
- 'osd mkfs options $fstype': file system specific options for mkfs
- 'osd mkfs type': to define the filesystem for mkfs and also mount

Replaced in mkcephfs: --mkbtrfs with --mkfs

Replaced in init-ceph:
- --btrfs with --fsmount
- --nobtrfs with --nofsmount
- --btrfsumount with --fsumount

NOTE: old options from mkcephfs and init-ceph will still work, but
      get may removed in the future from the scripts.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
2012-11-23 19:14:52 -08:00
Joao Eduardo Luis
96b82ebf87 mon: Monitor: wake up contexts based on paxos machine's state
When recovering the leader, only wake up a paxos machine's contexts if
the paxos machine is in a state that can handle said contexts.

Fixes: #3495

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2012-11-23 19:13:07 +00:00
Joao Eduardo Luis
3b061ab9d3 mon: AuthMonitor: increase log levels when logging secrets
Fixes: #3361

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2012-11-23 19:13:05 +00:00
Joao Eduardo Luis
7527a1ea6c auth: Keyring: increase log levels when logging secrets
Fixes: #3361

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2012-11-23 19:13:03 +00:00
Joao Eduardo Luis
deabdc8a10 auth: cephx: increase log levels when logging secrets
We understand that logging secrets may be useful when debugging the root
causes for auth issues. However, logging secrets is far from a good idea.
Therefore, just increase the log levels to a high enough value so that
most other debug infos can be obtained without even logging the secrets.
If one really wants to log the secrets, then setting --debug-auth 30 should
do the trick.

Fixes: #3361

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2012-11-23 19:13:01 +00:00
Joao Eduardo Luis
d6cf77dcbb crush: CrushWrapper: don't add item to a bucket with != type than wanted
We take little consideration about the type of the bucket we are adding
an item to. Although this works for the vast majority of cases, it was
also leaving room for silly little mistakes to become problematic and
leading a monitor to crash.

For instance, say that we ran:
  'ceph osd crush set 0 osd.0 1 root=foo row=foo'

If root 'foo' exists, then this will work and 'row=foo' will be ignored.
However, if there is no bucket named 'foo', then we would (in order)
create a bucket for row 'foo', adding osd.0 to it, and would then add
osd.0 to bucket 'foo' again -- remember, little consideration regarding
the bucket type was given.

This would trigger a monitor crash due to the recursion done in
'adjust_item_weight'. A solution to this problem is to make sure that we
do not allow specifying multiple buckets with the same name when adding
an item to crush. Not only solves our crash problem, but will also render
invalid any mistake when specifying the wrong bucket type (say, using
'row=bar' when in fact 'bar' is a rack).

Fixes: #3515

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2012-11-23 19:12:57 +00:00
Joao Eduardo Luis
95e1fe8822 mon: PGMonitor: check if pg exists when handling 'pg map <PG>'
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
2012-11-23 19:12:48 +00:00
Yehuda Sadeh
ab8327fec0 Merge remote-tracking branch 'origin/next' into next 2012-11-22 14:59:25 -08:00
Sage Weil
1c715a11f7 mds: child directory inherits SGID bit
Update the inode, not the local variable.

Reported-by: Giorgos Kappes <geokapp@gmail.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-11-22 13:53:29 -08:00
Yehuda Sadeh
3110e5ca42 Merge remote-tracking branch 'origin/next' into next 2012-11-22 12:57:33 -08:00
Yehuda Sadeh
a0e8452a09 Merge branch 'wip-opslog-socket2' into next
Conflicts:
	src/rgw/rgw_main.cc
2012-11-22 12:55:35 -08:00
Sage Weil
55081c2bea crush: prevent loops from insert_item
If the insertion would create a loop, return -EINVAL.

Fixes: #3515
Signed-off-by: Sage Weil <sage@inktank.com>
2012-11-22 09:17:34 -08:00
Dan Mick
b706945ae9 Try using syscall() for syncfs if not supported directly by glibc
Signed-off-by: Dan Mick <dan.mick@inktank.com>
2012-11-22 08:50:44 -08:00