Commit Graph

36168 Commits

Author SHA1 Message Date
Sage Weil
acd49892df Merge pull request #2604 from athanatos/wip-9113
ReplicatedPG: clean out completed trimmed objects as we go

Reviewed-by: Sage Weil <sage@redhat.com>
2014-09-29 14:02:15 -07:00
Samuel Just
78fc7b8198 Merge pull request #2549 from ceph/wip-9545
os/FileJournal: do not request sync while shutting down

Reviewed-by: Samuel Just <sam.just@inktank.com>
2014-09-29 13:54:31 -07:00
Samuel Just
f91c571ef6 Merge pull request #2550 from ceph/wip-8629
osd: fix cache_evict vs make_writeable/finish_ctx snapdir bug #8629

Reviewed-by: Samuel Just <sam.just@inktank.com>
2014-09-29 13:52:21 -07:00
Samuel Just
ffda34c4be Merge pull request #2510 from somnathr/wip-obj-delete-fix
FileStore: Race condition during object delete is fixed

Reviewed-by: Samuel Just <sam.just@inktank.com>
2014-09-29 13:44:37 -07:00
Sage Weil
b2416240b8 ceph.spec: fix python-flask dependency
This is needed by ceph-rest-api, which is in ceph.rpm; it's not related to
python-ceph (except that ceph-rest-api happens to require that too).

Backport: firefly
Signed-off-by: Sage Weil <sage@redhat.com>
2014-09-29 13:44:03 -07:00
Sage Weil
e42424e777 debian: python-flask is needed by ceph, not python-ceph
It's used by ceph-rest-api which is in the 'ceph' (server) package.

Backport: firefly
Signed-off-by: Sage Weil <sage@redhat.com>
2014-09-29 13:40:18 -07:00
Sage Weil
614157c288 Merge pull request #2598 from ceph/wip-9582
librados: fix other timeout segfault

Reviewed-by: Greg Farnum <greg@inktank.com>
2014-09-29 13:08:10 -07:00
Sage Weil
9af9df42f2 Merge pull request #2594 from dachary/wip-9620-test-mon-thrash
qa/workunits/cephtool/test.sh: fix thrash (ultimate)

Reviewed-by: Sage Weil <sage@redhat.com>
2014-09-29 08:18:36 -07:00
Loic Dachary
beade63a17 qa/workunits/cephtool/test.sh: fix thrash (ultimate)
Keep the osd trash test to ensure it is a valid command but make it a
noop by giving it a zero argument (meaning thrash 0 OSD maps).

Remove the loops that were added after the command in an attempt to wait
for the cluster to recover and not pollute the rest of the tests. Actual
testing of osd thrash would require a dedicated cluster because it the
side effects are random and it is unnecessarily difficult to ensure they
are finished.

http://tracker.ceph.com/issues/9620 Fixes: #9620

Signed-off-by: Loic Dachary <loic-201408@dachary.org>
2014-09-29 13:47:06 +02:00
Dan van der Ster
f8ac2248af ceph-disk: add Scientific Linux as a Redhat clone
Scientific Linux is a RHEL clone and needs to use partx.

Signed-off-by: Dan van der Ster <daniel.vanderster@cern.ch>
(cherry picked from commit 5ca7ea5b53)
2014-09-26 17:46:15 -07:00
Sage Weil
5c2984e6e1 Merge pull request #2531 from dachary/wip-9536-isa-alignment
erasure-code: isa plugin alignment fixes

Reviewed-by: Sage Weil <sage@redhat.com>
2014-09-25 14:05:57 -07:00
Sage Weil
d851c3f233 osd: improve debug output for do_{notifies,queries,infos}
Hunting #9389

Signed-off-by: Sage Weil <sage@redhat.com>
2014-09-25 13:51:46 -07:00
Sage Weil
2ba5ed57b3 Merge pull request #2540 from ceph/wip-giant-messenger-fixes
giant messenger fixes

Reviewed-by: Sage Weil <sage@redhat.com>
2014-09-25 13:01:38 -07:00
Sage Weil
126d0b30e9 osdc/Objecter: only post_rx_buffer if no op timeout
If we post an rx buffer and there is a timeout, the revocation can happen
while the reader has consumed the buffers but before it has decoded and
constructed the message.  In particular, we calculate a crc32c over the
data portion of the message after we've taken the buffers and dropped the
lock.

Instead of fixing this race (for example, by reverifying rx_buffers under
the lock while calculating the crc.. bleh), just skip the rx buffer
optimization entirely when a timeout is present.

Note that this doesn't cover the op_cancel() paths, but none of those users
provide static buffers to read into.

Fixes: #9582
Backport: firefly, dumpling
Signed-off-by: Sage Weil <sage@redhat.com>
2014-09-25 12:34:11 -07:00
Sage Weil
0115a55aa3 Merge pull request #2574 from ceph/wip-msgr-shutdown
msg: allow calling dtor immediately after ctor

Reviewed-by: Sage Weil <sage@redhat.com>
2014-09-25 09:26:18 -07:00
Loic Dachary
ba02a5e638 erasure-code: test isa encode/decode with various object sizes
Create an encode_decode() helper method to be called from the
encode_decode test function with various object size arguments. The
helper method is a copy/paste of the previous test that was using a
single object of a fixed size. The test is slightly adapted to
accommodate for different object sizes but the logic is not modified.

The object sizes being tested are chosen to be under the size of the
required size alignment or on multiple pages, size aligned or not.

Signed-off-by: Loic Dachary <loic-201408@dachary.org>
2014-09-25 18:05:01 +02:00
Loic Dachary
eb8fdfa4f5 erasure-code: add test for isa chunk_size method
Signed-off-by: Loic Dachary <loic-201408@dachary.org>
2014-09-25 18:04:58 +02:00
John Spray
7a468f358b msg: allow calling dtor immediately after ctor
Asserting on reaper_stop only made sense if the
messenger had ever been started: as it stood,
one couldn't create and destroy a messenger
without also starting and stopping it.

Signed-off-by: John Spray <john.spray@redhat.com>
2014-09-25 17:01:10 +01:00
Loic Dachary
af07d29e27 erasure-code: isa encode tests adapted to per chunk alignment
The encode tests use the alignment constraints. It has been changed to
be aligned on a per chunk basis instead of computing a more expensive
object alignement constraint. The test function is modified to take the
change into account but the logic is otherwise unmodified.

Signed-off-by: Loic Dachary <loic-201408@dachary.org>
2014-09-25 17:39:16 +02:00
Loic Dachary
aa9d70be38 erasure-code: isa test compare chunks with memcmp instead of strncmp
Because they may contain null characters.

Signed-off-by: Loic Dachary <loic-201408@dachary.org>
2014-09-25 17:39:16 +02:00
Loic Dachary
ed77178e7d erasure-code: run isa tests via libtool and valgrind
Because running valgrind with no libtool does not test the binary but
the enclosing shell script.

Signed-off-by: Loic Dachary <loic-201408@dachary.org>
2014-09-25 17:39:16 +02:00
Loic Dachary
668c352721 erasure-code: do not use typed tests for isa
Because there only is one type.

Signed-off-by: Loic Dachary <loic-201408@dachary.org>
2014-09-25 17:39:16 +02:00
Loic Dachary
28c2b6e4f2 erasure-code: isa uses per chunk alignment constraints
Copy code from the jerasure plugin to enforce alignment constraints per
chunk instead of using the total object size. It is simpler and reduces
the size of the chunks. See
c7daaaf5e6
for more information.

Signed-off-by: Loic Dachary <loic-201408@dachary.org>
2014-09-25 17:39:10 +02:00
Andreas Peters
6f4909ae59 erasure-code: [ISA] modify get_alignment function to imply a platform/compiler independent alignment constraint of 32-byte aligned buffer addresses & length 2014-09-25 17:37:27 +02:00
Guang Yang
0f884fdb31 For pgls OP, get/put budget on per list session basis, instead of per OP basis, which could lead to deadlock.
Signed-off-by: Guang Yang (yguang@yahoo-inc.com)
2014-09-25 00:47:46 +00:00
Samuel Just
7f87cf1b1d ReplicatedPG: clean out completed trimmed objects as we go
Also, explicitely maintain a max number of concurrently trimming
objects.

Fixes: 9113
Backport: dumpling, firefly, giant
Signed-off-by: Samuel Just <sam.just@inktank.com>
2014-09-24 15:33:11 -07:00
Loic Dachary
468d245a02 Merge pull request #2506 from dachary/wip-9304-unintended-implicit-ruleset
erasure-code: pool create must not always create a ruleset

Reviewed-by: João Eduardo Luís <joao@redhat.com>
2014-09-24 13:35:55 +02:00
Samuel Just
c17ac03a50 ReplicatedPG: don't move on to the next snap immediately
If we have a bunch of trimmed snaps for which we have no
objects, we'll spin for a long time.  Instead, requeue.

Fixes: #9487
Backport: dumpling, firefly, giant
Reviewed-by: Sage Weil <sage@redhat.com>
Signed-off-by: Samuel Just <sam.just@inktank.com>
2014-09-23 16:28:04 -07:00
Sage Weil
255b430a87 osd: initialize purged_snap on backfill start; restart backfill if change
If we backfill a PG to a new OSD, we currently neglect to initialize
purged_snaps.  As a result, the first time the snaptrimmer runs it has to
churn through every deleted snap for all time, and to make matters worse
does so in one go with the PG lock held.  This leads to badness on any
cluster with a significant number of removed snaps that experiences
backfill.

Resolve this by initializing purged_snaps when we finish backfill.  The
backfill itself will clear out any stray snaps and ensure the object set
is in sync with purged_snaps.  Note that purged_snaps on the primary
that is driving backfill will not change during this period as the
snaptrimmer is not scheduled unless the PG is clean (which it won't be
during backfill).

If we by chance to interrupt backfill, go clean with other OSDs,
purge snaps, and then let this OSD rejoin, we will either restart
backfill (non-contiguous log) or the log will include the result of
the snap trim (the events that remove the trimmed snap).

Fixes: #9487
Backfill: firefly, dumpling
Signed-off-by: Sage Weil <sage@redhat.com>
2014-09-23 16:28:04 -07:00
Samuel Just
4be53d5eeb PG: check full ratio again post-reservation
Otherwise, we might queue 30 pgs for backfill at 0.80 fullness
and then never check again filling the osd after pg 11.

Fixes: #9574
Backport: dumpling, firefly, giant
Signed-off-by: Samuel Just <sam.just@inktank.com>
2014-09-23 12:53:41 -07:00
Sage Weil
f711819df5 Merge pull request #2561 from athanatos/wip-9293
Wip 9293

Reviewed-by: Sage Weil <sage@redhat.com>
2014-09-23 11:40:13 -07:00
Loic Dachary
34e665867e Merge pull request #2557 from ceph/wip-mon-fix-checks
ceph-mon: check fs stats just before preforking

Reviewed-by: Loic Dachary <loic-201408@dachary.org>
2014-09-23 17:59:02 +02:00
Joao Eduardo Luis
7f71c11666 ceph-mon: check fs stats just before preforking
Otherwise statfs may fail if mkfs hasn't been run yet or if the monitor
data directory does not exist.  There are checks to account for the mon
data dir not existing and we should wait for them to clear before we go
ahead and check the fs stats.

Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
2014-09-23 14:05:48 +01:00
Loic Dachary
9d3fbe92a8 Merge pull request #2551 from dachary/wip-9343-erasure-code-feature
erasure code feature

Reviewed-by: João Eduardo Luís <joao@redhat.com>
2014-09-23 13:37:27 +02:00
Loic Dachary
9687150cea erasure-code: isa/lrc plugin feature
There are two new plugins (isa and lrc). When upgrading a cluster, there
must be a protection against the following scenario:

  * the mon are upgraded but not the osd
  * a new pool is created using plugin isa
  * the osd fail to load the isa plugin because they have not been
    upgraded

A feature bit is added : PLUGINS_V2. The monitor will only agree to
create an erasure code profile for the isa or lrc plugin if all OSDs
supports PLUGINS_V2. Once such an erasure code profile is stored in the
OSDMap, an OSD can only boot if it supports the PLUGINS_V2 feature,
which means it is able to load the isa and lrc plugins.

The monitors will only activate the PLUGINS_V2 feature if all monitors
in the quorum support it. It protects against the following scenario:

  * the leader is upgraded the peons are not upgraded
  * the leader creates a pool with plugin=lrc because all OSD have
    the PLUGINS_V2 feature
  * the leader goes down and a non upgraded peon becomes the leader
  * an old OSD tries to join the cluster
  * the new leader will let the OSD boot because it does not contain
    the logic that would excluded it
  * the old OSD will fail when required to load the plugin lrc

This is going to be needed each time new plugins are added, which is
impractical. A more generic plugin upgrade support should be added
instead, as described in http://tracker.ceph.com/issues/7291.

http://tracker.ceph.com/issues/9343 Refs: #9343

Signed-off-by: Loic Dachary <loic-201408@dachary.org>
2014-09-23 13:34:52 +02:00
Loic Dachary
f51d21b53d erasure-code: restore jerasure BlaumRoth default w
Changing from W=7 to W=6 by default for the BlaumRoth technique is
correct but introduces a regression. The content that was encoded with
the previous version cannot be read again. Although the prime(w+1)
constraint was not obeyed by W=7, the encoded content was useable and
should keep being readable.

The W=7 remains the default for backward compatibility and an exception
to the prime(w+1) check.

http://tracker.ceph.com/issues/9572 Fixes: #9572

Signed-off-by: Loic Dachary <loic-201408@dachary.org>
2014-09-23 11:42:47 +02:00
Sage Weil
7354165c1f Merge pull request #2538 from ceph/wip-mon-data-space-die
mon: die if 'mon data' fs has critically low available disk space & fix logging issues

Reviewed-by: Sage Weil <sage@redhat.com>
2014-09-22 19:16:18 -07:00
Joao Eduardo Luis
89fceb3c36 mon: Monitor: log RO commands on 'debug' level, RWX on 'info'
Fixes: #9455

Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
2014-09-22 17:36:29 +01:00
Joao Eduardo Luis
2c5b12d909 mon: Monitor: use MonCommand::requires_perm() when checking perms
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
2014-09-22 17:36:29 +01:00
Joao Eduardo Luis
bb55862093 mon: Monitor.h: add 'requires_perm()' function to MonCommand struct
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
2014-09-22 17:36:29 +01:00
Joao Eduardo Luis
f1b814e515 mon: Monitor: log RO admin socket commands on 'debug' level
Reduces the noise caused by read-only operations via the admin socket.
RW commands are still logged at 'info' level.

Fixes: #9455

Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
2014-09-22 17:36:29 +01:00
Joao Eduardo Luis
282bac79b4 mon: LogMonitor: adjust debug messages output levels
Reduce the noise.

Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
2014-09-22 17:36:29 +01:00
Joao Eduardo Luis
9686044a23 mon: LogMonitor: add debug message upon logging to a channel's file
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
2014-09-22 17:36:29 +01:00
Joao Eduardo Luis
3760bc1b46 mon: LogMonitor: appropriately expand channel meta variables
We must only expand the log file's channel meta variables upon requiring
a channel's log file.  As we may have a 'default' channel that will
cover all channels, we must wait to expand channels as they come in and
do so if they haven't yet been expanded.  Expanding the 'log_file' in
place would have the unfortunate side effect of expanding, say,

default=/tmp/whatever.$channel.log

to

default=/tmp/whatever.default.log

which would not be what we wanted upon receiving a message that should
go into channel 'foo' -- assuming we specified no such channel in the
options, channel 'foo' should go into '/tmp/whatever.foo.log'.

Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
2014-09-22 17:36:29 +01:00
Joao Eduardo Luis
6c378aebcb common: LogEntry: if channel is missing, default to "cluster"
Keeps backward compatibility when there are entities that do not know
what a channel is.  This way we ensure that those messages are logged as
they were expected to be before channels were introduced: to the cluster
log.

Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
2014-09-22 17:36:29 +01:00
Joao Eduardo Luis
2da1a2914a ceph_mon: check available storage space for mon data dir on start
error out if available storage space is below 'mon data avail crit'

Fixes: #9502

Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
2014-09-22 17:36:29 +01:00
Joao Eduardo Luis
9996d44698 mon: DataHealthService: use get_fs_stats() instead
and relieve the DataStats struct from clutter by using
ceph_data_stats_t instead of multiple fields.

Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
2014-09-22 17:36:29 +01:00
Joao Eduardo Luis
3d74230d1c common: util: add get_fs_stats() function
simplifies the task of obtaining available/used disk space, as well as
used available percentage.

Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
2014-09-22 17:36:28 +01:00
Loic Dachary
f421d5cc35 documentation: comment the CompatSet data members
Signed-off-by: Loic Dachary <loic-201408@dachary.org>
2014-09-22 16:31:00 +02:00
Sage Weil
ce8eefca13 osd/ReplicatedPG: do not clone or preserve snapdir on cache_evict
If we cache_evict a head in a cache pool, we need to prevent
make_writeable() from cloning the head and finish_ctx() from
preserving the snapdir object.

Fixes: #8629
Backport: firefly
Signed-off-by: Sage Weil <sage@redhat.com>
2014-09-21 15:56:18 -07:00