This way, we avoid grabbing the map_lock. Furthermore,
get curmap at the beginning of the method to ensure that
we send the message using the same map used to check
is_up.
This should also fix#2798, which was caused by
an osd being marked up between service.get_osdmap()
and OSD::osdmap.
Signed-off-by: Samuel Just <sam.just@inktank.com>
This option makes the osd skip zeroing old trimmed regions of the log. The
data is never read, since the xattrs indicate which part of the log is
valid. We've never actually used this to debug a problem, and it consumes
space, so let's disable it.
Signed-off-by: Sage Weil <sage@inktank.com>
Whether an entry is eligible to log/dump is independent of the channel it
is sent to. Some channels impose additional restrictions.
Signed-off-by: Sage Weil <sage@inktank.com>
Below is a patch which makes the ceph-rbdnamer script more robust and
fixes a problem with the rbd udev rules.
On our setup we encountered a symlink which was linked to the wrong rbd:
/dev/rbd/mypool/myrbd -> /dev/rbd1
While that link should have gone to /dev/rbd3 (on which a
partition /dev/rbd3p1 was present).
Now the old udev rule passes %n to the ceph-rbdnamer script, the problem
with %n is that %n results in a value of 3 (for rbd3), but in a value of
1 (for rbd3p1), so it seems it can't be depended upon for rbdnaming.
In the patch below the ceph-rbdnamer script is made more robust and it
now it can be called in various ways:
/usr/bin/ceph-rbdnamer /dev/rbd3
/usr/bin/ceph-rbdnamer /dev/rbd3p1
/usr/bin/ceph-rbdnamer rbd3
/usr/bin/ceph-rbdnamer rbd3p1
/usr/bin/ceph-rbdnamer 3
Even with all these different styles of calling the modified script, it
should now return the same rbdname. This change "has" to be combined
with calling it from udev with %k though.
With that fixed, we hit the second problem. We ended up with:
/dev/rbd/mypool/myrbd -> /dev/rbd3p1
So the rbdname was symlinked to the partition on the rbd instead of the
rbd itself. So what probably went wrong is udev discovering the disk and
running ceph-rbdnamer which resolved it to myrbd so the following
symlink was created:
/dev/rbd/mypool/myrbd -> /dev/rbd3
However partitions would be discovered next and ceph-rbdnamer would be
run with rbd3p1 (%k) as parameter, resulting in the name myrbd too, with
the previous correct symlink being overwritten with a faulty one:
/dev/rbd/mypool/myrbd -> /dev/rbd3p1
The solution to the problem is in differentiating between disks and
partitions in udev and handling them slightly differently. So with the
patch below partitions now get their own symlinks in the following style
(which is fairly consistent with other udev rules):
/dev/rbd/mypool/myrbd-part1 -> /dev/rbd3p1
Please let me know any feedback you have on this patch or the approach
used.
Regards,
Pascal de Bruijn
Unilogic B.V.
Signed-off-by: Pascal de Bruijn <pascal@unilogicnetworks.net>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
The ceph-mon --mkfs function no longer wipes out the directory; it is in
fact mostly a no-op that just verifies the dir exists.
So, ensure that the directory is empty at mkfs time. This could
alternatively do an 'rm -r' in that directory (that is in fact what
ceph-mon used to do), but this is safer.
Signed-off-by: Sage Weil <sage@inktank.com>
In non-crash situations, we want to make sure the message is both below the
syslog/stderr threshold and also below the normal log threshold. Otherwise
we get anything we gather on those channels, even when the log level is
low.
Signed-off-by: Sage Weil <sage@inktank.com>
Otherwise, accessing the pg via _applied_recovered_object
isn't safe. Using intrusive_ptr clarifies the reference
ownership.
Signed-off-by: Samuel Just <sam.just@inktank.com>
We should gather an event if it is below the log or gather threshold.
Previously we were only gathering if we were going to print it, which makes
the dump no more useful than what was already logged.
Signed-off-by: Sage Weil <sage@inktank.com>
If the osd recieving the info has divergent entries, it will
also have a "divergent" stat structure.
Probably fixes#2769.
In cases like #2769, this bug can result in a primary with a stat
structure which double counts an operation: once for the
divergent operation, and once for the replay.
This is another way for the bug addressed in
5924f8e4a8 to happen.
Signed-off-by: Samuel Just <sam.just@inktank.com>
The purged_snaps set can grow without bound as snaps are
created and removed. Because the filestore doesn't
provide unlimited size collection attributes, it's better
to place the full info on the biginfo object, since we
need to write it during write_info anyway.
Added CEPH_OSD_FEATURE_INCOMPAT_BIGINFO to prevent downgrade.
Signed-off-by: Samuel Just <sam.just@inktank.com>
At one point, snap_collections were written to a pg collection
attribute. Subsequently, they were moved to the biginfo object
since the structure can grow too large for limited size xattrs.
make_snap_collection, however, was not updated.
Using write_info here should prevent this from happening in
the future.
Signed-off-by: Samuel Just <sam.just@inktank.com>
Previously, we did not actually persist the osd compatibility
mask. Without persisting the current compat mask, a previous,
incompatible version of the OSD would not be prevented from
starting on the same store.
Signed-off-by: Samuel Just <sam.just@inktank.com>
CompatSet users number the Feature objects rather than
providing masks. Thus, we should do
mask |= (1 << f.id) rather than mask |= f.id.
In order to detect old, broken encodings, the lowest
bit will be set in memory but not set in the encoding.
We can reconstruct the correct mask from the names map.
This bug can cause an incompat bit to not be detected
since 1|2 == 1|2|3.
fixes: #2748
Signed-off-by: Samuel Just <sam.just@inktank.com>
We want to look at the acting set here, nothing else. This was causing us
to erroneously queue ops for later (wasting memory) and to erroneously
print out a 'misdrected op' message in the cluster log (confusion and
incorrect [but ignored] -ENXIO reply).
Fixes: #2022
Signed-off-by: Sage Weil <sage@inktank.com>
It is possible for a .new file to already exist, potentially with a
larger size. This would happen if:
- we were proposing a different value
- we crashed (or were stopped) before it got renamed into place
- after restarting, a different value was proposed and accepted.
This isn't so unlikely for the log state machine, where we're
aggregating random messages. O_TRUNC ensure we avoid getting the tail
end of some previous junk.
I observed #2593 and found that a logm state value had a larger size on
one mon (after slurping) than the others, pointing to put_bl_sn_map().
While we are at it, O_TRUNC put_int() too; the same type of bug is
possible there, too.
Fixes: #2593
Signed-off-by: Sage Weil <sage@inktank.com>
Greg points out that we could have a situation like:
- mon recovers..
- goes through osdmaps, notes an osd was removed and removes from
full/nearfull
- goes through pgmaps, and re-adds it when it encounters some osd_stat_ts.
Fix this by removing the osd from the full/nearfull set when we remove
the osd_stat_t from the pgmap. Any osd removal is always followed by
an osd_stat_rm[] record when the primary processes the new osdmap and
proposed the appropriate pgmap updates.
Signed-off-by: Sage Weil <sage@inktank.com>
It is possible for a .new file to already exist, potentially with a
larger size. This would happen if:
- we were proposing a different value
- we crashed (or were stopped) before it got renamed into place
- after restarting, a different value was proposed and accepted.
This isn't so unlikely for the log state machine, where we're
aggregating random messages. O_TRUNC ensure we avoid getting the tail
end of some previous junk.
I observed #2593 and found that a logm state value had a larger size on
one mon (after slurping) than the others, pointing to put_bl_sn_map().
While we are at it, O_TRUNC put_int() too; the same type of bug is
possible there, too.
Fixes: #2593
Signed-off-by: Sage Weil <sage@inktank.com>
Use a helper to dump /proc/self/fd when we hit EMFILE in the filestore.
Ideally, we should trigger this in other appropriate places, but it is
not immediately clear that there is a sane way to do that.
Fixes: #2330
Signed-off-by: Sage Weil <sage@inktank.com>
Users probably want get_pg_acting_rank(). If they don't, they can probably
have the mapping and can calculate the rank themselves. Having this here
is asking for bugs like #2022.
Signed-off-by: Sage Weil <sage@inktank.com>
We want to look at the acting set here, nothing else. This was causing us
to erroneously queue ops for later (wasting memory) and to erroneously
print out a 'misdrected op' message in the cluster log (confusion and
incorrect [but ignored] -ENXIO reply).
Fixes: #2022
Signed-off-by: Sage Weil <sage@inktank.com>
Make the helper exclusively for the PG != NULL cases, and open-code the
one PG == NULL caller. This is simpler, and lets us include more useful
information in the log message.
Signed-off-by: Sage Weil <sage@inktank.com>
Stores absolute path to the generated keyring so that tests running in
other directories (e.g. src/java/test) can simply reference the
generated ceph.conf.
Signed-off-by: Noah Watkins <jawhawk@cs.ucsc.edu>