We need to sync the object_map too. We can _almost_ check to see if there
are keys for the object and only do it then, except that they may have
existed previously and then been deleted.
So, always sync. leveldb is reasonably nice about this... it should just
be another fsync.
Signed-off-by: Sage Weil <sage@newdream.net>
The old strategy was to initiate a commit after any non-idempotent
transaction. This only worked if the transaction was idempotent with
respect to itself, or could be replayed partially without problems,
and in reality that isn't the case. For example:
- clone A -> B
- write to A
- <sync>
If we crash before the sync, and replay the clone A->B, we corrupt B with
the new A data.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Also, do the getxattr using fgetxattr, to avoid duplicating code. This is
slightly slower probably because we open a file handle, but if we care we
should really clean up the code to use lfn_open instead of lfn_find and
avoid the repeated path traversal too.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
watch_lock is inside map_lock (and pg->lock), which means we need to
drop it to take pg->lock here. That means verifying in
handle_watch_timeout that we haven't raced with another thread canceling
the timeout event, which would be indicated by
- the entity not appearing in unconnected_watchers
- the entity having a different (presumably newer) expire time
Fixes: #2103
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
Before, we were being very careful about updating the heartbeat peers if
new PGs were created or when certain types of messages were received.
However, the PG can change it's peers in lots of cases (e.g., when
recovery completes), but the OSD doesn't re-aggregate.
Instead, set a flag when each PG updates it's set, and check that flag in
the OSD code periodically or in likely places. A call in tick() acts as
a catch-all.
The num_created counts can probably be cleaned out now...
Signed-off-by: Sage Weil <sage@newdream.net>
Reviewed-by: Greg Farnum <gregory.farnum@dreamhost.com>
This adds a document that I wrote about how Ceph client file data
is striped across Ceph objects to the repository. It's a text
document. Someone with better document preparation skills than I
should use the content below as a basis for something prettier if
that's appropriate.
[Made a few edits... -sage]
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Track which region of the log has been zeroed on disk. This may be
different from tail if 'osd preserved trimmed log = false' in the config.
Only zero the portion of the log we need to. This avoids rezeroing regions
or missing bits when 'osd preserved trimmed log' was off and is then turned
on.
Signed-off-by: Sage Weil <sage@newdream.net>
Reviewed-by: Samuel Just <samuel.just@dreamhost.com>
First try the FL_ALLOC_PUNCH_HOLE fallocate() flag. If we get EOPNOTSUPP,
fall back to writing zeros.
Check for fallocate(2) with configure. Also, avoid this if we are not
Linux, since I'm not sure about the hard-coded FL_ALLOC_PUNCH_HOLE being
correct on other platforms.
Signed-off-by: Sage Weil <sage@newdream.net>
Reviewed-by: Samuel Just <samuel.just@dreamhost.com>