ceph/doc/dev/filestore-filesystem-compat...

====================================
 Filestore filesystem compatilibity
====================================

http://marc.info/?l=ceph-devel&m=131942130322957&w=2

Although running on ext4, xfs, or whatever other non-btrfs you want mostly
works, there are a few important remaining issues:

OSD journal replay of non-idempotent transactions
=================================================

**Resolved** with full sync but not ideal.
See http://tracker.newdream.net/issues/213

On non-btrfs backends, the Ceph OSDs use a write-ahead journal. After
restart, the OSD does not know exactly which transactions in the
journal may have already been committed to disk, and may reapply a
transaction again during replay. For most operations (write, delete,
truncate) this is fine.

Some operations, though, are non-idempotent. The simplest example is
CLONE, which copies (efficiently, on btrfs) data from one object to
another. If the source object is modified, the osd restarts, and then
the clone is replayed, the target will get incorrect (newer) data. For
example,

- clone A -> B
- modify A
- <osd crash, replay from 1>

B will get new instead of old contents.

(This doesn't happen on btrfs because the snapshots allow us to replay
from a known consistent point in time.)

Possibilities:

- full sync after any non-idempotent operation
- re-evaluate the lower level interface based on needs from higher
  levels, construct only safe operations, be very careful; brittle
- use xattrs to add sequence numbers to objects:

  - on non-btrfs, we set a xattr on every modified object with the
    op_seq, the unique sequence number for the transaction.
  - for any (potentially) non-idempotent operation, we fsync() before
    continuing to the next transaction, to ensure that xattr hits disk.
  - on replay, we skip a transaction if the xattr indicates we already
    performed this transaction.

  Because every 'transaction' only modifies on a single object (file),
  this ought to work. It'll make things like clone slow, but let's
  face it: they're already slow on non-btrfs file systems because they
  actually copy the data (instead of duplicating the extent refs in
  btrfs). And it should make the full ObjectStore interface safe,
  without upper layers having to worry about the kinds and orders of
  transactions they perform.