journal is distributed among different nodes. because authority changes over time, it's not immedicatley clear to a recoverying node relaying the journal whether the data is "real" or not (it might be exported later in the journal). possibilities: ONE.. bloat the journal! - journal entry includes full trace of dirty data (dentries, inodes) up until import point - local renames implicit.. cache is reattached on replay - exports are a list of exported dirs.. which are then dumped ... recovery phase 1 - each entry includes full trace (inodes + dentries) up until the import point - cache during recovery is fragmetned/dangling beneath import points - when export is encountered items are discarded (marked clean) recovery phase 2 - import roots ping store to determine attachment points (if not already known) - if it was imported during period, attachment point is already known. - renames affecting imports are logged too - import roots discovered from other nodes, attached to hierarchy then - maybe resume normal operations - if recovery is a background process on a takeover mds, "export" everything to that node. -> journal contains lots of clean data.. maybe 5+ times bigger as a result! possible fixes: - collect dir traces into journal chunks so they aren't repeated as often - each chunk summarizes traces in previous chunk - hopefully next chunk will include many of the same traces - if not, then the entry will include it === log entry types === - all inode, dentry, dir items include a dirty flag. - dirs are implicitly _never_ complete; even if they are, a fetch before commit is necessary to confirm ImportPath - log change in import path Import - log import addition (w/ path, dirino) InoAlloc - allocate ino InoRelease - release ino Inode - inode info, along with dentry+inode trace up to import point Unlink - (null) dentry + trace, + flag (whether inode/dir is destroyed) Link - (new) dentry + inode + trace ----------------------------- TWO.. - directories in store contain path at time of commit (relative to import, and root) - replay without attaching anything to heirarchy - after replay, directories pinged in store to attach to hierarchy -> phase 2 too slow! -> and nested dirs may reattach... that won't be apparent from journal. - put just parent dir+dentry in dir store.. even worse on phase 2! THREE - metadata journal/log event types: chown, chmod, utime InodeUpdate mknod, mkdir, symlink Mknod .. new inode + link unlink, rmdir Unlink rename Link + Unlink (foreign) or Rename (local) link Link .. link existing inode InodeUpdate DentryLink DentryUnlink InodeCreate InodeDestroy Mkdir?