v0.3 v0.4 - ENOSPC - finish client failure recovery (reconnect after long eviction; and slow delayed reconnect) - make kclient use ->sendpage? - rip out io interruption? big items - ENOSPC - enforceable quotas? - snapshots - mds security enforcement - client, user authentication - cas - osd failure declarations - libuuid? snapshots - rados bits to do clone+write - fix cloning on unlinked file (where snaps=[], but head may have follows_snap attr) - figure out how to fix up rados logging - snap collections - garbage collection - client reconnect vs snaps - hard link backpointers - anchor source dir - build snaprealm for any hardlinked file - include snaps for all (primary+remote) parents - migrator import/export of versioned dentries, inodes... drop them on export... /- pin/unpin open_past_parents. - call open_parents() where needed. - mds server ops - link rollback - rename rollback userspace client - handle session STALE - time out caps, wake up waiters on renewal - link caps with mds session - validate dn leases - fix lease validation to check session ttl - clean up ll_ interface, now that we have leases! - clean up client mds session vs mdsmap behavior? - stop using mds's inode_t? - fix readdir vs fragment race by keeping a separate frag pos, and ignoring dentries below it kernel client - make writepages maybe skip pages with errors? - EIO, or ENOSPC? - ... writeback vs ENOSPC vs flush vs close()... hrm... - set mapping bits for ENOSPC, EIO? - flush caps on sync, fsync, etc. - do we need to block? - timeout mds session close on umount - deal with CAP_RDCACHE properly: invalidate cache pages? - procfs/debugfs - adjust granular debug levels too - should we be using debugfs? - a dir for each client instance (client###)? - hooks to get mds, osd, monmap epoch #s - vfs - can we use dentry_path(), if it gets merged into mainline? - io / osd client - osd ack vs commit handling. hmm! - fix readdir vs fragment race by keeping a separate frag pos, and ignoring dentries below it vfs issues - real_lookup() race: 1- hash lookup find no dentry 2- real_lookup() takes dir i_mutex, but then finds a dentry 3- drops mutex, then calld d_revalidate. if that fails, we return ENOENT (instead of looping?) - vfs_rename_dir() client - clean up client mds session vs mdsmap behavior? osdmon - monitor needs to monitor some osds... crush - more efficient failure when all/too many osds are down - allow forcefeed for more complicated rule structures. (e.g. make force_stack a list< set >) - "knob" bucket pgmon - monitor pg states, notify on out? - watch osd utilization; adjust overload in cluster map mon - paxos need to clean up old states. - some sort of tester for PaxosService... - osdmon needs to lower-bound old osdmap versions it keeps around? mds - proper handling of cache expire messages during rejoin phase? -> i think cache expires are fine; the rejoin_ack handler just has to behave if rejoining items go missing - try_remove_unlinked_dn thing - rename: importing inode... also journal imported client map? - rerun destro trace against latest, with various journal lengths - lease length heuristics - mds lock last_change stamp? - handle slow client reconnect (i.e. after mds has gone active) - fix reconnect/rejoin open file weirdness - get rid of C*Discover objects for replicate_to .. encode to bufferlists directly? - can we get rid of the dirlock remote auth_pin weirdness on subtree roots? - anchor_destroy needs to xlock linklock.. which means it needs a Mutation wrapper? - ... when it gets a caller.. someday.. - make truncate faster with a trunc_seq, attached to objects as attributes? - osd needs a set_floor_and_read op for safe failover/STOGITH-like semantics. - could mark dir complete in EMetaBlob by counting how many dentries are dirtied in the current log epoch in CDir... - FIXME how to journal/store root and stray inode content? - in particular, i care about dirfragtree.. get it on rejoin? - and dir sizes, if i add that... also on rejoin? - efficient stat for single writers - add FILE_CAP_EXTEND capability bit journaler - fix up for large events (e.g. imports) - use set_floor_and_read for safe takeover from possibly-not-quite-dead otherguy. - should we pad with zeros to avoid splitting individual entries? - make it a g_conf flag? - have to fix reader to skip over zeros (either <4 bytes for size, or zeroed sizes) - need to truncate at detected (valid) write_pos to clear out any other partial trailing writes fsck - fsck.ebofs - online mds fsck? - object backpointer attrs to hint catastrophic reconstruction? objecter - fix failure handler... - generic mon client? - maybe_request_map should set a timer event to periodically re-request. - transaction prepare/commit? - read+floor_lockout osd/rados - how does an admin intervene when a pg needs a dead osd to repeer? - a more general fencing mechanism? per-object granularity isn't usually a good match. - consider implications of nvram writeahead logs - flag missing log entries on crash recovery --> WRNOOP? or WRLOST? - efficiently replicate clone() objects - fix heartbeat wrt new replication - mark residual pgs obsolete ??? - rdlocks - optimize remove wrt recovery pushes - report crashed pgs? messenger - fix messenger shutdown.. we shouldn't delete messenger, since the caller may be referencing it, etc. simplemessenger - close idle connections objectcacher - merge clean bh's - ocacher caps transitions vs locks - test read locks reliability - heartbeat vs ping? - osdmonitor, filter ebofs - btrees - checksums - dups - sets - optionally scrub deallocated extents - clone() - map ObjectStore - verify proper behavior of conflicting/overlapping reads of clones - combine inodes and/or cnodes into same blocks - fix bug in node rotation on insert (and reenable) - fix NEAR_LAST_FWD (?) - awareness of underlying software/hardware raid in allocator so that we write full stripes _only_. - hmm, that's basically just a large block size. - rewrite the btree code! - multithreaded - eliminate nodepools - allow btree sets - allow arbitrary embedded data? - allow arbitrary btrees - allow root node(s?) to be embedded in onode, or whereever. - keys and values can be uniform (fixed-size) or non-uniform. - fixed size (if any) is a value in the btree struct. - negative indicates bytes of length value? (1 -> 255bytes, 2 -> 65535 bytes, etc.?) - non-uniform records preceeded by length. - keys sorted via a comparator defined in btree root. - lexicographically, by default. - goal - object btree key->value payload, not just a data blob payload. - better threading behavior. - with transactional goodness! - onode - object attributes.. as a btree? - blob stream - map stream. - allow blob values. - remaining hard problems - how to cope with file size changes and read/write sharing snapshot notes -- osd - pass snap lineage with osd op/capability - tag each non-live object with the set of snaps it is defined over - osdmap has sparse map of extant snapids. incrementals are simple rmsnapid, and max_snapid increase - put each object in first_snap, last_snap collections. - use background thread to trim old snaps. - for each object in first_snap|last_snap collections, - get snap list, - filter against extant snaps - adjust collections, or delete - adjust coll_t namespace to allow first_snap/last_snap collections.. - pg.u.type = CEPH_PG_TYPE_SNAP_LB/UB? - read: oid.snap=NOSNAP op.snaps=[3,2,1] // read latest oid.snap=3 op.snaps=[3,2,1] // read old snap 3 - write: - oid.snap=NOSNAP op.snaps=[3,2,1] // write to latest clone NOSNAP to oid.snaps[0] if oid:snaps[0] != snaps[0] set snap_first based on snaps array. - write(oid.snap=NOSNAP snaps=[3,2,1]) oid.snap=NOSNAP oid:prior=3,2,1 -> do nothing, write to existing - write(oid.snap=NOSNAP snaps=[300,2,1]) oid.snap=NOSNAP oid:prior=2,1 -> clone to oid.snap=300, snaps=300 - write(oid.snap=NOSNAP snaps=[2,1]) oid.snap=NOSNAP oid:prior=3,2,1 -> slow request, but who cares. use snaps=3,2,1, write to existing as above. - write(oid.snap=NOSNAP snaps=[3,2,1]) oid.snap=NOSNAP oid:prior=1 -> clone to oid.snap=3, snaps=3,2 screw "snap writeback": - if client is buffering, it has exclusive access to the object. therefore, it can flush older snaps before newer ones. done. - if multiple clients are writing synchronously, then the client doesn't care if the osd pushes a write "forward" in time to the next snapshot, since it hasn't completed yet. write [10] NOSNAP follows=10 write [20,10] NOSNAP follows=20,10 20 [20,10] write [45,42,20,10] NOSNAP follows=45,42,20,10 45 [45,42] 20 [20,10] write [42,20,10] NOSNAP follows=45,42,20,10 45 [45,42] 20 [20,10] delete [45,42,20,10] NOSNAP follows=45,42,20,10 (empty) 45 [45,42] 20 [20,10] write [42,20] NOSNAP follows=45,42,20,10 * just write to head. 45 [45,42] 20 [20,10] delete [45,...] NOSNAP follows=45,42,20,10 (empty) 45 [45,42] 20 [20,10] write [60,45,..] NOSNAP follows=60,45,42,20,10 45 [45,42] 20 [20,10] write [70,60,..] NOSNAP follows=60,45,42,20,10 70 [70] 45 [45,42] 20 [20,10] issues: - how to log and replication all this cloning... - snap trimming is async from replicated writes - snap trimming in general needs to be fast.. and probably a background process.. -> logically put per-pg snap removal as a discrete pg log event? - need to index snap_first/last on a per-pg basis... 128-bit collections? or ditch ebofs? btrfs rev de-duping - i.e. when sub_op_push gets an object - query checksums - userland will read+verify ranges are actually a match? - or, in pull, do FIEMAP against prior, and next object? how do we know what those are? - punch hole (?) - clone file range (not entire file) interface $ ls -al .snapshot # list snaps. show both symbolic names, and timestamp names? (symbolic -> timestamp symlinks, maybe) $ mkdir .snapshot/blah # create snap $ rmdir .snapshot/blah # remove it