v0.5 /- debug restart, cosd reformat, etc. /- finish btrfs ioctl interface /- efficient snap recovery /- throttle osd recovery /- forced unmount? v0.6 - ENOSPC - async metadata ops v0.7 - cas? big items - finish client failure recovery (reconnect after long eviction; and slow delayed reconnect) - ENOSPC - space reservation in ObjectStore, redeemed by Transactions? - reserved as PG goes active; reservation canceled when pg goes inactive - something similar during recovery - ? - repair - enforceable quotas? - mds security enforcement - client, user authentication - cas - osd failure declarations - libuuid? repair - are we concerned about - scrubbing - reconstruction after loss of subset of cdirs - reconstruction after loss of md log - data object - path backpointers? - parent dir pointer? - cdir objects - parent dir pointer - update on rename? or on cdir store? on cdir store is sufficient if mdlog survives... - or what the hell, full trace? - mds scrubbing - rados scrubbing snaps on osd - garbage collection - don't start collection on replica until clean? - efficient recovery of clones using the clone diff info - recovery on primary - log order vs recovery/clone order? - rep_op_push on primary needs to be smart - IndexLog could somehow aggregate data_subsets? - primary pushes to replicas - clone/rename, + push of new head kernel client - make osd retry writes if failure after ack.. - ACLs - reconnect path should include pathbase, not just a string? - make writepages maybe skip pages with errors? - EIO, or ENOSPC? - ... writeback vs ENOSPC vs flush vs close()... hrm... - set mapping bits for ENOSPC, EIO? - flush caps on sync, fsync, etc. - do we need to block? how do we track that? - forced unmount? - procfs/debugfs - adjust granular debug levels too - should we be using debugfs? - a dir for each client instance (client###)? - hooks to get mds, osd, monmap epoch #s - populate sysfs? - things that would be useful to see - fsid - map versions on client - outstanding mds, osd, mon requests? - fix readdir vs fragment race by keeping a separate frag pos, and ignoring dentries below it - reconnect after being disconnected from the mds kclient items to review - fill_trace locking - async trunc - async writeback - cache invalidation race, locking problems - cap changes are serialized by i_lock, but (thorough) cache invalidation may block.. vfs issues - real_lookup() race: 1- hash lookup find no dentry 2- real_lookup() takes dir i_mutex, but then finds a dentry 3- drops mutex, then calld d_revalidate. if that fails, we return ENOENT (instead of looping?) - vfs_rename_dir() userspace client - handle session STALE - time out caps, wake up waiters on renewal - link caps with mds session - validate dn leases - fix lease validation to check session ttl - clean up ll_ interface, now that we have leases! - clean up client mds session vs mdsmap behavior? - stop using mds's inode_t? - fix readdir vs fragment race by keeping a separate frag pos, and ignoring dentries below it mds - we either need to stop forwarding mds requests on behalf of client, or we need to gracefully deal with multiple replies that (may) contain caps, or we need to not forward requests that may include a cap reply. * the problem is when we get duplicate replies with caps, we drop the second one, and in so doing lose important state that is difficult to clean up... - hard link backpointers - anchor source dir - build snaprealm for any hardlinked file - include snaps for all (primary+remote) parents - remove anchors when purging - delayed open_remote_ino vs purge? - how do we properly clean up inodes when doing a snap purge? - when they are mid-recover? see 136470cf7ca876febf68a2b0610fa3bb77ad3532 - whats with the 'clear if dirtyscattered' bit in decode_import_inode()? - what if a recovery is queued, or in progress, and the inode is then cowed? can that happen? - proper handling of cache expire messages during rejoin phase? -> i think cache expires are fine; the rejoin_ack handler just has to behave if rejoining items go missing - try_remove_unlinked_dn thing - rename: importing inode... also journal imported client map? - rerun destro trace against latest, with various journal lengths - lease length heuristics - mds lock last_change stamp? - handle slow client reconnect (i.e. after mds has gone active) - fix reconnect/rejoin open file weirdness - can we get rid of the dirlock remote auth_pin weirdness on subtree roots? - anchor_destroy needs to xlock linklock.. which means it needs a Mutation wrapper? - ... when it gets a caller.. someday.. - make truncate faster with a trunc_seq, attached to objects as attributes? - osd needs a set_floor_and_read op for safe failover/STOGITH-like semantics. - could mark dir complete in EMetaBlob by counting how many dentries are dirtied in the current log epoch in CDir... - FIXME how to journal/store root and stray inode content? - in particular, i care about dirfragtree.. get it on rejoin? - and dir sizes, if i add that... also on rejoin? - add FILE_CAP_EXTEND capability bit journaler - fix up for large events (e.g. imports) - use set_floor_and_read for safe takeover from possibly-not-quite-dead otherguy. - should we pad with zeros to avoid splitting individual entries? - make it a g_conf flag? - have to fix reader to skip over zeros (either <4 bytes for size, or zeroed sizes) - need to truncate at detected (valid) write_pos to clear out any other partial trailing writes osdmon - monitor needs to monitor some osds... crush - allow forcefeed for more complicated rule structures. (e.g. make force_stack a list< set >) pgmon - include osd vector with pg state - check for orphan pgs - monitor pg states, notify on out? - watch osd utilization; adjust overload in cluster map mon - paxos need to clean up old states. - some sort of tester for PaxosService... - osdmon needs to lower-bound old osdmap versions it keeps around? osd - snap_trimmers should detect, remove unused snap collections (and update snap_collections set) - cope with divergent logs (update AND removal) in merge_log... (make merge_log augment omissing?) - how does an admin intervene when a pg needs a dead osd to repeer? - a more general fencing mechanism? per-object granularity isn't usually a good match. - consider implications of nvram writeahead logs - flag missing log entries on crash recovery --> WRNOOP? or WRLOST? - efficiently replicate clone() objects - fix heartbeat wrt new replication - mark residual pgs obsolete ??? - rdlocks - optimize remove wrt recovery pushes simplemessenger - close idle connections objectcacher - read locks? - maintain more explicit inode grouping instead of wonky hashes ebofs - btrees - checksums - dups - sets - optionally scrub deallocated extents - clone() - map ObjectStore - verify proper behavior of conflicting/overlapping reads of clones - combine inodes and/or cnodes into same blocks - fix bug in node rotation on insert (and reenable) - fix NEAR_LAST_FWD (?) - awareness of underlying software/hardware raid in allocator so that we write full stripes _only_. - hmm, that's basically just a large block size. - rewrite the btree code! - multithreaded - eliminate nodepools - allow btree sets - allow arbitrary embedded data? - allow arbitrary btrees - allow root node(s?) to be embedded in onode, or whereever. - keys and values can be uniform (fixed-size) or non-uniform. - fixed size (if any) is a value in the btree struct. - negative indicates bytes of length value? (1 -> 255bytes, 2 -> 65535 bytes, etc.?) - non-uniform records preceeded by length. - keys sorted via a comparator defined in btree root. - lexicographically, by default. - goal - object btree key->value payload, not just a data blob payload. - better threading behavior. - with transactional goodness! - onode - object attributes.. as a btree? - blob stream - map stream. - allow blob values. remaining hard problems - how to cope with file size changes and read/write sharing