code cleanup - userspace encoding/decoding needs major cleanup - use le32 etc annotation - probably kill base case in encoder.h, replace with int types, with appropriate swabbing? - addr=? userspace client - per-mds session struct.. and put the cap_ttl in there! - move the size check(s) on read from _read() into FileCache - validate dn leases - clean up ll_ interface, now that we have leases! - obey file_max - revoke own caps when they time out, - clean up client mds session vs mdsmap behavior? - client caps migration races - caps need a seq number; reap logic needs to be a bit smarter - also needs cope with mds failures - reference count lease validations on path lookup? kernel client - flush caps on sync, fsync, etc. - do we need to block? - timeout mds session close on umount - file_data_version stuff! - deal with CAP_RDCACHE properly: invalidate cache pages? - what happens after reconnect when we get successful reply but no trace (!) on e.g. rename, unlink, link, open+O_CREAT, etc... - fallback in each case (in ceph_unlink, _rename, etc.) when reply has no trace? - what about open(O_WR or O_CREAT)? pbly needs fix in mds... tho it's racey now... - procfs/debugfs - adjust granular debug levels too - should we be using debugfs? - a dir for each client instance (client###)? - hooks to get mds, osd, monmap epoch #s - nfs exporting - fill_trace needs to use d_splice_alias - lookup needs to return last_dentry if it gets swapped by fill_trace/d_splice_alias - build fat fh's with multiple ancestors.. filenames..? - vfs - can we use dentry_path(), if it gets merged into mainline? - io / osd client - osd ack vs commit handling. hmm! - fix handling of resent message pages client - clean up client mds session vs mdsmap behavior? - client caps migration races - caps need a seq number; reap logic needs to be a bit smarter - also needs cope with mds failures osdmon - monitor needs to monitor some osds... crush - more efficient failure when all/too many osds are down - allow forcefeed for more complicated rule structures. (e.g. make force_stack a list< set >) - "knob" bucket pgmon - monitor pg states, notify on out? - watch osd utilization; adjust overload in cluster map mon - paxos need to clean up old states. - some sort of tester for PaxosService... - osdmon needs to lower-bound old osdmap versions it keeps around? mds mustfix - rename slave in-memory rollback on failure - proper handling of cache expire messages during rejoin phase? -> i think cache expires are fine; the rejoin_ack handler just has to behave if rejoining items go missing - try_remove_unlinked_dn thing - rerun destro trace against latest, with various journal lengths mds - lease length heuristics - mds lock last_change stamp? - fix file_data_version - on recovery, validate file sizes when max_size > size - coalesce lease revocations on dir inode + dentry, where possible - fix reconnect/rejoin open file weirdness - get rid of C*Discover objects for replicate_to .. encode to bufferlists directly? - consistency points/snapshots - dentry versions vs dirfrags... - failure during reconnect vs clientmap. - inode.rmtime (recursive mtime)? - make inode.size reflect directory size (number of entries)? - osd needs a set_floor_and_read op for safe failover/STOGITH-like semantics. - could mark dir complete in EMetaBlob by counting how many dentries are dirtied in the current log epoch in CDir... - fix rmdir empty exported dirfrag race - export all frags <= 1 item? then we ensure freezing before empty, avoiding any last unlink + export vs rmdir race. - how to know full dir size (when trimming)? - put frag size/mtime in fragmap in inode? we will need that anyway for stat on dirs - will need to make inode discover/import_decode smart about dirfrag auth - or, only put frag size/mtime in inode when frag is closed. otherwise, soft (journaled) state, possibly on another mds. - need to move state from replicas to auth. simplelock doesn't currently support that. - ScatterLock or something? hrm. - FIXME how to journal/store root and stray inode content? - in particular, i care about dirfragtree.. get it on rejoin? - and dir sizes, if i add that... also on rejoin? - efficient stat for single writers - add FILE_CAP_EXTEND capability bit journaler - fix up for large events (e.g. imports) - use set_floor_and_read for safe takeover from possibly-not-quite-dead otherguy. - should we pad with zeros to avoid splitting individual entries? - make it a g_conf flag? - have to fix reader to skip over zeros (either <4 bytes for size, or zeroed sizes) - need to truncate at detected (valid) write_pos to clear out any other partial trailing writes fsck - fsck.ebofs - online mds fsck? - object backpointer attrs to hint catastrophic reconstruction? rados snapshots - integrate revisions into ObjectCacher? - clean up oid.rev vs op.rev in osd+osdc - attr.crev is rev we were created in. - oid.rev=0 is "live". defined for attr.crev <= rev. - otherwise, defined for attr.crev <= rev < oid.rev (i.e. oid.rev is upper bound, non-inclusive.) - write|delete is tagged with op.rev - if attr.crev < op.rev - we clone to oid.rev=rev (clone keeps old crev) - change live attr.crev=rev. - apply update - read is tagged with op.rev - if 0, we read from 0 (if it exists). - otherwise we choose object rev based on op.rev vs oid.rev, and then verifying attr.crev <= op.rev. objecter - maybe_request_map should set a timer event to periodically re-request. - transaction prepare/commit - read+floor_lockout osd/rados - fix build_prior_set behavior. needs to not always exclude currently down nodes. e.g., 1: A B 2: B 3: A -> prior_set should be , bc B may have independently applied updates. 1: A B C 2: B C 3: A C -> prior_set can be , bc C would carry any epoch 2 updates -> so: we need at least 1 osd from each epoch, IFF we make store sync on osdmap boundaries. -> so, use calc_priors_during in build_prior, then make recovery code check for is_up - paxos replication (i.e. majority voting)? - transaction prepare/commit - rollback - rollback logging (to fix slow prepare vs rollback race) - a more general fencing mechanism? per-object granularity isn't usually a good match. - consider implications of nvram writeahead logs - flag missing log entries on crash recovery --> WRNOOP? or WRLOST? - efficiently replicate clone() objects - fix heartbeat wrt new replication - mark residual pgs obsolete ??? - rdlocks - optimize remove wrt recovery pushes - report crashed pgs? messenger - fix messenger shutdown.. we shouldn't delete messenger, since the caller may be referencing it, etc. simplemessenger - fix/audit accept() logic to detect reset, do callback - close idle connections - take a look at RDS? http://oss.oracle.com/projects/rds/ objectcacher - merge clean bh's - ocacher caps transitions vs locks - test read locks reliability - heartbeat vs ping? - osdmonitor, filter ebofs - btrees - checksums - dups - sets - optionally scrub deallocated extents - clone() - map ObjectStore - verify proper behavior of conflicting/overlapping reads of clones - combine inodes and/or cnodes into same blocks - fix bug in node rotation on insert (and reenable) - fix NEAR_LAST_FWD (?) - awareness of underlying software/hardware raid in allocator so that we write full stripes _only_. - hmm, that's basically just a large block size. - rewrite the btree code! - multithreaded - eliminate nodepools - allow btree sets - allow arbitrary embedded data? - allow arbitrary btrees - allow root node(s?) to be embedded in onode, or whereever. - keys and values can be uniform (fixed-size) or non-uniform. - fixed size (if any) is a value in the btree struct. - negative indicates bytes of length value? (1 -> 255bytes, 2 -> 65535 bytes, etc.?) - non-uniform records preceeded by length. - keys sorted via a comparator defined in btree root. - lexicographically, by default. - goal - object btree key->value payload, not just a data blob payload. - better threading behavior. - with transactional goodness! - onode - object attributes.. as a btree? - blob stream - map stream. - allow blob values. - remaining hard problems - how to cope with file size changes and read/write sharing