v0.3 v0.4 - ENOSPC - finish client failure recovery (reconnect after long eviction; and slow delayed reconnect) - make kclient use ->sendpage? - rip out io interruption? big items - ENOSPC - quotas / - accounting - enforcement - rados cow/snapshot infrastructure - mds snapshots - mds security enforcement - client, user authentication - cas - use libuuid userspace client - handle session STALE - time out caps, wake up waiters on renewal - link caps with mds session - validate dn leases - fix lease validation to check session ttl - clean up ll_ interface, now that we have leases! - clean up client mds session vs mdsmap behavior? - stop using mds's inode_t? - fix readdir vs fragment race by keeping a separate frag pos, and ignoring dentries below it kernel client - make writepages maybe skip pages with errors? - EIO, or ENOSPC? - ... writeback vs ENOSPC vs flush vs close()... hrm... - set mapping bits for ENOSPC, EIO? - flush caps on sync, fsync, etc. - do we need to block? - timeout mds session close on umount - deal with CAP_RDCACHE properly: invalidate cache pages? - procfs/debugfs - adjust granular debug levels too - should we be using debugfs? - a dir for each client instance (client###)? - hooks to get mds, osd, monmap epoch #s - vfs - can we use dentry_path(), if it gets merged into mainline? - io / osd client - osd ack vs commit handling. hmm! - fix readdir vs fragment race by keeping a separate frag pos, and ignoring dentries below it vfs issues - real_lookup() race: 1- hash lookup find no dentry 2- real_lookup() takes dir i_mutex, but then finds a dentry 3- drops mutex, then calld d_revalidate. if that fails, we return ENOENT (instead of looping?) - vfs_rename_dir() client - clean up client mds session vs mdsmap behavior? osdmon - monitor needs to monitor some osds... crush - more efficient failure when all/too many osds are down - allow forcefeed for more complicated rule structures. (e.g. make force_stack a list< set >) - "knob" bucket pgmon - monitor pg states, notify on out? - watch osd utilization; adjust overload in cluster map mon - paxos need to clean up old states. - some sort of tester for PaxosService... - osdmon needs to lower-bound old osdmap versions it keeps around? mds - proper handling of cache expire messages during rejoin phase? -> i think cache expires are fine; the rejoin_ack handler just has to behave if rejoining items go missing - try_remove_unlinked_dn thing - rename: importing inode... also journal imported client map? - rerun destro trace against latest, with various journal lengths - lease length heuristics - mds lock last_change stamp? - handle slow client reconnect (i.e. after mds has gone active) - fix reconnect/rejoin open file weirdness - get rid of C*Discover objects for replicate_to .. encode to bufferlists directly? - can we get rid of the dirlock remote auth_pin weirdness on subtree roots? - anchor_destroy needs to xlock linklock.. which means it needs a Mutation wrapper? - ... when it gets a caller.. someday.. - make truncate faster with a trunc_seq, attached to objects as attributes? - osd needs a set_floor_and_read op for safe failover/STOGITH-like semantics. - could mark dir complete in EMetaBlob by counting how many dentries are dirtied in the current log epoch in CDir... - FIXME how to journal/store root and stray inode content? - in particular, i care about dirfragtree.. get it on rejoin? - and dir sizes, if i add that... also on rejoin? - efficient stat for single writers - add FILE_CAP_EXTEND capability bit journaler - fix up for large events (e.g. imports) - use set_floor_and_read for safe takeover from possibly-not-quite-dead otherguy. - should we pad with zeros to avoid splitting individual entries? - make it a g_conf flag? - have to fix reader to skip over zeros (either <4 bytes for size, or zeroed sizes) - need to truncate at detected (valid) write_pos to clear out any other partial trailing writes fsck - fsck.ebofs - online mds fsck? - object backpointer attrs to hint catastrophic reconstruction? objecter - fix failure handler... - generic mon client? - maybe_request_map should set a timer event to periodically re-request. - transaction prepare/commit? - read+floor_lockout osd/rados - how does an admin intervene when a pg needs a dead osd to repeer? - a more general fencing mechanism? per-object granularity isn't usually a good match. - consider implications of nvram writeahead logs - flag missing log entries on crash recovery --> WRNOOP? or WRLOST? - efficiently replicate clone() objects - fix heartbeat wrt new replication - mark residual pgs obsolete ??? - rdlocks - optimize remove wrt recovery pushes - report crashed pgs? messenger - fix messenger shutdown.. we shouldn't delete messenger, since the caller may be referencing it, etc. simplemessenger - close idle connections objectcacher - merge clean bh's - ocacher caps transitions vs locks - test read locks reliability - heartbeat vs ping? - osdmonitor, filter ebofs - btrees - checksums - dups - sets - optionally scrub deallocated extents - clone() - map ObjectStore - verify proper behavior of conflicting/overlapping reads of clones - combine inodes and/or cnodes into same blocks - fix bug in node rotation on insert (and reenable) - fix NEAR_LAST_FWD (?) - awareness of underlying software/hardware raid in allocator so that we write full stripes _only_. - hmm, that's basically just a large block size. - rewrite the btree code! - multithreaded - eliminate nodepools - allow btree sets - allow arbitrary embedded data? - allow arbitrary btrees - allow root node(s?) to be embedded in onode, or whereever. - keys and values can be uniform (fixed-size) or non-uniform. - fixed size (if any) is a value in the btree struct. - negative indicates bytes of length value? (1 -> 255bytes, 2 -> 65535 bytes, etc.?) - non-uniform records preceeded by length. - keys sorted via a comparator defined in btree root. - lexicographically, by default. - goal - object btree key->value payload, not just a data blob payload. - better threading behavior. - with transactional goodness! - onode - object attributes.. as a btree? - blob stream - map stream. - allow blob values. - remaining hard problems - how to cope with file size changes and read/write sharing snapshot notes -- todo - rados bits to do clone+write - fix cloning on unlinked file (where snaps=[], but head may have follows_snap attr) - figure out how to fix up rados logging - snap collections - garbage collection - fetch may need to adjust loaded dentry first,last? - client reconnect vs snaps - hard link backpointers - anchor source dir - build snaprealm for any hardlinked file - include snaps for all (primary+remote) parents - migrator import/export of versioned dentries, inodes... drop them on export... primary file link -> old inode primary dir link -> multiversion inode remote link -> multiversion inode - for simplicity, don't replicate any snapshot data. - rename() needs to create a new realm if src/dst realms differ and (rrealms, or open_children, or not subtree leaf) (similar logic to the anchor update) - will snapshots and CAS play nice? - cas object refs should follow same deletion semantics as non-cas objects. - locker - dirstat/fragstats - mds server ops - link rollback - rename rollback - snaplock semantics - fast bcast of any updates? - what about dir rename snaprealm update? make parallel update on each mds? - when we create a snapshot, - xlock snaplock - create realm, if necessary - add it to the realm snaps list. - build list of current children - send client a capgroup update for each affected realm (as we unlock the snaplock? or via a separate lock event that pushes the update out to replicas?) - when a client is opening a file - if it is in an existing capgroup, all is well. - if it is not, rdlock all ancestor snaprealms, and open a new capgroup with the client. - or shoudl we even bother rdlocking? the snap creation is going to be somewhat async, regardless... - what is snapid? - can we get away with it _not_ being ordered? - for osds.. yes. - for mds.. may make the cdentry range info tricky! - osds need to see snapid deletion events in osdmap incrementals. or, snapmap? - so... assign it via mds0 metadata - fix up inode_map to key off vinodeno.. or have a second map for non-zero snapids.. - no, just key of vinodeno_t, and make it CInode *get_inode(inodeno_t ino, snapid_t sn=NOSNAP); struct vinodeno_t { inodeno_t ino; snapid_t snapid; }; - dentry: replace dname -> ino, rino+rtype with (dname, first, last) -> vino, vino+rtype - live dentries have last = NOSNAP. kept in separate map: - map items; - map, CDentry> vitems; - or? clean up dir item map/hash at the same time (keep name storage in CDentry) - map, CDentry*> items; // all items - or? map, CDentry*> items; // lastsnap -> name -> - no.. then all the CDir::map_t coded loops break CDentry *lookup(string &dname, snapid_t sn=NOSNAP); - track vitem count in fragstat. - when vitem count gets large, add pointer in fnode indicating vitem range stored in separate dir object. client - also keep caps linked into snaprealm list - current snapid (lineage) for each snaprealm. - just keep it simple; don't bother with snaprealm linkages! - attach snapid (lineage) to each dirty page - can we cow page if its dirty but a different realm? ...hmm probably not, but we can flush it in write_begin, just like when we do a read to make it clean osd - pass snap lineage with osd op/capability - tag each non-live object with the set of snaps it is defined over - osdmap has sparse map of extant snapids. incrementals are simple rmsnapid, and max_snapid increase - put each object in first_snap, last_snap collections. - use background thread to trim old snaps. - for each object in first_snap|last_snap collections, - get snap list, - filter against extant snaps - adjust collections, or delete - adjust coll_t namespace to allow first_snap/last_snap collections.. - pg.u.type = CEPH_PG_TYPE_SNAP_LB/UB? - read: oid.snap=NOSNAP op.snaps=[3,2,1] // read latest oid.snap=3 op.snaps=[3,2,1] // read old snap 3 - write: - oid.snap=NOSNAP op.snaps=[3,2,1] // write to latest clone NOSNAP to oid.snaps[0] if oid:snaps[0] != snaps[0] set snap_first based on snaps array. - write(oid.snap=NOSNAP snaps=[3,2,1]) oid.snap=NOSNAP oid:prior=3,2,1 -> do nothing, write to existing - write(oid.snap=NOSNAP snaps=[300,2,1]) oid.snap=NOSNAP oid:prior=2,1 -> clone to oid.snap=300, snaps=300 - write(oid.snap=NOSNAP snaps=[2,1]) oid.snap=NOSNAP oid:prior=3,2,1 -> slow request, but who cares. use snaps=3,2,1, write to existing as above. - write(oid.snap=NOSNAP snaps=[3,2,1]) oid.snap=NOSNAP oid:prior=1 -> clone to oid.snap=3, snaps=3,2 screw "snap writeback": - if client is buffering, it has exclusive access to the object. therefore, it can flush older snaps before newer ones. done. - if multiple clients are writing synchronously, then the client doesn't care if the osd pushes a write "forward" in time to the next snapshot, since it hasn't completed yet. write [10] NOSNAP follows=10 write [20,10] NOSNAP follows=20,10 20 [20,10] write [45,42,20,10] NOSNAP follows=45,42,20,10 45 [45,42] 20 [20,10] write [42,20,10] NOSNAP follows=45,42,20,10 45 [45,42] 20 [20,10] delete [45,42,20,10] NOSNAP follows=45,42,20,10 (empty) 45 [45,42] 20 [20,10] write [42,20] NOSNAP follows=45,42,20,10 * just write to head. 45 [45,42] 20 [20,10] delete [45,...] NOSNAP follows=45,42,20,10 (empty) 45 [45,42] 20 [20,10] write [60,45,..] NOSNAP follows=60,45,42,20,10 45 [45,42] 20 [20,10] write [70,60,..] NOSNAP follows=60,45,42,20,10 70 [70] 45 [45,42] 20 [20,10] issues: - how to log and replication all this cloning... - snap trimming is async from replicated writes - snap trimming in general needs to be fast.. and probably a background process.. -> logically put per-pg snap removal as a discrete pg log event? - need to index snap_first/last on a per-pg basis... 128-bit collections? or ditch ebofs? btrfs rev de-duping - i.e. when sub_op_push gets an object - query checksums - userland will read+verify ranges are actually a match? - or, in pull, do FIEMAP against prior, and next object? how do we know what those are? - punch hole (?) - clone file range (not entire file) interface $ ls -al .snapshot # list snaps. show both symbolic names, and timestamp names? (symbolic -> timestamp symlinks, maybe) $ mkdir .snapshot/blah # create snap $ rmdir .snapshot/blah # remove it