some smallish projects: - crush rewrite in C - generalize any memory management etc. to allow use in kernel and userspace - userspace crush tools - xml import/export? - ? - pg monitor service - to support statfs? - general pg health - some sort of (throttled) osd status reporting - dynamic pg creation (eventually!) - SimpleMessenger - clean up/merge Messenger/Dispatcher interfaces - auto close idle connections - delivery ack and buffering, and then reconnect - take a look at RDS? http://oss.oracle.com/projects/rds/ - generalize monitor client? - throttle message resend attempts - ENOSPC on client, OSD code cleanup - endian portability - word size - clean up all encoded structures general kernel planning - soft consistency on (kernel) lookup? - accurate reconstruction of (syscall) path? sage doc - mdsmonitor beacon semantics - cache expiration, cache invariants - including dual expire states, transition, vs subtree grouping of expire messages - recovery states, implicit barrier are rejoin - journal content - importmaps and up:resolve - metablob version semantics sage mds - fix server unlink .. needs to use slave_requests to clean up any failures during the resolve stage /- .ceph_hosts file, so we can use the infiniband addresses - look at mds osds - the split/merge plan: - hmm, should we move ESubtreeMap out of the journal? that would avoid all the icky weirdness in shutdown, with periodic logging, etc. - extend/clean up filepath to allow paths relative to an ino - fix path_traverse - fix reconnect/rejoin open file weirdness - stray reintegration - stray purge on shutdown - need to export stray crap to another mds.. - verify stray is empty on shutdown - consistency points/snapshots - dentry versions vs dirfrags... - more testing of failures + thrashing. - is export prep dir open deadlock properly fixed by forge_replica_dir()? - failures during recovery stages (resolve, rejoin)... make sure rejoin still works! - detect and deal with client failure - failure during reconnect vs clientmap. although probalby the whole thing needs a larger overhaul... - inode.max_size - inode.allocated_size - real chdir (directory "open") - relative metadata ops - osd needs a set_floor_and_read op for safe failover/STOGITH-like semantics. - EMetablob should return 'expired' if they have higher versions (and are thus described by a newer journal entry) - could mark dir complete in EMetaBlob by counting how many dentries are dirtied in teh current log epoch in CDir... - fix rmdir empty exported dirfrag race - export all frags <= 1 item? then we ensure freezing before empty, avoiding any last unlink + export vs rmdir race. - how to know full dir size (when trimming)? - put frag size/mtime in fragmap in inode? we will need that anyway for stat on dirs - will need to make inode discover/import_decode smart about dirfrag auth - or, only put frag size/mtime in inode when frag is closed. otherwise, soft (journaled) state, possibly on another mds. - need to move state from replicas to auth. simplelock doesn't currently support that. - ScatterLock or something? hrm. - FIXME how to journal root and stray inode content? - in particular, i care about dirfragtree.. get it on rejoin? - and dir sizes, if i add that... also on rejoin? osdmon - allow fresh replacement osds. add osd_created in osdmap, probably - monitor needs to monitor some osds... - monitor pg states, notify on out? - watch osd utilization; adjust overload in cluster map journaler - fix up for large events (e.g. imports) - use set_floor_and_read for safe takeover from possibly-not-quite-dead otherguy. - should we pad with zeros to avoid splitting individual entries? - make it a g_conf flag? - have to fix reader to skip over zeros (either <4 bytes for size, or zeroed sizes) - need to truncate at detected (valid) write_pos to clear out any other partial trailing writes crush - xml import/export? - crush tools rados snapshots - integrate revisions into ObjectCacher - clean up oid.rev vs op.rev in osd+osdc - attr.crev is rev we were created in. - oid.rev=0 is "live". defined for attr.crev <= rev. - otherwise, defined for attr.crev <= rev < oid.rev (i.e. oid.rev is upper bound, non-inclusive.) - write|delete is tagged with op.rev - if attr.crev < op.rev - we clone to oid.rev=rev (clone keeps old crev) - change live attr.crev=rev. - apply update - read is tagged with op.rev - if 0, we read from 0 (if it exists). - otherwise we choose object rev based on op.rev vs oid.rev, and then verifying attr.crev <= op.rev. - how to get usage feedback to monitor? - clean up mds caps release in exporter - figure out client failure modes - add connection retry. objecter - transaction prepare/commit - read+floor_lockout osd/rados - transaction prepare/commit - rollback - rollback logging (to fix slow prepare vs rollback race) - read+floor_lockout for clean STOGITH-like/fencing semantics after failover. - consider implications of nvram writeahead logs - clean shutdown? - pgmonitor should supplement failure detection - flag missing log entries on crash recovery --> WRNOOP? or WRLOST? - efficiently replicate clone() objects - fix heartbeat wrt new replication - mark residual pgs obsolete ??? - rdlocks - optimize remove wrt recovery pushes - report crashed pgs? messenger - fix messenger shutdown.. we shouldn't delete messenger, since the caller may be referencing it, etc. simplemessenger - close idle connections - buffer sent messages until a receive is acknowledged (handshake!) - retry, timeout on connection or transmission failure - exponential backoff on monitor resend attempts (actually, this should go outside the messenger!) objectcacher - merge clean bh's - ocacher caps transitions vs locks - test read locks reliability - heartbeat vs ping? - osdmonitor, filter ebofs - allow holes - verify proper behavior of conflicting/overlapping reads of clones - combine inodes and/or cnodes into same blocks - allow btree sets instead of maps - eliminate nodepools - nonblocking write on missing onodes? - fix bug in node rotation on insert (and reenable) - fix NEAR_LAST_FWD (?) - awareness of underlying software/hardware raid in allocator so that we write full stripes _only_. - hmm, that's basically just a large block size. - rewrite the btree code! - multithreaded - eliminate nodepools - allow btree sets - allow arbitrary embedded data? - allow arbitrary btrees - allow root node(s?) to be embedded in onode, or whereever. - keys and values can be uniform (fixed-size) or non-uniform. - fixed size (if any) is a value in the btree struct. - negative indicates bytes of length value? (1 -> 255bytes, 2 -> 65535 bytes, etc.?) - non-uniform records preceeded by length. - keys sorted via a comparator defined in btree root. - lexicographically, by default. - goal - object btree key->value payload, not just a data blob payload. - better threading behavior. - with transactional goodness! - onode - object attributes.. as a btree? - blob stream - map stream. - allow blob values. - remaining hard problems - how to cope with file size changes and read/write sharing crush - more efficient failure when all/too many osds are down - allow forcefeed for more complicated rule structures. (e.g. make force_stack a list< set >) mds - distributed client management - chdir (directory opens!) - rewrite logstream - clean up - be smart about rados ack vs reread - log locking? root log object - trimming, rotation - efficient stat for single writers - lstat vs stat - add FILE_CAP_EXTEND capability bit - only share osdmap updates with clients holding capabilities - delayed replica caps release... we need to set a timer event? (and cancel it when appropriate?) - finish hard links! - reclaim danglers from inode file on discover... - fix rename wrt hard links - interactive hash/unhash interface - test hashed readdir - make logstream.flush align itself to stripes - carefully define/document frozen wrt dir_auth vs hashing client - fstat - mixed lazy and non-lazy io will clobber each others' caps in the buffer cache.. how to isolate.. - test client caps migration w/ mds exports - some heuristic behavior to consolidate caps to inode auth? why qsync could be wrong (for very strict POSIX) : varying mds -> client message transit or processing times. - mds -> 1,2 : qsync - client1 writes at byte 100 - client1 -> mds : qsync reply (size=100) - client1 writes at byte 300 - client1 -> client2 (outside channel) - client2 writes at byte 200 - client2 -> mds : qsync reply (size=200) -> stat results in size 200, even though at no single point in time was the max size 500. -> for correct result, need to _stop_ client writers while gathering metadata. SAGE: - string table? - hard links - fix MExportAck and others to use dir+dentry, not inode (otherwise this all breaks with hard links.. altho it probably needs reworking already!) - do real permission checks? ISSUES - discover - soft: authority selectively repicates, or sets a 'forward' flag in reply - hard: authority always replicates (eg. discover for export) - forward flag (see soft) - error flag (if file not found, etc.) - [what was i talking about?] make sure waiters are properly triggered, either upon dir_rep update, or (empty!) discover reply DOCUMENT - cache, distributed cache structure and invariants - export process - hash/unhash process TEST - hashing - test hash/unhash operation - hash+export: encode list of replicated dir inodes so they can be discovered before import is procesed. - test nauthitems (wrt hashing?) IMPLEMENT - smarter balancing - popularity calculation and management is inconsistent/wrong. - does it work? - dump active config in run output somewhere