ceph/branches/riccardo/monitor2/TODO



monitor
- finish generic paxos

osdmon
- distribute w/ paxos framework
- allow fresh replacement osds.  add osd_created in osdmap, probably
- monitor needs to monitor some osds...
- monitor pg states, notify on out?
- watch osd utilization; adjust overload in cluster map

mdsmon
- distribute w/ paxos framework

journaler
- fix up for large events (e.g. imports)
- use set_floor_and_read for safe takeover from possibly-not-quite-dead otherguy.
- should we pad with zeros to avoid splitting individual entries?
  - make it a g_conf flag?
  - have to fix reader to skip over zeros (either <4 bytes for size, or zeroed sizes)
- need to truncate at detected (valid) write_pos to clear out any other partial trailing writes


crush
- xml import/export?
- crush tools


rados+ebofs
- purge replicated writes from cache.  (with exception of partial tail blocks.)

rados paper todo?
- better experiments
  - berkeleydb objectstore?
- flush log only in response to subsequent read or write?
- better behaving recovery
- justify use of splay.
  - dynamic replication
- snapshots

rados snapshots
- integrate revisions into ObjectCacher
- clean up oid.rev vs op.rev in osd+osdc

- attr.crev is rev we were created in.
- oid.rev=0 is "live".  defined for attr.crev <= rev.
- otherwise, defined for attr.crev <= rev < oid.rev  (i.e. oid.rev is upper bound, non-inclusive.)

- write|delete is tagged with op.rev
  - if attr.crev < op.rev
    - we clone to oid.rev=rev (clone keeps old crev)
    - change live attr.crev=rev.
  - apply update
- read is tagged with op.rev
  - if 0, we read from 0 (if it exists).
  - otherwise we choose object rev based on op.rev vs oid.rev, and then verifying attr.crev <= op.rev.

- how to get usage feedback to monitor?

- change messenger entity_inst_t
 - no more rank!  make it a uniquish nonce?

- clean up mds caps release in exporter
- figure out client failure modes
- clean up messenger failure modes.
- add connection retry.


objecter
- read+floor_lockout

osd/rados
- read+floor_lockout for clean STOGITH-like/fencing semantics after failover.
- separate out replication code into a PG class, to pave way for RAID

- efficiently replicate clone() objects
- pg_num instead of pg_bits
- flag missing log entries on crash recovery  --> WRNOOP? or WRLOST?
- consider implications of nvram writeahead logs
- fix heartbeat wrt new replication
- mark residual pgs obsolete  ???
- rdlocks
- optimize remove wrt recovery pushes
- pg_bit/pg_num changes
- report crashed pgs?

simplemessenger
- close idle connections
- retry, timeout on connection or transmission failure

objectcacher
- ocacher caps transitions vs locks
- test read locks

reliability
- heartbeat vs ping?
- osdmonitor, filter

ebofs
- verify proper behavior of conflicting/overlapping reads of clones
- test(fix) sync()
- combine inodes and/or cnodes into same blocks
- allow btree sets instead of maps
- eliminate nodepools
- nonblocking write on missing onodes?
- fix bug in node rotation on insert (and reenable)
- fix NEAR_LAST_FWD (?)
- journaling? in NVRAM?
- metadata in nvram?  flash?


remaining hard problems
- how to cope with file size changes and read/write sharing


crush
- more efficient failure when all/too many osds are down
- allow forcefeed for more complicated rule structures.  (e.g. make force_stack a list< set<int> >)


mds
- distributed client management
- anchormgr
  - 2pc
  - independent journal?
  - distributed?
- link count management
  - also 2pc
- chdir (directory opens!)
- rewrite logstream
  - clean up
  - be smart about rados ack vs reread
  - log locking?  root log object
  - trimming, rotation

- efficient stat for single writers
- lstat vs stat
- add FILE_CAP_EXTEND capability bit
- only share osdmap updates with clients holding capabilities
- delayed replica caps release... we need to set a timer event? (and cancel it when appropriate?)
- finish hard links!
 - reclaim danglers from inode file on discover...
 - fix rename wrt hard links
- interactive hash/unhash interface
- test hashed readdir
- make logstream.flush align itself to stripes

- carefully define/document frozen wrt dir_auth vs hashing


client
- fstat
- make_request: cope with mds failure
- mixed lazy and non-lazy io will clobber each others' caps in the buffer cache.. how to isolate..
- test client caps migration w/ mds exports
- some heuristic behavior to consolidate caps to inode auth?


MDS TODO
- fix hashed readdir: should (optionally) do a lock on dir namespace?
- fix hard links
  - they mostly work, but they're fragile
- sync clients on stat
  - will need to ditch 10s client metadata caching before this is useful
  - implement truncate
- implement hashed directories
- statfs?
- rewrite journal + recovery
- figure out online failure recovery
- more distributed fh management?
- btree directories (for efficient large directories)
- consistency points/snapshots

- fix MExportAck and others to use dir+dentry, not inode
  (otherwise this all breaks with hard links.. altho it probably needs reworking already?)


why qsync could be wrong (for very strict POSIX) : varying mds -> client message transit or processing times.
- mds -> 1,2 : qsync
- client1 writes at byte 100
- client1 -> mds : qsync reply (size=100)
- client1 writes at byte 300
- client1 -> client2 (outside channel)
- client2 writes at byte 200
- client2 -> mds : qsync reply (size=200)
-> stat results in size 200, even though at no single point in time was the max size 500.
-> for correct result, need to _stop_ client writers while gathering metadata.


SAGE:

- string table?

- hard links
 - fix MExportAck and others to use dir+dentry, not inode
   (otherwise this all breaks with hard links.. altho it probably needs reworking already!)

- do real permission checks?


ISSUES


- discover
 - soft: authority selectively repicates, or sets a 'forward' flag in reply
 - hard: authority always replicates (eg. discover for export)
 - forward flag (see soft)
 - error flag   (if file not found, etc.)
 - [what was i talking about?] make sure waiters are properly triggered, either upon dir_rep update, or (empty!) discover reply


DOCUMENT
- cache, distributed cache structure and invariants
- export process
- hash/unhash process


TEST
- hashing
 - test hash/unhash operation
 - hash+export: encode list of replicated dir inodes so they can be discovered before import is procesed.
 - test nauthitems (wrt hashing?)


IMPLEMENT

- smarter balancing
  - popularity calculation and management is inconsistent/wrong.
  - does it work?

- dump active config in run output somewhere


==== MDS RECOVERY ====

- how to reliably deliver cache expire messages?
  - how should proxy behave?
  - exporter failure
    - all cacheexpire info has been passed on up until point where export is permanent.  no impact.
  - importer failure
    - exporter collects expire info, so that it can reverse.
    - ???
  - maybe hosts should double-up expires until after export is known to have committed?
--> just send expires to both nodes.  dir_auth+dir_auth2.  clean up export ack/notify process.  :)

*** dar... no, separate bystander dir_auth updates from the prepare/ack/commit cycle!
- expire should go to both old and new auth
- set_dir_auth should take optional second auth, and authority() should optionally set/return a second possible auth
- does inode need it's own replica list?  no!
- dirslices.


/- exporter recovery if importer fails during EXPORT_EXPORTING stage
- importer recovery if exporter fails

/?- delay response to sending import_map if export in progress?
/?- finish export before sending import_map?
/- ambiguous imports on active node should include in-progress imports!
/- how to effectively trim cache after resolve but before rejoin
/  - we need to eliminate unneed non-auth metadata, without hosing potentially useful auth metadata

- osd needs a set_floor_and_read op for safe failover/STOGITH-like semantics.

- failures during recovery stages (resolve, rejoin)... make sure rejoin still works!

- fix mds initial osdmap weirdness (which will currently screw up on standby -> almost anything)


importmap only sent after exports have completed.
failures update export ack waitlists, so exports will compelte if unrelated nodes fail.
importmap can be sent regardless of import status -- pending import is just flagged ambiguous.
failure of exporter induces some cleanup on importer.  importer will disambiguate when it gets an importmap on exporter recovery.
failure of importer induces cleanup on exporter.  no ambiguity.


/- no new mds may join if cluster is in a recovery state.  starting -> standby (unless failed)
/  - make sure creating -> standby, and are not included in recovery set?


mdsmap notes
- mds don't care about intervening states, except rejoin > active, and
  that transition requires active involvement.  thus, no need worry
  about delivering/processing the full sequence of maps.

blech:
- EMetablob should return 'expired' if they have
  higher versions (and are thus described by a newer journal entry)

mds
- mds falure vs clients
  - clean up client op redirection
  - idempotent ops

- journal+recovery
  - unlink
  - open(wr cap), open+create
  - file capabilities i/o
  - link
  - rename

- should auth_pins really go to the root?
  - FIXME: auth_pins on importer versus import beneath an authpinned region?