ceph/src/TODO

big items
- quotas
  - accounting
  - enforcement
- rados cow/snapshot infrastructure
- mds snapshots
- mds security enforcement
- client, user authentication
- cas

- meta vs data crush rules
- use libuuid

userspace client
- handle session STALE
- rm -rf on fragmented directory
- time out caps, wake up waiters on renewal
  - link caps with mds session
- validate dn leases
- fix lease validation to check session ttl
- clean up ll_ interface, now that we have leases!
- clean up client mds session vs mdsmap behavior?

kernel client
- flush caps on sync, fsync, etc.
  - do we need to block?
- timeout mds session close on umount
- deal with CAP_RDCACHE properly: invalidate cache pages?
- procfs/debugfs
  - adjust granular debug levels too
    - should we be using debugfs?
  - a dir for each client instance (client###)?
  - hooks to get mds, osd, monmap epoch #s
- clean up messenger vs ktcp
 - hook into sysfs?
- vfs
 - can we use dentry_path(), if it gets merged into mainline?
- io / osd client
  - osd ack vs commit handling.  hmm!

client
- clean up client mds session vs mdsmap behavior?

osdmon
- monitor needs to monitor some osds...

crush
- more efficient failure when all/too many osds are down
- allow forcefeed for more complicated rule structures.  (e.g. make force_stack a list< set<int> >)
- "knob" bucket

pgmon
- monitor pg states, notify on out?
- watch osd utilization; adjust overload in cluster map

mon
- paxos need to clean up old states.
- some sort of tester for PaxosService...
- osdmon needs to lower-bound old osdmap versions it keeps around?

mds
- dir frags
 - fix replay  (dont want dir frozen, pins, etc.?)
 - fix accounting

- proper handling of cache expire messages during rejoin phase?
  -> i think cache expires are fine; the rejoin_ack handler just has to behave if rejoining items go missing
- try_remove_unlinked_dn thing

- rerun destro trace against latest, with various journal lengths

- lease length heuristics
  - mds lock last_change stamp?

- handle slow client reconnect (i.e. after mds has gone active)

- fix reconnect/rejoin open file weirdness
- get rid of C*Discover objects for replicate_to .. encode to bufferlists directly?

- can we get rid of the dirlock remote auth_pin weirdness on subtree roots?
- anchor_destroy needs to xlock linklock.. which means it needs a Mutation wrapper?
  - ... when it gets a caller.. someday..

- make truncate faster with a trunc_seq, attached to objects as attributes?

- osd needs a set_floor_and_read op for safe failover/STOGITH-like semantics.

- could mark dir complete in EMetaBlob by counting how many dentries are dirtied in the current log epoch in CDir...

- FIXME how to journal/store root and stray inode content? 
  - in particular, i care about dirfragtree.. get it on rejoin?
  - and dir sizes, if i add that... also on rejoin?

- efficient stat for single writers
- add FILE_CAP_EXTEND capability bit


journaler
- fix up for large events (e.g. imports)
- use set_floor_and_read for safe takeover from possibly-not-quite-dead otherguy.
- should we pad with zeros to avoid splitting individual entries?
  - make it a g_conf flag?
  - have to fix reader to skip over zeros (either <4 bytes for size, or zeroed sizes)
- need to truncate at detected (valid) write_pos to clear out any other partial trailing writes


fsck
- fsck.ebofs
- online mds fsck?
- object backpointer attrs to hint catastrophic reconstruction?

objecter
- maybe_request_map should set a timer event to periodically re-request.
- transaction prepare/commit?
- read+floor_lockout

osd/rados
- how does an admin intervene when a pg needs a dead osd to repeer?

- a more general fencing mechanism?  per-object granularity isn't usually a good match.

- consider implications of nvram writeahead logs

- flag missing log entries on crash recovery  --> WRNOOP? or WRLOST?

- efficiently replicate clone() objects
- fix heartbeat wrt new replication
- mark residual pgs obsolete  ???
- rdlocks
- optimize remove wrt recovery pushes
- report crashed pgs?

messenger
- fix messenger shutdown.. we shouldn't delete messenger, since the caller may be referencing it, etc.

simplemessenger
- close idle connections

objectcacher
- merge clean bh's
- ocacher caps transitions vs locks
- test read locks

reliability
- heartbeat vs ping?
- osdmonitor, filter

ebofs
- btrees
  - checksums
  - dups
  - sets

- optionally scrub deallocated extents
- clone()

- map ObjectStore

- verify proper behavior of conflicting/overlapping reads of clones
- combine inodes and/or cnodes into same blocks
- fix bug in node rotation on insert (and reenable)
- fix NEAR_LAST_FWD (?)

- awareness of underlying software/hardware raid in allocator so that we
  write full stripes _only_.
  - hmm, that's basically just a large block size.

- rewrite the btree code!
  - multithreaded
  - eliminate nodepools
  - allow btree sets
  - allow arbitrary embedded data?
  - allow arbitrary btrees
  - allow root node(s?) to be embedded in onode, or whereever.
  - keys and values can be uniform (fixed-size) or non-uniform.  
    - fixed size (if any) is a value in the btree struct.  
      - negative indicates bytes of length value?  (1 -> 255bytes, 2 -> 65535 bytes, etc.?)
    - non-uniform records preceeded by length.  
    - keys sorted via a comparator defined in btree root.  
      - lexicographically, by default.

- goal
  - object btree key->value payload, not just a data blob payload.
  - better threading behavior.
    - with transactional goodness!

- onode
  - object attributes.. as a btree?
  - blob stream
  - map stream.
    - allow blob values.

  - 


remaining hard problems
- how to cope with file size changes and read/write sharing


snapshot notes --

mds
- break mds hierarchy into snaprealms
  - keep per-realm inode xlists, so that breaking a realm is O(size(realm))
struct snap {
  u64 rev;
  string name;
  utime_t ctime;
};
struct snaprealm {
  list<rev> revs;
  snaprealm parent;
  list<realm> children;
  xlist<CInode*> inodes_with_caps;   // used for efficient realm splits
};


- link client caps to realm, so that snapshot creation is O(num_child_realms*num_clients)
  - keep per-realm, per-client record with cap refcount, to avoid traversinng realm inode lists looking for caps

struct CapabilityGroup {
   int client;
   xlist<Capability*> caps;
   snaprealm *realm;
};
in snaprealm, 
   map<int, CapabilityGroup*> client_cap_groups;  // used to identify clients who need snap notifications


- for each realm, 
  - list<rev> revs;

- rev can be an ino?  or whatever. can we get away with it _not_ being ordered?
  - for osds.. yes.
  - for mds.. may make the cdentry range info tricky!


metadata
- fix up inode_map to key off vinodeno.. or better yet have a second map for non-zero revs.
struct vinodeno_t {
  inodeno_t ino;
  __u64 rev;
};
- dentry: replace dname -> ino, rino+rtype with
     (dname, crev, drev) -> vino, vino+rtype (where valid range is [crev, drev)
  - live dentries have drev = 0.  kept in separate map:
     - map<string, CDentry*> items;
     - map<pair<string,drev>, CDentry> vitems;
  - track vitem count in fragstat.
     - when vitem count gets large, add pointer in fnode indicating vitem range stored in separate dir object.


client
- also keep caps linked into snaprealm list
- current rev for each snaprealm
- attach rev to each dirty page
  - can we cow page if its dirty but a different realm?
    ... hmm probably not, but we can flush it first, just like we do a read to make it clean


osd
- pass realm lineage with osd op/capability
- tag each non-live object with the set of realms it is defined over
  - osdmap has sparse map of extant revs.  incrementals are simple rmrev, and max_rev increase

- is it possible to efficiently clean up whiteout objects when old snaprealms go away?


rados snapshots
- integrate revisions into ObjectCacher?
- clean up oid.rev vs op.rev in osd+osdc

- attr.crev is rev we were created in.
- oid.rev=0 is "live".  defined for attr.crev <= rev.
- otherwise, defined for attr.crev <= rev < oid.rev  (i.e. oid.rev is deletion time.  upper bound, non-inclusive.)

- write|delete is tagged with op.rev
  - if attr.crev != op.rev
    - we clone to oid.rev=rev (clone keeps old crev)
    - tag clone with list of revs it is defined over
    - change live attr.crev=rev.
  - apply update
- read is tagged with op.rev
  - if 0, we read from 0 (if it exists).
  - otherwise we choose object rev based on op.rev vs oid.rev, and then verifying attr.crev <= op.rev.
    - walk backwards through snap lineage?  i.e. if lineage = 1, 5, 30, 77, 100(now), and op.rev = 30, try 100, 77.

- or, tag live (0) object with attr listing which revs exist (and keep it around at size 0 if it doesn't logically exist)
  - no, the dir lookup on old revs will be in a cached btrfs btree dir node (no inode needed until we have a hit)

btrfs rev de-duping
- i.e. when sub_op_push gets an object
- query checksums
  - userland will read+verify ranges are actually a match?
- punch hole (?)
- clone file range (not entire file)


interface
$ ls -al .snapshot      # list snaps.  show both symbolic names, and timestamp names?  (symbolic -> timestamp symlinks, maybe)
$ mkdir .snapshot/blah  # create snap
$ rmdir .snapshot/blah  # remove it
todos 2008-05-10 23:31:14 +00:00			`big items`
			`- quotas`
			`- accounting`
			`- enforcement`
			`- rados cow/snapshot infrastructure`
			`- mds snapshots`
			`- mds security enforcement`
			`- client, user authentication`
			`- cas`

			`- meta vs data crush rules`
			`- use libuuid`
kclient: changed per-ci delayed work cancellation 2008-04-18 15:10:33 +00:00
todos 2008-03-25 22:55:26 +00:00			`userspace client`
mds: more scatter fiddling 2008-06-03 20:13:58 +00:00			`- handle session STALE`
build fat nfs handles, and add mds GETINODE op to resolve them 2008-05-10 05:47:04 +00:00			`- rm -rf on fragmented directory`
client: cleaned up mds sessions a bit 2008-05-01 18:13:33 +00:00			`- time out caps, wake up waiters on renewal`
todos 2008-05-15 20:39:51 +00:00			`- link caps with mds session`
kclient: preemptive lease release; allow dir inode+dentry leases in single message 2008-03-28 16:24:18 +00:00			`- validate dn leases`
allow lease duration to exceed session timeout 2008-06-05 03:47:02 +00:00			`- fix lease validation to check session ttl`
todos 2008-03-25 22:55:26 +00:00			`- clean up ll_ interface, now that we have leases!`
			`- clean up client mds session vs mdsmap behavior?`

osdmap cleanup; osd failure detection cleanup git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@2096 29311d96-e01e-0410-9327-a35deaab8ce9 2007-11-20 18:19:11 +00:00			`kernel client`
todos 2008-04-17 01:53:13 +00:00			`- flush caps on sync, fsync, etc.`
kclient: put delayed caps on single queue, use existing mdsc delayed work handler 2008-05-01 14:13:47 +00:00			`- do we need to block?`
simplify mds session caps stale/resume 2008-04-30 00:01:51 +00:00			`- timeout mds session close on umount`
mds: only increase file_max if filelock is wrlockable 2008-04-08 02:48:19 +00:00			`- deal with CAP_RDCACHE properly: invalidate cache pages?`
todos 2008-03-07 18:37:33 +00:00			`- procfs/debugfs`
todo, client verbosity 2008-03-03 06:41:06 +00:00			`- adjust granular debug levels too`
			`- should we be using debugfs?`
todos 2008-03-07 18:37:33 +00:00			`- a dir for each client instance (client###)?`
			`- hooks to get mds, osd, monmap epoch #s`
kclient: wrap socket in refcounting kobject 2008-05-26 16:15:09 +00:00			`- clean up messenger vs ktcp`
			`- hook into sysfs?`
kernel: setattr; moved fill_trace to inode.c 2008-01-19 16:18:14 +00:00			`- vfs`
kernel: some encoding/decoding cleanup 2008-01-31 23:29:57 +00:00			`- can we use dentry_path(), if it gets merged into mainline?`
kclient: syncfs stub 2008-04-02 23:37:00 +00:00			`- io / osd client`
some kernel todos 2007-12-24 05:16:04 +00:00			`- osd ack vs commit handling. hmm!`
todo git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1378 29311d96-e01e-0410-9327-a35deaab8ce9 2007-05-25 20:10:48 +00:00
osdc: take flags args 2008-03-20 16:41:09 +00:00			`client`
			`- clean up client mds session vs mdsmap behavior?`

mon todos! git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@2231 29311d96-e01e-0410-9327-a35deaab8ce9 2007-12-19 04:54:23 +00:00			`osdmon`
			`- monitor needs to monitor some osds...`

			`crush`
todos 2008-01-19 00:44:18 +00:00			`- more efficient failure when all/too many osds are down`
			`- allow forcefeed for more complicated rule structures. (e.g. make force_stack a list< set<int> >)`
			`- "knob" bucket`
mon todos! git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@2231 29311d96-e01e-0410-9327-a35deaab8ce9 2007-12-19 04:54:23 +00:00
			`pgmon`
			`- monitor pg states, notify on out?`
			`- watch osd utilization; adjust overload in cluster map`

			`mon`
todos 2008-04-15 01:32:14 +00:00			`- paxos need to clean up old states.`
test2 2007-12-21 01:50:39 +00:00			`- some sort of tester for PaxosService...`
todos 2008-04-15 01:32:14 +00:00			`- osdmon needs to lower-bound old osdmap versions it keeps around?`
mon todos! git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@2231 29311d96-e01e-0410-9327-a35deaab8ce9 2007-12-19 04:54:23 +00:00
mds todos 2008-06-04 18:08:06 +00:00			`mds`
			`- dir frags`
			`- fix replay (dont want dir frozen, pins, etc.?)`
			`- fix accounting`
todos 2008-05-27 14:32:00 +00:00
todo! git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1836 29311d96-e01e-0410-9327-a35deaab8ce9 2007-09-13 03:58:22 +00:00			`- proper handling of cache expire messages during rejoin phase?`
merged r1958:2075 from branches/sage/mds back into trunk git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@2076 29311d96-e01e-0410-9327-a35deaab8ce9 2007-11-16 21:46:52 +00:00			`-> i think cache expires are fine; the rejoin_ack handler just has to behave if rejoining items go missing`
			`- try_remove_unlinked_dn thing`
bounding dirfrag_t's maybe ambiguous git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1538 29311d96-e01e-0410-9327-a35deaab8ce9 2007-07-20 17:47:49 +00:00
merged r1958:2075 from branches/sage/mds back into trunk git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@2076 29311d96-e01e-0410-9327-a35deaab8ce9 2007-11-16 21:46:52 +00:00			`- rerun destro trace against latest, with various journal lengths`
merged r1513 branches/sage/cephmds2 back to trunk/ceph git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1514 29311d96-e01e-0410-9327-a35deaab8ce9 2007-07-17 17:51:11 +00:00
kclient: drop leases for setattr 2008-03-31 17:27:12 +00:00			`- lease length heuristics`
			`- mds lock last_change stamp?`

mds: slight cleanup of client reconnect failures 2008-06-05 14:14:45 +00:00			`- handle slow client reconnect (i.e. after mds has gone active)`
mds todos 2008-03-28 16:42:37 +00:00
todos 2008-02-29 00:58:28 +00:00			`- fix reconnect/rejoin open file weirdness`
merged r1958:2075 from branches/sage/mds back into trunk git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@2076 29311d96-e01e-0410-9327-a35deaab8ce9 2007-11-16 21:46:52 +00:00			`- get rid of C*Discover objects for replicate_to .. encode to bufferlists directly?`
mds todos 2008-06-04 18:08:06 +00:00
mds: slight cleanup of client reconnect failures 2008-06-05 14:14:45 +00:00			`- can we get rid of the dirlock remote auth_pin weirdness on subtree roots?`
mds todos 2008-06-04 18:08:06 +00:00			`- anchor_destroy needs to xlock linklock.. which means it needs a Mutation wrapper?`
			`- ... when it gets a caller.. someday..`
mds: some prelim nesting updates 2008-05-22 21:20:30 +00:00
mds todos 2008-06-04 18:08:06 +00:00			`- make truncate faster with a trunc_seq, attached to objects as attributes?`
merged r1958:2075 from branches/sage/mds back into trunk git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@2076 29311d96-e01e-0410-9327-a35deaab8ce9 2007-11-16 21:46:52 +00:00
merged branches/sage/cephmds2 into trunk/ceph git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1407 29311d96-e01e-0410-9327-a35deaab8ce9 2007-06-06 22:43:47 +00:00			`- osd needs a set_floor_and_read op for safe failover/STOGITH-like semantics.`

merged r1936 from branches/sage/mds back into trunk git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1937 29311d96-e01e-0410-9327-a35deaab8ce9 2007-10-12 22:46:27 +00:00			`- could mark dir complete in EMetaBlob by counting how many dentries are dirtied in the current log epoch in CDir...`
merged branches/sage/mds r1653 back into trunk git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1654 29311d96-e01e-0410-9327-a35deaab8ce9 2007-08-22 17:37:13 +00:00
merged r1936 from branches/sage/mds back into trunk git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1937 29311d96-e01e-0410-9327-a35deaab8ce9 2007-10-12 22:46:27 +00:00			`- FIXME how to journal/store root and stray inode content?`
merged branches/sage/cephmds2 into trunk/ceph git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1359 29311d96-e01e-0410-9327-a35deaab8ce9 2007-05-16 21:53:22 +00:00			`- in particular, i care about dirfragtree.. get it on rejoin?`
			`- and dir sizes, if i add that... also on rejoin?`

merged r1936 from branches/sage/mds back into trunk git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1937 29311d96-e01e-0410-9327-a35deaab8ce9 2007-10-12 22:46:27 +00:00			`- efficient stat for single writers`
			`- add FILE_CAP_EXTEND capability bit`

merged branches/sage/cephmds2 into trunk/ceph git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1359 29311d96-e01e-0410-9327-a35deaab8ce9 2007-05-16 21:53:22 +00:00
merge from branches/sage/cephmds2 git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1106 29311d96-e01e-0410-9327-a35deaab8ce9 2007-02-17 22:49:47 +00:00			`journaler`
			`- fix up for large events (e.g. imports)`
			`- use set_floor_and_read for safe takeover from possibly-not-quite-dead otherguy.`
			`- should we pad with zeros to avoid splitting individual entries?`
			`- make it a g_conf flag?`
			`- have to fix reader to skip over zeros (either <4 bytes for size, or zeroed sizes)`
			`- need to truncate at detected (valid) write_pos to clear out any other partial trailing writes`


mds: dont force journal flush on every op; mdlog+journaler are smart enough to detect waiters 2008-01-31 20:01:31 +00:00			`fsck`
			`- fsck.ebofs`
			`- online mds fsck?`
todos 2008-02-29 00:58:28 +00:00			`- object backpointer attrs to hint catastrophic reconstruction?`
mds: dont force journal flush on every op; mdlog+journaler are smart enough to detect waiters 2008-01-31 20:01:31 +00:00
merge from branches/sage/cephmds2 git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1106 29311d96-e01e-0410-9327-a35deaab8ce9 2007-02-17 22:49:47 +00:00			`objecter`
fixed parallel client naming race git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1697 29311d96-e01e-0410-9327-a35deaab8ce9 2007-08-25 22:40:01 +00:00			`- maybe_request_map should set a timer event to periodically re-request.`
todos 2008-05-16 21:54:04 +00:00			`- transaction prepare/commit?`
merge from branches/sage/cephmds2 git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1106 29311d96-e01e-0410-9327-a35deaab8ce9 2007-02-17 22:49:47 +00:00			`- read+floor_lockout`
osd handles divergent logs now. git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@886 29311d96-e01e-0410-9327-a35deaab8ce9 2006-10-02 03:55:32 +00:00
			`osd/rados`
osd: remove adjust_prior; fold crash logic into build_prior 2008-05-16 22:33:57 +00:00			`- how does an admin intervene when a pg needs a dead osd to repeer?`
objecter and journaler error paths for inc_lock 2008-03-24 19:53:26 +00:00
			`- a more general fencing mechanism? per-object granularity isn't usually a good match.`
merged branches/sage/mds r1627 back to trunk git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1628 29311d96-e01e-0410-9327-a35deaab8ce9 2007-08-13 19:14:00 +00:00
cleanup of osd failure recovery git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@858 29311d96-e01e-0410-9327-a35deaab8ce9 2006-09-15 21:53:31 +00:00			`- consider implications of nvram writeahead logs`
merged branches/sage/mds r1627 back to trunk git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1628 29311d96-e01e-0410-9327-a35deaab8ce9 2007-08-13 19:14:00 +00:00
			`- flag missing log entries on crash recovery --> WRNOOP? or WRLOST?`

			`- efficiently replicate clone() objects`
newrepop merge git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@850 29311d96-e01e-0410-9327-a35deaab8ce9 2006-09-10 03:43:27 +00:00			`- fix heartbeat wrt new replication`
post fast. git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@855 29311d96-e01e-0410-9327-a35deaab8ce9 2006-09-13 21:00:01 +00:00			`- mark residual pgs obsolete ???`
objectcacher fixes mostly git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@786 29311d96-e01e-0410-9327-a35deaab8ce9 2006-08-03 19:59:26 +00:00			`- rdlocks`
cleanup of osd failure recovery git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@858 29311d96-e01e-0410-9327-a35deaab8ce9 2006-09-15 21:53:31 +00:00			`- optimize remove wrt recovery pushes`
some stabilizing! git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@783 29311d96-e01e-0410-9327-a35deaab8ce9 2006-07-28 19:07:47 +00:00			`- report crashed pgs?`

* fixed a bug in buffer.h! yay! should be much more memory efficient now, too. git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1496 29311d96-e01e-0410-9327-a35deaab8ce9 2007-07-13 17:29:26 +00:00			`messenger`
			`- fix messenger shutdown.. we shouldn't delete messenger, since the caller may be referencing it, etc.`

merge from branches/sage/cephmds2 git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1106 29311d96-e01e-0410-9327-a35deaab8ce9 2007-02-17 22:49:47 +00:00			`simplemessenger`
new object_t struct; monitor work git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@919 29311d96-e01e-0410-9327-a35deaab8ce9 2006-10-09 19:10:44 +00:00			`- close idle connections`
tons of rados and client stuff. untested! git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@784 29311d96-e01e-0410-9327-a35deaab8ce9 2006-08-02 18:00:24 +00:00
			`objectcacher`
merged branches/sage/mds r1627 back to trunk git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1628 29311d96-e01e-0410-9327-a35deaab8ce9 2007-08-13 19:14:00 +00:00			`- merge clean bh's`
stabilized new rados logging stuff git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@787 29311d96-e01e-0410-9327-a35deaab8ce9 2006-08-04 23:15:35 +00:00			`- ocacher caps transitions vs locks`
			`- test read locks`
tons of rados and client stuff. untested! git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@784 29311d96-e01e-0410-9327-a35deaab8ce9 2006-08-02 18:00:24 +00:00
* empty log message * git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@774 29311d96-e01e-0410-9327-a35deaab8ce9 2006-06-26 20:53:41 +00:00			`reliability`
merge from branches/sage/cephmds2 git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1106 29311d96-e01e-0410-9327-a35deaab8ce9 2007-02-17 22:49:47 +00:00			`- heartbeat vs ping?`
* empty log message * git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@774 29311d96-e01e-0410-9327-a35deaab8ce9 2006-06-26 20:53:41 +00:00			`- osdmonitor, filter`

* empty log message * git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@794 29311d96-e01e-0410-9327-a35deaab8ce9 2006-08-14 04:44:55 +00:00			`ebofs`
asdf git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@2240 29311d96-e01e-0410-9327-a35deaab8ce9 2007-12-19 20:00:10 +00:00			`- btrees`
			`- checksums`
			`- dups`
			`- sets`

merged r1958:2075 from branches/sage/mds back into trunk git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@2076 29311d96-e01e-0410-9327-a35deaab8ce9 2007-11-16 21:46:52 +00:00			`- optionally scrub deallocated extents`
			`- clone()`

			`- map ObjectStore`
merged branches/sage/mds r1627 back to trunk git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1628 29311d96-e01e-0410-9327-a35deaab8ce9 2007-08-13 19:14:00 +00:00
onward git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@987 29311d96-e01e-0410-9327-a35deaab8ce9 2006-12-07 19:17:37 +00:00			`- verify proper behavior of conflicting/overlapping reads of clones`
client readdir() realted stuff; some mds bug fixes; git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@881 29311d96-e01e-0410-9327-a35deaab8ce9 2006-09-26 22:22:01 +00:00			`- combine inodes and/or cnodes into same blocks`
			`- fix bug in node rotation on insert (and reenable)`
git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@948 29311d96-e01e-0410-9327-a35deaab8ce9 2006-10-25 18:17:42 +00:00			`- fix NEAR_LAST_FWD (?)`
merged branches/sage/mds r1627 back to trunk git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@1628 29311d96-e01e-0410-9327-a35deaab8ce9 2007-08-13 19:14:00 +00:00
			`- awareness of underlying software/hardware raid in allocator so that we`
			`write full stripes _only_.`
			`- hmm, that's basically just a large block size.`

			`- rewrite the btree code!`
			`- multithreaded`
			`- eliminate nodepools`
			`- allow btree sets`
			`- allow arbitrary embedded data?`
			`- allow arbitrary btrees`
			`- allow root node(s?) to be embedded in onode, or whereever.`
			`- keys and values can be uniform (fixed-size) or non-uniform.`
			`- fixed size (if any) is a value in the btree struct.`
			`- negative indicates bytes of length value? (1 -> 255bytes, 2 -> 65535 bytes, etc.?)`
			`- non-uniform records preceeded by length.`
			`- keys sorted via a comparator defined in btree root.`
			`- lexicographically, by default.`

			`- goal`
			`- object btree key->value payload, not just a data blob payload.`
			`- better threading behavior.`
			`- with transactional goodness!`

			`- onode`
			`- object attributes.. as a btree?`
			`- blob stream`
			`- map stream.`
			`- allow blob values.`

			`-`

* empty log message * git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@794 29311d96-e01e-0410-9327-a35deaab8ce9 2006-08-14 04:44:55 +00:00

stabilized new rados logging stuff git-svn-id: https://ceph.svn.sf.net/svnroot/ceph@787 29311d96-e01e-0410-9327-a35deaab8ce9 2006-08-04 23:15:35 +00:00			`remaining hard problems`
			`- how to cope with file size changes and read/write sharing`

notes 2008-06-06 14:40:55 +00:00
			`snapshot notes --`

			`mds`
			`- break mds hierarchy into snaprealms`
			`- keep per-realm inode xlists, so that breaking a realm is O(size(realm))`
			`struct snap {`
			`u64 rev;`
			`string name;`
			`utime_t ctime;`
			`};`
			`struct snaprealm {`
			`list<rev> revs;`
			`snaprealm parent;`
			`list<realm> children;`
			`xlist<CInode*> inodes_with_caps; // used for efficient realm splits`
			`};`


			`- link client caps to realm, so that snapshot creation is O(num_child_realms*num_clients)`
			`- keep per-realm, per-client record with cap refcount, to avoid traversinng realm inode lists looking for caps`

			`struct CapabilityGroup {`
			`int client;`
			`xlist<Capability*> caps;`
			`snaprealm *realm;`
			`};`
			`in snaprealm,`
			`map<int, CapabilityGroup*> client_cap_groups; // used to identify clients who need snap notifications`


			`- for each realm,`
			`- list<rev> revs;`

more notes 2008-06-11 17:45:54 +00:00			`- rev can be an ino? or whatever. can we get away with it _not_ being ordered?`
			`- for osds.. yes.`
			`- for mds.. may make the cdentry range info tricky!`


			`metadata`
			`- fix up inode_map to key off vinodeno.. or better yet have a second map for non-zero revs.`
			`struct vinodeno_t {`
			`inodeno_t ino;`
			`__u64 rev;`
			`};`
			`- dentry: replace dname -> ino, rino+rtype with`
			`(dname, crev, drev) -> vino, vino+rtype (where valid range is [crev, drev)`
			`- live dentries have drev = 0. kept in separate map:`
			`- map<string, CDentry*> items;`
			`- map<pair<string,drev>, CDentry> vitems;`
			`- track vitem count in fragstat.`
			`- when vitem count gets large, add pointer in fnode indicating vitem range stored in separate dir object.`

more notes 2008-06-11 17:22:47 +00:00
notes 2008-06-06 14:40:55 +00:00			`client`
			`- also keep caps linked into snaprealm list`
			`- current rev for each snaprealm`
			`- attach rev to each dirty page`
			`- can we cow page if its dirty but a different realm?`
			`... hmm probably not, but we can flush it first, just like we do a read to make it clean`

more notes 2008-06-11 17:22:47 +00:00

			`osd`
			`- pass realm lineage with osd op/capability`
			`- tag each non-live object with the set of realms it is defined over`
			`- osdmap has sparse map of extant revs. incrementals are simple rmrev, and max_rev increase`

			`- is it possible to efficiently clean up whiteout objects when old snaprealms go away?`


			`rados snapshots`
			`- integrate revisions into ObjectCacher?`
			`- clean up oid.rev vs op.rev in osd+osdc`

			`- attr.crev is rev we were created in.`
			`- oid.rev=0 is "live". defined for attr.crev <= rev.`
			`- otherwise, defined for attr.crev <= rev < oid.rev (i.e. oid.rev is deletion time. upper bound, non-inclusive.)`

			`- write\|delete is tagged with op.rev`
			`- if attr.crev != op.rev`
			`- we clone to oid.rev=rev (clone keeps old crev)`
			`- tag clone with list of revs it is defined over`
			`- change live attr.crev=rev.`
			`- apply update`
			`- read is tagged with op.rev`
			`- if 0, we read from 0 (if it exists).`
			`- otherwise we choose object rev based on op.rev vs oid.rev, and then verifying attr.crev <= op.rev.`
			`- walk backwards through snap lineage? i.e. if lineage = 1, 5, 30, 77, 100(now), and op.rev = 30, try 100, 77.`

			`- or, tag live (0) object with attr listing which revs exist (and keep it around at size 0 if it doesn't logically exist)`
			`- no, the dir lookup on old revs will be in a cached btrfs btree dir node (no inode needed until we have a hit)`

			`btrfs rev de-duping`
			`- i.e. when sub_op_push gets an object`
			`- query checksums`
			`- userland will read+verify ranges are actually a match?`
			`- punch hole (?)`
			`- clone file range (not entire file)`




			`interface`
			`$ ls -al .snapshot # list snaps. show both symbolic names, and timestamp names? (symbolic -> timestamp symlinks, maybe)`
			`$ mkdir .snapshot/blah # create snap`
more notes 2008-06-11 17:45:54 +00:00			`$ rmdir .snapshot/blah # remove it`