v0.15 /- clean up msgr protocol checks /- kclient: checkpatch fixes, cleanups. allow msg revoke (nice interface cleanup) /- monclient fixes; ceph detects monitor session drop /- msgr: protocol check cleanups; ack seq # fix; /- debian: radosgw package, fix header perms /- kclient: GET_DATALOC ioctl /- kclient: osdc bug fix /- kclient: clean up debugfs layout v0.16 - kclient: fix msgr bug (out_qlen thing) - kclient cleanup: uninline strings, use pr_fmt, prefix frag_ macros - kclient: xattr cleanups - kclient: fix invalidate recursion bug - libceph: identify self - hadoop: set primary replica on self - kclient: akpm review fixups - uninline frags - uninline string hash - document data structures - audit all inline in kclient - ceph_buffer and vmalloc? - ceph_i_test smp_mb instead of spinlock - bit ops in messenger - name args in ceph_osd_op union - disk format, wire protocol changes - use sockaddr_storage; some ipv6 groundwork v0.16.1 - mds: put migration vectors in mdsmap - rgw: fix - include buffer.c in kernel package, tarball v0.17 - kclient: fix multiple mds mdsmap decoding - kclient: fix mon subscription renewal - crush: fix crush map creation with empty buckets (occurs on larger clusters) - osdmap: fix encoding bug (crashes kclient); make kclient not crash - msgr: simplified policy, failure model - mon: less push, more pull - mon: request routing - mon cluster expansion - osd: fix pg parsing, restarts on larger clusters v0.18 - basic ENOSPC handling - big endian fixes (required protocol/disk format change) - improved object -> pg hash function; selectable - selectable crush hash function(s) - mds restart bug fixes - kclient mds reconnect bug fixes - fixed mds log trimming bug - fixed mds cap vs snap deadlock - filestore faster flushing - mount btrfs by UUID? - qa - osd: rebuild pg log - osd: handle storage errors - rebuild mds hierarchy - kclient: msgs built with a page list - kclient: retry alloc on ENOMEM when reading from connection? pending wire, disk format changes bugs - SIGBUS - mds rstat bug (on 2* cp -av usr + 11_kernel_untar_build) mds/CInode.cc: In function 'virtual void CInode::finish_scatter_gather_update(int)': mds/CInode.cc:1233: FAILED assert(pi->rstat.rfiles >= 0) - mislinked directory? (cpusr.sh, mv /c/* /c/t, more cpusr, ls /c/t) - premature filejournal trimming? - weird osd_lock contention during osd restart? - kclient: [85858.693538] BUG: sleeping function called from invalid context at kernel/mute x.c:280 [85858.701570] in_atomic(): 1, irqs_disabled(): 0, pid: 2762, name: cp [85858.708027] 1 lock held by cp/2762: [85858.711652] #0: (&dentry->d_lock){+.+...}, at: [] ceph_d_ revalidate+0xae/0x41c [ceph] [85858.721612] Pid: 2762, comm: cp Not tainted 2.6.32-rc2 #1 [85858.727176] Call Trace: [85858.729738] [] ? __debug_show_held_locks+0x22/0x24 [85858.736309] [] __might_sleep+0x115/0x11a [85858.742000] [] mutex_lock_nested+0x29/0x32a [85858.747957] [] ? get_lock_stats+0x19/0x4c [85858.753761] [] reset_connection+0x28/0xe4 [ceph] [85858.760148] [] ceph_con_shutdown+0x2f/0x70 [ceph] [85858.766630] [] ceph_put_mds_session+0x48/0x9a [ceph] [85858.773378] [] __ceph_mdsc_drop_dentry_lease+0x18/0x23 [ceph] [85858.780924] [] ceph_d_revalidate+0x17b/0x41c [ceph] [85858.787569] [] ? __d_lookup+0x0/0x195 [85858.793001] [] do_lookup+0x166/0x1bb [85858.798362] [] __link_path_walk+0x38b/0xe8c [85858.804319] [] path_walk+0x69/0xd4 [85858.809476] [] do_filp_open+0x178/0x9dc [85858.815088] [] ? put_lock_stats+0xe/0x27 [85858.820771] [] ? _spin_unlock+0x30/0x4b [85858.826373] [] ? alloc_fd+0x11d/0x12e [85858.831811] [] do_sys_open+0x5d/0x10b [85858.837241] [] sys_open+0x1b/0x1d [85858.842334] [] tracesys+0xd0/0xd5 - kclient: after reconnect, cp: writing `/c/ceph2.2/bin/gs-gpl': Bad file descriptor - need to somehow wake up unreconnected caps? hrm!! - kclient: ~300 (306, 311) second delay before able to reconnect to restarted monitor??? - kclient: socket creation - kclient: bdi thing after mount failures, multiple attempts [ 1438.509155] ------------[ cut here ]------------ [ 1438.513933] WARNING: at fs/sysfs/dir.c:487 sysfs_add_one+0xf3/0x10a() [ 1438.520560] Hardware name: PDSMi [ 1438.523898] sysfs: cannot create duplicate filename '/class/bdi/0:25' [ 1438.530526] Modules linked in: ceph fan ac battery container ehci_hcd uhci_hcd thermal button processor [ 1438.546600] Pid: 2829, comm: mount.ceph Tainted: G W 2.6.32-rc2 #1 [ 1438.553722] Call Trace: [ 1438.556279] [] ? sysfs_add_one+0xf3/0x10a [ 1438.562179] [] warn_slowpath_common+0x77/0xa4 [ 1438.568399] [] warn_slowpath_fmt+0x64/0x66 [ 1438.574364] [] ? trace_hardirqs_on_caller+0x113/0x13e [ 1438.581312] [] ? sysfs_pathname+0x37/0x3f [ 1438.587132] [] ? sysfs_pathname+0x37/0x3f [ 1438.593017] [] ? sysfs_pathname+0x37/0x3f [ 1438.598894] [] sysfs_add_one+0xf3/0x10a [ 1438.604593] [] create_dir+0x58/0x93 [ 1438.609929] [] sysfs_create_dir+0x38/0x4f [ 1438.615825] [] ? _spin_unlock+0x30/0x4b [ 1438.621520] [] kobject_add_internal+0x125/0x201 [ 1438.627939] [] kobject_add_varg+0x41/0x4d [ 1438.633820] [] kobject_add+0x89/0x8b [ 1438.639263] [] ? mark_held_locks+0x4d/0x6b [ 1438.645245] [] ? lockdep_init_map+0xae/0x540 [ 1438.651351] [] ? kobject_get+0x1a/0x22 [ 1438.656906] [] ? get_device+0x14/0x1a [ 1438.662371] [] device_add+0x119/0x627 [ 1438.667877] [] ? __spin_lock_init+0x31/0x54 [ 1438.673933] [] device_register+0x19/0x1d [ 1438.679703] [] device_create_vargs+0x10e/0x13b [ 1438.686028] [] bdi_register+0x80/0x192 [ 1438.691635] [] ? lockdep_init_map+0xae/0x540 [ 1438.697762] [] ? mempool_kmalloc+0x11/0x13 [ 1438.703714] [] ? mempool_create_node+0x122/0x16e [ 1438.710218] [] ? ceph_set_super+0x0/0xd8 [ceph] [ 1438.716620] [] ? mempool_kfree+0x0/0xb [ 1438.722221] [] ? mempool_kmalloc+0x0/0x13 [ 1438.728072] [] bdi_register_dev+0x23/0x25 [ 1438.733944] [] ceph_get_sb+0xa20/0x104f [ceph] [ 1438.740267] [] ? __kmalloc+0x15c/0x1ef [ 1438.745869] [] ? __alloc_percpu+0xb/0xd [ 1438.751545] [] vfs_kern_mount+0x9d/0x158 [ 1438.757359] [] do_kern_mount+0x47/0xe7 [ 1438.762967] [] do_mount+0x743/0x7a9 [ 1438.768284] [] ? strndup_user+0x5d/0x85 [ 1438.773962] [] sys_mount+0x7f/0xc1 [ 1438.779204] [] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 1438.785846] [] system_call_fastpath+0x16/0x1b greg - osd: error handling - uclient: readdir from cache - mds: basic auth checks later - document on-wire protocol - authentication - client reconnect after long eviction; and slow delayed reconnect - repair - mds security enforcement - client, user authentication - cas - osd failure declarations - rename over old files should flush data, or revert back to old contents rados - make rest interface superset of s3? - create/delete snapshots - list, access snapped version - perl swig wrapper - 'rados call foo.bar'? - merge pgs - destroy pg_pools - autosize pg_pools? - security repair - namespace reconstruction tool - repair pg (rebuild log) (online or offline? ./cosd --repair_pg 1.ef?) - repair file ioctl? - are we concerned about - scrubbing - reconstruction after loss of subset of cdirs - reconstruction after loss of md log - data object - path backpointers? - parent dir pointer? - mds scrubbing kclient - ENOMEM - message pools - sockets? (this can actual generates a lockdep warning :/) - use page lists for large messages? e.g. reconnect - fs-portable file layout virtual xattr (see Andreas' -fsdevel thread) - statlite - audit/combine/rework/whatever invalidate, writeback threads and associated invariants - add cap to release if we get fouled up in fill_inode et al? - make caps reservations per-client - fix up ESTALE handling - don't retry on ENOMEM on non-nofail requests in kick_requests - make cap import/export more efficient? - flock, fnctl locks - ACLs - init security xattrs - should we try to ref CAP_PIN on special inodes that are open? - fix readdir vs fragment race by keeping a separate frag pos, and ignoring dentries below it - inotify for updates from other clients? vfs issues - real_lookup() race: 1- hash lookup find no dentry 2- real_lookup() takes dir i_mutex, but then finds a dentry 3- drops mutex, then calld d_revalidate. if that fails, we return ENOENT (instead of looping?) - vfs_rename_dir() - a getattr mask would be really nice filestore - make min sync interval self-tuning (ala xfs, ext3?) - get file csum? btrfs - clone compressed inline extents - ioctl to pull out data csum? osd - gracefully handle ENOSPC - gracefully handle EIO? - client session object - track client's osdmap; and only share latest osdmap with them once! - what to do with lost objects.. continue peering? - segregate backlog from log ondisk? - preserve pg logs on disk for longer period - make scrub interruptible - optionally separate osd interfaces (ips) for clients and osds (replication, peering, etc.) - pg repair - pg split should be a work queue - optimize remove wrt recovery pushes? uclient - clean up check_caps to more closely mirror kclient logic - readdir from cache - fix readdir vs fragment race by keeping a separate frag pos, and ignoring dentries below it - hadoop: clean up assert usage mds - pass issued, wanted into eval(lock) when eval() already has it? (and otherwise optimize eval paths..) - add an up:shadow mode? - tail the mds log as it is written - periodically check head so that we trim, too - handle slow client reconnect (i.e. after mds has gone active) - anchor_destroy needs to xlock linklock.. which means it needs a Mutation wrapper? - ... when it gets a caller.. someday.. - add FILE_CAP_EXTEND capability bit - dir fragment - maybe just take dftlock for now, to keep it simple. - dir merge - snap - hard link backpointers - anchor source dir - build snaprealm for any hardlinked file - include snaps for all (primary+remote) parents - how do we properly clean up inodes when doing a snap purge? - when they are mid-recover? see 136470cf7ca876febf68a2b0610fa3bb77ad3532 - what if a recovery is queued, or in progress, and the inode is then cowed? can that happen? - proper handling of cache expire messages during rejoin phase? -> i think cache expires are fine; the rejoin_ack handler just has to behave if rejoining items go missing - clustered - on replay, but dirty scatter replicas on lists so that they get flushed? or does rejoin handle that? - linkage vs cdentry replicas and remote rename.... - rename: importing inode... also journal imported client map? mon - how to shrink cluster? - how to tell osd to cleanly shut down - mds injectargs N should take mds# or id. * should bcast to standy mds's. - paxos need to clean up old states. - default: simple max of (state count, min age), so that we have at least N hours of history, say? - osd map: trim only old maps < oldest "in" osd up_from osdmon - monitor needs to monitor some osds... pgmon /- include osd vector with pg state - check for orphan pgs - monitor pg states, notify on out? - watch osd utilization; adjust overload in cluster map crush - allow forcefeed for more complicated rule structures. (e.g. make force_stack a list< set >) simplemessenger - close idle connections? objectcacher - read locks? - maintain more explicit inode grouping instead of wonky hashes cas - chunking. see TTTD in ESHGHI, K. A framework for analyzing and improving content-based chunking algorithms. Tech. Rep. HPL-2005-30(R.1), Hewlett Packard Laboratories, Palo Alto, 2005. radosgw - handle gracefully location related requests - logging control (?) - parse date/time better - upload using post - torrent - handle gracefully PUT/GET requestPayment