v0.17 - kclient: fix multiple mds mdsmap decoding - kclient: fix mon subscription renewal - crush: fix crush map creation with empty buckets (occurs on larger clusters) - osdmap: fix encoding bug (crashes kclient); make kclient not crash - msgr: simplified policy, failure model - mon: less push, more pull - mon: request routing - mon cluster expansion - osd: fix pg parsing, restarts on larger clusters v0.18 - osd: basic ENOSPC handling - big endian fixes (required protocol/disk format change) - osd: improved object -> pg hash function; selectable - crush: selectable hash function(s) - mds restart bug fixes - kclient: mds reconnect bug fixes - fixed mds log trimming bug - fixed mds cap vs snap deadlock - filestore: faster flushing - uclient,kclient: snapshot fixes - mds: fix recursive accounting bug - uclient: fixes for 32bit clients - auth: 'none' security framework - mon: "safely" bail on write errors (e.g. ENOSPC) - mds: fix replay/reconnect race (caused (fast) client reconnect to fail) - mds: misc journal replay, session fixes v0.19 - ms_dispatch fairness - kclient: bad fsid deadlock fix - tids in fixed msg header (protocol change) - feature bits during connection handshake - remove erank from ceph_entity_addr - compat/incompat bits - kclient: handle enomem on reply using tid in msg header - audit truncation sequence - should mds recovery recover truncation metadata? - qa: snap test. maybe walk through 2.6.* kernel trees? - osd: rebuild pg log - osd: handle storage errors - rebuild mds hierarchy - kclient: retry alloc on ENOMEM when reading from connection? pending wire format changes /- include a __u64 tid in ceph_msg_header /- compat bits during connection handshake - compat bits during auth/mount with monitor? /- remove erank from ceph_entity_addr pending mds format changes - compat/incompat flags pending osd format changes - current/ subdir - compat/incompat flags pending mon format changes - add v to PGMap, PGMap::Incremental - others? - compat/incompat flags bugs - 'mount.ceph server:/ relpath' puts relpath (not canonicalized name) in /etc/mtab. df gets confused uml:~# grep ceph /etc/mtab 10.0.1.252:/ mnt ceph rw 0 0 - kclient: on umount -f [ 4683.361323] INFO: task umount:15840 blocked for more than 120 seconds. [ 4683.367910] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4683.375792] umount D 0000000000000000 0 15840 2614 0x00000000 [ 4683.382761] ffff880104ee9c58 0000000000000046 0000000000000000 ffffffff8145f50a [ 4683.390297] ffff88010901aa00 ffff880104ee9fd8 ffff880104ee9fd8 0000000000004000 [ 4683.397862] 000000000000df18 ffff880104ee9fd8 ffff880104ee9fd8 00000000001d2d80 [ 4683.405414] Call Trace: [ 4683.407911] [] ? _spin_unlock_irq+0x36/0x51 [ 4683.413794] [] ? check_cap_flush+0xa4/0x235 [ceph] [ 4683.420285] [] mutex_lock_nested+0x1b9/0x317 [ 4683.426255] [] ? check_cap_flush+0xa4/0x235 [ceph] [ 4683.432769] [] check_cap_flush+0xa4/0x235 [ceph] [ 4683.439090] [] ceph_mdsc_sync+0x186/0x303 [ceph] [ 4683.445396] [] ? mutex_unlock+0x9/0xb [ 4683.450766] [] ? ceph_osdc_sync+0xe4/0x170 [ceph] [ 4683.457188] [] ceph_syncfs+0x4e/0xc8 [ceph] [ 4683.469161] [] __sync_filesystem+0x5e/0x72 [ 4683.474968] [] sync_filesystem+0x35/0x4c [ 4683.480596] [] generic_shutdown_super+0x28/0xcd [ 4683.486831] [] kill_anon_super+0x11/0x4f [ 4683.492474] [] ceph_kill_sb+0x4d/0xa3 [ceph] [ 4683.498453] [] ? deactivate_super+0x60/0x7d [ 4683.504344] [] deactivate_super+0x68/0x7d [ 4683.510067] [] mntput_no_expire+0x78/0xb3 [ 4683.515784] [] sys_umount+0x2b2/0x2df [ 4683.521159] [] ? do_page_fault+0x104/0x278 [ 4683.526947] [] system_call_fastpath+0x16/0x1b - kclient: multiple incoming replies, or aborted (osd) request, can deplete reply msgpool - reproduce: read large file, hit control-c. dropping the request empties out the reply pool. - this is actually harmless, except that one aborted request and one active request means the aborted reply gets the message but the active request doesn't. - actually, we can replace prepare_pages with a smarter alloc_msg, now that tid is in the header. - then we need a revoke_msg_incoming in place of revoke_pages? - mon: dup osd boot messages to log 09.12.21 14:09:33.634098 log 09.12.21 14:09:32.612955 mon0 10.3.14.128:6789/0/0 198 : [INF] osd6 10.3.14.133:6800/14770/0 boot 09.12.21 14:09:33.634125 log 09.12.21 14:09:32.614155 mon0 10.3.14.128:6789/0/0 199 : [INF] osd6 10.3.14.133:6800/14770/0 boot 09.12.21 14:09:33.634137 log 09.12.21 14:09:32.614726 mon0 10.3.14.128:6789/0/0 200 : [INF] osd6 10.3.14.133:6800/14770/0 boot 09.12.21 14:09:33.634148 log 09.12.21 14:09:32.615444 mon0 10.3.14.128:6789/0/0 201 : [INF] osd6 10.3.14.133:6800/14770/0 boot - fix mon delay when starting new mds, when current mds is already laggy - bonnie++ -u root -d /mnt/ceph/ -s 0 -n 1 (03:35:29 PM) Isteriat: Using uid:0, gid:0. (03:35:29 PM) Isteriat: Create files in sequential order...done. (03:35:29 PM) Isteriat: Stat files in sequential order...Expected 1024 files but only got 0 (03:35:29 PM) Isteriat: Cleaning up test directory after error. - osd pg split breaks if not all osds are up... - mds recovery flag set on inode that didn't get recovered?? - mds memory leak (after some combo of client failures, mds restarts+reconnects?) - osd pg split breaks if not all osds are up... - mislinked directory? (cpusr.sh, mv /c/* /c/t, more cpusr, ls /c/t) - kclient: after reconnect, cp: writing `/c/ceph2.2/bin/gs-gpl': Bad file descriptor - need to somehow wake up unreconnected caps? hrm!! - kclient: socket creation - mds file purge should truncate in place, or remove from namespace before purge. otherwise new ref can appear before inode is destroyed. mds/MDCache.cc: In function 'void MDCache::remove_inode(CInode*)': mds/MDCache.cc:217: FAILED assert(o->get_num_ref() == 0) 1: /tmp/cmds.20091211.084324(_Z18__ceph_assert_failPKcS0_iS0_+0x34) [0x9656ea] 2: /tmp/cmds.20091211.084324(_ZN7MDCache12remove_inodeEP6CInode+0x1ad) [0x7af283] 3: /tmp/cmds.20091211.084324(_ZN7MDCache19_purge_stray_loggedEP7CDentrymP10LogSegment+0x115) [0x7af3b5] 4: /tmp/cmds.20091211.084324(_ZN22C_MDC_PurgeStrayLogged6finishEi+0x34) [0x83286e] 5: /tmp/cmds.20091211.084324(_Z15finish_contextsRSt4listIP7ContextSaIS1_EEi+0x130) [0x736d96] 6: /tmp/cmds.20091211.084324(_ZN9Journaler13_finish_flushEil7utime_tb+0x873) [0x915f4d] 7: /tmp/cmds.20091211.084324(_ZN9Journaler7C_Flush6finishEi+0x43) [0x91d5eb] 8: /tmp/cmds.20091211.084324(_ZN8Objecter19handle_osd_op_replyEP11MOSDOpReply+0xcf5) [0x8e7415] 9: /tmp/cmds.20091211.084324(_ZN3MDS9_dispatchEP7Message+0x1f04) [0x715dda] 10: /tmp/cmds.20091211.084324(_ZN3MDS11ms_dispatchEP7Message+0x2f) [0x716dc1] 11: /tmp/cmds.20091211.084324(_ZN9Messenger19ms_deliver_dispatchEP7Message+0x54) [0x70a658] 12: /tmp/cmds.20091211.084324(_ZN15SimpleMessenger8Endpoint14dispatch_entryEv+0x4df) [0x6f78af] 13: /tmp/cmds.20091211.084324(_ZN15SimpleMessenger8Endpoint14DispatchThread5entryEv+0x19) [0x70ce77] 14: /tmp/cmds.20091211.084324(_ZN6Thread11_entry_funcEPv+0x20) [0x704b0c] 15: /lib/libpthread.so.0 [0x7f3ea5bf2fc7] 16: /lib/libc.so.6(clone+0x6d) [0x7f3ea4e355ad] - snaprealm thing ceph3:~# find /c /c /c/.ceph /c/.ceph/mds0 /c/.ceph/mds0/journal /c/.ceph/mds0/stray [68663.397407] ceph: ceph_add_cap: couldn't find snap realm 10000491bb5 ... ceph3:/c# [68724.067160] BUG: unable to handle kernel NULL pointer dereference at 0000000000000088 [68724.071069] IP: [] __send_cap+0x237/0x585 [ceph] [68724.078917] PGD f7a12067 PUD f688c067 PMD 0 [68724.082907] Oops: 0000 [#1] PREEMPT SMP [68724.082907] last sysfs file: /sys/class/net/lo/operstate [68724.082907] CPU 1 [68724.082907] Modules linked in: ceph fan ac battery psmouse ehci_hcd ohci_hcd ide_pci_generic thermal processor button [68724.082907] Pid: 10, comm: events/1 Not tainted 2.6.32-rc2 #1 H8SSL [68724.082907] RIP: 0010:[] [] __send_cap+0x237/0x585 [ceph] [68724.114907] RSP: 0018:ffff8800f96e3a50 EFLAGS: 00010202 [68724.114907] RAX: 0000000000000000 RBX: 0000000000000354 RCX: 0000000000000000 [68724.114907] RDX: 0000000000000000 RSI: ffff8800f76e8ba8 RDI: ffff8800f581a508 [68724.114907] RBP: ffff8800f96e3bb0 R08: 0000000000000000 R09: 0000000000000001 [68724.114907] R10: ffff8800cea922b8 R11: ffffffffa0082982 R12: 0000000000000001 [68724.114907] R13: 0000000000000000 R14: ffff8800cea95378 R15: 0000000000000000 [68724.114907] FS: 00007f54be9a06e0(0000) GS:ffff880009200000(0000) knlGS:0000000000000000 [68724.114907] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b [68724.114907] CR2: 0000000000000088 CR3: 00000000f7118000 CR4: 00000000000006e0 [68724.178904] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [68724.178904] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [68724.178904] Process events/1 (pid: 10, threadinfo ffff8800f96e2000, task ffff8800f96e02c0) [68724.178904] Stack: [68724.178904] ffff8800f96e0980 ffff8800f96e02c0 ffff8800f96e3a80 ffffffff8106a3b9 [68724.178904] <0> ffff8800f96e3a80 0000000000000003 00006589ac4ca260 0000000000000004 [68724.178904] <0> 0cb13589944c0262 0000000000000000 ffff8800f96e3b30 ffffffff81ca7c80 [68724.178904] Call Trace: [68724.178904] [] ? get_lock_stats+0x19/0x4c [68724.178904] [] ? mark_held_locks+0x4d/0x6b [68724.178904] [] ceph_check_caps+0x740/0xa70 [ceph] [68724.178904] [] ? get_lock_stats+0x19/0x4c [68724.178904] [] ? put_lock_stats+0xe/0x27 [68724.178904] [] ceph_check_delayed_caps+0xcb/0x14a [ceph] [68724.178904] [] delayed_work+0x3f/0x368 [ceph] [68724.178904] [] ? worker_thread+0x229/0x398 [68724.178904] [] worker_thread+0x283/0x398 [68724.178904] [] ? worker_thread+0x229/0x398 [68724.178904] [] ? delayed_work+0x0/0x368 [ceph] [68724.178904] [] ? preempt_schedule+0x3e/0x4b [68724.306901] [] ? autoremove_ceph3:/c# [68724.067160] filestore performance notes - write ordering options - fs only (no journal) - fs, journal - fs + journal in parallel - journal sync, then fs - and the issues - latency - effect of a btrfs hang - unexpected error handling (EIO, ENOSPC) - impact on ack, sync ordering semantics. - how to throttle request stream to disk io rate - rmw vs delayed mode - if journal is on fs, then - throttling isn't an issue, but - fs stalls are also journal stalls - fs only - latency: commits are bad. - hang: bad. - errors: could be handled, aren't - acks: supported - throttle: fs does it - rmw: pg toggles mode - fs, journal - latency: good, unless fs hangs - hang: bad. latency spikes. overall throughput drops. - errors: could probably be handled, isn't. - acks: supported - throttle: btrfs does it (by hanging), which leads to a (necessary) latency spike - rmw: pg toggles mode - fs | journal - latency: good - hang: no latency spike. fs throughput may drop, to the extent btrfs throughput necessarily will. - errors: not detected until later. could journal addendum record. or die (like we do now) - acks: could be flexible.. maybe supported, maybe not. will need some extra locking smarts? - throttle: ?? - rmw: rmw must block on prior fs writes. - journal, fs (writeahead) - latency: good (commit only, no acks) - hang: same as | - errors: same as | - acks: never. - throttle: ?? - rmw: rmw must block on prior fs writes. * JourningObjectStore interface needs work? - separate reads/writes into separate op queues? - greg - csync data import/export tool? - uclient: readdir from cache - mds: basic auth checks later - document on-wire protocol - authentication - client reconnect after long eviction; and slow delayed reconnect - repair - mds security enforcement - client, user authentication - cas - osd failure declarations - rename over old files should flush data, or revert back to old contents - clean up SimpleMessenger interface and usage a little. Can probably unify some/all of shutdown, wait, destroy. Possibly move destroy into put() and make get/put usage more consistent/stringently mandated. rados - make rest interface superset of s3? - create/delete snapshots - list, access snapped version - perl swig wrapper - 'rados call foo.bar'? - merge pgs - destroy pg_pools - autosize pg_pools? - security repair - namespace reconstruction tool - repair pg (rebuild log) (online or offline? ./cosd --repair_pg 1.ef?) - repair file ioctl? - are we concerned about - scrubbing - reconstruction after loss of subset of cdirs - reconstruction after loss of md log - data object - path backpointers? - parent dir pointer? - mds scrubbing kclient - replace radix tree in monc with rbtree on statfs requests - ENOMEM - message pools - sockets? (this can actual generates a lockdep warning :/) - fs-portable file layout virtual xattr (see Andreas' -fsdevel thread) - statlite - audit/combine/rework/whatever invalidate, writeback threads and associated invariants - add cap to release if we get fouled up in fill_inode et al? - make caps reservations per-client - fix up ESTALE handling - don't retry on ENOMEM on non-nofail requests in kick_requests - make cap import/export more efficient? - flock, fnctl locks - ACLs - init security xattrs - should we try to ref CAP_PIN on special inodes that are open? - fix readdir vs fragment race by keeping a separate frag pos, and ignoring dentries below it - inotify for updates from other clients? vfs issues - real_lookup() race: 1- hash lookup find no dentry 2- real_lookup() takes dir i_mutex, but then finds a dentry 3- drops mutex, then calld d_revalidate. if that fails, we return ENOENT (instead of looping?) - vfs_rename_dir() - a getattr mask would be really nice filestore - make min sync interval self-tuning (ala xfs, ext3?) - get file csum? btrfs - clone compressed inline extents - ioctl to pull out data csum? osd - gracefully handle ENOSPC - gracefully handle EIO? - client session object - track client's osdmap; and only share latest osdmap with them once! - what to do with lost objects.. continue peering? - segregate backlog from log ondisk? - preserve pg logs on disk for longer period - make scrub interruptible - optionally separate osd interfaces (ips) for clients and osds (replication, peering, etc.) - pg repair - pg split should be a work queue - optimize remove wrt recovery pushes? uclient - fix client_lock vs other mutex with C_SafeCond - clean up check_caps to more closely mirror kclient logic - readdir from cache - fix readdir vs fragment race by keeping a separate frag pos, and ignoring dentries below it - hadoop: clean up assert usage mds - don't sync log on every clientreplay request? - pass issued, wanted into eval(lock) when eval() already has it? (and otherwise optimize eval paths..) - add an up:shadow mode? - tail the mds log as it is written - periodically check head so that we trim, too - handle slow client reconnect (i.e. after mds has gone active) - anchor_destroy needs to xlock linklock.. which means it needs a Mutation wrapper? - ... when it gets a caller.. someday.. - add FILE_CAP_EXTEND capability bit - dir fragment - maybe just take dftlock for now, to keep it simple. - dir merge - snap - hard link backpointers - anchor source dir - build snaprealm for any hardlinked file - include snaps for all (primary+remote) parents - how do we properly clean up inodes when doing a snap purge? - when they are mid-recover? see 136470cf7ca876febf68a2b0610fa3bb77ad3532 - what if a recovery is queued, or in progress, and the inode is then cowed? can that happen? - proper handling of cache expire messages during rejoin phase? -> i think cache expires are fine; the rejoin_ack handler just has to behave if rejoining items go missing - clustered - on replay, but dirty scatter replicas on lists so that they get flushed? or does rejoin handle that? - linkage vs cdentry replicas and remote rename.... - rename: importing inode... also journal imported client map? mon - don't allow lpg_num expansion and osd addition at the same time? - how to shrink cluster? - how to tell osd to cleanly shut down - mds injectargs N should take mds# or id. * should bcast to standy mds's. - paxos need to clean up old states. - default: simple max of (state count, min age), so that we have at least N hours of history, say? - osd map: trim only old maps < oldest "in" osd up_from osdmon - monitor needs to monitor some osds... pgmon /- include osd vector with pg state - check for orphan pgs - monitor pg states, notify on out? - watch osd utilization; adjust overload in cluster map crush - allow forcefeed for more complicated rule structures. (e.g. make force_stack a list< set >) simplemessenger - close idle connections? objectcacher - read locks? - maintain more explicit inode grouping instead of wonky hashes cas - chunking. see TTTD in ESHGHI, K. A framework for analyzing and improving content-based chunking algorithms. Tech. Rep. HPL-2005-30(R.1), Hewlett Packard Laboratories, Palo Alto, 2005. radosgw - handle gracefully location related requests - logging control (?) - parse date/time better - upload using post - torrent - handle gracefully PUT/GET requestPayment -- for nicer kclient debug output (everything but messenger, but including msg in/out) echo 'module ceph +p' > /sys/kernel/debug/dynamic_debug/control ; echo 'file fs/ceph/messenger.c -p' > /sys/kernel/debug/dynamic_debug/control ; echo 'file ' `grep -- --- /sys/kernel/debug/dynamic_debug/control | grep ceph | awk '{print $1}' | sed 's/:/ line /'` +p > /sys/kernel/debug/dynamic_debug/control ; echo 'file ' `grep -- === /sys/kernel/debug/dynamic_debug/control | grep ceph | awk '{print $1}' | sed 's/:/ line /'` +p > /sys/kernel/debug/dynamic_debug/control