Commit Graph

47218 Commits

Author SHA1 Message Date
Sage Weil
2e1edef3ff os/bluestore/BlueFS: fix replay of unlink
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:57 -05:00
Sage Weil
3745afb4c8 os/bluestore: support second block.wal device
Use this device for the bluefs log.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:57 -05:00
Sage Weil
02605a6612 os/bluestore/BlueStore: fix zero gap bug
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:57 -05:00
Sage Weil
9f114ac24b os/bluestore: disable overlay for now
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:57 -05:00
Sage Weil
b48798787d os/bluestore/BlockDevice: restructure interface
use atomics, do not track in-flight extents or magically cope
with racing ios (that is the users responsibility).

Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:57 -05:00
Sage Weil
1727cebdae os/bluestore/BlueFS: fix overwrite
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:56 -05:00
Sage Weil
13655fbb4a os/bluestore/BlueFS: fix writes spanning extents
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:56 -05:00
Sage Weil
ccce793f60 os/bluestore: reenable rocksdb recycling
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:56 -05:00
Sage Weil
ef06380b9a os/bluestore/BlockDevice: lock device while open
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:56 -05:00
Sage Weil
e3fd2795d0 os/bluestore/BlockDevice: debug read result
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:56 -05:00
Sage Weil
f6f4ed3dfc os/bluestore/BlockDevice: fix alignment check
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:56 -05:00
Sage Weil
db754e7df3 os/bluestore/BlockDevice: check aio return values
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:56 -05:00
Sage Weil
e7cce09c4d os/bluestore/BlueFS: avoid lock during reads
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:55 -05:00
Sage Weil
05be4c6c11 os/bluestore/BlueFS: prevent read+write sharing
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:55 -05:00
Sage Weil
9785bc9866 vstart.sh: debug bluefs and rocksdb
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:55 -05:00
Sage Weil
73adec4c98 os/bluestore/BlueFS: periodically compact log
Rewrite only the current metadata in a fresh log
periodically to free log space.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:55 -05:00
Sage Weil
dd901498c9 os/bluestore/BlueFS: simplify extent list
Merge contiguous extents.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:55 -05:00
Sage Weil
b073028528 os/bluestore/BlueFS: fix read
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:55 -05:00
Sage Weil
ac05b4c1c5 ceph_test_objectstore: trivial init fix
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:55 -05:00
Sage Weil
9341eec54d kv/RocksDBStore: rocksdb_separate_wal_dir option
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:54 -05:00
Sage Weil
3649a80a89 os/bluestore/BlueFS: ref count BlueFS::File *
There are FileWriters that exist when the file is
deleted.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:54 -05:00
Sage Weil
98485dee05 os/bluestore/BlueFS: readdir list dirs, too
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:54 -05:00
Sage Weil
b8630ee48c ceph-bluefs-tool: simple tool to export bluefs content
Currently we just do a dump.  We'll add more
functionality later.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:54 -05:00
Sage Weil
2d0537853a os/bluestore/BlueFS: many fixes
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:54 -05:00
Sage Weil
e4f6148c9f os/bluestore/BlueStore: share space with BlueFS
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:54 -05:00
Sage Weil
653882c446 os/bluestore/BlockDevice: move to simple mutex model
Just for now, while we get the rest of this working.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:53 -05:00
Sage Weil
dd04391706 os/bluestore/BlueFS: simple file system to back rocksdb
BlueFS is a simple file system that will back rocksdb.
BlueRocksEnv is the rocksdb::Env implementation that
glues them together.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:53 -05:00
Sage Weil
6f5ac50171 ceph_test_objectstore: less verbose
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:53 -05:00
Sage Weil
226b3476a3 ceph_test_objectstore: less verbose on hash collision test
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:53 -05:00
Sage Weil
1b8d5b6068 os/bluestore/BlueStore: fix _do_read
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:53 -05:00
Sage Weil
1ffd5e6963 os/bluestore/StupidAllocator: fix locking
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:53 -05:00
Sage Weil
14460484ff os/bluestore/StupidAllocator: fix misc bugs
Can't use invalid iterator; fix init_rm_free.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:52 -05:00
Sage Weil
08a94d95e1 os/bluestore/Allocator: init_rm_free
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:52 -05:00
Sage Weil
65f720ae9d kv/RocksDBStore: take custom Env
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:52 -05:00
Sage Weil
a869f92fac os/bluestore: fix _do_read return value
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:52 -05:00
Sage Weil
d704628cab os/bluestore/BlockDevice: fix read return value
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:52 -05:00
Sage Weil
9d01b8df9a os/bluestore: separate Allocator from freelist storage
FreelistManager perists our freelist.  Allocator is a policy that
allocates it.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:52 -05:00
Sage Weil
a62ffb0d03 newstore -> bluestore
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:52 -05:00
Sage Weil
3a4d583f85 os/newstore: always create db.wal
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:32 -05:00
Sage Weil
ad9f9fad01 os/newstore: create db dir
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:17 -05:00
Sage Weil
5658665ce7 os/newstore: consume a raw block device
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:06:17 -05:00
Sage Weil
32e768391f os/newstore: make collection_list tolerate sloppy start position
Because of this change (#6076), the hobject_t will contain pool id, hence
the ghobject_t having this hobject_t will be not equal to ghobject_t().

In newstore, this will cause assertion failure:
FAILED assert(k >= start_key && k < end_key)

The fix is to make compatible with previous change to create a
ghobject_t object with pool id and shard id in newstore.

Fixes: #13801
Reported-by: Zhi Zhang <zhangz.david@outlook.com>
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:05:18 -05:00
Sage Weil
2dae3df8af os/newstore: make key names more efficient
- pack u32 and u64 in binary (instead of in hex)
- avoid duplicating the object name while making things still
  sort by (key,name).  Use < when key < name, = when key == name,
  > when key > name) as a prefix.  And in the = case (which is
  basically always) include the name just once.

Note that this breaks on-disk compatibility.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:05:18 -05:00
Sage Weil
5e566dd7cb os/newstore: fix collection_list vs max entries
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:05:18 -05:00
Sage Weil
84646ab1c2 os/newstore: do not set/change frag_size if there are overlays
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:05:18 -05:00
Sage Weil
9291e1682a os/newstore: define a fid_backpointer_t type
Signed-off-by: Sage Weil <sage@redhat.com>

fix wal_oP_t
2016-01-01 13:05:17 -05:00
Sage Weil
b2db842e4d os/newstoer: add newstore types to ceph-dencoder
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:05:17 -05:00
Sage Weil
0af0dbdc14 os/newstore: set alloc hint on new frags
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:05:17 -05:00
Sage Weil
f0f815fb9e os/newstore: dump onode contents
Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:05:17 -05:00
Sage Weil
299350461b os/newstore: fixed fragment size
Instead of a single, variable-length fragment for each object,
set a fixed size (newstore_min_frag_size = 1 MB) and stripe the
object over these.  The last fragment will be smaller
than 1 MB if the object is not a multiple of 1 MB.

On write, this is basically free: we can just as cheaply write
4 inodes created together and fsync them than we can one.  On
overwrite, it allows us to replace individual fragments and avoid
write-ahead many cases.

On read it is a bit slower because of inode lookups and disk
seeks.  In the common case (big object written sequentially) we
hope that fs prefetching will hide most of it (e.g., all inodes
will be loaded together in the same metadata btree node, and the
files' data is written sequentially on disk).

Allowing for a singe large fragment in the case of a sequentially
written large object may save us something, but it complicates
the code significantly.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-01-01 13:05:17 -05:00