mirror of https://github.com/ceph/ceph
163 lines
6.0 KiB
ReStructuredText
163 lines
6.0 KiB
ReStructuredText
|
==========
|
||
|
SeaStore
|
||
|
==========
|
||
|
|
||
|
This is a rough design doc for a new ObjectStore implementation design
|
||
|
to facilitate higher performance on solid state devices.
|
||
|
|
||
|
Name
|
||
|
====
|
||
|
|
||
|
SeaStore maximizes the opportunity for confusion (seastar? seashore?)
|
||
|
and associated fun. Alternative suggestions welcome.
|
||
|
|
||
|
|
||
|
Goals
|
||
|
=====
|
||
|
|
||
|
* Target NVMe devices. Not primarily concerned with pmem or HDD.
|
||
|
* make use of SPDK for user-space driven IO
|
||
|
* Use Seastar futures programming model to facilitate run-to-completion and a sharded memory/processing model
|
||
|
* Allow zero- (or minimal) data copying on read and write paths when combined with a seastar-based messenger using DPDK
|
||
|
|
||
|
|
||
|
Motivation and background
|
||
|
=========================
|
||
|
|
||
|
All flash devices are internally structured in terms of segments that
|
||
|
can be written efficiently but must be erased in their entirety. The
|
||
|
NVMe device generally has limited knowledge about what data in a
|
||
|
segment is still "live" (hasn't been logically discarded), making the
|
||
|
inevitable garbage collection within the device inefficient. We can
|
||
|
design an on-disk layout that is friendly to GC at lower layers and
|
||
|
drive garbage collection at higher layers.
|
||
|
|
||
|
In principle a fine-grained discard could communicate our intent to
|
||
|
the device, but in practice discard is poorly implemented in the
|
||
|
device and intervening software layers.
|
||
|
|
||
|
|
||
|
Basics
|
||
|
======
|
||
|
|
||
|
The basic idea is that all data will be stream out sequentially to
|
||
|
large segments on the device. In the SSD hardware, segments are
|
||
|
likely to be on the order of 100's of MB to tens of GB.
|
||
|
|
||
|
SeaStore's logical segments would ideally be perfectly aligned with
|
||
|
the hardware segments. In practice, it may be challenging to
|
||
|
determine geometry and to sufficiently hint to the device that LBAs
|
||
|
being written shoudl be aligned to the underlying hardware. In the
|
||
|
worst case, we can structure our logical segments to correspond to
|
||
|
e.g. 5x the physical segment size so that we have about ~20% of our
|
||
|
data misaligned.
|
||
|
|
||
|
When we reach some utilization threshold, we mix cleaning work in with
|
||
|
the ongoing write workload in order to evacuate live data from
|
||
|
previously written segments. Once they are completely free we can
|
||
|
discard the entire segment so that it can be erased and reclaimed by
|
||
|
the device.
|
||
|
|
||
|
The key is to mix a small bit of cleaning work with every write
|
||
|
transaction to avoid spikes and variance in write latency.
|
||
|
|
||
|
|
||
|
|
||
|
Data layout basics
|
||
|
==================
|
||
|
|
||
|
One or more cores/shards will be reading and writing to the device at
|
||
|
once. Each shard will have its own independent data it is operating
|
||
|
on and stream to its own open segments. Devices that support streams
|
||
|
can be hinted accordingly so that data from different shards is not
|
||
|
mixed on the underlying media.
|
||
|
|
||
|
Global state
|
||
|
------------
|
||
|
|
||
|
There will be a simple global table of segments and their usage/empty
|
||
|
status. Each shard will occasionally claim new empty segments for
|
||
|
writing as needed, or return cleaned segments to the global free list.
|
||
|
|
||
|
At a high level, all metadata will be structured as a b-tree. The
|
||
|
root for the metadata btree will also be stored centrally (along with
|
||
|
the segment allocation table).
|
||
|
|
||
|
This is hand-wavey, but it is probably sufficient to update the root
|
||
|
pointer for the btree either as each segment is sealed or as a new
|
||
|
segment is opened.
|
||
|
|
||
|
|
||
|
Writing segments
|
||
|
----------------
|
||
|
|
||
|
Each segment will be written sequentially as a sequence of
|
||
|
transactions. Each transaction will be on-disk expression of an
|
||
|
ObjectStore::Transaction. It will consist of
|
||
|
|
||
|
* new data blocks
|
||
|
* some metadata describing changes to b-tree metadata blocks. This
|
||
|
will be written compact as a delta: which keys are removed and which
|
||
|
keys/values are inserted into the b-tree block.
|
||
|
|
||
|
As each b-tree block is modified, we update the block in memory and
|
||
|
put it on a 'dirty' list. However, again, only the (compact) delta is journaled
|
||
|
to the segment.
|
||
|
|
||
|
As we approach the end of the segment, the goal is to undirty all of
|
||
|
our dirty blocks in memory. Based on the number of dirty blocks and
|
||
|
the remaining space, we include a proportional number of dirty blocks
|
||
|
in each transaction write so that we undirty some of the b-tree
|
||
|
blocks. Eventually, the last transaction written to the segment will
|
||
|
include all of the remaining dirty b-tree blocks.
|
||
|
|
||
|
Segment inventory
|
||
|
-----------------
|
||
|
|
||
|
At the end of each segment, an inventory will be written that includes
|
||
|
any metadata needed to test whether blocks in the segment are still
|
||
|
live. For data blocks, that means an object id (e.g., ino number) and
|
||
|
offset to test whether the block is still reference. For metadata
|
||
|
blocks, it would be at least one metadata key that lands in any b-tree
|
||
|
block that is modified (via a delta) in the segment--enough for us to
|
||
|
do a forward lookup in the b-tree to check whether the b-tree block is
|
||
|
still referenced. Once this is written, the segment is sealed and read-only.
|
||
|
|
||
|
Crash recovery
|
||
|
--------------
|
||
|
|
||
|
On any crash, we simply "replay" the currently open segment in memory.
|
||
|
For any b-tree delta encountered, we load the original block, modify
|
||
|
in memory, and mark it dirty. Once we continue writing, the normal "write
|
||
|
dirty blocks as we near the end of the segment" behavior will pick up where
|
||
|
we left off.
|
||
|
|
||
|
|
||
|
|
||
|
ObjectStore considerations
|
||
|
==========================
|
||
|
|
||
|
Splits, merges, and sharding
|
||
|
----------------------------
|
||
|
|
||
|
One of the current ObjectStore requirements is to be able to split a
|
||
|
collection (PG) in O(1) time. Starting in mimic, we also need to be
|
||
|
able to merge two collections into one (i.e., exactly the reverse of a
|
||
|
split).
|
||
|
|
||
|
However, the PGs that we split into would hash to different shards of
|
||
|
the OSD in the current sharding scheme. One can imagine replacing
|
||
|
that sharding scheme with a temporary mapping directing the smaller
|
||
|
child PG to the right shard since we generally then migrate that PG to
|
||
|
another OSD anyway, but this wouldn't help us in the merge case where
|
||
|
the constituent pieces may start out on different shards and
|
||
|
ultimately need to be handled in the same collection (and be operated
|
||
|
on via single transactions).
|
||
|
|
||
|
This suggests that we likely need a way for data written via one shard
|
||
|
to "switch ownership" and later be read and managed by a different
|
||
|
shard.
|
||
|
|
||
|
|
||
|
|