ceph/doc/dev/zoned-storage.rst

118 lines
5.8 KiB
ReStructuredText
Raw Normal View History

=======================
Zoned Storage Support
=======================
http://zonedstorage.io
Zoned Storage is a class of storage devices that enables host and storage
devices to cooperate to achieve higher storage capacities, increased throughput,
and lower latencies. The zoned storage interface is available through the SCSI
Zoned Block Commands (ZBC) and Zoned Device ATA Command Set (ZAC) standards on
Shingled Magnetic Recording (SMR) hard disks today and is also being adopted for
NVMe Solid State Disks with the upcoming NVMe Zoned Namespaces (ZNS) standard.
This project aims to enable Ceph to work on zoned storage drives and at the same
time explore research problems related to adopting this new interface. The
first target is to enable non-ovewrite workloads (e.g. RGW) on host-managed SMR
(HM-SMR) drives and explore cleaning (garbage collection) policies. HM-SMR
drives are high capacity hard drives with the ZBC/ZAC interface. The longer
term goal is to support ZNS SSDs, as they become available, as well as overwrite
workloads.
The first patch in these series enables writing data to HM-SMR drives. The
second patch will introduce ZonedFreelistManger, a FreelistManager
implementation that passes enough information to ZonedAllocator to correctly
initialize state of zones. We have to introduce a new FreelistManager
implementation because with zoned devices a region of disk can be in three
states (empty, used, and stale), whereas current BitmapFreelistManager tracks
only two states (empty and used). It is not possible to accurately initialize
the state of zones in ZonedAllocator by tracking only two states. The third
planned patch will introduce a rudimentary cleaner to form a baseline for
further research.
Currently we can perform basic RADOS benchmarks on an OSD running on an HM-SMR
drives, restart the OSD, and read the written data, as can be seen below.
Please contact Abutalib Aghayev <agayev@cs.cmu.edu> for questions.
::
$ zbc_info /dev/sdc
Device /dev/sdc:
Vendor ID: ATA HGST HSH721414AL T240
Zoned block device interface, Host-managed zone model
27344764928 512-bytes sectors
3418095616 logical blocks of 4096 B
3418095616 physical blocks of 4096 B
14000.520 GB capacity
Read commands are unrestricted
672 KiB max R/W size
Maximum number of open sequential write required zones: 128
$ MON=1 OSD=1 MDS=0 sudo ../src/vstart.sh --new --localhost --bluestore --bluestore-devs /dev/sdc --bluestore-zoned
<snipped verbose output>
$ sudo ./bin/ceph osd pool create bench 32 32
pool 'bench' created
$ sudo ./bin/rados bench -p bench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_h0.cc.journaling712.narwhal.p_29846
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 45 29 115.943 116 0.384175 0.407806
2 16 86 70 139.949 164 0.259845 0.391488
3 16 125 109 145.286 156 0.31727 0.404727
4 16 162 146 145.953 148 0.826671 0.409003
5 16 203 187 149.553 164 0.44815 0.404303
6 16 242 226 150.621 156 0.227488 0.409872
7 16 281 265 151.384 156 0.411896 0.408686
8 16 320 304 151.956 156 0.435135 0.411473
9 16 359 343 152.401 156 0.463699 0.408658
10 15 396 381 152.356 152 0.409554 0.410851
Total time run: 10.3305
Total writes made: 396
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 153.333
Stddev Bandwidth: 13.6561
Max bandwidth (MB/sec): 164
Min bandwidth (MB/sec): 116
Average IOPS: 38
Stddev IOPS: 3.41402
Max IOPS: 41
Min IOPS: 29
Average Latency(s): 0.411226
Stddev Latency(s): 0.180238
Max latency(s): 1.00844
Min latency(s): 0.108616
$ sudo ../src/stop.sh
$ # Notice the lack of "--new" parameter to vstart.sh
$ MON=1 OSD=1 MDS=0 sudo ../src/vstart.sh --localhost --bluestore --bluestore-devs /dev/sdc --bluestore-zoned
<snipped verbose output>
$ sudo ./bin/rados bench -p bench 10 rand
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 61 45 179.903 180 0.117329 0.244067
2 16 116 100 199.918 220 0.144162 0.292305
3 16 174 158 210.589 232 0.170941 0.285481
4 16 251 235 234.918 308 0.241175 0.256543
5 16 316 300 239.914 260 0.206044 0.255882
6 15 392 377 251.206 308 0.137972 0.247426
7 15 458 443 252.984 264 0.0800146 0.245138
8 16 529 513 256.346 280 0.103529 0.239888
9 16 587 571 253.634 232 0.145535 0.2453
10 15 646 631 252.254 240 0.837727 0.246019
Total time run: 10.272
Total reads made: 646
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 251.558
Average IOPS: 62
Stddev IOPS: 10.005
Max IOPS: 77
Min IOPS: 45
Average Latency(s): 0.249385
Max latency(s): 0.888654
Min latency(s): 0.0103208