iSCSI vs NVMe over TCP

同一環境下(OS、NICやマシン)で、さらに条件を揃えるため、SSDではなくnullデバイスを使用。

下記出力のsdeがiscsiデバイスで、nvme1n1がNVMe Over TCP。両者ともリモートマシン上の同一/dev/nullb0に接続されている。

# lsblk 
NAME          MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda             8:0    0 447.1G  0 disk 
├─sda1          8:1    0   600M  0 part /boot/efi
├─sda2          8:2    0     1G  0 part /boot
└─sda3          8:3    0 445.6G  0 part 
  ├─fc31-root 253:0    0    15G  0 lvm  /
  └─fc31-swap 253:1    0  31.4G  0 lvm  [SWAP]
sdb             8:16   0   1.5T  0 disk 
sdc             8:32   0   1.5T  0 disk 
sdd             8:48   1  14.6G  0 disk 
└─sdd1          8:49   1  14.6G  0 part 
sde             8:64   0   250G  0 disk 
sr0            11:0    1     2G  0 rom  
nvme0n1       259:0    0   477G  0 disk 
├─nvme0n1p1   259:1    0   512M  0 part 
└─nvme0n1p2   259:2    0 476.4G  0 part 
nvme1n1       259:4    0   250G  0 disk 

FIOを使い、リード70%ライト30%、ブロックサイズ4KB、256並列、Qデプスは1の条件で計測する。

まずはiSCSI。リードが24,800 IOPSでスループットは97.0MB/s、レーテンシは平均8.2msといった感じ。

# fio --name=iscsi --filename=/dev/sde --rw=randrw --rwmixread=70 --direct=1 --invalidate=1 --ioengine=libaio --bs=4k --numjobs=256 --time_based --runtime=10 --group_reporting --iodepth=1
iscsi: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
...
fio-3.7
Starting 256 processes
Jobs: 256 (f=256): [m(256)][100.0%][r=98.8MiB/s,w=43.3MiB/s][r=25.3k,w=11.1k IOPS][eta 00m:00s]
iscsi: (groupid=0, jobs=256): err= 0: pid=2766: Fri Dec  6 22:59:38 2019
   read: IOPS=24.8k, BW=97.0MiB/s (102MB/s)(973MiB/10026msec)
    slat (nsec): min=1395, max=12744k, avg=71535.37, stdev=174994.31
    clat (nsec): min=1088, max=42954k, avg=8100632.98, stdev=6503994.47
     lat (usec): min=126, max=42972, avg=8172.73, stdev=6508.26
    clat percentiles (usec):
     |  1.00th=[  799],  5.00th=[ 1156], 10.00th=[ 1500], 20.00th=[ 2245],
     | 30.00th=[ 3195], 40.00th=[ 4490], 50.00th=[ 6128], 60.00th=[ 8225],
     | 70.00th=[10814], 80.00th=[13829], 90.00th=[17957], 95.00th=[20841],
     | 99.00th=[26608], 99.50th=[28443], 99.90th=[32637], 99.95th=[34341],
     | 99.99th=[38011]
   bw (  KiB/s): min=  216, max=  624, per=0.39%, avg=389.93, stdev=55.55, samples=4892
   iops        : min=   54, max=  156, avg=97.43, stdev=13.89, samples=4892
  write: IOPS=10.7k, BW=41.8MiB/s (43.8MB/s)(419MiB/10026msec)
    slat (usec): min=2, max=1860, avg=72.70, stdev=174.42
    clat (nsec): min=1193, max=32943k, avg=4786197.03, stdev=3747743.28
     lat (usec): min=197, max=32991, avg=4859.45, stdev=3753.62
    clat percentiles (usec):
     |  1.00th=[  742],  5.00th=[ 1020], 10.00th=[ 1270], 20.00th=[ 1713],
     | 30.00th=[ 2212], 40.00th=[ 2802], 50.00th=[ 3556], 60.00th=[ 4555],
     | 70.00th=[ 5800], 80.00th=[ 7570], 90.00th=[10159], 95.00th=[12518],
     | 99.00th=[16909], 99.50th=[18482], 99.90th=[22676], 99.95th=[25035],
     | 99.99th=[29230]
   bw (  KiB/s): min=   48, max=  328, per=0.39%, avg=168.04, stdev=41.55, samples=4892
   iops        : min=   12, max=   82, avg=41.96, stdev=10.39, samples=4892
  lat (usec)   : 2=0.01%, 4=0.01%, 250=0.01%, 500=0.06%, 750=0.77%
  lat (usec)   : 1000=2.57%
  lat (msec)   : 2=16.19%, 4=22.56%, 10=31.77%, 20=21.64%, 50=4.44%
  cpu          : usr=0.21%, sys=0.66%, ctx=392549, majf=0, minf=3695
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=249030,107245,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=97.0MiB/s (102MB/s), 97.0MiB/s-97.0MiB/s (102MB/s-102MB/s), io=973MiB (1020MB), run=10026-10026msec
  WRITE: bw=41.8MiB/s (43.8MB/s), 41.8MiB/s-41.8MiB/s (43.8MB/s-43.8MB/s), io=419MiB (439MB), run=10026-10026msec

Disk stats (read/write):
  sdd: ios=246238/106056, merge=0/0, ticks=1986382/502897, in_queue=2313614, util=97.08%
# 

続いてNVMe Over TCPも同一条件で計測する。 同じくリードに注目すると104,000 IOPS、スループットが427MB/s、平均レーテンシが1.7ms。

[root@rdma21 ~]# fio --name=nvme-tcp --filename=/dev/nvme0n1 --rw=randrw --rwmixread=70 --direct=1 --invalidate=1 --ioengine=libaio --bs=4k --numjobs=256 --time_based --runtime=10 --group_reporting --iodepth=1
rdma: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
...
fio-3.7
Starting 256 processes
Jobs: 256 (f=256): [m(256)][100.0%][r=407MiB/s,w=175MiB/s][r=104k,w=44.8k IOPS][eta 00m:00s]
rdma: (groupid=0, jobs=256): err= 0: pid=2591: Fri Dec  6 22:14:10 2019
   read: IOPS=104k, BW=407MiB/s (427MB/s)(4074MiB/10008msec)
    slat (nsec): min=1461, max=16435k, avg=13431.75, stdev=55340.30
    clat (nsec): min=627, max=33919k, avg=1692241.44, stdev=1242439.66
     lat (usec): min=56, max=33964, avg=1705.99, stdev=1242.58
    clat percentiles (usec):
     |  1.00th=[  151],  5.00th=[  245], 10.00th=[  355], 20.00th=[  594],
     | 30.00th=[  857], 40.00th=[ 1156], 50.00th=[ 1467], 60.00th=[ 1778],
     | 70.00th=[ 2147], 80.00th=[ 2606], 90.00th=[ 3326], 95.00th=[ 4015],
     | 99.00th=[ 5604], 99.50th=[ 6325], 99.90th=[ 8225], 99.95th=[ 9241],
     | 99.99th=[14353]
   bw (  KiB/s): min= 1056, max= 3272, per=0.39%, avg=1625.09, stdev=240.45, samples=4870
   iops        : min=  264, max=  818, avg=406.27, stdev=60.12, samples=4870
  write: IOPS=44.7k, BW=175MiB/s (183MB/s)(1749MiB/10008msec)
    slat (nsec): min=1677, max=18010k, avg=14509.37, stdev=52819.88
    clat (nsec): min=486, max=47722k, avg=1714147.63, stdev=1264153.19
     lat (usec): min=53, max=47816, avg=1728.97, stdev=1264.08
    clat percentiles (usec):
     |  1.00th=[  135],  5.00th=[  233], 10.00th=[  347], 20.00th=[  594],
     | 30.00th=[  873], 40.00th=[ 1172], 50.00th=[ 1483], 60.00th=[ 1811],
     | 70.00th=[ 2180], 80.00th=[ 2638], 90.00th=[ 3359], 95.00th=[ 4047],
     | 99.00th=[ 5669], 99.50th=[ 6390], 99.90th=[ 8291], 99.95th=[ 9372],
     | 99.99th=[14484]
   bw (  KiB/s): min=  416, max= 1328, per=0.39%, avg=697.71, stdev=116.18, samples=4870
   iops        : min=  104, max=  332, avg=174.40, stdev=29.05, samples=4870
  lat (nsec)   : 500=0.01%, 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=0.16%, 250=5.24%, 500=10.92%, 750=9.63%, 1000=8.69%
  lat (msec)   : 2=31.38%, 4=28.87%, 10=5.05%, 20=0.04%, 50=0.01%
  cpu          : usr=0.45%, sys=0.91%, ctx=1980857, majf=0, minf=3471
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1042843,447857,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=4074MiB (4271MB), run=10008-10008msec
  WRITE: bw=175MiB/s (183MB/s), 175MiB/s-175MiB/s (183MB/s-183MB/s), io=1749MiB (1834MB), run=10008-10008msec

Disk stats (read/write):
  nvme0n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
# 

NVMe over TCPは、従来からあるNVMe over Fabric (ROCE)に比べて性能が出ないと言われるが、iSCSIと比較すると圧倒的に良い。

同一環境でRDMA設定し、NVMe over Fabric (ROCE)との性能比較も実施する。