Jump to content
Welcome to our new Citrix community!
  • 0

Poor performance with NVMe SRs


Tim LoGiudice

Question

We're seeing some really slow performance on NVMe SRs.  We've got 12 Micron 9300 MAX 3.2TB NVMe drives across 4 servers, (All dual EPYC 7542 machines with 256GB RAM), the servers have dedicated SATA DOMs for the hypervisor, not used as SRs.  Running 8.2, up to date.  The drives are capable of 3000+ MBps sequential read and write, but with a single VM on a drive and nothing else running on the host, the best we'll see is around 400MBps.  The drives are capable of ~800K read IOPs, the best we see is 50K, and ~200K write IOPs should be possible, but the best we see is 52K.  Doesn't matter if the guest is running Windows or Linux, we do the tests with fio or bonnie or crystal or anvil etc.  PV drivers are installed and up to date.  Newly created VMs behave the same as VMs imported from other pools.  The results are consistent across all the servers.  The drives perform as expected if a native OS is booted on the machines.  Interestingly, if we run IO tests in 2, 3, even 4 VMs on the same drive, they still return the same results as if they were running by themselves, so the underlying drive/hypervisor seems to be capable, and it seems to be a limit on what a single guest is allowed to use.

 

Is there something obvious I'm missing that is killing our performance on these drives? I notice tapdisk using 100% of a CPU core on the host while running IO tests, which implies to me that the guest's throughput is somehow limited by the speed of a single CPU core. Is that just all there is to it, or is there a way to decouple those things? Or am I barking up the wrong tree entirely there?

 

Thanks in advance for any insights you may have.

Link to comment

12 answers to this question

Recommended Posts

  • 0

What RAID configuration do you have? Also if you run iostat, do you see any saturation of CPU or queue taking place? How many VCPUs are allocated to dom0? How are the hosts interfacing with the storage unit? If tapdisk is maxed out, you probably need either more VCPUs or a faster CPU, depending on whether the routines runs single-threaded or not.

 

There are many possibilities that can lead to slower than anticipated performance, including limitations within XS/CH itself. You'll not see anything close to bare metal performance.

Link to comment
  • 0

Since this is more of a performance problem I am curious to know the answers to the following.

  • Have you ever seen better performance with Citrix HyperVisor and NVMe drives or are you testing this for the first time?
  • Is this being compared with another hypervisor ? ( As in - what is reference  benchmark ? ) 
  • Do you believe you can test the same fio on dom0 as well and see what results you get ? 
Link to comment
  • 0
On 12/21/2021 at 4:22 PM, Tobias Kreidl said:

What RAID configuration do you have? Also if you run iostat, do you see any saturation of CPU or queue taking place? How many VCPUs are allocated to dom0? How are the hosts interfacing with the storage unit? If tapdisk is maxed out, you probably need either more VCPUs or a faster CPU, depending on whether the routines runs single-threaded or not.

 

There are many possibilities that can lead to slower than anticipated performance, including limitations within XS/CH itself. You'll not see anything close to bare metal performance.

 

There is no RAID, these are just standalone NVMEe U.2 drives connected directly to the PCIe bus on each host.  dom0 has 16 VCPUs.  Tapdisk hits 100% during any test (Though dom0 CPUs overall are 92% idle so it looks like it must be single threaded, IOWait also remains at 0 throughout), and the device queue will hit 95% on a 16+ thread 4K random read, but is generally much lower 30-60%. 

A faster CPU would make sense to me here, but we actually have another pool with single Xeon 6130s (so half as many CPUs, half as many cores per CPU, and a lower single core speed). I have connected one of these drives to one of those machines for testing, and even with the machine under load, with 10 VMs running, mostly doing code builds, I see 2x the performance on the same drive (eg 1200MBps sequential read, when the same drive when connected to the other pool struggles to maintain 400MBps on an otherwise idle host).  The pools are running the same version of the hypervisor. 

 

On 12/21/2021 at 5:35 PM, Alan Lantz said:

I wonder if the Hypervisor was on traditional drives and not SATADom's if that would address the performance.  

 

--Alan--

 

I am very curious why that might be the case? Does the hypervisor use its local storage as part of the IO pipeline for independent SRs? I'd expect to see some evidence of that in iostat, unless it's masked for some reason. 

 

On 12/22/2021 at 3:56 AM, Boby John1709155536 said:

Since this is more of a performance problem I am curious to know the answers to the following.

  • Have you ever seen better performance with Citrix HyperVisor and NVMe drives or are you testing this for the first time?
  • Is this being compared with another hypervisor ? ( As in - what is reference  benchmark ? ) 
  • Do you believe you can test the same fio on dom0 as well and see what results you get ? 
  • I've seen better performance with Citrix Hypervisor in the past with SATA SSDs than we're seeing right now with these NVMe SSDs on these particular hosts.  As noted above, these NVMe SSDs are also showing higher performance when connected to our older/slower hosts.
  • The reference is the same hypervisor on older/slower hardware. 
  • Yes.

 

 

 

Link to comment
  • 0

A single drive will suffer at least particularly with write performance since the write are not being spread over multiple drives. At some point, the queue depth determines also some limitations. The iostat application should show you if you see other forms of saturation. The top and xentop utilities will show you individual CPU usage stats, which are useful to look for CPU saturation.

Link to comment
  • 0

fio results, 1GB test file.

dom0:

 

4MB sequential single thread

write: IOPS=750, BW=3000MiB/s (3146MB/s)(176GiB/60009msec)

read: IOPS=653, BW=2614MiB/s (2741MB/s)(153GiB/60013msec)

 

4K random single thread:

 write: IOPS=117k, BW=457MiB/s (479MB/s)(26.8GiB/60001msec)

 read: IOPS=108k, BW=421MiB/s (441MB/s)(24.7GiB/60001msec)

 

linux guest on same drive, same host, 8.2 tools installed, same CPU count as dom0, more RAM:

 

4MB sequential single thread

write: IOPS=101, BW=385MiB/s (404MB/s)(23.1GiB/60001msec)

read: IOPS=142, BW=542MiB/s (569MB/s)(32.5GiB/60011msec)

 

4K random single thread:

write: IOPS=8108, BW=30.2MiB/s (31.7MB/s)(1812MiB/60001msec)

read: IOPS=4839, BW=18MiB/s (18.9MB/s)(1081MiB/60001msec)

Link to comment
  • 0
5 minutes ago, Tobias Kreidl said:

That's not what I consider bad for a single drive.  To get reasonable SSD performance I found you need at least four or more individual drives, at least in the environment I ran them (NexentaStor). Even then, write were limited to around 20 - 30 msec latency. Reads were down in in tends of nanoseconds range.

 

-=Tobias

 

3GBps from dom0 vs 0.385GBps or 117K IOPs vs barely 8K on the same operation. from a guest with exclusive use of the entire host seems extreme to me. On our older pool, the same drive can hit 1.2GBps on the same test in a similarly configured guest competing with 9 other live VMs.  >50% performance hit still seems high to me, but that would be a vast improvement over what we're seeing here. I definitely expect less than bare metal performance, but I would not expect to see only 10-12% of bare metal.  I don't see that kind of hit on CPU performance, GPU performance, or network IO. 

 

I could see if the guest was sharing the drive with other VMs how additional drives might help, but it doesn't seem like it would here given that running a test simultaneously on 4 VMs sharing the same disk doesn't lower the individual VM performance at all, so the physical drive is clearly able to service all of those requests, and something is limiting the performance of the individual VMs.

 

 

Link to comment
  • 0

Honestly, I never got close to bare metal I/O performance under XS/CH. Some of it can be addressed with some tuning, but not all. Queue depths and many other factors contribute. I have not run XS in production since 7.6 and hence cannot speak to the newer releases. About half of bare metal performance is probably about the best you can hope for.

 

It'd be interesting if you see similar stats under XCP-NG.

 

-=Tobias

Link to comment
  • 0

This is a normal performance range, the real bottleneck here is the SMAPIv1 (the storage stack), with one disk per process (tapdisk). The process is hitting 100% of one dom0 CPU, so you can't go faster. However, this limitation only applies for one disk, but this will scale with more disks in total (so in the end, with multiple disks and VMs, you'll reach your baremetal storage performance).

 

However, for one disk only, you can't do much more right now. There's multiple possibilities:

  • add more disks to the VM and create a RAID0 inside, you should immediately see more perfs (not ideal, but it's a demonstration of my initial claim)
  • wait for the new storage stack (SMAPIv3) that will help to bring higher performance "drivers"
  • here in XCP-ng project, we are doing R&D to bring SPDK for the storage stack, to improve drastically the performance per drive. It's still in R&D phase however.
  • we are also working on DPUs, passing directly NVMe "slices" to the guest, so you have baremetal performances (see https://xcp-ng.org/blog/2021/12/20/dpu-for-storage-a-first-look/ ). This is also R&D work, but we expect a real PoC in 2022

 

We'd love to work in a closer collaboration with Citrix on those topics (so we can go faster together), but so far I assume there's other priorities for them ?

Link to comment
  • 0
3 hours ago, Olivier Lambert said:

This is a normal performance range, the real bottleneck here is the SMAPIv1 (the storage stack), with one disk per process (tapdisk). The process is hitting 100% of one dom0 CPU, so you can't go faster. However, this limitation only applies for one disk, but this will scale with more disks in total (so in the end, with multiple disks and VMs, you'll reach your baremetal storage performance).

 

However, for one disk only, you can't do much more right now. There's multiple possibilities:

  • add more disks to the VM and create a RAID0 inside, you should immediately see more perfs (not ideal, but it's a demonstration of my initial claim)
  • wait for the new storage stack (SMAPIv3) that will help to bring higher performance "drivers"
  • here in XCP-ng project, we are doing R&D to bring SPDK for the storage stack, to improve drastically the performance per drive. It's still in R&D phase however.
  • we are also working on DPUs, passing directly NVMe "slices" to the guest, so you have baremetal performances (see https://xcp-ng.org/blog/2021/12/20/dpu-for-storage-a-first-look/ ). This is also R&D work, but we expect a real PoC in 2022

 

We'd love to work in a closer collaboration with Citrix on those topics (so we can go faster together), but so far I assume there's other priorities for them ?

As I said -- and in full agreement with Olivier -- the only way to currently get more IOPS is with more disks in a RAID configuration because of the queue depth limitation. The spindle count will make a huge difference, especially with writes as at least in some cases, cache can help reads. From my own experience, I found 4 disks to be pretty much the minimum and had good results with non-SSD drives, but it took around 20 disks in a RAID20 configuration!

SMAPIv3 will make a big difference as well as the great research you're doing, Olivier, with SPDK and DPUs!

-=Tobias

Link to comment
  • 0
27 minutes ago, Tobias Kreidl said:

As I said -- and in full agreement with Olivier -- the only way to currently get more IOPS is with more disks in a RAID configuration because of the queue depth limitation. The spindle count will make a huge difference, especially with writes as at least in some cases, cache can help reads. From my own experience, I found 4 disks to be pretty much the minimum and had good results with non-SSD drives, but it took around 20 disks in a RAID20 configuration!

SMAPIv3 will make a big difference as well as the great research you're doing, Olivier, with SPDK and DPUs!

-=Tobias

 

Thanks Tobias. There's a lot of innovation to do in that area indeed ? I truly hope we get visible results in 2022!

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...