Changing or viewing NUMA allocation on running VM

Dan Q · January 29, 2020

I'm running some performance tests to understand the behavior of Xen and NUMA allocation. The brief results summary is: I have demonstrated using the various sysbench memory benchmarks that memory performance is significantly impacted (~30%) if the vCPUs are either left unpinned or if I pin half of the vCPUs in a VM to different physical CPUs. That all makes sense. The result that doesn't make sense to me is that if I run the test with all vCPUs pinned to one physical CPU, then I switch them all to be pinned to the other physical CPU, I see the same memory benchmark performance.

The system under test has 2 20-core (40 hyperthreaded) Intel 6230 processors, and for the test I created 2 VMs, each allocated 40 vCPUs and I ran the sysbench tests on both VMs simultaneously in order to exercise all cores and NUMA channels at once.

In the last test my expectation was that I would see worse performance than in the 50/50 split test, but the performance was the same as it was in my initial test. I also did a test where I kept the vCPUs pinned each to their separate physical CPU, but every 2 seconds I swapped which CPUs they were pinned to (each call to sysbench used 10 second runs, so I would expect 4 swaps during each test), but still it ran at the same rate as the first test when I kept the vCPUs pinned to separate physical CPUs for the whole test run.

This is leading me to wonder how memory is allocated to VMs. My original goal with this test was to see if I could cause a VM's memory to be moved to a different NUMA channel if I changed the vCPU pinning. I thought that this would require a shutdown of the VM to release the memory from use, but that doesn't seem to be the case. Our VMs use fixed memory allocation. The only explanation I can come to is that reserved physical memory is allocated to VMs on demand when something in the VM tries to allocate memory and that sysbench allocates and releases memory frequently so when the pinning swap happened it basically had no effect.

I'm going to try another test now using a JVM which will reserve a large block of memory on startup to see if I see the same results.

Can someone shed more light on how I can optimize for NUMA allocation after a VM has been created? Or am I right about how sysbench works, and we only need to pin before the application under test allocates its memory?

Tobias Kreidl · February 4, 2020

Crossing a NUMA/vNUMA boundary can seriously degrade performance, as can allocating more memory to the VM than is associated with a single socket.

There are a number of points that can be fine-tuned. Experimentation is highly encouraged!

See also this series of articles that may be of some use:

https://www.mycugc.org/blogs/tobias-kreidl/2019/03/07/tale-of-two-servers-bios-settings-affect-apps-gpu

https://www.mycugc.org/blogs/tobias-kreidl/2019/04/30/a-tale-of-two-servers-part-2

https://www.mycugc.org/blogs/tobias-kreidl/2019/04/30/a-tale-of-two-servers-part-3

-=Tobias

Koenraad Willems · January 29, 2020

Hi,

I'm not 100% sure about how it works on Xen, but on KVM, there is a thing called Automatic NUMA balancing, which will periodically check if memory is optimally placed and move it in chunks if necessary. Here's a slide deck on that:

https://www.linux-kvm.org/images/7/75/01x07b-NumaAutobalancing.pdf

Here's an interesting read on NUMA on Xen:

https://wiki.xen.org/wiki/Xen_on_NUMA_Machines

By the way, I don't know if you have tested CPU operations/speed, but I tend to see that assigning and pinning more vCPU's to a VM than there are physical cores - so essentially also assigning and pinning the hyperthreads - negatively affects performance, if you start hammering all vCPU's at 100%.
I'm also not sure if pinning all vCPU's is the best way, but I guess it depends on your specific use case.

Best,

Koenraad

Dan Q · January 30, 2020

2 hours ago, Koenraad Willems said:

By the way, I don't know if you have tested CPU operations/speed, but I tend to see that assigning and pinning more vCPU's to a VM than there are physical cores - so essentially also assigning and pinning the hyperthreads - negatively affects performance, if you start hammering all vCPU's at 100%.
I'm also not sure if pinning all vCPU's is the best way, but I guess it depends on your specific use case.

Thanks for the links! I hadn't seen the Redhat one. I'll need to dig deeper into the topics in it.

One of the big surprises we found was a 30-40% increase in the memory and up to a 68% increase in the CPU tests in sysbench as we went beyond the physical core count of 40 all the way up to 80. We really didn't expect that tightly looped prime number calculations would benefit from hyperthreading at all. We saw almost no difference in the CPU test by pinning the vCPUs but we got a 30% increase in the memory tests by pinning. Also quite interesting was that when I did a split-NUMA test, that is I assigned half of the vCPUs to both processors, the performance was about the same as without any pinning at all.

We're going to apply pinning to some application workloads soon to see if it makes a difference. My hope is that we'll see a nice increase due to the NUMA alignment.

Koenraad Willems · January 30, 2020

Hi,

You're welcome on the links. There is a lot of info out there, very cool stuff, too little time..

Quote

One of the big surprises we found was a 30-40% increase in the memory and up to a 68% increase in the CPU tests in sysbench as we went beyond the physical core count of 40 all the way up to 80. We really didn't expect that tightly looped prime number calculations would benefit from hyperthreading at all. We saw almost no difference in the CPU test by

pinning the vCPUs but we got a 30% increase in the memory tests by pinning. Also quite interesting was that when I did a split-NUMA test, that is I assigned half of the vCPUs to both processors, the performance was about the same as without any pinning at all.

That's interesting. On physical workloads, hyper-threading should add about a 20% performance increase, but on virtual, I've not seen this consistently. Could also be a difference in real-world workloads and (synthetic) benchmarks.
I do have to add that we run on KVM, there might be differences and also there are a lot of optimisations that you need to do yourself there.

Something interesting to look into as well, is Turbo Speed. I don't know how well Xen leverages that, but there are a lot of optimisations that can be done there too.

Good luck with the tests.

Best,

Koenraad

Dan Q · February 11, 2020

It looks like the reason we saw continued performance gains on the the prime number test of sysbench is that the cpu has 4 arithmetic logical units (ALUs), so it's very capable of utilizing hyperthreading for highly integer math related workloads. See section 2.8 and figure 2-2 of https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-optimization-reference-manual

Koenraad Willems · February 11, 2020

Hi,

That's the problem with synthetic tests of course.. Unless calculating prime numbers is going to be your use-case.. ;)

Best,

Koenraad

Sign In

Changing or viewing NUMA allocation on running VM

Question

Dan Q

Link to comment

6 answers to this question

Recommended Posts

Tobias Kreidl

Link to comment

Koenraad Willems

Link to comment

Dan Q

Link to comment

Koenraad Willems

Link to comment

Dan Q

Link to comment

Koenraad Willems

Link to comment

Create an account or sign in to comment

Create an account

Sign in

Discussions

Netscaler

Citrix

Technical Articles

Tech Insights

Community Articles

Resources

Events

Education