Programster's Blog

Tutorials focusing on Linux, programming, and open-source

KVM - NUMA Declaration

Computers can have more than one system bus which allows them to have multiple processors in a single system like the one shown below:

This allows the servers to achieve higher compute capability at a better price to performance ratio, or to just exceed the possible compute power limitations of a single socket machine.

Such systems are NUMA (Non Uniform Memory Access) systems. Each CPU has it's own block of memory that it can access at low latency. However sometimes the CPU will need to access memory attached to another socket. This is called remote memory access and the high latency this requires can leave processors under utilized.

KVM and NUMA

Such systems can be great as KVM hosts as it allows you to host more guests on a single machine. Better yet, such KVM can be configured to take the NUMA topology into account to improve performance for its guests. Since remote memory access causes high latency, you probably want to ensure that a guest is isolated to one CPU socket and its local memory, whilst spreading the guests out across the multiple areas. Thus on a quad socket system, you could have 4 guests that never effect each other in any way and each run on their own CPUs and memory spaces. For details on how you can use cputune and numatune declarations to optimize the performance of your guests, read this.

The NUMA Declaration

Now I will go on to explain how to use the numa declaration inside the cpu block to set up a NUMA topology inside the guest (e.g. what the guest thinks its NUMA topology is, rather than how the guest runs on the hosts NUMA topology). This can be necessary if a guest is granted access to multiple CPU Sockets, or if you want to enable hotplugging of memory.

Here is an example XML configuration to set the NUMA topology of a guest:

<cpu>
    <numa>
        <cell id="0" cpus="0-1" memory="3" unit="GiB"/>
        <cell id="1" cpus="2-3" memory="3" unit="GiB"/>
    </numa>
</cpu>

This tells the guest that it has 2 memory busses, each with 3 GiB of memory, and that cores 0-1 have low latency access to one of the 3GB sets, whilst 2-3 have low latency access to the other set. Applications that can be optimized for NUMA will be able to take this into account so that they try to limit the number of remote memory calls they make. You probably want this to reflect your host's topology for optimal performance. E.g. don't create NUMA nodes just for the heck of it, and if you have more than one node, make sure the cpu and memory IDs match up appropriately to the host. I probably woudn't even bother giving a guest access to more than one node.

How Many NUMA Nodes Do I Have?

Most consumer and entry-level server hardware will only have 1 NUMA cell, so you probably don't want to have more than one cell declaration. You can find out how many NUMA cells you have by running lscpu. For example, for my KVM host I get the following:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 94
Model name:            Intel(R) Xeon(R) CPU E3-1220 v5 @ 3.00GHz
Stepping:              3
CPU MHz:               799.987
CPU max MHz:           3500.0000
CPU min MHz:           800.0000
BogoMIPS:              6000.00
Virtualisation:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0-3
...

As you can see, it explicitly states:

NUMA node(s):          1

If you want to fetch statistics about your NUMA nodes, such as how often they are making remote memory calls, then look into numastat.

References

Last updated: 16th September 2021
First published: 16th August 2018