AMD EPYC 7313P Energy Consumption Test

April 29, 2022

Introduction

After building a too-powerful-for-home server, I wondered how I can optimize its power consumption. There is not much I can do for mainboard, storage and RAM, so I only focus on the CPU. When I say EPYC in this post, I mean AMD EPYC 7313P, but the basics I think should be same for all EPYC processors.

AMD EPYC 7313P has 4 Core/Cache Complex Dies (CCD) where each CCD has one Core Complex (CCX) (EPYC Milan Series). Each CCX has 4 cores (so in total 4x4=16 cores), and each core has 2 threads (so in total 16x2=32 threads). CPU base clock is 3 GHz, and the boost clock is 3.7 GHz. The boost clock can only be used, I think, if the actual power stays under the design limit (TDP). It has a thermal design power (TDP) figure of 155W. In this post, I use the terms CCX, core and thread in this hardware sense.

I have a few basic questions to answer:

  • What is the minimum energy use when the CPU is almost 100% idling ?
  • What are the C-states and P-states ?
  • How does the use of a thread, multi threads, a core and multi cores in the same and the different CCX change the power use ?
  • What is the behavior of different cpufreq governors ?

It might be useful to clarify something first. Energy (Joule or Watt-hour) is what a system use to do work. CPU uses energy to compute. Power (Watt) is the rate of energy. You pay for the energy consumed (unit kWh) (not power), but the devices specify their power, the rate they consume energy. So if a system have a constant power (lets say 10 W), power multiplied by time is the energy used (10 W device run for 1 hour is 10 Wh). However, like many devices, the power of a processor is not fixed, and that is the point of power saving.

Setup

In order to have maximum power saving and efficiency, I am using the following BIOS options:

  • APBDIS = 0: dynamically switch Infinity Fabric P-state based on link use
  • DF C-States = Enabled: allow Infinity Fabric to go to low-power
  • Global C-State Control = Auto: enables C2 state
  • cTDP and Package Power Limit Control = Auto: default power limits
  • Core Performance Boost = Auto: enables boost frequency when possible

Particularly the first two options, maybe also the third one, can diminish the performance under load, so they are usually recommended to have opposite values (APBDIS=1, DF C-States=Disabled, Global C-State Control=Disabled) for performance oriented setups.

In order to have a reasonably repeatable test, I am running an Ubuntu 22.04 Desktop Live image.

I have NPS=4, so NUMA topology reflects the underlying CPU hardware architecture:

$ numactl -H

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 16 17 18 19
node 0 size: 32049 MB
node 0 free: 30719 MB
node 1 cpus: 4 5 6 7 20 21 22 23
node 1 size: 32239 MB
node 1 free: 31061 MB
node 2 cpus: 8 9 10 11 24 25 26 27
node 2 size: 32206 MB
node 2 free: 30338 MB
node 3 cpus: 12 13 14 15 28 29 30 31
node 3 size: 32217 MB
node 3 free: 31253 MB
node distances:
node   0   1   2   3
  0:  10  12  12  12
  1:  12  10  12  12
  2:  12  12  10  12
  3:  12  12  12  10

Each node above corresponds to a CCD/CCX, where core #X and core #(X+16) runs on the same physical core, so cpu 0 and cpu 16 is on the same physical core. This information can also be verified in turbostat output.

I am using turbostat and cpupower utilities for power management and monitoring, and stress utility to stress the cpu. I use stress with taskset to set the cpu affinity of stress, so stress only runs on the cpus/threads I specify.

How Energy Use is Measured

On EPYC (and on many recent processors), there are two (MSR) registers:

  • Core Energy Status CORE_ENERGY_STAT
  • Package Energy Status PKG_ENERGY_STAT)

These continuously reflect the energy (not power) use of each core and the package. The unit of this register (what does one increment, a change in LSB, mean physically) is also given in another register: RAPL Power Unit RAPL_PWR_UNIT.

On EPYC, the unit is ~15.3 Microjoule (1/2^65536 J), and 1 J is 1 Ws (Watt x second). The MSR registers are continously increasing and overflowing to zero quickly when the registers are 32-bit. On EPYC, it seems they are 64-bit, so overflowing is not practically possible. The difference between the two readings of a energy status register scaled by the unit (in Joule) per the time difference between readings (second) results the power in Watt (Joule=Watt x second, Watt x second/second=Watt).

In case it is not clear, the package power is not the same as the sum of its cores, since the package also contains things like memory controllers, PCIe controllers, infinity fabric etc.

The core energy status register is per physical core. When SMT (symmetric multithreading) is enabled (as here), Linux identifies each thread as a cpu. So the pair of logical cores within the same physical core reports the same value. turbostat shows only one value for each physical core. The package power is called PkgWatt and the power of physical core is called CorWatt in turbostat.

The best reference for this information is the repository of amd_energy.

The reference of EPYC MSR registers is Preliminary Processor Programming Reference for AMD Family 19h.

C-States

Lets look at the available C-states:

$ cpupower idle-info

CPUidle driver: acpi_idle
CPUidle governor: menu
analyzing CPU 0:

Number of idle states: 3
Available idle states: POLL C1 C2
POLL:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
Usage: 93
Duration: 3919
C1:
Flags/Description: ACPI FFH MWAIT 0x0
Latency: 1
Usage: 5568
Duration: 990283
C2:
Flags/Description: ACPI IOPORT 0x814
Latency: 30
Usage: 23359
Duration: 845218748

There are three C-states:

  • C0: operational/active state
  • C1: idle
  • C2: idle and power gated, deep sleep

C0 is the normal operating state, whereas C1 and particularly C2 are power-save/idle states. C2 consumes less power than C1 but waking a core up from C2 takes more time than C1, so there is a performance penalty.

There is no particular information for EPYC, but at C1, there should be no execution but the clock would be still running, whereas at C2, the clock would also be stopped, hence it is called deep sleep.

As reported above, this system uses acpi_idle driver and menu governor. There is usually no need to modify anything here and there are limited options. The governor decides when to put a core to C1 and C2 and when to wake it up.

AFAIK from desktop Intel processors, idle states of threads, cores and package are different, in the sense that when lets say all threads in a core are in idle state, that core can also be put to idle state which saves more power than threads alone. It seems EPYC is simpler in this respect, there are only C0, C1, C2, and these are visible and settable for each thread.

P-states

P-States are not idle states, but to save power, the clock frequency can be reduced. So a thread can run not at its full capacity but consumes less power. Again from the desktop processors, I normally think of P-States as many fine grained (frequency) values, but it seems for EPYC there are a very small number of P-States, only three.

$ cpupower frequency-info

analyzing CPU 0:
  driver: acpi-cpufreq
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 1.50 GHz - 3.73 GHz
  available frequency steps:  3.00 GHz, 2.20 GHz, 1.50 GHz
  available cpufreq governors: conservative ondemand userspace powersave performance schedutil
  current policy: frequency should be within 1.50 GHz and 3.00 GHz.
                  The governor "schedutil" may decide which speed to use
                  within this range.
  current CPU frequency: 1.50 GHz (asserted by call to hardware)
  boost state support:
    Supported: yes
    Active: yes
    Boost States: 0
    Total States: 3
    Pstate-P0:  3000MHz
    Pstate-P1:  2200MHz
    Pstate-P2:  1500MHz

Above output is only for cpu (thread) 0, but they are all same. So there are only three P-States, P0, P1 and P2.

Because there are no independent Boost States, I think the boosted state is part of P0, so 3000 MHz actually means it can be up to 3700 MHz when possible when boost is enabled (/sys/devices/system/cpu/cpufreq/boost).

P-State/cpufreq Governors

As reported above, the current governor is schedutil. I did not select this, it seems it is the default. The governors basically operate like this:

  • conservative and ondemand are similar to schedutil, they all set the frequency (or select the P-state) based on the load.
  • performance sets the frequency to maximum (3.00 GHz), or selects P0. If boost is enabled, this can go up to 3.73 GHz.
  • powersave sets the frequency to minimum (1.50 GHz), or selects the last P-state, P2 here.
  • userspace lets user to set the frequency, but not freely, only to the frequency steps/P-states available.

userspace governor is easy to observe since it does not change the frequency dynamically, so I will use that. The governor and the frequency is set like this and it is possible to do this per cpu (thread) with -c X option.

$ cpupower frequency-set -g userspace
$ cpupower frequency-set -f 1500MHz

The current status of threads can be verified with cpupower monitor and turbostat.

Minimum Power

The minimum power is when all cores are in C2. They cannot stay all the time/100% at C2, otherwise there would be nothing running, so they should be as much as possible at C2 (e.g. >99%) and when they run, they should run at the lowest P-state P2.

governorfrequency
userspace1500 MHz

turbostat reports:

  • PkgWatt: 35 W
  • CorWatt: all cores are close to 0 W

So I assume this is the minimum that can be reached, 35 W.

Stressing One Logical Core @ 1.5GHz

governorfrequency
userspace1500 MHz
$ taskset -c 31 stress --cpu 1
  • PkgWatt: 53 W
  • CorWatt: all cores except core #31 are close to 0 W. Core #31 consumes 0.63 W.

That is a pretty big jump from 35 W to 53 W for just a core consuming 0.63 W. It is because many shared resources should also run when a thread is running.

I wonder what happens when all cores are stressed.

Stressing All Cores @ 1.5GHz

governorfrequency
userspace1500 MHz
$ taskset -c 0-31 stress --cpu 32
  • PkgWatt: 65 W
  • CorWatt: each core uses 0.85 W. All cores consume ~13 W.

There is a small difference, so a core actually consumes not much power comparing to whatever is going on in the package.

Another interesting thing is when one logical core is used, it is 0.63 W, when both cores are used, it is 0.85 W. So obviously there is a large shared part that consumes power in a single core.

Now I wonder what happens at 3 Ghz. Because TDP is 155W, but it is only at 65 W now.

Stressing All Cores @ 3.0 GHz (but actually 3.73 GHz)

governorfrequency
userspace3000 MHz
$ taskset -c 0-31 stress --cpu 32
  • PkgWatt: 151 W
  • CorWatt: each core uses ~6.2 W. All cores consume ~100 W.

OK, so now it almost reached TDP figure 155 W. An interesting thing is although I requested 3.0 GHz, so P0 state, all cores are running at 3.73 GHz, which is the boost frequency. I think, because the total power is still under TDP (150<155), it can run all cores at the boosted frequency. I have an air cooled system but the heatsink and the fan is pretty large, so maybe it is helping.

Jumping from 1.5 GHz to 3.72 GHz, so 2.5 times, power use increased from 0.85 W to 6.2 W, almost 8 times. This really explains why that cpufreq governor is called powersave. Keeping the cores at the lowest frequency decreases power consumption a lot.

I wonder what happens if boost is disabled.

Stressing All Cores @ 3.0 GHz

governorfrequency
userspace3000 MHz

but also boost is disabled (/sys/devices/system/cpu/cpufreq/boost is 0).

$ taskset -c 0-31 stress --cpu 32
  • PkgWatt: 104 W
  • CorWatt: each core uses ~3.3 W. All cores consume ~53 W.

So now all the cores are running at the base clock of 3 GHz. The result is quite interesting. Increasing frequency from 3 GHz to 3.73 GHz, so only 0.25 times, increases power use almost 2 times.

I wonder what happens at P1 state, at 2.2 GHz. So I can have all data values.

Stressing All Cores @ 2.2 GHz

governorfrequency
userspace2200 MHz
$ taskset -c 0-31 stress --cpu 32
  • PkgWatt: 78 W
  • CorWatt: each core uses ~ 1.65 W. All cores consume ~26 W.

That is exactly half of previous result, and the results of 1.5 GHz was also half of this for cores. Very interesting. So P-states are designed in a way that core power consumption doubles at every step, from P2 to P1, from P1 to P0. Also, from base clock P0 to boost clock P0, it also doubles. Surprising result, I was not expecting this.

Another result is there is always around 50 W difference between PkgWatt and total of CorWatt. So there is a not very changing 50 W use of package. This is same as the total CorWatt at 3.0 Ghz.

I think the relation between frequency and power use is clear now, I wonder if the location of stressed core matters.

Stressing Two Logical Cores @ 3.0 GHz

governorfrequency
userspace3000 MHz

boost is still disabled.

$ taskset -c 0,16 stress --cpu 2

PkgWatt: 56 W

CorWatt: Not surprisingly, the physical core that contains logical cores 0 and 16 consumes ~3.3 W. Same as before.

$ taskset -c 0,1 stress --cpu 2

PkgWatt: 57 W

CorWatt: Not surprisingly, now there are two physical cores and each consume ~2.4 W. So instead of 3.3 W in total, 2x2.4=4.8 W is used. Naturally if different physical cores are activated, more power is used.

What happens if these two logical cores are in two different CCX ?

$ taskset -c 0,4 stress --cpu 2

PkgWatt: 57 W

CorWatt: Because of two different physical cores, each of these cores use ~2.4 W like before. There is also no difference between PkgWatt. So I guess it does not matter the location of logical core other than being in the same physical core.

Full Results

I made many measurement combinations and all data is below. The script I use to run the tests and generate the table is at github.

P-state PB in the table below means P0 boosted frequency (3.73 GHz).

Test config is in # of Threads-# of Cores-# of CCXx format. For example, 2-2-2 means, there are 2 threads, 2 cores (so each thread is running on different thread) and 2 CCX (so the cores are in different cores). 2-2-1 and 2-2-2 configs also have a single letter variant indicator like 2-2-2-A since there are 6 ways to distribute a two thread test to cores and CCXs (because C(4,2)=6). I did the variant tests to see if there is any difference, but there is none.

The tests are run without NUMA balancing, no ASLR, no swap. Before each test, caches are cleaned. Each test is run for 30 seconds and turbostat makes a measurement every second.

P-stateConfigCPU SetCorWattPkgWatt
P21-1-1120.6652.69
P22-1-112,280.8852.92
P22-2-1-A12,131.2853.33
P22-2-1-B12,141.2853.34
P22-2-1-C12,151.2853.34
P22-2-1-D13,141.2853.34
P22-2-1-E13,151.2853.34
P22-2-1-F14,151.2853.34
P22-2-2-A3,71.2553.31
P22-2-2-B3,111.2353.30
P22-2-2-C3,151.2553.32
P22-2-2-D7,111.2353.30
P22-2-2-E7,151.2553.32
P22-2-2-F11,151.2453.31
P232-16-40-3113.3765.06
P11-1-1121.2053.47
P12-1-112,281.6353.88
P12-2-1-A12,132.4054.63
P12-2-1-B12,142.3954.61
P12-2-1-C12,152.4054.62
P12-2-1-D13,142.4054.64
P12-2-1-E13,152.4254.65
P12-2-1-F14,152.3954.62
P12-2-2-A3,72.4254.68
P12-2-2-B3,112.3454.57
P12-2-2-C3,152.3854.60
P12-2-2-D7,112.3954.65
P12-2-2-E7,152.4254.67
P12-2-2-F11,152.3354.57
P132-16-40-3126.1477.59
P01-1-1122.3555.04
P02-1-112,283.2355.95
P02-2-1-A12,134.7257.32
P02-2-1-B12,144.7157.31
P02-2-1-C12,154.7357.32
P02-2-1-D13,144.7057.30
P02-2-1-E13,154.7157.31
P02-2-1-F14,154.6657.26
P02-2-2-A3,74.8357.50
P02-2-2-B3,114.6857.30
P02-2-2-C3,154.8057.46
P02-2-2-D7,114.7757.44
P02-2-2-E7,154.8357.50
P02-2-2-F11,154.5957.18
P032-16-40-3152.55103.56
PB1-1-1124.3557.93
PB2-1-112,285.8159.34
PB2-2-1-A12,138.7562.18
PB2-2-1-B12,148.6862.16
PB2-2-1-C12,158.7162.13
PB2-2-1-D13,148.5461.94
PB2-2-1-E13,158.5461.94
PB2-2-1-F14,158.5061.90
PB2-2-2-A3,78.9762.52
PB2-2-2-B3,118.8062.31
PB2-2-2-C3,158.9262.44
PB2-2-2-D7,118.7862.35
PB2-2-2-E7,158.9262.49
PB2-2-2-F11,158.4361.81
PB32-16-40-3199.37149.94

Conclusion

I think there are three major results of this experiment.

Not surprisingly, running both threads in a single core consumes less power than running one thread in two difference cores.

There is no difference between running two threads in any combination of cores, and any combination of CCXs. The reason I did this test is internally the processor has a Ring topology (Infinity Fabric). So because it is not a fully connected topology, I was thinking there might be a difference, but there is none.

If a core consumes X unit of power at P2, it consumes 2X at P1, 4X at P0, and 8X at P0 boosted. If the performance is directly related to frequency only, the performance is X at P2, 1.5X at P1, 2X at P0, and 2.5X at P0 boosted. However, there is also the power use of the package which was same as P0. If all these are combined and normalized, performance/power is:

  • P2: X / (X+4X) = 0.2
  • P1: 1.5X / (2X+4X) = 0.25
  • P0: 2X / (4X+4X) = 0.25
  • P0 boosted: 2.5X / (8X+4X) = 0.2

So my conclusion is:

  • if there is a very light load, then it makes sense to run everything at P2.
  • if there is some load but not full, then P1 or P0 makes sense.
  • if there is a full load, then naturally P0 boost is fine assuming TDP will not limit the number of threads that can run at boosted frequency.

Most of the time the load is not very stable so the governor should be load aware. Hence, using ondemand, conservative or new schedutil is probably a must.