HeFFTe is the Highly Efficient FFT for Exascale. This benchmark has 64 different subtests. Some fail for strange reasons including a missing libelf library or running too quickly. However, most run and provide an example result. These tests run in mixture of mostly single-threaded and threads that match the numbers of cores.

Topdown profile seems to have an upper floor of frontend bound stalls, patches of backend stalls and somewhat lower retirement rate.

AMD metrics have an average of 3 cores. This is floating point code with 60% backend memory stalls. Frontend stalls average 17% overall and the retirement rate is below 10%
elapsed 2584.235
on_cpu 0.187 # 2.99 / 16 cores
utime 6127.346
stime 1597.466
nvcsw 2219897 # 96.96%
nivcsw 69603 # 3.04%
inblock 22620712 # 8753.35/sec
onblock 3811880 # 1475.05/sec
cpu-clock 9104977687955 # 9104.978 seconds
task-clock 9105744639000 # 9105.745 seconds
page faults 708443241 # 77801.791/sec
context switches 3181073 # 349.348/sec
cpu migrations 52924 # 5.812/sec
major page faults 226527 # 24.877/sec
minor page faults 708216255 # 77776.863/sec
alignment faults 0 # 0.000/sec
emulation faults 0 # 0.000/sec
branches 2648061711421 # 127.197 branches per 1000 inst
branch misses 137376535860 # 5.19% branch miss
conditional 1746794451531 # 83.906 conditional branches per 1000 inst
indirect 72161814734 # 3.466 indirect branches per 1000 inst
cpu-cycles 39238084515767 # 0.95 GHz
instructions 20656319189605 # 0.53 IPC low
slots 78565017276084 #
retiring 7430090396720 # 9.5% ( 9.5%) low
-- ucode 22792162227 # 0.0%
-- fastpath 7407298234493 # 9.4%
frontend 13446209644619 # 17.1% (17.2%)
-- latency 9245125183632 # 11.8%
-- bandwidth 4201084460987 # 5.3%
backend 56879497108114 # 72.4% (73.0%) high
-- cpu 9300687286741 # 11.8%
-- memory 47578809821373 # 60.6%
speculation 213890267340 # 0.3% ( 0.3%) low
-- branch mispredict 211243116854 # 0.3%
-- pipeline restart 2647150486 # 0.0%
smt-contention 595242856933 # 0.8% ( 0.0%)
cpu-cycles 39180222661690 # 0.96 GHz
instructions 20555473254360 # 0.52 IPC low
instructions 6868140531111 # 57.742 l2 access per 1000 inst
l2 hit from l1 303283223345 # 38.82% l2 miss
l2 miss from l1 101327869773 #
l2 hit from l2 pf 40674479706 #
l3 hit from l2 pf 5564960545 #
l3 miss from l2 pf 47056539540 #
instructions 6852144871392 # 128.190 float per 1000 inst
float 512 919 # 0.000 AVX-512 per 1000 inst
float 256 7426 # 0.000 AVX-256 per 1000 inst
float 128 878379612146 # 128.190 AVX-128 per 1000 inst
float MMX 0 # 0.000 MMX per 1000 inst
float scalar 0 # 0.000 scalar per 1000 inst
instructions 20699274879600 #
opcache 3632503887154 # 175.489 opcache per 1000 inst
opcache miss 906178654575 # 24.9% opcache miss rate
l1 dTLB miss 66246872882 # 3.200 L1 dTLB per 1000 inst
l2 dTLB miss 14096459068 # 0.681 L2 dTLB per 1000 inst
instructions 20785223761254 #
icache 2087723305047 # 100.443 icache per 1000 inst
icache miss 54202803799 # 2.6% icache miss rate
l1 iTLB miss 67609643 # 0.003 L1 iTLB per 1000 inst
l2 iTLB miss 0 # 0.000 L2 iTLB per 1000 inst
tlb flush 20274864 # 0.001 TLB flush per 1000 inst
Intel metrics
Process overview shows mpi used to invoke and most time in either speed3d_c22c or speed3d_r2c.
9007 processes
3240 speed3d_c2c 14178.01 4914.72
3456 speed3d_r2c 7832.05 3804.90
1602 mpirun 79.39 544.27
68 clinfo 16.17 9.66
38 vulkaninfo 1.14 1.91
6 php 0.63 206.14
4 vulkani:disk$0 0.12 0.21
6 glxinfo:gdrv0 0.09 0.15
6 glxinfo:gl0 0.09 0.15
2 llvmpipe-0 0.06 0.11
2 llvmpipe-1 0.06 0.11
2 llvmpipe-10 0.06 0.11
2 llvmpipe-11 0.06 0.11
2 llvmpipe-12 0.06 0.11
2 llvmpipe-13 0.06 0.11
2 llvmpipe-14 0.06 0.11
2 llvmpipe-15 0.06 0.11
2 llvmpipe-2 0.06 0.11
2 llvmpipe-4 0.06 0.11
2 llvmpipe-5 0.06 0.11
2 llvmpipe-6 0.06 0.11
2 llvmpipe-7 0.06 0.11
2 llvmpipe-8 0.06 0.11
2 llvmpipe-9 0.06 0.11
6 clang 0.06 0.10
2 llvmpipe-3 0.06 0.10
2 glxinfo 0.05 0.05
2 glxinfo:cs0 0.05 0.05
2 glxinfo:disk$0 0.05 0.05
2 glxinfo:sh0 0.05 0.05
2 glxinfo:shlo0 0.05 0.05
3 rocminfo 0.03 0.00
1 lspci 0.00 0.03
1 ps 0.00 0.01
267 heffte 0.00 0.00
176 sh 0.00 0.00
13 gcc 0.00 0.00
9 gsettings 0.00 0.00
8 stat 0.00 0.00
8 systemd-detect- 0.00 0.00
6 llvm-link 0.00 0.00
5 gmain 0.00 0.00
5 phoronix-test-s 0.00 0.00
2 cc 0.00 0.00
2 dconf worker 0.00 0.00
2 lscpu 0.00 0.00
2 uname 0.00 0.00
2 which 0.00 0.00
2 xset 0.00 0.00
1 date 0.00 0.00
1 dirname 0.00 0.00
1 dmesg 0.00 0.00
1 dmidecode 0.00 0.00
1 grep 0.00 0.00
1 ifconfig 0.00 0.00
1 ip 0.00 0.00
1 lsmod 0.00 0.00
1 mktemp 0.00 0.00
1 qdbus 0.00 0.00
1 readlink 0.00 0.00
1 realpath 0.00 0.00
1 sed 0.00 0.00
1 sort 0.00 0.00
1 stty 0.00 0.00
1 systemctl 0.00 0.00
1 template.sh 0.00 0.00
1 wc 0.00 0.00
1 xrandr 0.00 0.00
4 processes running
51 maximum processes
Example of a core computation block
65412) heffte cpu=2 start=5.91 finish=6.82
65413) mpirun cpu=6 start=5.91 finish=6.80
65414) mpirun cpu=1 start=6.11 finish=6.80
65415) mpirun cpu=4 start=6.11 finish=6.11
65416) mpirun cpu=4 start=6.14 finish=6.79
65417) mpirun cpu=8 start=6.24 finish=6.79
65418) mpirun cpu=5 start=6.24 finish=6.79
65419) speed3d_c2c cpu=12 start=6.28 finish=6.78
65424) speed3d_c2c cpu=9 start=6.29 finish=6.78
65427) speed3d_c2c cpu=13 start=6.30 finish=6.78
65420) speed3d_c2c cpu=7 start=6.28 finish=6.78
65423) speed3d_c2c cpu=4 start=6.29 finish=6.78
65428) speed3d_c2c cpu=11 start=6.30 finish=6.78
65421) speed3d_c2c cpu=9 start=6.29 finish=6.78
65425) speed3d_c2c cpu=5 start=6.29 finish=6.78
65430) speed3d_c2c cpu=6 start=6.30 finish=6.78
65422) speed3d_c2c cpu=2 start=6.29 finish=6.78
65429) speed3d_c2c cpu=11 start=6.30 finish=6.78
65432) speed3d_c2c cpu=12 start=6.30 finish=6.78
65426) speed3d_c2c cpu=14 start=6.29 finish=6.78
65433) speed3d_c2c cpu=12 start=6.30 finish=6.78
65437) speed3d_c2c cpu=10 start=6.31 finish=6.78
65431) speed3d_c2c cpu=5 start=6.30 finish=6.78
65435) speed3d_c2c cpu=8 start=6.31 finish=6.78
65439) speed3d_c2c cpu=8 start=6.31 finish=6.78
65434) speed3d_c2c cpu=0 start=6.30 finish=6.78
65438) speed3d_c2c cpu=7 start=6.31 finish=6.78
65441) speed3d_c2c cpu=14 start=6.32 finish=6.78
65436) speed3d_c2c cpu=3 start=6.31 finish=6.78
65440) speed3d_c2c cpu=4 start=6.32 finish=6.78
65442) speed3d_c2c cpu=1 start=6.32 finish=6.78
