HeFFTe is the Highly Efficient FFT for Exascale. This benchmark has 64 different subtests. Some fail for strange reasons including a missing libelf library or running too quickly. However, most run and provide an example result. These tests run in mixture of mostly single-threaded and threads that match the numbers of cores.

Topdown profile seems to have an upper floor of frontend bound stalls, patches of backend stalls and somewhat lower retirement rate.

AMD metrics have an average of 3 cores. This is floating point code with 60% backend memory stalls. Frontend stalls average 17% overall and the retirement rate is below 10%

elapsed              2584.235
on_cpu               0.187          # 2.99 / 16 cores
utime                6127.346
stime                1597.466
nvcsw                2219897        # 96.96%
nivcsw               69603          # 3.04%
inblock              22620712       # 8753.35/sec
onblock              3811880        # 1475.05/sec
cpu-clock            9104977687955  # 9104.978 seconds
task-clock           9105744639000  # 9105.745 seconds
page faults          708443241      # 77801.791/sec
context switches     3181073        # 349.348/sec
cpu migrations       52924          # 5.812/sec
major page faults    226527         # 24.877/sec
minor page faults    708216255      # 77776.863/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             2648061711421  # 127.197 branches per 1000 inst
branch misses        137376535860   # 5.19% branch miss
conditional          1746794451531  # 83.906 conditional branches per 1000 inst
indirect             72161814734    # 3.466 indirect branches per 1000 inst
cpu-cycles           39238084515767 # 0.95 GHz
instructions         20656319189605 # 0.53 IPC low
slots                78565017276084 #
retiring             7430090396720  #  9.5% ( 9.5%) low
-- ucode             22792162227    #     0.0%
-- fastpath          7407298234493  #     9.4%
frontend             13446209644619 # 17.1% (17.2%)
-- latency           9245125183632  #    11.8%
-- bandwidth         4201084460987  #     5.3%
backend              56879497108114 # 72.4% (73.0%) high
-- cpu               9300687286741  #    11.8%
-- memory            47578809821373 #    60.6%
speculation          213890267340   #  0.3% ( 0.3%) low
-- branch mispredict 211243116854   #     0.3%
-- pipeline restart  2647150486     #     0.0%
smt-contention       595242856933   #  0.8% ( 0.0%)
cpu-cycles           39180222661690 # 0.96 GHz
instructions         20555473254360 # 0.52 IPC low
instructions         6868140531111  # 57.742 l2 access per 1000 inst
l2 hit from l1       303283223345   # 38.82% l2 miss
l2 miss from l1      101327869773   #
l2 hit from l2 pf    40674479706    #
l3 hit from l2 pf    5564960545     #
l3 miss from l2 pf   47056539540    #
instructions         6852144871392  # 128.190 float per 1000 inst
float 512            919            # 0.000 AVX-512 per 1000 inst
float 256            7426           # 0.000 AVX-256 per 1000 inst
float 128            878379612146   # 128.190 AVX-128 per 1000 inst
float MMX            0              # 0.000 MMX per 1000 inst
float scalar         0              # 0.000 scalar per 1000 inst
instructions         20699274879600 #
opcache              3632503887154  # 175.489 opcache per 1000 inst
opcache miss         906178654575   # 24.9% opcache miss rate
l1 dTLB miss         66246872882    # 3.200 L1 dTLB per 1000 inst
l2 dTLB miss         14096459068    # 0.681 L2 dTLB per 1000 inst
instructions         20785223761254 #
icache               2087723305047  # 100.443 icache per 1000 inst
icache miss          54202803799    #  2.6% icache miss rate
l1 iTLB miss         67609643       # 0.003 L1 iTLB per 1000 inst
l2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst
tlb flush            20274864       # 0.001 TLB flush per 1000 inst

Intel metrics

Process overview shows mpi used to invoke and most time in either speed3d_c22c or speed3d_r2c.

9007 processes
	3240 speed3d_c2c          14178.01  4914.72
	3456 speed3d_r2c           7832.05  3804.90
	1602 mpirun                  79.39   544.27
	 68 clinfo                  16.17     9.66
	 38 vulkaninfo               1.14     1.91
	  6 php                      0.63   206.14
	  4 vulkani:disk$0           0.12     0.21
	  6 glxinfo:gdrv0            0.09     0.15
	  6 glxinfo:gl0              0.09     0.15
	  2 llvmpipe-0               0.06     0.11
	  2 llvmpipe-1               0.06     0.11
	  2 llvmpipe-10              0.06     0.11
	  2 llvmpipe-11              0.06     0.11
	  2 llvmpipe-12              0.06     0.11
	  2 llvmpipe-13              0.06     0.11
	  2 llvmpipe-14              0.06     0.11
	  2 llvmpipe-15              0.06     0.11
	  2 llvmpipe-2               0.06     0.11
	  2 llvmpipe-4               0.06     0.11
	  2 llvmpipe-5               0.06     0.11
	  2 llvmpipe-6               0.06     0.11
	  2 llvmpipe-7               0.06     0.11
	  2 llvmpipe-8               0.06     0.11
	  2 llvmpipe-9               0.06     0.11
	  6 clang                    0.06     0.10
	  2 llvmpipe-3               0.06     0.10
	  2 glxinfo                  0.05     0.05
	  2 glxinfo:cs0              0.05     0.05
	  2 glxinfo:disk$0           0.05     0.05
	  2 glxinfo:sh0              0.05     0.05
	  2 glxinfo:shlo0            0.05     0.05
	  3 rocminfo                 0.03     0.00
	  1 lspci                    0.00     0.03
	  1 ps                       0.00     0.01
	267 heffte                   0.00     0.00
	176 sh                       0.00     0.00
	 13 gcc                      0.00     0.00
	  9 gsettings                0.00     0.00
	  8 stat                     0.00     0.00
	  8 systemd-detect-          0.00     0.00
	  6 llvm-link                0.00     0.00
	  5 gmain                    0.00     0.00
	  5 phoronix-test-s          0.00     0.00
	  2 cc                       0.00     0.00
	  2 dconf worker             0.00     0.00
	  2 lscpu                    0.00     0.00
	  2 uname                    0.00     0.00
	  2 which                    0.00     0.00
	  2 xset                     0.00     0.00
	  1 date                     0.00     0.00
	  1 dirname                  0.00     0.00
	  1 dmesg                    0.00     0.00
	  1 dmidecode                0.00     0.00
	  1 grep                     0.00     0.00
	  1 ifconfig                 0.00     0.00
	  1 ip                       0.00     0.00
	  1 lsmod                    0.00     0.00
	  1 mktemp                   0.00     0.00
	  1 qdbus                    0.00     0.00
	  1 readlink                 0.00     0.00
	  1 realpath                 0.00     0.00
	  1 sed                      0.00     0.00
	  1 sort                     0.00     0.00
	  1 stty                     0.00     0.00
	  1 systemctl                0.00     0.00
	  1 template.sh              0.00     0.00
	  1 wc                       0.00     0.00
	  1 xrandr                   0.00     0.00
4 processes running
51 maximum processes

Example of a core computation block

      65412) heffte           cpu=2 start=5.91  finish=6.82 
        65413) mpirun           cpu=6 start=5.91  finish=6.80 
          65414) mpirun           cpu=1 start=6.11  finish=6.80 
          65415) mpirun           cpu=4 start=6.11  finish=6.11 
          65416) mpirun           cpu=4 start=6.14  finish=6.79 
          65417) mpirun           cpu=8 start=6.24  finish=6.79 
          65418) mpirun           cpu=5 start=6.24  finish=6.79 
          65419) speed3d_c2c      cpu=12 start=6.28  finish=6.78 
            65424) speed3d_c2c      cpu=9 start=6.29  finish=6.78 
            65427) speed3d_c2c      cpu=13 start=6.30  finish=6.78 
          65420) speed3d_c2c      cpu=7 start=6.28  finish=6.78 
            65423) speed3d_c2c      cpu=4 start=6.29  finish=6.78 
            65428) speed3d_c2c      cpu=11 start=6.30  finish=6.78 
          65421) speed3d_c2c      cpu=9 start=6.29  finish=6.78 
            65425) speed3d_c2c      cpu=5 start=6.29  finish=6.78 
            65430) speed3d_c2c      cpu=6 start=6.30  finish=6.78 
          65422) speed3d_c2c      cpu=2 start=6.29  finish=6.78 
            65429) speed3d_c2c      cpu=11 start=6.30  finish=6.78 
            65432) speed3d_c2c      cpu=12 start=6.30  finish=6.78 
          65426) speed3d_c2c      cpu=14 start=6.29  finish=6.78 
            65433) speed3d_c2c      cpu=12 start=6.30  finish=6.78 
            65437) speed3d_c2c      cpu=10 start=6.31  finish=6.78 
          65431) speed3d_c2c      cpu=5 start=6.30  finish=6.78 
            65435) speed3d_c2c      cpu=8 start=6.31  finish=6.78 
            65439) speed3d_c2c      cpu=8 start=6.31  finish=6.78 
          65434) speed3d_c2c      cpu=0 start=6.30  finish=6.78 
            65438) speed3d_c2c      cpu=7 start=6.31  finish=6.78 
            65441) speed3d_c2c      cpu=14 start=6.32  finish=6.78 
          65436) speed3d_c2c      cpu=3 start=6.31  finish=6.78 
            65440) speed3d_c2c      cpu=4 start=6.32  finish=6.78 
            65442) speed3d_c2c      cpu=1 start=6.32  finish=6.78