Testing the fast fourier transform library with FFTs in 32-different sizes and dimensions. OVerall benchmark looks single-threadedand varies on how much the CPU cores are busy.

Topdown overview varies by benchmark but most have little frontend stalls and more backend stalls. Also seems to vary with backend retirement. This is case where I expect contrasts if you pull apart different size ffts.

AMD topdown metrics show almost 40% floating point with few branches. A moderate L2 miss rate with memory dominating backend stalls over floating point.

elapsed              6035.507
on_cpu               0.052          # 0.84 / 16 cores
utime                5057.912
stime                4.091
nvcsw                3674           # 13.24%
nivcsw               24069          # 86.76%
inblock              2608           # 0.43/sec
onblock              132160         # 21.90/sec
cpu-clock            5063022032326  # 5063.022 seconds
task-clock           5063105505005  # 5063.106 seconds
page faults          1403691        # 277.239/sec
context switches     57281          # 11.313/sec
cpu migrations       1619           # 0.320/sec
major page faults    2              # 0.000/sec
minor page faults    1403689        # 277.239/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             1965580397202  # 39.916 branches per 1000 inst
branch misses        3298066028     # 0.17% branch miss
conditional          1656507779012  # 33.640 conditional branches per 1000 inst
indirect             91167644504    # 1.851 indirect branches per 1000 inst
cpu-cycles           19871548371102 # 0.23 GHz
instructions         41895510041240 # 2.11 IPC
slots                39757752428328 #
retiring             14486190179724 # 36.4% (36.4%)
-- ucode             16153741583    #     0.0%
-- fastpath          14470036438141 #    36.4%
frontend             1299222028903  #  3.3% ( 3.3%)
-- latency           488647318608   #     1.2%
-- bandwidth         810574710295   #     2.0%
backend              23765685993487 # 59.8% (59.8%)
-- cpu               6143636408133  #    15.5%
-- memory            17622049585354 #    44.3%
speculation          205999054100   #  0.5% ( 0.5%)
-- branch mispredict 115970837458   #     0.3%
-- pipeline restart  90028216642    #     0.2%
smt-contention       654137508      #  0.0% ( 0.0%)
cpu-cycles           22621150909378 # 0.23 GHz
instructions         46306295646487 # 2.05 IPC
instructions         15440006308981 # 54.750 l2 access per 1000 inst
l2 hit from l1       551234904807   # 19.92% l2 miss
l2 miss from l1      60719432531    #
l2 hit from l2 pf    186431059131   #
l3 hit from l2 pf    48964245685    #
l3 miss from l2 pf   58708583119    #
instructions         15435614877141 # 384.957 float per 1000 inst
float 512            230            # 0.000 AVX-512 per 1000 inst
float 256            390            # 0.000 AVX-256 per 1000 inst
float 128            5942046309998  # 384.957 AVX-128 per 1000 inst
float MMX            0              # 0.000 MMX per 1000 inst
float scalar         0              # 0.000 scalar per 1000 inst

Intel metrics

elapsed              3010.244
on_cpu               0.049          # 0.78 / 16 cores
utime                2335.260
stime                2.102
nvcsw                2964           # 20.89%
nivcsw               11223          # 79.11%
inblock              24             # 0.01/sec
onblock              105536         # 35.06/sec
cpu-clock            2337833277758  # 2337.833 seconds
task-clock           2337870126196  # 2337.870 seconds
page faults          683541         # 292.378/sec
context switches     28760          # 12.302/sec
cpu migrations       651            # 0.278/sec
major page faults    0              # 0.000/sec
minor page faults    683541         # 292.378/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             827918918657   # 38.741 branches per 1000 inst
branch misses        2461654182     # 0.30% branch miss
conditional          827918940033   # 38.741 conditional branches per 1000 inst
indirect             41023014118    # 1.920 indirect branches per 1000 inst
slots                86916464289776 #
retiring             39007424491498 # 44.9% (44.9%)
-- ucode             1317253774803  #     1.5%
-- fastpath          37690170716695 #    43.4%
frontend             3309488424692  #  3.8% ( 3.8%)
-- latency           1285710019402  #     1.5%
-- bandwidth         2023778405290  #     2.3%
backend              46132471585328 # 53.1% (53.1%)
-- cpu               10736832716887 #    12.4%
-- memory            35395638868441 #    40.7%
speculation          1999383497821  #  2.3% ( 2.3%)
-- branch mispredict 1441628404359  #     1.7%
-- pipeline restart  557755093462   #     0.6%
smt-contention       0              #  0.0% ( 0.0%)
cpu-cycles           8626088451177  # 0.17 GHz
instructions         22951385088498 # 2.66 IPC
l2 access            725794661607   # 31.630 l2 access per 1000 inst
l2 miss              323765912211   # 44.61% l2 miss

Process overview shows not many processes and using an internal bench program. This did crash towards end of first run.

613 processes
	161 bench                 2461.12     0.84
	 34 clinfo                   9.74     3.33
	 19 vulkaninfo               0.76     0.57
	  2 vulkani:disk$0           0.08     0.06
	  3 glxinfo:gdrv0            0.07     0.06
	  6 clang                    0.05     0.07
	  1 llvmpipe-0               0.04     0.03
	  1 llvmpipe-1               0.04     0.03
	  1 llvmpipe-10              0.04     0.03
	  1 llvmpipe-11              0.04     0.03
	  1 llvmpipe-12              0.04     0.03
	  1 llvmpipe-13              0.04     0.03
	  1 llvmpipe-14              0.04     0.03
	  1 llvmpipe-15              0.04     0.03
	  1 llvmpipe-2               0.04     0.03
	  1 llvmpipe-3               0.04     0.03
	  1 llvmpipe-4               0.04     0.03
	  1 llvmpipe-5               0.04     0.03
	  1 llvmpipe-6               0.04     0.03
	  1 llvmpipe-7               0.04     0.03
	  1 llvmpipe-8               0.04     0.03
	  1 llvmpipe-9               0.04     0.03
	  1 glxinfo                  0.04     0.02
	  1 glxinfo:cs0              0.04     0.02
	  1 glxinfo:disk$0           0.03     0.02
	  1 glxinfo:sh0              0.03     0.02
	  1 glxinfo:shlo0            0.03     0.02
	  1 ps                       0.00     0.01
	281 sh                       0.00     0.00
	 13 gcc                      0.00     0.00
	  8 gsettings                0.00     0.00
	  8 systemd-detect-          0.00     0.00
	  7 stat                     0.00     0.00
	  6 llvm-link                0.00     0.00
	  5 gmain                    0.00     0.00
	  4 phoronix-test-s          0.00     0.00
	  2 dconf worker             0.00     0.00
	  2 which                    0.00     0.00
	  1 cc                       0.00     0.00
	  1 date                     0.00     0.00
	  1 dirname                  0.00     0.00
	  1 grep                     0.00     0.00
	  1 ifconfig                 0.00     0.00
	  1 ip                       0.00     0.00
	  1 lscpu                    0.00     0.00
	  1 mktemp                   0.00     0.00
	  1 qdbus                    0.00     0.00
	  1 readlink                 0.00     0.00
	  1 realpath                 0.00     0.00
	  1 sed                      0.00     0.00
	  1 sort                     0.00     0.00
	  1 stty                     0.00     0.00
	  1 systemctl                0.00     0.00
	  1 template.sh              0.00     0.00
	  1 wc                       0.00     0.00
	  1 xrandr                   0.00     0.00
	  1 xset                     0.00     0.00
11 processes running
47 maximum processes

Computation is repeated invocations of bench, e.g.

      294703) sh               cpu=0 start=29.18 finish=32.71
        294704) bench            cpu=4 start=29.18 finish=32.71
      294705) sh               cpu=0 start=36.72 finish=40.27
        294706) bench            cpu=9 start=36.72 finish=40.27
      294707) sh               cpu=0 start=44.27 finish=47.81
        294708) bench            cpu=1 start=44.27 finish=47.81
      294709) sh               cpu=0 start=47.81 finish=47.82
        294710) sh               cpu=1 start=47.81 finish=47.82
      294711) sh               cpu=1 start=58.22 finish=60.70
        294712) bench            cpu=2 start=58.23 finish=60.70
      294713) sh               cpu=8 start=64.70 finish=67.14
        294714) bench            cpu=9 start=64.70 finish=67.14
      294715) sh               cpu=8 start=71.14 finish=73.58
        294716) bench            cpu=1 start=71.14 finish=73.58
      294717) sh               cpu=10 start=73.58 finish=73.58
        294718) sh               cpu=11 start=73.58 finish=73.58
      294719) sh               cpu=10 start=92.97 finish=96.45
        294720) bench            cpu=3 start=92.97 finish=96.45
      294721) sh               cpu=10 start=100.45 finish=103.90
        294722) bench            cpu=11 start=100.45 finish=103.90
      294723) sh               cpu=2 start=107.90 finish=111.35
        294724) bench            cpu=3 start=107.91 finish=111.35
      294725) sh               cpu=3 start=111.35 finish=111.35
        294726) sh               cpu=12 start=111.35 finish=111.35
      294727) sh               cpu=2 start=123.78 finish=126.36
        294728) bench            cpu=3 start=123.78 finish=126.36
      294729) sh               cpu=2 start=130.37 finish=132.90
        294730) bench            cpu=11 start=130.37 finish=132.89
      294731) sh               cpu=10 start=136.90 finish=139.41
        294732) bench            cpu=3 start=136.90 finish=139.41
      294733) sh               cpu=12 start=139.41 finish=139.42
        294734) sh               cpu=5 start=139.41 finish=139.41
      294735) sh               cpu=2 start=155.84 finish=158.60
        294736) bench            cpu=3 start=155.84 finish=158.60
      294737) sh               cpu=10 start=162.61 finish=165.38
        294738) bench            cpu=3 start=162.61 finish=165.38
      294740) sh               cpu=10 start=169.38 finish=172.19
        294741) bench            cpu=11 start=169.38 finish=172.19
      294742) sh               cpu=10 start=172.19 finish=172.19
        294743) sh               cpu=11 start=172.19 finish=172.19
      294745) sh               cpu=2 start=183.37 finish=186.57
        294746) bench            cpu=11 start=183.37 finish=186.57
      294747) sh               cpu=2 start=190.57 finish=193.77
        294748) bench            cpu=3 start=190.57 finish=193.76
      294750) sh               cpu=2 start=197.77 finish=200.99
        294751) bench            cpu=3 start=197.77 finish=200.98
      294752) sh               cpu=2 start=200.99 finish=200.99
        294753) sh               cpu=3 start=200.99 finish=200.99