Facebook Llama model in C/C++. There are three models and I ran only the smallest one. The first of three runs seems quick than the other two, but otherwise a fast-running test on half the cores.

Topdown profile has a somewhat variable set of runs, but overall shows a very high backend stalls and low frontend stalls.

AMD metrics include a moderate amount of floating point and some L2 misses. However, overall the memory-bound stalls dominate with 60% of total available stalls. This chart also shows the “high” and “low” markers I added.
elapsed 128.978
on_cpu 0.403 # 6.45 / 16 cores
utime 802.751
stime 29.595
nvcsw 3121 # 25.65%
nivcsw 9049 # 74.35%
inblock 0 # 0.00/sec
onblock 14976 # 116.11/sec
cpu-clock 834030785056 # 834.031 seconds
task-clock 834038425512 # 834.038 seconds
page faults 408869 # 490.228/sec
context switches 12606 # 15.114/sec
cpu migrations 1976 # 2.369/sec
major page faults 18 # 0.022/sec
minor page faults 408851 # 490.206/sec
alignment faults 0 # 0.000/sec
emulation faults 0 # 0.000/sec
branches 326046896419 # 61.105 branches per 1000 inst
branch misses 5579250038 # 1.71% branch miss
conditional 303320131491 # 56.846 conditional branches per 1000 inst
indirect 2509260760 # 0.470 indirect branches per 1000 inst
cpu-cycles 4282152900761 # 1.75 GHz
instructions 6160699287190 # 1.44 IPC
slots 8778375490104 #
retiring 1984204239811 # 22.6% (22.6%)
-- ucode 1016982872 # 0.0%
-- fastpath 1983187256939 # 22.6%
frontend 425233076222 # 4.8% ( 4.8%) low
-- latency 365603861874 # 4.2%
-- bandwidth 59629214348 # 0.7%
backend 6337124379199 # 72.2% (72.3%) high
-- cpu 1028969144665 # 11.7%
-- memory 5308155234534 # 60.5%
speculation 23594960441 # 0.3% ( 0.3%) low
-- branch mispredict 23239203975 # 0.3%
-- pipeline restart 355756466 # 0.0%
smt-contention 8215691503 # 0.1% ( 0.0%)
cpu-cycles 5696554478782 # 1.90 GHz
instructions 8184302765251 # 1.44 IPC
instructions 2759506227804 # 35.157 l2 access per 1000 inst
l2 hit from l1 65442572379 # 22.50% l2 miss
l2 miss from l1 1987747212 #
l2 hit from l2 pf 11727059547 #
l3 hit from l2 pf 496781684 #
l3 miss from l2 pf 19348396440 #
instructions 2757346957013 # 127.940 float per 1000 inst
float 512 53 # 0.000 AVX-512 per 1000 inst
float 256 596 # 0.000 AVX-256 per 1000 inst
float 128 352775686064 # 127.940 AVX-128 per 1000 inst
float MMX 0 # 0.000 MMX per 1000 inst
float scalar 0 # 0.000 scalar per 1000 inst
Intel metrics
elapsed 239.452
on_cpu 0.693 # 11.09 / 16 cores
utime 2162.827
stime 492.449
nvcsw 3539 # 12.09%
nivcsw 25740 # 87.91%
inblock 0 # 0.00/sec
onblock 5136 # 21.45/sec
cpu-clock 2657716156554 # 2657.716 seconds
task-clock 2657758773079 # 2657.759 seconds
page faults 529091 # 199.074/sec
context switches 30300 # 11.401/sec
cpu migrations 5084 # 1.913/sec
major page faults 27 # 0.010/sec
minor page faults 529064 # 199.064/sec
alignment faults 0 # 0.000/sec
emulation faults 0 # 0.000/sec
branches 1288959932154 # 101.427 branches per 1000 inst
branch misses 2034096364 # 0.16% branch miss
conditional 1288960487290 # 101.427 conditional branches per 1000 inst
indirect 329525395006 # 25.930 indirect branches per 1000 inst
slots 12514986885902 #
retiring 5925087936848 # 47.3% (47.3%)
-- ucode 826542358038 # 6.6%
-- fastpath 5098545578810 # 40.7%
frontend 1791866046493 # 14.3% (14.3%)
-- latency 842163481633 # 6.7%
-- bandwidth 949702564860 # 7.6%
backend 4736701088681 # 37.8% (37.8%)
-- cpu 2305606850907 # 18.4%
-- memory 2431094237774 # 19.4%
speculation 60467120680 # 0.5% ( 0.5%) low
-- branch mispredict 50619440821 # 0.4%
-- pipeline restart 9847679859 # 0.1%
smt-contention 0 # 0.0% ( 0.0%)
cpu-cycles 15159327437532 # 1.00 GHz
instructions 34380603646888 # 2.27 IPC
l2 access 259418071488 # 8.897 l2 access per 1000 inst
l2 miss 161943722185 # 62.43% l2 miss
Process overview gives many “main” processes
7945 processes
7594 main 1483052.64 62245.03
68 clinfo 16.20 6.34
38 vulkaninfo 1.13 1.15
4 vulkani:disk$0 0.12 0.13
6 glxinfo:gdrv0 0.11 0.07
6 glxinfo:gl0 0.11 0.06
6 php 0.07 0.10
2 llvmpipe-0 0.06 0.07
2 llvmpipe-1 0.06 0.07
2 llvmpipe-10 0.06 0.07
2 llvmpipe-11 0.06 0.07
2 llvmpipe-12 0.06 0.07
2 llvmpipe-13 0.06 0.07
2 llvmpipe-14 0.06 0.07
2 llvmpipe-15 0.06 0.07
2 llvmpipe-2 0.06 0.07
2 llvmpipe-3 0.06 0.07
2 llvmpipe-4 0.06 0.07
2 llvmpipe-5 0.06 0.07
2 llvmpipe-6 0.06 0.07
2 llvmpipe-7 0.06 0.07
2 llvmpipe-8 0.06 0.07
2 llvmpipe-9 0.06 0.07
6 clang 0.06 0.06
2 glxinfo 0.05 0.03
2 glxinfo:cs0 0.05 0.03
2 glxinfo:disk$0 0.05 0.03
2 glxinfo:sh0 0.05 0.03
2 glxinfo:shlo0 0.05 0.03
3 rocminfo 0.03 0.00
1 lspci 0.00 0.02
1 ps 0.00 0.01
82 sh 0.00 0.00
13 gcc 0.00 0.00
10 gsettings 0.00 0.00
8 stat 0.00 0.00
8 systemd-detect- 0.00 0.00
6 llvm-link 0.00 0.00
5 phoronix-test-s 0.00 0.00
4 gmain 0.00 0.00
3 llama-cpp 0.00 0.00
2 cc 0.00 0.00
2 dconf worker 0.00 0.00
2 lscpu 0.00 0.00
2 uname 0.00 0.00
2 which 0.00 0.00
2 xset 0.00 0.00
1 date 0.00 0.00
1 dirname 0.00 0.00
1 dmesg 0.00 0.00
1 dmidecode 0.00 0.00
1 grep 0.00 0.00
1 ifconfig 0.00 0.00
1 ip 0.00 0.00
1 lsmod 0.00 0.00
1 mktemp 0.00 0.00
1 qdbus 0.00 0.00
1 readlink 0.00 0.00
1 realpath 0.00 0.00
1 sed 0.00 0.00
1 sort 0.00 0.00
1 stty 0.00 0.00
1 systemctl 0.00 0.00
1 template.sh 0.00 0.00
1 wc 0.00 0.00
1 xrandr 0.00 0.00
0 processes running
47 maximum processes
I won’t put all 7000+ processes, but overall structure is of this pattern
1082877) llama-cpp cpu=12 start=10.12 finish=63.86
1082878) main cpu=15 start=10.12 finish=63.85
1082879) main cpu=15 start=10.13 finish=63.85
1082880) main cpu=2 start=10.13 finish=63.85
1082881) main cpu=1 start=10.13 finish=63.85
1082882) main cpu=6 start=10.13 finish=63.85
1082883) main cpu=0 start=10.13 finish=63.85
1082884) main cpu=13 start=10.13 finish=63.85
1082885) main cpu=12 start=10.13 finish=63.85
1082886) main cpu=14 start=10.13 finish=63.85
1082887) main cpu=7 start=10.13 finish=63.85
1082888) main cpu=10 start=10.13 finish=63.85
1082889) main cpu=9 start=10.13 finish=63.85
1082890) main cpu=3 start=10.13 finish=63.85
1082891) main cpu=8 start=10.13 finish=63.85
1082892) main cpu=5 start=10.13 finish=63.85
1082893) main cpu=4 start=10.13 finish=63.85
1082894) main cpu=15 start=10.58 finish=10.70
1082895) main cpu=8 start=10.58 finish=10.70
1082896) main cpu=9 start=10.58 finish=10.70
1082897) main cpu=10 start=10.58 finish=10.70
1082898) main cpu=3 start=10.58 finish=10.70
1082899) main cpu=4 start=10.58 finish=10.70
1082900) main cpu=5 start=10.58 finish=10.70
1082901) main cpu=8 start=10.70 finish=11.27
1082902) main cpu=7 start=10.70 finish=11.27
1082903) main cpu=9 start=10.70 finish=11.27
1082904) main cpu=10 start=10.70 finish=11.27
1082905) main cpu=5 start=10.70 finish=11.27
1082906) main cpu=12 start=10.70 finish=11.27
1082907) main cpu=11 start=10.70 finish=11.27
1082908) main cpu=15 start=11.27 finish=11.37
1082909) main cpu=10 start=11.27 finish=11.37
1082910) main cpu=0 start=11.27 finish=11.37
1082911) main cpu=3 start=11.27 finish=11.37
1082912) main cpu=5 start=11.27 finish=11.37
1082913) main cpu=9 start=11.27 finish=11.37
1082914) main cpu=4 start=11.27 finish=11.37
1082915) main cpu=0 start=11.37 finish=11.47
1082916) main cpu=2 start=11.37 finish=11.47
1082917) main cpu=3 start=11.37 finish=11.47
1082918) main cpu=5 start=11.37 finish=11.47
1082919) main cpu=12 start=11.37 finish=11.47
1082920) main cpu=15 start=11.37 finish=11.47
1082921) main cpu=1 start=11.37 finish=11.47
1082922) main cpu=7 start=11.47 finish=11.56
1082923) main cpu=9 start=11.47 finish=11.56
1082924) main cpu=11 start=11.47 finish=11.56
1082925) main cpu=0 start=11.47 finish=11.56
1082926) main cpu=5 start=11.47 finish=11.56
1082927) main cpu=12 start=11.47 finish=11.56
1082928) main cpu=10 start=11.47 finish=11.56
1082929) main cpu=12 start=11.56 finish=11.66
1082930) main cpu=0 start=11.56 finish=11.66
1082931) main cpu=13 start=11.56 finish=11.66
1082932) main cpu=15 start=11.56 finish=11.66
1082933) main cpu=1 start=11.56 finish=11.66
1082934) main cpu=11 start=11.56 finish=11.66
1082935) main cpu=10 start=11.56 finish=11.66
1082936) main cpu=11 start=11.66 finish=11.76
1082937) main cpu=4 start=11.66 finish=11.76
1082938) main cpu=0 start=11.66 finish=11.76
1082939) main cpu=15 start=11.66 finish=11.76
1082940) main cpu=9 start=11.66 finish=11.76
1082941) main cpu=13 start=11.66 finish=11.76
1082942) main cpu=10 start=11.66 finish=11.76
1082943) main cpu=10 start=11.76 finish=11.85
1082944) main cpu=13 start=11.76 finish=11.85
