Facebook Llama model in C/C++. There are three models and I ran only the smallest one. The first of three runs seems quick than the other two, but otherwise a fast-running test on half the cores.

Topdown profile has a somewhat variable set of runs, but overall shows a very high backend stalls and low frontend stalls.

AMD metrics include a moderate amount of floating point and some L2 misses. However, overall the memory-bound stalls dominate with 60% of total available stalls. This chart also shows the “high” and “low” markers I added.

elapsed              128.978
on_cpu               0.403          # 6.45 / 16 cores
utime                802.751
stime                29.595
nvcsw                3121           # 25.65%
nivcsw               9049           # 74.35%
inblock              0              # 0.00/sec
onblock              14976          # 116.11/sec
cpu-clock            834030785056   # 834.031 seconds
task-clock           834038425512   # 834.038 seconds
page faults          408869         # 490.228/sec
context switches     12606          # 15.114/sec
cpu migrations       1976           # 2.369/sec
major page faults    18             # 0.022/sec
minor page faults    408851         # 490.206/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             326046896419   # 61.105 branches per 1000 inst
branch misses        5579250038     # 1.71% branch miss
conditional          303320131491   # 56.846 conditional branches per 1000 inst
indirect             2509260760     # 0.470 indirect branches per 1000 inst
cpu-cycles           4282152900761  # 1.75 GHz
instructions         6160699287190  # 1.44 IPC
slots                8778375490104  #
retiring             1984204239811  # 22.6% (22.6%)
-- ucode             1016982872     #     0.0%
-- fastpath          1983187256939  #    22.6%
frontend             425233076222   #  4.8% ( 4.8%) low
-- latency           365603861874   #     4.2%
-- bandwidth         59629214348    #     0.7%
backend              6337124379199  # 72.2% (72.3%) high
-- cpu               1028969144665  #    11.7%
-- memory            5308155234534  #    60.5%
speculation          23594960441    #  0.3% ( 0.3%) low
-- branch mispredict 23239203975    #     0.3%
-- pipeline restart  355756466      #     0.0%
smt-contention       8215691503     #  0.1% ( 0.0%)
cpu-cycles           5696554478782  # 1.90 GHz
instructions         8184302765251  # 1.44 IPC
instructions         2759506227804  # 35.157 l2 access per 1000 inst
l2 hit from l1       65442572379    # 22.50% l2 miss
l2 miss from l1      1987747212     #
l2 hit from l2 pf    11727059547    #
l3 hit from l2 pf    496781684      #
l3 miss from l2 pf   19348396440    #
instructions         2757346957013  # 127.940 float per 1000 inst
float 512            53             # 0.000 AVX-512 per 1000 inst
float 256            596            # 0.000 AVX-256 per 1000 inst
float 128            352775686064   # 127.940 AVX-128 per 1000 inst
float MMX            0              # 0.000 MMX per 1000 inst
float scalar         0              # 0.000 scalar per 1000 inst

Intel metrics

elapsed              239.452
on_cpu               0.693          # 11.09 / 16 cores
utime                2162.827
stime                492.449
nvcsw                3539           # 12.09%
nivcsw               25740          # 87.91%
inblock              0              # 0.00/sec
onblock              5136           # 21.45/sec
cpu-clock            2657716156554  # 2657.716 seconds
task-clock           2657758773079  # 2657.759 seconds
page faults          529091         # 199.074/sec
context switches     30300          # 11.401/sec
cpu migrations       5084           # 1.913/sec
major page faults    27             # 0.010/sec
minor page faults    529064         # 199.064/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             1288959932154  # 101.427 branches per 1000 inst
branch misses        2034096364     # 0.16% branch miss
conditional          1288960487290  # 101.427 conditional branches per 1000 inst
indirect             329525395006   # 25.930 indirect branches per 1000 inst
slots                12514986885902 #
retiring             5925087936848  # 47.3% (47.3%)
-- ucode             826542358038   #     6.6%
-- fastpath          5098545578810  #    40.7%
frontend             1791866046493  # 14.3% (14.3%)
-- latency           842163481633   #     6.7%
-- bandwidth         949702564860   #     7.6%
backend              4736701088681  # 37.8% (37.8%)
-- cpu               2305606850907  #    18.4%
-- memory            2431094237774  #    19.4%
speculation          60467120680    #  0.5% ( 0.5%) low
-- branch mispredict 50619440821    #     0.4%
-- pipeline restart  9847679859     #     0.1%
smt-contention       0              #  0.0% ( 0.0%)
cpu-cycles           15159327437532 # 1.00 GHz
instructions         34380603646888 # 2.27 IPC
l2 access            259418071488   # 8.897 l2 access per 1000 inst
l2 miss              161943722185   # 62.43% l2 miss

Process overview gives many “main” processes

7945 processes
	7594 main                 1483052.64 62245.03
	 68 clinfo                  16.20     6.34
	 38 vulkaninfo               1.13     1.15
	  4 vulkani:disk$0           0.12     0.13
	  6 glxinfo:gdrv0            0.11     0.07
	  6 glxinfo:gl0              0.11     0.06
	  6 php                      0.07     0.10
	  2 llvmpipe-0               0.06     0.07
	  2 llvmpipe-1               0.06     0.07
	  2 llvmpipe-10              0.06     0.07
	  2 llvmpipe-11              0.06     0.07
	  2 llvmpipe-12              0.06     0.07
	  2 llvmpipe-13              0.06     0.07
	  2 llvmpipe-14              0.06     0.07
	  2 llvmpipe-15              0.06     0.07
	  2 llvmpipe-2               0.06     0.07
	  2 llvmpipe-3               0.06     0.07
	  2 llvmpipe-4               0.06     0.07
	  2 llvmpipe-5               0.06     0.07
	  2 llvmpipe-6               0.06     0.07
	  2 llvmpipe-7               0.06     0.07
	  2 llvmpipe-8               0.06     0.07
	  2 llvmpipe-9               0.06     0.07
	  6 clang                    0.06     0.06
	  2 glxinfo                  0.05     0.03
	  2 glxinfo:cs0              0.05     0.03
	  2 glxinfo:disk$0           0.05     0.03
	  2 glxinfo:sh0              0.05     0.03
	  2 glxinfo:shlo0            0.05     0.03
	  3 rocminfo                 0.03     0.00
	  1 lspci                    0.00     0.02
	  1 ps                       0.00     0.01
	 82 sh                       0.00     0.00
	 13 gcc                      0.00     0.00
	 10 gsettings                0.00     0.00
	  8 stat                     0.00     0.00
	  8 systemd-detect-          0.00     0.00
	  6 llvm-link                0.00     0.00
	  5 phoronix-test-s          0.00     0.00
	  4 gmain                    0.00     0.00
	  3 llama-cpp                0.00     0.00
	  2 cc                       0.00     0.00
	  2 dconf worker             0.00     0.00
	  2 lscpu                    0.00     0.00
	  2 uname                    0.00     0.00
	  2 which                    0.00     0.00
	  2 xset                     0.00     0.00
	  1 date                     0.00     0.00
	  1 dirname                  0.00     0.00
	  1 dmesg                    0.00     0.00
	  1 dmidecode                0.00     0.00
	  1 grep                     0.00     0.00
	  1 ifconfig                 0.00     0.00
	  1 ip                       0.00     0.00
	  1 lsmod                    0.00     0.00
	  1 mktemp                   0.00     0.00
	  1 qdbus                    0.00     0.00
	  1 readlink                 0.00     0.00
	  1 realpath                 0.00     0.00
	  1 sed                      0.00     0.00
	  1 sort                     0.00     0.00
	  1 stty                     0.00     0.00
	  1 systemctl                0.00     0.00
	  1 template.sh              0.00     0.00
	  1 wc                       0.00     0.00
	  1 xrandr                   0.00     0.00
0 processes running
47 maximum processes

I won’t put all 7000+ processes, but overall structure is of this pattern

      1082877) llama-cpp        cpu=12 start=10.12 finish=63.86
        1082878) main             cpu=15 start=10.12 finish=63.85
          1082879) main             cpu=15 start=10.13 finish=63.85
          1082880) main             cpu=2 start=10.13 finish=63.85
          1082881) main             cpu=1 start=10.13 finish=63.85
          1082882) main             cpu=6 start=10.13 finish=63.85
          1082883) main             cpu=0 start=10.13 finish=63.85
          1082884) main             cpu=13 start=10.13 finish=63.85
          1082885) main             cpu=12 start=10.13 finish=63.85
          1082886) main             cpu=14 start=10.13 finish=63.85
          1082887) main             cpu=7 start=10.13 finish=63.85
          1082888) main             cpu=10 start=10.13 finish=63.85
          1082889) main             cpu=9 start=10.13 finish=63.85
          1082890) main             cpu=3 start=10.13 finish=63.85
          1082891) main             cpu=8 start=10.13 finish=63.85
          1082892) main             cpu=5 start=10.13 finish=63.85
          1082893) main             cpu=4 start=10.13 finish=63.85
          1082894) main             cpu=15 start=10.58 finish=10.70
          1082895) main             cpu=8 start=10.58 finish=10.70
          1082896) main             cpu=9 start=10.58 finish=10.70
          1082897) main             cpu=10 start=10.58 finish=10.70
          1082898) main             cpu=3 start=10.58 finish=10.70
          1082899) main             cpu=4 start=10.58 finish=10.70
          1082900) main             cpu=5 start=10.58 finish=10.70
          1082901) main             cpu=8 start=10.70 finish=11.27
          1082902) main             cpu=7 start=10.70 finish=11.27
          1082903) main             cpu=9 start=10.70 finish=11.27
          1082904) main             cpu=10 start=10.70 finish=11.27
          1082905) main             cpu=5 start=10.70 finish=11.27
          1082906) main             cpu=12 start=10.70 finish=11.27
          1082907) main             cpu=11 start=10.70 finish=11.27
          1082908) main             cpu=15 start=11.27 finish=11.37
          1082909) main             cpu=10 start=11.27 finish=11.37
          1082910) main             cpu=0 start=11.27 finish=11.37
          1082911) main             cpu=3 start=11.27 finish=11.37
          1082912) main             cpu=5 start=11.27 finish=11.37
          1082913) main             cpu=9 start=11.27 finish=11.37
          1082914) main             cpu=4 start=11.27 finish=11.37
          1082915) main             cpu=0 start=11.37 finish=11.47
          1082916) main             cpu=2 start=11.37 finish=11.47
          1082917) main             cpu=3 start=11.37 finish=11.47
          1082918) main             cpu=5 start=11.37 finish=11.47
          1082919) main             cpu=12 start=11.37 finish=11.47
          1082920) main             cpu=15 start=11.37 finish=11.47
          1082921) main             cpu=1 start=11.37 finish=11.47
          1082922) main             cpu=7 start=11.47 finish=11.56
          1082923) main             cpu=9 start=11.47 finish=11.56
          1082924) main             cpu=11 start=11.47 finish=11.56
          1082925) main             cpu=0 start=11.47 finish=11.56
          1082926) main             cpu=5 start=11.47 finish=11.56
          1082927) main             cpu=12 start=11.47 finish=11.56
          1082928) main             cpu=10 start=11.47 finish=11.56
          1082929) main             cpu=12 start=11.56 finish=11.66
          1082930) main             cpu=0 start=11.56 finish=11.66
          1082931) main             cpu=13 start=11.56 finish=11.66
          1082932) main             cpu=15 start=11.56 finish=11.66
          1082933) main             cpu=1 start=11.56 finish=11.66
          1082934) main             cpu=11 start=11.56 finish=11.66
          1082935) main             cpu=10 start=11.56 finish=11.66
          1082936) main             cpu=11 start=11.66 finish=11.76
          1082937) main             cpu=4 start=11.66 finish=11.76
          1082938) main             cpu=0 start=11.66 finish=11.76
          1082939) main             cpu=15 start=11.66 finish=11.76
          1082940) main             cpu=9 start=11.66 finish=11.76
          1082941) main             cpu=13 start=11.66 finish=11.76
          1082942) main             cpu=10 start=11.66 finish=11.76
          1082943) main             cpu=10 start=11.76 finish=11.85
          1082944) main             cpu=13 start=11.76 finish=11.85