{"id":914,"date":"2024-01-26T01:29:14","date_gmt":"2024-01-26T01:29:14","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?page_id=914"},"modified":"2024-01-28T19:12:04","modified_gmt":"2024-01-28T19:12:04","slug":"llama-cpp","status":"publish","type":"page","link":"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/llama-cpp\/","title":{"rendered":"llama.cpp"},"content":{"rendered":"\n<p>Facebook Llama model in C\/C++. There are three models and I ran only the smallest one. The first of three runs seems quick than the other two, but otherwise a fast-running test on half the cores.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-73.png\" alt=\"\" class=\"wp-image-995\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-73.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-73-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-73-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>Topdown profile has a somewhat variable set of runs, but overall shows a very high backend stalls and low frontend stalls.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-110.png\" alt=\"\" class=\"wp-image-997\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-110.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-110-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-110-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>AMD metrics include a moderate amount of floating point and some L2 misses. However, overall the memory-bound stalls dominate with 60% of total available stalls. This chart also shows the &#8220;high&#8221; and &#8220;low&#8221; markers I added.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              128.978\non_cpu               0.403          # 6.45 \/ 16 cores\nutime                802.751\nstime                29.595\nnvcsw                3121           # 25.65%\nnivcsw               9049           # 74.35%\ninblock              0              # 0.00\/sec\nonblock              14976          # 116.11\/sec\ncpu-clock            834030785056   # 834.031 seconds\ntask-clock           834038425512   # 834.038 seconds\npage faults          408869         # 490.228\/sec\ncontext switches     12606          # 15.114\/sec\ncpu migrations       1976           # 2.369\/sec\nmajor page faults    18             # 0.022\/sec\nminor page faults    408851         # 490.206\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             326046896419   # 61.105 branches per 1000 inst\nbranch misses        5579250038     # 1.71% branch miss\nconditional          303320131491   # 56.846 conditional branches per 1000 inst\nindirect             2509260760     # 0.470 indirect branches per 1000 inst\ncpu-cycles           4282152900761  # 1.75 GHz\ninstructions         6160699287190  # 1.44 IPC\nslots                8778375490104  #\nretiring             1984204239811  # 22.6% (22.6%)\n-- ucode             1016982872     #     0.0%\n-- fastpath          1983187256939  #    22.6%\nfrontend             425233076222   #  4.8% ( 4.8%) low\n-- latency           365603861874   #     4.2%\n-- bandwidth         59629214348    #     0.7%\nbackend              6337124379199  # 72.2% (72.3%) high\n-- cpu               1028969144665  #    11.7%\n-- memory            5308155234534  #    60.5%\nspeculation          23594960441    #  0.3% ( 0.3%) low\n-- branch mispredict 23239203975    #     0.3%\n-- pipeline restart  355756466      #     0.0%\nsmt-contention       8215691503     #  0.1% ( 0.0%)\ncpu-cycles           5696554478782  # 1.90 GHz\ninstructions         8184302765251  # 1.44 IPC\ninstructions         2759506227804  # 35.157 l2 access per 1000 inst\nl2 hit from l1       65442572379    # 22.50% l2 miss\nl2 miss from l1      1987747212     #\nl2 hit from l2 pf    11727059547    #\nl3 hit from l2 pf    496781684      #\nl3 miss from l2 pf   19348396440    #\ninstructions         2757346957013  # 127.940 float per 1000 inst\nfloat 512            53             # 0.000 AVX-512 per 1000 inst\nfloat 256            596            # 0.000 AVX-256 per 1000 inst\nfloat 128            352775686064   # 127.940 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         0              # 0.000 scalar per 1000 inst\n<\/code><\/pre>\n\n\n\n<p>Intel metrics<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              239.452\non_cpu               0.693          # 11.09 \/ 16 cores\nutime                2162.827\nstime                492.449\nnvcsw                3539           # 12.09%\nnivcsw               25740          # 87.91%\ninblock              0              # 0.00\/sec\nonblock              5136           # 21.45\/sec\ncpu-clock            2657716156554  # 2657.716 seconds\ntask-clock           2657758773079  # 2657.759 seconds\npage faults          529091         # 199.074\/sec\ncontext switches     30300          # 11.401\/sec\ncpu migrations       5084           # 1.913\/sec\nmajor page faults    27             # 0.010\/sec\nminor page faults    529064         # 199.064\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             1288959932154  # 101.427 branches per 1000 inst\nbranch misses        2034096364     # 0.16% branch miss\nconditional          1288960487290  # 101.427 conditional branches per 1000 inst\nindirect             329525395006   # 25.930 indirect branches per 1000 inst\nslots                12514986885902 #\nretiring             5925087936848  # 47.3% (47.3%)\n-- ucode             826542358038   #     6.6%\n-- fastpath          5098545578810  #    40.7%\nfrontend             1791866046493  # 14.3% (14.3%)\n-- latency           842163481633   #     6.7%\n-- bandwidth         949702564860   #     7.6%\nbackend              4736701088681  # 37.8% (37.8%)\n-- cpu               2305606850907  #    18.4%\n-- memory            2431094237774  #    19.4%\nspeculation          60467120680    #  0.5% ( 0.5%) low\n-- branch mispredict 50619440821    #     0.4%\n-- pipeline restart  9847679859     #     0.1%\nsmt-contention       0              #  0.0% ( 0.0%)\ncpu-cycles           15159327437532 # 1.00 GHz\ninstructions         34380603646888 # 2.27 IPC\nl2 access            259418071488   # 8.897 l2 access per 1000 inst\nl2 miss              161943722185   # 62.43% l2 miss\n<\/code><\/pre>\n\n\n\n<p>Process overview gives many &#8220;main&#8221; processes<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>7945 processes\n\t7594 main                 1483052.64 62245.03\n\t 68 clinfo                  16.20     6.34\n\t 38 vulkaninfo               1.13     1.15\n\t  4 vulkani:disk$0           0.12     0.13\n\t  6 glxinfo:gdrv0            0.11     0.07\n\t  6 glxinfo:gl0              0.11     0.06\n\t  6 php                      0.07     0.10\n\t  2 llvmpipe-0               0.06     0.07\n\t  2 llvmpipe-1               0.06     0.07\n\t  2 llvmpipe-10              0.06     0.07\n\t  2 llvmpipe-11              0.06     0.07\n\t  2 llvmpipe-12              0.06     0.07\n\t  2 llvmpipe-13              0.06     0.07\n\t  2 llvmpipe-14              0.06     0.07\n\t  2 llvmpipe-15              0.06     0.07\n\t  2 llvmpipe-2               0.06     0.07\n\t  2 llvmpipe-3               0.06     0.07\n\t  2 llvmpipe-4               0.06     0.07\n\t  2 llvmpipe-5               0.06     0.07\n\t  2 llvmpipe-6               0.06     0.07\n\t  2 llvmpipe-7               0.06     0.07\n\t  2 llvmpipe-8               0.06     0.07\n\t  2 llvmpipe-9               0.06     0.07\n\t  6 clang                    0.06     0.06\n\t  2 glxinfo                  0.05     0.03\n\t  2 glxinfo:cs0              0.05     0.03\n\t  2 glxinfo:disk$0           0.05     0.03\n\t  2 glxinfo:sh0              0.05     0.03\n\t  2 glxinfo:shlo0            0.05     0.03\n\t  3 rocminfo                 0.03     0.00\n\t  1 lspci                    0.00     0.02\n\t  1 ps                       0.00     0.01\n\t 82 sh                       0.00     0.00\n\t 13 gcc                      0.00     0.00\n\t 10 gsettings                0.00     0.00\n\t  8 stat                     0.00     0.00\n\t  8 systemd-detect-          0.00     0.00\n\t  6 llvm-link                0.00     0.00\n\t  5 phoronix-test-s          0.00     0.00\n\t  4 gmain                    0.00     0.00\n\t  3 llama-cpp                0.00     0.00\n\t  2 cc                       0.00     0.00\n\t  2 dconf worker             0.00     0.00\n\t  2 lscpu                    0.00     0.00\n\t  2 uname                    0.00     0.00\n\t  2 which                    0.00     0.00\n\t  2 xset                     0.00     0.00\n\t  1 date                     0.00     0.00\n\t  1 dirname                  0.00     0.00\n\t  1 dmesg                    0.00     0.00\n\t  1 dmidecode                0.00     0.00\n\t  1 grep                     0.00     0.00\n\t  1 ifconfig                 0.00     0.00\n\t  1 ip                       0.00     0.00\n\t  1 lsmod                    0.00     0.00\n\t  1 mktemp                   0.00     0.00\n\t  1 qdbus                    0.00     0.00\n\t  1 readlink                 0.00     0.00\n\t  1 realpath                 0.00     0.00\n\t  1 sed                      0.00     0.00\n\t  1 sort                     0.00     0.00\n\t  1 stty                     0.00     0.00\n\t  1 systemctl                0.00     0.00\n\t  1 template.sh              0.00     0.00\n\t  1 wc                       0.00     0.00\n\t  1 xrandr                   0.00     0.00\n0 processes running\n47 maximum processes\n<\/code><\/pre>\n\n\n\n<p>I won&#8217;t put all 7000+ processes, but overall structure is of this pattern<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>      1082877) llama-cpp        cpu=12 start=10.12 finish=63.86\n        1082878) main             cpu=15 start=10.12 finish=63.85\n          1082879) main             cpu=15 start=10.13 finish=63.85\n          1082880) main             cpu=2 start=10.13 finish=63.85\n          1082881) main             cpu=1 start=10.13 finish=63.85\n          1082882) main             cpu=6 start=10.13 finish=63.85\n          1082883) main             cpu=0 start=10.13 finish=63.85\n          1082884) main             cpu=13 start=10.13 finish=63.85\n          1082885) main             cpu=12 start=10.13 finish=63.85\n          1082886) main             cpu=14 start=10.13 finish=63.85\n          1082887) main             cpu=7 start=10.13 finish=63.85\n          1082888) main             cpu=10 start=10.13 finish=63.85\n          1082889) main             cpu=9 start=10.13 finish=63.85\n          1082890) main             cpu=3 start=10.13 finish=63.85\n          1082891) main             cpu=8 start=10.13 finish=63.85\n          1082892) main             cpu=5 start=10.13 finish=63.85\n          1082893) main             cpu=4 start=10.13 finish=63.85\n          1082894) main             cpu=15 start=10.58 finish=10.70\n          1082895) main             cpu=8 start=10.58 finish=10.70\n          1082896) main             cpu=9 start=10.58 finish=10.70\n          1082897) main             cpu=10 start=10.58 finish=10.70\n          1082898) main             cpu=3 start=10.58 finish=10.70\n          1082899) main             cpu=4 start=10.58 finish=10.70\n          1082900) main             cpu=5 start=10.58 finish=10.70\n          1082901) main             cpu=8 start=10.70 finish=11.27\n          1082902) main             cpu=7 start=10.70 finish=11.27\n          1082903) main             cpu=9 start=10.70 finish=11.27\n          1082904) main             cpu=10 start=10.70 finish=11.27\n          1082905) main             cpu=5 start=10.70 finish=11.27\n          1082906) main             cpu=12 start=10.70 finish=11.27\n          1082907) main             cpu=11 start=10.70 finish=11.27\n          1082908) main             cpu=15 start=11.27 finish=11.37\n          1082909) main             cpu=10 start=11.27 finish=11.37\n          1082910) main             cpu=0 start=11.27 finish=11.37\n          1082911) main             cpu=3 start=11.27 finish=11.37\n          1082912) main             cpu=5 start=11.27 finish=11.37\n          1082913) main             cpu=9 start=11.27 finish=11.37\n          1082914) main             cpu=4 start=11.27 finish=11.37\n          1082915) main             cpu=0 start=11.37 finish=11.47\n          1082916) main             cpu=2 start=11.37 finish=11.47\n          1082917) main             cpu=3 start=11.37 finish=11.47\n          1082918) main             cpu=5 start=11.37 finish=11.47\n          1082919) main             cpu=12 start=11.37 finish=11.47\n          1082920) main             cpu=15 start=11.37 finish=11.47\n          1082921) main             cpu=1 start=11.37 finish=11.47\n          1082922) main             cpu=7 start=11.47 finish=11.56\n          1082923) main             cpu=9 start=11.47 finish=11.56\n          1082924) main             cpu=11 start=11.47 finish=11.56\n          1082925) main             cpu=0 start=11.47 finish=11.56\n          1082926) main             cpu=5 start=11.47 finish=11.56\n          1082927) main             cpu=12 start=11.47 finish=11.56\n          1082928) main             cpu=10 start=11.47 finish=11.56\n          1082929) main             cpu=12 start=11.56 finish=11.66\n          1082930) main             cpu=0 start=11.56 finish=11.66\n          1082931) main             cpu=13 start=11.56 finish=11.66\n          1082932) main             cpu=15 start=11.56 finish=11.66\n          1082933) main             cpu=1 start=11.56 finish=11.66\n          1082934) main             cpu=11 start=11.56 finish=11.66\n          1082935) main             cpu=10 start=11.56 finish=11.66\n          1082936) main             cpu=11 start=11.66 finish=11.76\n          1082937) main             cpu=4 start=11.66 finish=11.76\n          1082938) main             cpu=0 start=11.66 finish=11.76\n          1082939) main             cpu=15 start=11.66 finish=11.76\n          1082940) main             cpu=9 start=11.66 finish=11.76\n          1082941) main             cpu=13 start=11.66 finish=11.76\n          1082942) main             cpu=10 start=11.66 finish=11.76\n          1082943) main             cpu=10 start=11.76 finish=11.85\n          1082944) main             cpu=13 start=11.76 finish=11.85\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Facebook Llama model in C\/C++. There are three models and I ran only the smallest one. The first of three runs seems quick than the other two, but otherwise a fast-running test on half the cores. Topdown profile has a <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/llama-cpp\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":58,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-914","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/914","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=914"}],"version-history":[{"count":3,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/914\/revisions"}],"predecessor-version":[{"id":998,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/914\/revisions\/998"}],"up":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/58"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=914"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}