{"id":1123,"date":"2024-01-30T11:20:40","date_gmt":"2024-01-30T11:20:40","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?page_id=1123"},"modified":"2024-01-31T00:50:47","modified_gmt":"2024-01-31T00:50:47","slug":"whisper-cpp","status":"publish","type":"page","link":"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/whisper-cpp\/","title":{"rendered":"whisper.cpp"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">A C++ implementation of OpenAI Whisper model for audio transcription. Three different models are used to transcribe the same audio file. Looks like the workload runs in parallel on half the cores. The AMD processor does over 2.5x faster overall on this workload.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-89.png\" alt=\"\" class=\"wp-image-1128\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-89.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-89-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-89-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Topdown profile shows it is dominated by backend stalls and that frontend stalls are low. A very similar profile is found with llama.cpp &#8211; written by the same author.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-126.png\" alt=\"\" class=\"wp-image-1130\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-126.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-126-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-126-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">AMD profile shows half the cores busy. There is some floating point, though not as much as other fp codes. There are reasonable number of L2 misses.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              5530.727\non_cpu               0.475          # 7.60 \/ 16 cores\nutime                41948.223\nstime                74.837\nnvcsw                88182          # 21.69%\nnivcsw               318356         # 78.31%\ninblock              121448         # 21.96\/sec\nonblock              39528          # 7.15\/sec\ncpu-clock            43185206747693 # 43185.207 seconds\ntask-clock           43185307800638 # 43185.308 seconds\npage faults          4596579        # 106.438\/sec\ncontext switches     433591         # 10.040\/sec\ncpu migrations       66000          # 1.528\/sec\nmajor page faults    3              # 0.000\/sec\nminor page faults    4596576        # 106.438\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             43085660001008 # 117.442 branches per 1000 inst\nbranch misses        37668494486    # 0.09% branch miss\nconditional          42866388416712 # 116.844 conditional branches per 1000 inst\nindirect             45204891247    # 0.123 indirect branches per 1000 inst\ncpu-cycles           170523008924934 # 1.93 GHz\ninstructions         360130923953209 # 2.11 IPC\nslots                353078813046786 #\nretiring             114261747623360 # 32.4% (32.4%)\n-- ucode             687310766168   #     0.2%\n-- fastpath          113574436857192 #    32.2%\nfrontend             10740260701037 #  3.0% ( 3.0%) low\n-- latency           5933745440010  #     1.7%\n-- bandwidth         4806515261027  #     1.4%\nbackend              226502059685765 # 64.2% (64.2%)\n-- cpu               54930607366266 #    15.6%\n-- memory            171571452319499 #    48.6%\nspeculation          1399502085660  #  0.4% ( 0.4%) low\n-- branch mispredict 911735370202   #     0.3%\n-- pipeline restart  487766715458   #     0.1%\nsmt-contention       175114447106   #  0.0% ( 0.0%)\ncpu-cycles           170419137834019 # 1.93 GHz\ninstructions         360038148851712 # 2.11 IPC\ninstructions         122240539336491 # 78.098 l2 access per 1000 inst\nl2 hit from l1       5310074428252  # 39.15% l2 miss\nl2 miss from l1      207073886950   #\nl2 hit from l2 pf    705880443601   #\nl3 hit from l2 pf    3106777843787  #\nl3 miss from l2 pf   424027854896   #\ninstructions         122178608709053 # 66.991 float per 1000 inst\nfloat 512            75             # 0.000 AVX-512 per 1000 inst\nfloat 256            672            # 0.000 AVX-256 per 1000 inst\nfloat 128            8184827126616  # 66.991 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         0              # 0.000 scalar per 1000 inst\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Intel metrics<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              14368.253\non_cpu               0.730          # 11.68 \/ 16 cores\nutime                167717.180\nstime                147.268\nnvcsw                97481          # 10.82%\nnivcsw               803547         # 89.18%\ninblock              6863912        # 477.71\/sec\nonblock              25736          # 1.79\/sec\ncpu-clock            169666851070848 # 169666.851 seconds\ntask-clock           169668782732666 # 169668.783 seconds\npage faults          5269925        # 31.060\/sec\ncontext switches     970642         # 5.721\/sec\ncpu migrations       262405         # 1.547\/sec\nmajor page faults    33             # 0.000\/sec\nminor page faults    5269892        # 31.060\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             227128281463420 # 208.515 branches per 1000 inst\nbranch misses        29121652723    # 0.01% branch miss\nconditional          227128311572956 # 208.515 conditional branches per 1000 ins\nt\nindirect             22920041459410 # 21.042 indirect branches per 1000 inst\nslots                1150135161525986 #\nretiring             651783482407376 # 56.7% (56.7%) high\n-- ucode             4936987792977  #     0.4%\n-- fastpath          646846494614399 #    56.2%\nfrontend             16180417417749 #  1.4% ( 1.4%) low\n-- latency           8941995996169  #     0.8%\n-- bandwidth         7238421421580  #     0.6%\nbackend              479717990945457 # 41.7% (41.7%)\n-- cpu               383188217770082 #    33.3%\n-- memory            96529773175375 #     8.4%\nspeculation          2980965978192  #  0.3% ( 0.3%) low\n-- branch mispredict 657433004912   #     0.1%\n-- pipeline restart  2323532973280  #     0.2%\nsmt-contention       0              #  0.0% ( 0.0%)\ncpu-cycles           245659047612452 # 1.06 GHz\ninstructions         1011662256267344 # 4.12 IPC high\nl2 access            8328352125465  # 10.041 l2 access per 1000 inst\nl2 miss              4426583244304  # 53.15% l2 miss\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The process profile includes almost 500,000 processes.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>496963 processes\n\t496713 main                 1507285.76 2012754.91\n\t 34 clinfo                   9.26     3.99\n\t 19 vulkaninfo               0.76     0.57\n\t  2 vulkani:disk$0           0.08     0.06\n\t  6 clang                    0.05     0.07\n\t  3 glxinfo:gdrv0            0.05     0.06\n\t  3 glxinfo:gl0              0.05     0.06\n\t  1 llvmpipe-0               0.04     0.03\n\t  1 llvmpipe-1               0.04     0.03\n\t  1 llvmpipe-10              0.04     0.03\n\t  1 llvmpipe-11              0.04     0.03\n\t  1 llvmpipe-12              0.04     0.03\n\t  1 llvmpipe-13              0.04     0.03\n\t  1 llvmpipe-14              0.04     0.03\n\t  1 llvmpipe-15              0.04     0.03\n\t  1 llvmpipe-2               0.04     0.03\n\t  1 llvmpipe-3               0.04     0.03\n\t  1 llvmpipe-4               0.04     0.03\n\t  1 llvmpipe-5               0.04     0.03\n\t  1 llvmpipe-6               0.04     0.03\n\t  1 llvmpipe-7               0.04     0.03\n\t  1 llvmpipe-8               0.04     0.03\n\t  1 llvmpipe-9               0.04     0.03\n\t  1 glxinfo                  0.03     0.02\n\t  1 glxinfo:cs0              0.03     0.02\n\t  1 glxinfo:disk$0           0.03     0.02\n\t  1 glxinfo:sh0              0.03     0.02\n\t  1 glxinfo:shlo0            0.03     0.02\n\t  1 ps                       0.00     0.01\n\t 62 sh                       0.00     0.00\n\t 13 gcc                      0.00     0.00\n\t 11 gsettings                0.00     0.00\n\t  8 systemd-detect-          0.00     0.00\n\t  7 stat                     0.00     0.00\n\t  7 whisper-cpp              0.00     0.00\n\t  6 llvm-link                0.00     0.00\n\t  4 phoronix-test-s          0.00     0.00\n\t  3 gmain                    0.00     0.00\n\t  2 which                    0.00     0.00\n\t  1 cc                       0.00     0.00\n\t  1 date                     0.00     0.00\n\t  1 dconf worker             0.00     0.00\n\t  1 dirname                  0.00     0.00\n\t  1 grep                     0.00     0.00\n\t  1 ifconfig                 0.00     0.00\n\t  1 ip                       0.00     0.00\n\t  1 lscpu                    0.00     0.00\n\t  1 mktemp                   0.00     0.00\n\t  1 qdbus                    0.00     0.00\n\t  1 readlink                 0.00     0.00\n\t  1 realpath                 0.00     0.00\n\t  1 sed                      0.00     0.00\n\t  1 sort                     0.00     0.00\n\t  1 stty                     0.00     0.00\n\t  1 systemctl                0.00     0.00\n\t  1 template.sh              0.00     0.00\n\t  1 wc                       0.00     0.00\n\t  1 xrandr                   0.00     0.00\n\t  1 xset                     0.00     0.00\n18 processes running\n47 maximum processes\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>A C++ implementation of OpenAI Whisper model for audio transcription. Three different models are used to transcribe the same audio file. Looks like the workload runs in parallel on half the cores. The AMD processor does over 2.5x faster overall <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/whisper-cpp\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":58,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-1123","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1123","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=1123"}],"version-history":[{"count":2,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1123\/revisions"}],"predecessor-version":[{"id":1131,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1123\/revisions\/1131"}],"up":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/58"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=1123"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}