{"id":1953,"date":"2024-03-03T19:02:43","date_gmt":"2024-03-03T19:02:43","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?page_id=1953"},"modified":"2024-03-04T01:36:38","modified_gmt":"2024-03-04T01:36:38","slug":"deepspeech","status":"publish","type":"page","link":"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/deepspeech\/","title":{"rendered":"deepspeech"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">A speech to text process using Tensorflow to transcribe a three minute audio recording. One workload that runs in about a minute and then three times. Looks to be single-threaded<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/systemtime-14.png\" alt=\"\" class=\"wp-image-1966\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/systemtime-14.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/systemtime-14-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/systemtime-14-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Topdown profile looks to be backend bound.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/amdtopdown-15.png\" alt=\"\" class=\"wp-image-1968\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/amdtopdown-15.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/amdtopdown-15-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/amdtopdown-15-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">AMD metrics confirm a backend-stall heavy workload with low frontend stalls. There is not much floating point. A good amount of L2 access.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              160.615\non_cpu               0.070          # 1.12 \/ 16 cores\nutime                169.116\nstime                10.486\nnvcsw                487526         # 99.62%\nnivcsw               1847           # 0.38%\ninblock              0              # 0.00\/sec\nonblock              12568          # 78.25\/sec\ncpu-clock            177998788938   # 177.999 seconds\ntask-clock           178430160190   # 178.430 seconds\npage faults          235378         # 1319.160\/sec\ncontext switches     489985         # 2746.088\/sec\ncpu migrations       23465          # 131.508\/sec\nmajor page faults    3416           # 19.145\/sec\nminor page faults    231962         # 1300.016\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             49259821446    # 63.246 branches per 1000 inst\nbranch misses        521285591      # 1.06% branch miss\nconditional          40408849370    # 51.882 conditional branches per 1000 inst\nindirect             1381988941     # 1.774 indirect branches per 1000 inst\ncpu-cycles           696267329926   # 0.24 GHz\ninstructions         772925724731   # 1.11 IPC\nslots                1390055158284  #\nretiring             256880226504   # 18.5% (18.7%)\n-- ucode             151528830      #     0.0%\n-- fastpath          256728697674   #    18.5%\nfrontend             58842049778    #  4.2% ( 4.3%) low\n-- latency           47353735122    #     3.4%\n-- bandwidth         11488314656    #     0.8%\nbackend              1047097332617  # 75.3% (76.2%) high\n-- cpu               130578369248   #     9.4%\n-- memory            916518963369   #    65.9%\nspeculation          11335440378    #  0.8% ( 0.8%) low\n-- branch mispredict 10259441048    #     0.7%\n-- pipeline restart  1075999330     #     0.1%\nsmt-contention       15848595211    #  1.1% ( 0.0%)\ncpu-cycles           697932624106   # 0.28 GHz\ninstructions         777877029300   # 1.11 IPC\ninstructions         259176178766   # 172.445 l2 access per 1000 inst\nl2 hit from l1       26157068658    # 31.31% l2 miss\nl2 miss from l1      2105517564     #\nl2 hit from l2 pf    6648857572     #\nl3 hit from l2 pf    269934316      #\nl3 miss from l2 pf   11617658376    #\ninstructions         259534288888   # 15.867 float per 1000 inst\nfloat 512            65             # 0.000 AVX-512 per 1000 inst\nfloat 256            2318481        # 0.009 AVX-256 per 1000 inst\nfloat 128            4115648027     # 15.858 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         4              # 0.000 scalar per 1000 inst\ninstructions         778907715084   #\nopcache              94567228200    # 121.410 opcache per 1000 inst\nopcache miss         4023676179     #  4.3% opcache miss rate\nl1 dTLB miss         4315198834     # 5.540 L1 dTLB per 1000 inst\nl2 dTLB miss         1044990269     # 1.342 L2 dTLB per 1000 inst\ninstructions         778450113458   #\nicache               7298837934     # 9.376 icache per 1000 inst\nicache miss          1369530995     # 18.8% icache miss rate\nl1 iTLB miss         23755944       # 0.031 L1 iTLB per 1000 inst\nl2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst\ntlb flush            16742          # 0.000 TLB flush per 1000 inst\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Intel metrics confirm both L3 and dram-bound natures of the memory-bound stalls.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              201.626\non_cpu               0.056          # 0.90 \/ 16 cores\nutime                176.392\nstime                5.582\nnvcsw                468751         # 97.63%\nnivcsw               11384          # 2.37%\ninblock              1208           # 5.99\/sec\nonblock              1288           # 6.39\/sec\ncpu-clock            180258810774   # 180.259 seconds\ntask-clock           180510870708   # 180.511 seconds\npage faults          222169         # 1230.779\/sec\ncontext switches     480881         # 2664.000\/sec\ncpu migrations       38123          # 211.195\/sec\nmajor page faults    2428           # 13.451\/sec\nminor page faults    219741         # 1217.328\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             49026825305    # 59.224 branches per 1000 inst\nbranch misses        354171008      # 0.72% branch miss\nconditional          49026840985    # 59.224 conditional branches per 1000 inst\nindirect             3105678778     # 3.752 indirect branches per 1000 inst\nslots                3095032907966  #\nretiring             773986641565   # 25.0% (25.0%)\n-- ucode             10055304339    #     0.3%\n-- fastpath          763931337226   #    24.7%\nfrontend             74881303981    #  2.4% ( 2.4%) low\n-- latency           36919463368    #     1.2%\n-- bandwidth         37961840613    #     1.2%\nbackend              2231130561294  # 72.1% (72.1%) high\n-- cpu               1108840747474  #    35.8%\n-- memory            1122289813820  #    36.3%\nspeculation          42072490016    #  1.4% ( 1.4%)\n-- branch mispredict 39314118809    #     1.3%\n-- pipeline restart  2758371207     #     0.1%\nsmt-contention       0              #  0.0% ( 0.0%)\ncpu-cycles           618989502966   # 0.24 GHz\ninstructions         878893138266   # 1.42 IPC\nl2 access            144912559758   # 188.820 l2 access per 1000 inst\nl2 miss              98850887983    # 68.21% l2 miss\ncpu-cycles           540075543736   # 42.5% memory latency\nload stalls          228742926778   #  0.0% l1 bound\nl1 miss              463786101064   # 47.1% l2 bound\nl2 miss              209158694764   # 14.2% l3 bound\nl3 miss              132645464628   # 24.6% dram bound\nstore_stalls         624046835      #  0.1% store bound\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Process profile shows time spent on the deepspeech process.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>450 processes\n\t102 deepspeech            5788.84   199.24\n\t 68 clinfo                  16.53     6.33\n\t 38 vulkaninfo               1.34     0.96\n\t  4 vulkani:disk$0           0.15     0.11\n\t  6 glxinfo:gdrv0            0.09     0.09\n\t  6 glxinfo:gl0              0.09     0.09\n\t  2 llvmpipe-0               0.07     0.05\n\t  2 llvmpipe-1               0.07     0.05\n\t  2 llvmpipe-10              0.07     0.05\n\t  2 llvmpipe-11              0.07     0.05\n\t  2 llvmpipe-12              0.07     0.05\n\t  2 llvmpipe-13              0.07     0.05\n\t  2 llvmpipe-14              0.07     0.05\n\t  2 llvmpipe-15              0.07     0.05\n\t  2 llvmpipe-2               0.07     0.05\n\t  2 llvmpipe-3               0.07     0.05\n\t  2 llvmpipe-4               0.07     0.05\n\t  2 llvmpipe-5               0.07     0.05\n\t  2 llvmpipe-6               0.07     0.05\n\t  2 llvmpipe-7               0.07     0.05\n\t  2 llvmpipe-8               0.07     0.05\n\t  2 llvmpipe-9               0.07     0.05\n\t  6 php                      0.06     0.09\n\t  2 glxinfo                  0.06     0.03\n\t  6 clang                    0.05     0.07\n\t  2 glxinfo:cs0              0.05     0.03\n\t  2 glxinfo:disk$0           0.05     0.03\n\t  2 glxinfo:sh0              0.05     0.03\n\t  2 glxinfo:shlo0            0.05     0.03\n\t  3 rocminfo                 0.00     0.03\n\t  1 lspci                    0.00     0.02\n\t 81 sh                       0.00     0.00\n\t 12 gcc                      0.00     0.00\n\t  9 gsettings                0.00     0.00\n\t  8 stat                     0.00     0.00\n\t  8 systemd-detect-          0.00     0.00\n\t  6 llvm-link                0.00     0.00\n\t  5 gmain                    0.00     0.00\n\t  5 phoronix-test-s          0.00     0.00\n\t  3 deepspeech-run           0.00     0.00\n\t  2 dconf worker             0.00     0.00\n\t  2 lscpu                    0.00     0.00\n\t  2 uname                    0.00     0.00\n\t  2 which                    0.00     0.00\n\t  2 xset                     0.00     0.00\n\t  1 cc                       0.00     0.00\n\t  1 date                     0.00     0.00\n\t  1 dirname                  0.00     0.00\n\t  1 dmesg                    0.00     0.00\n\t  1 dmidecode                0.00     0.00\n\t  1 grep                     0.00     0.00\n\t  1 ifconfig                 0.00     0.00\n\t  1 ip                       0.00     0.00\n\t  1 lsmod                    0.00     0.00\n\t  1 mktemp                   0.00     0.00\n\t  1 ps                       0.00     0.00\n\t  1 qdbus                    0.00     0.00\n\t  1 readlink                 0.00     0.00\n\t  1 realpath                 0.00     0.00\n\t  1 sed                      0.00     0.00\n\t  1 sort                     0.00     0.00\n\t  1 stty                     0.00     0.00\n\t  1 systemctl                0.00     0.00\n\t  1 template.sh              0.00     0.00\n\t  1 wc                       0.00     0.00\n\t  1 xrandr                   0.00     0.00\n0 processes running\n47 maximum processes\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Computation blocks<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>      906663) deepspeech-run   cpu=5 start=5.99  finish=49.12\n        906664) deepspeech       cpu=5 start=6.00  finish=49.11\n          906665) deepspeech       cpu=4 start=6.01  finish=49.11\n          906666) deepspeech       cpu=14 start=6.01  finish=49.11\n          906667) deepspeech       cpu=13 start=6.01  finish=49.11\n          906668) deepspeech       cpu=2 start=6.01  finish=49.11\n          906669) deepspeech       cpu=9 start=6.01  finish=49.11\n          906670) deepspeech       cpu=7 start=6.01  finish=49.11\n          906671) deepspeech       cpu=1 start=6.01  finish=49.11\n          906672) deepspeech       cpu=8 start=6.01  finish=49.11\n          906673) deepspeech       cpu=12 start=6.01  finish=49.11\n          906674) deepspeech       cpu=8 start=6.01  finish=49.11\n          906675) deepspeech       cpu=15 start=6.01  finish=49.11\n          906676) deepspeech       cpu=6 start=6.01  finish=49.11\n          906677) deepspeech       cpu=10 start=6.01  finish=49.11\n          906678) deepspeech       cpu=3 start=6.01  finish=49.11\n          906679) deepspeech       cpu=9 start=6.01  finish=49.11\n          906680) deepspeech       cpu=11 start=6.01  finish=49.11\n          906681) deepspeech       cpu=7 start=6.01  finish=49.11\n          906682) deepspeech       cpu=11 start=6.01  finish=49.11\n          906683) deepspeech       cpu=0 start=6.01  finish=49.11\n          906684) deepspeech       cpu=14 start=6.01  finish=49.11\n          906685) deepspeech       cpu=6 start=6.01  finish=49.11\n          906686) deepspeech       cpu=13 start=6.01  finish=49.11\n          906687) deepspeech       cpu=4 start=6.01  finish=49.11\n          906688) deepspeech       cpu=12 start=6.01  finish=49.11\n          906689) deepspeech       cpu=13 start=6.01  finish=49.11\n          906690) deepspeech       cpu=15 start=6.01  finish=49.11\n          906691) deepspeech       cpu=5 start=6.01  finish=49.11\n          906692) deepspeech       cpu=10 start=6.01  finish=49.11\n          906693) deepspeech       cpu=1 start=6.01  finish=49.11\n          906694) deepspeech       cpu=2 start=6.01  finish=49.11\n          906695) deepspeech       cpu=0 start=6.01  finish=49.11\n          906696) deepspeech       cpu=3 start=6.01  finish=49.11\n          906697) deepspeech       cpu=11 start=6.01  finish=49.11\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>A speech to text process using Tensorflow to transcribe a three minute audio recording. One workload that runs in about a minute and then three times. Looks to be single-threaded Topdown profile looks to be backend bound. AMD metrics confirm <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/deepspeech\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":58,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-1953","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1953","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=1953"}],"version-history":[{"count":2,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1953\/revisions"}],"predecessor-version":[{"id":1969,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1953\/revisions\/1969"}],"up":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/58"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=1953"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}