{"id":714,"date":"2024-01-20T00:51:21","date_gmt":"2024-01-20T00:51:21","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?page_id=714"},"modified":"2024-01-20T11:34:10","modified_gmt":"2024-01-20T11:34:10","slug":"deepsparse","status":"publish","type":"page","link":"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/deepsparse\/","title":{"rendered":"deepsparse"},"content":{"rendered":"\n<p>A neural network framework with 24 different workloads. Mostly runs on half of the logical cores.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-39.png\" alt=\"\" class=\"wp-image-720\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-39.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-39-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-39-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>Topdown metrics show a backend bound workload but occasional front-end stalls.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-77.png\" alt=\"\" class=\"wp-image-722\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-77.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-77-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-77-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>AMD topdown metrics show a small amount of floating point and not many branches. The backend stalls are at least as much CPU as memory.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              3569.478\non_cpu               0.326          # 5.22 \/ 16 cores\nutime                18324.364\nstime                307.259\nnvcsw                11222907       # 99.54%\nnivcsw               52164          # 0.46%\ninblock              1228440        # 344.15\/sec\nonblock              42214160       # 11826.42\/sec\ncpu-clock            18625705808601 # 18625.706 seconds\ntask-clock           18629500805795 # 18629.501 seconds\npage faults          69392523       # 3724.873\/sec\ncontext switches     11292506       # 606.163\/sec\ncpu migrations       5906           # 0.317\/sec\nmajor page faults    243            # 0.013\/sec\nminor page faults    69392280       # 3724.860\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             3840677894116  # 42.361 branches per 1000 inst\nbranch misses        101209872829   # 2.64% branch miss\nconditional          2983734079416  # 32.909 conditional branches per 1000 inst\nindirect             118343443749   # 1.305 indirect branches per 1000 inst\ncpu-cycles           78591628660514 # 1.38 GHz\ninstructions         90362073448677 # 1.15 IPC\nslots                157193781261678 #\nretiring             33408779346212 # 21.3% (21.3%)\n-- ucode             107489339676   #     0.1%\n-- fastpath          33301290006536 #    21.2%\nfrontend             14717913722454 #  9.4% ( 9.4%)\n-- latency           11303657501748 #     7.2%\n-- bandwidth         3414256220706  #     2.2%\nbackend              107252745611847 # 68.2% (68.3%)\n-- cpu               62638564249864 #    39.8%\n-- memory            44614181361983 #    28.4%\nspeculation          1668870353449  #  1.1% ( 1.1%)\n-- branch mispredict 1552840004870  #     1.0%\n-- pipeline restart  116030348579   #     0.1%\nsmt-contention       145376168253   #  0.1% ( 0.0%)\ncpu-cycles           78661479112363 # 1.38 GHz\ninstructions         90267718612366 # 1.15 IPC\ninstructions         30088349387486 # 156.378 l2 access per 1000 inst\nl2 hit from l1       3522876544780  # 11.60% l2 miss\nl2 miss from l1      250597431430   #\nl2 hit from l2 pf    887065395908   #\nl3 hit from l2 pf    228456679605   #\nl3 miss from l2 pf   66743478461    #\ninstructions         30088835388315 # 50.472 float per 1000 inst\nfloat 512            133            # 0.000 AVX-512 per 1000 inst\nfloat 256            381138356114   # 12.667 AVX-256 per 1000 inst\nfloat 128            1137465319616  # 37.804 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         29596531       # 0.001 scalar per 1000 inst\n<\/code><\/pre>\n\n\n\n<p>Intel metrics show ~3 of 16 cores.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              3686.610\non_cpu               0.189          # 3.03 \/ 16 cores\nutime                10969.732\nstime                203.687\nnvcsw                6807043        # 98.55%\nnivcsw               100101         # 1.45%\ninblock              1998880        # 542.20\/sec\nonblock              48694912       # 13208.59\/sec\ncpu-clock            11168917301443 # 11168.917 seconds\ntask-clock           11169935020733 # 11169.935 seconds\npage faults          69133490       # 6189.247\/sec\ncontext switches     6925161        # 619.982\/sec\ncpu migrations       4705           # 0.421\/sec\nmajor page faults    3716           # 0.333\/sec\nminor page faults    69129774       # 6188.915\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             4876027591344  # 42.096 branches per 1000 inst\nbranch misses        45054577411    # 0.92% branch miss\nconditional          4876027683984  # 42.096 conditional branches per 1000 inst\nindirect             132760442627   # 1.146 indirect branches per 1000 inst\nslots                249817192973432 #\nretiring             113578962480686 # 45.5% (45.5%)\n-- ucode             4381356017946  #     1.8%\n-- fastpath          109197606462740 #    43.7%\nfrontend             15248899292191 #  6.1% ( 6.1%)\n-- latency           9155034273169  #     3.7%\n-- bandwidth         6093865019022  #     2.4%\nbackend              115519281492570 # 46.2% (46.2%)\n-- cpu               81476276360751 #    32.6%\n-- memory            34043005131819 #    13.6%\nspeculation          6054177131575  #  2.4% ( 2.4%)\n-- branch mispredict 5526964401535  #     2.2%\n-- pipeline restart  527212730040   #     0.2%\nsmt-contention       0              #  0.0% ( 0.0%)\ncpu-cycles           41759566746190 # 0.72 GHz\ninstructions         115237922937775 # 2.76 IPC\nl2 access            6658511399645  # 57.800 l2 access per 1000 inst\nl2 miss              1068977335236  # 16.05% l2 miss\n<\/code><\/pre>\n\n\n\n<p>Process overview shows an internal benchmark process<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>1637 processes\n\t1152 deepsparse.benc        517.65  1067.38\n\t 68 clinfo                  19.83     6.33\n\t 38 vulkaninfo               0.97     1.53\n\t  6 glxinfo:gdrv0            0.19     0.09\n\t  6 php                      0.16     0.40\n\t  4 vulkani:disk$0           0.11     0.17\n\t  2 glxinfo                  0.09     0.03\n\t  2 glxinfo:cs0              0.09     0.03\n\t  2 glxinfo:disk$0           0.09     0.03\n\t  2 glxinfo:sh0              0.09     0.03\n\t  2 glxinfo:shlo0            0.09     0.03\n\t  2 llvmpipe-0               0.06     0.09\n\t  2 llvmpipe-1               0.06     0.09\n\t  2 llvmpipe-10              0.06     0.09\n\t  2 llvmpipe-11              0.06     0.09\n\t  2 llvmpipe-12              0.06     0.09\n\t  2 llvmpipe-13              0.06     0.09\n\t  2 llvmpipe-14              0.06     0.09\n\t  2 llvmpipe-15              0.06     0.09\n\t  2 llvmpipe-3               0.06     0.09\n\t  2 llvmpipe-4               0.06     0.09\n\t  2 llvmpipe-5               0.06     0.09\n\t  2 llvmpipe-6               0.06     0.09\n\t  2 llvmpipe-7               0.06     0.09\n\t  2 llvmpipe-8               0.06     0.09\n\t  2 llvmpipe-9               0.06     0.09\n\t  2 llvmpipe-2               0.06     0.08\n\t  6 clang                    0.05     0.07\n\t  3 rocminfo                 0.03     0.00\n\t  1 lspci                    0.01     0.02\n\t  1 ps                       0.00     0.01\n\t 81 sh                       0.00     0.00\n\t 72 arch.bin                 0.00     0.00\n\t 72 deepsparse               0.00     0.00\n\t 12 gcc                      0.00     0.00\n\t  9 gsettings                0.00     0.00\n\t  8 stat                     0.00     0.00\n\t  8 systemd-detect-          0.00     0.00\n\t  6 llvm-link                0.00     0.00\n\t  5 phoronix-test-s          0.00     0.00\n\t  4 gmain                    0.00     0.00\n\t  3 dconf worker             0.00     0.00\n\t  2 lscpu                    0.00     0.00\n\t  2 uname                    0.00     0.00\n\t  2 which                    0.00     0.00\n\t  2 xset                     0.00     0.00\n\t  1 cc                       0.00     0.00\n\t  1 date                     0.00     0.00\n\t  1 dirname                  0.00     0.00\n\t  1 dmesg                    0.00     0.00\n\t  1 dmidecode                0.00     0.00\n\t  1 grep                     0.00     0.00\n\t  1 ifconfig                 0.00     0.00\n\t  1 ip                       0.00     0.00\n\t  1 lsmod                    0.00     0.00\n\t  1 mktemp                   0.00     0.00\n\t  1 python                   0.00     0.00\n\t  1 python3                  0.00     0.00\n\t  1 qdbus                    0.00     0.00\n\t  1 readlink                 0.00     0.00\n\t  1 realpath                 0.00     0.00\n\t  1 sed                      0.00     0.00\n\t  1 sort                     0.00     0.00\n\t  1 stty                     0.00     0.00\n\t  1 systemctl                0.00     0.00\n\t  1 template.sh              0.00     0.00\n\t  1 wc                       0.00     0.00\n\t  1 xrandr                   0.00     0.00\n0 processes running\n47 maximum processes\n<\/code><\/pre>\n\n\n\n<p>Regular set of processes started<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>      2648224) deepsparse       cpu=14 start=5.84  finish=7.55 \n        2648225) deepsparse.benc  cpu=15 start=5.85  finish=7.54 \n          2648226) deepsparse.benc  cpu=11 start=5.88  finish=7.54 \n          2648227) deepsparse.benc  cpu=5 start=5.88  finish=7.54 \n          2648228) deepsparse.benc  cpu=9 start=5.88  finish=7.54 \n          2648229) deepsparse.benc  cpu=10 start=5.88  finish=7.54 \n          2648230) deepsparse.benc  cpu=12 start=5.88  finish=7.54 \n          2648231) deepsparse.benc  cpu=14 start=5.88  finish=7.54 \n          2648232) deepsparse.benc  cpu=8 start=5.88  finish=7.54 \n          2648233) deepsparse.benc  cpu=0 start=5.88  finish=7.54 \n          2648234) deepsparse.benc  cpu=1 start=5.88  finish=7.54 \n          2648235) deepsparse.benc  cpu=3 start=5.88  finish=7.54 \n          2648236) deepsparse.benc  cpu=13 start=5.88  finish=7.54 \n          2648237) deepsparse.benc  cpu=2 start=5.88  finish=7.54 \n          2648238) deepsparse.benc  cpu=6 start=5.88  finish=7.54 \n          2648239) deepsparse.benc  cpu=4 start=5.88  finish=7.54 \n          2648240) deepsparse.benc  cpu=7 start=5.88  finish=7.54 \n          2648241) arch.bin         cpu=15 start=7.52  finish=7.53 \n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>A neural network framework with 24 different workloads. Mostly runs on half of the logical cores. Topdown metrics show a backend bound workload but occasional front-end stalls. AMD topdown metrics show a small amount of floating point and not many <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/deepsparse\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":58,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-714","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/714","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=714"}],"version-history":[{"count":2,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/714\/revisions"}],"predecessor-version":[{"id":723,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/714\/revisions\/723"}],"up":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/58"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=714"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}