{"id":2546,"date":"2024-09-05T01:34:06","date_gmt":"2024-09-05T01:34:06","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?page_id=2546"},"modified":"2024-09-05T14:32:39","modified_gmt":"2024-09-05T14:32:39","slug":"xnnpack","status":"publish","type":"page","link":"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/xnnpack\/","title":{"rendered":"xnnpack"},"content":{"rendered":"\n<p>Google library for high efficiency floating-point neural network inference operators. Used by other frameworks.  There is a sequence of nine operations. These run on all cores.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/09\/systemtime-1.png\" alt=\"\" class=\"wp-image-2554\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/09\/systemtime-1.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/09\/systemtime-1-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/09\/systemtime-1-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>Topdown profile shows mostly backend bound but a mix among the operations.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/09\/amdtopdown-1.png\" alt=\"\" class=\"wp-image-2556\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/09\/amdtopdown-1.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/09\/amdtopdown-1-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/09\/amdtopdown-1-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>AMD metrics show backend stalls averaging 71% and split between memory and core. The frontend and speculation stalls are small. There is a moderate amount of floating point.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              931.826\non_cpu               0.959          # 15.34 \/ 16 cores\nutime                14257.581\nstime                33.219\nnvcsw                10099          # 9.45%\nnivcsw               96767          # 90.55%\ninblock              8              # 0.01\/sec\nonblock              17472          # 18.75\/sec\ncpu-clock            14294818899155 # 14294.819 seconds\ntask-clock           14294890468287 # 14294.890 seconds\npage faults          944654         # 66.083\/sec\ncontext switches     111341         # 7.789\/sec\ncpu migrations       285            # 0.020\/sec\nmajor page faults    2              # 0.000\/sec\nminor page faults    944652         # 66.083\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             2082975726202  # 39.545 branches per 1000 inst\nbranch misses        15827594675    # 0.76% branch miss\nconditional          1918405492130  # 36.420 conditional branches per 1000 inst\nindirect             38210190153    # 0.725 indirect branches per 1000 inst\ncpu-cycles           56773076641271 # 3.81 GHz\ninstructions         52309348452668 # 0.92 IPC\nslots                113599063615566 #\nretiring             19097588418675 # 16.8% (23.5%)\n-- ucode             66616981438    #     0.1%\n-- fastpath          19030971437237 #    16.8%\nfrontend             3440658672312  #  3.0% ( 4.2%) low\n-- latency           1913602394898  #     1.7%\n-- bandwidth         1527056277414  #     1.3%\nbackend              58361692315514 # 51.4% (71.7%) high\n-- cpu               28346810089652 #    25.0%\n-- memory            30014882225862 #    26.4%\nspeculation          497091553516   #  0.4% ( 0.6%) low\n-- branch mispredict 310968324668   #     0.3%\n-- pipeline restart  186123228848   #     0.2%\nsmt-contention       32201569235059 # 28.3% ( 0.0%)\ncpu-cycles           57035182888061 # 3.81 GHz\ninstructions         52381979244633 # 0.92 IPC\ninstructions         17467779825069 # 61.298 l2 access per 1000 inst\nl2 hit from l1       659001928214   # 26.58% l2 miss\nl2 miss from l1      71983136525    #\nl2 hit from l2 pf    199094421089   #\nl3 hit from l2 pf    195194100072   #\nl3 miss from l2 pf   17449215935    #\ninstructions         17446199487461 # 72.488 float per 1000 inst\nfloat 512            62             # 0.000 AVX-512 per 1000 inst\nfloat 256            10591429521    # 0.607 AVX-256 per 1000 inst\nfloat 128            1254051102509  # 71.881 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         112            # 0.000 scalar per 1000 inst\ninstructions         52817290218833 #\nopcache              5292338275116  # 100.201 opcache per 1000 inst\nopcache miss         117763496128   #  2.2% opcache miss rate\nl1 dTLB miss         23326497562    # 0.442 L1 dTLB per 1000 inst\nl2 dTLB miss         4634341166     # 0.088 L2 dTLB per 1000 inst\ninstructions         52695988373276 #\nicache               164780128406   # 3.127 icache per 1000 inst\nicache miss          20642440619    # 12.5% icache miss rate\nl1 iTLB miss         7388292        # 0.000 L1 iTLB per 1000 inst\nl2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst\ntlb flush            70176          # 0.000 TLB flush per 1000 inst\n<\/code><\/pre>\n\n\n\n<p>Intel metrics show largest percentage of memory stalls are L1 and then L3.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              970.960\non_cpu               0.954          # 15.26 \/ 16 cores\nutime                14796.665\nstime                16.394\nnvcsw                8948           # 1.51%\nnivcsw               582911         # 98.49%\ninblock              392            # 0.40\/sec\nonblock              5968           # 6.15\/sec\ncpu-clock            14816772460134 # 14816.772 seconds\ntask-clock           14816918352865 # 14816.918 seconds\npage faults          792902         # 53.513\/sec\ncontext switches     596500         # 40.258\/sec\ncpu migrations       5685           # 0.384\/sec\nmajor page faults    0              # 0.000\/sec\nminor page faults    792902         # 53.513\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             1460766975019  # 36.514 branches per 1000 inst\nbranch misses        9973175448     # 0.68% branch miss\nconditional          1460767229579  # 36.514 conditional branches per 1000 inst\nindirect             486302274617   # 12.156 indirect branches per 1000 inst\nslots                70348736732066 #\nretiring             26079489676296 # 37.1% (37.1%)\n-- ucode             3101142779251  #     4.4%\n-- fastpath          22978346897045 #    32.7%\nfrontend             8773456230690  # 12.5% (12.5%)\n-- latency           7397857194056  #    10.5%\n-- bandwidth         1375599036634  #     2.0%\nbackend              34824709912208 # 49.5% (49.5%)\n-- cpu               19735277735638 #    28.1%\n-- memory            15089432176570 #    21.4%\nspeculation          696075894965   #  1.0% ( 1.0%) low\n-- branch mispredict 600904200955   #     0.9%\n-- pipeline restart  95171694010    #     0.1%\nsmt-contention       0              #  0.0% ( 0.0%)\ncpu-cycles           70960300997299 # 1.54 GHz\ninstructions         71462354213740 # 1.01 IPC\nl2 access            1915667603277  # 27.541 l2 access per 1000 inst\nl2 miss              749955564088   # 39.15% l2 miss\ncpu-cycles           23457104488956 # 31.0% memory latency\nload stalls          7146853814664  # 16.1% l1 bound\nl1 miss              3377786328787  #  2.6% l2 bound\nl2 miss              2773148184220  #  9.9% l3 bound\nl3 miss              451171967163   #  1.9% dram bound\nstore_stalls         125913079992   #  0.5% store bound\n<\/code><\/pre>\n\n\n\n<p>Process profile shows most time spent in an end2end-bench driver with ~8000 invocations<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>8439 processes<br>\t8133 end2end-bench        18530586.84 38821.55<br>\t 36 clinfo                   4.11     2.24<br>\t 38 vulkaninfo               1.32     1.14<br>\t  4 vulkani:disk$0           0.14     0.12<br>\t  6 php                      0.09     0.12<br>\t  2 llvmpipe-0               0.07     0.06<br>\t  2 llvmpipe-1               0.07     0.06<br>\t  2 llvmpipe-10              0.07     0.06<br>\t  2 llvmpipe-11              0.07     0.06<br>\t  2 llvmpipe-12              0.07     0.06<br>\t  2 llvmpipe-13              0.07     0.06<br>\t  2 llvmpipe-14              0.07     0.06<br>\t  2 llvmpipe-15              0.07     0.06<br>\t  2 llvmpipe-2               0.07     0.06<br>\t  2 llvmpipe-3               0.07     0.06<br>\t  2 llvmpipe-4               0.07     0.06<br>\t  2 llvmpipe-5               0.07     0.06<br>\t  2 llvmpipe-6               0.07     0.06<br>\t  2 llvmpipe-7               0.07     0.06<br>\t  2 llvmpipe-8               0.07     0.06<br>\t  2 llvmpipe-9               0.07     0.06<br>\t  6 clang                    0.06     0.06<br>\t  3 rocminfo                 0.03     0.00<br>\t  1 lspci                    0.00     0.02<br>\t 85 sh                       0.00     0.00<br>\t 13 gcc                      0.00     0.00<br>\t  8 gsettings                0.00     0.00<br>\t  8 stat                     0.00     0.00<br>\t  8 systemd-detect-          0.00     0.00<br>\t  6 llvm-link                0.00     0.00<br>\t  5 glxinfo                  0.00     0.00<br>\t  5 gmain                    0.00     0.00<br>\t  5 phoronix-test-s          0.00     0.00<br>\t  3 dconf worker             0.00     0.00<br>\t  3 xnnpack                  0.00     0.00<br>\t  2 cc                       0.00     0.00<br>\t  2 dmesg                    0.00     0.00<br>\t  2 grep                     0.00     0.00<br>\t  2 lscpu                    0.00     0.00<br>\t  2 setterm                  0.00     0.00<br>\t  2 uname                    0.00     0.00<br>\t  2 which                    0.00     0.00<br>\t  1 date                     0.00     0.00<br>\t  1 dirname                  0.00     0.00<br>\t  1 dmidecode                0.00     0.00<br>\t  1 ifconfig                 0.00     0.00<br>\t  1 ip                       0.00     0.00<br>\t  1 lsmod                    0.00     0.00<br>\t  1 mktemp                   0.00     0.00<br>\t  1 ps                       0.00     0.00<br>\t  1 qdbus                    0.00     0.00<br>\t  1 readlink                 0.00     0.00<br>\t  1 realpath                 0.00     0.00<br>\t  1 sed                      0.00     0.00<br>\t  1 sort                     0.00     0.00<br>\t  1 stty                     0.00     0.00<br>\t  1 systemctl                0.00     0.00<br>\t  1 template.sh              0.00     0.00<br><br><br><\/code><\/pre>\n\n\n\n<p>Process profile shows a lot of short driver calls<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>      1133056) xnnpack          cpu=4 start=5.23  finish=312.52\n        1133057) end2end-bench    cpu=0 start=5.24  finish=312.48\n          1133058) end2end-bench    cpu=10 start=5.24  finish=5.33 \n          1133059) end2end-bench    cpu=5 start=5.24  finish=5.33 \n          1133060) end2end-bench    cpu=6 start=5.24  finish=5.33 \n          1133061) end2end-bench    cpu=15 start=5.24  finish=5.33 \n          1133062) end2end-bench    cpu=0 start=5.24  finish=5.33 \n          1133063) end2end-bench    cpu=11 start=5.24  finish=5.33 \n          1133064) end2end-bench    cpu=4 start=5.24  finish=5.33 \n          1133065) end2end-bench    cpu=13 start=5.24  finish=5.33 \n          1133066) end2end-bench    cpu=1 start=5.24  finish=5.33 \n          1133067) end2end-bench    cpu=2 start=5.24  finish=5.33 \n          1133068) end2end-bench    cpu=14 start=5.24  finish=5.33 \n          1133069) end2end-bench    cpu=7 start=5.24  finish=5.33 \n          1133070) end2end-bench    cpu=8 start=5.24  finish=5.33 \n          1133071) end2end-bench    cpu=3 start=5.24  finish=5.33 \n          1133072) end2end-bench    cpu=12 start=5.24  finish=5.33 \n          1133073) end2end-bench    cpu=0 start=5.34  finish=5.44 \n          1133074) end2end-bench    cpu=6 start=5.34  finish=5.44 \n          1133075) end2end-bench    cpu=15 start=5.34  finish=5.44 \n          1133076) end2end-bench    cpu=9 start=5.34  finish=5.44 \n          1133077) end2end-bench    cpu=12 start=5.34  finish=5.44 \n          1133078) end2end-bench    cpu=13 start=5.34  finish=5.44 \n          1133079) end2end-bench    cpu=2 start=5.34  finish=5.43 \n          1133080) end2end-bench    cpu=3 start=5.34  finish=5.43 \n          1133081) end2end-bench    cpu=14 start=5.34  finish=5.43 \n          1133082) end2end-bench    cpu=7 start=5.34  finish=5.43 \n          1133083) end2end-bench    cpu=8 start=5.34  finish=5.43 \n          1133084) end2end-bench    cpu=5 start=5.34  finish=5.43 \n          1133085) end2end-bench    cpu=1 start=5.34  finish=5.43 \n          1133086) end2end-bench    cpu=10 start=5.34  finish=5.43 \n          1133087) end2end-bench    cpu=4 start=5.34  finish=5.43 \n          1133088) end2end-bench    cpu=4 start=5.44  finish=5.73 \n          1133089) end2end-bench    cpu=6 start=5.44  finish=5.73 \n          1133090) end2end-bench    cpu=9 start=5.44  finish=5.73 \n          1133091) end2end-bench    cpu=2 start=5.44  finish=5.73 \n          1133092) end2end-bench    cpu=0 start=5.44  finish=5.73 \n          1133093) end2end-bench    cpu=15 start=5.44  finish=5.73 \n          1133094) end2end-bench    cpu=13 start=5.44  finish=5.73 \n          1133095) end2end-bench    cpu=3 start=5.44  finish=5.73 \n          1133096) end2end-bench    cpu=5 start=5.44  finish=5.73 \n          1133097) end2end-bench    cpu=14 start=5.44  finish=5.73 \n          1133098) end2end-bench    cpu=1 start=5.44  finish=5.73 \n          1133099) end2end-bench    cpu=10 start=5.44  finish=5.73 \n          1133100) end2end-bench    cpu=8 start=5.44  finish=5.73 \n          1133101) end2end-bench    cpu=7 start=5.44  finish=5.73 \n          1133102) end2end-bench    cpu=12 start=5.44  finish=5.73 \n          1133103) end2end-bench    cpu=6 start=5.73  finish=8.37 \n          1133104) end2end-bench    cpu=0 start=5.73  finish=8.37 \n          1133105) end2end-bench    cpu=9 start=5.73  finish=8.37 \n          1133106) end2end-bench    cpu=13 start=5.73  finish=8.37 \n...<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Google library for high efficiency floating-point neural network inference operators. Used by other frameworks. There is a sequence of nine operations. These run on all cores. Topdown profile shows mostly backend bound but a mix among the operations. AMD metrics <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/xnnpack\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":58,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-2546","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/2546","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=2546"}],"version-history":[{"count":2,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/2546\/revisions"}],"predecessor-version":[{"id":2557,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/2546\/revisions\/2557"}],"up":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/58"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=2546"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}