{"id":2259,"date":"2024-06-01T08:18:21","date_gmt":"2024-06-01T08:18:21","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?page_id=2259"},"modified":"2024-06-01T08:49:20","modified_gmt":"2024-06-01T08:49:20","slug":"onnx","status":"publish","type":"page","link":"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/onnx\/","title":{"rendered":"onnx"},"content":{"rendered":"\n<p>Onnx runtime with 20 different workloads. These run with a variety of different parallelism.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/06\/systemtime.png\" alt=\"\" class=\"wp-image-2260\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/06\/systemtime.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/06\/systemtime-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/06\/systemtime-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>Topdown profile shows mostly backend bound with periods of high frontend stalls.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/06\/amdtopdown.png\" alt=\"\" class=\"wp-image-2261\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/06\/amdtopdown.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/06\/amdtopdown-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/06\/amdtopdown-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>AMD metrics show running on half the cores, not much floating point, with moderate L2 hit rate.  Backend bound with high memory stalls but also CPU stalls.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              7666.905\non_cpu               0.484          # 7.75 \/ 16 cores\nutime                59267.083\nstime                121.864\nnvcsw                126506         # 56.48%\nnivcsw               97469          # 43.52%\ninblock              8              # 0.00\/sec\nonblock              31152          # 4.06\/sec\ncpu-clock            59392070765107 # 59392.071 seconds\ntask-clock           59392404433514 # 59392.404 seconds\npage faults          75921573       # 1278.304\/sec\ncontext switches     261852         # 4.409\/sec\ncpu migrations       45585          # 0.768\/sec\nmajor page faults    268            # 0.005\/sec\nminor page faults    75921305       # 1278.300\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             13964968401705 # 59.551 branches per 1000 inst\nbranch misses        24286886257    # 0.17% branch miss\nconditional          13150504035833 # 56.078 conditional branches per 1000 inst\nindirect             75766798166    # 0.323 indirect branches per 1000 inst\ncpu-cycles           170936473902854 # 2.13 GHz\ninstructions         149610925486559 # 0.88 IPC\nslots                341882092221648 #\nretiring             50270481479148 # 14.7% (15.8%)\n-- ucode             449679582297   #     0.1%\n-- fastpath          49820801896851 #    14.6%\nfrontend             7839231529047  #  2.3% ( 2.5%) low\n-- latency           3811377380796  #     1.1%\n-- bandwidth         4027854148251  #     1.2%\nbackend              259444684020020 # 75.9% (81.5%) high\n-- cpu               108192204233662 #    31.6%\n-- memory            151252479786358 #    44.2%\nspeculation          667988414339   #  0.2% ( 0.2%) low\n-- branch mispredict 416633463189   #     0.1%\n-- pipeline restart  251354951150   #     0.1%\nsmt-contention       23659541944969 #  6.9% ( 0.0%)\ncpu-cycles           225365754216755 # 1.98 GHz\ninstructions         209421223495808 # 0.93 IPC\ninstructions         69791759643743 # 102.929 l2 access per 1000 inst\nl2 hit from l1       5092499628764  # 12.04% l2 miss\nl2 miss from l1      187031638880   #\nl2 hit from l2 pf    1413548447320  #\nl3 hit from l2 pf    451119563582   #\nl3 miss from l2 pf   226457942800   #\ninstructions         69772481102765 # 78.194 float per 1000 inst\nfloat 512            167            # 0.000 AVX-512 per 1000 inst\nfloat 256            10196733872    # 0.146 AVX-256 per 1000 inst\nfloat 128            5445615832780  # 78.048 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         0              # 0.000 scalar per 1000 inst\ninstructions         140399342837497 #\nopcache              12547602639130 # 89.371 opcache per 1000 inst\nopcache miss         336571040544   #  2.7% opcache miss rate\nl1 dTLB miss         85216624277    # 0.607 L1 dTLB per 1000 inst\nl2 dTLB miss         20923237061    # 0.149 L2 dTLB per 1000 inst\ninstructions         228624533030111 #\nicache               782732232728   # 3.424 icache per 1000 inst\nicache miss          106861319043   # 13.7% icache miss rate\nl1 iTLB miss         4080671841     # 0.018 L1 iTLB per 1000 inst\nl2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst\ntlb flush            78761          # 0.000 TLB flush per 1000 inst\n<\/code><\/pre>\n\n\n\n<p>Intel metrics show most backend stalls are CPU stalls.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              5583.167\non_cpu               0.727          # 11.64 \/ 16 cores\nutime                64884.355\nstime                77.838\nnvcsw                91202          # 20.23%\nnivcsw               359685         # 79.77%\ninblock              352            # 0.06\/sec\nonblock              18768          # 3.36\/sec\ncpu-clock            64965405314194 # 64965.405 seconds\ntask-clock           64965592525832 # 64965.593 seconds\npage faults          60825931       # 936.279\/sec\ncontext switches     478409         # 7.364\/sec\ncpu migrations       61992          # 0.954\/sec\nmajor page faults    709            # 0.011\/sec\nminor page faults    60825222       # 936.268\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             9379557277083  # 50.485 branches per 1000 inst\nbranch misses        22703149215    # 0.24% branch miss\nconditional          9379557342747  # 50.485 conditional branches per 1000 inst\nindirect             1939366372296  # 10.439 indirect branches per 1000 inst\nslots                471739138825190 #\nretiring             128933116492684 # 27.3% (27.3%)\n-- ucode             7027709037148  #     1.5%\n-- fastpath          121905407455536 #    25.8%\nfrontend             28156329685609 #  6.0% ( 6.0%)\n-- latency           20790135084826 #     4.4%\n-- bandwidth         7366194600783  #     1.6%\nbackend              310047888633198 # 65.7% (65.7%)\n-- cpu               245552390551012 #    52.1%\n-- memory            64495498082186 #    13.7%\nspeculation          4944247087919  #  1.0% ( 1.0%)\n-- branch mispredict 2039556773679  #     0.4%\n-- pipeline restart  2904690314240  #     0.6%\nsmt-contention       0              #  0.0% ( 0.0%)\ncpu-cycles           97716947289441 # 1.16 GHz\ninstructions         110341301754781 # 1.13 IPC\nl2 access            3318562539997  # 40.480 l2 access per 1000 inst\nl2 miss              1067638444672  # 32.17% l2 miss\ncpu-cycles           75242602539682 # 25.9% memory latency\nload stalls          18912792284043 #  0.0% l1 bound\nl1 miss              19009622083529 #  5.5% l2 bound\nl2 miss              14894522086162 #  2.9% l3 bound\nl3 miss              12690193751695 # 16.9% dram bound\nstore_stalls         550951507058   #  0.7% store bound\n<\/code><\/pre>\n\n\n\n<p>Process summary shows time in onnxruntime_per<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>852 processes\n\t537 onnxruntime_per      292179.20   381.75\n\t 34 clinfo                  10.07     2.99\n\t 19 vulkaninfo               0.95     0.57\n\t  2 vulkani:disk$0           0.10     0.06\n\t  3 glxinfo:gdrv0            0.08     0.03\n\t  3 glxinfo:gl0              0.08     0.03\n\t  6 clang                    0.06     0.06\n\t  1 llvmpipe-0               0.05     0.03\n\t  1 llvmpipe-1               0.05     0.03\n\t  1 llvmpipe-10              0.05     0.03\n\t  1 llvmpipe-11              0.05     0.03\n\t  1 llvmpipe-12              0.05     0.03\n\t  1 llvmpipe-13              0.05     0.03\n\t  1 llvmpipe-14              0.05     0.03\n\t  1 llvmpipe-15              0.05     0.03\n\t  1 llvmpipe-2               0.05     0.03\n\t  1 llvmpipe-3               0.05     0.03\n\t  1 llvmpipe-4               0.05     0.03\n\t  1 llvmpipe-5               0.05     0.03\n\t  1 llvmpipe-6               0.05     0.03\n\t  1 llvmpipe-7               0.05     0.03\n\t  1 llvmpipe-8               0.05     0.03\n\t  1 llvmpipe-9               0.05     0.03\n\t  1 glxinfo                  0.04     0.01\n\t  1 glxinfo:cs0              0.04     0.01\n\t  1 glxinfo:disk$0           0.04     0.01\n\t  1 glxinfo:sh0              0.04     0.01\n\t  1 glxinfo:shlo0            0.04     0.01\n\t 78 sh                       0.00     0.00\n\t 54 onnx                     0.00     0.00\n\t 13 gcc                      0.00     0.00\n\t 10 gsettings                0.00     0.00\n\t  8 systemd-detect-          0.00     0.00\n\t  7 stat                     0.00     0.00\n\t  6 llvm-link                0.00     0.00\n\t  4 gmain                    0.00     0.00\n\t  4 phoronix-test-s          0.00     0.00\n\t  2 which                    0.00     0.00\n\t  1 cc                       0.00     0.00\n\t  1 date                     0.00     0.00\n\t  1 dconf worker             0.00     0.00\n\t  1 dirname                  0.00     0.00\n\t  1 grep                     0.00     0.00\n\t  1 ifconfig                 0.00     0.00\n\t  1 ip                       0.00     0.00\n\t  1 lscpu                    0.00     0.00\n\t  1 mktemp                   0.00     0.00\n\t  1 ps                       0.00     0.00\n\t  1 python                   0.00     0.00\n\t  1 python3                  0.00     0.00\n\t  1 qdbus                    0.00     0.00\n\t  1 readlink                 0.00     0.00\n\t  1 realpath                 0.00     0.00\n\t  1 sed                      0.00     0.00\n\t  1 sort                     0.00     0.00\n\t  1 stty                     0.00     0.00\n\t  1 systemctl                0.00     0.00\n\t  1 template.sh              0.00     0.00\n\t  1 wc                       0.00     0.00\n\t  1 xrandr                   0.00     0.00\n\t  1 xset                     0.00     0.00\n18 processes running\n47 maximum processes\n<\/code><\/pre>\n\n\n\n<p>Computation blocks are relatively regular.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>      23628) onnx             cpu=1 start=71.35 finish=132.96\n        23629) onnxruntime_per  cpu=5 start=71.35 finish=132.94\n          23630) onnxruntime_per  cpu=3 start=71.84 finish=132.90\n          23631) onnxruntime_per  cpu=4 start=71.84 finish=132.90\n          23632) onnxruntime_per  cpu=6 start=71.84 finish=132.90\n          23633) onnxruntime_per  cpu=15 start=71.84 finish=132.90\n          23634) onnxruntime_per  cpu=8 start=71.84 finish=132.90\n          23635) onnxruntime_per  cpu=2 start=71.84 finish=132.90\n          23636) onnxruntime_per  cpu=1 start=71.84 finish=132.90\n          23637) onnxruntime_per  cpu=14 start=71.85 finish=132.90\n          23638) onnxruntime_per  cpu=7 start=71.85 finish=132.90\n          23639) onnxruntime_per  cpu=12 start=71.85 finish=132.90\n          23640) onnxruntime_per  cpu=0 start=71.85 finish=132.90\n          23641) onnxruntime_per  cpu=11 start=71.85 finish=132.90\n          23642) onnxruntime_per  cpu=10 start=71.85 finish=132.90\n          23643) onnxruntime_per  cpu=5 start=71.85 finish=132.90\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Onnx runtime with 20 different workloads. These run with a variety of different parallelism. Topdown profile shows mostly backend bound with periods of high frontend stalls. AMD metrics show running on half the cores, not much floating point, with moderate <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/onnx\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":58,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-2259","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/2259","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=2259"}],"version-history":[{"count":2,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/2259\/revisions"}],"predecessor-version":[{"id":2279,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/2259\/revisions\/2279"}],"up":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/58"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=2259"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}