{"id":376,"date":"2024-01-09T02:06:15","date_gmt":"2024-01-09T02:06:15","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?page_id=376"},"modified":"2024-01-09T13:03:29","modified_gmt":"2024-01-09T13:03:29","slug":"libxsmm","status":"publish","type":"page","link":"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/libxsmm\/","title":{"rendered":"libxsmm"},"content":{"rendered":"\n<p>libxsmm calculates dense and sparse matrix operations. There are four different workloads with different characteristics as shown below. However, generally backend\/memory bound and not much front end or speculation stalls.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-32.png\" alt=\"\" class=\"wp-image-384\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-32.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-32-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-32-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>AMD metrics show a backend\/memory bound application with L2 misses and a moderate floating point and not many branches or speculation.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              1200.458\non_cpu               0.717          # 11.47 \/ 16 cores\nutime                13718.215\nstime                45.335\nnvcsw                4385           # 2.62%\nnivcsw               162934         # 97.38%\ninblock              1000           # 0.83\/sec\nonblock              3960           # 3.30\/sec\ncpu-clock            13768026176988 # 13768.026 seconds\ntask-clock           13768338206084 # 13768.338 seconds\npage faults          8547672        # 620.821\/sec\ncontext switches     173105         # 12.573\/sec\ncpu migrations       4624           # 0.336\/sec\nmajor page faults    5              # 0.000\/sec\nminor page faults    8547667        # 620.821\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             1857322849343  # 34.536 branches per 1000 inst\nbranch misses        9198990418     # 0.50% branch miss\nconditional          1717674980350  # 31.939 conditional branches per 1000 inst\nindirect             28717902408    # 0.534 indirect branches per 1000 inst\ncpu-cycles           59079522725954 # 3.07 GHz\ninstructions         53707353318899 # 0.91 IPC\nslots                118146870952488 #\nretiring             18332446411746 # 15.5% (16.9%)\n-- ucode             26086845722    #     0.0%\n-- fastpath          18306359566024 #    15.5%\nfrontend             3910888575373  #  3.3% ( 3.6%)\n-- latency           3151894748598  #     2.7%\n-- bandwidth         758993826775   #     0.6%\nbackend              85105805287604 # 72.0% (78.5%)\n-- cpu               22326738329569 #    18.9%\n-- memory            62779066958035 #    53.1%\nspeculation          1089519233419  #  0.9% ( 1.0%)\n-- branch mispredict 373902864471   #     0.3%\n-- pipeline restart  715616368948   #     0.6%\nsmt-contention       9708155311998  #  8.2% ( 0.0%)\ncpu-cycles           59091290358687 # 3.06 GHz\ninstructions         53737511307216 # 0.91 IPC\ninstructions         17913341548162 # 131.561 l2 access per 1000 inst\nl2 hit from l1       1643968275128  # 11.23% l2 miss\nl2 miss from l1      106129979128   #\nl2 hit from l2 pf    554249205024   #\nl3 hit from l2 pf    124053766622   #\nl3 miss from l2 pf   34419515356    #\ninstructions         17910401741829 # 84.494 float per 1000 inst\nfloat 512            43             # 0.000 AVX-512 per 1000 inst\nfloat 256            1008           # 0.000 AVX-256 per 1000 inst\nfloat 128            1513315108501  # 84.494 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         0              # 0.000 scalar per 1000 inst\n<\/code><\/pre>\n\n\n\n<p>Intel metrics<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              2522.838\non_cpu               0.682          # 10.91 \/ 16 cores\nutime                26885.493\nstime                633.641\nnvcsw                155641175      # 99.83%\nnivcsw               269976         # 0.17%\ninblock              7240           # 2.87\/sec\nonblock              4072           # 1.61\/sec\ncpu-clock            27423281062373 # 27423.281 seconds\ntask-clock           27447874273562 # 27447.874 seconds\npage faults          12043704       # 438.785\/sec\ncontext switches     155923601      # 5680.717\/sec\ncpu migrations       349702         # 12.741\/sec\nmajor page faults    97             # 0.004\/sec\nminor page faults    12043607       # 438.781\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             1325105869957  # 18.401 branches per 1000 inst\nbranch misses        6973190243     # 0.53% branch miss\nconditional          1325105891109  # 18.401 conditional branches per 1000 inst\nindirect             568603144299   # 7.896 indirect branches per 1000 inst\nslots                113834189052350 #\nretiring             34522415093139 # 30.3% (30.3%)\n-- ucode             795046442826   #     0.7%\n-- fastpath          33727368650313 #    29.6%\nfrontend             13659591684925 # 12.0% (12.0%)\n-- latency           12302798991273 #    10.8%\n-- bandwidth         1356792693652  #     1.2%\nbackend              64978302757717 # 57.1% (57.1%)\n-- cpu               16016754739937 #    14.1%\n-- memory            48961548017780 #    43.0%\nspeculation          858815841121   #  0.8% ( 0.8%)\n-- branch mispredict 571923334820   #     0.5%\n-- pipeline restart  286892506301   #     0.3%\nsmt-contention       0              #  0.0% ( 0.0%)\ncpu-cycles           76217336930148 # 1.98 GHz\ninstructions         83494508679423 # 1.10 IPC\nl2 access            3360799226144  # 103.904 l2 access per 1000 inst\nl2 miss              555528752845   # 16.53% l2 miss\n<\/code><\/pre>\n\n\n\n<p>Straightforward process structure<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>560 processes\n\t192 specialized          219147.73   667.84\n\t 64 clinfo                  10.56     3.52\n\t 38 vulkaninfo               0.96     0.95\n\t  6 php                      0.17     0.34\n\t  6 glxinfo:gdrv0            0.12     0.07\n\t  4 vulkani:disk$0           0.11     0.10\n\t  2 llvmpipe-0               0.06     0.05\n\t  2 llvmpipe-1               0.06     0.05\n\t  2 llvmpipe-10              0.06     0.05\n\t  2 llvmpipe-11              0.06     0.05\n\t  2 llvmpipe-12              0.06     0.05\n\t  2 llvmpipe-13              0.06     0.05\n\t  2 llvmpipe-14              0.06     0.05\n\t  2 llvmpipe-15              0.06     0.05\n\t  2 llvmpipe-2               0.06     0.05\n\t  2 llvmpipe-3               0.06     0.05\n\t  2 llvmpipe-4               0.06     0.05\n\t  2 llvmpipe-5               0.06     0.05\n\t  2 llvmpipe-6               0.06     0.05\n\t  2 llvmpipe-7               0.06     0.05\n\t  2 llvmpipe-8               0.06     0.05\n\t  2 llvmpipe-9               0.06     0.05\n\t  2 glxinfo                  0.06     0.03\n\t  2 glxinfo:cs0              0.06     0.03\n\t  2 glxinfo:disk$0           0.06     0.03\n\t  2 glxinfo:sh0              0.06     0.03\n\t  2 glxinfo:shlo0            0.06     0.03\n\t  6 clang                    0.02     0.07\n\t  1 lspci                    0.01     0.03\n\t 95 sh                       0.00     0.00\n\t 13 gcc                      0.00     0.00\n\t 12 libxsmm                  0.00     0.00\n\t  9 stty                     0.00     0.00\n\t  8 stat                     0.00     0.00\n\t  8 systemd-detect-          0.00     0.00\n\t  7 gsettings                0.00     0.00\n\t  6 llvm-link                0.00     0.00\n\t  5 gmain                    0.00     0.00\n\t  5 phoronix-test-s          0.00     0.00\n\t  4 dconf worker             0.00     0.00\n\t  2 cc                       0.00     0.00\n\t  2 lscpu                    0.00     0.00\n\t  2 uname                    0.00     0.00\n\t  2 which                    0.00     0.00\n\t  2 xset                     0.00     0.00\n\t  1 date                     0.00     0.00\n\t  1 dirname                  0.00     0.00\n\t  1 dmesg                    0.00     0.00\n\t  1 dmidecode                0.00     0.00\n\t  1 grep                     0.00     0.00\n\t  1 ifconfig                 0.00     0.00\n\t  1 ip                       0.00     0.00\n\t  1 lsmod                    0.00     0.00\n\t  1 mktemp                   0.00     0.00\n\t  1 ps                       0.00     0.00\n\t  1 qdbus                    0.00     0.00\n\t  1 readlink                 0.00     0.00\n\t  1 realpath                 0.00     0.00\n\t  1 sed                      0.00     0.00\n\t  1 sort                     0.00     0.00\n\t  1 systemctl                0.00     0.00\n\t  1 template.sh              0.00     0.00\n\t  1 wc                       0.00     0.00\n\t  1 xrandr                   0.00     0.00\n0 processes running\n47 maximum processes\n<\/code><\/pre>\n\n\n\n<p>With parallel computation on all cores<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>      417620) libxsmm          cpu=8 start=5.70  finish=118.89\n        417621) specialized      cpu=15 start=5.70  finish=118.89\n          417622) specialized      cpu=1 start=5.70  finish=118.89\n          417623) specialized      cpu=3 start=5.70  finish=118.89\n          417624) specialized      cpu=14 start=5.70  finish=118.89\n          417625) specialized      cpu=13 start=5.70  finish=118.89\n          417626) specialized      cpu=4 start=5.70  finish=118.89\n          417627) specialized      cpu=10 start=5.70  finish=118.89\n          417628) specialized      cpu=8 start=5.70  finish=118.89\n          417629) specialized      cpu=4 start=5.71  finish=118.89\n          417630) specialized      cpu=12 start=5.71  finish=118.89\n          417631) specialized      cpu=3 start=5.71  finish=118.89\n          417632) specialized      cpu=14 start=5.71  finish=118.89\n          417633) specialized      cpu=5 start=5.71  finish=118.89\n          417634) specialized      cpu=11 start=5.71  finish=118.89\n          417635) specialized      cpu=8 start=5.71  finish=118.89\n          417636) specialized      cpu=10 start=5.71  finish=118.89<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>libxsmm calculates dense and sparse matrix operations. There are four different workloads with different characteristics as shown below. However, generally backend\/memory bound and not much front end or speculation stalls. AMD metrics show a backend\/memory bound application with L2 misses <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/libxsmm\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":58,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-376","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/376","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=376"}],"version-history":[{"count":2,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/376\/revisions"}],"predecessor-version":[{"id":386,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/376\/revisions\/386"}],"up":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/58"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=376"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}