{"id":1220,"date":"2024-02-01T01:29:04","date_gmt":"2024-02-01T01:29:04","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?page_id=1220"},"modified":"2024-02-01T01:38:48","modified_gmt":"2024-02-01T01:38:48","slug":"lulesh","status":"publish","type":"page","link":"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/lulesh\/","title":{"rendered":"lulesh"},"content":{"rendered":"\n<p>Lulesh is an acronym for Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics. This is a very quick running benchmark. Looks like MPI runs just on physical cores.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/systemtime-2.png\" alt=\"\" class=\"wp-image-1224\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/systemtime-2.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/systemtime-2-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/systemtime-2-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>Topdown profile is sparse because the workload runs quickly. However on aggregate backend stalls predominate.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/amdtopdown-2.png\" alt=\"\" class=\"wp-image-1226\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/amdtopdown-2.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/amdtopdown-2-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/amdtopdown-2-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>AMD metrics make it easier to see the summary. On-cpu is barely 1\/4 of the cores. Backend memory stalls are high and CPU stalls also contribute. Approximately 40% of the instructions are floating point<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              48.970\non_cpu               0.296          # 4.73 \/ 16 cores\nutime                191.535\nstime                40.245\nnvcsw                45784          # 96.73%\nnivcsw               1548           # 3.27%\ninblock              8              # 0.16\/sec\nonblock              62080          # 1267.73\/sec\ncpu-clock            231739627464   # 231.740 seconds\ntask-clock           231757527279   # 231.758 seconds\npage faults          19718776       # 85083.649\/sec\ncontext switches     47385          # 204.459\/sec\ncpu migrations       1131           # 4.880\/sec\nmajor page faults    234            # 1.010\/sec\nminor page faults    19718542       # 85082.639\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             82537114616    # 75.499 branches per 1000 inst\nbranch misses        3072310653     # 3.72% branch miss\nconditional          55678608704    # 50.931 conditional branches per 1000 inst\nindirect             2686831025     # 2.458 indirect branches per 1000 inst\ncpu-cycles           986477655776   # 1.27 GHz\ninstructions         1082464092791  # 1.10 IPC\nslots                1974576003252  #\nretiring             377954134774   # 19.1% (19.2%)\n-- ucode             509902685      #     0.0%\n-- fastpath          377444232089   #    19.1%\nfrontend             228199081339   # 11.6% (11.6%)\n-- latency           170613588120   #     8.6%\n-- bandwidth         57585493219    #     2.9%\nbackend              1361308062049  # 68.9% (69.0%)\n-- cpu               432111990127   #    21.9%\n-- memory            929196071922   #    47.1%\nspeculation          5124623997     #  0.3% ( 0.3%) low\n-- branch mispredict 5037122371     #     0.3%\n-- pipeline restart  87501626       #     0.0%\nsmt-contention       1988503188     #  0.1% ( 0.0%)\ncpu-cycles           986280789225   # 1.27 GHz\ninstructions         1079037081656  # 1.09 IPC\ninstructions         360898263957   # 40.265 l2 access per 1000 inst\nl2 hit from l1       9682858043     # 24.84% l2 miss\nl2 miss from l1      712645490      #\nl2 hit from l2 pf    1951552116     #\nl3 hit from l2 pf    139447103      #\nl3 miss from l2 pf   2757536174     #\ninstructions         361496500819   # 406.170 float per 1000 inst\nfloat 512            76             # 0.000 AVX-512 per 1000 inst\nfloat 256            690            # 0.000 AVX-256 per 1000 inst\nfloat 128            146829108908   # 406.170 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         0              # 0.000 scalar per 1000 inst\n<\/code><\/pre>\n\n\n\n<p>Intel metrics<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              55.049\non_cpu               0.317          # 5.07 \/ 16 cores\nutime                242.439\nstime                36.925\nnvcsw                83411          # 98.53%\nnivcsw               1248           # 1.47%\ninblock              519472         # 9436.55\/sec\nonblock              50664          # 920.34\/sec\ncpu-clock            279287193118   # 279.287 seconds\ntask-clock           279309002068   # 279.309 seconds\npage faults          19700096       # 70531.547\/sec\ncontext switches     84722          # 303.327\/sec\ncpu migrations       1460           # 5.227\/sec\nmajor page faults    3526           # 12.624\/sec\nminor page faults    19696570       # 70518.923\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             268786231444   # 136.459 branches per 1000 inst\nbranch misses        58265385       # 0.02% branch miss\nconditional          268786245332   # 136.459 conditional branches per 1000 inst\nindirect             41617151216    # 21.128 indirect branches per 1000 inst\nslots                15445604506322 #\nretiring             7770478502776  # 50.3% (50.3%)\n-- ucode             790445170397   #     5.1%\n-- fastpath          6980033332379  #    45.2%\nfrontend             748657265289   #  4.8% ( 4.8%) low\n-- latency           355686368439   #     2.3%\n-- bandwidth         392970896850   #     2.5%\nbackend              6870999071370  # 44.5% (44.5%)\n-- cpu               2408157938972  #    15.6%\n-- memory            4462841132398  #    28.9%\nspeculation          137401946204   #  0.9% ( 0.9%) low\n-- branch mispredict 71814841286    #     0.5%\n-- pipeline restart  65587104918    #     0.4%\nsmt-contention       0              #  0.0% ( 0.0%)\ncpu-cycles           5203449155222  # 1.14 GHz\ninstructions         15428229523592 # 2.97 IPC\nl2 access            75190794668    # 9.551 l2 access per 1000 inst\nl2 miss              44769402597    # 59.54% l2 miss\n<\/code><\/pre>\n\n\n\n<p>Process overview shows lulesh2.0 invocations under MPI<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>441 processes\n\t 72 lulesh2.0              570.95   112.10\n\t 68 clinfo                  15.88     6.32\n\t 38 vulkaninfo               0.94     1.33\n\t 18 mpirun                   0.77     2.15\n\t  6 glxinfo:gdrv0            0.12     0.04\n\t  6 glxinfo:gl0              0.12     0.04\n\t  4 vulkani:disk$0           0.10     0.14\n\t  6 clang                    0.08     0.03\n\t  6 php                      0.07     0.07\n\t  2 glxinfo                  0.06     0.03\n\t  2 glxinfo:cs0              0.06     0.02\n\t  2 glxinfo:disk$0           0.06     0.02\n\t  2 glxinfo:sh0              0.06     0.02\n\t  2 glxinfo:shlo0            0.06     0.02\n\t  2 llvmpipe-0               0.05     0.07\n\t  2 llvmpipe-1               0.05     0.07\n\t  2 llvmpipe-10              0.05     0.07\n\t  2 llvmpipe-11              0.05     0.07\n\t  2 llvmpipe-12              0.05     0.07\n\t  2 llvmpipe-13              0.05     0.07\n\t  2 llvmpipe-14              0.05     0.07\n\t  2 llvmpipe-15              0.05     0.07\n\t  2 llvmpipe-2               0.05     0.07\n\t  2 llvmpipe-3               0.05     0.07\n\t  2 llvmpipe-4               0.05     0.07\n\t  2 llvmpipe-5               0.05     0.07\n\t  2 llvmpipe-6               0.05     0.07\n\t  2 llvmpipe-7               0.05     0.07\n\t  2 llvmpipe-8               0.05     0.07\n\t  2 llvmpipe-9               0.05     0.07\n\t  3 rocminfo                 0.03     0.00\n\t  1 lspci                    0.00     0.02\n\t  1 ps                       0.00     0.01\n\t 82 sh                       0.00     0.00\n\t 13 gcc                      0.00     0.00\n\t 13 gsettings                0.00     0.00\n\t  8 stat                     0.00     0.00\n\t  8 systemd-detect-          0.00     0.00\n\t  6 llvm-link                0.00     0.00\n\t  5 phoronix-test-s          0.00     0.00\n\t  3 lulesh                   0.00     0.00\n\t  2 cc                       0.00     0.00\n\t  2 gmain                    0.00     0.00\n\t  2 lscpu                    0.00     0.00\n\t  2 uname                    0.00     0.00\n\t  2 which                    0.00     0.00\n\t  2 xset                     0.00     0.00\n\t  1 date                     0.00     0.00\n\t  1 dconf worker             0.00     0.00\n\t  1 dirname                  0.00     0.00\n\t  1 dmesg                    0.00     0.00\n\t  1 dmidecode                0.00     0.00\n\t  1 grep                     0.00     0.00\n\t  1 ifconfig                 0.00     0.00\n\t  1 ip                       0.00     0.00\n\t  1 lsmod                    0.00     0.00\n\t  1 mktemp                   0.00     0.00\n\t  1 qdbus                    0.00     0.00\n\t  1 readlink                 0.00     0.00\n\t  1 realpath                 0.00     0.00\n\t  1 sed                      0.00     0.00\n\t  1 sort                     0.00     0.00\n\t  1 stty                     0.00     0.00\n\t  1 systemctl                0.00     0.00\n\t  1 template.sh              0.00     0.00\n\t  1 wc                       0.00     0.00\n\t  1 xrandr                   0.00     0.00\n0 processes running\n47 maximum processes\n<\/code><\/pre>\n\n\n\n<p>Computation blocks<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>      7923) lulesh           cpu=1 start=5.85  finish=16.90\n        7924) mpirun           cpu=0 start=5.85  finish=16.88\n          7927) mpirun           cpu=4 start=6.46  finish=16.88\n          7928) mpirun           cpu=7 start=6.46  finish=6.46 \n          7929) mpirun           cpu=9 start=6.48  finish=16.87\n          7930) mpirun           cpu=15 start=6.97  finish=16.87\n          7931) mpirun           cpu=10 start=6.97  finish=16.88\n          7932) lulesh2.0        cpu=10 start=6.98  finish=16.82\n            7934) lulesh2.0        cpu=15 start=6.98  finish=16.81\n            7938) lulesh2.0        cpu=15 start=6.99  finish=16.81\n          7933) lulesh2.0        cpu=12 start=6.98  finish=16.82\n            7936) lulesh2.0        cpu=0 start=6.99  finish=16.81\n            7940) lulesh2.0        cpu=14 start=7.00  finish=16.81\n          7935) lulesh2.0        cpu=3 start=6.99  finish=16.82\n            7939) lulesh2.0        cpu=14 start=6.99  finish=16.81\n            7943) lulesh2.0        cpu=5 start=7.00  finish=16.81\n          7937) lulesh2.0        cpu=4 start=6.99  finish=16.77\n            7942) lulesh2.0        cpu=1 start=7.00  finish=16.77\n            7947) lulesh2.0        cpu=11 start=7.00  finish=16.77\n          7941) lulesh2.0        cpu=11 start=7.00  finish=16.77\n            7945) lulesh2.0        cpu=10 start=7.00  finish=16.77\n            7950) lulesh2.0        cpu=3 start=7.01  finish=16.77\n          7944) lulesh2.0        cpu=8 start=7.00  finish=16.77\n            7948) lulesh2.0        cpu=5 start=7.01  finish=16.77\n            7952) lulesh2.0        cpu=4 start=7.01  finish=16.77\n          7946) lulesh2.0        cpu=6 start=7.00  finish=16.77\n            7951) lulesh2.0        cpu=13 start=7.01  finish=16.77\n            7954) lulesh2.0        cpu=5 start=7.02  finish=16.77\n          7949) lulesh2.0        cpu=7 start=7.01  finish=16.77\n            7953) lulesh2.0        cpu=15 start=7.01  finish=16.77\n            7955) lulesh2.0        cpu=2 start=7.02  finish=16.77\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Lulesh is an acronym for Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics. This is a very quick running benchmark. Looks like MPI runs just on physical cores. Topdown profile is sparse because the workload runs quickly. However on aggregate backend stalls <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/lulesh\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":58,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-1220","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1220","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=1220"}],"version-history":[{"count":2,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1220\/revisions"}],"predecessor-version":[{"id":1227,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1220\/revisions\/1227"}],"up":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/58"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=1220"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}