{"id":494,"date":"2024-01-13T19:21:04","date_gmt":"2024-01-13T19:21:04","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?page_id=494"},"modified":"2024-01-13T19:21:39","modified_gmt":"2024-01-13T19:21:39","slug":"npb","status":"publish","type":"page","link":"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/npb\/","title":{"rendered":"npb"},"content":{"rendered":"\n<p>The NAS parallel benchmarks &#8211; <a href=\"https:\/\/nas.nasa.gov\/software\/npb.html\">link <\/a>&#8211;  test a set of computational kernels:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>IS &#8211; integer sort<\/li>\n\n\n\n<li>EP &#8211; embarrasingly parallel<\/li>\n\n\n\n<li>CG &#8211; conjugate gradient<\/li>\n\n\n\n<li>MG &#8211; multi-grid<\/li>\n\n\n\n<li>FT &#8211; fourier transform<\/li>\n\n\n\n<li>BT &#8211; block triangle diagonal solver<\/li>\n\n\n\n<li>SP &#8211; scalar-penta diagonal solver<\/li>\n\n\n\n<li>LU &#8211; lower upper gauss seidel solver<\/li>\n<\/ul>\n\n\n\n<p>With a variety of sizes (S = small, W = workstation, A\/B\/C = standard tests, D\/E\/F = large tests) where each letter is larger than the previous one.  This test tries 10 configurations: BT.C, CG.C, EP.C, EP.D, FT.C, IS.D, LU.C, MG.C, SP.B  and SP.C.  The IS.D doesn&#8217;t run on Intel but all the others run.Depending on the problem size, different numbers of threads are run.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-9.png\" alt=\"\" class=\"wp-image-495\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-9.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-9-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-9-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>Overall topdown distribution shows about 65% backend bound with both CPU and memory being about equal weight. However, there are some tests approaching 90% backend bound and others closer to 60%<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-47.png\" alt=\"\" class=\"wp-image-496\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-47.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-47-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-47-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>The AMD metrics show 30% of instructions are floating point with some branches and ~5% of time for misprediction. We are about 1\/3 on cpu and initial graph suggests this is mostly because the algorithms don&#8217;t always run on 16 cores.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              2283.426\non_cpu               0.329          # 5.26 \/ 16 cores\nutime                11712.085\nstime                293.999\nnvcsw                418029         # 92.54%\nnivcsw               33714          # 7.46%\ninblock              24920          # 10.91\/sec\nonblock              726560         # 318.19\/sec\ncpu-clock            12006890392461 # 12006.890 seconds\ntask-clock           12007051116953 # 12007.051 seconds\npage faults          32449764       # 2702.559\/sec\ncontext switches     462377         # 38.509\/sec\ncpu migrations       18933          # 1.577\/sec\nmajor page faults    3595           # 0.299\/sec\nminor page faults    32446169       # 2702.260\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             5515738653990  # 82.620 branches per 1000 inst\nbranch misses        108608594762   # 1.97% branch miss\nconditional          3742804280329  # 56.063 conditional branches per 1000 inst\nindirect             615681271121   # 9.222 indirect branches per 1000 inst\ncpu-cycles           60351169928545 # 1.49 GHz\ninstructions         79670697330954 # 1.32 IPC\nslots                120710831646738 #\nretiring             27615568683340 # 22.9% (22.9%)\n-- ucode             3933717481     #     0.0%\n-- fastpath          27611634965859 #    22.9%\nfrontend             7033153177869  #  5.8% ( 5.8%)\n-- latency           3463417792602  #     2.9%\n-- bandwidth         3569735385267  #     3.0%\nbackend              79549840661075 # 65.9% (65.9%)\n-- cpu               40042540131536 #    33.2%\n-- memory            39507300529539 #    32.7%\nspeculation          6459137499915  #  5.4% ( 5.4%)\n-- branch mispredict 6321715953737  #     5.2%\n-- pipeline restart  137421546178   #     0.1%\nsmt-contention       53098719153    #  0.0% ( 0.0%)\ncpu-cycles           80335334077607 # 1.64 GHz\ninstructions         117089528823367 # 1.46 IPC\ninstructions         39035071246839 # 28.407 l2 access per 1000 inst\nl2 hit from l1       724297829438   # 21.89% l2 miss\nl2 miss from l1      50355819375    #\nl2 hit from l2 pf    192229686294   #\nl3 hit from l2 pf    82702658409    #\nl3 miss from l2 pf   109634385230   #\ninstructions         39021936306273 # 290.912 float per 1000 inst\nfloat 512            197            # 0.000 AVX-512 per 1000 inst\nfloat 256            135099664      # 0.003 AVX-256 per 1000 inst\nfloat 128            11351820981171 # 290.909 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         0              # 0.000 scalar per 1000 inst\n<\/code><\/pre>\n\n\n\n<p>Intel metrics<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              3395.363\non_cpu               0.498          # 7.97 \/ 16 cores\nutime                26824.162\nstime                240.716\nnvcsw                571012         # 89.91%\nnivcsw               64057          # 10.09%\ninblock              1526680        # 449.64\/sec\nonblock              848720         # 249.96\/sec\ncpu-clock            27828052765411 # 27828.053 seconds\ntask-clock           27828221812104 # 27828.222 seconds\npage faults          39637056       # 1424.347\/sec\ncontext switches     683518         # 24.562\/sec\ncpu migrations       38996          # 1.401\/sec\nmajor page faults    17773          # 0.639\/sec\nminor page faults    39619248       # 1423.707\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             23921316362428 # 146.786 branches per 1000 inst\nbranch misses        103717613886   # 0.43% branch miss\nconditional          23921316438460 # 146.786 conditional branches per 1000 inst\nindirect             4336551074975  # 26.610 indirect branches per 1000 inst\nslots                328695207828722 #\nretiring             171019529525049 # 52.0% (52.0%)\n-- ucode             15361044597237 #     4.7%\n-- fastpath          155658484927812 #    47.4%\nfrontend             22635542803013 #  6.9% ( 6.9%)\n-- latency           7363451194826  #     2.2%\n-- bandwidth         15272091608187 #     4.6%\nbackend              124024545934012 # 37.7% (37.7%)\n-- cpu               50372780801574 #    15.3%\n-- memory            73651765132438 #    22.4%\nspeculation          11136414067764 #  3.4% ( 3.4%)\n-- branch mispredict 8681080075941  #     2.6%\n-- pipeline restart  2455333991823  #     0.7%\nsmt-contention       0              #  0.0% ( 0.0%)\ncpu-cycles           102330630188559 # 1.55 GHz\ninstructions         317073908099680 # 3.10 IPC\nl2 access            1934847494161  # 12.244 l2 access per 1000 inst\nl2 miss              510893459816   # 26.40% l2 miss\n<\/code><\/pre>\n\n\n\n<p>The process tree shows this is MPI code with solvers named for the algorithm.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>1446 processes\n\t 96 ep.D.x               10489.55     1.66\n\t 36 sp.C.x                5617.93    36.90\n\t 36 bt.C.x                4643.11    18.33\n\t 72 lu.C.x                4173.60    15.98\n\t 72 is.D.x                2628.17   434.67\n\t 72 cg.C.x                1664.50    19.92\n\t 72 ft.C.x                1461.35   223.01\n\t 36 sp.B.x                1360.87    14.18\n\t 72 mg.C.x                 663.37    20.76\n\t 72 ep.C.x                 493.43     0.98\n\t 67 clinfo                  16.63     5.57\n\t186 mpiexec                  8.56    23.04\n\t 38 vulkaninfo               0.83     1.32\n\t  6 php                      0.15     0.77\n\t  6 glxinfo:gdrv0            0.15     0.06\n\t  4 vulkani:disk$0           0.09     0.14\n\t  2 glxinfo                  0.07     0.02\n\t  2 glxinfo:cs0              0.07     0.02\n\t  2 glxinfo:disk$0           0.07     0.02\n\t  2 glxinfo:sh0              0.07     0.02\n\t  2 glxinfo:shlo0            0.07     0.02\n\t  2 llvmpipe-0               0.05     0.07\n\t  2 llvmpipe-1               0.05     0.07\n\t  2 llvmpipe-10              0.05     0.07\n\t  2 llvmpipe-11              0.05     0.07\n\t  2 llvmpipe-12              0.05     0.07\n\t  2 llvmpipe-13              0.05     0.07\n\t  2 llvmpipe-14              0.05     0.07\n\t  2 llvmpipe-15              0.05     0.07\n\t  2 llvmpipe-2               0.05     0.07\n\t  2 llvmpipe-3               0.05     0.07\n\t  2 llvmpipe-4               0.05     0.07\n\t  2 llvmpipe-5               0.05     0.07\n\t  2 llvmpipe-6               0.05     0.07\n\t  2 llvmpipe-7               0.05     0.07\n\t  2 llvmpipe-8               0.05     0.07\n\t  2 llvmpipe-9               0.05     0.07\n\t  6 clang                    0.03     0.09\n\t  3 rocminfo                 0.03     0.00\n\t  1 lspci                    0.00     0.02\n\t194 npb                      0.00     0.00\n\t100 sh                       0.00     0.00\n\t 31 cut                      0.00     0.00\n\t 24 bc                       0.00     0.00\n\t 15 awk                      0.00     0.00\n\t 13 gcc                      0.00     0.00\n\t 11 gsettings                0.00     0.00\n\t  8 stat                     0.00     0.00\n\t  8 systemd-detect-          0.00     0.00\n\t  6 llvm-link                0.00     0.00\n\t  5 phoronix-test-s          0.00     0.00\n\t  3 gmain                    0.00     0.00\n\t  2 cc                       0.00     0.00\n\t  2 dconf worker             0.00     0.00\n\t  2 lscpu                    0.00     0.00\n\t  2 uname                    0.00     0.00\n\t  2 which                    0.00     0.00\n\t  2 xset                     0.00     0.00\n\t  1 date                     0.00     0.00\n\t  1 dirname                  0.00     0.00\n\t  1 dmesg                    0.00     0.00\n\t  1 dmidecode                0.00     0.00\n\t  1 grep                     0.00     0.00\n\t  1 ifconfig                 0.00     0.00\n\t  1 ip                       0.00     0.00\n\t  1 lsmod                    0.00     0.00\n\t  1 mktemp                   0.00     0.00\n\t  1 ps                       0.00     0.00\n\t  1 qdbus                    0.00     0.00\n\t  1 readlink                 0.00     0.00\n\t  1 realpath                 0.00     0.00\n\t  1 sed                      0.00     0.00\n\t  1 sort                     0.00     0.00\n\t  1 stty                     0.00     0.00\n\t  1 systemctl                0.00     0.00\n\t  1 template.sh              0.00     0.00\n\t  1 wc                       0.00     0.00\n\t  1 xrandr                   0.00     0.00\n0 processes running\n47 maximum processes\n<\/code><\/pre>\n\n\n\n<p>Here is an example run of the BT.C workload<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>      86732) npb              cpu=13 start=5.79  finish=136.63\n        86733) npb              cpu=14 start=5.79  finish=5.79 \n          86734) npb              cpu=15 start=5.79  finish=5.79 \n          86735) cut              cpu=10 start=5.79  finish=5.79 \n        86736) npb              cpu=0 start=5.79  finish=5.79 \n        86737) npb              cpu=1 start=5.79  finish=5.80 \n          86738) npb              cpu=14 start=5.79  finish=5.79 \n          86739) bc               cpu=15 start=5.79  finish=5.80 \n        86740) mpiexec          cpu=4 start=5.80  finish=136.60\n          86743) mpiexec          cpu=2 start=6.38  finish=136.60\n          86744) mpiexec          cpu=11 start=6.38  finish=6.38 \n          86745) mpiexec          cpu=15 start=6.40  finish=136.60\n          86747) mpiexec          cpu=13 start=6.88  finish=136.60\n          86748) mpiexec          cpu=7 start=6.88  finish=136.60\n          86749) bt.C.x           cpu=1 start=6.89  finish=136.57\n            86751) bt.C.x           cpu=12 start=6.89  finish=136.57\n            86754) bt.C.x           cpu=14 start=6.90  finish=136.56\n          86750) bt.C.x           cpu=5 start=6.89  finish=136.57\n            86753) bt.C.x           cpu=11 start=6.90  finish=136.57\n            86757) bt.C.x           cpu=2 start=6.91  finish=136.56\n          86752) bt.C.x           cpu=15 start=6.90  finish=136.57\n            86756) bt.C.x           cpu=0 start=6.90  finish=136.57\n            86759) bt.C.x           cpu=4 start=6.91  finish=136.56\n          86755) bt.C.x           cpu=0 start=6.90  finish=136.57\n            86758) bt.C.x           cpu=11 start=6.91  finish=136.57\n            86760) bt.C.x           cpu=12 start=6.91  finish=136.56\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>The NAS parallel benchmarks &#8211; link &#8211; test a set of computational kernels: With a variety of sizes (S = small, W = workstation, A\/B\/C = standard tests, D\/E\/F = large tests) where each letter is larger than the previous <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/npb\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":58,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-494","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/494","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=494"}],"version-history":[{"count":1,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/494\/revisions"}],"predecessor-version":[{"id":498,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/494\/revisions\/498"}],"up":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/58"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=494"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}