{"id":2177,"date":"2024-03-24T13:16:59","date_gmt":"2024-03-24T13:16:59","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?page_id=2177"},"modified":"2024-03-24T13:17:00","modified_gmt":"2024-03-24T13:17:00","slug":"pbbs","status":"publish","type":"page","link":"https:\/\/mvermeulen.org\/perf\/workloads\/pbbs\/","title":{"rendered":"pbbs"},"content":{"rendered":"\n<p>The <a href=\"https:\/\/github.com\/cmuparlay\/pbbsbench\">Problem Based Benchmark Suite<\/a> (PBBS) is a source repository of about 20 different algorithms expressed in short benchmarks. For example:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>ANN\nbreadthFirstSearch\nBWDecode\nclassify\ncomparisonSort\nconcurrentKNN\nconvexHull\ndelaunayRefine\ndelaunayTriangulation\nhistogram\nintegerSort\ninvertedIndex\nlongestRepeatedSubstring\nmaximalIndependentSet\nmaximalMatching\nminSpanningForest\nnBody\nnearestNeighbors\nrangeQuery2d\nrangeQueryKDTree\nrangeSearch\nrayCast\nremoveDuplicates\nspanningForest\nsuffixArray\nwordCounts\n<\/code><\/pre>\n\n\n\n<p>The benchmarks come in a small mode and large mode and have a quick implementation.  The system requires 64GB of RAM for large mode, so I have run the smaller mode which only needs 12GB of RAM. However, this also results of some of them running in just seconds so I collected them together to show a conglomeration.<\/p>\n\n\n\n<p>A system overview shows a mixture of benchmarks running in one core vs those running on all available cores.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/systemtime-41.png\" alt=\"\" class=\"wp-image-2178\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/systemtime-41.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/systemtime-41-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/systemtime-41-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>The topdown profile of the benchmarks is somewhat blurred and benchmark dependent.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/amdtopdown-43.png\" alt=\"\" class=\"wp-image-2179\" style=\"width:1180px;height:auto\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/amdtopdown-43.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/amdtopdown-43-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/amdtopdown-43-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>Test outputs also don&#8217;t always show a very long running test, e.g. here it the output for nbody<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>cd benchmarks\/nBody\/parallelCK ; make -s\ncd benchmarks\/nBody\/parallelCK ; numactl -i all .\/testInputs_small -r 3 -p 16\n3DonSphere_100000 :  -r 3 -o \/tmp\/ofile4755_557782 : '0.175', '0.168', '0.172', geomean = 0.172\n3DinCube_100000 :  -r 3 -o \/tmp\/ofile752134_819802 : '0.334', '0.332', '0.336', geomean = 0.334\n3Dplummer_100000 :  -r 3 -o \/tmp\/ofile998621_657874 : '0.724', '0.732', '0.701', geomean = 0.719\nparallelCK : 16 : geomean of mins = 0.339, geomean of geomeans = 0.345\nSmall Inputs\n<\/code><\/pre>\n\n\n\n<p>The large model runs slightly faster but still in seconds, e.g. 2 seconds, 4 seconds, 6 seconds or about 30 seconds overall<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>HOSTNAME: augusta\nRunning only:  &#91;&#91;'nBody\/parallelCK', True, 0]]\nrunning on 16 threads\n\ncd benchmarks\/nBody\/parallelCK ; make -s\ncd benchmarks\/nBody\/parallelCK ; numactl -i all .\/testInputs -r 3 -p 16\n3DonSphere_1000000 :  -r 3 -o \/tmp\/ofile687062_310171 : '1.858', '1.714', '1.703', geomean = 1.757\n3DinCube_1000000 :  -r 3 -o \/tmp\/ofile245353_156304 : '4.125', '4.147', '4.162', geomean = 4.145\n3Dplummer_1000000 :  -r 3 -o \/tmp\/ofile878794_743375 : '6.202', '6.166', '6.17', geomean = 6.18\nparallelCK : 16 : geomean of mins = 3.512, geomean of geomeans = 3.557\n<\/code><\/pre>\n\n\n\n<p>It is possible to extend that slightly providing a &#8220;-r&#8221; option for more runs but overall still a fairly quickly running code.<\/p>\n\n\n\n<p>The AMD metrics show a composite that includes a relatively average overall mix of floating point, branches, opcache, etc.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              434.115\non_cpu               0.480          # 7.68 \/ 16 cores\nutime                3058.909\nstime                274.399\nnvcsw                2261861        # 98.08%\nnivcsw               44272          # 1.92%\ninblock              16             # 0.04\/sec\nonblock              50915928       # 117286.83\/sec\ncpu-clock            3332164205502  # 3332.164 seconds\ntask-clock           3333021473267  # 3333.021 seconds\npage faults          104897984      # 31472.340\/sec\ncontext switches     2304869        # 691.525\/sec\ncpu migrations       24437          # 7.332\/sec\nmajor page faults    576            # 0.173\/sec\nminor page faults    104897408      # 31472.167\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             3147241511460  # 177.373 branches per 1000 inst\nbranch misses        73529245988    # 2.34% branch miss\nconditional          2590299010108  # 145.985 conditional branches per 1000 inst\nindirect             32139961657    # 1.811 indirect branches per 1000 inst\ncpu-cycles           13860752529316 # 2.00 GHz\ninstructions         17687485942547 # 1.28 IPC\nslots                27802291281930 #\nretiring             5843337876958  # 21.0% (27.6%)\n-- ucode             8123407764     #     0.0%\n-- fastpath          5835214469194  #    21.0%\nfrontend             5116890434599  # 18.4% (24.2%)\n-- latency           3089230464624  #    11.1%\n-- bandwidth         2027659969975  #     7.3%\nbackend              9176199683559  # 33.0% (43.4%)\n-- cpu               1960965413017  #     7.1%\n-- memory            7215234270542  #    26.0%\nspeculation          1007172315506  #  3.6% ( 4.8%)\n-- branch mispredict 998151183069   #     3.6%\n-- pipeline restart  9021132437     #     0.0%\nsmt-contention       6658563270305  # 23.9% ( 0.0%)\ncpu-cycles           13868909164745 # 2.00 GHz\ninstructions         17703619245800 # 1.28 IPC\ninstructions         5905011790886  # 14.896 l2 access per 1000 inst\nl2 hit from l1       61946567593    # 37.33% l2 miss\nl2 miss from l1      14558905287    #\nl2 hit from l2 pf    7740802349     #\nl3 hit from l2 pf    5494122426     #\nl3 miss from l2 pf   12779715907    #\ninstructions         5898789410239  # 49.118 float per 1000 inst\nfloat 512            410            # 0.000 AVX-512 per 1000 inst\nfloat 256            476            # 0.000 AVX-256 per 1000 inst\nfloat 128            289735641329   # 49.118 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         55             # 0.000 scalar per 1000 inst\ninstructions         17711244839796 #\nopcache              3147913227436  # 177.735 opcache per 1000 inst\nopcache miss         115380444621   #  3.7% opcache miss rate\nl1 dTLB miss         59779381014    # 3.375 L1 dTLB per 1000 inst\nl2 dTLB miss         14251524341    # 0.805 L2 dTLB per 1000 inst\ninstructions         17707297103321 #\nicache               252770587972   # 14.275 icache per 1000 inst\nicache miss          10786644623    #  4.3% icache miss rate\nl1 iTLB miss         58838420       # 0.003 L1 iTLB per 1000 inst\nl2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst\ntlb flush            208869         # 0.000 TLB flush per 1000 inst\n<\/code><\/pre>\n\n\n\n<p>Overall I had explored these as a potential alternative to SPEC CPU as compiler type benchmarks but seem to run a bit too quickly to be interesting.  Still useful if one looks for a particular implementation of classic problems.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Problem Based Benchmark Suite (PBBS) is a source repository of about 20 different algorithms expressed in short benchmarks. For example: The benchmarks come in a small mode and large mode and have a quick implementation. The system requires 64GB <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/workloads\/pbbs\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":48,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-2177","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/2177","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=2177"}],"version-history":[{"count":1,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/2177\/revisions"}],"predecessor-version":[{"id":2180,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/2177\/revisions\/2180"}],"up":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/48"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=2177"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}