The Problem Based Benchmark Suite (PBBS) is a source repository of about 20 different algorithms expressed in short benchmarks. For example:
ANN
breadthFirstSearch
BWDecode
classify
comparisonSort
concurrentKNN
convexHull
delaunayRefine
delaunayTriangulation
histogram
integerSort
invertedIndex
longestRepeatedSubstring
maximalIndependentSet
maximalMatching
minSpanningForest
nBody
nearestNeighbors
rangeQuery2d
rangeQueryKDTree
rangeSearch
rayCast
removeDuplicates
spanningForest
suffixArray
wordCounts
The benchmarks come in a small mode and large mode and have a quick implementation. The system requires 64GB of RAM for large mode, so I have run the smaller mode which only needs 12GB of RAM. However, this also results of some of them running in just seconds so I collected them together to show a conglomeration.
A system overview shows a mixture of benchmarks running in one core vs those running on all available cores.

The topdown profile of the benchmarks is somewhat blurred and benchmark dependent.

Test outputs also don’t always show a very long running test, e.g. here it the output for nbody
cd benchmarks/nBody/parallelCK ; make -s
cd benchmarks/nBody/parallelCK ; numactl -i all ./testInputs_small -r 3 -p 16
3DonSphere_100000 : -r 3 -o /tmp/ofile4755_557782 : '0.175', '0.168', '0.172', geomean = 0.172
3DinCube_100000 : -r 3 -o /tmp/ofile752134_819802 : '0.334', '0.332', '0.336', geomean = 0.334
3Dplummer_100000 : -r 3 -o /tmp/ofile998621_657874 : '0.724', '0.732', '0.701', geomean = 0.719
parallelCK : 16 : geomean of mins = 0.339, geomean of geomeans = 0.345
Small Inputs
The large model runs slightly faster but still in seconds, e.g. 2 seconds, 4 seconds, 6 seconds or about 30 seconds overall
HOSTNAME: augusta
Running only: [['nBody/parallelCK', True, 0]]
running on 16 threads
cd benchmarks/nBody/parallelCK ; make -s
cd benchmarks/nBody/parallelCK ; numactl -i all ./testInputs -r 3 -p 16
3DonSphere_1000000 : -r 3 -o /tmp/ofile687062_310171 : '1.858', '1.714', '1.703', geomean = 1.757
3DinCube_1000000 : -r 3 -o /tmp/ofile245353_156304 : '4.125', '4.147', '4.162', geomean = 4.145
3Dplummer_1000000 : -r 3 -o /tmp/ofile878794_743375 : '6.202', '6.166', '6.17', geomean = 6.18
parallelCK : 16 : geomean of mins = 3.512, geomean of geomeans = 3.557
It is possible to extend that slightly providing a “-r” option for more runs but overall still a fairly quickly running code.
The AMD metrics show a composite that includes a relatively average overall mix of floating point, branches, opcache, etc.
elapsed 434.115
on_cpu 0.480 # 7.68 / 16 cores
utime 3058.909
stime 274.399
nvcsw 2261861 # 98.08%
nivcsw 44272 # 1.92%
inblock 16 # 0.04/sec
onblock 50915928 # 117286.83/sec
cpu-clock 3332164205502 # 3332.164 seconds
task-clock 3333021473267 # 3333.021 seconds
page faults 104897984 # 31472.340/sec
context switches 2304869 # 691.525/sec
cpu migrations 24437 # 7.332/sec
major page faults 576 # 0.173/sec
minor page faults 104897408 # 31472.167/sec
alignment faults 0 # 0.000/sec
emulation faults 0 # 0.000/sec
branches 3147241511460 # 177.373 branches per 1000 inst
branch misses 73529245988 # 2.34% branch miss
conditional 2590299010108 # 145.985 conditional branches per 1000 inst
indirect 32139961657 # 1.811 indirect branches per 1000 inst
cpu-cycles 13860752529316 # 2.00 GHz
instructions 17687485942547 # 1.28 IPC
slots 27802291281930 #
retiring 5843337876958 # 21.0% (27.6%)
-- ucode 8123407764 # 0.0%
-- fastpath 5835214469194 # 21.0%
frontend 5116890434599 # 18.4% (24.2%)
-- latency 3089230464624 # 11.1%
-- bandwidth 2027659969975 # 7.3%
backend 9176199683559 # 33.0% (43.4%)
-- cpu 1960965413017 # 7.1%
-- memory 7215234270542 # 26.0%
speculation 1007172315506 # 3.6% ( 4.8%)
-- branch mispredict 998151183069 # 3.6%
-- pipeline restart 9021132437 # 0.0%
smt-contention 6658563270305 # 23.9% ( 0.0%)
cpu-cycles 13868909164745 # 2.00 GHz
instructions 17703619245800 # 1.28 IPC
instructions 5905011790886 # 14.896 l2 access per 1000 inst
l2 hit from l1 61946567593 # 37.33% l2 miss
l2 miss from l1 14558905287 #
l2 hit from l2 pf 7740802349 #
l3 hit from l2 pf 5494122426 #
l3 miss from l2 pf 12779715907 #
instructions 5898789410239 # 49.118 float per 1000 inst
float 512 410 # 0.000 AVX-512 per 1000 inst
float 256 476 # 0.000 AVX-256 per 1000 inst
float 128 289735641329 # 49.118 AVX-128 per 1000 inst
float MMX 0 # 0.000 MMX per 1000 inst
float scalar 55 # 0.000 scalar per 1000 inst
instructions 17711244839796 #
opcache 3147913227436 # 177.735 opcache per 1000 inst
opcache miss 115380444621 # 3.7% opcache miss rate
l1 dTLB miss 59779381014 # 3.375 L1 dTLB per 1000 inst
l2 dTLB miss 14251524341 # 0.805 L2 dTLB per 1000 inst
instructions 17707297103321 #
icache 252770587972 # 14.275 icache per 1000 inst
icache miss 10786644623 # 4.3% icache miss rate
l1 iTLB miss 58838420 # 0.003 L1 iTLB per 1000 inst
l2 iTLB miss 0 # 0.000 L2 iTLB per 1000 inst
tlb flush 208869 # 0.000 TLB flush per 1000 inst
Overall I had explored these as a potential alternative to SPEC CPU as compiler type benchmarks but seem to run a bit too quickly to be interesting. Still useful if one looks for a particular implementation of classic problems.
