The NAS parallel benchmarks – link – test a set of computational kernels:
- IS – integer sort
- EP – embarrasingly parallel
- CG – conjugate gradient
- MG – multi-grid
- FT – fourier transform
- BT – block triangle diagonal solver
- SP – scalar-penta diagonal solver
- LU – lower upper gauss seidel solver
With a variety of sizes (S = small, W = workstation, A/B/C = standard tests, D/E/F = large tests) where each letter is larger than the previous one. This test tries 10 configurations: BT.C, CG.C, EP.C, EP.D, FT.C, IS.D, LU.C, MG.C, SP.B and SP.C. The IS.D doesn’t run on Intel but all the others run.Depending on the problem size, different numbers of threads are run.

Overall topdown distribution shows about 65% backend bound with both CPU and memory being about equal weight. However, there are some tests approaching 90% backend bound and others closer to 60%

The AMD metrics show 30% of instructions are floating point with some branches and ~5% of time for misprediction. We are about 1/3 on cpu and initial graph suggests this is mostly because the algorithms don’t always run on 16 cores.
elapsed 2283.426
on_cpu 0.329 # 5.26 / 16 cores
utime 11712.085
stime 293.999
nvcsw 418029 # 92.54%
nivcsw 33714 # 7.46%
inblock 24920 # 10.91/sec
onblock 726560 # 318.19/sec
cpu-clock 12006890392461 # 12006.890 seconds
task-clock 12007051116953 # 12007.051 seconds
page faults 32449764 # 2702.559/sec
context switches 462377 # 38.509/sec
cpu migrations 18933 # 1.577/sec
major page faults 3595 # 0.299/sec
minor page faults 32446169 # 2702.260/sec
alignment faults 0 # 0.000/sec
emulation faults 0 # 0.000/sec
branches 5515738653990 # 82.620 branches per 1000 inst
branch misses 108608594762 # 1.97% branch miss
conditional 3742804280329 # 56.063 conditional branches per 1000 inst
indirect 615681271121 # 9.222 indirect branches per 1000 inst
cpu-cycles 60351169928545 # 1.49 GHz
instructions 79670697330954 # 1.32 IPC
slots 120710831646738 #
retiring 27615568683340 # 22.9% (22.9%)
-- ucode 3933717481 # 0.0%
-- fastpath 27611634965859 # 22.9%
frontend 7033153177869 # 5.8% ( 5.8%)
-- latency 3463417792602 # 2.9%
-- bandwidth 3569735385267 # 3.0%
backend 79549840661075 # 65.9% (65.9%)
-- cpu 40042540131536 # 33.2%
-- memory 39507300529539 # 32.7%
speculation 6459137499915 # 5.4% ( 5.4%)
-- branch mispredict 6321715953737 # 5.2%
-- pipeline restart 137421546178 # 0.1%
smt-contention 53098719153 # 0.0% ( 0.0%)
cpu-cycles 80335334077607 # 1.64 GHz
instructions 117089528823367 # 1.46 IPC
instructions 39035071246839 # 28.407 l2 access per 1000 inst
l2 hit from l1 724297829438 # 21.89% l2 miss
l2 miss from l1 50355819375 #
l2 hit from l2 pf 192229686294 #
l3 hit from l2 pf 82702658409 #
l3 miss from l2 pf 109634385230 #
instructions 39021936306273 # 290.912 float per 1000 inst
float 512 197 # 0.000 AVX-512 per 1000 inst
float 256 135099664 # 0.003 AVX-256 per 1000 inst
float 128 11351820981171 # 290.909 AVX-128 per 1000 inst
float MMX 0 # 0.000 MMX per 1000 inst
float scalar 0 # 0.000 scalar per 1000 inst
Intel metrics
elapsed 3395.363
on_cpu 0.498 # 7.97 / 16 cores
utime 26824.162
stime 240.716
nvcsw 571012 # 89.91%
nivcsw 64057 # 10.09%
inblock 1526680 # 449.64/sec
onblock 848720 # 249.96/sec
cpu-clock 27828052765411 # 27828.053 seconds
task-clock 27828221812104 # 27828.222 seconds
page faults 39637056 # 1424.347/sec
context switches 683518 # 24.562/sec
cpu migrations 38996 # 1.401/sec
major page faults 17773 # 0.639/sec
minor page faults 39619248 # 1423.707/sec
alignment faults 0 # 0.000/sec
emulation faults 0 # 0.000/sec
branches 23921316362428 # 146.786 branches per 1000 inst
branch misses 103717613886 # 0.43% branch miss
conditional 23921316438460 # 146.786 conditional branches per 1000 inst
indirect 4336551074975 # 26.610 indirect branches per 1000 inst
slots 328695207828722 #
retiring 171019529525049 # 52.0% (52.0%)
-- ucode 15361044597237 # 4.7%
-- fastpath 155658484927812 # 47.4%
frontend 22635542803013 # 6.9% ( 6.9%)
-- latency 7363451194826 # 2.2%
-- bandwidth 15272091608187 # 4.6%
backend 124024545934012 # 37.7% (37.7%)
-- cpu 50372780801574 # 15.3%
-- memory 73651765132438 # 22.4%
speculation 11136414067764 # 3.4% ( 3.4%)
-- branch mispredict 8681080075941 # 2.6%
-- pipeline restart 2455333991823 # 0.7%
smt-contention 0 # 0.0% ( 0.0%)
cpu-cycles 102330630188559 # 1.55 GHz
instructions 317073908099680 # 3.10 IPC
l2 access 1934847494161 # 12.244 l2 access per 1000 inst
l2 miss 510893459816 # 26.40% l2 miss
The process tree shows this is MPI code with solvers named for the algorithm.
1446 processes
96 ep.D.x 10489.55 1.66
36 sp.C.x 5617.93 36.90
36 bt.C.x 4643.11 18.33
72 lu.C.x 4173.60 15.98
72 is.D.x 2628.17 434.67
72 cg.C.x 1664.50 19.92
72 ft.C.x 1461.35 223.01
36 sp.B.x 1360.87 14.18
72 mg.C.x 663.37 20.76
72 ep.C.x 493.43 0.98
67 clinfo 16.63 5.57
186 mpiexec 8.56 23.04
38 vulkaninfo 0.83 1.32
6 php 0.15 0.77
6 glxinfo:gdrv0 0.15 0.06
4 vulkani:disk$0 0.09 0.14
2 glxinfo 0.07 0.02
2 glxinfo:cs0 0.07 0.02
2 glxinfo:disk$0 0.07 0.02
2 glxinfo:sh0 0.07 0.02
2 glxinfo:shlo0 0.07 0.02
2 llvmpipe-0 0.05 0.07
2 llvmpipe-1 0.05 0.07
2 llvmpipe-10 0.05 0.07
2 llvmpipe-11 0.05 0.07
2 llvmpipe-12 0.05 0.07
2 llvmpipe-13 0.05 0.07
2 llvmpipe-14 0.05 0.07
2 llvmpipe-15 0.05 0.07
2 llvmpipe-2 0.05 0.07
2 llvmpipe-3 0.05 0.07
2 llvmpipe-4 0.05 0.07
2 llvmpipe-5 0.05 0.07
2 llvmpipe-6 0.05 0.07
2 llvmpipe-7 0.05 0.07
2 llvmpipe-8 0.05 0.07
2 llvmpipe-9 0.05 0.07
6 clang 0.03 0.09
3 rocminfo 0.03 0.00
1 lspci 0.00 0.02
194 npb 0.00 0.00
100 sh 0.00 0.00
31 cut 0.00 0.00
24 bc 0.00 0.00
15 awk 0.00 0.00
13 gcc 0.00 0.00
11 gsettings 0.00 0.00
8 stat 0.00 0.00
8 systemd-detect- 0.00 0.00
6 llvm-link 0.00 0.00
5 phoronix-test-s 0.00 0.00
3 gmain 0.00 0.00
2 cc 0.00 0.00
2 dconf worker 0.00 0.00
2 lscpu 0.00 0.00
2 uname 0.00 0.00
2 which 0.00 0.00
2 xset 0.00 0.00
1 date 0.00 0.00
1 dirname 0.00 0.00
1 dmesg 0.00 0.00
1 dmidecode 0.00 0.00
1 grep 0.00 0.00
1 ifconfig 0.00 0.00
1 ip 0.00 0.00
1 lsmod 0.00 0.00
1 mktemp 0.00 0.00
1 ps 0.00 0.00
1 qdbus 0.00 0.00
1 readlink 0.00 0.00
1 realpath 0.00 0.00
1 sed 0.00 0.00
1 sort 0.00 0.00
1 stty 0.00 0.00
1 systemctl 0.00 0.00
1 template.sh 0.00 0.00
1 wc 0.00 0.00
1 xrandr 0.00 0.00
0 processes running
47 maximum processes
Here is an example run of the BT.C workload
86732) npb cpu=13 start=5.79 finish=136.63
86733) npb cpu=14 start=5.79 finish=5.79
86734) npb cpu=15 start=5.79 finish=5.79
86735) cut cpu=10 start=5.79 finish=5.79
86736) npb cpu=0 start=5.79 finish=5.79
86737) npb cpu=1 start=5.79 finish=5.80
86738) npb cpu=14 start=5.79 finish=5.79
86739) bc cpu=15 start=5.79 finish=5.80
86740) mpiexec cpu=4 start=5.80 finish=136.60
86743) mpiexec cpu=2 start=6.38 finish=136.60
86744) mpiexec cpu=11 start=6.38 finish=6.38
86745) mpiexec cpu=15 start=6.40 finish=136.60
86747) mpiexec cpu=13 start=6.88 finish=136.60
86748) mpiexec cpu=7 start=6.88 finish=136.60
86749) bt.C.x cpu=1 start=6.89 finish=136.57
86751) bt.C.x cpu=12 start=6.89 finish=136.57
86754) bt.C.x cpu=14 start=6.90 finish=136.56
86750) bt.C.x cpu=5 start=6.89 finish=136.57
86753) bt.C.x cpu=11 start=6.90 finish=136.57
86757) bt.C.x cpu=2 start=6.91 finish=136.56
86752) bt.C.x cpu=15 start=6.90 finish=136.57
86756) bt.C.x cpu=0 start=6.90 finish=136.57
86759) bt.C.x cpu=4 start=6.91 finish=136.56
86755) bt.C.x cpu=0 start=6.90 finish=136.57
86758) bt.C.x cpu=11 start=6.91 finish=136.57
86760) bt.C.x cpu=12 start=6.91 finish=136.56
