A open source Quantum Monte Carlo (QMC) simulation code with six workloads using MPI.

Topdown profile shows differences among workloads but overall a moderatly high retirement rate.

AMD metrics confirm this is floating point code with a low L2 miss rate. Frontend stalls including opcache misses and icache misses are higher.
elapsed 2499.876
on_cpu 0.471 # 7.53 / 16 cores
utime 18540.133
stime 279.802
nvcsw 267253 # 87.33%
nivcsw 38781 # 12.67%
inblock 0 # 0.00/sec
onblock 612760 # 245.12/sec
cpu-clock 18820238026460 # 18820.238 seconds
task-clock 18820339009536 # 18820.339 seconds
page faults 3109375 # 165.214/sec
context switches 318172 # 16.906/sec
cpu migrations 21668 # 1.151/sec
major page faults 3045 # 0.162/sec
minor page faults 3106330 # 165.052/sec
alignment faults 0 # 0.000/sec
emulation faults 0 # 0.000/sec
branches 22461830733574 # 112.402 branches per 1000 inst
branch misses 95072047637 # 0.42% branch miss
conditional 15351486721874 # 76.821 conditional branches per 1000 inst
indirect 1594633030241 # 7.980 indirect branches per 1000 inst
cpu-cycles 81503776961511 # 1.87 GHz
instructions 216873675248429 # 2.66 IPC
slots 163024728552936 #
retiring 74147801543055 # 45.5% (45.5%)
-- ucode 223663585020 # 0.1%
-- fastpath 73924137958035 # 45.3%
frontend 33251313416882 # 20.4% (20.4%)
-- latency 15458432843526 # 9.5%
-- bandwidth 17792880573356 # 10.9%
backend 52902396193136 # 32.5% (32.5%)
-- cpu 28130036258952 # 17.3%
-- memory 24772359934184 # 15.2%
speculation 2652527659474 # 1.6% ( 1.6%)
-- branch mispredict 2312381697199 # 1.4%
-- pipeline restart 340145962275 # 0.2%
smt-contention 70637863915 # 0.0% ( 0.0%)
cpu-cycles 70056047900986 # 1.89 GHz
instructions 184301782099454 # 2.63 IPC
instructions 61443640810233 # 58.085 l2 access per 1000 inst
l2 hit from l1 3203344833251 # 1.45% l2 miss
l2 miss from l1 27545024648 #
l2 hit from l2 pf 341279478407 #
l3 hit from l2 pf 22923123161 #
l3 miss from l2 pf 1419531841 #
instructions 61414388678104 # 189.442 float per 1000 inst
float 512 124 # 0.000 AVX-512 per 1000 inst
float 256 1028 # 0.000 AVX-256 per 1000 inst
float 128 11634484218727 # 189.442 AVX-128 per 1000 inst
float MMX 0 # 0.000 MMX per 1000 inst
float scalar 7 # 0.000 scalar per 1000 inst
instructions 183320760693268 #
opcache 25758437755830 # 140.510 opcache per 1000 inst
opcache miss 8035610764180 # 31.2% opcache miss rate
l1 dTLB miss 926871253871 # 5.056 L1 dTLB per 1000 inst
l2 dTLB miss 4877453192 # 0.027 L2 dTLB per 1000 inst
instructions 216933399893634 #
icache 12980381188061 # 59.836 icache per 1000 inst
icache miss 5639705054563 # 43.4% icache miss rate
l1 iTLB miss 672008848887 # 3.098 L1 iTLB per 1000 inst
l2 iTLB miss 0 # 0.000 L2 iTLB per 1000 inst
tlb flush 614276 # 0.000 TLB flush per 1000 inst
Intel metrics confirm a low dram rate and overall have a low level of backend stalls.
elapsed 5737.400
on_cpu 0.734 # 11.74 / 16 cores
utime 67038.996
stime 300.316
nvcsw 304024 # 75.55%
nivcsw 98410 # 24.45%
inblock 318032 # 55.43/sec
onblock 509584 # 88.82/sec
cpu-clock 67339817205818 # 67339.817 seconds
task-clock 67339907346354 # 67339.907 seconds
page faults 2014512 # 29.916/sec
context switches 430696 # 6.396/sec
cpu migrations 48520 # 0.721/sec
major page faults 5995 # 0.089/sec
minor page faults 2008517 # 29.827/sec
alignment faults 0 # 0.000/sec
emulation faults 0 # 0.000/sec
branches 59500946712395 # 127.456 branches per 1000 inst
branch misses 197934244602 # 0.33% branch miss
conditional 59500946747051 # 127.456 conditional branches per 1000 inst
indirect 13124976106133 # 28.115 indirect branches per 1000 inst
slots 415180653423602 #
retiring 273696487216963 # 65.9% (65.9%) high
-- ucode 20184709555080 # 4.9%
-- fastpath 253511777661883 # 61.1%
frontend 61819690085221 # 14.9% (14.9%)
-- latency 17123034533870 # 4.1%
-- bandwidth 44696655551351 # 10.8%
backend 60638837676964 # 14.6% (14.6%) low
-- cpu 44744051942427 # 10.8%
-- memory 15894785734537 # 3.8%
speculation 17966962240492 # 4.3% ( 4.3%)
-- branch mispredict 12994602079182 # 3.1%
-- pipeline restart 4972360161310 # 1.2%
smt-contention 0 # 0.0% ( 0.0%)
cpu-cycles 208272485168282 # 2.00 GHz
instructions 843020141703026 # 4.05 IPC high
l2 access 4483162925590 # 14.309 l2 access per 1000 inst
l2 miss 43256958269 # 0.96% l2 miss
cpu-cycles 68419223332582 # 6.8% memory latency
load stalls 3123740243951 # 0.0% l1 bound
l1 miss 3242717002137 # 4.6% l2 bound
l2 miss 98796768732 # 0.1% l3 bound
l3 miss 53243773665 # 0.1% dram bound
store_stalls 1510444863594 # 2.2% store bound
Process overview shows qmcpack as the primary process
1243 processes
725 qmcpack 59023.27 889.06
68 clinfo 17.18 6.52
174 mpirun 8.26 22.61
38 vulkaninfo 1.33 1.15
4 vulkani:disk$0 0.14 0.13
6 php 0.08 0.44
2 llvmpipe-0 0.07 0.07
2 llvmpipe-1 0.07 0.07
2 llvmpipe-10 0.07 0.07
2 llvmpipe-11 0.07 0.07
2 llvmpipe-12 0.07 0.07
2 llvmpipe-13 0.07 0.07
2 llvmpipe-14 0.07 0.07
2 llvmpipe-15 0.07 0.07
2 llvmpipe-2 0.07 0.07
2 llvmpipe-3 0.07 0.07
2 llvmpipe-4 0.07 0.07
2 llvmpipe-5 0.07 0.07
2 llvmpipe-6 0.07 0.07
2 llvmpipe-7 0.07 0.07
2 llvmpipe-8 0.07 0.07
2 llvmpipe-9 0.07 0.07
6 clang 0.04 0.08
3 rocminfo 0.00 0.03
1 lspci 0.00 0.03
1 ps 0.00 0.01
95 sh 0.00 0.00
13 gcc 0.00 0.00
12 gsettings 0.00 0.00
8 stat 0.00 0.00
8 systemd-detect- 0.00 0.00
6 llvm-link 0.00 0.00
5 phoronix-test-s 0.00 0.00
4 glxinfo 0.00 0.00
3 gmain 0.00 0.00
2 cc 0.00 0.00
2 lscpu 0.00 0.00
2 setterm 0.00 0.00
2 uname 0.00 0.00
2 which 0.00 0.00
1 date 0.00 0.00
1 dconf worker 0.00 0.00
1 dirname 0.00 0.00
1 dmesg 0.00 0.00
1 dmidecode 0.00 0.00
1 grep 0.00 0.00
1 ifconfig 0.00 0.00
1 ip 0.00 0.00
1 lsmod 0.00 0.00
1 mktemp 0.00 0.00
1 python 0.00 0.00
1 python3 0.00 0.00
1 qdbus 0.00 0.00
1 readlink 0.00 0.00
1 realpath 0.00 0.00
1 sed 0.00 0.00
1 sort 0.00 0.00
1 stty 0.00 0.00
1 systemctl 0.00 0.00
1 template.sh 0.00 0.00
1 wc 0.00 0.00
0 processes running
47 maximum processes
Computation blocks
175875) qmcpack cpu=3 start=6.32 finish=34.76
175876) mpirun cpu=13 start=6.32 finish=34.73
175880) mpirun cpu=2 start=6.52 finish=34.73
175881) mpirun cpu=14 start=6.52 finish=6.52
175882) mpirun cpu=3 start=6.54 finish=34.73
175884) mpirun cpu=6 start=6.64 finish=34.73
175885) mpirun cpu=8 start=6.64 finish=34.73
175886) qmcpack cpu=8 start=6.68 finish=34.72
175894) qmcpack cpu=13 start=6.71 finish=34.72
175895) qmcpack cpu=10 start=6.71 finish=34.72
175887) qmcpack cpu=5 start=6.68 finish=34.72
175896) qmcpack cpu=13 start=6.72 finish=34.72
175897) qmcpack cpu=4 start=6.72 finish=34.72
175888) qmcpack cpu=9 start=6.68 finish=34.72
175898) qmcpack cpu=1 start=6.72 finish=34.72
175900) qmcpack cpu=7 start=6.72 finish=34.72
175889) qmcpack cpu=14 start=6.69 finish=34.72
175899) qmcpack cpu=11 start=6.72 finish=34.72
175901) qmcpack cpu=11 start=6.73 finish=34.72
175890) qmcpack cpu=7 start=6.69 finish=34.72
175902) qmcpack cpu=11 start=6.73 finish=34.72
175903) qmcpack cpu=13 start=6.73 finish=34.72
175891) qmcpack cpu=12 start=6.70 finish=34.72
175904) qmcpack cpu=10 start=6.73 finish=34.72
175905) qmcpack cpu=14 start=6.74 finish=34.72
175892) qmcpack cpu=2 start=6.70 finish=34.72
175906) qmcpack cpu=6 start=6.74 finish=34.72
175907) qmcpack cpu=10 start=6.74 finish=34.72
175893) qmcpack cpu=0 start=6.71 finish=34.72
175908) qmcpack cpu=13 start=6.74 finish=34.72
175909) qmcpack cpu=6 start=6.75 finish=34.72
