A suite of Python HPC benchmarks that run on both the CPU and the GPU. There are multiple backends but my system seemed to only have numpy available with JAX, Numba and Aesara missing and Tensorflow and Pytorch not chosen. So two workloads from twelve to run. This workload looks single-threaded in middle of the chart with edges being JAX, Numba and Aesara attempts.

Topdown profile shows a mix between workload attempts with frontend stalls on failing cases and backend stalls on passing ones.

AMD metrics include some floating point and balance of frontend and backend stalls. I expect this can vary depending on backends chosen.
elapsed 214.590
on_cpu 0.029 # 0.46 / 16 cores
utime 65.401
stime 33.947
nvcsw 2373 # 70.79%
nivcsw 979 # 29.21%
inblock 0 # 0.00/sec
onblock 2256 # 10.51/sec
cpu-clock 99401250819 # 99.401 seconds
task-clock 99408371231 # 99.408 seconds
page faults 9395772 # 94516.909/sec
context switches 4242 # 42.672/sec
cpu migrations 222 # 2.233/sec
major page faults 0 # 0.000/sec
minor page faults 9395772 # 94516.909/sec
alignment faults 0 # 0.000/sec
emulation faults 0 # 0.000/sec
branches 103193296965 # 153.981 branches per 1000 inst
branch misses 4321031643 # 4.19% branch miss
conditional 61129993121 # 91.216 conditional branches per 1000 inst
indirect 10014830714 # 14.944 indirect branches per 1000 inst
cpu-cycles 421731065390 # 0.12 GHz
instructions 662316153608 # 1.57 IPC
slots 852041413506 #
retiring 239696482165 # 28.1% (29.0%)
-- ucode 1192405255 # 0.1%
-- fastpath 238504076910 # 28.0%
frontend 235886914660 # 27.7% (28.6%)
-- latency 182114380362 # 21.4%
-- bandwidth 53772534298 # 6.3%
backend 321872190110 # 37.8% (39.0%)
-- cpu 119350501661 # 14.0%
-- memory 202521688449 # 23.8%
speculation 28443856449 # 3.3% ( 3.4%)
-- branch mispredict 28180477317 # 3.3%
-- pipeline restart 263379132 # 0.0%
smt-contention 26141421008 # 3.1% ( 0.0%)
cpu-cycles 420627774223 # 0.12 GHz
instructions 667713898623 # 1.59 IPC
instructions 221072463900 # 81.519 l2 access per 1000 inst
l2 hit from l1 9622992118 # 31.34% l2 miss
l2 miss from l1 512043469 #
l2 hit from l2 pf 3262559785 #
l3 hit from l2 pf 3743710843 #
l3 miss from l2 pf 1392296327 #
instructions 222193823037 # 108.414 float per 1000 inst
float 512 58 # 0.000 AVX-512 per 1000 inst
float 256 95640867 # 0.430 AVX-256 per 1000 inst
float 128 23993315635 # 107.984 AVX-128 per 1000 inst
float MMX 0 # 0.000 MMX per 1000 inst
float scalar 40 # 0.000 scalar per 1000 inst
instructions 2376781 #
opcache 893250 # 375.823 opcache per 1000 inst
opcache miss 473918 # 53.1% opcache miss rate
l1 dTLB miss 4172 # 1.755 L1 dTLB per 1000 inst
l2 dTLB miss 1012 # 0.426 L2 dTLB per 1000 inst
instructions 2402871 #
icache 1176420 # 489.589 icache per 1000 inst
icache miss 108062 # 9.2% icache miss rate
l1 iTLB miss 10 # 0.004 L1 iTLB per 1000 inst
l2 iTLB miss 0 # 0.000 L2 iTLB per 1000 inst
tlb flush 20 # 0.008 TLB flush per 1000 inst
Intel metrics
elapsed 237.303
on_cpu 0.032 # 0.51 / 16 cores
utime 91.367
stime 29.632
nvcsw 2776 # 4.58%
nivcsw 57898 # 95.42%
inblock 30536 # 128.68/sec
onblock 2200 # 9.27/sec
cpu-clock 121049310913 # 121.049 seconds
task-clock 121056968018 # 121.057 seconds
page faults 9552684 # 78910.650/sec
context switches 61654 # 509.297/sec
cpu migrations 388 # 3.205/sec
major page faults 240 # 1.983/sec
minor page faults 9552444 # 78908.667/sec
alignment faults 0 # 0.000/sec
emulation faults 0 # 0.000/sec
branches 124270109121 # 151.128 branches per 1000 inst
branch misses 1427179445 # 1.15% branch miss
conditional 124270133921 # 151.128 conditional branches per 1000 inst
indirect 18073500647 # 21.980 indirect branches per 1000 inst
slots 2221669248344 #
retiring 747027275446 # 33.6% (33.6%)
-- ucode 73452005367 # 3.3%
-- fastpath 673575270079 # 30.3%
frontend 279668953965 # 12.6% (12.6%)
-- latency 122976164207 # 5.5%
-- bandwidth 156692789758 # 7.1%
backend 996700033973 # 44.9% (44.9%)
-- cpu 314848326319 # 14.2%
-- memory 681851707654 # 30.7%
speculation 205285930989 # 9.2% ( 9.2%)
-- branch mispredict 193227678611 # 8.7%
-- pipeline restart 12058252378 # 0.5%
smt-contention 0 # 0.0% ( 0.0%)
cpu-cycles 398681714399 # 0.11 GHz
instructions 745549574653 # 1.87 IPC
l2 access 45007184463 # 60.675 l2 access per 1000 inst
l2 miss 25658347372 # 57.01% l2 miss
cpu-cycles 397444517269 # 35.9% memory latency
load stalls 106883580882 # 0.0% l1 bound
l1 miss 123115394313 # 14.8% l2 bound
l2 miss 64305341449 # 6.9% l3 bound
l3 miss 36690735023 # 9.2% dram bound
store_stalls 35742162866 # 9.0% store bound
Process summary highlights these are python driven tests.
661 processes
385 python3 1031.50 534.99
38 vulkaninfo 1.51 1.34
6 glxinfo:gdrv0 0.19 0.07
6 glxinfo:gl0 0.19 0.07
4 vulkani:disk$0 0.16 0.15
6 php 0.11 0.14
2 glxinfo 0.09 0.03
2 glxinfo:cs0 0.09 0.03
2 glxinfo:disk$0 0.09 0.03
2 glxinfo:sh0 0.09 0.03
2 glxinfo:shlo0 0.09 0.03
2 llvmpipe-0 0.08 0.07
2 llvmpipe-1 0.08 0.07
2 llvmpipe-10 0.08 0.07
2 llvmpipe-11 0.08 0.07
2 llvmpipe-12 0.08 0.07
2 llvmpipe-13 0.08 0.07
2 llvmpipe-14 0.08 0.07
2 llvmpipe-15 0.08 0.07
2 llvmpipe-2 0.08 0.07
2 llvmpipe-3 0.08 0.07
2 llvmpipe-4 0.08 0.07
2 llvmpipe-5 0.08 0.07
2 llvmpipe-6 0.08 0.07
2 llvmpipe-7 0.08 0.07
2 llvmpipe-8 0.08 0.07
2 llvmpipe-9 0.08 0.07
1 lspci 0.01 0.02
1 ps 0.00 0.01
70 sh 0.00 0.00
24 pyhpc 0.00 0.00
12 gcc 0.00 0.00
10 gsettings 0.00 0.00
8 stat 0.00 0.00
8 systemd-detect- 0.00 0.00
5 phoronix-test-s 0.00 0.00
3 dconf worker 0.00 0.00
3 gmain 0.00 0.00
2 clinfo 0.00 0.00
2 lscpu 0.00 0.00
2 uname 0.00 0.00
2 which 0.00 0.00
2 xset 0.00 0.00
1 cc 0.00 0.00
1 date 0.00 0.00
1 dirname 0.00 0.00
1 dmesg 0.00 0.00
1 dmidecode 0.00 0.00
1 grep 0.00 0.00
1 ifconfig 0.00 0.00
1 ip 0.00 0.00
1 lsmod 0.00 0.00
1 mktemp 0.00 0.00
1 readlink 0.00 0.00
1 realpath 0.00 0.00
1 sed 0.00 0.00
1 sort 0.00 0.00
1 stty 0.00 0.00
1 systemctl 0.00 0.00
1 template.sh 0.00 0.00
1 wc 0.00 0.00
1 xrandr 0.00 0.00
0 processes running
47 maximum processes
Example execution block
130168) pyhpc cpu=3 start=78.30 finish=84.94
130169) python3 cpu=12 start=78.30 finish=84.94
130170) python3 cpu=13 start=78.33 finish=84.93
130171) python3 cpu=6 start=78.33 finish=84.93
130172) python3 cpu=15 start=78.33 finish=84.93
130173) python3 cpu=0 start=78.33 finish=84.93
130174) python3 cpu=9 start=78.33 finish=84.93
130175) python3 cpu=2 start=78.33 finish=84.93
130176) python3 cpu=11 start=78.33 finish=84.93
130177) python3 cpu=4 start=78.33 finish=84.93
130178) python3 cpu=14 start=78.34 finish=84.93
130179) python3 cpu=7 start=78.34 finish=84.93
130180) python3 cpu=8 start=78.34 finish=84.93
130181) python3 cpu=1 start=78.34 finish=84.93
130182) python3 cpu=5 start=78.34 finish=84.93
130183) python3 cpu=10 start=78.34 finish=84.93
130184) python3 cpu=3 start=78.34 finish=84.93
Overall a test that can be elaborated further to really exercise particular backends, though also not long running.
