A set of computing benchmarks that use OpenCL, OpenML and CUDA. The OpenCL ones fail leaving X workloads. A total of four workloads run correctly.

Topdown profile show workloads dominated by backend stalls.

AMD metrics confirm high backend stalls and low factors of other stalls and retirement rates. This is floating point code with a low IPC.
elapsed 581.377
on_cpu 0.648 # 10.37 / 16 cores
utime 6019.004
stime 8.001
nvcsw 11085 # 16.04%
nivcsw 58007 # 83.96%
inblock 0 # 0.00/sec
onblock 618240 # 1063.41/sec
cpu-clock 6028754320350 # 6028.754 seconds
task-clock 6028877226692 # 6028.877 seconds
page faults 2529117 # 419.500/sec
context switches 71602 # 11.877/sec
cpu migrations 1516 # 0.251/sec
major page faults 13 # 0.002/sec
minor page faults 2529104 # 419.498/sec
alignment faults 0 # 0.000/sec
emulation faults 0 # 0.000/sec
branches 356376744888 # 41.762 branches per 1000 inst
branch misses 5808698352 # 1.63% branch miss
conditional 311127182240 # 36.460 conditional branches per 1000 inst
indirect 6537032472 # 0.766 indirect branches per 1000 inst
cpu-cycles 26789856580851 # 2.91 GHz
instructions 8523753717315 # 0.32 IPC low
slots 53574576334074 #
retiring 3037285457309 # 5.7% ( 6.9%) low
-- ucode 30795275761 # 0.1%
-- fastpath 3006490181548 # 5.6%
frontend 2105424630950 # 3.9% ( 4.8%) low
-- latency 1145189084004 # 2.1%
-- bandwidth 960235546946 # 1.8%
backend 38648611175245 # 72.1% (87.7%) high
-- cpu 17650289931049 # 32.9%
-- memory 20998321244196 # 39.2%
speculation 254667013033 # 0.5% ( 0.6%) low
-- branch mispredict 167646250146 # 0.3%
-- pipeline restart 87020762887 # 0.2%
smt-contention 9528539434241 # 17.8% ( 0.0%)
cpu-cycles 26694322086662 # 2.91 GHz
instructions 8523373024551 # 0.32 IPC low
instructions 2839498090258 # 49.711 l2 access per 1000 inst
l2 hit from l1 109922376042 # 29.68% l2 miss
l2 miss from l1 23104420522 #
l2 hit from l2 pf 12447739921 #
l3 hit from l2 pf 1892129919 #
l3 miss from l2 pf 16891383017 #
instructions 2839703362814 # 335.613 float per 1000 inst
float 512 126 # 0.000 AVX-512 per 1000 inst
float 256 926 # 0.000 AVX-256 per 1000 inst
float 128 953042328671 # 335.613 AVX-128 per 1000 inst
float MMX 0 # 0.000 MMX per 1000 inst
float scalar 5 # 0.000 scalar per 1000 inst
instructions 8524876284215 #
opcache 971449144496 # 113.955 opcache per 1000 inst
opcache miss 18397630060 # 1.9% opcache miss rate
l1 dTLB miss 19074903429 # 2.238 L1 dTLB per 1000 inst
l2 dTLB miss 15094481558 # 1.771 L2 dTLB per 1000 inst
instructions 8520906203159 #
icache 26149384533 # 3.069 icache per 1000 inst
icache miss 2194992940 # 8.4% icache miss rate
l1 iTLB miss 54453575 # 0.006 L1 iTLB per 1000 inst
l2 iTLB miss 0 # 0.000 L2 iTLB per 1000 inst
tlb flush 2101941 # 0.000 TLB flush per 1000 inst
Intel metrics show the L3 portion of memory stalls is the largest.
elapsed 1375.760
on_cpu 0.801 # 12.81 / 16 cores
utime 17614.523
stime 9.219
nvcsw 8620 # 6.50%
nivcsw 123986 # 93.50%
inblock 744 # 0.54/sec
onblock 804440 # 584.72/sec
cpu-clock 17625758069268 # 17625.758 seconds
task-clock 17625905653801 # 17625.906 seconds
page faults 4638836 # 263.183/sec
context switches 139046 # 7.889/sec
cpu migrations 4590 # 0.260/sec
major page faults 1 # 0.000/sec
minor page faults 4638835 # 263.183/sec
alignment faults 0 # 0.000/sec
emulation faults 0 # 0.000/sec
branches 419343046573 # 20.778 branches per 1000 inst
branch misses 4547421869 # 1.08% branch miss
conditional 419343085101 # 20.778 conditional branches per 1000 inst
indirect 67921372342 # 3.365 indirect branches per 1000 inst
slots 136495034123360 #
retiring 7814481448059 # 5.7% ( 5.7%) low
-- ucode 1698017378425 # 1.2%
-- fastpath 6116464069634 # 4.5%
frontend 5269661840749 # 3.9% ( 3.9%) low
-- latency 4422139728561 # 3.2%
-- bandwidth 847522112188 # 0.6%
backend 122695895540374 # 89.9% (89.9%) high
-- cpu 24290445256795 # 17.8%
-- memory 98405450283579 # 72.1%
speculation 1063790694054 # 0.8% ( 0.8%) low
-- branch mispredict 830124740747 # 0.6%
-- pipeline restart 233665953307 # 0.2%
smt-contention 0 # 0.0% ( 0.0%)
cpu-cycles 35539188290105 # 1.96 GHz
instructions 11454483768790 # 0.32 IPC low
l2 access 305084043503 # 38.065 l2 access per 1000 inst
l2 miss 118297250080 # 38.78% l2 miss
cpu-cycles 43478714436613 # 73.5% memory latency
load stalls 27281801869658 # 24.9% l1 bound
l1 miss 16441424261761 # 1.0% l2 bound
l2 miss 16022798500216 # 30.8% l3 bound
l3 miss 2650993519337 # 6.1% dram bound
store_stalls 4674882882278 # 10.8% store bound
Process overview shows different processes per workload.
1086 processes
48 lbm 54292.96 24.48
48 mri-gridding 32846.56 11.68
48 stencil 6022.24 6.72
48 cutcp 1348.80 4.48
408 clinfo 98.72 36.61
42 python2 6.09 1.72
38 vulkaninfo 0.39 1.52
6 php 0.09 0.32
6 glxinfo:gdrv0 0.08 0.10
6 glxinfo:gl0 0.08 0.10
3 ld 0.05 0.03
4 vulkani:disk$0 0.04 0.16
6 clang 0.04 0.05
2 glxinfo 0.04 0.04
2 glxinfo:cs0 0.04 0.04
2 glxinfo:disk$0 0.04 0.04
2 glxinfo:sh0 0.04 0.04
2 glxinfo:shlo0 0.04 0.04
3 rocminfo 0.03 0.00
2 llvmpipe-0 0.02 0.08
2 llvmpipe-1 0.02 0.08
2 llvmpipe-10 0.02 0.08
2 llvmpipe-11 0.02 0.08
2 llvmpipe-12 0.02 0.08
2 llvmpipe-13 0.02 0.08
2 llvmpipe-14 0.02 0.08
2 llvmpipe-15 0.02 0.08
2 llvmpipe-2 0.02 0.08
2 llvmpipe-3 0.02 0.08
2 llvmpipe-4 0.02 0.08
2 llvmpipe-5 0.02 0.08
2 llvmpipe-6 0.02 0.08
2 llvmpipe-7 0.02 0.08
2 llvmpipe-8 0.02 0.08
2 llvmpipe-9 0.02 0.08
1 lspci 0.01 0.02
145 sh 0.00 0.00
60 make 0.00 0.00
30 parboil 0.00 0.00
13 gcc 0.00 0.00
12 gsettings 0.00 0.00
8 stat 0.00 0.00
8 systemd-detect- 0.00 0.00
6 llvm-link 0.00 0.00
5 phoronix-test-s 0.00 0.00
3 c++ 0.00 0.00
3 collect2 0.00 0.00
3 gmain 0.00 0.00
2 cc 0.00 0.00
2 lscpu 0.00 0.00
2 uname 0.00 0.00
2 which 0.00 0.00
2 xset 0.00 0.00
1 date 0.00 0.00
1 dconf worker 0.00 0.00
1 dirname 0.00 0.00
1 dmesg 0.00 0.00
1 dmidecode 0.00 0.00
1 grep 0.00 0.00
1 ifconfig 0.00 0.00
1 ip 0.00 0.00
1 lsmod 0.00 0.00
1 mktemp 0.00 0.00
1 ps 0.00 0.00
1 python 0.00 0.00
1 python3 0.00 0.00
1 qdbus 0.00 0.00
1 readlink 0.00 0.00
1 realpath 0.00 0.00
1 sed 0.00 0.00
1 sort 0.00 0.00
1 stty 0.00 0.00
1 systemctl 0.00 0.00
1 template.sh 0.00 0.00
1 wc 0.00 0.00
1 xrandr 0.00 0.00
0 processes running
47 maximum processes
An example computation block
938107) parboil cpu=7 start=121.12 finish=193.82
938108) python2 cpu=8 start=121.12 finish=193.82
938109) make cpu=9 start=121.13 finish=121.14
938110) make cpu=10 start=121.14 finish=192.13
938111) lbm cpu=11 start=121.14 finish=192.13
938112) lbm cpu=13 start=121.14 finish=192.13
938113) lbm cpu=9 start=121.14 finish=192.13
938114) lbm cpu=4 start=121.14 finish=192.13
938115) lbm cpu=14 start=121.14 finish=192.13
938116) lbm cpu=7 start=121.14 finish=192.13
938117) lbm cpu=8 start=121.14 finish=192.13
938118) lbm cpu=10 start=121.14 finish=192.13
938119) lbm cpu=5 start=121.14 finish=192.13
938120) lbm cpu=12 start=121.14 finish=192.13
938121) lbm cpu=1 start=121.14 finish=192.13
938122) lbm cpu=15 start=121.14 finish=192.13
938123) lbm cpu=6 start=121.14 finish=192.13
938124) lbm cpu=0 start=121.14 finish=192.13
938125) lbm cpu=2 start=121.14 finish=192.13
938126) lbm cpu=3 start=121.14 finish=192.13
