libxsmm calculates dense and sparse matrix operations. There are four different workloads with different characteristics as shown below. However, generally backend/memory bound and not much front end or speculation stalls.

AMD metrics show a backend/memory bound application with L2 misses and a moderate floating point and not many branches or speculation.
elapsed 1200.458
on_cpu 0.717 # 11.47 / 16 cores
utime 13718.215
stime 45.335
nvcsw 4385 # 2.62%
nivcsw 162934 # 97.38%
inblock 1000 # 0.83/sec
onblock 3960 # 3.30/sec
cpu-clock 13768026176988 # 13768.026 seconds
task-clock 13768338206084 # 13768.338 seconds
page faults 8547672 # 620.821/sec
context switches 173105 # 12.573/sec
cpu migrations 4624 # 0.336/sec
major page faults 5 # 0.000/sec
minor page faults 8547667 # 620.821/sec
alignment faults 0 # 0.000/sec
emulation faults 0 # 0.000/sec
branches 1857322849343 # 34.536 branches per 1000 inst
branch misses 9198990418 # 0.50% branch miss
conditional 1717674980350 # 31.939 conditional branches per 1000 inst
indirect 28717902408 # 0.534 indirect branches per 1000 inst
cpu-cycles 59079522725954 # 3.07 GHz
instructions 53707353318899 # 0.91 IPC
slots 118146870952488 #
retiring 18332446411746 # 15.5% (16.9%)
-- ucode 26086845722 # 0.0%
-- fastpath 18306359566024 # 15.5%
frontend 3910888575373 # 3.3% ( 3.6%)
-- latency 3151894748598 # 2.7%
-- bandwidth 758993826775 # 0.6%
backend 85105805287604 # 72.0% (78.5%)
-- cpu 22326738329569 # 18.9%
-- memory 62779066958035 # 53.1%
speculation 1089519233419 # 0.9% ( 1.0%)
-- branch mispredict 373902864471 # 0.3%
-- pipeline restart 715616368948 # 0.6%
smt-contention 9708155311998 # 8.2% ( 0.0%)
cpu-cycles 59091290358687 # 3.06 GHz
instructions 53737511307216 # 0.91 IPC
instructions 17913341548162 # 131.561 l2 access per 1000 inst
l2 hit from l1 1643968275128 # 11.23% l2 miss
l2 miss from l1 106129979128 #
l2 hit from l2 pf 554249205024 #
l3 hit from l2 pf 124053766622 #
l3 miss from l2 pf 34419515356 #
instructions 17910401741829 # 84.494 float per 1000 inst
float 512 43 # 0.000 AVX-512 per 1000 inst
float 256 1008 # 0.000 AVX-256 per 1000 inst
float 128 1513315108501 # 84.494 AVX-128 per 1000 inst
float MMX 0 # 0.000 MMX per 1000 inst
float scalar 0 # 0.000 scalar per 1000 inst
Intel metrics
elapsed 2522.838
on_cpu 0.682 # 10.91 / 16 cores
utime 26885.493
stime 633.641
nvcsw 155641175 # 99.83%
nivcsw 269976 # 0.17%
inblock 7240 # 2.87/sec
onblock 4072 # 1.61/sec
cpu-clock 27423281062373 # 27423.281 seconds
task-clock 27447874273562 # 27447.874 seconds
page faults 12043704 # 438.785/sec
context switches 155923601 # 5680.717/sec
cpu migrations 349702 # 12.741/sec
major page faults 97 # 0.004/sec
minor page faults 12043607 # 438.781/sec
alignment faults 0 # 0.000/sec
emulation faults 0 # 0.000/sec
branches 1325105869957 # 18.401 branches per 1000 inst
branch misses 6973190243 # 0.53% branch miss
conditional 1325105891109 # 18.401 conditional branches per 1000 inst
indirect 568603144299 # 7.896 indirect branches per 1000 inst
slots 113834189052350 #
retiring 34522415093139 # 30.3% (30.3%)
-- ucode 795046442826 # 0.7%
-- fastpath 33727368650313 # 29.6%
frontend 13659591684925 # 12.0% (12.0%)
-- latency 12302798991273 # 10.8%
-- bandwidth 1356792693652 # 1.2%
backend 64978302757717 # 57.1% (57.1%)
-- cpu 16016754739937 # 14.1%
-- memory 48961548017780 # 43.0%
speculation 858815841121 # 0.8% ( 0.8%)
-- branch mispredict 571923334820 # 0.5%
-- pipeline restart 286892506301 # 0.3%
smt-contention 0 # 0.0% ( 0.0%)
cpu-cycles 76217336930148 # 1.98 GHz
instructions 83494508679423 # 1.10 IPC
l2 access 3360799226144 # 103.904 l2 access per 1000 inst
l2 miss 555528752845 # 16.53% l2 miss
Straightforward process structure
560 processes
192 specialized 219147.73 667.84
64 clinfo 10.56 3.52
38 vulkaninfo 0.96 0.95
6 php 0.17 0.34
6 glxinfo:gdrv0 0.12 0.07
4 vulkani:disk$0 0.11 0.10
2 llvmpipe-0 0.06 0.05
2 llvmpipe-1 0.06 0.05
2 llvmpipe-10 0.06 0.05
2 llvmpipe-11 0.06 0.05
2 llvmpipe-12 0.06 0.05
2 llvmpipe-13 0.06 0.05
2 llvmpipe-14 0.06 0.05
2 llvmpipe-15 0.06 0.05
2 llvmpipe-2 0.06 0.05
2 llvmpipe-3 0.06 0.05
2 llvmpipe-4 0.06 0.05
2 llvmpipe-5 0.06 0.05
2 llvmpipe-6 0.06 0.05
2 llvmpipe-7 0.06 0.05
2 llvmpipe-8 0.06 0.05
2 llvmpipe-9 0.06 0.05
2 glxinfo 0.06 0.03
2 glxinfo:cs0 0.06 0.03
2 glxinfo:disk$0 0.06 0.03
2 glxinfo:sh0 0.06 0.03
2 glxinfo:shlo0 0.06 0.03
6 clang 0.02 0.07
1 lspci 0.01 0.03
95 sh 0.00 0.00
13 gcc 0.00 0.00
12 libxsmm 0.00 0.00
9 stty 0.00 0.00
8 stat 0.00 0.00
8 systemd-detect- 0.00 0.00
7 gsettings 0.00 0.00
6 llvm-link 0.00 0.00
5 gmain 0.00 0.00
5 phoronix-test-s 0.00 0.00
4 dconf worker 0.00 0.00
2 cc 0.00 0.00
2 lscpu 0.00 0.00
2 uname 0.00 0.00
2 which 0.00 0.00
2 xset 0.00 0.00
1 date 0.00 0.00
1 dirname 0.00 0.00
1 dmesg 0.00 0.00
1 dmidecode 0.00 0.00
1 grep 0.00 0.00
1 ifconfig 0.00 0.00
1 ip 0.00 0.00
1 lsmod 0.00 0.00
1 mktemp 0.00 0.00
1 ps 0.00 0.00
1 qdbus 0.00 0.00
1 readlink 0.00 0.00
1 realpath 0.00 0.00
1 sed 0.00 0.00
1 sort 0.00 0.00
1 systemctl 0.00 0.00
1 template.sh 0.00 0.00
1 wc 0.00 0.00
1 xrandr 0.00 0.00
0 processes running
47 maximum processes
With parallel computation on all cores
417620) libxsmm cpu=8 start=5.70 finish=118.89
417621) specialized cpu=15 start=5.70 finish=118.89
417622) specialized cpu=1 start=5.70 finish=118.89
417623) specialized cpu=3 start=5.70 finish=118.89
417624) specialized cpu=14 start=5.70 finish=118.89
417625) specialized cpu=13 start=5.70 finish=118.89
417626) specialized cpu=4 start=5.70 finish=118.89
417627) specialized cpu=10 start=5.70 finish=118.89
417628) specialized cpu=8 start=5.70 finish=118.89
417629) specialized cpu=4 start=5.71 finish=118.89
417630) specialized cpu=12 start=5.71 finish=118.89
417631) specialized cpu=3 start=5.71 finish=118.89
417632) specialized cpu=14 start=5.71 finish=118.89
417633) specialized cpu=5 start=5.71 finish=118.89
417634) specialized cpu=11 start=5.71 finish=118.89
417635) specialized cpu=8 start=5.71 finish=118.89
417636) specialized cpu=10 start=5.71 finish=118.89
