Google library for high efficiency floating-point neural network inference operators. Used by other frameworks. There is a sequence of nine operations. These run on all cores.

Topdown profile shows mostly backend bound but a mix among the operations.

AMD metrics show backend stalls averaging 71% and split between memory and core. The frontend and speculation stalls are small. There is a moderate amount of floating point.
elapsed 931.826
on_cpu 0.959 # 15.34 / 16 cores
utime 14257.581
stime 33.219
nvcsw 10099 # 9.45%
nivcsw 96767 # 90.55%
inblock 8 # 0.01/sec
onblock 17472 # 18.75/sec
cpu-clock 14294818899155 # 14294.819 seconds
task-clock 14294890468287 # 14294.890 seconds
page faults 944654 # 66.083/sec
context switches 111341 # 7.789/sec
cpu migrations 285 # 0.020/sec
major page faults 2 # 0.000/sec
minor page faults 944652 # 66.083/sec
alignment faults 0 # 0.000/sec
emulation faults 0 # 0.000/sec
branches 2082975726202 # 39.545 branches per 1000 inst
branch misses 15827594675 # 0.76% branch miss
conditional 1918405492130 # 36.420 conditional branches per 1000 inst
indirect 38210190153 # 0.725 indirect branches per 1000 inst
cpu-cycles 56773076641271 # 3.81 GHz
instructions 52309348452668 # 0.92 IPC
slots 113599063615566 #
retiring 19097588418675 # 16.8% (23.5%)
-- ucode 66616981438 # 0.1%
-- fastpath 19030971437237 # 16.8%
frontend 3440658672312 # 3.0% ( 4.2%) low
-- latency 1913602394898 # 1.7%
-- bandwidth 1527056277414 # 1.3%
backend 58361692315514 # 51.4% (71.7%) high
-- cpu 28346810089652 # 25.0%
-- memory 30014882225862 # 26.4%
speculation 497091553516 # 0.4% ( 0.6%) low
-- branch mispredict 310968324668 # 0.3%
-- pipeline restart 186123228848 # 0.2%
smt-contention 32201569235059 # 28.3% ( 0.0%)
cpu-cycles 57035182888061 # 3.81 GHz
instructions 52381979244633 # 0.92 IPC
instructions 17467779825069 # 61.298 l2 access per 1000 inst
l2 hit from l1 659001928214 # 26.58% l2 miss
l2 miss from l1 71983136525 #
l2 hit from l2 pf 199094421089 #
l3 hit from l2 pf 195194100072 #
l3 miss from l2 pf 17449215935 #
instructions 17446199487461 # 72.488 float per 1000 inst
float 512 62 # 0.000 AVX-512 per 1000 inst
float 256 10591429521 # 0.607 AVX-256 per 1000 inst
float 128 1254051102509 # 71.881 AVX-128 per 1000 inst
float MMX 0 # 0.000 MMX per 1000 inst
float scalar 112 # 0.000 scalar per 1000 inst
instructions 52817290218833 #
opcache 5292338275116 # 100.201 opcache per 1000 inst
opcache miss 117763496128 # 2.2% opcache miss rate
l1 dTLB miss 23326497562 # 0.442 L1 dTLB per 1000 inst
l2 dTLB miss 4634341166 # 0.088 L2 dTLB per 1000 inst
instructions 52695988373276 #
icache 164780128406 # 3.127 icache per 1000 inst
icache miss 20642440619 # 12.5% icache miss rate
l1 iTLB miss 7388292 # 0.000 L1 iTLB per 1000 inst
l2 iTLB miss 0 # 0.000 L2 iTLB per 1000 inst
tlb flush 70176 # 0.000 TLB flush per 1000 inst
Intel metrics show largest percentage of memory stalls are L1 and then L3.
elapsed 970.960
on_cpu 0.954 # 15.26 / 16 cores
utime 14796.665
stime 16.394
nvcsw 8948 # 1.51%
nivcsw 582911 # 98.49%
inblock 392 # 0.40/sec
onblock 5968 # 6.15/sec
cpu-clock 14816772460134 # 14816.772 seconds
task-clock 14816918352865 # 14816.918 seconds
page faults 792902 # 53.513/sec
context switches 596500 # 40.258/sec
cpu migrations 5685 # 0.384/sec
major page faults 0 # 0.000/sec
minor page faults 792902 # 53.513/sec
alignment faults 0 # 0.000/sec
emulation faults 0 # 0.000/sec
branches 1460766975019 # 36.514 branches per 1000 inst
branch misses 9973175448 # 0.68% branch miss
conditional 1460767229579 # 36.514 conditional branches per 1000 inst
indirect 486302274617 # 12.156 indirect branches per 1000 inst
slots 70348736732066 #
retiring 26079489676296 # 37.1% (37.1%)
-- ucode 3101142779251 # 4.4%
-- fastpath 22978346897045 # 32.7%
frontend 8773456230690 # 12.5% (12.5%)
-- latency 7397857194056 # 10.5%
-- bandwidth 1375599036634 # 2.0%
backend 34824709912208 # 49.5% (49.5%)
-- cpu 19735277735638 # 28.1%
-- memory 15089432176570 # 21.4%
speculation 696075894965 # 1.0% ( 1.0%) low
-- branch mispredict 600904200955 # 0.9%
-- pipeline restart 95171694010 # 0.1%
smt-contention 0 # 0.0% ( 0.0%)
cpu-cycles 70960300997299 # 1.54 GHz
instructions 71462354213740 # 1.01 IPC
l2 access 1915667603277 # 27.541 l2 access per 1000 inst
l2 miss 749955564088 # 39.15% l2 miss
cpu-cycles 23457104488956 # 31.0% memory latency
load stalls 7146853814664 # 16.1% l1 bound
l1 miss 3377786328787 # 2.6% l2 bound
l2 miss 2773148184220 # 9.9% l3 bound
l3 miss 451171967163 # 1.9% dram bound
store_stalls 125913079992 # 0.5% store bound
Process profile shows most time spent in an end2end-bench driver with ~8000 invocations
8439 processes
8133 end2end-bench 18530586.84 38821.55
36 clinfo 4.11 2.24
38 vulkaninfo 1.32 1.14
4 vulkani:disk$0 0.14 0.12
6 php 0.09 0.12
2 llvmpipe-0 0.07 0.06
2 llvmpipe-1 0.07 0.06
2 llvmpipe-10 0.07 0.06
2 llvmpipe-11 0.07 0.06
2 llvmpipe-12 0.07 0.06
2 llvmpipe-13 0.07 0.06
2 llvmpipe-14 0.07 0.06
2 llvmpipe-15 0.07 0.06
2 llvmpipe-2 0.07 0.06
2 llvmpipe-3 0.07 0.06
2 llvmpipe-4 0.07 0.06
2 llvmpipe-5 0.07 0.06
2 llvmpipe-6 0.07 0.06
2 llvmpipe-7 0.07 0.06
2 llvmpipe-8 0.07 0.06
2 llvmpipe-9 0.07 0.06
6 clang 0.06 0.06
3 rocminfo 0.03 0.00
1 lspci 0.00 0.02
85 sh 0.00 0.00
13 gcc 0.00 0.00
8 gsettings 0.00 0.00
8 stat 0.00 0.00
8 systemd-detect- 0.00 0.00
6 llvm-link 0.00 0.00
5 glxinfo 0.00 0.00
5 gmain 0.00 0.00
5 phoronix-test-s 0.00 0.00
3 dconf worker 0.00 0.00
3 xnnpack 0.00 0.00
2 cc 0.00 0.00
2 dmesg 0.00 0.00
2 grep 0.00 0.00
2 lscpu 0.00 0.00
2 setterm 0.00 0.00
2 uname 0.00 0.00
2 which 0.00 0.00
1 date 0.00 0.00
1 dirname 0.00 0.00
1 dmidecode 0.00 0.00
1 ifconfig 0.00 0.00
1 ip 0.00 0.00
1 lsmod 0.00 0.00
1 mktemp 0.00 0.00
1 ps 0.00 0.00
1 qdbus 0.00 0.00
1 readlink 0.00 0.00
1 realpath 0.00 0.00
1 sed 0.00 0.00
1 sort 0.00 0.00
1 stty 0.00 0.00
1 systemctl 0.00 0.00
1 template.sh 0.00 0.00
Process profile shows a lot of short driver calls
1133056) xnnpack cpu=4 start=5.23 finish=312.52
1133057) end2end-bench cpu=0 start=5.24 finish=312.48
1133058) end2end-bench cpu=10 start=5.24 finish=5.33
1133059) end2end-bench cpu=5 start=5.24 finish=5.33
1133060) end2end-bench cpu=6 start=5.24 finish=5.33
1133061) end2end-bench cpu=15 start=5.24 finish=5.33
1133062) end2end-bench cpu=0 start=5.24 finish=5.33
1133063) end2end-bench cpu=11 start=5.24 finish=5.33
1133064) end2end-bench cpu=4 start=5.24 finish=5.33
1133065) end2end-bench cpu=13 start=5.24 finish=5.33
1133066) end2end-bench cpu=1 start=5.24 finish=5.33
1133067) end2end-bench cpu=2 start=5.24 finish=5.33
1133068) end2end-bench cpu=14 start=5.24 finish=5.33
1133069) end2end-bench cpu=7 start=5.24 finish=5.33
1133070) end2end-bench cpu=8 start=5.24 finish=5.33
1133071) end2end-bench cpu=3 start=5.24 finish=5.33
1133072) end2end-bench cpu=12 start=5.24 finish=5.33
1133073) end2end-bench cpu=0 start=5.34 finish=5.44
1133074) end2end-bench cpu=6 start=5.34 finish=5.44
1133075) end2end-bench cpu=15 start=5.34 finish=5.44
1133076) end2end-bench cpu=9 start=5.34 finish=5.44
1133077) end2end-bench cpu=12 start=5.34 finish=5.44
1133078) end2end-bench cpu=13 start=5.34 finish=5.44
1133079) end2end-bench cpu=2 start=5.34 finish=5.43
1133080) end2end-bench cpu=3 start=5.34 finish=5.43
1133081) end2end-bench cpu=14 start=5.34 finish=5.43
1133082) end2end-bench cpu=7 start=5.34 finish=5.43
1133083) end2end-bench cpu=8 start=5.34 finish=5.43
1133084) end2end-bench cpu=5 start=5.34 finish=5.43
1133085) end2end-bench cpu=1 start=5.34 finish=5.43
1133086) end2end-bench cpu=10 start=5.34 finish=5.43
1133087) end2end-bench cpu=4 start=5.34 finish=5.43
1133088) end2end-bench cpu=4 start=5.44 finish=5.73
1133089) end2end-bench cpu=6 start=5.44 finish=5.73
1133090) end2end-bench cpu=9 start=5.44 finish=5.73
1133091) end2end-bench cpu=2 start=5.44 finish=5.73
1133092) end2end-bench cpu=0 start=5.44 finish=5.73
1133093) end2end-bench cpu=15 start=5.44 finish=5.73
1133094) end2end-bench cpu=13 start=5.44 finish=5.73
1133095) end2end-bench cpu=3 start=5.44 finish=5.73
1133096) end2end-bench cpu=5 start=5.44 finish=5.73
1133097) end2end-bench cpu=14 start=5.44 finish=5.73
1133098) end2end-bench cpu=1 start=5.44 finish=5.73
1133099) end2end-bench cpu=10 start=5.44 finish=5.73
1133100) end2end-bench cpu=8 start=5.44 finish=5.73
1133101) end2end-bench cpu=7 start=5.44 finish=5.73
1133102) end2end-bench cpu=12 start=5.44 finish=5.73
1133103) end2end-bench cpu=6 start=5.73 finish=8.37
1133104) end2end-bench cpu=0 start=5.73 finish=8.37
1133105) end2end-bench cpu=9 start=5.73 finish=8.37
1133106) end2end-bench cpu=13 start=5.73 finish=8.37
...
