An open source deep learning framework from Tencent. There are four workloads, all on the CPU and for densenet, mobilenet, squeezenet v2 and squeezenet v1.1. The densenet workload runs on all cores and other workloads look single-threaded.

Topdown profile also shows different profiles for the benchmarks. However a general theme of being dominated by backend stalls and having mostly low levels of frontend stalls except for in transition.

AMD metrics show not many floating point and having ~50 L2 access per 1000 instructions.
elapsed 739.513
on_cpu 0.302 # 4.82 / 16 cores
utime 3560.039
stime 7.444
nvcsw 223692 # 87.59%
nivcsw 31706 # 12.41%
inblock 0 # 0.00/sec
onblock 13776 # 18.63/sec
cpu-clock 3564316007029 # 3564.316 seconds
task-clock 3564837738544 # 3564.838 seconds
page faults 228836 # 64.193/sec
context switches 258899 # 72.626/sec
cpu migrations 814 # 0.228/sec
major page faults 3 # 0.001/sec
minor page faults 228833 # 64.192/sec
alignment faults 0 # 0.000/sec
emulation faults 0 # 0.000/sec
branches 4196404420376 # 189.470 branches per 1000 inst
branch misses 3237305237 # 0.08% branch miss
conditional 3022374000895 # 136.462 conditional branches per 1000 inst
indirect 305139741479 # 13.777 indirect branches per 1000 inst
cpu-cycles 14722665973061 # 1.25 GHz
instructions 22146096055369 # 1.50 IPC
slots 29444273560320 #
retiring 7301956221892 # 24.8% (35.8%)
-- ucode 40904035395 # 0.1%
-- fastpath 7261052186497 # 24.7%
frontend 1517745054634 # 5.2% ( 7.4%)
-- latency 419893118628 # 1.4%
-- bandwidth 1097851936006 # 3.7%
backend 11560250031251 # 39.3% (56.6%)
-- cpu 5778461213660 # 19.6%
-- memory 5781788817591 # 19.6%
speculation 39744504863 # 0.1% ( 0.2%) low
-- branch mispredict 36045076797 # 0.1%
-- pipeline restart 3699428066 # 0.0%
smt-contention 9024536450254 # 30.6% ( 0.0%)
cpu-cycles 14721801786379 # 1.24 GHz
instructions 22187875339042 # 1.51 IPC
instructions 7369246766729 # 51.188 l2 access per 1000 inst
l2 hit from l1 234221533537 # 0.97% l2 miss
l2 miss from l1 1817504517 #
l2 hit from l2 pf 141169526887 #
l3 hit from l2 pf 1747943816 #
l3 miss from l2 pf 78990687 #
instructions 7382349757638 # 48.622 float per 1000 inst
float 512 65 # 0.000 AVX-512 per 1000 inst
float 256 770 # 0.000 AVX-256 per 1000 inst
float 128 358941089279 # 48.622 AVX-128 per 1000 inst
float MMX 0 # 0.000 MMX per 1000 inst
float scalar 0 # 0.000 scalar per 1000 inst
Intel metrics
elapsed 957.141
on_cpu 0.418 # 6.68 / 16 cores
utime 6393.282
stime 4.919
nvcsw 205201 # 78.38%
nivcsw 56603 # 21.62%
inblock 328 # 0.34/sec
onblock 2664 # 2.78/sec
cpu-clock 6393687302717 # 6393.687 seconds
task-clock 6393965412585 # 6393.965 seconds
page faults 232344 # 36.338/sec
context switches 266387 # 41.662/sec
cpu migrations 40747 # 6.373/sec
major page faults 2 # 0.000/sec
minor page faults 232342 # 36.338/sec
alignment faults 0 # 0.000/sec
emulation faults 0 # 0.000/sec
branches 5392050354022 # 188.279 branches per 1000 inst
branch misses 5089990262 # 0.09% branch miss
conditional 5392050370150 # 188.279 conditional branches per 1000 inst
indirect 1477539033770 # 51.592 indirect branches per 1000 inst
slots 40379208406658 #
retiring 15114663323800 # 37.4% (37.4%)
-- ucode 1079718189390 # 2.7%
-- fastpath 14034945134410 # 34.8%
frontend 2281353403271 # 5.6% ( 5.6%)
-- latency 1328892806649 # 3.3%
-- bandwidth 952460596622 # 2.4%
backend 22305036556297 # 55.2% (55.2%)
-- cpu 20253897577472 # 50.2%
-- memory 2051138978825 # 5.1%
speculation 428791222988 # 1.1% ( 1.1%)
-- branch mispredict 335441759078 # 0.8%
-- pipeline restart 93349463910 # 0.2%
smt-contention 0 # 0.0% ( 0.0%)
cpu-cycles 16045151722351 # 1.32 GHz
instructions 22585698220732 # 1.41 IPC
l2 access 408311732315 # 30.691 l2 access per 1000 inst
l2 miss 9044196543 # 2.22% l2 miss
Process overview shows almost all of the time in TNNtest
468 processes
102 TNNTest 56099.20 83.22
68 clinfo 19.17 7.99
38 vulkaninfo 1.50 1.33
4 vulkani:disk$0 0.15 0.14
6 glxinfo:gdrv0 0.15 0.07
6 glxinfo:gl0 0.15 0.07
6 php 0.10 0.14
2 llvmpipe-0 0.08 0.07
2 llvmpipe-1 0.08 0.07
2 llvmpipe-10 0.08 0.07
2 llvmpipe-11 0.08 0.07
2 llvmpipe-12 0.08 0.07
2 llvmpipe-13 0.08 0.07
2 llvmpipe-14 0.08 0.07
2 llvmpipe-15 0.08 0.07
2 llvmpipe-2 0.08 0.07
2 llvmpipe-3 0.08 0.07
2 llvmpipe-4 0.08 0.07
2 llvmpipe-5 0.08 0.07
2 llvmpipe-6 0.08 0.07
2 llvmpipe-7 0.08 0.07
2 llvmpipe-8 0.08 0.07
2 llvmpipe-9 0.08 0.07
2 glxinfo 0.07 0.03
2 glxinfo:cs0 0.07 0.03
2 glxinfo:disk$0 0.07 0.03
2 glxinfo:sh0 0.07 0.03
2 glxinfo:shlo0 0.07 0.03
6 clang 0.06 0.06
3 rocminfo 0.03 0.00
1 lspci 0.00 0.03
1 ps 0.00 0.01
88 sh 0.00 0.00
13 gcc 0.00 0.00
12 tnn 0.00 0.00
10 gsettings 0.00 0.00
8 stat 0.00 0.00
8 systemd-detect- 0.00 0.00
6 llvm-link 0.00 0.00
5 phoronix-test-s 0.00 0.00
4 gmain 0.00 0.00
2 cc 0.00 0.00
2 dconf worker 0.00 0.00
2 lscpu 0.00 0.00
2 uname 0.00 0.00
2 which 0.00 0.00
2 xset 0.00 0.00
1 date 0.00 0.00
1 dirname 0.00 0.00
1 dmesg 0.00 0.00
1 dmidecode 0.00 0.00
1 grep 0.00 0.00
1 ifconfig 0.00 0.00
1 ip 0.00 0.00
1 lsmod 0.00 0.00
1 mktemp 0.00 0.00
1 qdbus 0.00 0.00
1 readlink 0.00 0.00
1 realpath 0.00 0.00
1 sed 0.00 0.00
1 sort 0.00 0.00
1 stty 0.00 0.00
1 systemctl 0.00 0.00
1 template.sh 0.00 0.00
1 wc 0.00 0.00
1 xrandr 0.00 0.00
0 processes running
47 maximum processes
Computation structures start one thread on each cover, at least for the first workload
1119747) tnn cpu=1 start=7.90 finish=183.99
1119748) TNNTest cpu=5 start=7.90 finish=183.99
1119749) TNNTest cpu=3 start=8.11 finish=183.99
1119750) TNNTest cpu=15 start=8.11 finish=183.99
1119751) TNNTest cpu=6 start=8.11 finish=183.99
1119752) TNNTest cpu=9 start=8.11 finish=183.99
1119753) TNNTest cpu=7 start=8.11 finish=183.99
1119754) TNNTest cpu=14 start=8.11 finish=183.99
1119755) TNNTest cpu=13 start=8.11 finish=183.99
1119756) TNNTest cpu=2 start=8.11 finish=183.99
1119757) TNNTest cpu=12 start=8.11 finish=183.99
1119758) TNNTest cpu=10 start=8.11 finish=183.99
1119759) TNNTest cpu=0 start=8.11 finish=183.99
1119760) TNNTest cpu=11 start=8.11 finish=183.99
1119761) TNNTest cpu=4 start=8.11 finish=183.99
1119762) TNNTest cpu=8 start=8.11 finish=183.99
1119763) TNNTest cpu=1 start=8.11 finish=183.99
