onednn – Performance analysis, tools and experiments

onednn is a neural network library, in this test convolution. Except for Stream, this is the most memory-bound application I observe. There are a few data points of front-end misses but overall frontend is low.

AMD metrics show only 50% on cpu. Not much floating point (despite this being listed as f32 and u8/i8 test. Very low IPC and high L2 misses.

elapsed              71.340
on_cpu               0.508          # 8.13 / 16 cores
utime                577.606
stime                2.283
nvcsw                2933           # 21.82%
nivcsw               10507          # 78.18%
inblock              7824           # 109.67/sec
onblock              1808           # 25.34/sec
cpu-clock            580267448241   # 580.267 seconds
task-clock           580293281735   # 580.293 seconds
page faults          917435         # 1580.985/sec
context switches     13603          # 23.442/sec
cpu migrations       355            # 0.612/sec
major page faults    101            # 0.174/sec
minor page faults    917334         # 1580.811/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             14973262484    # 27.616 branches per 1000 inst
branch misses        247963050      # 1.66% branch miss
conditional          12771157805    # 23.555 conditional branches per 1000 inst
indirect             148342180      # 0.274 indirect branches per 1000 inst
cpu-cycles           2780469655011  # 2.44 GHz
instructions         543789492995   # 0.20 IPC
slots                5553310978572  #
retiring             189704440465   #  3.4% ( 3.6%)
-- ucode             965399781      #     0.0%
-- fastpath          188739040684   #     3.4%
frontend             200601330078   #  3.6% ( 3.8%)
-- latency           172522010298   #     3.1%
-- bandwidth         28079319780    #     0.5%
backend              4904067730753  # 88.3% (92.6%)
-- cpu               843972890978   #    15.2%
-- memory            4060094839775  #    73.1%
speculation          1460554868     #  0.0% ( 0.0%)
-- branch mispredict 1317644471     #     0.0%
-- pipeline restart  142910397      #     0.0%
smt-contention       257472348837   #  4.6% ( 0.0%)
cpu-cycles           2780748578640  # 2.29 GHz
instructions         542814219642   # 0.20 IPC
instructions         181076126057   # 50.266 l2 access per 1000 inst
l2 hit from l1       6471351931     # 44.79% l2 miss
l2 miss from l1      2012992425     #
l2 hit from l2 pf    566893824      #
l3 hit from l2 pf    538088144      #
l3 miss from l2 pf   1525658477     #
instructions         181285666295   # 4.541 float per 1000 inst
float 512            62             # 0.000 AVX-512 per 1000 inst
float 256            314            # 0.000 AVX-256 per 1000 inst
float 128            823292867      # 4.541 AVX-128 per 1000 inst
float MMX            0              # 0.000 MMX per 1000 inst
float scalar         0              # 0.000 scalar per 1000 inst

Intel metrics also confirm this is a short-running workload.

elapsed              70.776
on_cpu               0.515          # 8.24 / 16 cores
utime                581.506
stime                1.509
nvcsw                2728           # 29.18%
nivcsw               6621           # 70.82%
inblock              35816          # 506.05/sec
onblock              1888           # 26.68/sec
cpu-clock            583151242836   # 583.151 seconds
task-clock           583163697663   # 583.164 seconds
page faults          913034         # 1565.656/sec
context switches     9520           # 16.325/sec
cpu migrations       476            # 0.816/sec
major page faults    197            # 0.338/sec
minor page faults    912837         # 1565.319/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             24742585163    # 27.518 branches per 1000 inst
branch misses        55106113       # 0.22% branch miss
conditional          24742600267    # 27.518 conditional branches per 1000 inst
indirect             8415132396     # 9.359 indirect branches per 1000 inst
slots                3324114012128  #
retiring             466162249004   # 14.0% (14.0%)
-- ucode             10119389453    #     0.3%
-- fastpath          456042859551   #    13.7%
frontend             180764613017   #  5.4% ( 5.4%)
-- latency           156300290055   #     4.7%
-- bandwidth         24464322962    #     0.7%
backend              2669530026167  # 80.3% (80.3%)
-- cpu               436293792636   #    13.1%
-- memory            2233236233531  #    67.2%
speculation          5291484831     #  0.2% ( 0.2%)
-- branch mispredict 3942250602     #     0.1%
-- pipeline restart  1349234229     #     0.0%
smt-contention       0              #  0.0% ( 0.0%)
cpu-cycles           1100899965042  # 0.97 GHz
instructions         454783623331   # 0.41 IPC
l2 access            10699740373    # 23.529 l2 access per 1000 inst
l2 miss              6241519885     # 58.33% l2 miss

Process shows time spent in benchdnn

454 processes
	 96 benchdnn              9231.20    22.88
	 64 clinfo                  10.88     3.96
	 38 vulkaninfo               0.76     1.14
	  6 glxinfo:gdrv0            0.11     0.09
	  4 vulkani:disk$0           0.08     0.12
	  6 php                      0.06     0.09
	  6 clang                    0.05     0.03
	  2 glxinfo                  0.05     0.03
	  2 glxinfo:cs0              0.05     0.03
	  2 glxinfo:disk$0           0.05     0.03
	  2 glxinfo:sh0              0.05     0.03
	  2 glxinfo:shlo0            0.05     0.03
	  2 llvmpipe-0               0.04     0.06
	  2 llvmpipe-1               0.04     0.06
	  2 llvmpipe-10              0.04     0.06
	  2 llvmpipe-11              0.04     0.06
	  2 llvmpipe-12              0.04     0.06
	  2 llvmpipe-13              0.04     0.06
	  2 llvmpipe-14              0.04     0.06
	  2 llvmpipe-15              0.04     0.06
	  2 llvmpipe-2               0.04     0.06
	  2 llvmpipe-3               0.04     0.06
	  2 llvmpipe-4               0.04     0.06
	  2 llvmpipe-5               0.04     0.06
	  2 llvmpipe-6               0.04     0.06
	  2 llvmpipe-7               0.04     0.06
	  2 llvmpipe-8               0.04     0.06
	  2 llvmpipe-9               0.04     0.06
	  1 lspci                    0.01     0.03
	 90 sh                       0.00     0.00
	 13 gcc                      0.00     0.00
	 10 gsettings                0.00     0.00
	  9 stty                     0.00     0.00
	  8 stat                     0.00     0.00
	  8 systemd-detect-          0.00     0.00
	  6 llvm-link                0.00     0.00
	  6 onednn                   0.00     0.00
	  5 phoronix-test-s          0.00     0.00
	  4 gmain                    0.00     0.00
	  2 cc                       0.00     0.00
	  2 dconf worker             0.00     0.00
	  2 lscpu                    0.00     0.00
	  2 uname                    0.00     0.00
	  2 which                    0.00     0.00
	  2 xset                     0.00     0.00
	  1 date                     0.00     0.00
	  1 dirname                  0.00     0.00
	  1 dmesg                    0.00     0.00
	  1 dmidecode                0.00     0.00
	  1 grep                     0.00     0.00
	  1 ifconfig                 0.00     0.00
	  1 ip                       0.00     0.00
	  1 lsmod                    0.00     0.00
	  1 mktemp                   0.00     0.00
	  1 ps                       0.00     0.00
	  1 qdbus                    0.00     0.00
	  1 readlink                 0.00     0.00
	  1 realpath                 0.00     0.00
	  1 sed                      0.00     0.00
	  1 sort                     0.00     0.00
	  1 systemctl                0.00     0.00
	  1 template.sh              0.00     0.00
	  1 wc                       0.00     0.00
	  1 xrandr                   0.00     0.00
1 processes running
48 maximum processes

Computation core starts benchmark threads on all cpus.

      454702) onednn           cpu=0 start=15.52 finish=21.76
        454703) benchdnn         cpu=8 start=15.53 finish=21.75
          454704) benchdnn         cpu=0 start=15.56 finish=21.75
          454705) benchdnn         cpu=9 start=15.56 finish=21.75
          454706) benchdnn         cpu=1 start=15.56 finish=21.75
          454707) benchdnn         cpu=10 start=15.56 finish=21.75
          454708) benchdnn         cpu=2 start=15.56 finish=21.75
          454709) benchdnn         cpu=11 start=15.56 finish=21.75
          454710) benchdnn         cpu=3 start=15.56 finish=21.75
          454711) benchdnn         cpu=12 start=15.56 finish=21.75
          454712) benchdnn         cpu=4 start=15.56 finish=21.75
          454713) benchdnn         cpu=13 start=15.56 finish=21.75
          454714) benchdnn         cpu=5 start=15.56 finish=21.75
          454715) benchdnn         cpu=14 start=15.56 finish=21.75
          454716) benchdnn         cpu=6 start=15.56 finish=21.75
          454717) benchdnn         cpu=15 start=15.56 finish=21.75
          454718) benchdnn         cpu=7 start=15.56 finish=21.75