Tensorflow based engine for inference. There are six different models. With exception of one model, we mostly run on all cores.

Topdown profile shows most tests are constrained by backend stalls, though to different degrees. Overall frontend stalls are low. This is consistent with tensorflow and ai-benchment, two other workloads using the tensorflow source base.

AMD metrics show not much floating point. Backend stalls are more CPU than memory. On-core is most of the 16 cores.

elapsed              1781.202
on_cpu               0.906          # 14.49 / 16 cores
utime                25768.805
stime                46.546
nvcsw                2497631        # 90.04%
nivcsw               276147         # 9.96%
inblock              0              # 0.00/sec
onblock              13600          # 7.64/sec
cpu-clock            25817374390635 # 25817.374 seconds
task-clock           25818707618172 # 25818.708 seconds
page faults          543211         # 21.039/sec
context switches     2782445        # 107.769/sec
cpu migrations       2740           # 0.106/sec
major page faults    402            # 0.016/sec
minor page faults    542809         # 21.024/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             8176257999730  # 75.297 branches per 1000 inst
branch misses        18848036907    # 0.23% branch miss
conditional          6168687726909  # 56.809 conditional branches per 1000 inst
indirect             708765353562   # 6.527 indirect branches per 1000 inst
cpu-cycles           66488699011329 # 3.47 GHz
instructions         62700919022287 # 0.94 IPC
slots                132972780453384 #
retiring             20997039894536 # 15.8% (22.3%)
-- ucode             70604174137    #     0.1%
-- fastpath          20926435720399 #    15.7%
frontend             6812430443169  #  5.1% ( 7.2%)
-- latency           4262617441938  #     3.2%
-- bandwidth         2549813001231  #     1.9%
backend              65880605697902 # 49.5% (70.1%) high
-- cpu               43055389397505 #    32.4%
-- memory            22825216300397 #    17.2%
speculation          304512513870   #  0.2% ( 0.3%) low
-- branch mispredict 231994676551   #     0.2%
-- pipeline restart  72517837319    #     0.1%
smt-contention       38978063022358 # 29.3% ( 0.0%)
cpu-cycles           66437980534368 # 3.46 GHz
instructions         62886224269728 # 0.95 IPC
instructions         20963778736844 # 110.104 l2 access per 1000 inst
l2 hit from l1       1484797600488  # 17.09% l2 miss
l2 miss from l1      76449398591    #
l2 hit from l2 pf    505326269978   #
l3 hit from l2 pf    300588021776   #
l3 miss from l2 pf   17475885955    #
instructions         20952703348688 # 88.143 float per 1000 inst
float 512            86             # 0.000 AVX-512 per 1000 inst
float 256            16638293267    # 0.794 AVX-256 per 1000 inst
float 128            1830204124915  # 87.349 AVX-128 per 1000 inst
float MMX            0              # 0.000 MMX per 1000 inst
float scalar         126            # 0.000 scalar per 1000 inst
instructions         2683871        #
opcache              1001419        # 373.125 opcache per 1000 inst
opcache miss         538595         # 53.8% opcache miss rate
l1 dTLB miss         5215           # 1.943 L1 dTLB per 1000 inst
l2 dTLB miss         1072           # 0.399 L2 dTLB per 1000 inst
instructions         2719642        #
icache               1329853        # 488.981 icache per 1000 inst
icache miss          113221         #  8.5% icache miss rate
l1 iTLB miss         13             # 0.005 L1 iTLB per 1000 inst
l2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst
tlb flush            19             # 0.007 TLB flush per 1000 inst

Intel metrics

elapsed              3505.681
on_cpu               0.900          # 14.41 / 16 cores
utime                50462.029
stime                45.015
nvcsw                3228584        # 86.96%
nivcsw               484151         # 13.04%
inblock              671152         # 191.45/sec
onblock              3288           # 0.94/sec
cpu-clock            50501413499110 # 50501.413 seconds
task-clock           50502962043307 # 50502.962 seconds
page faults          1267501        # 25.098/sec
context switches     3729982        # 73.857/sec
cpu migrations       20244          # 0.401/sec
major page faults    5621           # 0.111/sec
minor page faults    1261880        # 24.986/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             7113178270992  # 42.878 branches per 1000 inst
branch misses        33305181988    # 0.47% branch miss
conditional          7113178311760  # 42.878 conditional branches per 1000 inst
indirect             2612548769133  # 15.748 indirect branches per 1000 inst
slots                195950194535378 #
retiring             60251227935729 # 30.7% (30.7%)
-- ucode             2905514460597  #     1.5%
-- fastpath          57345713475132 #    29.3%
frontend             15083119217621 #  7.7% ( 7.7%)
-- latency           11672569239183 #     6.0%
-- bandwidth         3410549978438  #     1.7%
backend              120615883054053 # 61.6% (61.6%)
-- cpu               102213139192031 #    52.2%
-- memory            18402743862022 #     9.4%
speculation          1161461077519  #  0.6% ( 0.6%) low
-- branch mispredict 1029624209035  #     0.5%
-- pipeline restart  131836868484   #     0.1%
smt-contention       0              #  0.0% ( 0.0%)
cpu-cycles           90788972703168 # 2.18 GHz
instructions         100133205361729 # 1.10 IPC
l2 access            2519301193908  # 41.427 l2 access per 1000 inst
l2 miss              695926152907   # 27.62% l2 miss
cpu-cycles           58978337968420 # 21.4% memory latency
load stalls          12514222603573 # 12.5% l1 bound
l1 miss              5158405253880  #  3.4% l2 bound
l2 miss              3175529755484  #  4.3% l3 bound
l3 miss              641250892911   #  1.1% dram bound
store_stalls         100727600884   #  0.2% store bound

Process summary

814 processes
	432 linux_x86-64_be      412501.71   546.44
	 68 clinfo                  17.52     4.67
	 38 vulkaninfo               1.13     1.14
	  6 php                      0.16     0.27
	  4 vulkani:disk$0           0.12     0.12
	  6 glxinfo:gdrv0            0.09     0.04
	  6 glxinfo:gl0              0.09     0.04
	  2 llvmpipe-0               0.06     0.06
	  2 llvmpipe-1               0.06     0.06
	  2 llvmpipe-10              0.06     0.06
	  2 llvmpipe-11              0.06     0.06
	  2 llvmpipe-12              0.06     0.06
	  2 llvmpipe-13              0.06     0.06
	  2 llvmpipe-14              0.06     0.06
	  2 llvmpipe-15              0.06     0.06
	  2 llvmpipe-2               0.06     0.06
	  2 llvmpipe-3               0.06     0.06
	  2 llvmpipe-4               0.06     0.06
	  2 llvmpipe-5               0.06     0.06
	  2 llvmpipe-6               0.06     0.06
	  2 llvmpipe-7               0.06     0.06
	  2 llvmpipe-8               0.06     0.06
	  2 llvmpipe-9               0.06     0.06
	  2 glxinfo                  0.06     0.02
	  2 glxinfo:cs0              0.06     0.02
	  2 glxinfo:disk$0           0.06     0.02
	  2 glxinfo:sh0              0.05     0.02
	  2 glxinfo:shlo0            0.05     0.02
	  6 clang                    0.03     0.09
	  3 rocminfo                 0.03     0.00
	  1 lspci                    0.00     0.02
	  1 ps                       0.00     0.01
	 91 sh                       0.00     0.00
	 27 tensorflow-lite          0.00     0.00
	 12 gcc                      0.00     0.00
	 10 gsettings                0.00     0.00
	  8 stat                     0.00     0.00
	  8 systemd-detect-          0.00     0.00
	  6 llvm-link                0.00     0.00
	  5 phoronix-test-s          0.00     0.00
	  4 gmain                    0.00     0.00
	  2 dconf worker             0.00     0.00
	  2 lscpu                    0.00     0.00
	  2 uname                    0.00     0.00
	  2 which                    0.00     0.00
	  2 xset                     0.00     0.00
	  1 cc                       0.00     0.00
	  1 date                     0.00     0.00
	  1 dirname                  0.00     0.00
	  1 dmesg                    0.00     0.00
	  1 dmidecode                0.00     0.00
	  1 grep                     0.00     0.00
	  1 ifconfig                 0.00     0.00
	  1 ip                       0.00     0.00
	  1 lsmod                    0.00     0.00
	  1 mktemp                   0.00     0.00
	  1 qdbus                    0.00     0.00
	  1 readlink                 0.00     0.00
	  1 realpath                 0.00     0.00
	  1 sed                      0.00     0.00
	  1 sort                     0.00     0.00
	  1 stty                     0.00     0.00
	  1 systemctl                0.00     0.00
	  1 template.sh              0.00     0.00
	  1 wc                       0.00     0.00
	  1 xrandr                   0.00     0.00
0 processes running
47 maximum processes


Core computation blocks start one process on each core

      233538) tensorflow-lite  cpu=13 start=5.82  finish=66.34
        233539) linux_x86-64_be  cpu=0 start=5.82  finish=66.34
          233540) linux_x86-64_be  cpu=11 start=5.83  finish=66.34
          233541) linux_x86-64_be  cpu=4 start=5.83  finish=66.34
          233542) linux_x86-64_be  cpu=6 start=5.83  finish=66.34
          233543) linux_x86-64_be  cpu=2 start=5.83  finish=66.34
          233544) linux_x86-64_be  cpu=1 start=5.83  finish=66.34
          233545) linux_x86-64_be  cpu=5 start=5.83  finish=66.34
          233546) linux_x86-64_be  cpu=7 start=5.83  finish=66.34
          233547) linux_x86-64_be  cpu=8 start=5.83  finish=66.34
          233548) linux_x86-64_be  cpu=14 start=5.83  finish=66.34
          233549) linux_x86-64_be  cpu=10 start=5.83  finish=66.34
          233550) linux_x86-64_be  cpu=3 start=5.83  finish=66.34
          233551) linux_x86-64_be  cpu=12 start=5.83  finish=66.34
          233552) linux_x86-64_be  cpu=9 start=5.83  finish=66.34
          233553) linux_x86-64_be  cpu=13 start=5.83  finish=66.34
          233554) linux_x86-64_be  cpu=15 start=5.83  finish=66.34