Onnx runtime with 20 different workloads. These run with a variety of different parallelism.

Topdown profile shows mostly backend bound with periods of high frontend stalls.

AMD metrics show running on half the cores, not much floating point, with moderate L2 hit rate. Backend bound with high memory stalls but also CPU stalls.

elapsed              7666.905
on_cpu               0.484          # 7.75 / 16 cores
utime                59267.083
stime                121.864
nvcsw                126506         # 56.48%
nivcsw               97469          # 43.52%
inblock              8              # 0.00/sec
onblock              31152          # 4.06/sec
cpu-clock            59392070765107 # 59392.071 seconds
task-clock           59392404433514 # 59392.404 seconds
page faults          75921573       # 1278.304/sec
context switches     261852         # 4.409/sec
cpu migrations       45585          # 0.768/sec
major page faults    268            # 0.005/sec
minor page faults    75921305       # 1278.300/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             13964968401705 # 59.551 branches per 1000 inst
branch misses        24286886257    # 0.17% branch miss
conditional          13150504035833 # 56.078 conditional branches per 1000 inst
indirect             75766798166    # 0.323 indirect branches per 1000 inst
cpu-cycles           170936473902854 # 2.13 GHz
instructions         149610925486559 # 0.88 IPC
slots                341882092221648 #
retiring             50270481479148 # 14.7% (15.8%)
-- ucode             449679582297   #     0.1%
-- fastpath          49820801896851 #    14.6%
frontend             7839231529047  #  2.3% ( 2.5%) low
-- latency           3811377380796  #     1.1%
-- bandwidth         4027854148251  #     1.2%
backend              259444684020020 # 75.9% (81.5%) high
-- cpu               108192204233662 #    31.6%
-- memory            151252479786358 #    44.2%
speculation          667988414339   #  0.2% ( 0.2%) low
-- branch mispredict 416633463189   #     0.1%
-- pipeline restart  251354951150   #     0.1%
smt-contention       23659541944969 #  6.9% ( 0.0%)
cpu-cycles           225365754216755 # 1.98 GHz
instructions         209421223495808 # 0.93 IPC
instructions         69791759643743 # 102.929 l2 access per 1000 inst
l2 hit from l1       5092499628764  # 12.04% l2 miss
l2 miss from l1      187031638880   #
l2 hit from l2 pf    1413548447320  #
l3 hit from l2 pf    451119563582   #
l3 miss from l2 pf   226457942800   #
instructions         69772481102765 # 78.194 float per 1000 inst
float 512            167            # 0.000 AVX-512 per 1000 inst
float 256            10196733872    # 0.146 AVX-256 per 1000 inst
float 128            5445615832780  # 78.048 AVX-128 per 1000 inst
float MMX            0              # 0.000 MMX per 1000 inst
float scalar         0              # 0.000 scalar per 1000 inst
instructions         140399342837497 #
opcache              12547602639130 # 89.371 opcache per 1000 inst
opcache miss         336571040544   #  2.7% opcache miss rate
l1 dTLB miss         85216624277    # 0.607 L1 dTLB per 1000 inst
l2 dTLB miss         20923237061    # 0.149 L2 dTLB per 1000 inst
instructions         228624533030111 #
icache               782732232728   # 3.424 icache per 1000 inst
icache miss          106861319043   # 13.7% icache miss rate
l1 iTLB miss         4080671841     # 0.018 L1 iTLB per 1000 inst
l2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst
tlb flush            78761          # 0.000 TLB flush per 1000 inst

Intel metrics show most backend stalls are CPU stalls.

elapsed              5583.167
on_cpu               0.727          # 11.64 / 16 cores
utime                64884.355
stime                77.838
nvcsw                91202          # 20.23%
nivcsw               359685         # 79.77%
inblock              352            # 0.06/sec
onblock              18768          # 3.36/sec
cpu-clock            64965405314194 # 64965.405 seconds
task-clock           64965592525832 # 64965.593 seconds
page faults          60825931       # 936.279/sec
context switches     478409         # 7.364/sec
cpu migrations       61992          # 0.954/sec
major page faults    709            # 0.011/sec
minor page faults    60825222       # 936.268/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             9379557277083  # 50.485 branches per 1000 inst
branch misses        22703149215    # 0.24% branch miss
conditional          9379557342747  # 50.485 conditional branches per 1000 inst
indirect             1939366372296  # 10.439 indirect branches per 1000 inst
slots                471739138825190 #
retiring             128933116492684 # 27.3% (27.3%)
-- ucode             7027709037148  #     1.5%
-- fastpath          121905407455536 #    25.8%
frontend             28156329685609 #  6.0% ( 6.0%)
-- latency           20790135084826 #     4.4%
-- bandwidth         7366194600783  #     1.6%
backend              310047888633198 # 65.7% (65.7%)
-- cpu               245552390551012 #    52.1%
-- memory            64495498082186 #    13.7%
speculation          4944247087919  #  1.0% ( 1.0%)
-- branch mispredict 2039556773679  #     0.4%
-- pipeline restart  2904690314240  #     0.6%
smt-contention       0              #  0.0% ( 0.0%)
cpu-cycles           97716947289441 # 1.16 GHz
instructions         110341301754781 # 1.13 IPC
l2 access            3318562539997  # 40.480 l2 access per 1000 inst
l2 miss              1067638444672  # 32.17% l2 miss
cpu-cycles           75242602539682 # 25.9% memory latency
load stalls          18912792284043 #  0.0% l1 bound
l1 miss              19009622083529 #  5.5% l2 bound
l2 miss              14894522086162 #  2.9% l3 bound
l3 miss              12690193751695 # 16.9% dram bound
store_stalls         550951507058   #  0.7% store bound

Process summary shows time in onnxruntime_per

852 processes
	537 onnxruntime_per      292179.20   381.75
	 34 clinfo                  10.07     2.99
	 19 vulkaninfo               0.95     0.57
	  2 vulkani:disk$0           0.10     0.06
	  3 glxinfo:gdrv0            0.08     0.03
	  3 glxinfo:gl0              0.08     0.03
	  6 clang                    0.06     0.06
	  1 llvmpipe-0               0.05     0.03
	  1 llvmpipe-1               0.05     0.03
	  1 llvmpipe-10              0.05     0.03
	  1 llvmpipe-11              0.05     0.03
	  1 llvmpipe-12              0.05     0.03
	  1 llvmpipe-13              0.05     0.03
	  1 llvmpipe-14              0.05     0.03
	  1 llvmpipe-15              0.05     0.03
	  1 llvmpipe-2               0.05     0.03
	  1 llvmpipe-3               0.05     0.03
	  1 llvmpipe-4               0.05     0.03
	  1 llvmpipe-5               0.05     0.03
	  1 llvmpipe-6               0.05     0.03
	  1 llvmpipe-7               0.05     0.03
	  1 llvmpipe-8               0.05     0.03
	  1 llvmpipe-9               0.05     0.03
	  1 glxinfo                  0.04     0.01
	  1 glxinfo:cs0              0.04     0.01
	  1 glxinfo:disk$0           0.04     0.01
	  1 glxinfo:sh0              0.04     0.01
	  1 glxinfo:shlo0            0.04     0.01
	 78 sh                       0.00     0.00
	 54 onnx                     0.00     0.00
	 13 gcc                      0.00     0.00
	 10 gsettings                0.00     0.00
	  8 systemd-detect-          0.00     0.00
	  7 stat                     0.00     0.00
	  6 llvm-link                0.00     0.00
	  4 gmain                    0.00     0.00
	  4 phoronix-test-s          0.00     0.00
	  2 which                    0.00     0.00
	  1 cc                       0.00     0.00
	  1 date                     0.00     0.00
	  1 dconf worker             0.00     0.00
	  1 dirname                  0.00     0.00
	  1 grep                     0.00     0.00
	  1 ifconfig                 0.00     0.00
	  1 ip                       0.00     0.00
	  1 lscpu                    0.00     0.00
	  1 mktemp                   0.00     0.00
	  1 ps                       0.00     0.00
	  1 python                   0.00     0.00
	  1 python3                  0.00     0.00
	  1 qdbus                    0.00     0.00
	  1 readlink                 0.00     0.00
	  1 realpath                 0.00     0.00
	  1 sed                      0.00     0.00
	  1 sort                     0.00     0.00
	  1 stty                     0.00     0.00
	  1 systemctl                0.00     0.00
	  1 template.sh              0.00     0.00
	  1 wc                       0.00     0.00
	  1 xrandr                   0.00     0.00
	  1 xset                     0.00     0.00
18 processes running
47 maximum processes

Computation blocks are relatively regular.

      23628) onnx             cpu=1 start=71.35 finish=132.96
        23629) onnxruntime_per  cpu=5 start=71.35 finish=132.94
          23630) onnxruntime_per  cpu=3 start=71.84 finish=132.90
          23631) onnxruntime_per  cpu=4 start=71.84 finish=132.90
          23632) onnxruntime_per  cpu=6 start=71.84 finish=132.90
          23633) onnxruntime_per  cpu=15 start=71.84 finish=132.90
          23634) onnxruntime_per  cpu=8 start=71.84 finish=132.90
          23635) onnxruntime_per  cpu=2 start=71.84 finish=132.90
          23636) onnxruntime_per  cpu=1 start=71.84 finish=132.90
          23637) onnxruntime_per  cpu=14 start=71.85 finish=132.90
          23638) onnxruntime_per  cpu=7 start=71.85 finish=132.90
          23639) onnxruntime_per  cpu=12 start=71.85 finish=132.90
          23640) onnxruntime_per  cpu=0 start=71.85 finish=132.90
          23641) onnxruntime_per  cpu=11 start=71.85 finish=132.90
          23642) onnxruntime_per  cpu=10 start=71.85 finish=132.90
          23643) onnxruntime_per  cpu=5 start=71.85 finish=132.90