Open Image Denoise library for ray-tracing and part of the oneAPI rendering toolkit. There are three tests that run on the CPU. On AMD the hip and SYCL tests fail. Looks like the six failures may be that single-threaded segment at the end.

Topdown profile is dominated by backend stalls.

AMD metrics show little floating point. Backend stalls are cpu-bound not memory bound. Frontend stalls are very low.

elapsed              1143.453
on_cpu               0.852          # 13.64 / 16 cores
utime                15569.790
stime                24.394
nvcsw                137174         # 47.97%
nivcsw               148800         # 52.03%
inblock              8              # 0.01/sec
onblock              13656          # 11.94/sec
cpu-clock            15595849921789 # 15595.850 seconds
task-clock           15596082503631 # 15596.083 seconds
page faults          7315810        # 469.080/sec
context switches     291471         # 18.689/sec
cpu migrations       931            # 0.060/sec
major page faults    57             # 0.004/sec
minor page faults    7315753        # 469.076/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             202447109427   # 4.901 branches per 1000 inst
branch misses        2965059032     # 1.46% branch miss
conditional          185803354842   # 4.498 conditional branches per 1000 inst
indirect             786178808      # 0.019 indirect branches per 1000 inst
cpu-cycles           62242756524025 # 3.39 GHz
instructions         41309080431827 # 0.66 IPC low
slots                124479064490070 #
retiring             14028040556278 # 11.3% (14.2%)
-- ucode             4540852651     #     0.0%
-- fastpath          14023499703627 #    11.3%
frontend             903299780714   #  0.7% ( 0.9%) low
-- latency           805847673036   #     0.6%
-- bandwidth         97452107678    #     0.1%
backend              83506378748576 # 67.1% (84.8%) high
-- cpu               72430204392186 #    58.2%
-- memory            11076174356390 #     8.9%
speculation          23559215421    #  0.0% ( 0.0%) low
-- branch mispredict 19678376604    #     0.0%
-- pipeline restart  3880838817     #     0.0%
smt-contention       26017722604434 # 20.9% ( 0.0%)
cpu-cycles           62237080562275 # 3.38 GHz
instructions         41309704710978 # 0.66 IPC low
instructions         13768582732070 # 125.298 l2 access per 1000 inst
l2 hit from l1       1477657932786  # 4.28% l2 miss
l2 miss from l1      13861287643    #
l2 hit from l2 pf    187554180525   #
l3 hit from l2 pf    20200972860    #
l3 miss from l2 pf   39763266084    #
instructions         13770331503339 # 6.083 float per 1000 inst
float 512            108            # 0.000 AVX-512 per 1000 inst
float 256            1120053095     # 0.081 AVX-256 per 1000 inst
float 128            82649393722    # 6.002 AVX-128 per 1000 inst
float MMX            0              # 0.000 MMX per 1000 inst
float scalar         0              # 0.000 scalar per 1000 inst
instructions         2655157        #
opcache              985914         # 371.320 opcache per 1000 inst
opcache miss         525657         # 53.3% opcache miss rate
l1 dTLB miss         5852           # 2.204 L1 dTLB per 1000 inst
l2 dTLB miss         1012           # 0.381 L2 dTLB per 1000 inst
instructions         2809369        #
icache               1346399        # 479.253 icache per 1000 inst
icache miss          118242         #  8.8% icache miss rate
l1 iTLB miss         13             # 0.005 L1 iTLB per 1000 inst
l2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst
tlb flush            19             # 0.007 TLB flush per 1000 inst

CPU stalls of 58% are almost as high as minibude (64%) and much above the mean with both showing up as outliers on the distribution.

Intel metrics show most memory is L1 with only 2.4% dram.

elapsed              1784.190
on_cpu               0.919          # 14.71 / 16 cores
utime                26221.932
stime                21.883
nvcsw                228045         # 48.75%
nivcsw               239714         # 51.25%
inblock              18752          # 10.51/sec
onblock              1800           # 1.01/sec
cpu-clock            26244075480642 # 26244.075 seconds
task-clock           26244333593954 # 26244.334 seconds
page faults          8712850        # 331.990/sec
context switches     476473         # 18.155/sec
cpu migrations       30947          # 1.179/sec
major page faults    161            # 0.006/sec
minor page faults    8712689        # 331.984/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             1321680383971  # 12.574 branches per 1000 inst
branch misses        2731388696     # 0.21% branch miss
conditional          1321680401059  # 12.574 conditional branches per 1000 inst
indirect             358569107192   # 3.411 indirect branches per 1000 inst
slots                98943282613106 #
retiring             49432693869376 # 50.0% (50.0%)
-- ucode             563001550256   #     0.6%
-- fastpath          48869692319120 #    49.4%
frontend             22964816174943 # 23.2% (23.2%)
-- latency           22183147237452 #    22.4%
-- bandwidth         781668937491   #     0.8%
backend              25631568005899 # 25.9% (25.9%)
-- cpu               13970973300638 #    14.1%
-- memory            11660594705261 #    11.8%
speculation          398396612737   #  0.4% ( 0.4%) low
-- branch mispredict 281168117116   #     0.3%
-- pipeline restart  117228495621   #     0.1%
smt-contention       0              #  0.0% ( 0.0%)
cpu-cycles           64601012956599 # 2.60 GHz
instructions         103400515861605 # 1.60 IPC
l2 access            803899147867   # 15.693 l2 access per 1000 inst
l2 miss              186431498354   # 23.19% l2 miss
cpu-cycles           32018963882436 # 26.2% memory latency
load stalls          8267191766134  # 20.3% l1 bound
l1 miss              1767537943547  #  1.8% l2 bound
l2 miss              1198804839203  #  1.4% l3 bound
l3 miss              759185998010   #  2.4% dram bound
store_stalls         128410516875   #  0.4% store bound

Process summary shows time spent in the benchmark application.

592 processes
	200 oidnBenchmark        236248.38   341.85
	 68 clinfo                  19.50     6.32
	 38 vulkaninfo               1.34     1.35
	  4 vulkani:disk$0           0.15     0.15
	  6 glxinfo:gdrv0            0.15     0.07
	  6 glxinfo:gl0              0.15     0.07
	  6 php                      0.13     0.28
	  2 llvmpipe-0               0.08     0.07
	  2 llvmpipe-1               0.08     0.07
	  2 llvmpipe-10              0.08     0.07
	  2 llvmpipe-11              0.08     0.07
	  2 llvmpipe-12              0.08     0.07
	  2 llvmpipe-13              0.08     0.07
	  2 llvmpipe-14              0.08     0.07
	  2 llvmpipe-15              0.08     0.07
	  2 llvmpipe-2               0.08     0.07
	  2 llvmpipe-3               0.08     0.07
	  2 llvmpipe-4               0.08     0.07
	  2 llvmpipe-5               0.08     0.07
	  2 llvmpipe-6               0.08     0.07
	  2 llvmpipe-7               0.08     0.07
	  2 llvmpipe-8               0.08     0.07
	  2 llvmpipe-9               0.08     0.07
	  2 glxinfo                  0.07     0.03
	  2 glxinfo:cs0              0.07     0.03
	  2 glxinfo:disk$0           0.07     0.03
	  2 glxinfo:sh0              0.07     0.03
	  2 glxinfo:shlo0            0.07     0.03
	  6 clang                    0.06     0.05
	  3 rocminfo                 0.03     0.03
	  1 lspci                    0.01     0.02
	 85 sh                       0.00     0.00
	 27 oidn                     0.00     0.00
	 12 gcc                      0.00     0.00
	 10 gsettings                0.00     0.00
	  8 stat                     0.00     0.00
	  8 systemd-detect-          0.00     0.00
	  6 llvm-link                0.00     0.00
	  5 phoronix-test-s          0.00     0.00
	  4 gmain                    0.00     0.00
	  2 dconf worker             0.00     0.00
	  2 lscpu                    0.00     0.00
	  2 uname                    0.00     0.00
	  2 which                    0.00     0.00
	  2 xset                     0.00     0.00
	  1 cc                       0.00     0.00
	  1 date                     0.00     0.00
	  1 dirname                  0.00     0.00
	  1 dmesg                    0.00     0.00
	  1 dmidecode                0.00     0.00
	  1 grep                     0.00     0.00
	  1 ifconfig                 0.00     0.00
	  1 ip                       0.00     0.00
	  1 lsmod                    0.00     0.00
	  1 mktemp                   0.00     0.00
	  1 ps                       0.00     0.00
	  1 qdbus                    0.00     0.00
	  1 readlink                 0.00     0.00
	  1 realpath                 0.00     0.00
	  1 sed                      0.00     0.00
	  1 sort                     0.00     0.00
	  1 stty                     0.00     0.00
	  1 systemctl                0.00     0.00
	  1 template.sh              0.00     0.00
	  1 wc                       0.00     0.00
	  1 xrandr                   0.00     0.00
16 processes running
63 maximum processes

Computation blocks show a similar pattern

      8061) oidn             cpu=1 start=89.65 finish=169.60
        8062) oidnBenchmark    cpu=10 start=89.65 finish=169.58
          8065) oidnBenchmark    cpu=12 start=89.68 finish=169.58
          8066) oidnBenchmark    cpu=5 start=89.68 finish=89.68
          8067) oidnBenchmark    cpu=15 start=90.14 finish=169.58
            8068) oidnBenchmark    cpu=5 start=90.14 finish=169.58
              8070) oidnBenchmark    cpu=3 start=90.14 finish=169.58
                8075) ?? cpu=0 start=90.14 finish=0.00 
                  8078) ?? cpu=0 start=90.14 finish=0.00 
                8077) oidnBenchmark    cpu=14 start=90.14 finish=169.58
              8074) oidnBenchmark    cpu=11 start=90.14 finish=169.58
            8072) oidnBenchmark    cpu=1 start=90.14 finish=169.58
              8076) oidnBenchmark    cpu=13 start=90.14 finish=169.58
              8079) oidnBenchmark    cpu=8 start=90.14 finish=169.58
          8069) oidnBenchmark    cpu=10 start=90.14 finish=169.58
            8071) oidnBenchmark    cpu=9 start=90.14 finish=169.58
              8080) oidnBenchmark    cpu=6 start=90.14 finish=169.58
              8081) oidnBenchmark    cpu=7 start=90.14 finish=169.58
            8073) oidnBenchmark    cpu=0 start=90.14 finish=169.58