A small C++ code for illumination rendering. Looks to be multi-threaded and quickly running.

Topdown profile shows backend stalls as largest issue with a moderate retirement rate.

AMD metrics show this is floating point code with small amount of L2 access. Backend stalls are mostly CPU stalls.

elapsed              45.090
on_cpu               0.654          # 10.47 / 16 cores
utime                471.119
stime                0.870
nvcsw                1678           # 25.47%
nivcsw               4911           # 74.53%
inblock              0              # 0.00/sec
onblock              62944          # 1395.95/sec
cpu-clock            472009776505   # 472.010 seconds
task-clock           472014224464   # 472.014 seconds
page faults          163926         # 347.290/sec
context switches     6635           # 14.057/sec
cpu migrations       212            # 0.449/sec
major page faults    12             # 0.025/sec
minor page faults    163914         # 347.265/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             290450354693   # 101.312 branches per 1000 inst
branch misses        6280963989     # 2.16% branch miss
conditional          224571701814   # 78.333 conditional branches per 1000 inst
indirect             8558837143     # 2.985 indirect branches per 1000 inst
cpu-cycles           1882296852708  # 2.61 GHz
instructions         2865046175872  # 1.52 IPC
slots                3766723443972  #
retiring             1031193614822  # 27.4% (41.8%)
-- ucode             740624499      #     0.0%
-- fastpath          1030452990323  #    27.4%
frontend             128086560200   #  3.4% ( 5.2%)
-- latency           89260572702    #     2.4%
-- bandwidth         38825987498    #     1.0%
backend              1182909332439  # 31.4% (47.9%)
-- cpu               1030784053449  #    27.4%
-- memory            152125278990   #     4.0%
speculation          124961204508   #  3.3% ( 5.1%)
-- branch mispredict 122654034594   #     3.3%
-- pipeline restart  2307169914     #     0.1%
smt-contention       1299568541868  # 34.5% ( 0.0%)
cpu-cycles           1879246306828  # 2.61 GHz
instructions         2869706329426  # 1.53 IPC
instructions         954570926607   # 0.574 l2 access per 1000 inst
l2 hit from l1       383137104      # 4.29% l2 miss
l2 miss from l1      12700086       #
l2 hit from l2 pf    154085933      #
l3 hit from l2 pf    5453940        #
l3 miss from l2 pf   5369977        #
instructions         955485420525   # 391.055 float per 1000 inst
float 512            77             # 0.000 AVX-512 per 1000 inst
float 256            586            # 0.000 AVX-256 per 1000 inst
float 128            373647513792   # 391.055 AVX-128 per 1000 inst
float MMX            0              # 0.000 MMX per 1000 inst
float scalar         0              # 0.000 scalar per 1000 inst
instructions         2866802424129  #
opcache              402332512660   # 140.342 opcache per 1000 inst
opcache miss         1292555935     #  0.3% opcache miss rate
l1 dTLB miss         31443107       # 0.011 L1 dTLB per 1000 inst
l2 dTLB miss         5381707        # 0.002 L2 dTLB per 1000 inst
instructions         2866800052557  #
icache               2307657177     # 0.805 icache per 1000 inst
icache miss          286395407      # 12.4% icache miss rate
l1 iTLB miss         8608125        # 0.003 L1 iTLB per 1000 inst
l2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst
tlb flush            16974          # 0.000 TLB flush per 1000 inst

Intel metrics confirm low L2 access and show higher level of branch misprediction

elapsed              46.525
on_cpu               0.673          # 10.77 / 16 cores
utime                500.506
stime                0.384
nvcsw                1211           # 19.40%
nivcsw               5030           # 80.60%
inblock              4656           # 100.08/sec
onblock              51592          # 1108.91/sec
cpu-clock            500904808159   # 500.905 seconds
task-clock           500907949143   # 500.908 seconds
page faults          99508          # 198.655/sec
context switches     6290           # 12.557/sec
cpu migrations       222            # 0.443/sec
major page faults    58             # 0.116/sec
minor page faults    99450          # 198.539/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             289123524356   # 101.078 branches per 1000 inst
branch misses        6697550758     # 2.32% branch miss
conditional          289123534756   # 101.078 conditional branches per 1000 inst
indirect             53838970043    # 18.822 indirect branches per 1000 inst
slots                2855244531524  #
retiring             1627251960026  # 57.0% (57.0%) high
-- ucode             35202779377    #     1.2%
-- fastpath          1592049180649  #    55.8%
frontend             521415453265   # 18.3% (18.3%)
-- latency           452095257232   #    15.8%
-- bandwidth         69320196033    #     2.4%
backend              297810108968   # 10.4% (10.4%) low
-- cpu               214954934707   #     7.5%
-- memory            82855174261    #     2.9%
speculation          409638370395   # 14.3% (14.3%) high
-- branch mispredict 409107060915   #    14.3%
-- pipeline restart  531309480      #     0.0%
smt-contention       0              #  0.0% ( 0.0%)
cpu-cycles           3296409027952  # 2.19 GHz
instructions         5741847939829  # 1.74 IPC
l2 access            306096009      # 0.094 l2 access per 1000 inst
l2 miss              67249322       # 21.97% l2 miss
cpu-cycles           1860051893950  # 16.4% memory latency
load stalls          304362400662   # 16.3% l1 bound
l1 miss              798391811      #  0.0% l2 bound
l2 miss              362458552      #  0.0% l3 bound
l3 miss              166527645      #  0.0% dram bound
store_stalls         143931493      #  0.0% store bound

Process profile shows the smallpt-rendere process is primary process

384 processes
	 48 smallpt-rendere       7485.92     2.08
	 68 clinfo                  15.87     6.24
	 38 vulkaninfo               1.15     1.14
	  4 vulkani:disk$0           0.12     0.12
	  6 php                      0.06     0.06
	  2 llvmpipe-0               0.06     0.06
	  2 llvmpipe-1               0.06     0.06
	  2 llvmpipe-10              0.06     0.06
	  2 llvmpipe-11              0.06     0.06
	  2 llvmpipe-12              0.06     0.06
	  2 llvmpipe-13              0.06     0.06
	  2 llvmpipe-14              0.06     0.06
	  2 llvmpipe-15              0.06     0.06
	  2 llvmpipe-2               0.06     0.06
	  2 llvmpipe-3               0.06     0.06
	  2 llvmpipe-4               0.06     0.06
	  2 llvmpipe-5               0.06     0.06
	  2 llvmpipe-6               0.06     0.06
	  2 llvmpipe-7               0.06     0.06
	  2 llvmpipe-8               0.06     0.06
	  2 llvmpipe-9               0.06     0.06
	  6 clang                    0.05     0.07
	  3 rocminfo                 0.00     0.03
	  1 lspci                    0.00     0.02
	 84 sh                       0.00     0.00
	 13 gcc                      0.00     0.00
	 11 gsettings                0.00     0.00
	  8 stat                     0.00     0.00
	  8 systemd-detect-          0.00     0.00
	  6 llvm-link                0.00     0.00
	  5 glxinfo                  0.00     0.00
	  5 phoronix-test-s          0.00     0.00
	  3 gmain                    0.00     0.00
	  3 smallpt                  0.00     0.00
	  2 cc                       0.00     0.00
	  2 dconf worker             0.00     0.00
	  2 grep                     0.00     0.00
	  2 lscpu                    0.00     0.00
	  2 setterm                  0.00     0.00
	  2 uname                    0.00     0.00
	  2 which                    0.00     0.00
	  1 date                     0.00     0.00
	  1 dirname                  0.00     0.00
	  1 dmesg                    0.00     0.00
	  1 dmidecode                0.00     0.00
	  1 ifconfig                 0.00     0.00
	  1 ip                       0.00     0.00
	  1 lsmod                    0.00     0.00
	  1 mktemp                   0.00     0.00
	  1 ps                       0.00     0.00
	  1 qdbus                    0.00     0.00
	  1 readlink                 0.00     0.00
	  1 realpath                 0.00     0.00
	  1 sed                      0.00     0.00
	  1 sort                     0.00     0.00
	  1 stty                     0.00     0.00
	  1 systemctl                0.00     0.00
	  1 template.sh              0.00     0.00
	  1 wc                       0.00     0.00
0 processes running
47 maximum processes

Process tree shows following pattern for core computation blocks.

      230105) smallpt          cpu=1 start=5.51  finish=15.35
        230106) smallpt-rendere  cpu=11 start=5.51  finish=15.35
          230107) smallpt-rendere  cpu=6 start=5.52  finish=15.35
          230108) smallpt-rendere  cpu=12 start=5.52  finish=15.35
          230109) smallpt-rendere  cpu=8 start=5.52  finish=15.35
          230110) smallpt-rendere  cpu=15 start=5.52  finish=15.35
          230111) smallpt-rendere  cpu=5 start=5.52  finish=15.35
          230112) smallpt-rendere  cpu=2 start=5.52  finish=15.35
          230113) smallpt-rendere  cpu=9 start=5.52  finish=15.35
          230114) smallpt-rendere  cpu=3 start=5.52  finish=15.35
          230115) smallpt-rendere  cpu=13 start=5.52  finish=15.35
          230116) smallpt-rendere  cpu=4 start=5.52  finish=15.35
          230117) smallpt-rendere  cpu=14 start=5.52  finish=15.35
          230118) smallpt-rendere  cpu=7 start=5.52  finish=15.35
          230119) smallpt-rendere  cpu=10 start=5.53  finish=15.35
          230120) smallpt-rendere  cpu=0 start=5.53  finish=15.35
          230121) smallpt-rendere  cpu=1 start=5.53  finish=15.35