Google libwebp2 library with image encoding. There are five workloads that run differing amounts of time. Looks like the last one takes a majority of the time.

Topdown profile shows overall a high retirement rate

AMD metrics confirm the higher retirement rate. Some floating point code and lower amount of L2 access.

elapsed              6454.668
on_cpu               0.938          # 15.01 / 16 cores
utime                96826.664
stime                45.413
nvcsw                494978         # 29.55%
nivcsw               1179978        # 70.45%
inblock              8              # 0.00/sec
onblock              190864         # 29.57/sec
cpu-clock            96875999087966 # 96875.999 seconds
task-clock           96876761793744 # 96876.762 seconds
page faults          10494175       # 108.325/sec
context switches     1706972        # 17.620/sec
cpu migrations       160719         # 1.659/sec
major page faults    3              # 0.000/sec
minor page faults    10494172       # 108.325/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             126898490547422 # 158.007 branches per 1000 inst
branch misses        1318886240479  # 1.04% branch miss
conditional          77448549388599 # 96.435 conditional branches per 1000 inst
indirect             13673282628499 # 17.025 indirect branches per 1000 inst
cpu-cycles           384962570882363 # 3.72 GHz
instructions         803239029833114 # 2.09 IPC
slots                769854586083126 #
retiring             284609100009382 # 37.0% (62.1%) high
-- ucode             4043501393113  #     0.5%
-- fastpath          280565598616269 #    36.4%
frontend             78261693919570 # 10.2% (17.1%)
-- latency           42905861803014 #     5.6%
-- bandwidth         35355832116556 #     4.6%
backend              80176935537309 # 10.4% (17.5%) low
-- cpu               49494451462269 #     6.4%
-- memory            30682484075040 #     4.0%
speculation          15095623143789 #  2.0% ( 3.3%)
-- branch mispredict 15003422604144 #     1.9%
-- pipeline restart  92200539645    #     0.0%
smt-contention       311710573425872 # 40.5% ( 0.0%)
cpu-cycles           384784177103671 # 3.71 GHz
instructions         803248721506170 # 2.09 IPC
instructions         267743308628030 # 7.474 l2 access per 1000 inst
l2 hit from l1       1602663711417  # 7.55% l2 miss
l2 miss from l1      35570585376    #
l2 hit from l2 pf    282998514358   #
l3 hit from l2 pf    43222627984    #
l3 miss from l2 pf   72300954073    #
instructions         267647173786960 # 121.102 float per 1000 inst
float 512            79             # 0.000 AVX-512 per 1000 inst
float 256            616            # 0.000 AVX-256 per 1000 inst
float 128            32412525008255 # 121.102 AVX-128 per 1000 inst
float MMX            0              # 0.000 MMX per 1000 inst
float scalar         0              # 0.000 scalar per 1000 inst
instructions         803109168877818 #
opcache              127066718164852 # 158.218 opcache per 1000 inst
opcache miss         3493204387362  #  2.7% opcache miss rate
l1 dTLB miss         665818661116   # 0.829 L1 dTLB per 1000 inst
l2 dTLB miss         6124862374     # 0.008 L2 dTLB per 1000 inst
instructions         803111520402727 #
icache               4365248868945  # 5.435 icache per 1000 inst
icache miss          710717117556   # 16.3% icache miss rate
l1 iTLB miss         95935516176    # 0.119 L1 iTLB per 1000 inst
l2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst
tlb flush            142328         # 0.000 TLB flush per 1000 inst

Intel metrics

elapsed              8161.727
on_cpu               0.937          # 15.00 / 16 cores
utime                122359.757
stime                47.232
nvcsw                1448562        # 50.39%
nivcsw               1426173        # 49.61%
inblock              22528          # 2.76/sec
onblock              179512         # 21.99/sec
cpu-clock            122409623127842 # 122409.623 seconds
task-clock           122410292988041 # 122410.293 seconds
page faults          10143595       # 82.866/sec
context switches     2915091        # 23.814/sec
cpu migrations       124435         # 1.017/sec
major page faults    82             # 0.001/sec
minor page faults    10143513       # 82.865/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             127006812912740 # 158.044 branches per 1000 inst
branch misses        1410065467341  # 1.11% branch miss
conditional          127006812933188 # 158.044 conditional branches per 1000 inst
indirect             50858947018231 # 63.288 indirect branches per 1000 inst
slots                557076544137674 #
retiring             400021032719744 # 71.8% (71.8%) high
-- ucode             43305567539593 #     7.8%
-- fastpath          356715465180151 #    64.0%
frontend             80086580625828 # 14.4% (14.4%)
-- latency           33826533132000 #     6.1%
-- bandwidth         46260047493828 #     8.3%
backend              27264603206660 #  4.9% ( 4.9%) low
-- cpu               19162921555765 #     3.4%
-- memory            8101681650895  #     1.5%
speculation          59508409749058 # 10.7% (10.7%) high
-- branch mispredict 59089662975152 #    10.6%
-- pipeline restart  418746773906   #     0.1%
smt-contention       0              #  0.0% ( 0.0%)
cpu-cycles           346572446103804 # 2.67 GHz
instructions         760195516210084 # 2.19 IPC
l2 access            2119361444064  # 5.371 l2 access per 1000 inst
l2 miss              440106375672   # 20.77% l2 miss
cpu-cycles           179557431925404 # 11.4% memory latency
load stalls          20166657919693 #  8.7% l1 bound
l1 miss              4621141485201  #  1.9% l2 bound
l2 miss              1141520305635  #  0.2% l3 bound
l3 miss              849908241729   #  0.5% dram bound
store_stalls         259645621514   #  0.1% store bound

Process overview shows time spent in cwp2

512 processes
	234 cwp2                 896093.32   511.22
	 34 clinfo                  10.01     3.00
	 19 vulkaninfo               0.74     0.76
	  2 vulkani:disk$0           0.07     0.08
	  6 clang                    0.05     0.07
	  1 llvmpipe-0               0.04     0.04
	  1 llvmpipe-1               0.04     0.04
	  1 llvmpipe-10              0.04     0.04
	  1 llvmpipe-11              0.04     0.04
	  1 llvmpipe-12              0.04     0.04
	  1 llvmpipe-13              0.04     0.04
	  1 llvmpipe-14              0.04     0.04
	  1 llvmpipe-15              0.04     0.04
	  1 llvmpipe-2               0.04     0.04
	  1 llvmpipe-3               0.04     0.04
	  1 llvmpipe-4               0.04     0.04
	  1 llvmpipe-5               0.04     0.04
	  1 llvmpipe-6               0.04     0.04
	  1 llvmpipe-7               0.04     0.04
	  1 llvmpipe-8               0.04     0.04
	  1 llvmpipe-9               0.04     0.04
	  1 ps                       0.00     0.01
	 68 sh                       0.00     0.00
	 13 gcc                      0.00     0.00
	 13 rm                       0.00     0.00
	 13 webp2                    0.00     0.00
	 11 gsettings                0.00     0.00
	  8 systemd-detect-          0.00     0.00
	  7 stat                     0.00     0.00
	  6 llvm-link                0.00     0.00
	  4 glxinfo                  0.00     0.00
	  4 phoronix-test-s          0.00     0.00
	  3 gmain                    0.00     0.00
	  2 grep                     0.00     0.00
	  2 which                    0.00     0.00
	  1 cc                       0.00     0.00
	  1 date                     0.00     0.00
	  1 dconf worker             0.00     0.00
	  1 dirname                  0.00     0.00
	  1 ifconfig                 0.00     0.00
	  1 ip                       0.00     0.00
	  1 lscpu                    0.00     0.00
	  1 mktemp                   0.00     0.00
	  1 qdbus                    0.00     0.00
	  1 readlink                 0.00     0.00
	  1 realpath                 0.00     0.00
	  1 sed                      0.00     0.00
	  1 setterm                  0.00     0.00
	  1 sort                     0.00     0.00
	  1 stty                     0.00     0.00
	  1 systemctl                0.00     0.00
	  1 template.sh              0.00     0.00
	  1 wc                       0.00     0.00
28 processes running
47 maximum processes

Computation blocks

      13497) webp2            cpu=0 start=13.20 finish=16.73
        13498) cwp2             cpu=9 start=13.20 finish=16.71
          13499) cwp2             cpu=8 start=13.70 finish=16.67
          13500) cwp2             cpu=15 start=13.70 finish=16.67
          13501) cwp2             cpu=0 start=13.70 finish=16.67
          13502) cwp2             cpu=7 start=13.70 finish=16.67
          13503) cwp2             cpu=11 start=13.70 finish=16.67
          13504) cwp2             cpu=15 start=13.70 finish=16.67
          13505) cwp2             cpu=7 start=13.70 finish=16.67
          13506) cwp2             cpu=13 start=13.70 finish=16.67
          13507) cwp2             cpu=10 start=13.70 finish=16.67
          13508) cwp2             cpu=4 start=13.70 finish=16.67
          13509) cwp2             cpu=12 start=13.70 finish=16.67
          13510) cwp2             cpu=6 start=13.70 finish=16.67
          13511) cwp2             cpu=1 start=13.70 finish=16.67
          13512) cwp2             cpu=3 start=13.70 finish=16.69
          13513) cwp2             cpu=2 start=13.70 finish=16.69
          13514) cwp2             cpu=11 start=13.70 finish=16.69
          13515) cwp2             cpu=14 start=13.70 finish=16.70
        13517) rm               cpu=2 start=16.73 finish=16.73