a JPEG image processing library. Exercised with two quick running workloads. The first runs on one thread, the second on “all” threads.

Topdown profile shows the single-threaded workload as having a higher retirement rate and some backend stalls. The parallel one is flipped.

AMD metrics show not much floating point, some L2 access including misses.

elapsed              214.087
on_cpu               0.230          # 3.69 / 16 cores
utime                572.235
stime                217.040
nvcsw                46264          # 80.71%
nivcsw               11056          # 19.29%
inblock              88             # 0.41/sec
onblock              293008         # 1368.64/sec
cpu-clock            789175841055   # 789.176 seconds
task-clock           789261200257   # 789.261 seconds
page faults          78020185       # 98852.173/sec
context switches     58208          # 73.750/sec
cpu migrations       7653           # 9.696/sec
major page faults    4              # 0.005/sec
minor page faults    78020181       # 98852.168/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             322457994967   # 72.656 branches per 1000 inst
branch misses        15346807047    # 4.76% branch miss
conditional          231733673746   # 52.214 conditional branches per 1000 inst
indirect             6110886568     # 1.377 indirect branches per 1000 inst
cpu-cycles           3143196687512  # 0.92 GHz
instructions         4405132309118  # 1.40 IPC
slots                6291991187694  #
retiring             1536387710543  # 24.4% (29.6%)
-- ucode             3228779195     #     0.1%
-- fastpath          1533158931348  #    24.4%
frontend             929644428644   # 14.8% (17.9%)
-- latency           707692046958   #    11.2%
-- bandwidth         221952381686   #     3.5%
backend              2565494368112  # 40.8% (49.5%)
-- cpu               956203684287   #    15.2%
-- memory            1609290683825  #    25.6%
speculation          150701191697   #  2.4% ( 2.9%)
-- branch mispredict 145671010991   #     2.3%
-- pipeline restart  5030180706     #     0.1%
smt-contention       1109755007091  # 17.6% ( 0.0%)
cpu-cycles           3147769160356  # 0.91 GHz
instructions         4396937620175  # 1.40 IPC
instructions         1480153486902  # 45.732 l2 access per 1000 inst
l2 hit from l1       54965304511    # 11.34% l2 miss
l2 miss from l1      3863029248     #
l2 hit from l2 pf    8912497384     #
l3 hit from l2 pf    2337614534     #
l3 miss from l2 pf   1474996298     #
instructions         1468633262644  # 47.498 float per 1000 inst
float 512            57             # 0.000 AVX-512 per 1000 inst
float 256            602            # 0.000 AVX-256 per 1000 inst
float 128            69757460749    # 47.498 AVX-128 per 1000 inst
float MMX            0              # 0.000 MMX per 1000 inst
float scalar         0              # 0.000 scalar per 1000 inst
instructions         4410783289171  #
opcache              639131229094   # 144.902 opcache per 1000 inst
opcache miss         75966023342    # 11.9% opcache miss rate
l1 dTLB miss         15719138380    # 3.564 L1 dTLB per 1000 inst
l2 dTLB miss         696242527      # 0.158 L2 dTLB per 1000 inst
instructions         4411366720558  #
icache               147700356249   # 33.482 icache per 1000 inst
icache miss          9449915162     #  6.4% icache miss rate
l1 iTLB miss         30947776       # 0.007 L1 iTLB per 1000 inst
l2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst
tlb flush            23457          # 0.000 TLB flush per 1000 inst

Intel metrics show backend stalls are more at L1/L2 levels than all the way out to dram

elapsed              229.959
on_cpu               0.221          # 3.54 / 16 cores
utime                655.417
stime                157.900
nvcsw                47596          # 81.51%
nivcsw               10796          # 18.49%
inblock              248224         # 1079.43/sec
onblock              281728         # 1225.12/sec
cpu-clock            812664634134   # 812.665 seconds
task-clock           812731141309   # 812.731 seconds
page faults          64066346       # 78828.462/sec
context switches     59368          # 73.048/sec
cpu migrations       11485          # 14.131/sec
major page faults    1276           # 1.570/sec
minor page faults    64065070       # 78826.892/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             289467025331   # 66.628 branches per 1000 inst
branch misses        4791047711     # 1.66% branch miss
conditional          289467037075   # 66.628 conditional branches per 1000 inst
indirect             46740820621    # 10.759 indirect branches per 1000 inst
slots                7250543859452  #
retiring             3182672569517  # 43.9% (43.9%)
-- ucode             284462783358   #     3.9%
-- fastpath          2898209786159  #    40.0%
frontend             798789567358   # 11.0% (11.0%)
-- latency           466544490290   #     6.4%
-- bandwidth         332245077068   #     4.6%
backend              2613200296738  # 36.0% (36.0%)
-- cpu               1445575001374  #    19.9%
-- memory            1167625295364  #    16.1%
speculation          713440337672   #  9.8% ( 9.8%)
-- branch mispredict 642636577796   #     8.9%
-- pipeline restart  70803759876    #     1.0%
smt-contention       0              #  0.0% ( 0.0%)
cpu-cycles           3015898317055  # 0.82 GHz
instructions         5329185911432  # 1.77 IPC
l2 access            103439376307   # 33.427 l2 access per 1000 inst
l2 miss              23259806947    # 22.49% l2 miss
cpu-cycles           1737674083545  # 27.5% memory latency
load stalls          410906786285   # 11.1% l1 bound
l1 miss              217693607649   #  8.2% l2 bound
l2 miss              74926260792    #  2.1% l3 bound
l3 miss              38350978722    #  2.2% dram bound
store_stalls         66085442220    #  3.8% store bound