↓
 

Performance analysis, tools and experiments

An eclectic collection

  • Overview
  • Blog
  • Workloads
    • cpu2017
      • 500.perlbench_r
      • 502.gcc_r
      • 503.bwaves_r
      • 505.mcf_r
      • 507.cactuBSSN_r
      • 508.namd_r
      • 510.parest_r
      • 511.povray_r
      • 519.lbm_r
      • 520.omnetpp_r
      • 521.wrf_r
      • 523.xalancbmk_r
      • 525.x264_r
      • 526.blender_r
      • 527.cam4_r
      • 531.deepsjeng_r
      • 538.imagick_r
      • 541.leela_r
      • 544.nab_r
      • 548.exchange2_r
      • 549.fotonik3d_r
      • 554.roms_r
      • 557.xz_r
    • geekbench
    • lmbench
    • passmark
    • pbbs
    • phoronix
      • ai-benchmark
      • aircrack-ng
      • amg
      • aobench
      • aom-av1
      • apache
      • apache-iotdb
      • appleseed
      • arrayfire
      • askap
      • asmfish
      • astcenc
      • avifenc
      • basis
      • blake2
      • blogbench
      • blender
      • blosc
      • bork
      • botan
      • brl-cad
      • build-apache
      • build-clash
      • build-eigen
      • build-erlang
      • build-ffmpeg
      • build-gcc
      • build-gdb
      • build-gem5
      • build-godot
      • build-imagemagick
      • build-linux-kernel
      • build-llvm
      • build-mesa
      • build-mplayer
      • build-nodejs
      • build-php
      • build-python
      • build-wasmer
      • build2
      • bullet
      • byte
      • cachebench
      • cassandra
      • clickhouse
      • clomp
      • cloverleaf
      • cockroach
      • compilebench
      • compress-7zip
      • compress-gzip
      • compress-lz4
      • compress-pbzip2
      • compress-rar
      • compress-xz
      • compress-zstd
      • core-latency
      • coremark
      • cp2k
      • cpp-perf-bench
      • cpuminer-opt
      • crafty
      • c-ray
      • cryptopp
      • cryptsetup
      • ctx-clock
      • cython-bench
      • dacapobench
      • daphne
      • darktable
      • dav1d
      • dbench
      • deepsparse
      • deepspeech
      • dolfyn
      • draco
      • dragonflydb
      • duckdb
      • easywave
      • ebizzy
      • embree
      • encode-flac
      • encode-mp3
      • encode-opus
      • encode-wavpack
      • espeak
      • etcpak
      • faiss
      • fast-cli
      • ffmpeg
      • ffte
      • fftw
      • fhourstones
      • financebench
      • furmark
      • gcrypt
      • gegl
      • gimp
      • git
      • glibc-bench
      • gmpbench
      • gnupg
      • gnuradio
      • go-benchmark
      • gpaw
      • graph500
      • graphics-magick
      • gromacs
      • hackbench
      • hadoop
      • heffte
      • helsing
      • himeno
      • hmmer
      • hpcg
      • incompact3d
      • indigobench
      • inkscape
      • ipc-benchmark
      • java-jmh
      • java-scimark2
      • john-the-ripper
      • jpegxl
      • jpegxl-decode
      • kvazaar
      • kripke
      • lammps
      • lczero
      • libraw
      • libreoffice
      • libxsmm
      • liquid-dsp
      • llama.cpp
      • llamafile
      • lulesh
      • lzbench
      • mbw
      • memcached
      • minibude
      • minife
      • mnn
      • mpcbench
      • m-queens
      • mrbayes
      • mutex
      • namd
      • mt-dgemm
      • ncnn
      • neat
      • nettle
      • nginx
      • ngspice
      • node-octane
      • node-web-tooling
      • npb
      • n-queens
      • numpy
      • nwchem
      • oidn
      • onednn
      • octave-benchmark
      • onnx
      • opencv
      • openfoam
      • openjpeg
      • openssl
      • openradioss
      • openscad
      • openvino
      • openvkl
      • ospray
      • ospray-studio
      • palabos
      • parboil
      • pennant
      • perl-benchmark
      • pgbench
      • phpbench
      • pjsip
      • polybench-c
      • polyhedron
      • povray
      • primesieve
      • pybench
      • pyhpc
      • pyperformance
      • pytorch
      • quadray
      • qe
      • qmcpack
      • quantlib
      • quicksilver
      • ramspeed
      • rav1e
      • rawtherapee
      • rbenchmark
      • redis
      • renaissance
      • rnnoise
      • rocksdb
      • rodinia
      • rsvg
      • schbench
      • scikit-learn
      • scimark2
      • scylladb
      • securemark
      • selenium
      • simdjson
      • smallpt
      • smhasher
      • spark
      • spark-tpcds
      • speedb
      • specfem3d
      • sqlite
      • srsran
      • stargate
      • stockfish
      • stream
      • stress-ng
      • svt-av1
      • svt-hevc
      • svt-vp9
      • sudokut
      • synthmark
      • sysbench
      • tensorflow
      • tensorflow-lite
      • tesseract
      • tjbench
      • tnn
      • toybrot
      • tscp
      • ttsiod-renderer
      • tungsten
      • uvg266
      • vkpeak
      • vpxenc
      • v-ray
      • vvenc
      • webp
      • webp2
      • whisper.cpp
      • whisperfile
      • wireguard
      • x264
      • x265
      • xmrig
      • xnnpack
      • y-cruncher
      • z3
    • stream
  • Tools
    • Compilers
    • likwid
    • perf
    • trace-cmd and kernelshark
    • wspy
  • Experiments
    • Histograms
    • clustering
    • Adding summary statistics for all benchmarks
  • Home
  • Blog
  • Workloads
    • cpu2017
      • 500.perlbench_r
      • 502.gcc_r
      • 503.bwaves_r
      • 505.mcf_r
      • 507.cactuBSSN_r
      • 508.namd_r
      • 510.parest_r
      • 511.povray_r
      • 519.lbm_r
      • 520.omnetpp_r
      • 521.wrf_r
      • 523.xalancbmk_r
      • 525.x264_r
      • 526.blender_r
      • 527.cam4_r
      • 531.deepsjeng_r
      • 538.imagick_r
      • 541.leela_r
      • 544.nab_r
      • 548.exchange2_r
      • 549.fotonik3d_r
      • 554.roms_r
      • 557.xz_r
    • geekbench
    • lmbench
    • passmark
    • pbbs
    • phoronix
      • ai-benchmark
      • aircrack-ng
      • amg
      • aobench
      • aom-av1
      • apache
      • apache-iotdb
      • appleseed
      • arrayfire
      • askap
      • asmfish
      • astcenc
      • avifenc
      • b
      • basis
      • blake2
      • blender
      • blogbench
      • blosc
      • bork
      • botan
      • brl-cad
      • build-apache
      • build-clash
      • build-eigen
      • build-erlang
      • build-ffmpeg
      • build-gcc
      • build-gdb
      • build-gem5
      • build-godot
      • build-imagemagick
      • build-linux-kernel
      • build-llvm
      • build-mesa
      • build-mplayer
      • build-nodejs
      • build-php
      • build-python
      • build-wasmer
      • build2
      • bullet
      • byte
      • c-ray
      • cachebench
      • cassandra
      • clickhouse
      • clomp
      • cloverleaf
      • cockroach
      • compilebench
      • compress-7zip
      • compress-gzip
      • compress-lz4
      • compress-pbzip2
      • compress-rar
      • compress-xz
      • compress-zstd
      • core-latency
      • coremark
      • cp2k
      • cpp-perf-bench
      • cpuminer-opt
      • crafty
      • cryptopp
      • cryptsetup
      • ctx-clock
      • cython-bench
      • dacapobench
      • daphne
      • darktable
      • dav1d
      • dbench
      • deepsparse
      • deepspeech
      • dolfyn
      • draco
      • dragonflydb
      • duckdb
      • easywave
      • ebizzy
      • embree
      • encode-flac
      • encode-mp3
      • encode-opus
      • encode-wavpack
      • espeak
      • etcpak
      • faiss
      • fast-cli
      • ffmpeg
      • ffte
      • fftw
      • fhourstones
      • financebench
      • furmark
      • gcrypt
      • gegl
      • gimp
      • git
      • glibc-bench
      • gmpbench
      • gnupg
      • gnuradio
      • go-benchmark
      • gpaw
      • graph500
      • graphics-magick
      • gromacs
      • hackbench
      • hadoop
      • heffte
      • helsing
      • himeno
      • hmmer
      • hpcg
      • incompact3d
      • indigobench
      • inkscape
      • ipc-benchmark
      • java-jmh
      • java-scimark2
      • john-the-ripper
      • jpegxl
      • jpegxl-decode
      • kripke
      • kvazaar
      • lammps
      • lczero
      • libraw
      • libreoffice
      • libxsmm
      • liquid-dsp
      • llama.cpp
      • llamafile
      • lulesh
      • lzbench
      • m-queens
      • mbw
      • memcached
      • minibude
      • minife
      • mnn
      • mpcbench
      • mrbayes
      • mt-dgemm
      • mutex
      • n-queens
      • namd
      • ncnn
      • neat
      • nettle
      • nginx
      • ngspice
      • node-octane
      • node-web-tooling
      • npb
      • numpy
      • nwchem
      • octave-benchmark
      • oidn
      • onednn
      • onnx
      • opencv
      • openfoam
      • openjpeg
      • openradioss
      • openscad
      • openssl
      • openvino
      • openvkl
      • ospray
      • ospray-studio
      • palabos
      • parboil
      • pennant
      • perl-benchmark
      • pgbench
      • phpbench
      • pjsip
      • polybench-c
      • polyhedron
      • povray
      • primesieve
      • pybench
      • pyhpc
      • pyperformance
      • pytorch
      • qe
      • qmcpack
      • quadray
      • quantlib
      • quicksilver
      • ramspeed
      • rav1e
      • rawtherapee
      • rays1bench
      • rbenchmark
      • redis
      • renaissance
      • rnnoise
      • rocksdb
      • rodinia
      • rsvg
      • schbench
      • scikit-learn
      • scimark2
      • scylladb
      • securemark
      • selenium
      • simdjson
      • smallpt
      • smhasher
      • spark
      • spark-tpcds
      • specfem3d
      • speedb
      • sqlite
      • srsran
      • stargate
      • stockfish
      • stream
      • stress-ng
      • sudokut
      • svt-av1
      • svt-hevc
      • svt-vp9
      • synthmark
      • sysbench
      • tensorflow
      • tensorflow-lite
      • tesseract
      • tjbench
      • tnn
      • toybrot
      • tscp
      • ttsiod-renderer
      • tungsten
      • uvg266
      • v-ray
      • vkpeak
      • vpxenc
      • vvenc
      • webp
      • webp2
      • whisper.cpp
      • whisperfile
      • wireguard
      • x264
      • x265
      • xmrig
      • xnnpack
      • y-cruncher
      • z3
    • stream
  • Tools
    • Compilers
    • likwid
    • perf
    • trace-cmd and kernelshark
    • wspy
  • Experiments
Home→Tags compiler

Tag Archives: compiler

cpu2017

Performance analysis, tools and experiments Posted on June 6, 2024 by mevJune 6, 2024

I have reached the point of diminishing returns for Phoronix tests. Reached 240 workloads analyzed and another ~30+ workloads that are skipped, mostly as GPU centric tests. These 270+ tests fully cover the 56 Phoronix benchmark articles so far this year. It has also become increasingly rare for a new article to reference an uncharacterized Phoronix test. When this happens, I will add to my analysis but I don’t see as much point in going through other, often obsolete workload examples. So I expect this to slowly creep up but not that quickly.

SPEC CPU is an interesting workload both as a workload and as a study of performance counters. There were three general issues that kept me from jumping full-bore into adding SPEC CPU until now:

  • The suite is expensive, ~$1000
  • My point is to characterize it as a workload, not to create hardware measurements. There are both detailed reporting rules and an emphasis on publicizing SPEC numbers to measure/compare hardware. While I would like the code to be somewhat optimized, I am also not trying for the absolute highest scores. So I will refrain from creating specific numbers with my “estimates” and use compiler options that I generally find without searching for optimal
  • SPEC CPU is a good measure of processor, memory and compiler. So for these measurements, I created config files with AMD AOCC compiler suite.

SPEC CPU has both rate benchmarks and speed benchmarks. The rate benchmarks maximize throughput, running multiple copies typically one per logical core. The speed benchmarks minimize latency, sometimes one copy but now also using OpenMP as it makes sense. I have concentrated first on the rate benchmarks. Looking at their profiles, I see some commonality between them and occasional variation with many Phoronix benchmarks.

benchmarkstatuselapsedon_cpuinblockonblockpage-faultcontext-switchIPCGHzretire-ratefrontend-stallbackend-stallspec-stallretire-ucoderetire-fastpathfloat-densityfrontend-latencyfrontend-bandwidthopcache-missicache-missbackend-cpubackend-memoryamd-l2-missamd-l2-densityspec-branchspec-pipelinebranch-missbranch-densitybranch-condbranch-indsmt-contention
500.perlbench_rCPU1272.53115.750.02542.18389.43710.7211.584.1033.319.744.72.30.125.216.8038.86.25.642.53.130.811.6215.7361.70.10.63184.569132.59412.99623.9
502.gcc_rCPU1280.59715.700.004845.6126600.97110.9480.604.3912.725.659.81.90.09.67.94312.66.915.229.73.142.529.3760.2711.40.01.99222.751170.0475.59423.8
503.bwaves_rCPU4632.33215.850.004.72565.44511.0530.134.472.31.496.30.10.02.2260.2371.20.29.76.56.886.549.20132.9880.00.01.3819.17214.9751.3983.0
505.mcf_rCPU706.48815.730.0015.391495.20210.4970.984.0619.926.943.99.30.016.30.68713.18.90.416.74.331.616.3353.4017.60.04.98169.000147.3120.01718.2
507.cactuBSSN_rCPU609.92215.690.00495.301135.26013.4910.234.214.34.890.90.10.04.155.4723.61.147.741.89.178.510.76476.4670.00.00.6448.78733.4033.7343.6
508.namd_rCPU705.98315.810.0045.01239.91811.3851.833.7451.34.840.93.00.031.3395.6222.20.80.218.817.27.71.3563.0101.80.04.4026.58724.2580.02139.1
510.parest_rCPU4550.51715.800.0016.88213.12510.9740.294.415.84.289.70.30.04.8338.5041.91.61.423.36.867.434.7793.1160.20.01.59106.36290.9643.24917.3
511.povray_rCPU1207.79715.600.001449.6863.68310.4021.813.7852.46.638.92.10.331.4244.8322.51.53.720.911.212.40.0663.1871.10.20.25157.689109.66510.90039.5
519.lbm_rCPU1416.59915.530.007.13256.96910.5880.264.424.72.193.20.00.04.551.3541.11.015.36.02.288.924.72172.6470.00.00.08138.203137.7280.0262.3
520.omnetpp_rCPU2024.58315.830.00870.95107.7209.9940.284.585.69.681.93.00.04.916.3754.43.92.328.44.866.744.6272.9902.50.13.02196.963143.76711.78512.8
521.wrf_rCPU1953.56115.790.001454.42198.47912.1690.434.357.87.983.80.50.07.4278.0726.21.23.127.315.463.526.1577.3260.40.00.90113.37977.67713.9455.9
523.xalancbmk_rCPU716.42215.720.008293.00930.34610.4890.774.3219.08.971.10.90.112.234.3413.62.24.020.73.042.915.9675.2290.50.10.34267.659234.5837.44935.4
525.x264_rCPU427.19515.000.0016419.201583.99612.8721.593.6938.021.038.62.50.127.2189.56011.23.928.642.411.416.36.1637.1801.80.02.7666.29648.4903.84528.0
526.blender_rCPU1003.54315.620.00655.151981.57611.5641.083.8824.96.662.36.20.017.9397.6083.71.11.028.313.631.215.5631.2254.50.02.09134.627120.4141.25628.1
527.cam4_rCPU1312.50815.817.50935.893656.41310.9190.844.0716.416.066.90.70.014.5189.5889.34.98.025.919.540.020.7862.6950.60.01.15124.38389.3218.77211.1
531.deepsjeng_rCPU811.22315.780.0037.06718.78210.1811.424.0830.229.434.95.50.023.621.27415.07.917.716.43.723.64.8523.5374.20.13.99123.84097.5200.90721.9
538.imagick_rCPU371.22314.330.001329.381026.20110.7572.193.6356.011.125.27.70.033.8149.8553.82.90.114.48.36.96.1311.9784.60.00.89182.276175.1610.18739.6
541.leela_rCPU1062.97415.760.00144.03156.44610.4671.054.0723.650.012.713.70.018.281.30227.511.22.68.33.06.84.1418.02110.50.112.17141.333118.8330.18722.7
544.nab_rCPU581.06615.530.0023.20301.41911.1741.273.7736.98.752.12.30.125.5318.5305.01.11.415.122.813.34.9452.5781.50.01.3083.28972.1751.77430.8
548.exchange2_rCPU557.81915.770.0023.7671.04710.6701.893.9946.436.914.22.60.032.3126.18212.013.61.617.74.65.30.760.8271.80.01.30165.361157.6891.02230.3
549.fotonik3d_rCPU4829.37115.910.0032.94144.71811.5100.114.542.01.996.10.10.01.9286.3861.30.56.812.82.791.844.60137.6450.00.00.4536.51833.9070.4711.7
554.roms_rCPU2913.60415.820.008.12600.29711.8520.164.452.82.394.90.10.02.8129.6701.80.44.421.77.485.636.36196.4550.00.00.4376.56657.3276.6642.0
557.xz_rCPU1413.26215.760.0011.481363.3769.4840.774.4119.49.564.86.30.012.621.3564.51.71.017.82.239.830.1123.2534.10.04.53115.607104.9401.34035.2
  • The on_cpu values are high. This is very much a test of a CPU-dominated workload. There are not as many delays waiting for I/O, networking, graphics or other parts of the system. So there is an intensity to the mix that isn’t always as present with a more generic set of applications. Correspondingly the “GHz” values as a number of clock cycles per second are also high.
  • Most of the floating point benchmarks are dominated by backend stalls. On my 7840 processor, the memory subsystem more often becomes a limiter.

I have gone through fprate and am in process of working through intrate. While I have run the intspeed and fpspeed benchmarks, those are lower on my list to characterize. This sets me up for two later exercises to follow (a) after zen5 processors are available, I can use the benchmarks to see how the workloads compare on a zen5 vs zen4 core and (b) I am thinking of a “clustering” exercise to look for similarities between both phoronix and SPEC CPU.

Posted in experiment | Tagged benchmarks, compiler, cpu2017 | Leave a reply

graphics-magick sharpen, compiler improvements

Performance analysis, tools and experiments Posted on March 19, 2024 by mevMarch 19, 2024

The following Phoronix Article – https://www.phoronix.com/review/nvidia-gh200-compilers compares GCC 13.2 with Clang 17.0.2 on an ARM platform. On the discussions attached the improvement for graphics-magick sharpen benchmark particularly stand out. So I thought I would see if I could see a similar improvement and using performance tools could spot likely areas contributing to the difference.

My system has Ubuntu 22.04 system compiler or gcc 11.4 and also aocc 4.1 based on clang 16.0.3 so not exactly the same but close enough. I forced a rebuild by reinstalling the test and setting environment variables, e.g.

export CC=/opt/AMD/aocc-compiler-4.1.0/bin/clang
export CXX=/opt/AMD/aocc-compiler-4.1.0/bin/clang++
export CFLAGS="-O3 -march=native"
export CXXFLAGS="-O3 -march=native"

With these differences, I see the following with gcc 11.4

    Operation: Sharpen:
        107
        108
        108

    Average: 108 Iterations Per Minute
    Deviation: 0.54%

and the following differences with clang 16.0

    Operation: Sharpen:
        177
        178
        178

    Average: 178 Iterations Per Minute
    Deviation: 0.32%

So overall a 1.65x speedup. Noy quite the 2x speedup seen on the AArch64 system but close enough given different compilers.

Here is what my topdown profile shows for gcc

Here is the comparison point with clang

Interestingly the total runtime is close to the same (time-bound test?) but we definitely have dropped backend stalls in favor of retiring a higher percentage of instructions.

Here is what the metrics show for gcc

elapsed              196.669
on_cpu               0.912          # 14.59 / 16 cores
utime                2847.753
stime                22.595
nvcsw                7799           # 21.59%
nivcsw               28324          # 78.41%
inblock              72             # 0.37/sec
onblock              12832          # 65.25/sec
cpu-clock            2870386800612  # 2870.387 seconds
task-clock           2870418438024  # 2870.418 seconds
page faults          8219671        # 2863.579/sec
context switches     36937          # 12.868/sec
cpu migrations       252            # 0.088/sec
major page faults    3              # 0.001/sec
minor page faults    8219668        # 2863.578/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             1140203600927  # 59.971 branches per 1000 inst
branch misses        9706629710     # 0.85% branch miss
conditional          1123801038031  # 59.108 conditional branches per 1000 inst
indirect             79758239       # 0.004 indirect branches per 1000 inst
cpu-cycles           12006859596451 # 3.83 GHz
instructions         18841741204039 # 1.57 IPC
slots                24015874376394 #
retiring             6873291214752  # 28.6% (44.9%)
-- ucode             776914684      #     0.0%
-- fastpath          6872514300068  #    28.6%
frontend             280560426894   #  1.2% ( 1.8%) low
-- latency           204333739230   #     0.9%
-- bandwidth         76226687664    #     0.3%
backend              8106573021445  # 33.8% (52.9%)
-- cpu               7904629941195  #    32.9%
-- memory            201943080250   #     0.8%
speculation          52507444606    #  0.2% ( 0.3%) low
-- branch mispredict 52421287288    #     0.2%
-- pipeline restart  86157318       #     0.0%
smt-contention       8702915928072  # 36.2% ( 0.0%)
cpu-cycles           12008757786517 # 3.84 GHz
instructions         18832540244485 # 1.57 IPC
instructions         6279648919771  # 2.124 l2 access per 1000 inst
l2 hit from l1       7173349879     # 20.29% l2 miss
l2 miss from l1      704685663      #
l2 hit from l2 pf    4162164156     #
l3 hit from l2 pf    1757001598     #
l3 miss from l2 pf   244954442      #
instructions         6277843164084  # 351.548 float per 1000 inst
float 512            57             # 0.000 AVX-512 per 1000 inst
float 256            584            # 0.000 AVX-256 per 1000 inst
float 128            2206965520855  # 351.548 AVX-128 per 1000 inst
float MMX            0              # 0.000 MMX per 1000 inst
float scalar         0              # 0.000 scalar per 1000 inst
instructions         18950819136221 #
opcache              2107351794817  # 111.201 opcache per 1000 inst
opcache miss         9523189344     #  0.5% opcache miss rate
l1 dTLB miss         902958198      # 0.048 L1 dTLB per 1000 inst
l2 dTLB miss         68055690       # 0.004 L2 dTLB per 1000 inst
instructions         18892305597227 #
icache               18578037535    # 0.983 icache per 1000 inst
icache miss          1477678165     #  8.0% icache miss rate
l1 iTLB miss         8626682        # 0.000 L1 iTLB per 1000 inst
l2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst
tlb flush            34816          # 0.000 TLB flush per 1000 inst

Here is what they show for clang

elapsed              198.605
on_cpu               0.910          # 14.55 / 16 cores
utime                2846.489
stime                43.933
nvcsw                10817          # 26.18%
nivcsw               30507          # 73.82%
inblock              8              # 0.04/sec
onblock              12904          # 64.97/sec
cpu-clock            2890592540363  # 2890.593 seconds
task-clock           2890613273288  # 2890.613 seconds
page faults          13446401       # 4651.747/sec
context switches     42134          # 14.576/sec
cpu migrations       320            # 0.111/sec
major page faults    51             # 0.018/sec
minor page faults    13446350       # 4651.729/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             1895985554702  # 85.338 branches per 1000 inst
branch misses        16208435546    # 0.85% branch miss
conditional          1856679790302  # 83.569 conditional branches per 1000 inst
indirect             162624101      # 0.007 indirect branches per 1000 inst
cpu-cycles           11802414391046 # 3.76 GHz
instructions         22167963494891 # 1.88 IPC
slots                23606223963606 #
retiring             7476214292393  # 31.7% (52.1%)
-- ucode             3459130581     #     0.0%
-- fastpath          7472755161812  #    31.7%
frontend             577593637926   #  2.4% ( 4.0%) low
-- latency           362394713874   #     1.5%
-- bandwidth         215198924052   #     0.9%
backend              6205319253065  # 26.3% (43.3%)
-- cpu               5685432163067  #    24.1%
-- memory            519887089998   #     2.2%
speculation          83292194787    #  0.4% ( 0.6%) low
-- branch mispredict 83160520795    #     0.4%
-- pipeline restart  131673992      #     0.0%
smt-contention       9263789209330  # 39.2% ( 0.0%)
cpu-cycles           11818914678350 # 3.74 GHz
instructions         22211450935976 # 1.88 IPC
instructions         7404943446705  # 2.939 l2 access per 1000 inst
l2 hit from l1       11586386135    # 19.79% l2 miss
l2 miss from l1      1111590991     #
l2 hit from l2 pf    6979543347     #
l3 hit from l2 pf    2793906722     #
l3 miss from l2 pf   399941912      #
instructions         7400104673984  # 491.708 float per 1000 inst
float 512            72             # 0.000 AVX-512 per 1000 inst
float 256            668            # 0.000 AVX-256 per 1000 inst
float 128            3638689694804  # 491.708 AVX-128 per 1000 inst
float MMX            0              # 0.000 MMX per 1000 inst
float scalar         0              # 0.000 scalar per 1000 inst
instructions         22251978896991 #
opcache              3428837218389  # 154.091 opcache per 1000 inst
opcache miss         16257852042    #  0.5% opcache miss rate
l1 dTLB miss         1527716103     # 0.069 L1 dTLB per 1000 inst
l2 dTLB miss         108536720      # 0.005 L2 dTLB per 1000 inst
instructions         22248633347533 #
icache               35471913129    # 1.594 icache per 1000 inst
icache miss          1971706421     #  5.6% icache miss rate
l1 iTLB miss         9490954        # 0.000 L1 iTLB per 1000 inst
l2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst
tlb flush            71325          # 0.000 TLB flush per 1000 inst

Looking with a rough comparison I notice:

  • User time is almost identical so likely some time-bounded loop
  • There are more instructions overall, and particularly AVX-128 has gone from 351 per thousand to 491 per thousand. The number of branches has also gone up
  • IPC has gone from 1.57 to 1.88.

Based on this my likely guess is some greater vectorization to tighten the core loop. This indirectly results in more branches (smaller loop). CPU stalls still contribute most to backend stalls but have gone down while number of vector instructions have gone up.

There may be other more direct ways to compare compiler options and results, but this is at least an indirect way to view the effects looking at the overall performance characterization.

Posted in experiment | Tagged benchmarks, compiler, phoronix | Leave a reply

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org

Archives

  • November 2024
  • October 2024
  • September 2024
  • July 2024
  • June 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • February 2023

Tags

7840HS bad data benchmarks cachyos cluster compiler coremark cpu2017 data fabric getrusage gnuplot i5-13500H icache ipc kernel l3 metrics namd opcache perf performance counters perf_event_open phoronix Ryzen AI 9 HX 370 Ryzen AI 365 scaling stream threshold topdown tree virtualization website wsl Zen5

Recent Posts

  • Virtualization comparisons
  • Updating to a new kernel and graphics driver
  • SPEC CPU2017 Ryzen AI HX 370 vs. Ryzen 7840 HS
  • phoronix – Ryzen AI HX 370 vs Ryzen 7840 HS
  • New Ryzen AI 9 HX 370 machine
©2026 - Performance analysis, tools and experiments - Weaver Xtreme Theme
↑