Following are affinity clusters of similar benchmarks based on on_cpu (# of cores used) and topdown metrics (retirement, fronted stall, backend stall, speculation)
Cluster 0 (16 entries): 500.perlbench_r 508.namd_r 511.povray_r 525.x264_r 544.nab_r brl-cad c-ray john-the-ripper namd openssl povray qe quicksilver specfem3d svt-hevc vvenc
Cluster 1 (23 entries): 503.bwaves_r 507.cactuBSSN_r 510.parest_r 519.lbm_r 520.omnetpp_r 521.wrf_r 549.fotonik3d_r 554.roms_r ai-benchmark cloverleaf easywave kripke libxsmm minibud
e mt-dgemm ncnn oidn onednn openvino stream tensorflow xmrig y-cruncher
Cluster 2 (4 entries): 541.leela_r compress-7zip m-queens n-queens
Cluster 3 (5 entries): aobench compress-lz4 gnupg lammps lzbench
Cluster 4 (7 entries): 538.imagick_r 548.exchange2_r avifenc helsing hmmer rays1bench uvg266
Cluster 5 (6 entries): build-eigen build-python compress-gzip hadoop inkscape tscp
Cluster 6 (8 entries): 505.mcf_r 531.deepsjeng_r appleseed asmfish blender primesieve stockfish v-ray
Cluster 7 (14 entries): build2 build-erlang build-ffmpeg build-gcc build-gdb build-gem5 build-godot build-imagemagick build-linux-kernel build-llvm build-mesa build-mplayer build-php
build-wasmer
Cluster 8 (10 entries): apache cassandra compilebench ctx-clock dbench fast-cli ipc-benchmark memcached sqlite wireguard
Cluster 9 (8 entries): cp2k graphics-magick qmcpack rodinia smallpt stargate vpxenc x264
Cluster 10 (6 entries): blosc clickhouse core-latency daphne gimp mbw
Cluster 11 (6 entries): arrayfire dragonflydb nginx pgbench pjsip rbenchmark
Cluster 12 (5 entries): espeak phpbench pybench securemark smhasher
Cluster 13 (7 entries): apache-iotdb encode-wavpack fftw gcrypt java-scimark2 polybench-c synthmark
Cluster 14 (12 entries): 523.xalancbmk_r 526.blender_r 527.cam4_r 557.xz_r embree graph500 lczero openvkl ospray-studio quadray sysbench tensorflow-lite
Cluster 15 (10 entries): amg askap heffte hpcg incompact3d onnx openfoam parboil pytorch ramspeed
Cluster 16 (9 entries): compress-zstd cpp-perf-bench draco indigobench jpegxl jpegxl-decode polyhedron scimark2 z3
Cluster 17 (12 entries): clomp darktable deepsparse deepspeech ffte himeno llama.cpp llamafile lulesh ngspice npb palabos
Cluster 18 (6 entries): 502.gcc_r build-nodejs ebizzy faiss minife mnn
Cluster 19 (5 entries): botan cachebench glibc-bench gnuradio nettle
Cluster 20 (8 entries): blake2 build-apache build-clash octave-benchmark openjpeg selenium tungsten vkpeak
Cluster 21 (5 entries): mpcbench node-octane openscad pyperformance rav1e
Cluster 22 (7 entries): bork byte libreoffice numpy perl-benchmark rsvg sudokut
Cluster 23 (9 entries): compress-rar dacapobench duckdb ffmpeg gegl node-web-tooling pyhpc renaissance spark-tpcds
Cluster 24 (11 entries): aircrack-ng astcenc basis coremark cpuminer-opt java-jmh kvazaar mrbayes quantlib toybrot webp2
Cluster 25 (6 entries): dav1d opencv ospray schbench svt-av1 svt-vp9
Cluster 26 (9 entries): cryptopp dolfyn encode-flac gmpbench mutex nwchem rnnoise simdjson spark
Cluster 27 (6 entries): cockroach hackbench rocksdb scylladb speedb stress-ng
Cluster 28 (13 entries): aom-av1 financebench gpaw gromacs liquid-dsp neat openradioss pennant rawtherapee srsran tnn whisper.cpp x265
Cluster 29 (11 entries): bullet compress-pbzip2 crafty cython-bench encode-mp3 encode-opus etcpak fhourstones git libraw webp
Experimented some and settled on the following approach for clustering.
Attributes of interest
First question was “clustering based on what”. Following is a more complete set of metrics that range anywhere from the amount of I/O to floating density to counter-based instrumentation. These also have rather different ranges though can be normalized using the mean and standard deviation. I experimented first using a set of 10 metrics (on_cpu, retire, frontend, backend, speculation, IPC, GHz, float-density, branch-density, smt-contention) before settling on just the first five.
Some of these metrics are correlated and why I figured it wouldn’t add value to use both AMD and Intel. Some such as I/O metrics are very orthogonal and likely make sense in broader context but I don’t have enough study to clearly characterize.
| metric | count | min | max | median | mean | stddev |
|---|---|---|---|---|---|---|
| elapsed | 247 | 3.53 | 8.76e+03 | 532 | 1.15e+03 | 1.53e+03 |
| on_cpu | 247 | 0 | 15.9 | 6.34 | 6.68 | 5.42 |
| inblock | 247 | 0 | 4.46e+05 | 0 | 3.65e+03 | 3.92e+04 |
| onblock | 247 | 0.46 | 8.97e+05 | 131 | 1.57e+04 | 7.91e+04 |
| page-fault | 247 | 2.6 | 1.31e+05 | 2.1e+03 | 1.17e+04 | 2.1e+04 |
| context-switch | 247 | 2.88 | 4.77e+04 | 66 | 2.49e+03 | 7.58e+03 |
| IPC | 247 | 0.01 | 4.59 | 1.51 | 1.64 | 0.888 |
| GHz | 247 | 0 | 5.23 | 1.87 | 1.82 | 1.4 |
| retire-rate | 247 | 0.7 | 76.2 | 29.2 | 32.1 | 16 |
| frontend-stall | 247 | 0.1 | 72.8 | 16.3 | 21.7 | 18 |
| backend-stall | 247 | 4.1 | 97.4 | 38.2 | 42.9 | 21.9 |
| spec-stall | 247 | 0 | 21.3 | 2 | 3.28 | 3.78 |
| retire-ucode | 247 | 0 | 1.2 | 0 | 0.0769 | 0.141 |
| retire-fastpath | 247 | 0.7 | 76.2 | 25.5 | 27.7 | 14.5 |
| float-density | 247 | 0.016 | 676 | 62.9 | 129 | 151 |
| frontend-latency | 247 | 0.1 | 57.8 | 8.4 | 13.4 | 13.6 |
| frontend-bandwidth | 247 | 0 | 31.7 | 4.8 | 5.83 | 4.8 |
| opcache-miss | 247 | 0 | 65.9 | 6.1 | 13.4 | 15.1 |
| icache-miss | 247 | 0.1 | 69 | 13.4 | 16.8 | 11.5 |
| backend-cpu | 247 | 0.7 | 64 | 9.4 | 12.8 | 11.4 |
| backend-memory | 247 | 0.2 | 95.9 | 19.9 | 24.6 | 18.1 |
| amd-l2-miss | 247 | 0.05 | 67.5 | 17.1 | 18.3 | 13 |
| amd-l2-density | 247 | 0.022 | 4.29e+04 | 35.1 | 229 | 2.72e+03 |
| spec-branch | 247 | 0 | 21.2 | 1.7 | 2.82 | 3.6 |
| spec-pipeline | 247 | 0 | 2 | 0 | 0.113 | 0.249 |
| branch-miss | 247 | 0 | 14.8 | 1.85 | 2.72 | 2.99 |
| branch-density | 247 | 4.9 | 317 | 128 | 130 | 61.8 |
| branch-cond | 247 | 4.5 | 311 | 92.3 | 98.3 | 48.8 |
| branch-ind | 247 | 0.003 | 28.7 | 2.85 | 4.44 | 5.12 |
| smt-contention | 247 | 0 | 48.4 | 9.6 | 12.7 | 13.3 |
| elapsed | 238 | 16 | 1.69e+04 | 655 | 1.66e+03 | 2.5e+03 |
| on_cpu | 238 | 0 | 15.8 | 6.77 | 7 | 5.54 |
| inblock | 238 | 0.01 | 3.86e+05 | 65.8 | 6.31e+03 | 3.54e+04 |
| onblock | 238 | 0.37 | 6.35e+05 | 19.9 | 1.1e+04 | 5.18e+04 |
| page-fault | 238 | 4.07 | 1.21e+05 | 1.68e+03 | 1.06e+04 | 2.02e+04 |
| context-switch | 238 | 2.23 | 6.44e+04 | 62.4 | 2.49e+03 | 8.92e+03 |
| IPC | 238 | 0.01 | 5.53 | 1.89 | 2.02 | 1.01 |
| GHz | 238 | 0 | 3.04 | 1.32 | 1.2 | 0.88 |
| retire-rate | 238 | 3.7 | 87.3 | 43.2 | 43.4 | 15.4 |
| frontend-stall | 238 | 0.5 | 52 | 15.9 | 17.8 | 11 |
| backend-stall | 238 | 1.3 | 95.3 | 26.2 | 30.9 | 20.2 |
| spec-stall | 238 | 0 | 46.7 | 6.1 | 8.38 | 8.3 |
| retire-ucode | 238 | 0 | 16.7 | 2.9 | 3.26 | 2.26 |
| retire-fastpath | 238 | 2.4 | 83.2 | 39.6 | 40.2 | 14.9 |
| frontend-latency | 238 | 0.3 | 39.2 | 8.6 | 9.57 | 6.73 |
| frontend-bandwidth | 238 | 0.1 | 25.8 | 7.3 | 8.2 | 5.84 |
| backend-cpu | 238 | 0.6 | 76.1 | 11.5 | 15.5 | 12.4 |
| backend-memory | 238 | 0.3 | 89.9 | 10.7 | 15.4 | 16.1 |
| l1-stall | 238 | 0 | 29.6 | 4.1 | 5.53 | 5.66 |
| l2-stall | 238 | 0 | 58.8 | 7.1 | 7.84 | 8.18 |
| l3-stall | 238 | 0 | 35 | 2.3 | 3.98 | 5.42 |
| dram-stall | 238 | 0 | 86.9 | 3.3 | 7.46 | 11.7 |
| store-stall | 238 | 0 | 28.3 | 0.8 | 1.7 | 3.11 |
| intel-l2-miss | 238 | 0.35 | 92.7 | 29.1 | 30.1 | 18.2 |
| intel-l2-density | 238 | 0.019 | 2.12e+04 | 21.9 | 126 | 1.37e+03 |
| spec-branch | 238 | 0 | 46.7 | 5.7 | 7.95 | 8.3 |
| spec-pipeline | 238 | 0 | 6.2 | 0.3 | 0.427 | 0.652 |
| branch-miss | 238 | 0 | 20.4 | 0.83 | 1.57 | 2.32 |
| branch-density | 238 | 6.24 | 320 | 129 | 129 | 61 |
| branch-cond | 238 | 6.24 | 320 | 129 | 129 | 61 |
| branch-ind | 238 | 0.027 | 83 | 20.8 | 22.1 | 17.4 |
Clustering algorithm
After a web search, I ended up with a variation of Lloyd’s Algorithm. This seems relatively straightforward and settles fairly quickly on a set of clusters. As a summary, this goes through the following steps:
- Start with a set of randomly chosen cluster points. I picked every Nth benchmark as my starting point
- Now assign each benchmark to the closest cluster point. I used a simple Nth dimensional distance as sqrt(distance1 ^2 + distance2 ^2 + distance3 ^2 … distanceN^2).
- Now recompute the cluster points based on a center of the points assigned to the cluster
- Iterate the last two steps until it converges on a set of clusters.
This takes ~8 iterations when I tried it on ~240 phoronix tests along with 23 SPEC CPU 2017 benchmarks.
Now with these clusters, I can both use them to spread out a benchmark analysis across a wide range of different benchmarks and also to think of more similar benchmarks that might substitute for each other. For example cluster #7 from the list above collects a set of build benchmarks while cluster #1 looks like a set of backend bound benchmarks running on all cores.
