cluster – Performance analysis, tools and experiments

Following are affinity clusters of similar benchmarks based on on_cpu (# of cores used) and topdown metrics (retirement, fronted stall, backend stall, speculation)

Cluster 0 (16 entries): 500.perlbench_r 508.namd_r 511.povray_r 525.x264_r 544.nab_r brl-cad c-ray john-the-ripper namd openssl povray qe quicksilver specfem3d svt-hevc vvenc
Cluster 1 (23 entries): 503.bwaves_r 507.cactuBSSN_r 510.parest_r 519.lbm_r 520.omnetpp_r 521.wrf_r 549.fotonik3d_r 554.roms_r ai-benchmark cloverleaf easywave kripke libxsmm minibud
e mt-dgemm ncnn oidn onednn openvino stream tensorflow xmrig y-cruncher
Cluster 2 (4 entries): 541.leela_r compress-7zip m-queens n-queens
Cluster 3 (5 entries): aobench compress-lz4 gnupg lammps lzbench
Cluster 4 (7 entries): 538.imagick_r 548.exchange2_r avifenc helsing hmmer rays1bench uvg266
Cluster 5 (6 entries): build-eigen build-python compress-gzip hadoop inkscape tscp
Cluster 6 (8 entries): 505.mcf_r 531.deepsjeng_r appleseed asmfish blender primesieve stockfish v-ray
Cluster 7 (14 entries): build2 build-erlang build-ffmpeg build-gcc build-gdb build-gem5 build-godot build-imagemagick build-linux-kernel build-llvm build-mesa build-mplayer build-php
 build-wasmer
Cluster 8 (10 entries): apache cassandra compilebench ctx-clock dbench fast-cli ipc-benchmark memcached sqlite wireguard
Cluster 9 (8 entries): cp2k graphics-magick qmcpack rodinia smallpt stargate vpxenc x264
Cluster 10 (6 entries): blosc clickhouse core-latency daphne gimp mbw
Cluster 11 (6 entries): arrayfire dragonflydb nginx pgbench pjsip rbenchmark
Cluster 12 (5 entries): espeak phpbench pybench securemark smhasher
Cluster 13 (7 entries): apache-iotdb encode-wavpack fftw gcrypt java-scimark2 polybench-c synthmark
Cluster 14 (12 entries): 523.xalancbmk_r 526.blender_r 527.cam4_r 557.xz_r embree graph500 lczero openvkl ospray-studio quadray sysbench tensorflow-lite
Cluster 15 (10 entries): amg askap heffte hpcg incompact3d onnx openfoam parboil pytorch ramspeed
Cluster 16 (9 entries): compress-zstd cpp-perf-bench draco indigobench jpegxl jpegxl-decode polyhedron scimark2 z3
Cluster 17 (12 entries): clomp darktable deepsparse deepspeech ffte himeno llama.cpp llamafile lulesh ngspice npb palabos
Cluster 18 (6 entries): 502.gcc_r build-nodejs ebizzy faiss minife mnn
Cluster 19 (5 entries): botan cachebench glibc-bench gnuradio nettle
Cluster 20 (8 entries): blake2 build-apache build-clash octave-benchmark openjpeg selenium tungsten vkpeak
Cluster 21 (5 entries): mpcbench node-octane openscad pyperformance rav1e
Cluster 22 (7 entries): bork byte libreoffice numpy perl-benchmark rsvg sudokut
Cluster 23 (9 entries): compress-rar dacapobench duckdb ffmpeg gegl node-web-tooling pyhpc renaissance spark-tpcds
Cluster 24 (11 entries): aircrack-ng astcenc basis coremark cpuminer-opt java-jmh kvazaar mrbayes quantlib toybrot webp2
Cluster 25 (6 entries): dav1d opencv ospray schbench svt-av1 svt-vp9
Cluster 26 (9 entries): cryptopp dolfyn encode-flac gmpbench mutex nwchem rnnoise simdjson spark
Cluster 27 (6 entries): cockroach hackbench rocksdb scylladb speedb stress-ng
Cluster 28 (13 entries): aom-av1 financebench gpaw gromacs liquid-dsp neat openradioss pennant rawtherapee srsran tnn whisper.cpp x265
Cluster 29 (11 entries): bullet compress-pbzip2 crafty cython-bench encode-mp3 encode-opus etcpak fhourstones git libraw webp

Experimented some and settled on the following approach for clustering.

Attributes of interest

First question was “clustering based on what”. Following is a more complete set of metrics that range anywhere from the amount of I/O to floating density to counter-based instrumentation. These also have rather different ranges though can be normalized using the mean and standard deviation. I experimented first using a set of 10 metrics (on_cpu, retire, frontend, backend, speculation, IPC, GHz, float-density, branch-density, smt-contention) before settling on just the first five.

Some of these metrics are correlated and why I figured it wouldn’t add value to use both AMD and Intel. Some such as I/O metrics are very orthogonal and likely make sense in broader context but I don’t have enough study to clearly characterize.

metric	count	min	max	median	mean	stddev
elapsed	247	3.53	8.76e+03	532	1.15e+03	1.53e+03
on_cpu	247	0	15.9	6.34	6.68	5.42
inblock	247	0	4.46e+05	0	3.65e+03	3.92e+04
onblock	247	0.46	8.97e+05	131	1.57e+04	7.91e+04
page-fault	247	2.6	1.31e+05	2.1e+03	1.17e+04	2.1e+04
context-switch	247	2.88	4.77e+04	66	2.49e+03	7.58e+03
IPC	247	0.01	4.59	1.51	1.64	0.888
GHz	247	0	5.23	1.87	1.82	1.4
retire-rate	247	0.7	76.2	29.2	32.1	16
frontend-stall	247	0.1	72.8	16.3	21.7	18
backend-stall	247	4.1	97.4	38.2	42.9	21.9
spec-stall	247	0	21.3	2	3.28	3.78
retire-ucode	247	0	1.2	0	0.0769	0.141
retire-fastpath	247	0.7	76.2	25.5	27.7	14.5
float-density	247	0.016	676	62.9	129	151
frontend-latency	247	0.1	57.8	8.4	13.4	13.6
frontend-bandwidth	247	0	31.7	4.8	5.83	4.8
opcache-miss	247	0	65.9	6.1	13.4	15.1
icache-miss	247	0.1	69	13.4	16.8	11.5
backend-cpu	247	0.7	64	9.4	12.8	11.4
backend-memory	247	0.2	95.9	19.9	24.6	18.1
amd-l2-miss	247	0.05	67.5	17.1	18.3	13
amd-l2-density	247	0.022	4.29e+04	35.1	229	2.72e+03
spec-branch	247	0	21.2	1.7	2.82	3.6
spec-pipeline	247	0	2	0	0.113	0.249
branch-miss	247	0	14.8	1.85	2.72	2.99
branch-density	247	4.9	317	128	130	61.8
branch-cond	247	4.5	311	92.3	98.3	48.8
branch-ind	247	0.003	28.7	2.85	4.44	5.12
smt-contention	247	0	48.4	9.6	12.7	13.3
elapsed	238	16	1.69e+04	655	1.66e+03	2.5e+03
on_cpu	238	0	15.8	6.77	7	5.54
inblock	238	0.01	3.86e+05	65.8	6.31e+03	3.54e+04
onblock	238	0.37	6.35e+05	19.9	1.1e+04	5.18e+04
page-fault	238	4.07	1.21e+05	1.68e+03	1.06e+04	2.02e+04
context-switch	238	2.23	6.44e+04	62.4	2.49e+03	8.92e+03
IPC	238	0.01	5.53	1.89	2.02	1.01
GHz	238	0	3.04	1.32	1.2	0.88
retire-rate	238	3.7	87.3	43.2	43.4	15.4
frontend-stall	238	0.5	52	15.9	17.8	11
backend-stall	238	1.3	95.3	26.2	30.9	20.2
spec-stall	238	0	46.7	6.1	8.38	8.3
retire-ucode	238	0	16.7	2.9	3.26	2.26
retire-fastpath	238	2.4	83.2	39.6	40.2	14.9
frontend-latency	238	0.3	39.2	8.6	9.57	6.73
frontend-bandwidth	238	0.1	25.8	7.3	8.2	5.84
backend-cpu	238	0.6	76.1	11.5	15.5	12.4
backend-memory	238	0.3	89.9	10.7	15.4	16.1
l1-stall	238	0	29.6	4.1	5.53	5.66
l2-stall	238	0	58.8	7.1	7.84	8.18
l3-stall	238	0	35	2.3	3.98	5.42
dram-stall	238	0	86.9	3.3	7.46	11.7
store-stall	238	0	28.3	0.8	1.7	3.11
intel-l2-miss	238	0.35	92.7	29.1	30.1	18.2
intel-l2-density	238	0.019	2.12e+04	21.9	126	1.37e+03
spec-branch	238	0	46.7	5.7	7.95	8.3
spec-pipeline	238	0	6.2	0.3	0.427	0.652
branch-miss	238	0	20.4	0.83	1.57	2.32
branch-density	238	6.24	320	129	129	61
branch-cond	238	6.24	320	129	129	61
branch-ind	238	0.027	83	20.8	22.1	17.4

Clustering algorithm

After a web search, I ended up with a variation of Lloyd’s Algorithm. This seems relatively straightforward and settles fairly quickly on a set of clusters. As a summary, this goes through the following steps:

Start with a set of randomly chosen cluster points. I picked every Nth benchmark as my starting point
Now assign each benchmark to the closest cluster point. I used a simple Nth dimensional distance as sqrt(distance1 ^2 + distance2 ^2 + distance3 ^2 … distanceN^2).
Now recompute the cluster points based on a center of the points assigned to the cluster
Iterate the last two steps until it converges on a set of clusters.

This takes ~8 iterations when I tried it on ~240 phoronix tests along with 23 SPEC CPU 2017 benchmarks.

Now with these clusters, I can both use them to spread out a benchmark analysis across a wide range of different benchmarks and also to think of more similar benchmarks that might substitute for each other. For example cluster #7 from the list above collects a set of build benchmarks while cluster #1 looks like a set of backend bound benchmarks running on all cores.

Performance analysis, tools and experiments

An eclectic collection

Tag Archives: cluster

clustering

Attributes of interest

Clustering algorithm