February 2024 – Performance analysis, tools and experiments

oops…

Posted on February 24, 2024 by mevFebruary 29, 2024

I was surprised at the narrow range of the opcache hit/miss rate and related metrics for the icache

As it turns out there is an obvious explanation. The metrics were being measured as expected, but my addition to the script was always running an invalid phoronix-test-suite configuration, so I was consistently measuring an invalid run. I have now restarted the (many) runs and expect to eventually have more interesting metrics.

After updating and rerunning tests, I now see the following distributions which make a lot more sense than what I saw before…

Histograms

Posted on February 11, 2024 by mevFebruary 11, 2024

I now have the ability to create summary histograms characterizing the workloads. These are (re)-generated as I update performance reports, but following is values with ~170 workloads added. Walking through the histograms and what they describe…

Most of the runs are fairly quick, though I have a few benchmarks that run up to several hours. This is the elapsed time that often runs the workload three times. I then run this benchmark ~6 times collecting various metrics.

The distribution of worklods shows a small number of single-threaded workloads, a cluster around the number of cores w/o hyperthreading and then some that use as many cores as possible.

The number of page faults has a few outliers that are interesting for their own analysis: octave-benchmark, gimp, lulesh, openjpeg, tungsten… are these bringing file information into memory and operating on it? There is a similar story with context switches and stress-ng, wireguard, compress-rar which I assume are all more interrupt driven than CPU.

IPC shows a range that is lower than I expected but presumably some of these can’t take advantage as much of the core-bound aspects.

Similar picture for GHz which I calculate using the number of cycles divided by seconds. For some of those on the low end, it is similar to stream – waiting on memory traffic or similar reason? I assume for some others we have power limitations. Given how dynamic power is, assume the combination of IPC and GHz are more important – perhaps try an X/Y scatter plot with both variables?

Retirement rate as a percent of available slots shows more of a bell curve

Frontend stalls have a diminishing relation where those at the high end might be a subset to dive deeper

Backend stalls are more of a bell curve with a minimal amount for any of them and a small subset with a very high percentage

Speculative stalls are low for most workloads with a small number of outliers

Float density has up to have the code with little floating point and the rest on a distribution

Both the opcache and the i-cache miss rates surprise me mostly on how narrow the range of miss-rates are at. Seems like this doesn’t contribute by itself to frontend stalls as much as other factors, e.g. TLB? Separately is the miss rate the right metric or is there a more distilled metric?

Related picture with the icache miss rates.

The L2 cache density (per 1000 instructions); shows where various benchmarks use L2

Branch miss rates have a similar distribution as frontend stalls with most having a low miss rate and then a tail of a few benchmarks with higher miss rates.

How branchy is the code as determined by number of retired branches per 1000 instructions.

SMT contention is the number of slots going to the “other” core in a hyperthread. The large bar on left reflects both single-threaded workloads and those MPI workloads on physical cores.

There is a similar set of Intel 13500H benchmark plots. I won’t include them here because they reflect similar profiles (fortunately).

Overall, the histograms provide both a nice summary of a population of workloads (phoronix) where it also be interesting to compare/contrast with different workloads such as SPEC. It could also be interesting to aggregate the subset of benchmarks for a specific article. It could also be interesting to dive deeper on the outliers to understand how this affects things and how to best optimize. So many different avenues opened from this…

Adding summary statistics for all benchmarks

Posted on February 10, 2024 by mevFebruary 11, 2024

After adding general parsing of measurement statistics, I can now also create a statistical summary across all ~170 benchmarks as shown below. This lets me see for example the minimum IPC, maximum IPC, mean IPC and standard deviation. This will then provide some information whether a particular workload is “low” or “high” in a metric and how significantly.

The statistics below come from the workload statistics with AMD metrics first followed by Intel metrics. For example, based on this table we can see mean values for topdown metrics:

Retirement go from 0.8% to 76.2% with a mean of 32.3% and standard deviation of 15.6%. A retirement rate over 64.5% would be two standard deviations above the mean.
Frontend stalls go from 0.1% to 73% with a mean of 22.5% and a standard deviation of 17.5%
Backend stalls go from 4.1% to 97.1% with a mean of 41.6% and a standard deviation of 21.3%
Speculative stalls go from 0% to 21.2% with a mean of 3.56% and a standard deviation of 3.95%

These numbers are recalculated as the reports are re-generated but with 170 workloads mostly included are a good first overview of how the workloads operate on my AMD 7840.

Some next steps including flagging the outliers in the metrics and seeing how I can create histograms for different fields below.

metric	count	min	max	median	mean	stddev
elapsed	174	2.5	8.8e+03	554	1.25e+03	1.62e+03
on_cpu	174	0	16	7.26	7.31	5.53
inblock	172	0	5.47e+06	0	3.19e+04	4.16e+05
onblock	172	0.46	4e+05	133	9.97e+03	3.54e+04
page-fault	174	4.33	1.24e+05	2.36e+03	1.14e+04	2.04e+04
context-switch	174	1.51	5.12e+04	75	2.18e+03	7.73e+03
IPC	174	0.03	4.63	1.44	1.64	0.88
GHz	174	0	4.62	1.98	1.91	1.39
retire-rate	174	0.8	76.2	29.2	32.3	15.6
frontend-stall	174	0.1	73	19.1	22.5	17.5
backend-stall	174	4.1	97.1	36.8	41.6	21.3
spec-stall	174	0	21.2	2.7	3.56	3.95
retire-ucode	174	0	1	0	0.0736	0.131
retire-fastpath	174	0.7	76.2	24.7	27.8	14.5
float-density	174	0.013	676	67.5	133	154
frontend-latency	174	0.1	58.3	9.1	13.6	12.6
frontend-bandwidth	174	0	28.7	5.5	6.06	4.86
opcache-miss	87	52.4	54.7	53.8	53.8	0.401
icache-miss	87	8.2	9.6	8.5	8.57	0.294
backend-cpu	174	0.7	64	9.3	12.3	10.9
backend-memory	174	0.4	95.5	19.9	23.5	17.4
amd-l2-miss	174	0.08	59.8	16.3	17.3	11.7
amd-l2-density	174	0.036	470	38.2	49.7	58.9
spec-branch	174	0	21.1	2.1	3.07	3.8
spec-pipeline	174	0	1.4	0	0.11	0.205
branch-miss	174	0.01	14.8	1.96	2.79	2.99
branch-density	174	8.68	276	125	130	61.9
branch-cond	174	5.66	271	92.7	98.4	48.9
branch-ind	174	0.003	29.8	3.07	4.53	5.4
smt-contention	174	0	45.4	12.5	13.3	13.1
elapsed	169	1.49	1.44e+04	750	1.71e+03	2.45e+03
on_cpu	169	0	15.8	9.05	7.76	5.6
inblock	166	0	8.53e+04	53.9	2.33e+03	9.24e+03
onblock	166	0.37	4.25e+05	38.1	9.06e+03	3.59e+04
page-fault	169	4.05	1.17e+05	1.73e+03	1.02e+04	1.91e+04
context-switch	169	1.81	9.27e+04	72.6	2.44e+03	1e+04
IPC	169	0.05	5.54	1.82	2	0.975
GHz	169	0	3.06	1.46	1.3	0.885
retire-rate	169	3.5	87.6	42.8	44.3	14.7
frontend-stall	169	1	50.4	18.3	19.4	11.1
backend-stall	169	1.3	93.5	23.8	27.6	18.1
spec-stall	169	0	46.7	6.8	9.16	8.55
retire-ucode	169	0	16.7	3	3.37	2.4
retire-fastpath	169	2.3	83.4	39.5	40.9	14.1
frontend-latency	169	0.3	35.1	9.5	10.3	6.63
frontend-bandwidth	169	0.4	25.8	8.4	9.09	6
backend-cpu	169	0.6	67.7	10.4	13.8	11
backend-memory	169	0	90.4	10.1	13.8	14.1
l1-stall	77	0	24.4	5.2	5.45	4.88
l2-stall	77	0	57.1	8.6	9.01	9.09
l3-stall	77	0	35	2.6	4.07	5.51
dram-stall	77	0	49	6	8.86	10.6
store-stall	77	0	28.3	0.9	1.56	3.53
intel-l2-miss	169	0.59	92.9	26.8	28.2	17.1
intel-l2-density	169	0.029	370	26.9	35.7	45.2
spec-branch	169	0	46.7	6.2	8.71	8.58
spec-pipeline	169	0	6.2	0.3	0.456	0.701
branch-miss	169	0	20.4	1.03	1.81	2.58
branch-density	169	6.24	275	126	127	60.8
branch-cond	169	6.24	275	126	127	60.8
branch-ind	169	0.035	83	21.1	22.6	17

Creating a more automated performance table

Posted on February 5, 2024 by mevFebruary 5, 2024

I have been maintaining a table by hand with various performance metrics – both on the website and separately in Google Sheets. In addition to the extra work required, I also by nature only put some of the columns. So … Continue reading →

phoronix – January 2024

Posted on February 2, 2024 by mevFebruary 2, 2024

Phoronix has published its roundup of benchmark/performance/review articles – https://www.phoronix.com/news/January-2024-Highlights Included were 10 articles with reviews and benchmarks. I’ve been keeping up with CPU workloads listed and now >130 workloads total. I haven’t added GPU/graphics tests because I haven’t developed … Continue reading →

Performance analysis, tools and experiments

An eclectic collection

Monthly Archives: February 2024

oops…

Histograms

Adding summary statistics for all benchmarks

Creating a more automated performance table

phoronix – January 2024