Performance analysis, tools and experimentsPosted on by mev
I was surprised at the narrow range of the opcache hit/miss rate and related metrics for the icache
As it turns out there is an obvious explanation. The metrics were being measured as expected, but my addition to the script was always running an invalid phoronix-test-suite configuration, so I was consistently measuring an invalid run. I have now restarted the (many) runs and expect to eventually have more interesting metrics.
After updating and rerunning tests, I now see the following distributions which make a lot more sense than what I saw before…
Performance analysis, tools and experimentsPosted on by mev
I now have the ability to create summary histograms characterizing the workloads. These are (re)-generated as I update performance reports, but following is values with ~170 workloads added. Walking through the histograms and what they describe…
Most of the runs are fairly quick, though I have a few benchmarks that run up to several hours. This is the elapsed time that often runs the workload three times. I then run this benchmark ~6 times collecting various metrics.
The distribution of worklods shows a small number of single-threaded workloads, a cluster around the number of cores w/o hyperthreading and then some that use as many cores as possible.
The number of page faults has a few outliers that are interesting for their own analysis: octave-benchmark, gimp, lulesh, openjpeg, tungsten… are these bringing file information into memory and operating on it? There is a similar story with context switches and stress-ng, wireguard, compress-rar which I assume are all more interrupt driven than CPU.
IPC shows a range that is lower than I expected but presumably some of these can’t take advantage as much of the core-bound aspects.
Similar picture for GHz which I calculate using the number of cycles divided by seconds. For some of those on the low end, it is similar to stream – waiting on memory traffic or similar reason? I assume for some others we have power limitations. Given how dynamic power is, assume the combination of IPC and GHz are more important – perhaps try an X/Y scatter plot with both variables?
Retirement rate as a percent of available slots shows more of a bell curve
Frontend stalls have a diminishing relation where those at the high end might be a subset to dive deeper
Backend stalls are more of a bell curve with a minimal amount for any of them and a small subset with a very high percentage
Speculative stalls are low for most workloads with a small number of outliers
Float density has up to have the code with little floating point and the rest on a distribution
Both the opcache and the i-cache miss rates surprise me mostly on how narrow the range of miss-rates are at. Seems like this doesn’t contribute by itself to frontend stalls as much as other factors, e.g. TLB? Separately is the miss rate the right metric or is there a more distilled metric?
Related picture with the icache miss rates.
The L2 cache density (per 1000 instructions); shows where various benchmarks use L2
Branch miss rates have a similar distribution as frontend stalls with most having a low miss rate and then a tail of a few benchmarks with higher miss rates.
How branchy is the code as determined by number of retired branches per 1000 instructions.
SMT contention is the number of slots going to the “other” core in a hyperthread. The large bar on left reflects both single-threaded workloads and those MPI workloads on physical cores.
There is a similar set of Intel 13500H benchmark plots. I won’t include them here because they reflect similar profiles (fortunately).
Overall, the histograms provide both a nice summary of a population of workloads (phoronix) where it also be interesting to compare/contrast with different workloads such as SPEC. It could also be interesting to aggregate the subset of benchmarks for a specific article. It could also be interesting to dive deeper on the outliers to understand how this affects things and how to best optimize. So many different avenues opened from this…
Performance analysis, tools and experimentsPosted on by mev
After adding general parsing of measurement statistics, I can now also create a statistical summary across all ~170 benchmarks as shown below. This lets me see for example the minimum IPC, maximum IPC, mean IPC and standard deviation. This will then provide some information whether a particular workload is “low” or “high” in a metric and how significantly.
The statistics below come from the workload statistics with AMD metrics first followed by Intel metrics. For example, based on this table we can see mean values for topdown metrics:
Retirement go from 0.8% to 76.2% with a mean of 32.3% and standard deviation of 15.6%. A retirement rate over 64.5% would be two standard deviations above the mean.
Frontend stalls go from 0.1% to 73% with a mean of 22.5% and a standard deviation of 17.5%
Backend stalls go from 4.1% to 97.1% with a mean of 41.6% and a standard deviation of 21.3%
Speculative stalls go from 0% to 21.2% with a mean of 3.56% and a standard deviation of 3.95%
These numbers are recalculated as the reports are re-generated but with 170 workloads mostly included are a good first overview of how the workloads operate on my AMD 7840.
Some next steps including flagging the outliers in the metrics and seeing how I can create histograms for different fields below.
Performance analysis, tools and experimentsPosted on by mev
I have been maintaining a table by hand with various performance metrics – both on the website and separately in Google Sheets. In addition to the extra work required, I also by nature only put some of the columns. So …Continue reading →
Performance analysis, tools and experimentsPosted on by mev
Phoronix has published its roundup of benchmark/performance/review articles – https://www.phoronix.com/news/January-2024-Highlights Included were 10 articles with reviews and benchmarks. I’ve been keeping up with CPU workloads listed and now >130 workloads total. I haven’t added GPU/graphics tests because I haven’t developed …Continue reading →