{"id":1693,"date":"2024-02-11T02:07:29","date_gmt":"2024-02-11T02:07:29","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?p=1693"},"modified":"2024-02-11T02:09:49","modified_gmt":"2024-02-11T02:09:49","slug":"histograms","status":"publish","type":"post","link":"https:\/\/mvermeulen.org\/perf\/2024\/02\/11\/histograms\/","title":{"rendered":"Histograms"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">I now have the ability to create summary histograms  characterizing the workloads. These are (re)-generated as I update performance reports, but following is values with ~170 workloads added. Walking through the histograms and what they describe&#8230;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Most of the runs are fairly quick, though I have a few benchmarks that run up to several hours.  This is the elapsed time that often runs the workload three times. I then run this benchmark ~6 times collecting various metrics.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"400\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/02.elapsed.png\" alt=\"\" class=\"wp-image-1694\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The distribution of worklods shows a small number of single-threaded workloads, a cluster around the number of cores w\/o hyperthreading and then some that use as many cores as possible.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"400\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/03.on_cpu.png\" alt=\"\" class=\"wp-image-1695\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The number of page faults has a few outliers that are interesting for their own analysis: octave-benchmark, gimp, lulesh, openjpeg, tungsten&#8230; are these bringing file information into memory and operating on it? There is a similar story with context switches and stress-ng, wireguard, compress-rar which I assume are all more interrupt driven than CPU.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"400\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/06.page-fault.png\" alt=\"\" class=\"wp-image-1696\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">IPC shows a range that is lower than I expected but presumably some of these can&#8217;t take advantage as much of the core-bound aspects.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"400\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/08.IPC_.png\" alt=\"\" class=\"wp-image-1697\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Similar picture for GHz which I calculate using the number of cycles divided by seconds. For some of those on the low end, it is similar to stream &#8211; waiting on memory traffic or similar reason? I assume for some others we have power limitations. Given how dynamic power is, assume the combination of IPC and GHz are more important &#8211; perhaps try an X\/Y scatter plot with both variables?<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"400\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/09.GHz_.png\" alt=\"\" class=\"wp-image-1698\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Retirement rate as a percent of available slots shows more of a bell curve<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"400\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/10.retire-rate.png\" alt=\"\" class=\"wp-image-1699\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Frontend stalls have a diminishing relation where those at the high end might be a subset to dive deeper<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"400\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/11.frontend-stall.png\" alt=\"\" class=\"wp-image-1700\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Backend stalls are more of a bell curve with a minimal amount for any of them and a small subset with a very high percentage<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"400\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/12.backend-stall.png\" alt=\"\" class=\"wp-image-1701\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Speculative stalls are low for most workloads with a small number of outliers<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"400\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/13.spec-stall.png\" alt=\"\" class=\"wp-image-1702\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Float density has up to have the code with little floating point and the rest on a distribution<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"400\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/16.float-density.png\" alt=\"\" class=\"wp-image-1703\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Both the opcache and the i-cache miss rates surprise me mostly on how narrow the range of miss-rates are at. Seems like this doesn&#8217;t contribute by itself to frontend stalls as much as other factors, e.g. TLB?  Separately is the miss rate the right metric or is there a more distilled metric?<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"400\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/19.opcache-miss.png\" alt=\"\" class=\"wp-image-1704\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Related picture with the icache miss rates.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"400\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/19.opcache-miss-1.png\" alt=\"\" class=\"wp-image-1705\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The L2 cache density (per 1000 instructions); shows where various benchmarks use L2<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"400\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/24.amd-l2-density.png\" alt=\"\" class=\"wp-image-1706\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Branch miss rates have a similar distribution as frontend stalls with most having a low miss rate and then a tail of a few benchmarks with higher miss rates.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"400\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/27.branch-miss.png\" alt=\"\" class=\"wp-image-1707\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">How branchy is the code as determined by number of retired branches per 1000 instructions.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"400\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/28.branch-density.png\" alt=\"\" class=\"wp-image-1708\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">SMT contention is the number of slots going to the &#8220;other&#8221; core in a hyperthread. The large bar on left reflects both single-threaded workloads and those MPI workloads on physical cores.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"400\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/31.smt-contention.png\" alt=\"\" class=\"wp-image-1709\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">There is a similar set of Intel 13500H benchmark plots.  I won&#8217;t include them here because they reflect similar profiles (fortunately).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Overall, the histograms provide both a nice summary of a population of workloads (phoronix) where it also be interesting to compare\/contrast with different workloads such as SPEC.  It could also be interesting to aggregate the subset of benchmarks for a specific article. It could also be interesting to dive deeper on the outliers to understand how this affects things and how to best optimize.  So many different avenues opened from this&#8230;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I now have the ability to create summary histograms characterizing the workloads. These are (re)-generated as I update performance reports, but following is values with ~170 workloads added. Walking through the histograms and what they describe&#8230; Most of the runs <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/2024\/02\/11\/histograms\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11,3],"tags":[23,19,4],"class_list":["post-1693","post","type-post","status-publish","format-standard","hentry","category-experiment","category-website","tag-benchmarks","tag-gnuplot","tag-website"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/posts\/1693","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=1693"}],"version-history":[{"count":1,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/posts\/1693\/revisions"}],"predecessor-version":[{"id":1710,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/posts\/1693\/revisions\/1710"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=1693"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/categories?post=1693"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/tags?post=1693"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}