{"id":121,"date":"2023-12-23T17:51:57","date_gmt":"2023-12-23T17:51:57","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?p=121"},"modified":"2023-12-23T17:51:58","modified_gmt":"2023-12-23T17:51:58","slug":"topdown-updated-tool-and-metrics","status":"publish","type":"post","link":"https:\/\/mvermeulen.org\/perf\/2023\/12\/23\/topdown-updated-tool-and-metrics\/","title":{"rendered":"topdown &#8211; updated tool and metrics"},"content":{"rendered":"\n<p>I have updated and enhanced the topdown tool and also used this as an occasion to explore Zen4 topdown performance counters, Intel hybrid CPU while building something to compare Intel i5-13500H and Ryzen 7940 processor metrics. The interface might change, but below are examples of what I currently collect with the improved &#8220;topdown&#8221; tool<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>prompt% topdown -T phoronix-test-suite batch-run coremark\n\n... output from phoronix deleted...\n\nelapsed              82.260\non_cpu               0.697          # 11.15 \/ 16 cores\nutime                916.661\nstime                0.242\nnvcsw                1132           # 17.98%\nnivcsw               5164           # 82.02%\ninblock              0\nonblock              1056\ncpu-clock            916903184175   # 916.903 seconds\ntask-clock           916906647071   # 916.907 seconds\npage faults          74549          # 81.305\/sec\ncontext switches     6488           # 7.076\/sec\ncpu migrations       202            # 0.220\/sec\nmajor page faults    0              # 0.000\/sec\nminor page faults    74549          # 81.305\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\ncpu-cycles           3464621073039  # 2.63 GHz\ninstructions         8049045984713  # 2.32 IPC\nbranches             1522094984210  # 18.91%\nbranch-misses        4704540398     # 0.31%\nslots                3647019612782  #\nretiring             2224473675479  # 61.0%\nfrontend             711279775555   # 19.5%\nbackend              544779039260   # 14.9%\nspeculation          166666296338   #  4.6%\nslots                1823104942056  #<\/code><\/pre>\n\n\n\n<p>Here is the corresponding interface from my Ryzen 7840 machine<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>prompt% topdown -T phoronix-test-suite batch-run coremark\n\n... output from phoronix deleted...\n\n\nelapsed              85.541\non_cpu               0.745          # 11.92 \/ 16 cores\nutime                1018.861\nstime                0.459\nnvcsw                1093           # 9.61%\nnivcsw               10282          # 90.39%\ninblock              0\nonblock              1096\ncpu-clock            1019331677633   # 1019.332 seconds\ntask-clock           1019337578594   # 1019.338 seconds\npage faults          78468          # 76.979\/sec\ncontext switches     11572          # 11.352\/sec\ncpu migrations       141            # 0.138\/sec\nmajor page faults    2              # 0.002\/sec\nminor page faults    78466          # 76.977\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\ncpu-cycles           4365934085487  # 3.19 GHz\ninstructions         10082216971551 # 2.31 IPC\nbranches             1906657885465  # 18.91%\nbranch-misses        3041680542     # 0.16%\nslots                8724903154470  #\nretiring             3014426454611  # 34.5%\nfrontend             1218044327924  # 14.0%\nbackend              944914660522   # 10.8%\nspeculation          36368040752    #  0.4%\nslots                8728117977618  #\nsmt-contention       3511505952833  # 40.2%<\/code><\/pre>\n\n\n\n<p>I expect to modify the interface some, but following is an explanation of what\/how is being collected:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Elapsed time is the running time as measured by the CPU time<\/li>\n\n\n\n<li>The next six items come from an OS call to getrusage(2) to get information.  I print this by default.   The OS can provide about a process tree including\n<ul class=\"wp-block-list\">\n<li>The amount of user time and system time.<\/li>\n\n\n\n<li>A &#8220;on_cpu&#8221; metric calculated from the amount of user time, the elapsed time and number of available cores &#8211; essentially what percent of the time were all the cores scheduled for this application.  Non-scheduled time might occur because\n<ul class=\"wp-block-list\">\n<li>The app is single-threaded or perhaps doesn&#8217;t use all the threads in the CPU<\/li>\n\n\n\n<li>The process is not running because it is waiting for disk I\/O or network<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Context switches both voluntary and involuntary<\/li>\n\n\n\n<li>Block input\/ouput operations<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>The next nine items come from the OS software counters.  These let me see things like faults, context switches and cpu clocks.<\/li>\n\n\n\n<li>The next four items are derived from &#8220;generic&#8221; performance counters, expect to be available on any CPU as well as calculated metrics\n<ul class=\"wp-block-list\">\n<li>Note: The processor has a limited number of hardware performance counters (6, 5?); and provides information on how much they were enabled vs. running so you can scale as necessary.  In my example, I am running three such groups &#8211; one of them with cpu-cycles, instructions, branches and branch-instructions.<\/li>\n\n\n\n<li>The cpu-cycles counter and elapsed time lets us calculate an effective GHz we were running<\/li>\n\n\n\n<li>The instructions and cpu-cycles lets us calculate IPC (instructions per cycle)<\/li>\n\n\n\n<li>The branches and instructions lets us calculate the &#8220;branchiness&#8221; of the code<\/li>\n\n\n\n<li>The branch-misses lets us tell how often we miss branches<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>The next items are the top-down performance counters.  I have different counters for each processor.  In addition &#8211; Intel uses four buckets: &lt;retiring, frontend, backend, bad-speculation&gt; and AMD uses five-buckets: &lt;retiring, frontend, backend, bad-speculation and smt-contention&gt;.  When smt-contention is high &#8211; to compare between AMD\/Intel I may remove it but also gives some areas to explore further.<\/li>\n<\/ul>\n\n\n\n<p>A few additional things I&#8217;ve noticed in adding these metrics:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Documentation says that AMD processor has 6 performance counters.  However, somehow when I set up and read six counters as a group, the sixth one reads as 0.  This is why I added a second multi-plex block to read this.  This also gives me an opportunity to read additional top-down metrics (e.g. another four topdown related counters for both AMD and Intel) or if I decide to cross-compare Intel\/AMD by dividing up SMT-contention then I remove it<\/li>\n\n\n\n<li>The Intel processor is a &#8220;hybrid&#8221; processor with both performance and efficiency cores.  As best I can tell, I only seem to be reading from the performance cores.  Attempts to read partial results elsewhere gives me bad reads.<\/li>\n<\/ol>\n\n\n\n<p>Now that I have a basic top-down tool running there are several areas I consider enhancing the tool further for additional experiments before using it to measure various workloads<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>I can look at additional collections of counters, e.g. going deeper in top-down metrics or looking at specialized studies for different parts of the microarchitecture, e.g. caches, TLBs, uop-cache, etc.<\/li>\n\n\n\n<li>I would like to look at CSV output to make it easier to export key metrics to a table, e.g. so they can be compared with other tools<\/li>\n\n\n\n<li>I would like to look at creating periodic output, e.g. combined with CSV output this can let me see how metrics vary as the program runs<\/li>\n<\/ol>\n\n\n\n<p>I expect I can work with both the tool and the workload analysis &#8211; to work with both together as I look at some performance studies.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I have updated and enhanced the topdown tool and also used this as an occasion to explore Zen4 topdown performance counters, Intel hybrid CPU while building something to compare Intel i5-13500H and Ryzen 7940 processor metrics. The interface might change, <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/2023\/12\/23\/topdown-updated-tool-and-metrics\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5],"tags":[14,15,7],"class_list":["post-121","post","type-post","status-publish","format-standard","hentry","category-tools","tag-getrusage","tag-perf_event_open","tag-performance-counters"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/posts\/121","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=121"}],"version-history":[{"count":1,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/posts\/121\/revisions"}],"predecessor-version":[{"id":122,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/posts\/121\/revisions\/122"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=121"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/categories?post=121"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/tags?post=121"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}