perf_event_open – Performance analysis, tools and experiments

I have updated and enhanced the topdown tool and also used this as an occasion to explore Zen4 topdown performance counters, Intel hybrid CPU while building something to compare Intel i5-13500H and Ryzen 7940 processor metrics. The interface might change, but below are examples of what I currently collect with the improved “topdown” tool

prompt% topdown -T phoronix-test-suite batch-run coremark

... output from phoronix deleted...

elapsed              82.260
on_cpu               0.697          # 11.15 / 16 cores
utime                916.661
stime                0.242
nvcsw                1132           # 17.98%
nivcsw               5164           # 82.02%
inblock              0
onblock              1056
cpu-clock            916903184175   # 916.903 seconds
task-clock           916906647071   # 916.907 seconds
page faults          74549          # 81.305/sec
context switches     6488           # 7.076/sec
cpu migrations       202            # 0.220/sec
major page faults    0              # 0.000/sec
minor page faults    74549          # 81.305/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
cpu-cycles           3464621073039  # 2.63 GHz
instructions         8049045984713  # 2.32 IPC
branches             1522094984210  # 18.91%
branch-misses        4704540398     # 0.31%
slots                3647019612782  #
retiring             2224473675479  # 61.0%
frontend             711279775555   # 19.5%
backend              544779039260   # 14.9%
speculation          166666296338   #  4.6%
slots                1823104942056  #

Here is the corresponding interface from my Ryzen 7840 machine

prompt% topdown -T phoronix-test-suite batch-run coremark

... output from phoronix deleted...


elapsed              85.541
on_cpu               0.745          # 11.92 / 16 cores
utime                1018.861
stime                0.459
nvcsw                1093           # 9.61%
nivcsw               10282          # 90.39%
inblock              0
onblock              1096
cpu-clock            1019331677633   # 1019.332 seconds
task-clock           1019337578594   # 1019.338 seconds
page faults          78468          # 76.979/sec
context switches     11572          # 11.352/sec
cpu migrations       141            # 0.138/sec
major page faults    2              # 0.002/sec
minor page faults    78466          # 76.977/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
cpu-cycles           4365934085487  # 3.19 GHz
instructions         10082216971551 # 2.31 IPC
branches             1906657885465  # 18.91%
branch-misses        3041680542     # 0.16%
slots                8724903154470  #
retiring             3014426454611  # 34.5%
frontend             1218044327924  # 14.0%
backend              944914660522   # 10.8%
speculation          36368040752    #  0.4%
slots                8728117977618  #
smt-contention       3511505952833  # 40.2%

I expect to modify the interface some, but following is an explanation of what/how is being collected:

Elapsed time is the running time as measured by the CPU time
The next six items come from an OS call to getrusage(2) to get information. I print this by default. The OS can provide about a process tree including
- The amount of user time and system time.
- A “on_cpu” metric calculated from the amount of user time, the elapsed time and number of available cores – essentially what percent of the time were all the cores scheduled for this application. Non-scheduled time might occur because
  - The app is single-threaded or perhaps doesn’t use all the threads in the CPU
  - The process is not running because it is waiting for disk I/O or network
- Context switches both voluntary and involuntary
- Block input/ouput operations
The next nine items come from the OS software counters. These let me see things like faults, context switches and cpu clocks.
The next four items are derived from “generic” performance counters, expect to be available on any CPU as well as calculated metrics
- Note: The processor has a limited number of hardware performance counters (6, 5?); and provides information on how much they were enabled vs. running so you can scale as necessary. In my example, I am running three such groups – one of them with cpu-cycles, instructions, branches and branch-instructions.
- The cpu-cycles counter and elapsed time lets us calculate an effective GHz we were running
- The instructions and cpu-cycles lets us calculate IPC (instructions per cycle)
- The branches and instructions lets us calculate the “branchiness” of the code
- The branch-misses lets us tell how often we miss branches
The next items are the top-down performance counters. I have different counters for each processor. In addition – Intel uses four buckets: <retiring, frontend, backend, bad-speculation> and AMD uses five-buckets: <retiring, frontend, backend, bad-speculation and smt-contention>. When smt-contention is high – to compare between AMD/Intel I may remove it but also gives some areas to explore further.

A few additional things I’ve noticed in adding these metrics:

Documentation says that AMD processor has 6 performance counters. However, somehow when I set up and read six counters as a group, the sixth one reads as 0. This is why I added a second multi-plex block to read this. This also gives me an opportunity to read additional top-down metrics (e.g. another four topdown related counters for both AMD and Intel) or if I decide to cross-compare Intel/AMD by dividing up SMT-contention then I remove it
The Intel processor is a “hybrid” processor with both performance and efficiency cores. As best I can tell, I only seem to be reading from the performance cores. Attempts to read partial results elsewhere gives me bad reads.

Now that I have a basic top-down tool running there are several areas I consider enhancing the tool further for additional experiments before using it to measure various workloads

I can look at additional collections of counters, e.g. going deeper in top-down metrics or looking at specialized studies for different parts of the microarchitecture, e.g. caches, TLBs, uop-cache, etc.
I would like to look at CSV output to make it easier to export key metrics to a table, e.g. so they can be compared with other tools
I would like to look at creating periodic output, e.g. combined with CSV output this can let me see how metrics vary as the program runs

I expect I can work with both the tool and the workload analysis – to work with both together as I look at some performance studies.

Performance analysis, tools and experiments

An eclectic collection

Tag Archives: perf_event_open

topdown – updated tool and metrics