I have updated and enhanced the topdown tool and also used this as an occasion to explore Zen4 topdown performance counters, Intel hybrid CPU while building something to compare Intel i5-13500H and Ryzen 7940 processor metrics. The interface might change, but below are examples of what I currently collect with the improved “topdown” tool
prompt% topdown -T phoronix-test-suite batch-run coremark
... output from phoronix deleted...
elapsed 82.260
on_cpu 0.697 # 11.15 / 16 cores
utime 916.661
stime 0.242
nvcsw 1132 # 17.98%
nivcsw 5164 # 82.02%
inblock 0
onblock 1056
cpu-clock 916903184175 # 916.903 seconds
task-clock 916906647071 # 916.907 seconds
page faults 74549 # 81.305/sec
context switches 6488 # 7.076/sec
cpu migrations 202 # 0.220/sec
major page faults 0 # 0.000/sec
minor page faults 74549 # 81.305/sec
alignment faults 0 # 0.000/sec
emulation faults 0 # 0.000/sec
cpu-cycles 3464621073039 # 2.63 GHz
instructions 8049045984713 # 2.32 IPC
branches 1522094984210 # 18.91%
branch-misses 4704540398 # 0.31%
slots 3647019612782 #
retiring 2224473675479 # 61.0%
frontend 711279775555 # 19.5%
backend 544779039260 # 14.9%
speculation 166666296338 # 4.6%
slots 1823104942056 #
Here is the corresponding interface from my Ryzen 7840 machine
prompt% topdown -T phoronix-test-suite batch-run coremark
... output from phoronix deleted...
elapsed 85.541
on_cpu 0.745 # 11.92 / 16 cores
utime 1018.861
stime 0.459
nvcsw 1093 # 9.61%
nivcsw 10282 # 90.39%
inblock 0
onblock 1096
cpu-clock 1019331677633 # 1019.332 seconds
task-clock 1019337578594 # 1019.338 seconds
page faults 78468 # 76.979/sec
context switches 11572 # 11.352/sec
cpu migrations 141 # 0.138/sec
major page faults 2 # 0.002/sec
minor page faults 78466 # 76.977/sec
alignment faults 0 # 0.000/sec
emulation faults 0 # 0.000/sec
cpu-cycles 4365934085487 # 3.19 GHz
instructions 10082216971551 # 2.31 IPC
branches 1906657885465 # 18.91%
branch-misses 3041680542 # 0.16%
slots 8724903154470 #
retiring 3014426454611 # 34.5%
frontend 1218044327924 # 14.0%
backend 944914660522 # 10.8%
speculation 36368040752 # 0.4%
slots 8728117977618 #
smt-contention 3511505952833 # 40.2%
I expect to modify the interface some, but following is an explanation of what/how is being collected:
- Elapsed time is the running time as measured by the CPU time
- The next six items come from an OS call to getrusage(2) to get information. I print this by default. The OS can provide about a process tree including
- The amount of user time and system time.
- A “on_cpu” metric calculated from the amount of user time, the elapsed time and number of available cores – essentially what percent of the time were all the cores scheduled for this application. Non-scheduled time might occur because
- The app is single-threaded or perhaps doesn’t use all the threads in the CPU
- The process is not running because it is waiting for disk I/O or network
- Context switches both voluntary and involuntary
- Block input/ouput operations
- The next nine items come from the OS software counters. These let me see things like faults, context switches and cpu clocks.
- The next four items are derived from “generic” performance counters, expect to be available on any CPU as well as calculated metrics
- Note: The processor has a limited number of hardware performance counters (6, 5?); and provides information on how much they were enabled vs. running so you can scale as necessary. In my example, I am running three such groups – one of them with cpu-cycles, instructions, branches and branch-instructions.
- The cpu-cycles counter and elapsed time lets us calculate an effective GHz we were running
- The instructions and cpu-cycles lets us calculate IPC (instructions per cycle)
- The branches and instructions lets us calculate the “branchiness” of the code
- The branch-misses lets us tell how often we miss branches
- The next items are the top-down performance counters. I have different counters for each processor. In addition – Intel uses four buckets: <retiring, frontend, backend, bad-speculation> and AMD uses five-buckets: <retiring, frontend, backend, bad-speculation and smt-contention>. When smt-contention is high – to compare between AMD/Intel I may remove it but also gives some areas to explore further.
A few additional things I’ve noticed in adding these metrics:
- Documentation says that AMD processor has 6 performance counters. However, somehow when I set up and read six counters as a group, the sixth one reads as 0. This is why I added a second multi-plex block to read this. This also gives me an opportunity to read additional top-down metrics (e.g. another four topdown related counters for both AMD and Intel) or if I decide to cross-compare Intel/AMD by dividing up SMT-contention then I remove it
- The Intel processor is a “hybrid” processor with both performance and efficiency cores. As best I can tell, I only seem to be reading from the performance cores. Attempts to read partial results elsewhere gives me bad reads.
Now that I have a basic top-down tool running there are several areas I consider enhancing the tool further for additional experiments before using it to measure various workloads
- I can look at additional collections of counters, e.g. going deeper in top-down metrics or looking at specialized studies for different parts of the microarchitecture, e.g. caches, TLBs, uop-cache, etc.
- I would like to look at CSV output to make it easier to export key metrics to a table, e.g. so they can be compared with other tools
- I would like to look at creating periodic output, e.g. combined with CSV output this can let me see how metrics vary as the program runs
I expect I can work with both the tool and the workload analysis – to work with both together as I look at some performance studies.
