Tools – Performance analysis, tools and experiments

Home→Categories Tools

Close to 100 workloads, adding thresholds

Posted on January 26, 2024 by mevJanuary 26, 2024

I am now close to 100 overall phoronix tests added. Recent articles still include a number of new benchmarks, typically I have ~2/3 of the ones in an article and then need to add the remaining ones. However, over time have to get closer to having all the ones as articles come out.

As I have this number of workloads, I can now start to set more precise thresholds on what it means to be “high” or “low” on a metric. These could be slightly different between my AMD and Intel CPU –

IPC reported by Intel is slightly higher
Retirement rate reported by Intel is slightly higher
Frontend and backend stalls reported by Intel are slightly lower
Speculation misses reported by Intel are higher

Some of this might be because of differences in how the metrics are defined/counted and some due to the processors themselves. However, for now I’ve hard coded some thresholds into the tool that are same for both since these are mostly guidance and over time if my workload mix shifts or I find different values on other processors, I might adjust. Following are the initial guidelines added:

Metric	High	Low
IPC	3.0	0.7
retiring	54%	14%
frontend	45%	5%
backend	70%	18%
retiring	10%	1%

topdown – adding process trees and statistics

Posted on January 1, 2024 by mevJanuary 1, 2024

I have not enhanced the topdown tool with ability to print process trees. This enables the key features of my previous “wspy” command.

The interfaces is as follows. I added the following options to topdown to record process information:

	--tree <file>             - create CSV of processes
	--tree-cmdline            - record full command lines

The –tree option uses strace(2) to record fork/exec/exit events and save information to the file for later processing. An example of some information saved is as follows:

0.000 14119 root
0.002 14119 fork 14120
0.017 14120 comm cc1
0.017 14120 cmdline /usr/lib/gcc/x86_64-linux-gnu/11/cc1 -quiet -imultiarch x86_64-linux-gnu hello.c -quiet -dumpbase hello.c -dumpbase-ext .c -mtune=generic -march=x86-64 -fasynchronous-unwind-tables -fstack-protector-strong -Wformat -Wformat-security -fstack-clash-protection -fcf-protection -o /tmp/ccIyphnx.s
0.017 14120 exit 14120 (cc1) t 14119 14118 13158 34819 14118 1077936128 1221 0 0 0 1 0 0 0 20 0 1 0 1313576 46571520 3835 18446744073709551615 5890048 21573621 140722169364960 0 0 0 0 0 1256 1 0 0 17 18 0 0 0 0 0 30095936 30148584 59514880 140722169373230 140722169373523 140722169373523 140722169376723 0
0.018 14119 fork 14121
0.021 14121 comm as
0.021 14121 cmdline as --64 -o /tmp/cceku4R5.o /tmp/ccIyphnx.s
0.021 14121 exit 14121 (as) t 14119 14118 13158 34819 14118 1077936128 441 0 0 0 0 0 0 0 20 0 1 0 1313578 12435456 1332 18446744073709551615 94138168020992 94138168333961 140723056042896 0 0 0 0 0 1256 1 0 0 17 11 0 0 0 0 0 94138168430896 94138168453272 94138176069632 140723056051009 140723056051052 140723056051052 140723056054252 0
0.022 14119 fork 14124
0.023 14124 fork 14125
0.040 14125 comm ld
0.040 14125 cmdline /usr/bin/ld -plugin /usr/lib/gcc/x86_64-linux-gnu/11/liblto_plugin.so -plugin-opt=/usr/lib/gcc/x86_64-linux-gnu/11/lto-wrapper -plugin-opt=-fresolution=/tmp/ccWwmx4O.res -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s --build-id --eh-frame-hdr -m elf_x86_64 --hash-style=gnu --as-needed -dynamic-linker /lib64/ld-linux-x86-64.so.2 -pie -z now -z relro -o hello /usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/Scrt1.o /usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/crti.o /usr/lib/gcc/x86_64-linux-gnu/11/crtbeginS.o -L/usr/lib/gcc/x86_64-linux-gnu/11 -L/usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/11/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/11/../../.. /tmp/cceku4R5.o -lgcc --push-state --as-needed -lgcc_s --pop-state -lc -lgcc --push-state --as-needed -lgcc_s --pop-state /usr/lib/gcc/x86_64-linux-gnu/11/crtendS.o /usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/crtn.o
0.040 14125 exit 14125 (ld) t 14124 14118 13158 34819 14118 1077936128 1732 0 0 0 1 0 0 0 20 0 1 0 1313578 16846848 2276 18446744073709551615 94366230806528 94366231102693 140724178835088 0 0 0 0 0 0 1 0 0 17 20 0 0 0 0 0 94366232461104 94366232495352 94366259900416 140724178836727 140724178837886 140724178837886 140724178841580 0
0.040 14124 comm collect2
0.040 14124 cmdline /usr/lib/gcc/x86_64-linux-gnu/11/collect2 -plugin /usr/lib/gcc/x86_64-linux-gnu/11/liblto_plugin.so -plugin-opt=/usr/lib/gcc/x86_64-linux-gnu/11/lto-wrapper -plugin-opt=-fresolution=/tmp/ccWwmx4O.res -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s --build-id --eh-frame-hdr -m elf_x86_64 --hash-style=gnu --as-needed -dynamic-linker /lib64/ld-linux-x86-64.so.2 -pie -z now -z relro -o hello /usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/Scrt1.o /usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/crti.o /usr/lib/gcc/x86_64-linux-gnu/11/crtbeginS.o -L/usr/lib/gcc/x86_64-linux-gnu/11 -L/usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/11/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/11/../../.. /tmp/cceku4R5.o -lgcc --push-state --as-needed -lgcc_s --pop-state -lc -lgcc --push-state --as-needed -lgcc_s --pop-state /usr/lib/gcc/x86_64-linux-gnu/11/crtendS.o /usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/crtn.o
0.040 14124 exit 14124 (collect2) t 14119 14118 13158 34819 14118 1077936128 85 1732 0 0 0 0 1 0 20 0 1 0 1313578 8839168 250 18446744073709551615 4202496 4414097 140733621318048 0 0 0 0 0 9287 1 0 0 17 19 0 0 0 0 0 4488704 4494640 30085120 140733621320891 140733621322080 140733621322080 140733621325774 0
0.041 14119 comm gcc
0.041 14119 cmdline gcc -o hello hello.c
0.041 14119 exit 14119 (gcc) t 14118 14118 13158 34819 14118 1077936128 135 3479 0 0 0 0 2 1 20 0 1 0 1313576 9617408 251 18446744073709551615 4206592 4563953 140732373361024 0 0 0 0 0 20483 1 0 0 17 14 0 0 0 0 0 5114624 5124176 36139008 140732373369904 140732373369925 140732373369925 140732373372907 0

The “exit” event captures the contents of /proc/<pid>/stat when the process exits. I am not sure if this is reliable for hundreds of thousands of processes but for smaller several hundred examples it works find. If the –tree-cmdline option is given then we also capture /proc/<pid>/cmdline when the process exits.

This data file can then be processed with the proctree program with the following options

./source/wspy/proctree: fatal error: usage: ./source/wspy/proctree -[CcFfSsTtuv][-w width] file
	-C	turn on longer command line
	-c	turn on abbreviated command (default)
	-F	urn on start/finish info (default)
	-f	turn off start/finish info
	-S	turn on summary output
	-s	turn off summary output (default)
	-T	turn on tree output (default)
	-t	turn off tree output
	-U	turn off utime in tree
	-u	turn on utime in tree
	-v	verbose messages
	-w width	set command width

Here is a basic output with both summary statistics and tree information

5 processes
	  1 cc1                      0.01     0.00
	  1 ld                       0.01     0.00
	  1 as                       0.00     0.00
	  1 collect2                 0.00     0.00
	  1 gcc                      0.00     0.00
0 processes running
3 maximum processes

14119) gcc start=0.00  finish=0.04 
  14120) cc1 start=0.00  finish=0.02 
  14121) as start=0.02  finish=0.02 
  14124) collect2 start=0.02  finish=0.04 
    14125) ld start=0.02  finish=0.04

We can see more of the command line by adding the -C switch and also increasing the -w width

5 processes
	  1 cc1                      0.01     0.00
	  1 ld                       0.01     0.00
	  1 as                       0.00     0.00
	  1 collect2                 0.00     0.00
	  1 gcc                      0.00     0.00
0 processes running
3 maximum processes

14119) gcc -o hello hello.c start=0.00  finish=0.04 
  14120) /usr/lib/gcc/x86_64-linux-gnu/11/cc1 -quiet -imultiarch x86_64-linux-gnu hello.c -quiet -dumpbase hello.c -dum start=0.00  finish=0.02 
  14121) as --64 -o /tmp/cceku4R5.o /tmp/ccIyphnx.s start=0.02  finish=0.02 
  14124) /usr/lib/gcc/x86_64-linux-gnu/11/collect2 -plugin /usr/lib/gcc/x86_64-linux-gnu/11/liblto_plugin.so -plugin-op start=0.02  finish=0.04 
    14125) /usr/bin/ld -plugin /usr/lib/gcc/x86_64-linux-gnu/11/liblto_plugin.so -plugin-opt=/usr/lib/gcc/x86_64-linux- start=0.02  finish=0.04

Overall, this is a useful tool that helps me get more of the process overview e.g. single-threaded vs multi-threaded as well as summarizing processes that take the most time. As needed I also have a mechanism to decorate with additional instrumentation. Two examples might be (a) checking for particular syscalls e.g. file open events (b) investigating more of a process drill down not to the initial parent but to multiple sub-runs.

However, for now I have a basic topdown tool with both periodic output and a process tree to examine different workloads.

Creating basic metrics and adding topdown plots

Posted on December 31, 2023 by mevDecember 31, 2023

I have made several enhancements to the topdown tool. I also have some fragile things I still need to sort out along the way.

I have added metrics for –topdown2, –cache2, –float –branch and –opcache. These behave as I expect on AMD systems. I am still sorting out things on Intel system, though something acts strange with my topdown2 counters. If I use them alone, all is well but when I combine them with other counters, the perf_event_open call tells me there is an invalid argument.
I have done a first implementation of level 1 caches (–dcache,–icache) and TLB (–tlb). All these use the PERF_TYPE_HW_CACHE type from perf_event_open(2). However, the results don’t quite seem right – so I may look at adding corresponding events with PERF_TYPE_RAW events and see if they make more sense.
I did an initial implementation for –memory using the LS core counters for memory operations. This is also used for local/remote memory for likwid. However, the numbers are lower than what stream reports for memory traffic, so not sure this is the right counter recipe. I also have references to the /sys/devices/amd_df counters and can see them after loading the driver. However, not quite sure what counter to use for memory channel read/writes
I have created an initial summary block “topdown.txt” for counters that work as I expect and have both for AMD and Intel processors a high level summary I will show below.
I have implemented the “–interval” option which lets me sample counters periodically. When combined with gnuplot, –csv and -o options this lets me create some *.png files that plot topdown metrics.

The net combination is best seen below where I include both a topdown metrics summary (created from three runs of “topdown” with different options) and a topdown chart (created from a fourth run with additional options). This is a fair step along the way towards having a basic analysis tool for looking at benchmark loads. In addition to clearing up some of the issues above, I also want to add a “–tree” option to plot a process tree. Once I have that, I’ll have most of the useful bits of the program formerly named “wspy” and might also rename my “topdown” to also accept the “wspy” name.

Here is an AMD summary block with major that includes metrics for coremark:

elapsed              83.410
on_cpu               0.747          # 11.95 / 16 cores
utime                996.029
stime                0.451
nvcsw                1162           # 12.25%
nivcsw               8320           # 87.75%
inblock              0
onblock              1096
cpu-clock            996492501279   # 996.493 seconds
task-clock           996497240698   # 996.497 seconds
page faults          49987          # 50.163/sec
context switches     9695           # 9.729/sec
cpu migrations       136            # 0.136/sec
major page faults    0              # 0.000/sec
minor page faults    49985          # 50.161/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             1905721388306  # 189.110 branches per 1000 inst
branch misses        3005711443     # 0.16% branch miss
conditional          1674633740961  # 166.178 conditional branches per 1000 inst
indirect             9422915848     # 0.935 indirect branches per 1000 inst
cpu-cycles           4319923640733  # 3.23 GHz
instructions         10080742579393 # 2.33 IPC
slots                8640874657662  #
retiring             3015427410903  # 34.9% (58.9%)
-- ucode             6726058        #     0.0%
-- fastpath          3015420684845  #    34.9%
frontend             1175050211309  # 13.6% (22.9%)
-- latency           530224174536   #     6.1%
-- bandwidth         644826036773   #     7.5%
backend              894468621667   # 10.4% (17.5%)
-- cpu               270749606784   #     3.1%
-- memory            623719014883   #     7.2%
speculation          36309001429    #  0.4% ( 0.7%)
-- branch mispredict 34321580391    #     0.4%
-- pipeline restart  1987421038     #     0.0%
smt-contention       3519610791947  # 40.7% ( 0.0%)
instructions         5040563575655  # 0.024 l2 access per 1000 inst
l2 hit from l1       114170557      # 8.80% l2 miss
l2 miss from l1      7864844        #
l2 hit from l2 pf    5961997        #
l3 hit from l2 pf    1759222        #
l3 miss from l2 pf   1202870        #
instructions         5036908689193  # 0.085 float per 1000 inst
float 512            92             # 0.000 AVX-512 per 1000 inst
float 256            852            # 0.000 AVX-256 per 1000 inst
float 128            427687605      # 0.085 AVX-128 per 1000 inst
float MMX            0              # 0.000 MMX per 1000 inst
float scalar         0              # 0.000 scalar per 1000 inst

Here is the corresponding Intel summary block, also for coremark:

elapsed              82.626
on_cpu               0.707          # 11.31 / 16 cores
utime                934.350
stime                0.259
nvcsw                1122           # 16.56%
nivcsw               5653           # 83.44%
inblock              0
onblock              1064
cpu-clock            934609836035   # 934.610 seconds
task-clock           934612788300   # 934.613 seconds
page faults          74644          # 79.866/sec
context switches     6966           # 7.453/sec
cpu migrations       190            # 0.203/sec
major page faults    0              # 0.000/sec
minor page faults    74644          # 79.866/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             1487191047680  # 189.103 branches per 1000 inst
branch misses        3750608715     # 0.25% branch miss
conditional          1487191057952  # 189.103 conditional branches per 1000 inst
indirect             441335072192   # 56.118 indirect branches per 1000 inst
slots                6076449129938  #
retiring             3906991250131  # 64.3% (64.3%)
-- ucode             67666336195    #     1.1%
-- fastpath          3839324913936  #    63.2%
frontend             1246450345074  # 20.5% (20.5%)
-- latency           751572503238   #    12.4%
-- bandwidth         494877841836   #     8.1%
backend              629022362428   # 10.4% (10.4%)
-- cpu               335343935853   #     5.5%
-- memory            293678426575   #     4.8%
speculation          272715027078   #  4.5% ( 4.5%)
-- branch mispredict 256635566653   #     4.2%
-- pipeline restart  16079460425    #     0.3%
smt-contention       0              #  0.0% ( 0.0%)
cpu-cycles           3907422305230  # 2.65 GHz
instructions         9072449306543  # 2.32 IPC
l2 access            130609511      # 0.029 l2 access per 1000 inst
l2 miss              41959615       # 32.13% l2 miss

Here is the plot file of topdown metrics for coremark followed by the one for stream. From here you can see the repetition with different benchmarks as well as how the overall pattern (backend bound stream, mostly retiring coremark) show together.

Potential interface and potential counter groups for topdown tool

Posted on December 25, 2023 by mevDecember 25, 2023

I have looked through the Family 19h PPR reference, output from “perf list -v –detail” and also some likwid counter groups to figure out combinations of counters I might be able to add as instrumentation options for a topdown command. … Continue reading →

topdown – updated tool and metrics

Posted on December 23, 2023 by mevDecember 23, 2023

I have updated and enhanced the topdown tool and also used this as an occasion to explore Zen4 topdown performance counters, Intel hybrid CPU while building something to compare Intel i5-13500H and Ryzen 7940 processor metrics. The interface might change, … Continue reading →

perf – new performance counters with Linux 6.2

Posted on February 22, 2023 by mevFebruary 22, 2023

It looks like there are many new capabilities in the linux perf command run on a Zen4 core under Linux 6.2 when compared with Zen1 core under Linux 5.4. I compared the “perf list” output between: Ubuntu 20.04, Linux 5.4, … Continue reading →

Performance analysis, tools and experiments

An eclectic collection

Category Archives: Tools