lmbench – Performance analysis, tools and experiments

The lmbench benchmark can be installed using:

prompt% apt install lmbench

This includes the lat_mem_rd program that I have used to measure cache latencies. An example run and output for a memory size of 20000mb (20GB) and stride of 1024 is shown below. I have also annotated known architectural information on the cache hierarchy

L1 = 32KB and 4 clocks
L2 = 1MB and 14 clocks
L3 = 96MB and 47 clocks

Where the cycle counts come from Agnor Fog’s micro-architecture manuals

mev@montpelier:~$ /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_mem_rd 20000 1024
"stride=1024
0.00098 0.819
0.00195 0.796
0.00293 0.821
0.00391 0.814
0.00586 0.819
0.00781 0.809
0.01172 0.820
0.01562 0.792
0.02344 0.821
0.03125 0.792 # L1 = 32KB
0.04688 1.552
0.06250 2.886
0.09375 1.303
0.12500 2.877
0.18750 2.882
0.25000 2.888
0.37500 3.195
0.50000 3.198
0.75000 3.251
1.00000 3.311 # L2 = 1MB
1.50000 4.004
2.00000 4.693
3.00000 6.296
4.00000 6.743
6.00000 7.496
8.00000 7.411
12.00000 7.674
16.00000 8.166
24.00000 8.054
32.00000 7.817
48.00000 8.194
64.00000 7.956
96.00000 8.212 # L3 = 96MB
128.00000 8.857
192.00000 10.671
256.00000 15.057
384.00000 21.861
512.00000 22.165
768.00000 22.636
1024.00000 22.624
1536.00000 23.139
2048.00000 23.651
3072.00000 24.744
4096.00000 25.005
6144.00000 25.829
8192.00000 25.676
12288.00000 26.301
16384.00000 26.037

The numbers are in nanoseconds and suggest to me a few things:

CPUs have become more dynamic with base and boost frequencies. Compared to a decade ago it may not be as simple to convert from nanoseconds to cycles but we’re in the almost 5GHz range or 0.2ns per cycle and the numbers above correspond.
A pointer ring value of 1024 was chosen, as one varies this number the values also change, most likely prefetching coming into play.