cachyos.org is a Arch distribution designed to be quick. Several techniques are used including having packages compiled for the ISA rather than a generic. A the following Phoronix article shows the v3 (modern ISA) packages generally win and the v4 (AVX-512) packages are slightly better but also have regressions.
So I installed cachyos on a 7940HS AMD system and compared it against Ubuntu 22.04 on a similar system. The overall numbers are 6.5% better on the first workload and 5.8% better on the second workload.
Following is for cachyos
NAMD 3.0b6:
pts/namd-1.3.1 [Input: ATPase with 327,506 Atoms]
Test 1 of 2
Estimated Trial Run Count: 3
Estimated Test Run-Time: 3 Minutes
Estimated Time To Completion: 9 Minutes [13:33 UTC]
Started Run 1 @ 13:24:53
Started Run 2 @ 13:25:35
Started Run 3 @ 13:26:16
Input: ATPase with 327,506 Atoms:
1.3094369813811
1.3209132265683
1.3370467578622
Average: 1.32247 ns/day
Deviation: 1.05%
NAMD 3.0b6:
pts/namd-1.3.1 [Input: STMV with 1,066,628 Atoms]
Test 2 of 2
Estimated Trial Run Count: 3
Estimated Time To Completion: 7 Minutes [13:33 UTC]
Started Run 1 @ 13:27:02
Started Run 2 @ 13:29:06
Started Run 3 @ 13:31:09
Input: STMV with 1,066,628 Atoms:
0.38845511401158
0.3892307632426
0.39149056116257
Average: 0.38973 ns/day
Deviation: 0.40%
and the following for Ubuntu
NAMD 3.0b6:
pts/namd-1.3.1 [Input: ATPase with 327,506 Atoms]
Test 1 of 2
Estimated Trial Run Count: 3
Estimated Test Run-Time: 3 Minutes
Estimated Time To Completion: 9 Minutes [09:21 CST]
Started Run 1 @ 09:12:59
Started Run 2 @ 09:13:42
Started Run 3 @ 09:14:25
Input: ATPase with 327,506 Atoms:
1.2429462798618
1.2405624213794
1.2391773349509
Average: 1.24090 ns/day
Deviation: 0.15%
NAMD 3.0b6:
pts/namd-1.3.1 [Input: STMV with 1,066,628 Atoms]
Test 2 of 2
Estimated Trial Run Count: 3
Estimated Time To Completion: 7 Minutes [09:21 CST]
Started Run 1 @ 09:15:14
Started Run 2 @ 09:17:24
Started Run 3 @ 09:19:36
Input: STMV with 1,066,628 Atoms:
0.37073149030352
0.36840554081933
0.36519408239509
Average: 0.36811 ns/day
Deviation: 0.76%
Comparing my performance metrics shows
- 1/3 reduction in system time
- 3.8 GHz instead of 3.6 GHz
Most of the workload metrics including floating point are very similar.
Following is are the overall metrics for cachyos
elapsed 495.359
on_cpu 0.919 # 14.71 / 16 cores
utime 7264.584
stime 20.290
nvcsw 174165 # 71.79%
nivcsw 68431 # 28.21%
inblock 0 # 0.00/sec
onblock 3848 # 7.77/sec
cpu-clock 7359760148766 # 7359.760 seconds
task-clock 7359865910736 # 7359.866 seconds
page faults 3872735 # 526.196/sec
context switches 244414 # 33.209/sec
cpu migrations 503 # 0.068/sec
major page faults 0 # 0.000/sec
minor page faults 3872735 # 526.196/sec
alignment faults 0 # 0.000/sec
emulation faults 0 # 0.000/sec
branches 1897775256392 # 76.988 branches per 1000 inst
branch misses 39722841032 # 2.09% branch miss
conditional 1509458303101 # 61.235 conditional branches per 1000 inst
indirect 49014559620 # 1.988 indirect branches per 1000 inst
cpu-cycles 30431898149240 # 3.80 GHz
instructions 24655998647905 # 0.81 IPC
slots 60848642266458 #
retiring 10719221264151 # 17.6% (22.1%)
-- ucode 56759595164 # 0.1%
-- fastpath 10662461668987 # 17.5%
frontend 10432837231580 # 17.1% (21.6%)
-- latency 8735537140284 # 14.4%
-- bandwidth 1697300091296 # 2.8%
backend 26123312845140 # 42.9% (54.0%)
-- cpu 15479675594829 # 25.4%
-- memory 10643637250311 # 17.5%
speculation 1125198830253 # 1.8% ( 2.3%)
-- branch mispredict 945591758746 # 1.6%
-- pipeline restart 179607071507 # 0.3%
smt-contention 12447893595568 # 20.5% ( 0.0%)
cpu-cycles 30432992029522 # 3.81 GHz
instructions 24655633594125 # 0.81 IPC
instructions 8213813061144 # 24.296 l2 access per 1000 inst
l2 hit from l1 139816264099 # 18.94% l2 miss
l2 miss from l1 14272375641 #
l2 hit from l2 pf 36215623282 #
l3 hit from l2 pf 2654557357 #
l3 miss from l2 pf 20872911653 #
instructions 8211249455440 # 182.438 float per 1000 inst
float 512 53 # 0.000 AVX-512 per 1000 inst
float 256 39471842819 # 4.807 AVX-256 per 1000 inst
float 128 1458574293164 # 177.631 AVX-128 per 1000 inst
float MMX 0 # 0.000 MMX per 1000 inst
float scalar 3837 # 0.000 scalar per 1000 inst
instructions 24639840844799 #
opcache 4278105143452 # 173.626 opcache per 1000 inst
opcache miss 66805357105 # 1.6% opcache miss rate
l1 dTLB miss 28144769542 # 1.142 L1 dTLB per 1000 inst
l2 dTLB miss 2914548040 # 0.118 L2 dTLB per 1000 inst
instructions 24757437810677 #
icache 95378615773 # 3.853 icache per 1000 inst
icache miss 20027564740 # 21.0% icache miss rate
l1 iTLB miss 353644845 # 0.014 L1 iTLB per 1000 inst
l2 iTLB miss 0 # 0.000 L2 iTLB per 1000 inst
tlb flush 53161 # 0.000 TLB flush per 1000 inst
Following are the metrics for ubuntu
elapsed 534.402
on_cpu 0.933 # 14.93 / 16 cores
utime 7946.582
stime 31.785
nvcsw 159398 # 68.33%
nivcsw 73870 # 31.67%
inblock 0 # 0.00/sec
onblock 107872 # 201.86/sec
cpu-clock 7979741490188 # 7979.741 seconds
task-clock 7979891388497 # 7979.891 seconds
page faults 4288047 # 537.357/sec
context switches 235752 # 29.543/sec
cpu migrations 555 # 0.070/sec
major page faults 391 # 0.049/sec
minor page faults 4287656 # 537.308/sec
alignment faults 0 # 0.000/sec
emulation faults 0 # 0.000/sec
branches 1925290347271 # 77.458 branches per 1000 inst
branch misses 38840160112 # 2.02% branch miss
conditional 1517305547805 # 61.044 conditional branches per 1000 inst
indirect 50858503572 # 2.046 indirect branches per 1000 inst
cpu-cycles 31169386494557 # 3.61 GHz
instructions 24971268101144 # 0.80 IPC
slots 62339872752540 #
retiring 10839188885703 # 17.4% (21.8%)
-- ucode 54615869178 # 0.1%
-- fastpath 10784573016525 # 17.3%
frontend 10368247652561 # 16.6% (20.8%)
-- latency 8672309211630 # 13.9%
-- bandwidth 1695938440931 # 2.7%
backend 27434861319569 # 44.0% (55.1%)
-- cpu 16334908995677 # 26.2%
-- memory 11099952323892 # 17.8%
speculation 1114337539722 # 1.8% ( 2.2%)
-- branch mispredict 932720011460 # 1.5%
-- pipeline restart 181617528262 # 0.3%
smt-contention 12583164650371 # 20.2% ( 0.0%)
cpu-cycles 31195692318508 # 3.61 GHz
instructions 24862284718475 # 0.80 IPC
instructions 8286838897986 # 23.231 l2 access per 1000 inst
l2 hit from l1 135748832053 # 19.01% l2 miss
l2 miss from l1 13484958347 #
l2 hit from l2 pf 33649472706 #
l3 hit from l2 pf 2330209780 #
l3 miss from l2 pf 20785824419 #
instructions 8288576901008 # 183.896 float per 1000 inst
float 512 68 # 0.000 AVX-512 per 1000 inst
float 256 39465430187 # 4.761 AVX-256 per 1000 inst
float 128 1484772747474 # 179.135 AVX-128 per 1000 inst
float MMX 0 # 0.000 MMX per 1000 inst
float scalar 754 # 0.000 scalar per 1000 inst
instructions 24970173846783 #
opcache 4332161426447 # 173.493 opcache per 1000 inst
opcache miss 63720650094 # 1.5% opcache miss rate
l1 dTLB miss 29343624637 # 1.175 L1 dTLB per 1000 inst
l2 dTLB miss 3279370921 # 0.131 L2 dTLB per 1000 inst
instructions 24960638488597 #
icache 88015691133 # 3.526 icache per 1000 inst
icache miss 17928613619 # 20.4% icache miss rate
l1 iTLB miss 2049956023 # 0.082 L1 iTLB per 1000 inst
l2 iTLB miss 0 # 0.000 L2 iTLB per 1000 inst
tlb flush 48995 # 0.000 TLB flush per 1000 inst
Looking a little deeper, it looks like the namd package comes with pre-compiled binaries, so what I am comparing is more other parts of the system than my own compilations…For example a reduction of system time for the namd executable from 460.5 seconds to 285 seconds.
That makes it useful to remember that compilation might occur for the benchmark, but it can also happen earlier such as at installation or using pre-compiled binaries. Based on this, I need to find tests that actually compile rather than just run compiled binaries. A quick check using a grep of the process tree suggests a few possibilities including polyhedron and openfoam.
For example gfortran includes the following compilations
gfortran -ffast-math -funroll-loops -O3 ac.f90 -o ac
However, that seems to be built into the script and other than gfortran picking up things from the environment might not change. So it probably comes down to building with different options.
A further check of cachyos trying to install lczero results in build errors. So my general conclusion is that Ubuntu seems to make the most sense as a general build/benchmark platform but cachyos can be useful for trying specific OS package related changes. To check the effects of particular ISAs I might either need to find specific benchmarks e.g. polyhedron or SPEC and recompile them to compare results.
