{"id":1911,"date":"2024-03-02T15:53:36","date_gmt":"2024-03-02T15:53:36","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?p=1911"},"modified":"2024-03-02T15:53:38","modified_gmt":"2024-03-02T15:53:38","slug":"cachyos-and-namd","status":"publish","type":"post","link":"https:\/\/mvermeulen.org\/perf\/2024\/03\/02\/cachyos-and-namd\/","title":{"rendered":"cachyos and namd"},"content":{"rendered":"\n<p><a href=\"http:\/\/cachyos.org\" data-type=\"link\" data-id=\"cachyos.org\">cachyos.org <\/a>is a Arch distribution designed to be quick. Several techniques are used including having packages compiled for the ISA rather than a generic. A the <a href=\"https:\/\/www.phoronix.com\/review\/cachyos-x86-64-v3-v4\">following Phoronix article <\/a>shows the v3 (modern ISA) packages generally win and the v4 (AVX-512) packages are slightly better but also have regressions.<\/p>\n\n\n\n<p>So I installed cachyos on a 7940HS AMD system and compared it against Ubuntu 22.04 on a similar system. The overall numbers are 6.5% better on the first workload and 5.8% better on the second workload.<\/p>\n\n\n\n<p> Following is for cachyos<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>NAMD 3.0b6:\n    pts\/namd-1.3.1 &#91;Input: ATPase with 327,506 Atoms]\n    Test 1 of 2\n    Estimated Trial Run Count:    3                     \n    Estimated Test Run-Time:      3 Minutes             \n    Estimated Time To Completion: 9 Minutes &#91;13:33 UTC] \n        Started Run 1 @ 13:24:53\n        Started Run 2 @ 13:25:35\n        Started Run 3 @ 13:26:16\n\n    Input: ATPase with 327,506 Atoms:\n        1.3094369813811\n        1.3209132265683\n        1.3370467578622\n\n    Average: 1.32247 ns\/day\n    Deviation: 1.05%\n\nNAMD 3.0b6:\n    pts\/namd-1.3.1 &#91;Input: STMV with 1,066,628 Atoms]\n    Test 2 of 2\n    Estimated Trial Run Count:    3                     \n    Estimated Time To Completion: 7 Minutes &#91;13:33 UTC] \n        Started Run 1 @ 13:27:02\n        Started Run 2 @ 13:29:06\n        Started Run 3 @ 13:31:09\n\n    Input: STMV with 1,066,628 Atoms:\n        0.38845511401158\n        0.3892307632426\n        0.39149056116257\n\n    Average: 0.38973 ns\/day\n    Deviation: 0.40%\n<\/code><\/pre>\n\n\n\n<p>and the following for Ubuntu<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>NAMD 3.0b6:\n    pts\/namd-1.3.1 &#91;Input: ATPase with 327,506 Atoms]\n    Test 1 of 2\n    Estimated Trial Run Count:    3                     \n    Estimated Test Run-Time:      3 Minutes             \n    Estimated Time To Completion: 9 Minutes &#91;09:21 CST] \n        Started Run 1 @ 09:12:59\n        Started Run 2 @ 09:13:42\n        Started Run 3 @ 09:14:25\n\n    Input: ATPase with 327,506 Atoms:\n        1.2429462798618\n        1.2405624213794\n        1.2391773349509\n\n    Average: 1.24090 ns\/day\n    Deviation: 0.15%\n\nNAMD 3.0b6:\n    pts\/namd-1.3.1 &#91;Input: STMV with 1,066,628 Atoms]\n    Test 2 of 2\n    Estimated Trial Run Count:    3                     \n    Estimated Time To Completion: 7 Minutes &#91;09:21 CST] \n        Started Run 1 @ 09:15:14\n        Started Run 2 @ 09:17:24\n        Started Run 3 @ 09:19:36\n\n    Input: STMV with 1,066,628 Atoms:\n        0.37073149030352\n        0.36840554081933\n        0.36519408239509\n\n    Average: 0.36811 ns\/day\n    Deviation: 0.76%\n<\/code><\/pre>\n\n\n\n<p>Comparing my performance metrics shows<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>1\/3 reduction in system time<\/li>\n\n\n\n<li>3.8 GHz instead of 3.6 GHz<\/li>\n<\/ul>\n\n\n\n<p>Most of the workload metrics including floating point are very similar.<\/p>\n\n\n\n<p>Following is are the overall metrics for cachyos<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              495.359\non_cpu               0.919          # 14.71 \/ 16 cores\nutime                7264.584\nstime                20.290\nnvcsw                174165         # 71.79%\nnivcsw               68431          # 28.21%\ninblock              0              # 0.00\/sec\nonblock              3848           # 7.77\/sec\ncpu-clock            7359760148766  # 7359.760 seconds\ntask-clock           7359865910736  # 7359.866 seconds\npage faults          3872735        # 526.196\/sec\ncontext switches     244414         # 33.209\/sec\ncpu migrations       503            # 0.068\/sec\nmajor page faults    0              # 0.000\/sec\nminor page faults    3872735        # 526.196\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             1897775256392  # 76.988 branches per 1000 inst\nbranch misses        39722841032    # 2.09% branch miss\nconditional          1509458303101  # 61.235 conditional branches per 1000 inst\nindirect             49014559620    # 1.988 indirect branches per 1000 inst\ncpu-cycles           30431898149240 # 3.80 GHz\ninstructions         24655998647905 # 0.81 IPC\nslots                60848642266458 #\nretiring             10719221264151 # 17.6% (22.1%)\n-- ucode             56759595164    #     0.1%\n-- fastpath          10662461668987 #    17.5%\nfrontend             10432837231580 # 17.1% (21.6%)\n-- latency           8735537140284  #    14.4%\n-- bandwidth         1697300091296  #     2.8%\nbackend              26123312845140 # 42.9% (54.0%)\n-- cpu               15479675594829 #    25.4%\n-- memory            10643637250311 #    17.5%\nspeculation          1125198830253  #  1.8% ( 2.3%)\n-- branch mispredict 945591758746   #     1.6%\n-- pipeline restart  179607071507   #     0.3%\nsmt-contention       12447893595568 # 20.5% ( 0.0%)\ncpu-cycles           30432992029522 # 3.81 GHz\ninstructions         24655633594125 # 0.81 IPC\ninstructions         8213813061144  # 24.296 l2 access per 1000 inst\nl2 hit from l1       139816264099   # 18.94% l2 miss\nl2 miss from l1      14272375641    #\nl2 hit from l2 pf    36215623282    #\nl3 hit from l2 pf    2654557357     #\nl3 miss from l2 pf   20872911653    #\ninstructions         8211249455440  # 182.438 float per 1000 inst\nfloat 512            53             # 0.000 AVX-512 per 1000 inst\nfloat 256            39471842819    # 4.807 AVX-256 per 1000 inst\nfloat 128            1458574293164  # 177.631 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         3837           # 0.000 scalar per 1000 inst\ninstructions         24639840844799 #\nopcache              4278105143452  # 173.626 opcache per 1000 inst\nopcache miss         66805357105    #  1.6% opcache miss rate\nl1 dTLB miss         28144769542    # 1.142 L1 dTLB per 1000 inst\nl2 dTLB miss         2914548040     # 0.118 L2 dTLB per 1000 inst\ninstructions         24757437810677 #\nicache               95378615773    # 3.853 icache per 1000 inst\nicache miss          20027564740    # 21.0% icache miss rate\nl1 iTLB miss         353644845      # 0.014 L1 iTLB per 1000 inst\nl2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst\ntlb flush            53161          # 0.000 TLB flush per 1000 inst\n<\/code><\/pre>\n\n\n\n<p>Following are the metrics for ubuntu<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              534.402\non_cpu               0.933          # 14.93 \/ 16 cores\nutime                7946.582\nstime                31.785\nnvcsw                159398         # 68.33%\nnivcsw               73870          # 31.67%\ninblock              0              # 0.00\/sec\nonblock              107872         # 201.86\/sec\ncpu-clock            7979741490188  # 7979.741 seconds\ntask-clock           7979891388497  # 7979.891 seconds\npage faults          4288047        # 537.357\/sec\ncontext switches     235752         # 29.543\/sec\ncpu migrations       555            # 0.070\/sec\nmajor page faults    391            # 0.049\/sec\nminor page faults    4287656        # 537.308\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             1925290347271  # 77.458 branches per 1000 inst\nbranch misses        38840160112    # 2.02% branch miss\nconditional          1517305547805  # 61.044 conditional branches per 1000 inst\nindirect             50858503572    # 2.046 indirect branches per 1000 inst\ncpu-cycles           31169386494557 # 3.61 GHz\ninstructions         24971268101144 # 0.80 IPC\nslots                62339872752540 #\nretiring             10839188885703 # 17.4% (21.8%)\n-- ucode             54615869178    #     0.1%\n-- fastpath          10784573016525 #    17.3%\nfrontend             10368247652561 # 16.6% (20.8%)\n-- latency           8672309211630  #    13.9%\n-- bandwidth         1695938440931  #     2.7%\nbackend              27434861319569 # 44.0% (55.1%)\n-- cpu               16334908995677 #    26.2%\n-- memory            11099952323892 #    17.8%\nspeculation          1114337539722  #  1.8% ( 2.2%)\n-- branch mispredict 932720011460   #     1.5%\n-- pipeline restart  181617528262   #     0.3%\nsmt-contention       12583164650371 # 20.2% ( 0.0%)\ncpu-cycles           31195692318508 # 3.61 GHz\ninstructions         24862284718475 # 0.80 IPC\ninstructions         8286838897986  # 23.231 l2 access per 1000 inst\nl2 hit from l1       135748832053   # 19.01% l2 miss\nl2 miss from l1      13484958347    #\nl2 hit from l2 pf    33649472706    #\nl3 hit from l2 pf    2330209780     #\nl3 miss from l2 pf   20785824419    #\ninstructions         8288576901008  # 183.896 float per 1000 inst\nfloat 512            68             # 0.000 AVX-512 per 1000 inst\nfloat 256            39465430187    # 4.761 AVX-256 per 1000 inst\nfloat 128            1484772747474  # 179.135 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         754            # 0.000 scalar per 1000 inst\ninstructions         24970173846783 #\nopcache              4332161426447  # 173.493 opcache per 1000 inst\nopcache miss         63720650094    #  1.5% opcache miss rate\nl1 dTLB miss         29343624637    # 1.175 L1 dTLB per 1000 inst\nl2 dTLB miss         3279370921     # 0.131 L2 dTLB per 1000 inst\ninstructions         24960638488597 #\nicache               88015691133    # 3.526 icache per 1000 inst\nicache miss          17928613619    # 20.4% icache miss rate\nl1 iTLB miss         2049956023     # 0.082 L1 iTLB per 1000 inst\nl2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst\ntlb flush            48995          # 0.000 TLB flush per 1000 inst\n<\/code><\/pre>\n\n\n\n<p>Looking a little deeper, it looks like the namd package comes with pre-compiled binaries, so what I am comparing is more other parts of the system than my own compilations&#8230;For example a reduction of system time for the namd executable from 460.5 seconds to 285 seconds.<\/p>\n\n\n\n<p>That makes it useful to remember that compilation might occur for the benchmark, but it can also happen earlier such as at installation or using pre-compiled binaries.  Based on this, I need to find tests that actually compile rather than just run compiled binaries.  A quick check using a grep of the process tree suggests a few possibilities including polyhedron and openfoam.<\/p>\n\n\n\n<p>For example gfortran includes the following compilations<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>gfortran -ffast-math -funroll-loops -O3 ac.f90 -o ac<\/code><\/pre>\n\n\n\n<p>However, that seems to be built into the script and other than gfortran picking up things from the environment might not change. So it probably comes down to building with different options.<\/p>\n\n\n\n<p>A further check of cachyos trying to install lczero results in build errors.  So my general conclusion is that Ubuntu seems to make the most sense as a general build\/benchmark platform but cachyos can be useful for trying specific OS package related changes.  To check the effects of particular ISAs I might either need to find specific benchmarks e.g. polyhedron or SPEC and recompile them to compare results.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>cachyos.org is a Arch distribution designed to be quick. Several techniques are used including having packages compiled for the ISA rather than a generic. A the following Phoronix article shows the v3 (modern ISA) packages generally win and the v4 <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/2024\/03\/02\/cachyos-and-namd\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11],"tags":[30,31],"class_list":["post-1911","post","type-post","status-publish","format-standard","hentry","category-experiment","tag-cachyos","tag-namd"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/posts\/1911","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=1911"}],"version-history":[{"count":4,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/posts\/1911\/revisions"}],"predecessor-version":[{"id":1915,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/posts\/1911\/revisions\/1915"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=1911"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/categories?post=1911"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/tags?post=1911"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}