{"id":645,"date":"2024-01-17T00:43:06","date_gmt":"2024-01-17T00:43:06","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?page_id=645"},"modified":"2024-01-17T00:43:06","modified_gmt":"2024-01-17T00:43:06","slug":"hmmer","status":"publish","type":"page","link":"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/hmmer\/","title":{"rendered":"hmmer"},"content":{"rendered":"\n<p>hmmer is scientific code looking through profile hidden markov models. There is one test where the goal is to minimize time. This is parallel code running on half the cores. Looking at Intel code suggests it runs on cores but not hyperthreaded.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-29.png\" alt=\"\" class=\"wp-image-646\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-29.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-29-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-29-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>Topdown profile shows a moderate retirement rate with some frontend stalls and not as many backend stalls.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-66.png\" alt=\"\" class=\"wp-image-647\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-66.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-66-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-66-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>AMD metrics show heavily floating point code with a low level of L2 access.  Backend stalls are cpu-centric. So this would be a good code to drill lower on cpu-centric bottlenecks. Also a good candidate to try AVX-256 at least.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              331.598\non_cpu               0.472          # 7.56 \/ 16 cores\nutime                2446.019\nstime                60.105\nnvcsw                1974074        # 98.50%\nnivcsw               30023          # 1.50%\ninblock              2888           # 8.71\/sec\nonblock              13016          # 39.25\/sec\ncpu-clock            2512599080302  # 2512.599 seconds\ntask-clock           2513608145421  # 2513.608 seconds\npage faults          551481         # 219.398\/sec\ncontext switches     2005436        # 797.832\/sec\ncpu migrations       130384         # 51.871\/sec\nmajor page faults    101            # 0.040\/sec\nminor page faults    551380         # 219.358\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             2258521454096  # 92.354 branches per 1000 inst\nbranch misses        28635942877    # 1.27% branch miss\nconditional          2126819155402  # 86.969 conditional branches per 1000 inst\nindirect             17843603406    # 0.730 indirect branches per 1000 inst\ncpu-cycles           10482667685704 # 1.97 GHz\ninstructions         24364658068770 # 2.32 IPC\nslots                21107611910478 #\nretiring             7659667626542  # 36.3% (44.4%)\n-- ucode             2206940738     #     0.0%\n-- fastpath          7657460685804  #    36.3%\nfrontend             3554660839585  # 16.8% (20.6%)\n-- latency           1513553670744  #     7.2%\n-- bandwidth         2041107168841  #     9.7%\nbackend              4871025342216  # 23.1% (28.3%)\n-- cpu               4439682897969  #    21.0%\n-- memory            431342444247   #     2.0%\nspeculation          1151927521901  #  5.5% ( 6.7%)\n-- branch mispredict 1148950630289  #     5.4%\n-- pipeline restart  2976891612     #     0.0%\nsmt-contention       3870104848807  # 18.3% ( 0.0%)\ncpu-cycles           10487395728592 # 1.97 GHz\ninstructions         24380098876766 # 2.32 IPC\ninstructions         8155654202288  # 10.013 l2 access per 1000 inst\nl2 hit from l1       68286282185    # 1.92% l2 miss\nl2 miss from l1      1012878111     #\nl2 hit from l2 pf    12822770451    #\nl3 hit from l2 pf    467457741      #\nl3 miss from l2 pf   88265283       #\ninstructions         8149997728883  # 587.540 float per 1000 inst\nfloat 512            81             # 0.000 AVX-512 per 1000 inst\nfloat 256            584            # 0.000 AVX-256 per 1000 inst\nfloat 128            4788447283765  # 587.540 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         2              # 0.000 scalar per 1000 inst\n<\/code><\/pre>\n\n\n\n<p>Intel metrics show mostly running on 12 cores.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              459.336\non_cpu               0.721          # 11.54 \/ 16 cores\nutime                5237.164\nstime                65.096\nnvcsw                3057554        # 94.06%\nnivcsw               193158         # 5.94%\ninblock              320            # 0.70\/sec\nonblock              1768           # 3.85\/sec\ncpu-clock            5308742804653  # 5308.743 seconds\ntask-clock           5309513369462  # 5309.513 seconds\npage faults          750501         # 141.350\/sec\ncontext switches     3252164        # 612.516\/sec\ncpu migrations       639780         # 120.497\/sec\nmajor page faults    18             # 0.003\/sec\nminor page faults    750483         # 141.347\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             3382433495485  # 92.205 branches per 1000 inst\nbranch misses        40757068576    # 1.20% branch miss\nconditional          3382447770653  # 92.206 conditional branches per 1000 inst\nindirect             1567243251234  # 42.723 indirect branches per 1000 inst\nslots                30187099759652 #\nretiring             15109005184623 # 50.1% (50.1%)\n-- ucode             126581066445   #     0.4%\n-- fastpath          14982424118178 #    49.6%\nfrontend             4109048417357  # 13.6% (13.6%)\n-- latency           1786065710888  #     5.9%\n-- bandwidth         2322982706469  #     7.7%\nbackend              7816602085255  # 25.9% (25.9%)\n-- cpu               5358445823079  #    17.8%\n-- memory            2458156262176  #     8.1%\nspeculation          3111458169602  # 10.3% (10.3%)\n-- branch mispredict 3096804788571  #    10.3%\n-- pipeline restart  14653381031    #     0.0%\nsmt-contention       0              #  0.0% ( 0.0%)\ncpu-cycles           17094061098071 # 2.32 GHz\ninstructions         41367833831551 # 2.42 IPC\nl2 access            50706897733    # 3.145 l2 access per 1000 inst\nl2 miss              4410666064     # 8.70% l2 miss\n<\/code><\/pre>\n\n\n\n<p>Process tree shows a huge number of threads, though interesting they don&#8217;t show up on process runable above<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>297507 processes\n\t297144 hmmsearch            15169403.33 214883.04\n\t 68 clinfo                  16.53     6.32\n\t 18 mpirun                   7.41    66.93\n\t 38 vulkaninfo               0.95     1.23\n\t  6 php                      0.44     4.27\n\t  6 glxinfo:gdrv0            0.15     0.06\n\t  4 vulkani:disk$0           0.10     0.13\n\t  2 glxinfo                  0.07     0.02\n\t  2 glxinfo:cs0              0.07     0.02\n\t  2 glxinfo:disk$0           0.07     0.02\n\t  2 glxinfo:sh0              0.07     0.02\n\t  2 glxinfo:shlo0            0.07     0.02\n\t  6 clang                    0.06     0.06\n\t  2 llvmpipe-0               0.05     0.07\n\t  2 llvmpipe-1               0.05     0.07\n\t  2 llvmpipe-10              0.05     0.07\n\t  2 llvmpipe-11              0.05     0.07\n\t  2 llvmpipe-12              0.05     0.07\n\t  2 llvmpipe-13              0.05     0.07\n\t  2 llvmpipe-14              0.05     0.07\n\t  2 llvmpipe-15              0.05     0.07\n\t  2 llvmpipe-2               0.05     0.07\n\t  2 llvmpipe-3               0.05     0.07\n\t  2 llvmpipe-4               0.05     0.07\n\t  2 llvmpipe-5               0.05     0.07\n\t  2 llvmpipe-6               0.05     0.07\n\t  2 llvmpipe-7               0.05     0.07\n\t  2 llvmpipe-8               0.05     0.07\n\t  2 llvmpipe-9               0.05     0.07\n\t  1 lspci                    0.01     0.02\n\t  3 rocminfo                 0.00     0.03\n\t  1 ps                       0.00     0.01\n\t 82 sh                       0.00     0.00\n\t 13 gcc                      0.00     0.00\n\t 11 gsettings                0.00     0.00\n\t  8 stat                     0.00     0.00\n\t  8 systemd-detect-          0.00     0.00\n\t  6 llvm-link                0.00     0.00\n\t  5 phoronix-test-s          0.00     0.00\n\t  4 gmain                    0.00     0.00\n\t  3 hmmer                    0.00     0.00\n\t  2 cc                       0.00     0.00\n\t  2 lscpu                    0.00     0.00\n\t  2 uname                    0.00     0.00\n\t  2 which                    0.00     0.00\n\t  2 xset                     0.00     0.00\n\t  1 date                     0.00     0.00\n\t  1 dconf worker             0.00     0.00\n\t  1 dirname                  0.00     0.00\n\t  1 dmesg                    0.00     0.00\n\t  1 dmidecode                0.00     0.00\n\t  1 grep                     0.00     0.00\n\t  1 ifconfig                 0.00     0.00\n\t  1 ip                       0.00     0.00\n\t  1 lsmod                    0.00     0.00\n\t  1 mktemp                   0.00     0.00\n\t  1 qdbus                    0.00     0.00\n\t  1 readlink                 0.00     0.00\n\t  1 realpath                 0.00     0.00\n\t  1 sed                      0.00     0.00\n\t  1 sort                     0.00     0.00\n\t  1 stty                     0.00     0.00\n\t  1 systemctl                0.00     0.00\n\t  1 template.sh              0.00     0.00\n\t  1 wc                       0.00     0.00\n\t  1 xrandr                   0.00     0.00\n0 processes running\n47 maximum processes\n<\/code><\/pre>\n\n\n\n<p>The computation blocks show starts of these quick threads<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>      1600545) hmmer            cpu=0 start=5.74  finish=110.22\n        1600546) mpirun           cpu=7 start=5.74  finish=110.19\n          1600549) mpirun           cpu=12 start=6.33  finish=110.19\n          1600550) mpirun           cpu=13 start=6.33  finish=6.33 \n          1600551) mpirun           cpu=10 start=6.35  finish=110.19\n          1600552) mpirun           cpu=12 start=6.83  finish=110.19\n          1600553) mpirun           cpu=1 start=6.83  finish=110.19\n          1600554) hmmsearch        cpu=9 start=6.86  finish=109.62\n            1600560) hmmsearch        cpu=3 start=6.89  finish=6.89 \n            1600561) hmmsearch        cpu=1 start=6.89  finish=6.89 \n            1600568) hmmsearch        cpu=14 start=6.90  finish=6.90 \n            1600569) hmmsearch        cpu=9 start=6.90  finish=6.90 \n            1600588) hmmsearch        cpu=9 start=6.92  finish=6.92 \n            1600589) hmmsearch        cpu=2 start=6.92  finish=6.92 \n            1600608) hmmsearch        cpu=4 start=6.94  finish=6.94 \n            1600609) hmmsearch        cpu=15 start=6.94  finish=6.94 \n            1600624) hmmsearch        cpu=13 start=6.96  finish=6.96 \n            1600625) hmmsearch        cpu=15 start=6.96  finish=6.96 \n            1600634) hmmsearch        cpu=13 start=6.97  finish=6.97 \n            1600635) hmmsearch        cpu=15 start=6.97  finish=6.97 \n            1600648) hmmsearch        cpu=15 start=6.99  finish=6.99 \n            1600649) hmmsearch        cpu=13 start=6.99  finish=6.99 \n            1600656) hmmsearch        cpu=15 start=7.00  finish=7.00 \n            1600657) hmmsearch        cpu=13 start=7.00  finish=7.00 \n            1600682) hmmsearch        cpu=6 start=7.02  finish=7.02 \n            1600683) hmmsearch        cpu=5 start=7.02  finish=7.02 \n            1600702) hmmsearch        cpu=0 start=7.04  finish=7.04 \n            1600703) hmmsearch        cpu=5 start=7.04  finish=7.04 \n            1600722) hmmsearch        cpu=14 start=7.07  finish=7.07 \n            1600723) hmmsearch        cpu=0 start=7.07  finish=7.07 \n            1600738) hmmsearch        cpu=9 start=7.09  finish=7.09 \n            1600739) hmmsearch        cpu=6 start=7.09  finish=7.09 \n            1600748) hmmsearch        cpu=12 start=7.11  finish=7.11 \n            1600749) hmmsearch        cpu=11 start=7.11  finish=7.11 \n            1600762) hmmsearch        cpu=14 start=7.12  finish=7.12 \n            1600763) hmmsearch        cpu=11 start=7.12  finish=7.12 \n            1600774) hmmsearch        cpu=6 start=7.14  finish=7.14 \n            1600775) hmmsearch        cpu=11 start=7.14  finish=7.14 \n            1600788) hmmsearch        cpu=13 start=7.15  finish=7.15 \n            1600789) hmmsearch        cpu=3 start=7.15  finish=7.15 \n            1600822) hmmsearch        cpu=11 start=7.20  finish=7.20 \n            1600823) hmmsearch        cpu=4 start=7.20  finish=7.20 \n            1600836) hmmsearch        cpu=12 start=7.21  finish=7.21 \n            1600837) hmmsearch        cpu=3 start=7.21  finish=7.21 \n            1600848) hmmsearch        cpu=13 start=7.22  finish=7.22 \n            1600849) hmmsearch        cpu=4 start=7.22  finish=7.22 \n            1600858) hmmsearch        cpu=12 start=7.24  finish=7.24 \n            1600859) hmmsearch        cpu=1 start=7.24  finish=7.24 \n            1600876) hmmsearch        cpu=13 start=7.26  finish=7.26 \n            1600877) hmmsearch        cpu=4 start=7.26  finish=7.26 \n            1600892) hmmsearch        cpu=5 start=7.28  finish=7.28 \n            1600893) hmmsearch        cpu=4 start=7.28  finish=7.28 \n            1600902) hmmsearch        cpu=12 start=7.29  finish=7.29 \n            1600903) hmmsearch        cpu=5 start=7.29  finish=7.29 \n            1600924) hmmsearch        cpu=13 start=7.32  finish=7.32 \n            1600925) hmmsearch        cpu=12 start=7.32  finish=7.32 \n            1600934) hmmsearch        cpu=5 start=7.34  finish=7.34 \n            1600935) hmmsearch        cpu=12 start=7.34  finish=7.34 \n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>hmmer is scientific code looking through profile hidden markov models. There is one test where the goal is to minimize time. This is parallel code running on half the cores. Looking at Intel code suggests it runs on cores but <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/hmmer\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":58,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-645","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/645","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=645"}],"version-history":[{"count":1,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/645\/revisions"}],"predecessor-version":[{"id":648,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/645\/revisions\/648"}],"up":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/58"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=645"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}