{"id":2479,"date":"2024-06-07T11:03:32","date_gmt":"2024-06-07T11:03:32","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?page_id=2479"},"modified":"2024-06-07T11:03:34","modified_gmt":"2024-06-07T11:03:34","slug":"heffte","status":"publish","type":"page","link":"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/heffte\/","title":{"rendered":"heffte"},"content":{"rendered":"\n<p>HeFFTe is the Highly Efficient FFT for Exascale. This benchmark has 64 different subtests. Some fail for strange reasons including a missing libelf library or running too quickly. However, most run and provide an example result. These tests run in mixture of mostly single-threaded and threads that match the numbers of cores.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/06\/systemtime-28.png\" alt=\"\" class=\"wp-image-2480\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/06\/systemtime-28.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/06\/systemtime-28-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/06\/systemtime-28-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>Topdown profile seems to have an upper floor of frontend bound stalls, patches of backend stalls and somewhat lower retirement rate.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/06\/amdtopdown-30.png\" alt=\"\" class=\"wp-image-2482\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/06\/amdtopdown-30.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/06\/amdtopdown-30-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/06\/amdtopdown-30-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>AMD metrics have an average of 3 cores. This is floating point code with 60% backend memory stalls. Frontend stalls average 17% overall and the retirement rate is below 10%<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              2584.235\non_cpu               0.187          # 2.99 \/ 16 cores\nutime                6127.346\nstime                1597.466\nnvcsw                2219897        # 96.96%\nnivcsw               69603          # 3.04%\ninblock              22620712       # 8753.35\/sec\nonblock              3811880        # 1475.05\/sec\ncpu-clock            9104977687955  # 9104.978 seconds\ntask-clock           9105744639000  # 9105.745 seconds\npage faults          708443241      # 77801.791\/sec\ncontext switches     3181073        # 349.348\/sec\ncpu migrations       52924          # 5.812\/sec\nmajor page faults    226527         # 24.877\/sec\nminor page faults    708216255      # 77776.863\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             2648061711421  # 127.197 branches per 1000 inst\nbranch misses        137376535860   # 5.19% branch miss\nconditional          1746794451531  # 83.906 conditional branches per 1000 inst\nindirect             72161814734    # 3.466 indirect branches per 1000 inst\ncpu-cycles           39238084515767 # 0.95 GHz\ninstructions         20656319189605 # 0.53 IPC low\nslots                78565017276084 #\nretiring             7430090396720  #  9.5% ( 9.5%) low\n-- ucode             22792162227    #     0.0%\n-- fastpath          7407298234493  #     9.4%\nfrontend             13446209644619 # 17.1% (17.2%)\n-- latency           9245125183632  #    11.8%\n-- bandwidth         4201084460987  #     5.3%\nbackend              56879497108114 # 72.4% (73.0%) high\n-- cpu               9300687286741  #    11.8%\n-- memory            47578809821373 #    60.6%\nspeculation          213890267340   #  0.3% ( 0.3%) low\n-- branch mispredict 211243116854   #     0.3%\n-- pipeline restart  2647150486     #     0.0%\nsmt-contention       595242856933   #  0.8% ( 0.0%)\ncpu-cycles           39180222661690 # 0.96 GHz\ninstructions         20555473254360 # 0.52 IPC low\ninstructions         6868140531111  # 57.742 l2 access per 1000 inst\nl2 hit from l1       303283223345   # 38.82% l2 miss\nl2 miss from l1      101327869773   #\nl2 hit from l2 pf    40674479706    #\nl3 hit from l2 pf    5564960545     #\nl3 miss from l2 pf   47056539540    #\ninstructions         6852144871392  # 128.190 float per 1000 inst\nfloat 512            919            # 0.000 AVX-512 per 1000 inst\nfloat 256            7426           # 0.000 AVX-256 per 1000 inst\nfloat 128            878379612146   # 128.190 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         0              # 0.000 scalar per 1000 inst\ninstructions         20699274879600 #\nopcache              3632503887154  # 175.489 opcache per 1000 inst\nopcache miss         906178654575   # 24.9% opcache miss rate\nl1 dTLB miss         66246872882    # 3.200 L1 dTLB per 1000 inst\nl2 dTLB miss         14096459068    # 0.681 L2 dTLB per 1000 inst\ninstructions         20785223761254 #\nicache               2087723305047  # 100.443 icache per 1000 inst\nicache miss          54202803799    #  2.6% icache miss rate\nl1 iTLB miss         67609643       # 0.003 L1 iTLB per 1000 inst\nl2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst\ntlb flush            20274864       # 0.001 TLB flush per 1000 inst\n<\/code><\/pre>\n\n\n\n<p>Intel metrics<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p>Process overview shows mpi used to invoke and most time in either speed3d_c22c or speed3d_r2c.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>9007 processes\n\t3240 speed3d_c2c          14178.01  4914.72\n\t3456 speed3d_r2c           7832.05  3804.90\n\t1602 mpirun                  79.39   544.27\n\t 68 clinfo                  16.17     9.66\n\t 38 vulkaninfo               1.14     1.91\n\t  6 php                      0.63   206.14\n\t  4 vulkani:disk$0           0.12     0.21\n\t  6 glxinfo:gdrv0            0.09     0.15\n\t  6 glxinfo:gl0              0.09     0.15\n\t  2 llvmpipe-0               0.06     0.11\n\t  2 llvmpipe-1               0.06     0.11\n\t  2 llvmpipe-10              0.06     0.11\n\t  2 llvmpipe-11              0.06     0.11\n\t  2 llvmpipe-12              0.06     0.11\n\t  2 llvmpipe-13              0.06     0.11\n\t  2 llvmpipe-14              0.06     0.11\n\t  2 llvmpipe-15              0.06     0.11\n\t  2 llvmpipe-2               0.06     0.11\n\t  2 llvmpipe-4               0.06     0.11\n\t  2 llvmpipe-5               0.06     0.11\n\t  2 llvmpipe-6               0.06     0.11\n\t  2 llvmpipe-7               0.06     0.11\n\t  2 llvmpipe-8               0.06     0.11\n\t  2 llvmpipe-9               0.06     0.11\n\t  6 clang                    0.06     0.10\n\t  2 llvmpipe-3               0.06     0.10\n\t  2 glxinfo                  0.05     0.05\n\t  2 glxinfo:cs0              0.05     0.05\n\t  2 glxinfo:disk$0           0.05     0.05\n\t  2 glxinfo:sh0              0.05     0.05\n\t  2 glxinfo:shlo0            0.05     0.05\n\t  3 rocminfo                 0.03     0.00\n\t  1 lspci                    0.00     0.03\n\t  1 ps                       0.00     0.01\n\t267 heffte                   0.00     0.00\n\t176 sh                       0.00     0.00\n\t 13 gcc                      0.00     0.00\n\t  9 gsettings                0.00     0.00\n\t  8 stat                     0.00     0.00\n\t  8 systemd-detect-          0.00     0.00\n\t  6 llvm-link                0.00     0.00\n\t  5 gmain                    0.00     0.00\n\t  5 phoronix-test-s          0.00     0.00\n\t  2 cc                       0.00     0.00\n\t  2 dconf worker             0.00     0.00\n\t  2 lscpu                    0.00     0.00\n\t  2 uname                    0.00     0.00\n\t  2 which                    0.00     0.00\n\t  2 xset                     0.00     0.00\n\t  1 date                     0.00     0.00\n\t  1 dirname                  0.00     0.00\n\t  1 dmesg                    0.00     0.00\n\t  1 dmidecode                0.00     0.00\n\t  1 grep                     0.00     0.00\n\t  1 ifconfig                 0.00     0.00\n\t  1 ip                       0.00     0.00\n\t  1 lsmod                    0.00     0.00\n\t  1 mktemp                   0.00     0.00\n\t  1 qdbus                    0.00     0.00\n\t  1 readlink                 0.00     0.00\n\t  1 realpath                 0.00     0.00\n\t  1 sed                      0.00     0.00\n\t  1 sort                     0.00     0.00\n\t  1 stty                     0.00     0.00\n\t  1 systemctl                0.00     0.00\n\t  1 template.sh              0.00     0.00\n\t  1 wc                       0.00     0.00\n\t  1 xrandr                   0.00     0.00\n4 processes running\n51 maximum processes\n<\/code><\/pre>\n\n\n\n<p>Example of a core computation block<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>      65412) heffte           cpu=2 start=5.91  finish=6.82 \n        65413) mpirun           cpu=6 start=5.91  finish=6.80 \n          65414) mpirun           cpu=1 start=6.11  finish=6.80 \n          65415) mpirun           cpu=4 start=6.11  finish=6.11 \n          65416) mpirun           cpu=4 start=6.14  finish=6.79 \n          65417) mpirun           cpu=8 start=6.24  finish=6.79 \n          65418) mpirun           cpu=5 start=6.24  finish=6.79 \n          65419) speed3d_c2c      cpu=12 start=6.28  finish=6.78 \n            65424) speed3d_c2c      cpu=9 start=6.29  finish=6.78 \n            65427) speed3d_c2c      cpu=13 start=6.30  finish=6.78 \n          65420) speed3d_c2c      cpu=7 start=6.28  finish=6.78 \n            65423) speed3d_c2c      cpu=4 start=6.29  finish=6.78 \n            65428) speed3d_c2c      cpu=11 start=6.30  finish=6.78 \n          65421) speed3d_c2c      cpu=9 start=6.29  finish=6.78 \n            65425) speed3d_c2c      cpu=5 start=6.29  finish=6.78 \n            65430) speed3d_c2c      cpu=6 start=6.30  finish=6.78 \n          65422) speed3d_c2c      cpu=2 start=6.29  finish=6.78 \n            65429) speed3d_c2c      cpu=11 start=6.30  finish=6.78 \n            65432) speed3d_c2c      cpu=12 start=6.30  finish=6.78 \n          65426) speed3d_c2c      cpu=14 start=6.29  finish=6.78 \n            65433) speed3d_c2c      cpu=12 start=6.30  finish=6.78 \n            65437) speed3d_c2c      cpu=10 start=6.31  finish=6.78 \n          65431) speed3d_c2c      cpu=5 start=6.30  finish=6.78 \n            65435) speed3d_c2c      cpu=8 start=6.31  finish=6.78 \n            65439) speed3d_c2c      cpu=8 start=6.31  finish=6.78 \n          65434) speed3d_c2c      cpu=0 start=6.30  finish=6.78 \n            65438) speed3d_c2c      cpu=7 start=6.31  finish=6.78 \n            65441) speed3d_c2c      cpu=14 start=6.32  finish=6.78 \n          65436) speed3d_c2c      cpu=3 start=6.31  finish=6.78 \n            65440) speed3d_c2c      cpu=4 start=6.32  finish=6.78 \n            65442) speed3d_c2c      cpu=1 start=6.32  finish=6.78 \n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>HeFFTe is the Highly Efficient FFT for Exascale. This benchmark has 64 different subtests. Some fail for strange reasons including a missing libelf library or running too quickly. However, most run and provide an example result. These tests run in <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/heffte\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":58,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-2479","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/2479","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=2479"}],"version-history":[{"count":1,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/2479\/revisions"}],"predecessor-version":[{"id":2483,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/2479\/revisions\/2483"}],"up":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/58"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=2479"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}