{"id":800,"date":"2024-01-22T10:29:22","date_gmt":"2024-01-22T10:29:22","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?page_id=800"},"modified":"2024-01-22T10:29:23","modified_gmt":"2024-01-22T10:29:23","slug":"fftw","status":"publish","type":"page","link":"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/fftw\/","title":{"rendered":"fftw"},"content":{"rendered":"\n<p>Testing the fast fourier transform library with FFTs in 32-different sizes and dimensions. OVerall benchmark looks single-threadedand varies on how much the CPU cores are busy.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-51.png\" alt=\"\" class=\"wp-image-801\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-51.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-51-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-51-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>Topdown overview varies by benchmark but most have little frontend stalls and more backend stalls. Also seems to vary with backend retirement. This is case where I expect contrasts if you pull apart different size ffts.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-89.png\" alt=\"\" class=\"wp-image-802\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-89.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-89-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-89-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>AMD topdown metrics show almost 40% floating point with few branches. A moderate L2 miss rate with memory dominating backend stalls over floating point.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              6035.507\non_cpu               0.052          # 0.84 \/ 16 cores\nutime                5057.912\nstime                4.091\nnvcsw                3674           # 13.24%\nnivcsw               24069          # 86.76%\ninblock              2608           # 0.43\/sec\nonblock              132160         # 21.90\/sec\ncpu-clock            5063022032326  # 5063.022 seconds\ntask-clock           5063105505005  # 5063.106 seconds\npage faults          1403691        # 277.239\/sec\ncontext switches     57281          # 11.313\/sec\ncpu migrations       1619           # 0.320\/sec\nmajor page faults    2              # 0.000\/sec\nminor page faults    1403689        # 277.239\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             1965580397202  # 39.916 branches per 1000 inst\nbranch misses        3298066028     # 0.17% branch miss\nconditional          1656507779012  # 33.640 conditional branches per 1000 inst\nindirect             91167644504    # 1.851 indirect branches per 1000 inst\ncpu-cycles           19871548371102 # 0.23 GHz\ninstructions         41895510041240 # 2.11 IPC\nslots                39757752428328 #\nretiring             14486190179724 # 36.4% (36.4%)\n-- ucode             16153741583    #     0.0%\n-- fastpath          14470036438141 #    36.4%\nfrontend             1299222028903  #  3.3% ( 3.3%)\n-- latency           488647318608   #     1.2%\n-- bandwidth         810574710295   #     2.0%\nbackend              23765685993487 # 59.8% (59.8%)\n-- cpu               6143636408133  #    15.5%\n-- memory            17622049585354 #    44.3%\nspeculation          205999054100   #  0.5% ( 0.5%)\n-- branch mispredict 115970837458   #     0.3%\n-- pipeline restart  90028216642    #     0.2%\nsmt-contention       654137508      #  0.0% ( 0.0%)\ncpu-cycles           22621150909378 # 0.23 GHz\ninstructions         46306295646487 # 2.05 IPC\ninstructions         15440006308981 # 54.750 l2 access per 1000 inst\nl2 hit from l1       551234904807   # 19.92% l2 miss\nl2 miss from l1      60719432531    #\nl2 hit from l2 pf    186431059131   #\nl3 hit from l2 pf    48964245685    #\nl3 miss from l2 pf   58708583119    #\ninstructions         15435614877141 # 384.957 float per 1000 inst\nfloat 512            230            # 0.000 AVX-512 per 1000 inst\nfloat 256            390            # 0.000 AVX-256 per 1000 inst\nfloat 128            5942046309998  # 384.957 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         0              # 0.000 scalar per 1000 inst\n<\/code><\/pre>\n\n\n\n<p>Intel metrics<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              3010.244\non_cpu               0.049          # 0.78 \/ 16 cores\nutime                2335.260\nstime                2.102\nnvcsw                2964           # 20.89%\nnivcsw               11223          # 79.11%\ninblock              24             # 0.01\/sec\nonblock              105536         # 35.06\/sec\ncpu-clock            2337833277758  # 2337.833 seconds\ntask-clock           2337870126196  # 2337.870 seconds\npage faults          683541         # 292.378\/sec\ncontext switches     28760          # 12.302\/sec\ncpu migrations       651            # 0.278\/sec\nmajor page faults    0              # 0.000\/sec\nminor page faults    683541         # 292.378\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             827918918657   # 38.741 branches per 1000 inst\nbranch misses        2461654182     # 0.30% branch miss\nconditional          827918940033   # 38.741 conditional branches per 1000 inst\nindirect             41023014118    # 1.920 indirect branches per 1000 inst\nslots                86916464289776 #\nretiring             39007424491498 # 44.9% (44.9%)\n-- ucode             1317253774803  #     1.5%\n-- fastpath          37690170716695 #    43.4%\nfrontend             3309488424692  #  3.8% ( 3.8%)\n-- latency           1285710019402  #     1.5%\n-- bandwidth         2023778405290  #     2.3%\nbackend              46132471585328 # 53.1% (53.1%)\n-- cpu               10736832716887 #    12.4%\n-- memory            35395638868441 #    40.7%\nspeculation          1999383497821  #  2.3% ( 2.3%)\n-- branch mispredict 1441628404359  #     1.7%\n-- pipeline restart  557755093462   #     0.6%\nsmt-contention       0              #  0.0% ( 0.0%)\ncpu-cycles           8626088451177  # 0.17 GHz\ninstructions         22951385088498 # 2.66 IPC\nl2 access            725794661607   # 31.630 l2 access per 1000 inst\nl2 miss              323765912211   # 44.61% l2 miss\n<\/code><\/pre>\n\n\n\n<p>Process overview shows not many processes and using an internal bench program. This did crash towards end of first run.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>613 processes\n\t161 bench                 2461.12     0.84\n\t 34 clinfo                   9.74     3.33\n\t 19 vulkaninfo               0.76     0.57\n\t  2 vulkani:disk$0           0.08     0.06\n\t  3 glxinfo:gdrv0            0.07     0.06\n\t  6 clang                    0.05     0.07\n\t  1 llvmpipe-0               0.04     0.03\n\t  1 llvmpipe-1               0.04     0.03\n\t  1 llvmpipe-10              0.04     0.03\n\t  1 llvmpipe-11              0.04     0.03\n\t  1 llvmpipe-12              0.04     0.03\n\t  1 llvmpipe-13              0.04     0.03\n\t  1 llvmpipe-14              0.04     0.03\n\t  1 llvmpipe-15              0.04     0.03\n\t  1 llvmpipe-2               0.04     0.03\n\t  1 llvmpipe-3               0.04     0.03\n\t  1 llvmpipe-4               0.04     0.03\n\t  1 llvmpipe-5               0.04     0.03\n\t  1 llvmpipe-6               0.04     0.03\n\t  1 llvmpipe-7               0.04     0.03\n\t  1 llvmpipe-8               0.04     0.03\n\t  1 llvmpipe-9               0.04     0.03\n\t  1 glxinfo                  0.04     0.02\n\t  1 glxinfo:cs0              0.04     0.02\n\t  1 glxinfo:disk$0           0.03     0.02\n\t  1 glxinfo:sh0              0.03     0.02\n\t  1 glxinfo:shlo0            0.03     0.02\n\t  1 ps                       0.00     0.01\n\t281 sh                       0.00     0.00\n\t 13 gcc                      0.00     0.00\n\t  8 gsettings                0.00     0.00\n\t  8 systemd-detect-          0.00     0.00\n\t  7 stat                     0.00     0.00\n\t  6 llvm-link                0.00     0.00\n\t  5 gmain                    0.00     0.00\n\t  4 phoronix-test-s          0.00     0.00\n\t  2 dconf worker             0.00     0.00\n\t  2 which                    0.00     0.00\n\t  1 cc                       0.00     0.00\n\t  1 date                     0.00     0.00\n\t  1 dirname                  0.00     0.00\n\t  1 grep                     0.00     0.00\n\t  1 ifconfig                 0.00     0.00\n\t  1 ip                       0.00     0.00\n\t  1 lscpu                    0.00     0.00\n\t  1 mktemp                   0.00     0.00\n\t  1 qdbus                    0.00     0.00\n\t  1 readlink                 0.00     0.00\n\t  1 realpath                 0.00     0.00\n\t  1 sed                      0.00     0.00\n\t  1 sort                     0.00     0.00\n\t  1 stty                     0.00     0.00\n\t  1 systemctl                0.00     0.00\n\t  1 template.sh              0.00     0.00\n\t  1 wc                       0.00     0.00\n\t  1 xrandr                   0.00     0.00\n\t  1 xset                     0.00     0.00\n11 processes running\n47 maximum processes\n<\/code><\/pre>\n\n\n\n<p>Computation is repeated invocations of bench, e.g.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>      294703) sh               cpu=0 start=29.18 finish=32.71\n        294704) bench            cpu=4 start=29.18 finish=32.71\n      294705) sh               cpu=0 start=36.72 finish=40.27\n        294706) bench            cpu=9 start=36.72 finish=40.27\n      294707) sh               cpu=0 start=44.27 finish=47.81\n        294708) bench            cpu=1 start=44.27 finish=47.81\n      294709) sh               cpu=0 start=47.81 finish=47.82\n        294710) sh               cpu=1 start=47.81 finish=47.82\n      294711) sh               cpu=1 start=58.22 finish=60.70\n        294712) bench            cpu=2 start=58.23 finish=60.70\n      294713) sh               cpu=8 start=64.70 finish=67.14\n        294714) bench            cpu=9 start=64.70 finish=67.14\n      294715) sh               cpu=8 start=71.14 finish=73.58\n        294716) bench            cpu=1 start=71.14 finish=73.58\n      294717) sh               cpu=10 start=73.58 finish=73.58\n        294718) sh               cpu=11 start=73.58 finish=73.58\n      294719) sh               cpu=10 start=92.97 finish=96.45\n        294720) bench            cpu=3 start=92.97 finish=96.45\n      294721) sh               cpu=10 start=100.45 finish=103.90\n        294722) bench            cpu=11 start=100.45 finish=103.90\n      294723) sh               cpu=2 start=107.90 finish=111.35\n        294724) bench            cpu=3 start=107.91 finish=111.35\n      294725) sh               cpu=3 start=111.35 finish=111.35\n        294726) sh               cpu=12 start=111.35 finish=111.35\n      294727) sh               cpu=2 start=123.78 finish=126.36\n        294728) bench            cpu=3 start=123.78 finish=126.36\n      294729) sh               cpu=2 start=130.37 finish=132.90\n        294730) bench            cpu=11 start=130.37 finish=132.89\n      294731) sh               cpu=10 start=136.90 finish=139.41\n        294732) bench            cpu=3 start=136.90 finish=139.41\n      294733) sh               cpu=12 start=139.41 finish=139.42\n        294734) sh               cpu=5 start=139.41 finish=139.41\n      294735) sh               cpu=2 start=155.84 finish=158.60\n        294736) bench            cpu=3 start=155.84 finish=158.60\n      294737) sh               cpu=10 start=162.61 finish=165.38\n        294738) bench            cpu=3 start=162.61 finish=165.38\n      294740) sh               cpu=10 start=169.38 finish=172.19\n        294741) bench            cpu=11 start=169.38 finish=172.19\n      294742) sh               cpu=10 start=172.19 finish=172.19\n        294743) sh               cpu=11 start=172.19 finish=172.19\n      294745) sh               cpu=2 start=183.37 finish=186.57\n        294746) bench            cpu=11 start=183.37 finish=186.57\n      294747) sh               cpu=2 start=190.57 finish=193.77\n        294748) bench            cpu=3 start=190.57 finish=193.76\n      294750) sh               cpu=2 start=197.77 finish=200.99\n        294751) bench            cpu=3 start=197.77 finish=200.98\n      294752) sh               cpu=2 start=200.99 finish=200.99\n        294753) sh               cpu=3 start=200.99 finish=200.99\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Testing the fast fourier transform library with FFTs in 32-different sizes and dimensions. OVerall benchmark looks single-threadedand varies on how much the CPU cores are busy. Topdown overview varies by benchmark but most have little frontend stalls and more backend <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/fftw\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":58,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-800","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/800","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=800"}],"version-history":[{"count":1,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/800\/revisions"}],"predecessor-version":[{"id":803,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/800\/revisions\/803"}],"up":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/58"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=800"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}