{"id":1405,"date":"2024-02-03T20:54:12","date_gmt":"2024-02-03T20:54:12","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?page_id=1405"},"modified":"2024-02-03T22:38:51","modified_gmt":"2024-02-03T22:38:51","slug":"arrayfire","status":"publish","type":"page","link":"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/arrayfire\/","title":{"rendered":"arrayfire"},"content":{"rendered":"\n<p>A CPU and GPU numeric processing library, using both built-in CPU and OpenCL benchmarks. All run on my AMD system and the OpenCL fp16 fails on my Intel system.  The OpenCL fp32 passes on Intel The AMD is considerably faster, so curious if I am getting some GPU? Looks like the first two workloads are multi-threaded and the rest are single-threaded.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/systemtime-24.png\" alt=\"\" class=\"wp-image-1431\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/systemtime-24.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/systemtime-24-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/systemtime-24-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>Topdown profile shows frontend stalls as high and some variation between workloads and over time.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/amdtopdown-24.png\" alt=\"\" class=\"wp-image-1432\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/amdtopdown-24.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/amdtopdown-24-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/amdtopdown-24-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>AMD metrics show little floating point and moderate numbers of branches. Some L2 access though not particularly high backend stalls.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              255.589\non_cpu               0.316          # 5.05 \/ 16 cores\nutime                769.161\nstime                522.034\nnvcsw                84146          # 54.56%\nnivcsw               70084          # 45.44%\ninblock              0              # 0.00\/sec\nonblock              156304         # 611.54\/sec\ncpu-clock            1297086975088  # 1297.087 seconds\ntask-clock           1297163581400  # 1297.164 seconds\npage faults          1828517        # 1409.627\/sec\ncontext switches     155290         # 119.715\/sec\ncpu migrations       1592           # 1.227\/sec\nmajor page faults    155            # 0.119\/sec\nminor page faults    1828362        # 1409.508\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             566661991334   # 91.881 branches per 1000 inst\nbranch misses        61259641104    # 10.81% branch miss\nconditional          252994830141   # 41.022 conditional branches per 1000 inst\nindirect             24440357358    # 3.963 indirect branches per 1000 inst\ncpu-cycles           5083213642342  # 1.21 GHz\ninstructions         6172430147473  # 1.21 IPC\nslots                10170139884144 #\nretiring             2242627936735  # 22.1% (29.2%)\n-- ucode             8206700190     #     0.1%\n-- fastpath          2234421236545  #    22.0%\nfrontend             3184418817599  # 31.3% (41.5%)\n-- latency           2649897791970  #    26.1%\n-- bandwidth         534521025629   #     5.3%\nbackend              2238686350080  # 22.0% (29.2%)\n-- cpu               1024364604864  #    10.1%\n-- memory            1214321745216  #    11.9%\nspeculation          10863172263    #  0.1% ( 0.1%) low\n-- branch mispredict 10849523258    #     0.1%\n-- pipeline restart  13649005       #     0.0%\nsmt-contention       2493533444045  # 24.5% ( 0.0%)\ncpu-cycles           5022114467177  # 1.22 GHz\ninstructions         6157130266342  # 1.23 IPC\ninstructions         2057106720487  # 49.333 l2 access per 1000 inst\nl2 hit from l1       86581943288    # 9.06% l2 miss\nl2 miss from l1      2752199852     #\nl2 hit from l2 pf    8456824069     #\nl3 hit from l2 pf    5607231135     #\nl3 miss from l2 pf   836878029      #\ninstructions         2053827320707  # 21.171 float per 1000 inst\nfloat 512            83             # 0.000 AVX-512 per 1000 inst\nfloat 256            508            # 0.000 AVX-256 per 1000 inst\nfloat 128            43480827313    # 21.171 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         0              # 0.000 scalar per 1000 inst\ninstructions         2665431        #\nopcache              988658         # 370.919 opcache per 1000 inst\nopcache miss         530873         # 53.7% opcache miss rate\nl1 dTLB miss         5558           # 2.085 L1 dTLB per 1000 inst\nl2 dTLB miss         1178           # 0.442 L2 dTLB per 1000 inst\ninstructions         2715463        #\nicache               1322587        # 487.058 icache per 1000 inst\nicache miss          112382         #  8.5% icache miss rate\nl1 iTLB miss         14             # 0.005 L1 iTLB per 1000 inst\nl2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst\ntlb flush            19             # 0.007 TLB flush per 1000 inst\n<\/code><\/pre>\n\n\n\n<p>Intel metrics show lower on-cpu and both L2 and dram stalls.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              749.501\non_cpu               0.116          # 1.85 \/ 16 cores\nutime                1332.666\nstime                55.026\nnvcsw                515494         # 6.33%\nnivcsw               7622108        # 93.67%\ninblock              15760          # 21.03\/sec\nonblock              10456          # 13.95\/sec\ncpu-clock            1384483475706  # 1384.483 seconds\ntask-clock           1384896944201  # 1384.897 seconds\npage faults          5647449        # 4077.884\/sec\ncontext switches     8141137        # 5878.515\/sec\ncpu migrations       120081         # 86.708\/sec\nmajor page faults    163            # 0.118\/sec\nminor page faults    5647286        # 4077.766\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             327470422030   # 29.701 branches per 1000 inst\nbranch misses        2265053860     # 0.69% branch miss\nconditional          327470459310   # 29.701 conditional branches per 1000 inst\nindirect             45635483920    # 4.139 indirect branches per 1000 inst\nslots                21933232930778 #\nretiring             12264288512056 # 55.9% (55.9%) high\n-- ucode             701472801784   #     3.2%\n-- fastpath          11562815710272 #    52.7%\nfrontend             2803267333224  # 12.8% (12.8%)\n-- latency           1783243291316  #     8.1%\n-- bandwidth         1020024041908  #     4.7%\nbackend              6871443097914  # 31.3% (31.3%)\n-- cpu               3842821497310  #    17.5%\n-- memory            3028621600604  #    13.8%\nspeculation          1485815766476  #  6.8% ( 6.8%)\n-- branch mispredict 1451257286366  #     6.6%\n-- pipeline restart  34558480110    #     0.2%\nsmt-contention       0              #  0.0% ( 0.0%)\ncpu-cycles           5173265968393  # 0.39 GHz\ninstructions         14473440438126 # 2.80 IPC\nl2 access            178180311971   # 17.066 l2 access per 1000 inst\nl2 miss              83307874133    # 46.75% l2 miss\ncpu-cycles           4579755625449  # 28.7% memory latency\nload stalls          1298392731852  #  0.5% l1 bound\nl1 miss              1273798954929  # 10.5% l2 bound\nl2 miss              794601654431   #  5.6% l3 bound\nl3 miss              539003203401   # 11.8% dram bound\nstore_stalls         17972241280    #  0.4% store bound\n\n\n<\/code><\/pre>\n\n\n\n<p>AMD metrics show most of the time in the blas_cpu process.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>1198 processes\n\t102 blas_cpu             12168.09  8389.18\n\t 51 cg_cpu                 695.13   183.60\n\t306 blas_opencl            373.02   658.65\n\t153 cg_opencl              140.62   150.05\n\t272 clinfo                  74.48    24.64\n\t 38 vulkaninfo               1.14     1.14\n\t  6 glxinfo:gdrv0            0.13     0.03\n\t  6 glxinfo:gl0              0.13     0.03\n\t  4 vulkani:disk$0           0.12     0.12\n\t  6 php                      0.10     0.15\n\t  2 glxinfo                  0.08     0.02\n\t  2 glxinfo:cs0              0.08     0.02\n\t  2 glxinfo:disk$0           0.07     0.02\n\t  2 glxinfo:sh0              0.07     0.01\n\t  2 glxinfo:shlo0            0.07     0.01\n\t  2 llvmpipe-0               0.06     0.06\n\t  2 llvmpipe-1               0.06     0.06\n\t  2 llvmpipe-10              0.06     0.06\n\t  2 llvmpipe-11              0.06     0.06\n\t  2 llvmpipe-12              0.06     0.06\n\t  2 llvmpipe-13              0.06     0.06\n\t  2 llvmpipe-14              0.06     0.06\n\t  2 llvmpipe-15              0.06     0.06\n\t  2 llvmpipe-2               0.06     0.06\n\t  2 llvmpipe-3               0.06     0.06\n\t  2 llvmpipe-4               0.06     0.06\n\t  2 llvmpipe-5               0.06     0.06\n\t  2 llvmpipe-6               0.06     0.06\n\t  2 llvmpipe-7               0.06     0.06\n\t  2 llvmpipe-8               0.06     0.06\n\t  2 llvmpipe-9               0.06     0.06\n\t  6 clang                    0.05     0.05\n\t  3 rocminfo                 0.03     0.03\n\t  1 lspci                    0.01     0.02\n\t  1 ps                       0.00     0.01\n\t 98 sh                       0.00     0.00\n\t 18 arrayfire                0.00     0.00\n\t 13 gcc                      0.00     0.00\n\t 11 gsettings                0.00     0.00\n\t  8 stat                     0.00     0.00\n\t  8 systemd-detect-          0.00     0.00\n\t  6 llvm-link                0.00     0.00\n\t  5 phoronix-test-s          0.00     0.00\n\t  3 gmain                    0.00     0.00\n\t  2 cc                       0.00     0.00\n\t  2 dconf worker             0.00     0.00\n\t  2 lscpu                    0.00     0.00\n\t  2 uname                    0.00     0.00\n\t  2 which                    0.00     0.00\n\t  2 xset                     0.00     0.00\n\t  1 date                     0.00     0.00\n\t  1 dirname                  0.00     0.00\n\t  1 dmesg                    0.00     0.00\n\t  1 dmidecode                0.00     0.00\n\t  1 grep                     0.00     0.00\n\t  1 ifconfig                 0.00     0.00\n\t  1 ip                       0.00     0.00\n\t  1 lsmod                    0.00     0.00\n\t  1 mktemp                   0.00     0.00\n\t  1 qdbus                    0.00     0.00\n\t  1 readlink                 0.00     0.00\n\t  1 realpath                 0.00     0.00\n\t  1 sed                      0.00     0.00\n\t  1 sort                     0.00     0.00\n\t  1 stty                     0.00     0.00\n\t  1 systemctl                0.00     0.00\n\t  1 template.sh              0.00     0.00\n\t  1 wc                       0.00     0.00\n\t  1 xrandr                   0.00     0.00\n0 processes running\n59 maximum processes\n<\/code><\/pre>\n\n\n\n<p>Computation blocks look as follows<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>      363808) arrayfire        cpu=5 start=6.95  finish=23.75\n        363809) blas_cpu         cpu=7 start=6.95  finish=23.73\n          363810) blas_cpu         cpu=14 start=6.96  finish=23.73\n          363811) blas_cpu         cpu=10 start=6.96  finish=23.73\n          363812) blas_cpu         cpu=4 start=6.96  finish=23.73\n          363813) blas_cpu         cpu=9 start=6.96  finish=23.73\n          363814) blas_cpu         cpu=8 start=6.96  finish=23.73\n          363815) blas_cpu         cpu=3 start=6.96  finish=23.73\n          363816) blas_cpu         cpu=5 start=6.96  finish=23.73\n          363817) blas_cpu         cpu=15 start=6.96  finish=23.73\n          363818) blas_cpu         cpu=0 start=6.96  finish=23.73\n          363819) blas_cpu         cpu=6 start=6.96  finish=23.73\n          363820) blas_cpu         cpu=12 start=6.96  finish=23.73\n          363821) blas_cpu         cpu=1 start=6.96  finish=23.73\n          363822) blas_cpu         cpu=11 start=6.96  finish=23.73\n          363823) blas_cpu         cpu=2 start=6.96  finish=23.72\n          363824) blas_cpu         cpu=13 start=6.96  finish=23.72\n          363825) blas_cpu         cpu=7 start=6.96  finish=23.73\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>A CPU and GPU numeric processing library, using both built-in CPU and OpenCL benchmarks. All run on my AMD system and the OpenCL fp16 fails on my Intel system. The OpenCL fp32 passes on Intel The AMD is considerably faster, <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/arrayfire\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":58,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-1405","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1405","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=1405"}],"version-history":[{"count":3,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1405\/revisions"}],"predecessor-version":[{"id":1435,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1405\/revisions\/1435"}],"up":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/58"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=1405"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}