{"id":1990,"date":"2024-03-04T12:05:02","date_gmt":"2024-03-04T12:05:02","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?page_id=1990"},"modified":"2024-03-05T00:29:24","modified_gmt":"2024-03-05T00:29:24","slug":"parboil","status":"publish","type":"page","link":"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/parboil\/","title":{"rendered":"parboil"},"content":{"rendered":"\n<p>A set of computing benchmarks that use OpenCL, OpenML and CUDA. The OpenCL ones fail leaving X workloads. A total of four workloads run correctly.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/systemtime-17.png\" alt=\"\" class=\"wp-image-1993\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/systemtime-17.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/systemtime-17-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/systemtime-17-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>Topdown profile show workloads dominated by backend stalls.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/amdtopdown-18.png\" alt=\"\" class=\"wp-image-1995\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/amdtopdown-18.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/amdtopdown-18-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/amdtopdown-18-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>AMD metrics confirm high backend stalls and low factors of other stalls and retirement rates. This is floating point code with a low IPC.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              581.377\non_cpu               0.648          # 10.37 \/ 16 cores\nutime                6019.004\nstime                8.001\nnvcsw                11085          # 16.04%\nnivcsw               58007          # 83.96%\ninblock              0              # 0.00\/sec\nonblock              618240         # 1063.41\/sec\ncpu-clock            6028754320350  # 6028.754 seconds\ntask-clock           6028877226692  # 6028.877 seconds\npage faults          2529117        # 419.500\/sec\ncontext switches     71602          # 11.877\/sec\ncpu migrations       1516           # 0.251\/sec\nmajor page faults    13             # 0.002\/sec\nminor page faults    2529104        # 419.498\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             356376744888   # 41.762 branches per 1000 inst\nbranch misses        5808698352     # 1.63% branch miss\nconditional          311127182240   # 36.460 conditional branches per 1000 inst\nindirect             6537032472     # 0.766 indirect branches per 1000 inst\ncpu-cycles           26789856580851 # 2.91 GHz\ninstructions         8523753717315  # 0.32 IPC low\nslots                53574576334074 #\nretiring             3037285457309  #  5.7% ( 6.9%) low\n-- ucode             30795275761    #     0.1%\n-- fastpath          3006490181548  #     5.6%\nfrontend             2105424630950  #  3.9% ( 4.8%) low\n-- latency           1145189084004  #     2.1%\n-- bandwidth         960235546946   #     1.8%\nbackend              38648611175245 # 72.1% (87.7%) high\n-- cpu               17650289931049 #    32.9%\n-- memory            20998321244196 #    39.2%\nspeculation          254667013033   #  0.5% ( 0.6%) low\n-- branch mispredict 167646250146   #     0.3%\n-- pipeline restart  87020762887    #     0.2%\nsmt-contention       9528539434241  # 17.8% ( 0.0%)\ncpu-cycles           26694322086662 # 2.91 GHz\ninstructions         8523373024551  # 0.32 IPC low\ninstructions         2839498090258  # 49.711 l2 access per 1000 inst\nl2 hit from l1       109922376042   # 29.68% l2 miss\nl2 miss from l1      23104420522    #\nl2 hit from l2 pf    12447739921    #\nl3 hit from l2 pf    1892129919     #\nl3 miss from l2 pf   16891383017    #\ninstructions         2839703362814  # 335.613 float per 1000 inst\nfloat 512            126            # 0.000 AVX-512 per 1000 inst\nfloat 256            926            # 0.000 AVX-256 per 1000 inst\nfloat 128            953042328671   # 335.613 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         5              # 0.000 scalar per 1000 inst\ninstructions         8524876284215  #\nopcache              971449144496   # 113.955 opcache per 1000 inst\nopcache miss         18397630060    #  1.9% opcache miss rate\nl1 dTLB miss         19074903429    # 2.238 L1 dTLB per 1000 inst\nl2 dTLB miss         15094481558    # 1.771 L2 dTLB per 1000 inst\ninstructions         8520906203159  #\nicache               26149384533    # 3.069 icache per 1000 inst\nicache miss          2194992940     #  8.4% icache miss rate\nl1 iTLB miss         54453575       # 0.006 L1 iTLB per 1000 inst\nl2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst\ntlb flush            2101941        # 0.000 TLB flush per 1000 inst\n<\/code><\/pre>\n\n\n\n<p>Intel metrics show the L3 portion of memory stalls is the largest.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              1375.760\non_cpu               0.801          # 12.81 \/ 16 cores\nutime                17614.523\nstime                9.219\nnvcsw                8620           # 6.50%\nnivcsw               123986         # 93.50%\ninblock              744            # 0.54\/sec\nonblock              804440         # 584.72\/sec\ncpu-clock            17625758069268 # 17625.758 seconds\ntask-clock           17625905653801 # 17625.906 seconds\npage faults          4638836        # 263.183\/sec\ncontext switches     139046         # 7.889\/sec\ncpu migrations       4590           # 0.260\/sec\nmajor page faults    1              # 0.000\/sec\nminor page faults    4638835        # 263.183\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             419343046573   # 20.778 branches per 1000 inst\nbranch misses        4547421869     # 1.08% branch miss\nconditional          419343085101   # 20.778 conditional branches per 1000 inst\nindirect             67921372342    # 3.365 indirect branches per 1000 inst\nslots                136495034123360 #\nretiring             7814481448059  #  5.7% ( 5.7%) low\n-- ucode             1698017378425  #     1.2%\n-- fastpath          6116464069634  #     4.5%\nfrontend             5269661840749  #  3.9% ( 3.9%) low\n-- latency           4422139728561  #     3.2%\n-- bandwidth         847522112188   #     0.6%\nbackend              122695895540374 # 89.9% (89.9%) high\n-- cpu               24290445256795 #    17.8%\n-- memory            98405450283579 #    72.1%\nspeculation          1063790694054  #  0.8% ( 0.8%) low\n-- branch mispredict 830124740747   #     0.6%\n-- pipeline restart  233665953307   #     0.2%\nsmt-contention       0              #  0.0% ( 0.0%)\ncpu-cycles           35539188290105 # 1.96 GHz\ninstructions         11454483768790 # 0.32 IPC low\nl2 access            305084043503   # 38.065 l2 access per 1000 inst\nl2 miss              118297250080   # 38.78% l2 miss\ncpu-cycles           43478714436613 # 73.5% memory latency\nload stalls          27281801869658 # 24.9% l1 bound\nl1 miss              16441424261761 #  1.0% l2 bound\nl2 miss              16022798500216 # 30.8% l3 bound\nl3 miss              2650993519337  #  6.1% dram bound\nstore_stalls         4674882882278  # 10.8% store bound\n<\/code><\/pre>\n\n\n\n<p>Process overview shows different processes per workload.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>1086 processes\n\t 48 lbm                  54292.96    24.48\n\t 48 mri-gridding         32846.56    11.68\n\t 48 stencil               6022.24     6.72\n\t 48 cutcp                 1348.80     4.48\n\t408 clinfo                  98.72    36.61\n\t 42 python2                  6.09     1.72\n\t 38 vulkaninfo               0.39     1.52\n\t  6 php                      0.09     0.32\n\t  6 glxinfo:gdrv0            0.08     0.10\n\t  6 glxinfo:gl0              0.08     0.10\n\t  3 ld                       0.05     0.03\n\t  4 vulkani:disk$0           0.04     0.16\n\t  6 clang                    0.04     0.05\n\t  2 glxinfo                  0.04     0.04\n\t  2 glxinfo:cs0              0.04     0.04\n\t  2 glxinfo:disk$0           0.04     0.04\n\t  2 glxinfo:sh0              0.04     0.04\n\t  2 glxinfo:shlo0            0.04     0.04\n\t  3 rocminfo                 0.03     0.00\n\t  2 llvmpipe-0               0.02     0.08\n\t  2 llvmpipe-1               0.02     0.08\n\t  2 llvmpipe-10              0.02     0.08\n\t  2 llvmpipe-11              0.02     0.08\n\t  2 llvmpipe-12              0.02     0.08\n\t  2 llvmpipe-13              0.02     0.08\n\t  2 llvmpipe-14              0.02     0.08\n\t  2 llvmpipe-15              0.02     0.08\n\t  2 llvmpipe-2               0.02     0.08\n\t  2 llvmpipe-3               0.02     0.08\n\t  2 llvmpipe-4               0.02     0.08\n\t  2 llvmpipe-5               0.02     0.08\n\t  2 llvmpipe-6               0.02     0.08\n\t  2 llvmpipe-7               0.02     0.08\n\t  2 llvmpipe-8               0.02     0.08\n\t  2 llvmpipe-9               0.02     0.08\n\t  1 lspci                    0.01     0.02\n\t145 sh                       0.00     0.00\n\t 60 make                     0.00     0.00\n\t 30 parboil                  0.00     0.00\n\t 13 gcc                      0.00     0.00\n\t 12 gsettings                0.00     0.00\n\t  8 stat                     0.00     0.00\n\t  8 systemd-detect-          0.00     0.00\n\t  6 llvm-link                0.00     0.00\n\t  5 phoronix-test-s          0.00     0.00\n\t  3 c++                      0.00     0.00\n\t  3 collect2                 0.00     0.00\n\t  3 gmain                    0.00     0.00\n\t  2 cc                       0.00     0.00\n\t  2 lscpu                    0.00     0.00\n\t  2 uname                    0.00     0.00\n\t  2 which                    0.00     0.00\n\t  2 xset                     0.00     0.00\n\t  1 date                     0.00     0.00\n\t  1 dconf worker             0.00     0.00\n\t  1 dirname                  0.00     0.00\n\t  1 dmesg                    0.00     0.00\n\t  1 dmidecode                0.00     0.00\n\t  1 grep                     0.00     0.00\n\t  1 ifconfig                 0.00     0.00\n\t  1 ip                       0.00     0.00\n\t  1 lsmod                    0.00     0.00\n\t  1 mktemp                   0.00     0.00\n\t  1 ps                       0.00     0.00\n\t  1 python                   0.00     0.00\n\t  1 python3                  0.00     0.00\n\t  1 qdbus                    0.00     0.00\n\t  1 readlink                 0.00     0.00\n\t  1 realpath                 0.00     0.00\n\t  1 sed                      0.00     0.00\n\t  1 sort                     0.00     0.00\n\t  1 stty                     0.00     0.00\n\t  1 systemctl                0.00     0.00\n\t  1 template.sh              0.00     0.00\n\t  1 wc                       0.00     0.00\n\t  1 xrandr                   0.00     0.00\n0 processes running\n47 maximum processes\n<\/code><\/pre>\n\n\n\n<p>An example computation block<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>      938107) parboil          cpu=7 start=121.12 finish=193.82\n        938108) python2          cpu=8 start=121.12 finish=193.82\n          938109) make             cpu=9 start=121.13 finish=121.14\n          938110) make             cpu=10 start=121.14 finish=192.13\n            938111) lbm              cpu=11 start=121.14 finish=192.13\n              938112) lbm              cpu=13 start=121.14 finish=192.13\n              938113) lbm              cpu=9 start=121.14 finish=192.13\n              938114) lbm              cpu=4 start=121.14 finish=192.13\n              938115) lbm              cpu=14 start=121.14 finish=192.13\n              938116) lbm              cpu=7 start=121.14 finish=192.13\n              938117) lbm              cpu=8 start=121.14 finish=192.13\n              938118) lbm              cpu=10 start=121.14 finish=192.13\n              938119) lbm              cpu=5 start=121.14 finish=192.13\n              938120) lbm              cpu=12 start=121.14 finish=192.13\n              938121) lbm              cpu=1 start=121.14 finish=192.13\n              938122) lbm              cpu=15 start=121.14 finish=192.13\n              938123) lbm              cpu=6 start=121.14 finish=192.13\n              938124) lbm              cpu=0 start=121.14 finish=192.13\n              938125) lbm              cpu=2 start=121.14 finish=192.13\n              938126) lbm              cpu=3 start=121.14 finish=192.13\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>A set of computing benchmarks that use OpenCL, OpenML and CUDA. The OpenCL ones fail leaving X workloads. A total of four workloads run correctly. Topdown profile show workloads dominated by backend stalls. AMD metrics confirm high backend stalls and <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/parboil\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":58,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-1990","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1990","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=1990"}],"version-history":[{"count":2,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1990\/revisions"}],"predecessor-version":[{"id":1996,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1990\/revisions\/1996"}],"up":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/58"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=1990"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}