{"id":1023,"date":"2024-01-28T21:25:21","date_gmt":"2024-01-28T21:25:21","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?page_id=1023"},"modified":"2024-01-29T10:44:06","modified_gmt":"2024-01-29T10:44:06","slug":"himeno","status":"publish","type":"page","link":"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/himeno\/","title":{"rendered":"himeno"},"content":{"rendered":"\n<p>Linear solver of pressure Poisson using a point-Jacobi method. This is one of fed tests where Intel CPU performs substantially better (6046) than AMD CPU (3966). It is also a case where AMD has a variable rate. This looks like a single-threaded test.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-80.png\" alt=\"\" class=\"wp-image-1060\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-80.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-80-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-80-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>Topdown profile shows high backend stalls, both CPU and memory. It also suggests multiple rounds of tests to find a stable point. Frontend stalls are some of the smallest of all benchmarks.  Perhaps interesting to see factors here like code\/data footprint sizes?<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-117.png\" alt=\"\" class=\"wp-image-1061\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-117.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-117-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-117-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>AMD metrics show a floating point program with some L2 access and misses.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              562.793\non_cpu               0.056          # 0.90 \/ 16 cores\nutime                506.784\nstime                1.067\nnvcsw                2113           # 51.46%\nnivcsw               1993           # 48.54%\ninblock              0              # 0.00\/sec\nonblock              12800          # 22.74\/sec\ncpu-clock            507956077142   # 507.956 seconds\ntask-clock           507965279796   # 507.965 seconds\npage faults          240790         # 474.028\/sec\ncontext switches     6723           # 13.235\/sec\ncpu migrations       325            # 0.640\/sec\nmajor page faults    2              # 0.004\/sec\nminor page faults    240788         # 474.025\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             78486493682    # 20.748 branches per 1000 inst\nbranch misses        624071097      # 0.80% branch miss\nconditional          77692509960    # 20.538 conditional branches per 1000 inst\nindirect             47753355       # 0.013 indirect branches per 1000 inst\ncpu-cycles           2940577091356  # 0.26 GHz\ninstructions         4771889380353  # 1.62 IPC\nslots                5884521661530  #\nretiring             1586483683880  # 27.0% (27.0%)\n-- ucode             14146366       #     0.0%\n-- fastpath          1586469537514  #    27.0%\nfrontend             55781933015    #  0.9% ( 0.9%) low\n-- latency           31688469264    #     0.5%\n-- bandwidth         24093463751    #     0.4%\nbackend              4231018348818  # 71.9% (71.9%) high\n-- cpu               2277772735762  #    38.7%\n-- memory            1953245613056  #    33.2%\nspeculation          11018268385    #  0.2% ( 0.2%) low\n-- branch mispredict 10153156053    #     0.2%\n-- pipeline restart  865112332      #     0.0%\nsmt-contention       218961520      #  0.0% ( 0.0%)\ncpu-cycles           2938472632454  # 0.26 GHz\ninstructions         4629990328053  # 1.58 IPC\ninstructions         1544094148954  # 88.762 l2 access per 1000 inst\nl2 hit from l1       80167148026    # 20.63% l2 miss\nl2 miss from l1      1331488191     #\nl2 hit from l2 pf    29950113188    #\nl3 hit from l2 pf    13061396001    #\nl3 miss from l2 pf   13878019099    #\ninstructions         1543498098310  # 184.685 float per 1000 inst\nfloat 512            69             # 0.000 AVX-512 per 1000 inst\nfloat 256            518            # 0.000 AVX-256 per 1000 inst\nfloat 128            285061640703   # 184.685 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         0              # 0.000 scalar per 1000 inst\n<\/code><\/pre>\n\n\n\n<p>Intel metrics<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              168.565\non_cpu               0.057          # 0.92 \/ 16 cores\nutime                154.165\nstime                0.438\nnvcsw                1887           # 74.50%\nnivcsw               646            # 25.50%\ninblock              0              # 0.00\/sec\nonblock              1312           # 7.78\/sec\ncpu-clock            154630765914   # 154.631 seconds\ntask-clock           154634206479   # 154.634 seconds\npage faults          159168         # 1029.319\/sec\ncontext switches     3205           # 20.726\/sec\ncpu migrations       321            # 2.076\/sec\nmajor page faults    0              # 0.000\/sec\nminor page faults    159168         # 1029.319\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             35913794453    # 21.089 branches per 1000 inst\nbranch misses        88509738       # 0.25% branch miss\nconditional          35913806389    # 21.089 conditional branches per 1000 inst\nindirect             59414046       # 0.035 indirect branches per 1000 inst\nslots                3725864821436  #\nretiring             2390025479822  # 64.1% (64.1%) high\n-- ucode             620664419824   #    16.7%\n-- fastpath          1769361059998  #    47.5%\nfrontend             553782693331   # 14.9% (14.9%)\n-- latency           17755344084    #     0.5%\n-- bandwidth         536027349247   #    14.4%\nbackend              769551179018   # 20.7% (20.7%)\n-- cpu               500685850178   #    13.4%\n-- memory            268865328840   #     7.2%\nspeculation          17170761530    #  0.5% ( 0.5%) low\n-- branch mispredict 12411755919    #     0.3%\n-- pipeline restart  4759005611     #     0.1%\nsmt-contention       0              #  0.0% ( 0.0%)\ncpu-cycles           648805684715   # 0.22 GHz\ninstructions         1890731557963  # 2.91 IPC\nl2 access            101309940231   # 53.657 l2 access per 1000 inst\nl2 miss              59683326414    # 58.91% l2 miss\n<\/code><\/pre>\n\n\n\n<p>Process overview suggests computation in the himenobmtxpa application<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>372 processes\n\t 12 himenobmtxpa           506.42     0.11\n\t 68 clinfo                  16.48     6.32\n\t 38 vulkaninfo               0.76     1.53\n\t  6 glxinfo:gdrv0            0.11     0.07\n\t  6 glxinfo:gl0              0.11     0.06\n\t  4 vulkani:disk$0           0.08     0.17\n\t  6 php                      0.07     0.15\n\t  2 glxinfo                  0.05     0.04\n\t  2 glxinfo:cs0              0.05     0.04\n\t  2 glxinfo:disk$0           0.05     0.03\n\t  2 glxinfo:sh0              0.05     0.03\n\t  2 glxinfo:shlo0            0.05     0.03\n\t  2 llvmpipe-0               0.04     0.09\n\t  2 llvmpipe-1               0.04     0.09\n\t  2 llvmpipe-10              0.04     0.09\n\t  2 llvmpipe-11              0.04     0.09\n\t  2 llvmpipe-12              0.04     0.09\n\t  2 llvmpipe-13              0.04     0.09\n\t  2 llvmpipe-14              0.04     0.09\n\t  2 llvmpipe-15              0.04     0.09\n\t  2 llvmpipe-2               0.04     0.09\n\t  2 llvmpipe-3               0.04     0.09\n\t  2 llvmpipe-5               0.04     0.09\n\t  2 llvmpipe-6               0.04     0.09\n\t  2 llvmpipe-7               0.04     0.09\n\t  2 llvmpipe-8               0.04     0.09\n\t  2 llvmpipe-9               0.04     0.09\n\t  2 llvmpipe-4               0.04     0.08\n\t  6 clang                    0.03     0.09\n\t  3 rocminfo                 0.00     0.03\n\t  1 lspci                    0.00     0.02\n\t  1 ps                       0.00     0.01\n\t 82 sh                       0.00     0.00\n\t 13 gcc                      0.00     0.00\n\t 12 gsettings                0.00     0.00\n\t 12 himeno                   0.00     0.00\n\t  8 stat                     0.00     0.00\n\t  8 systemd-detect-          0.00     0.00\n\t  6 llvm-link                0.00     0.00\n\t  5 phoronix-test-s          0.00     0.00\n\t  2 cc                       0.00     0.00\n\t  2 dconf worker             0.00     0.00\n\t  2 gmain                    0.00     0.00\n\t  2 lscpu                    0.00     0.00\n\t  2 uname                    0.00     0.00\n\t  2 which                    0.00     0.00\n\t  2 xset                     0.00     0.00\n\t  1 date                     0.00     0.00\n\t  1 dirname                  0.00     0.00\n\t  1 dmesg                    0.00     0.00\n\t  1 dmidecode                0.00     0.00\n\t  1 grep                     0.00     0.00\n\t  1 ifconfig                 0.00     0.00\n\t  1 ip                       0.00     0.00\n\t  1 lsmod                    0.00     0.00\n\t  1 mktemp                   0.00     0.00\n\t  1 qdbus                    0.00     0.00\n\t  1 readlink                 0.00     0.00\n\t  1 realpath                 0.00     0.00\n\t  1 sed                      0.00     0.00\n\t  1 sort                     0.00     0.00\n\t  1 stty                     0.00     0.00\n\t  1 systemctl                0.00     0.00\n\t  1 template.sh              0.00     0.00\n\t  1 wc                       0.00     0.00\n\t  1 xrandr                   0.00     0.00\n0 processes running\n47 maximum processes\n<\/code><\/pre>\n\n\n\n<p>Code sizes for the application are small<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>mev@augusta:~\/.phoronix-test-suite\/installed-tests\/pts\/himeno-1.3.0$ size himenobmtxpa\n   text\t   data\t    bss\t    dec\t    hex\tfilename\n   8399\t    676\t    240\t   9315\t   2463\thimenobmtxpa<\/code><\/pre>\n\n\n\n<p>Essentially repeated calls to the following single Jacobi function. Probably also a good case for -O3 vectorization options.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>float\njacobi(int nn, Matrix* a,Matrix* b,Matrix* c,\n       Matrix* p,Matrix* bnd,Matrix* wrk1,Matrix* wrk2)\n{\n  int    i,j,k,n,imax,jmax,kmax;\n  float  gosa,s0,ss;\n\n  imax= p->mrows-1;\n  jmax= p->mcols-1;\n  kmax= p->mdeps-1;\n\n  for(n=0 ; n&lt;nn ; n++){\n    gosa = 0.0;\n\n\n    for(i=1 ; i&lt;imax; i++)\n      for(j=1 ; j&lt;jmax ; j++)\n        for(k=1 ; k&lt;kmax ; k++){\n            s0= a->m&#91;0]&#91;i]&#91;j]&#91;k]*p->m&#91;0]&#91;i+1]&#91;j]&#91;k]\n            + a->m&#91;1]&#91;i]&#91;j]&#91;k]*p->m&#91;0]&#91;i]&#91;j+1]&#91;k]\n            + a->m&#91;2]&#91;i]&#91;j]&#91;k]*p->m&#91;0]&#91;i]&#91;j]&#91;k+1]\n            + b->m&#91;0]&#91;i]&#91;j]&#91;k]\n             *( p->m&#91;0]&#91;i+1]&#91;j+1]&#91;k] - p->m&#91;0]&#91;i+1]&#91;j-1]&#91;k]\n              - p->m&#91;0]&#91;i-1]&#91;j+1]&#91;k] + p->m&#91;0]&#91;i-1]&#91;j-1]&#91;k] )\n            + b->m&#91;1]&#91;i]&#91;j]&#91;k]\n             *( p->m&#91;0]&#91;i]&#91;j+1]&#91;k+1] - p->m&#91;0]&#91;i]&#91;j-1]&#91;k+1]\n              - p->m&#91;0]&#91;i]&#91;j+1]&#91;k-1] + p->m&#91;0]&#91;i]&#91;j-1]&#91;k-1] )\n            + b->m&#91;2]&#91;i]&#91;j]&#91;k]\n             *( p->m&#91;0]&#91;i+1]&#91;j]&#91;k+1] - p->m&#91;0]&#91;i-1]&#91;j]&#91;k+1]\n              - p->m&#91;0]&#91;i+1]&#91;j]&#91;k-1] + p->m&#91;0]&#91;i-1]&#91;j]&#91;k-1] )\n            + c->m&#91;0]&#91;i]&#91;j]&#91;k] * p->m&#91;0]&#91;i-1]&#91;j]&#91; k]\n            + c->m&#91;1]&#91;i]&#91;j]&#91;k] * p->m&#91;0]&#91;i]&#91;j-1]&#91;k]\n            + c->m&#91;2]&#91;i]&#91;j]&#91;k] * p->m&#91;0]&#91;i]&#91;j]&#91;k-1]\n            + wrk1->m&#91;0]&#91;i]&#91;j]&#91;k];\n\n          ss= (s0*a->m&#91;3]&#91;i]&#91;j]&#91;k] - p->m&#91;0]&#91;i]&#91;j]&#91;k])*bnd->m&#91;0]&#91;i]&#91;j]&#91;k];\n\n\n          gosa+= ss*ss;\n          wrk2->m&#91;0]&#91;i]&#91;j]&#91;k]= p->m&#91;0]&#91;i]&#91;j]&#91;k] + omega*ss;\n        }\n\n    for(i=1 ; i&lt;imax ; i++)\n      for(j=1 ; j&lt;jmax ; j++)\n        for(k=1 ; k&lt;kmax ; k++)\n          p->m&#91;0]&#91;i]&#91;j]&#91;k]= wrk2->m&#91;0]&#91;i]&#91;j]&#91;k];\n    \n  } \/* end n loop *\/\n\n  return(gosa);\n}\n<\/code><\/pre>\n\n\n\n<p>Computation is repeated runs of this application<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>      1175981) himeno           cpu=14 start=8.11  finish=49.58\n        1175982) himenobmtxpa     cpu=7 start=8.11  finish=49.58\n      1175985) himeno           cpu=14 start=53.59 finish=95.91\n        1175986) himenobmtxpa     cpu=15 start=53.59 finish=95.91\n      1175987) himeno           cpu=6 start=99.91 finish=142.20\n        1175988) himenobmtxpa     cpu=15 start=99.92 finish=142.20\n      1175989) himeno           cpu=14 start=146.21 finish=188.63\n        1175990) himenobmtxpa     cpu=7 start=146.21 finish=188.63\n      1175992) himeno           cpu=6 start=192.63 finish=235.00\n        1175993) himenobmtxpa     cpu=15 start=192.63 finish=235.00\n      1175994) himeno           cpu=6 start=239.01 finish=281.36\n        1175995) himenobmtxpa     cpu=15 start=239.01 finish=281.35\n      1175998) himeno           cpu=6 start=285.36 finish=327.68\n        1175999) himenobmtxpa     cpu=15 start=285.36 finish=327.68\n      1176001) himeno           cpu=11 start=331.68 finish=373.93\n        1176002) himenobmtxpa     cpu=12 start=331.69 finish=373.93\n      1176003) himeno           cpu=3 start=377.94 finish=420.20\n        1176004) himenobmtxpa     cpu=12 start=377.94 finish=420.20\n      1176005) himeno           cpu=11 start=424.21 finish=466.42\n        1176006) himenobmtxpa     cpu=4 start=424.21 finish=466.42\n      1176010) himeno           cpu=11 start=470.43 finish=512.50\n        1176011) himenobmtxpa     cpu=4 start=470.43 finish=512.50\n      1176020) himeno           cpu=11 start=516.50 finish=558.76\n        1176021) himenobmtxpa     cpu=12 start=516.51 finish=558.76\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Linear solver of pressure Poisson using a point-Jacobi method. This is one of fed tests where Intel CPU performs substantially better (6046) than AMD CPU (3966). It is also a case where AMD has a variable rate. This looks like <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/himeno\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":58,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-1023","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1023","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=1023"}],"version-history":[{"count":3,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1023\/revisions"}],"predecessor-version":[{"id":1064,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1023\/revisions\/1064"}],"up":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/58"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=1023"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}