himeno – Performance analysis, tools and experiments

Linear solver of pressure Poisson using a point-Jacobi method. This is one of fed tests where Intel CPU performs substantially better (6046) than AMD CPU (3966). It is also a case where AMD has a variable rate. This looks like a single-threaded test.

Topdown profile shows high backend stalls, both CPU and memory. It also suggests multiple rounds of tests to find a stable point. Frontend stalls are some of the smallest of all benchmarks. Perhaps interesting to see factors here like code/data footprint sizes?

AMD metrics show a floating point program with some L2 access and misses.

elapsed              562.793
on_cpu               0.056          # 0.90 / 16 cores
utime                506.784
stime                1.067
nvcsw                2113           # 51.46%
nivcsw               1993           # 48.54%
inblock              0              # 0.00/sec
onblock              12800          # 22.74/sec
cpu-clock            507956077142   # 507.956 seconds
task-clock           507965279796   # 507.965 seconds
page faults          240790         # 474.028/sec
context switches     6723           # 13.235/sec
cpu migrations       325            # 0.640/sec
major page faults    2              # 0.004/sec
minor page faults    240788         # 474.025/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             78486493682    # 20.748 branches per 1000 inst
branch misses        624071097      # 0.80% branch miss
conditional          77692509960    # 20.538 conditional branches per 1000 inst
indirect             47753355       # 0.013 indirect branches per 1000 inst
cpu-cycles           2940577091356  # 0.26 GHz
instructions         4771889380353  # 1.62 IPC
slots                5884521661530  #
retiring             1586483683880  # 27.0% (27.0%)
-- ucode             14146366       #     0.0%
-- fastpath          1586469537514  #    27.0%
frontend             55781933015    #  0.9% ( 0.9%) low
-- latency           31688469264    #     0.5%
-- bandwidth         24093463751    #     0.4%
backend              4231018348818  # 71.9% (71.9%) high
-- cpu               2277772735762  #    38.7%
-- memory            1953245613056  #    33.2%
speculation          11018268385    #  0.2% ( 0.2%) low
-- branch mispredict 10153156053    #     0.2%
-- pipeline restart  865112332      #     0.0%
smt-contention       218961520      #  0.0% ( 0.0%)
cpu-cycles           2938472632454  # 0.26 GHz
instructions         4629990328053  # 1.58 IPC
instructions         1544094148954  # 88.762 l2 access per 1000 inst
l2 hit from l1       80167148026    # 20.63% l2 miss
l2 miss from l1      1331488191     #
l2 hit from l2 pf    29950113188    #
l3 hit from l2 pf    13061396001    #
l3 miss from l2 pf   13878019099    #
instructions         1543498098310  # 184.685 float per 1000 inst
float 512            69             # 0.000 AVX-512 per 1000 inst
float 256            518            # 0.000 AVX-256 per 1000 inst
float 128            285061640703   # 184.685 AVX-128 per 1000 inst
float MMX            0              # 0.000 MMX per 1000 inst
float scalar         0              # 0.000 scalar per 1000 inst

Intel metrics

elapsed              168.565
on_cpu               0.057          # 0.92 / 16 cores
utime                154.165
stime                0.438
nvcsw                1887           # 74.50%
nivcsw               646            # 25.50%
inblock              0              # 0.00/sec
onblock              1312           # 7.78/sec
cpu-clock            154630765914   # 154.631 seconds
task-clock           154634206479   # 154.634 seconds
page faults          159168         # 1029.319/sec
context switches     3205           # 20.726/sec
cpu migrations       321            # 2.076/sec
major page faults    0              # 0.000/sec
minor page faults    159168         # 1029.319/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             35913794453    # 21.089 branches per 1000 inst
branch misses        88509738       # 0.25% branch miss
conditional          35913806389    # 21.089 conditional branches per 1000 inst
indirect             59414046       # 0.035 indirect branches per 1000 inst
slots                3725864821436  #
retiring             2390025479822  # 64.1% (64.1%) high
-- ucode             620664419824   #    16.7%
-- fastpath          1769361059998  #    47.5%
frontend             553782693331   # 14.9% (14.9%)
-- latency           17755344084    #     0.5%
-- bandwidth         536027349247   #    14.4%
backend              769551179018   # 20.7% (20.7%)
-- cpu               500685850178   #    13.4%
-- memory            268865328840   #     7.2%
speculation          17170761530    #  0.5% ( 0.5%) low
-- branch mispredict 12411755919    #     0.3%
-- pipeline restart  4759005611     #     0.1%
smt-contention       0              #  0.0% ( 0.0%)
cpu-cycles           648805684715   # 0.22 GHz
instructions         1890731557963  # 2.91 IPC
l2 access            101309940231   # 53.657 l2 access per 1000 inst
l2 miss              59683326414    # 58.91% l2 miss

Process overview suggests computation in the himenobmtxpa application

372 processes
	 12 himenobmtxpa           506.42     0.11
	 68 clinfo                  16.48     6.32
	 38 vulkaninfo               0.76     1.53
	  6 glxinfo:gdrv0            0.11     0.07
	  6 glxinfo:gl0              0.11     0.06
	  4 vulkani:disk$0           0.08     0.17
	  6 php                      0.07     0.15
	  2 glxinfo                  0.05     0.04
	  2 glxinfo:cs0              0.05     0.04
	  2 glxinfo:disk$0           0.05     0.03
	  2 glxinfo:sh0              0.05     0.03
	  2 glxinfo:shlo0            0.05     0.03
	  2 llvmpipe-0               0.04     0.09
	  2 llvmpipe-1               0.04     0.09
	  2 llvmpipe-10              0.04     0.09
	  2 llvmpipe-11              0.04     0.09
	  2 llvmpipe-12              0.04     0.09
	  2 llvmpipe-13              0.04     0.09
	  2 llvmpipe-14              0.04     0.09
	  2 llvmpipe-15              0.04     0.09
	  2 llvmpipe-2               0.04     0.09
	  2 llvmpipe-3               0.04     0.09
	  2 llvmpipe-5               0.04     0.09
	  2 llvmpipe-6               0.04     0.09
	  2 llvmpipe-7               0.04     0.09
	  2 llvmpipe-8               0.04     0.09
	  2 llvmpipe-9               0.04     0.09
	  2 llvmpipe-4               0.04     0.08
	  6 clang                    0.03     0.09
	  3 rocminfo                 0.00     0.03
	  1 lspci                    0.00     0.02
	  1 ps                       0.00     0.01
	 82 sh                       0.00     0.00
	 13 gcc                      0.00     0.00
	 12 gsettings                0.00     0.00
	 12 himeno                   0.00     0.00
	  8 stat                     0.00     0.00
	  8 systemd-detect-          0.00     0.00
	  6 llvm-link                0.00     0.00
	  5 phoronix-test-s          0.00     0.00
	  2 cc                       0.00     0.00
	  2 dconf worker             0.00     0.00
	  2 gmain                    0.00     0.00
	  2 lscpu                    0.00     0.00
	  2 uname                    0.00     0.00
	  2 which                    0.00     0.00
	  2 xset                     0.00     0.00
	  1 date                     0.00     0.00
	  1 dirname                  0.00     0.00
	  1 dmesg                    0.00     0.00
	  1 dmidecode                0.00     0.00
	  1 grep                     0.00     0.00
	  1 ifconfig                 0.00     0.00
	  1 ip                       0.00     0.00
	  1 lsmod                    0.00     0.00
	  1 mktemp                   0.00     0.00
	  1 qdbus                    0.00     0.00
	  1 readlink                 0.00     0.00
	  1 realpath                 0.00     0.00
	  1 sed                      0.00     0.00
	  1 sort                     0.00     0.00
	  1 stty                     0.00     0.00
	  1 systemctl                0.00     0.00
	  1 template.sh              0.00     0.00
	  1 wc                       0.00     0.00
	  1 xrandr                   0.00     0.00
0 processes running
47 maximum processes

Code sizes for the application are small

mev@augusta:~/.phoronix-test-suite/installed-tests/pts/himeno-1.3.0$ size himenobmtxpa
   text	   data	    bss	    dec	    hex	filename
   8399	    676	    240	   9315	   2463	himenobmtxpa

Essentially repeated calls to the following single Jacobi function. Probably also a good case for -O3 vectorization options.

float
jacobi(int nn, Matrix* a,Matrix* b,Matrix* c,
       Matrix* p,Matrix* bnd,Matrix* wrk1,Matrix* wrk2)
{
  int    i,j,k,n,imax,jmax,kmax;
  float  gosa,s0,ss;

  imax= p->mrows-1;
  jmax= p->mcols-1;
  kmax= p->mdeps-1;

  for(n=0 ; n<nn ; n++){
    gosa = 0.0;


    for(i=1 ; i<imax; i++)
      for(j=1 ; j<jmax ; j++)
        for(k=1 ; k<kmax ; k++){
            s0= a->m[0][i][j][k]*p->m[0][i+1][j][k]
            + a->m[1][i][j][k]*p->m[0][i][j+1][k]
            + a->m[2][i][j][k]*p->m[0][i][j][k+1]
            + b->m[0][i][j][k]
             *( p->m[0][i+1][j+1][k] - p->m[0][i+1][j-1][k]
              - p->m[0][i-1][j+1][k] + p->m[0][i-1][j-1][k] )
            + b->m[1][i][j][k]
             *( p->m[0][i][j+1][k+1] - p->m[0][i][j-1][k+1]
              - p->m[0][i][j+1][k-1] + p->m[0][i][j-1][k-1] )
            + b->m[2][i][j][k]
             *( p->m[0][i+1][j][k+1] - p->m[0][i-1][j][k+1]
              - p->m[0][i+1][j][k-1] + p->m[0][i-1][j][k-1] )
            + c->m[0][i][j][k] * p->m[0][i-1][j][ k]
            + c->m[1][i][j][k] * p->m[0][i][j-1][k]
            + c->m[2][i][j][k] * p->m[0][i][j][k-1]
            + wrk1->m[0][i][j][k];

          ss= (s0*a->m[3][i][j][k] - p->m[0][i][j][k])*bnd->m[0][i][j][k];


          gosa+= ss*ss;
          wrk2->m[0][i][j][k]= p->m[0][i][j][k] + omega*ss;
        }

    for(i=1 ; i<imax ; i++)
      for(j=1 ; j<jmax ; j++)
        for(k=1 ; k<kmax ; k++)
          p->m[0][i][j][k]= wrk2->m[0][i][j][k];
    
  } /* end n loop */

  return(gosa);
}

Computation is repeated runs of this application

      1175981) himeno           cpu=14 start=8.11  finish=49.58
        1175982) himenobmtxpa     cpu=7 start=8.11  finish=49.58
      1175985) himeno           cpu=14 start=53.59 finish=95.91
        1175986) himenobmtxpa     cpu=15 start=53.59 finish=95.91
      1175987) himeno           cpu=6 start=99.91 finish=142.20
        1175988) himenobmtxpa     cpu=15 start=99.92 finish=142.20
      1175989) himeno           cpu=14 start=146.21 finish=188.63
        1175990) himenobmtxpa     cpu=7 start=146.21 finish=188.63
      1175992) himeno           cpu=6 start=192.63 finish=235.00
        1175993) himenobmtxpa     cpu=15 start=192.63 finish=235.00
      1175994) himeno           cpu=6 start=239.01 finish=281.36
        1175995) himenobmtxpa     cpu=15 start=239.01 finish=281.35
      1175998) himeno           cpu=6 start=285.36 finish=327.68
        1175999) himenobmtxpa     cpu=15 start=285.36 finish=327.68
      1176001) himeno           cpu=11 start=331.68 finish=373.93
        1176002) himenobmtxpa     cpu=12 start=331.69 finish=373.93
      1176003) himeno           cpu=3 start=377.94 finish=420.20
        1176004) himenobmtxpa     cpu=12 start=377.94 finish=420.20
      1176005) himeno           cpu=11 start=424.21 finish=466.42
        1176006) himenobmtxpa     cpu=4 start=424.21 finish=466.42
      1176010) himeno           cpu=11 start=470.43 finish=512.50
        1176011) himenobmtxpa     cpu=4 start=470.43 finish=512.50
      1176020) himeno           cpu=11 start=516.50 finish=558.76
        1176021) himenobmtxpa     cpu=12 start=516.51 finish=558.76