hpcg – Performance analysis, tools and experiments

High performance conjugate gradient benchmark. – link – used to measure super computers. There are multiple sizes from 104x104x104 to 192x192x192 and two runtimes from 60 to 1800 where 1800 is what is used to officially submit scores. I picked the 144x144x144 size. It reports a metric in GFLOP/s. Overall profile runs consistently on eight threads, suggesting a MPI run.

Topdown profile shows mostly backend bound with a low retirement rate.

AMD metrics show ~150 floating point instructions per 1000, and confirm the low retirement, high backend memory stalls and low speculation.

elapsed              875.134
on_cpu               0.488          # 7.81 / 16 cores
utime                6797.617
stime                41.324
nvcsw                35941          # 66.67%
nivcsw               17964          # 33.33%
inblock              6880           # 7.86/sec
onblock              62312          # 71.20/sec
cpu-clock            6840078255450  # 6840.078 seconds
task-clock           6840172325218  # 6840.172 seconds
page faults          16133100       # 2358.581/sec
context switches     58112          # 8.496/sec
cpu migrations       9711           # 1.420/sec
major page faults    331            # 0.048/sec
minor page faults    16132769       # 2358.533/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             2634432109888  # 143.289 branches per 1000 inst
branch misses        3672582283     # 0.14% branch miss
conditional          2568043704731  # 139.678 conditional branches per 1000 inst
indirect             6032629780     # 0.328 indirect branches per 1000 inst
cpu-cycles           30968079590724 # 2.21 GHz
instructions         18395333044799 # 0.59 IPC low
slots                61921102171122 #
retiring             6078821159224  #  9.8% ( 9.8%) low
-- ucode             839491495      #     0.0%
-- fastpath          6077981667729  #     9.8%
frontend             4307346648297  #  7.0% ( 7.0%)
-- latency           1798699792212  #     2.9%
-- bandwidth         2508646856085  #     4.1%
backend              51459872447555 # 83.1% (83.2%) high
-- cpu               8352863384115  #    13.5%
-- memory            43107009063440 #    69.6%
speculation          26131829493    #  0.0% ( 0.0%) low
-- branch mispredict 24835216381    #     0.0%
-- pipeline restart  1296613112     #     0.0%
smt-contention       48912860156    #  0.1% ( 0.0%)
cpu-cycles           30958977523557 # 2.21 GHz
instructions         18352926259460 # 0.59 IPC low
instructions         6114645283233  # 65.650 l2 access per 1000 inst
l2 hit from l1       224097131601   # 42.88% l2 miss
l2 miss from l1      5729334541     #
l2 hit from l2 pf    10927118257    #
l3 hit from l2 pf    778865718      #
l3 miss from l2 pf   165624449263   #
instructions         6111909121253  # 145.740 float per 1000 inst
float 512            49             # 0.000 AVX-512 per 1000 inst
float 256            772            # 0.000 AVX-256 per 1000 inst
float 128            890750308112   # 145.740 AVX-128 per 1000 inst
float MMX            0              # 0.000 MMX per 1000 inst
float scalar         45             # 0.000 scalar per 1000 inst
instructions         2664431        #
opcache              983447         # 369.102 opcache per 1000 inst
opcache miss         528168         # 53.7% opcache miss rate
l1 dTLB miss         5740           # 2.154 L1 dTLB per 1000 inst
l2 dTLB miss         1194           # 0.448 L2 dTLB per 1000 inst
instructions         2681848        #
icache               1285641        # 479.386 icache per 1000 inst
icache miss          107266         #  8.3% icache miss rate
l1 iTLB miss         15             # 0.006 L1 iTLB per 1000 inst
l2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst
tlb flush            20             # 0.007 TLB flush per 1000 inst

Process overview shows xhpcg taking almost all the time.

447 processes
	 72 xhpcg                20371.80   101.43
	 68 clinfo                  17.54     5.97
	 18 mpirun                   0.93     2.65
	 38 vulkaninfo               0.91     1.34
	  6 php                      0.11     1.11
	  4 vulkani:disk$0           0.09     0.15
	  6 glxinfo:gdrv0            0.08     0.10
	  6 glxinfo:gl0              0.08     0.09
	  6 clang                    0.06     0.06
	  2 llvmpipe-0               0.05     0.07
	  2 llvmpipe-1               0.05     0.07
	  2 llvmpipe-10              0.05     0.07
	  2 llvmpipe-11              0.05     0.07
	  2 llvmpipe-12              0.05     0.07
	  2 llvmpipe-13              0.05     0.07
	  2 llvmpipe-14              0.05     0.07
	  2 llvmpipe-15              0.05     0.07
	  2 llvmpipe-2               0.05     0.07
	  2 llvmpipe-3               0.05     0.07
	  2 llvmpipe-4               0.05     0.07
	  2 llvmpipe-5               0.05     0.07
	  2 llvmpipe-6               0.05     0.07
	  2 llvmpipe-7               0.05     0.07
	  2 llvmpipe-8               0.05     0.07
	  2 llvmpipe-9               0.05     0.07
	  2 glxinfo                  0.04     0.04
	  2 glxinfo:cs0              0.04     0.04
	  2 glxinfo:disk$0           0.04     0.04
	  2 glxinfo:sh0              0.04     0.04
	  2 glxinfo:shlo0            0.04     0.04
	  3 rocminfo                 0.03     0.00
	  1 lspci                    0.00     0.01
	  1 ps                       0.00     0.01
	 82 sh                       0.00     0.00
	 13 gcc                      0.00     0.00
	  8 stat                     0.00     0.00
	  8 systemd-detect-          0.00     0.00
	  7 gsettings                0.00     0.00
	  6 llvm-link                0.00     0.00
	  5 gmain                    0.00     0.00
	  5 phoronix-test-s          0.00     0.00
	  4 dconf worker             0.00     0.00
	  3 cat                      0.00     0.00
	  3 hpcg                     0.00     0.00
	  3 rm                       0.00     0.00
	  2 cc                       0.00     0.00
	  2 lscpu                    0.00     0.00
	  2 uname                    0.00     0.00
	  2 which                    0.00     0.00
	  2 xset                     0.00     0.00
	  1 date                     0.00     0.00
	  1 dirname                  0.00     0.00
	  1 dmesg                    0.00     0.00
	  1 dmidecode                0.00     0.00
	  1 grep                     0.00     0.00
	  1 ifconfig                 0.00     0.00
	  1 ip                       0.00     0.00
	  1 lsmod                    0.00     0.00
	  1 mktemp                   0.00     0.00
	  1 qdbus                    0.00     0.00
	  1 readlink                 0.00     0.00
	  1 realpath                 0.00     0.00
	  1 sed                      0.00     0.00
	  1 sort                     0.00     0.00
	  1 stty                     0.00     0.00
	  1 systemctl                0.00     0.00
	  1 template.sh              0.00     0.00
	  1 wc                       0.00     0.00
	  1 xrandr                   0.00     0.00
0 processes running
47 maximum processes

Computation blocks confirm this is run with MPI

      402268) hpcg             cpu=4 start=6.71  finish=292.31
        402269) rm               cpu=6 start=6.71  finish=6.72 
        402270) mpirun           cpu=9 start=6.72  finish=292.28
          402274) mpirun           cpu=5 start=7.30  finish=292.28
          402275) mpirun           cpu=14 start=7.30  finish=7.30 
          402276) mpirun           cpu=15 start=7.32  finish=292.27
          402278) mpirun           cpu=10 start=7.80  finish=292.27
          402279) mpirun           cpu=3 start=7.80  finish=292.27
          402280) xhpcg            cpu=6 start=7.83  finish=292.06
            402282) xhpcg            cpu=6 start=7.84  finish=292.00
            402285) xhpcg            cpu=11 start=7.85  finish=291.99
          402281) xhpcg            cpu=12 start=7.84  finish=292.05
            402284) xhpcg            cpu=6 start=7.85  finish=291.99
            402288) xhpcg            cpu=7 start=7.85  finish=291.99
          402283) xhpcg            cpu=2 start=7.84  finish=292.06
            402287) xhpcg            cpu=4 start=7.85  finish=291.99
            402290) xhpcg            cpu=14 start=7.86  finish=291.99
          402286) xhpcg            cpu=15 start=7.85  finish=292.04
            402291) xhpcg            cpu=4 start=7.86  finish=291.99
            402294) xhpcg            cpu=13 start=7.86  finish=291.99
          402289) xhpcg            cpu=1 start=7.85  finish=292.04
            402293) xhpcg            cpu=6 start=7.86  finish=291.99
            402298) xhpcg            cpu=8 start=7.87  finish=291.99
          402292) xhpcg            cpu=3 start=7.86  finish=292.05
            402296) xhpcg            cpu=6 start=7.87  finish=291.99
            402300) xhpcg            cpu=4 start=7.87  finish=291.99
          402295) xhpcg            cpu=5 start=7.86  finish=292.05
            402299) xhpcg            cpu=11 start=7.87  finish=291.99
            402302) xhpcg            cpu=9 start=7.88  finish=291.99
          402297) xhpcg            cpu=8 start=7.87  finish=292.05
            402301) xhpcg            cpu=11 start=7.87  finish=291.99
            402303) xhpcg            cpu=2 start=7.88  finish=291.99
        402308) cat              cpu=7 start=292.30 finish=292.31