Code for modeling electronic structures and materials. One benchmark and result is reported. The code is multi-threaded with a mix of one thread per hyperthreaded core and one per actual cor.e

Topdown profile shows a blur between being backend stalls and occasional higher retirement rates. Frontend stalls are low.

AMD metrics confirm backend stalls split between memory and CPU. The code doesn’t have much floating point or branches.

elapsed              2009.644
on_cpu               0.892          # 14.27 / 16 cores
utime                27794.710
stime                888.971
nvcsw                66275          # 16.07%
nivcsw               346047         # 83.93%
inblock              0              # 0.00/sec
onblock              705448         # 351.03/sec
cpu-clock            28689747750498 # 28689.748 seconds
task-clock           28690022795733 # 28690.023 seconds
page faults          81383819       # 2836.659/sec
context switches     422153         # 14.714/sec
cpu migrations       16702          # 0.582/sec
major page faults    458            # 0.016/sec
minor page faults    81383361       # 2836.643/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             11096495451105 # 63.130 branches per 1000 inst
branch misses        81279292087    # 0.73% branch miss
conditional          8071827699704  # 45.922 conditional branches per 1000 inst
indirect             385031131344   # 2.191 indirect branches per 1000 inst
cpu-cycles           108628710804268 # 3.38 GHz
instructions         175396936466455 # 1.61 IPC
slots                217224690829842 #
retiring             58560058869307 # 27.0% (36.9%)
-- ucode             59432676601    #     0.0%
-- fastpath          58500626192706 #    26.9%
frontend             11170785657522 #  5.1% ( 7.0%)
-- latency           6588015021012  #     3.0%
-- bandwidth         4582770636510  #     2.1%
backend              88710120335677 # 40.8% (55.9%)
-- cpu               45427624864433 #    20.9%
-- memory            43282495471244 #    19.9%
speculation          253387692929   #  0.1% ( 0.2%) low
-- branch mispredict 240323593293   #     0.1%
-- pipeline restart  13064099636    #     0.0%
smt-contention       58530240425718 # 26.9% ( 0.0%)
cpu-cycles           108712417550875 # 3.37 GHz
instructions         175866105021602 # 1.62 IPC
instructions         58623162544056 # 66.044 l2 access per 1000 inst
l2 hit from l1       2318656460357  # 17.18% l2 miss
l2 miss from l1      74091602930    #
l2 hit from l2 pf    961995367095   #
l3 hit from l2 pf    499044508637   #
l3 miss from l2 pf   92004877695    #
instructions         58611513941603 # 43.639 float per 1000 inst
float 512            73             # 0.000 AVX-512 per 1000 inst
float 256            666            # 0.000 AVX-256 per 1000 inst
float 128            2557768514987  # 43.639 AVX-128 per 1000 inst
float MMX            0              # 0.000 MMX per 1000 inst
float scalar         0              # 0.000 scalar per 1000 inst

Intel metrics

elapsed              6333.440
on_cpu               0.742          # 11.88 / 16 cores
utime                74750.330
stime                473.297
nvcsw                1313743        # 89.74%
nivcsw               150279         # 10.26%
inblock              30890000       # 4877.29/sec
onblock              694880         # 109.72/sec
cpu-clock            75222931189368 # 75222.931 seconds
task-clock           75223322994820 # 75223.323 seconds
page faults          85985637       # 1143.072/sec
context switches     1495425        # 19.880/sec
cpu migrations       54829          # 0.729/sec
major page faults    1233961        # 16.404/sec
minor page faults    84751671       # 1126.667/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             45503197495161 # 71.793 branches per 1000 inst
branch misses        90982505997    # 0.20% branch miss
conditional          45503197511129 # 71.793 conditional branches per 1000 inst
indirect             11993405533405 # 18.923 indirect branches per 1000 inst
slots                433838595012596 #
retiring             283286737116435 # 65.3% (65.3%) high
-- ucode             13222986405835 #     3.0%
-- fastpath          270063750710600 #    62.2%
frontend             19439778785847 #  4.5% ( 4.5%) low
-- latency           9361508360355  #     2.2%
-- bandwidth         10078270425492 #     2.3%
backend              112743375337721 # 26.0% (26.0%)
-- cpu               63018362871765 #    14.5%
-- memory            49725012465956 #    11.5%
speculation          12881927224817 #  3.0% ( 3.0%)
-- branch mispredict 12368006069230 #     2.9%
-- pipeline restart  513921155587   #     0.1%
smt-contention       0              #  0.0% ( 0.0%)
cpu-cycles           216187026514697 # 2.16 GHz
instructions         883987677201512 # 4.09 IPC high
l2 access            3937606297777  # 13.285 l2 access per 1000 inst
l2 miss              358227811293   # 9.10% l2 miss

Process overview shows pw.x as the primary process.

498 processes
	120 pw.x                 138331.72  4422.93
	 68 clinfo                  16.54     5.99
	 18 mpirun                   1.04     2.27
	 38 vulkaninfo               0.95     1.33
	  6 php                      0.22     0.38
	  4 vulkani:disk$0           0.10     0.14
	  6 glxinfo:gdrv0            0.10     0.08
	  6 glxinfo:gl0              0.10     0.08
	  6 clang                    0.05     0.07
	  2 llvmpipe-0               0.05     0.07
	  2 llvmpipe-1               0.05     0.07
	  2 llvmpipe-10              0.05     0.07
	  2 llvmpipe-11              0.05     0.07
	  2 llvmpipe-12              0.05     0.07
	  2 llvmpipe-13              0.05     0.07
	  2 llvmpipe-14              0.05     0.07
	  2 llvmpipe-15              0.05     0.07
	  2 llvmpipe-2               0.05     0.07
	  2 llvmpipe-3               0.05     0.07
	  2 llvmpipe-4               0.05     0.07
	  2 llvmpipe-5               0.05     0.07
	  2 llvmpipe-6               0.05     0.07
	  2 llvmpipe-7               0.05     0.07
	  2 llvmpipe-8               0.05     0.07
	  2 llvmpipe-9               0.05     0.07
	  2 glxinfo                  0.04     0.04
	  2 glxinfo:cs0              0.04     0.04
	  2 glxinfo:disk$0           0.04     0.04
	  2 glxinfo:sh0              0.04     0.04
	  2 glxinfo:shlo0            0.04     0.04
	  3 rocminfo                 0.03     0.00
	  1 lspci                    0.00     0.02
	  1 ps                       0.00     0.01
	 82 sh                       0.00     0.00
	 14 gsettings                0.00     0.00
	 13 gcc                      0.00     0.00
	 10 sed                      0.00     0.00
	  8 stat                     0.00     0.00
	  8 systemd-detect-          0.00     0.00
	  6 llvm-link                0.00     0.00
	  5 phoronix-test-s          0.00     0.00
	  3 qe                       0.00     0.00
	  2 cc                       0.00     0.00
	  2 lscpu                    0.00     0.00
	  2 uname                    0.00     0.00
	  2 which                    0.00     0.00
	  2 xset                     0.00     0.00
	  1 date                     0.00     0.00
	  1 dconf worker             0.00     0.00
	  1 dirname                  0.00     0.00
	  1 dmesg                    0.00     0.00
	  1 dmidecode                0.00     0.00
	  1 gmain                    0.00     0.00
	  1 grep                     0.00     0.00
	  1 ifconfig                 0.00     0.00
	  1 ip                       0.00     0.00
	  1 lsmod                    0.00     0.00
	  1 mktemp                   0.00     0.00
	  1 qdbus                    0.00     0.00
	  1 readlink                 0.00     0.00
	  1 realpath                 0.00     0.00
	  1 sort                     0.00     0.00
	  1 stty                     0.00     0.00
	  1 systemctl                0.00     0.00
	  1 template.sh              0.00     0.00
	  1 wc                       0.00     0.00
	  1 xrandr                   0.00     0.00
0 processes running
55 maximum processes

Sample computation shows MPI being used

      44262) qe               cpu=15 start=5.68  finish=665.97
        44263) mpirun           cpu=0 start=5.69  finish=665.94
          44268) mpirun           cpu=10 start=6.27  finish=665.94
          44269) mpirun           cpu=3 start=6.27  finish=6.27 
          44270) mpirun           cpu=2 start=6.29  finish=665.93
          44271) mpirun           cpu=15 start=6.77  finish=665.93
          44272) mpirun           cpu=13 start=6.77  finish=665.94
          44273) pw.x             cpu=9 start=6.80  finish=665.92
            44275) pw.x             cpu=4 start=6.81  finish=665.92
            44277) pw.x             cpu=0 start=6.81  finish=665.92
            44282) pw.x             cpu=6 start=6.82  finish=665.92
            44307) pw.x             cpu=8 start=8.44  finish=665.92
          44274) pw.x             cpu=6 start=6.81  finish=665.92
            44278) pw.x             cpu=12 start=6.81  finish=665.92
            44280) pw.x             cpu=10 start=6.82  finish=665.92
            44286) pw.x             cpu=15 start=6.82  finish=665.92
            44308) pw.x             cpu=15 start=8.45  finish=665.92
          44276) pw.x             cpu=11 start=6.81  finish=665.92
            44281) pw.x             cpu=5 start=6.82  finish=665.92
            44284) pw.x             cpu=15 start=6.82  finish=665.92
            44290) pw.x             cpu=5 start=6.83  finish=665.92
            44310) pw.x             cpu=5 start=8.47  finish=665.92
          44279) pw.x             cpu=13 start=6.82  finish=665.92
            44285) pw.x             cpu=2 start=6.82  finish=665.92
            44288) pw.x             cpu=10 start=6.83  finish=665.92
            44294) pw.x             cpu=0 start=6.83  finish=665.92
            44305) pw.x             cpu=5 start=8.41  finish=665.92
          44283) pw.x             cpu=1 start=6.82  finish=665.92
            44289) pw.x             cpu=15 start=6.83  finish=665.92
            44292) pw.x             cpu=4 start=6.83  finish=665.92
            44298) pw.x             cpu=3 start=6.84  finish=665.92
            44312) pw.x             cpu=11 start=8.59  finish=665.92
          44287) pw.x             cpu=10 start=6.83  finish=665.92
            44293) pw.x             cpu=2 start=6.83  finish=665.92
            44297) pw.x             cpu=10 start=6.84  finish=665.92
            44301) pw.x             cpu=10 start=6.84  finish=665.92
            44309) pw.x             cpu=8 start=8.46  finish=665.92
          44291) pw.x             cpu=14 start=6.83  finish=665.92
            44296) pw.x             cpu=8 start=6.84  finish=665.92
            44299) pw.x             cpu=5 start=6.84  finish=665.92
            44303) pw.x             cpu=13 start=6.85  finish=665.92
            44306) pw.x             cpu=7 start=8.42  finish=665.92
          44295) pw.x             cpu=7 start=6.83  finish=665.92
            44300) pw.x             cpu=10 start=6.84  finish=665.92
            44302) pw.x             cpu=5 start=6.85  finish=665.92
            44304) pw.x             cpu=4 start=6.85  finish=665.92
            44311) pw.x             cpu=1 start=8.48  finish=665.92
        44359) sed              cpu=6 start=665.97 finish=665.97
        44360) sed              cpu=12 start=665.97 finish=665.97
        44361) sed              cpu=9 start=665.97 finish=665.97