Lulesh is an acronym for Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics. This is a very quick running benchmark. Looks like MPI runs just on physical cores.

Topdown profile is sparse because the workload runs quickly. However on aggregate backend stalls predominate.

AMD metrics make it easier to see the summary. On-cpu is barely 1/4 of the cores. Backend memory stalls are high and CPU stalls also contribute. Approximately 40% of the instructions are floating point

elapsed              48.970
on_cpu               0.296          # 4.73 / 16 cores
utime                191.535
stime                40.245
nvcsw                45784          # 96.73%
nivcsw               1548           # 3.27%
inblock              8              # 0.16/sec
onblock              62080          # 1267.73/sec
cpu-clock            231739627464   # 231.740 seconds
task-clock           231757527279   # 231.758 seconds
page faults          19718776       # 85083.649/sec
context switches     47385          # 204.459/sec
cpu migrations       1131           # 4.880/sec
major page faults    234            # 1.010/sec
minor page faults    19718542       # 85082.639/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             82537114616    # 75.499 branches per 1000 inst
branch misses        3072310653     # 3.72% branch miss
conditional          55678608704    # 50.931 conditional branches per 1000 inst
indirect             2686831025     # 2.458 indirect branches per 1000 inst
cpu-cycles           986477655776   # 1.27 GHz
instructions         1082464092791  # 1.10 IPC
slots                1974576003252  #
retiring             377954134774   # 19.1% (19.2%)
-- ucode             509902685      #     0.0%
-- fastpath          377444232089   #    19.1%
frontend             228199081339   # 11.6% (11.6%)
-- latency           170613588120   #     8.6%
-- bandwidth         57585493219    #     2.9%
backend              1361308062049  # 68.9% (69.0%)
-- cpu               432111990127   #    21.9%
-- memory            929196071922   #    47.1%
speculation          5124623997     #  0.3% ( 0.3%) low
-- branch mispredict 5037122371     #     0.3%
-- pipeline restart  87501626       #     0.0%
smt-contention       1988503188     #  0.1% ( 0.0%)
cpu-cycles           986280789225   # 1.27 GHz
instructions         1079037081656  # 1.09 IPC
instructions         360898263957   # 40.265 l2 access per 1000 inst
l2 hit from l1       9682858043     # 24.84% l2 miss
l2 miss from l1      712645490      #
l2 hit from l2 pf    1951552116     #
l3 hit from l2 pf    139447103      #
l3 miss from l2 pf   2757536174     #
instructions         361496500819   # 406.170 float per 1000 inst
float 512            76             # 0.000 AVX-512 per 1000 inst
float 256            690            # 0.000 AVX-256 per 1000 inst
float 128            146829108908   # 406.170 AVX-128 per 1000 inst
float MMX            0              # 0.000 MMX per 1000 inst
float scalar         0              # 0.000 scalar per 1000 inst

Intel metrics

elapsed              55.049
on_cpu               0.317          # 5.07 / 16 cores
utime                242.439
stime                36.925
nvcsw                83411          # 98.53%
nivcsw               1248           # 1.47%
inblock              519472         # 9436.55/sec
onblock              50664          # 920.34/sec
cpu-clock            279287193118   # 279.287 seconds
task-clock           279309002068   # 279.309 seconds
page faults          19700096       # 70531.547/sec
context switches     84722          # 303.327/sec
cpu migrations       1460           # 5.227/sec
major page faults    3526           # 12.624/sec
minor page faults    19696570       # 70518.923/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             268786231444   # 136.459 branches per 1000 inst
branch misses        58265385       # 0.02% branch miss
conditional          268786245332   # 136.459 conditional branches per 1000 inst
indirect             41617151216    # 21.128 indirect branches per 1000 inst
slots                15445604506322 #
retiring             7770478502776  # 50.3% (50.3%)
-- ucode             790445170397   #     5.1%
-- fastpath          6980033332379  #    45.2%
frontend             748657265289   #  4.8% ( 4.8%) low
-- latency           355686368439   #     2.3%
-- bandwidth         392970896850   #     2.5%
backend              6870999071370  # 44.5% (44.5%)
-- cpu               2408157938972  #    15.6%
-- memory            4462841132398  #    28.9%
speculation          137401946204   #  0.9% ( 0.9%) low
-- branch mispredict 71814841286    #     0.5%
-- pipeline restart  65587104918    #     0.4%
smt-contention       0              #  0.0% ( 0.0%)
cpu-cycles           5203449155222  # 1.14 GHz
instructions         15428229523592 # 2.97 IPC
l2 access            75190794668    # 9.551 l2 access per 1000 inst
l2 miss              44769402597    # 59.54% l2 miss

Process overview shows lulesh2.0 invocations under MPI

441 processes
	 72 lulesh2.0              570.95   112.10
	 68 clinfo                  15.88     6.32
	 38 vulkaninfo               0.94     1.33
	 18 mpirun                   0.77     2.15
	  6 glxinfo:gdrv0            0.12     0.04
	  6 glxinfo:gl0              0.12     0.04
	  4 vulkani:disk$0           0.10     0.14
	  6 clang                    0.08     0.03
	  6 php                      0.07     0.07
	  2 glxinfo                  0.06     0.03
	  2 glxinfo:cs0              0.06     0.02
	  2 glxinfo:disk$0           0.06     0.02
	  2 glxinfo:sh0              0.06     0.02
	  2 glxinfo:shlo0            0.06     0.02
	  2 llvmpipe-0               0.05     0.07
	  2 llvmpipe-1               0.05     0.07
	  2 llvmpipe-10              0.05     0.07
	  2 llvmpipe-11              0.05     0.07
	  2 llvmpipe-12              0.05     0.07
	  2 llvmpipe-13              0.05     0.07
	  2 llvmpipe-14              0.05     0.07
	  2 llvmpipe-15              0.05     0.07
	  2 llvmpipe-2               0.05     0.07
	  2 llvmpipe-3               0.05     0.07
	  2 llvmpipe-4               0.05     0.07
	  2 llvmpipe-5               0.05     0.07
	  2 llvmpipe-6               0.05     0.07
	  2 llvmpipe-7               0.05     0.07
	  2 llvmpipe-8               0.05     0.07
	  2 llvmpipe-9               0.05     0.07
	  3 rocminfo                 0.03     0.00
	  1 lspci                    0.00     0.02
	  1 ps                       0.00     0.01
	 82 sh                       0.00     0.00
	 13 gcc                      0.00     0.00
	 13 gsettings                0.00     0.00
	  8 stat                     0.00     0.00
	  8 systemd-detect-          0.00     0.00
	  6 llvm-link                0.00     0.00
	  5 phoronix-test-s          0.00     0.00
	  3 lulesh                   0.00     0.00
	  2 cc                       0.00     0.00
	  2 gmain                    0.00     0.00
	  2 lscpu                    0.00     0.00
	  2 uname                    0.00     0.00
	  2 which                    0.00     0.00
	  2 xset                     0.00     0.00
	  1 date                     0.00     0.00
	  1 dconf worker             0.00     0.00
	  1 dirname                  0.00     0.00
	  1 dmesg                    0.00     0.00
	  1 dmidecode                0.00     0.00
	  1 grep                     0.00     0.00
	  1 ifconfig                 0.00     0.00
	  1 ip                       0.00     0.00
	  1 lsmod                    0.00     0.00
	  1 mktemp                   0.00     0.00
	  1 qdbus                    0.00     0.00
	  1 readlink                 0.00     0.00
	  1 realpath                 0.00     0.00
	  1 sed                      0.00     0.00
	  1 sort                     0.00     0.00
	  1 stty                     0.00     0.00
	  1 systemctl                0.00     0.00
	  1 template.sh              0.00     0.00
	  1 wc                       0.00     0.00
	  1 xrandr                   0.00     0.00
0 processes running
47 maximum processes

Computation blocks

      7923) lulesh           cpu=1 start=5.85  finish=16.90
        7924) mpirun           cpu=0 start=5.85  finish=16.88
          7927) mpirun           cpu=4 start=6.46  finish=16.88
          7928) mpirun           cpu=7 start=6.46  finish=6.46 
          7929) mpirun           cpu=9 start=6.48  finish=16.87
          7930) mpirun           cpu=15 start=6.97  finish=16.87
          7931) mpirun           cpu=10 start=6.97  finish=16.88
          7932) lulesh2.0        cpu=10 start=6.98  finish=16.82
            7934) lulesh2.0        cpu=15 start=6.98  finish=16.81
            7938) lulesh2.0        cpu=15 start=6.99  finish=16.81
          7933) lulesh2.0        cpu=12 start=6.98  finish=16.82
            7936) lulesh2.0        cpu=0 start=6.99  finish=16.81
            7940) lulesh2.0        cpu=14 start=7.00  finish=16.81
          7935) lulesh2.0        cpu=3 start=6.99  finish=16.82
            7939) lulesh2.0        cpu=14 start=6.99  finish=16.81
            7943) lulesh2.0        cpu=5 start=7.00  finish=16.81
          7937) lulesh2.0        cpu=4 start=6.99  finish=16.77
            7942) lulesh2.0        cpu=1 start=7.00  finish=16.77
            7947) lulesh2.0        cpu=11 start=7.00  finish=16.77
          7941) lulesh2.0        cpu=11 start=7.00  finish=16.77
            7945) lulesh2.0        cpu=10 start=7.00  finish=16.77
            7950) lulesh2.0        cpu=3 start=7.01  finish=16.77
          7944) lulesh2.0        cpu=8 start=7.00  finish=16.77
            7948) lulesh2.0        cpu=5 start=7.01  finish=16.77
            7952) lulesh2.0        cpu=4 start=7.01  finish=16.77
          7946) lulesh2.0        cpu=6 start=7.00  finish=16.77
            7951) lulesh2.0        cpu=13 start=7.01  finish=16.77
            7954) lulesh2.0        cpu=5 start=7.02  finish=16.77
          7949) lulesh2.0        cpu=7 start=7.01  finish=16.77
            7953) lulesh2.0        cpu=15 start=7.01  finish=16.77
            7955) lulesh2.0        cpu=2 start=7.02  finish=16.77