A CPU and GPU numeric processing library, using both built-in CPU and OpenCL benchmarks. All run on my AMD system and the OpenCL fp16 fails on my Intel system. The OpenCL fp32 passes on Intel The AMD is considerably faster, so curious if I am getting some GPU? Looks like the first two workloads are multi-threaded and the rest are single-threaded.

Topdown profile shows frontend stalls as high and some variation between workloads and over time.

AMD metrics show little floating point and moderate numbers of branches. Some L2 access though not particularly high backend stalls.

elapsed              255.589
on_cpu               0.316          # 5.05 / 16 cores
utime                769.161
stime                522.034
nvcsw                84146          # 54.56%
nivcsw               70084          # 45.44%
inblock              0              # 0.00/sec
onblock              156304         # 611.54/sec
cpu-clock            1297086975088  # 1297.087 seconds
task-clock           1297163581400  # 1297.164 seconds
page faults          1828517        # 1409.627/sec
context switches     155290         # 119.715/sec
cpu migrations       1592           # 1.227/sec
major page faults    155            # 0.119/sec
minor page faults    1828362        # 1409.508/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             566661991334   # 91.881 branches per 1000 inst
branch misses        61259641104    # 10.81% branch miss
conditional          252994830141   # 41.022 conditional branches per 1000 inst
indirect             24440357358    # 3.963 indirect branches per 1000 inst
cpu-cycles           5083213642342  # 1.21 GHz
instructions         6172430147473  # 1.21 IPC
slots                10170139884144 #
retiring             2242627936735  # 22.1% (29.2%)
-- ucode             8206700190     #     0.1%
-- fastpath          2234421236545  #    22.0%
frontend             3184418817599  # 31.3% (41.5%)
-- latency           2649897791970  #    26.1%
-- bandwidth         534521025629   #     5.3%
backend              2238686350080  # 22.0% (29.2%)
-- cpu               1024364604864  #    10.1%
-- memory            1214321745216  #    11.9%
speculation          10863172263    #  0.1% ( 0.1%) low
-- branch mispredict 10849523258    #     0.1%
-- pipeline restart  13649005       #     0.0%
smt-contention       2493533444045  # 24.5% ( 0.0%)
cpu-cycles           5022114467177  # 1.22 GHz
instructions         6157130266342  # 1.23 IPC
instructions         2057106720487  # 49.333 l2 access per 1000 inst
l2 hit from l1       86581943288    # 9.06% l2 miss
l2 miss from l1      2752199852     #
l2 hit from l2 pf    8456824069     #
l3 hit from l2 pf    5607231135     #
l3 miss from l2 pf   836878029      #
instructions         2053827320707  # 21.171 float per 1000 inst
float 512            83             # 0.000 AVX-512 per 1000 inst
float 256            508            # 0.000 AVX-256 per 1000 inst
float 128            43480827313    # 21.171 AVX-128 per 1000 inst
float MMX            0              # 0.000 MMX per 1000 inst
float scalar         0              # 0.000 scalar per 1000 inst
instructions         2665431        #
opcache              988658         # 370.919 opcache per 1000 inst
opcache miss         530873         # 53.7% opcache miss rate
l1 dTLB miss         5558           # 2.085 L1 dTLB per 1000 inst
l2 dTLB miss         1178           # 0.442 L2 dTLB per 1000 inst
instructions         2715463        #
icache               1322587        # 487.058 icache per 1000 inst
icache miss          112382         #  8.5% icache miss rate
l1 iTLB miss         14             # 0.005 L1 iTLB per 1000 inst
l2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst
tlb flush            19             # 0.007 TLB flush per 1000 inst

Intel metrics show lower on-cpu and both L2 and dram stalls.

elapsed              749.501
on_cpu               0.116          # 1.85 / 16 cores
utime                1332.666
stime                55.026
nvcsw                515494         # 6.33%
nivcsw               7622108        # 93.67%
inblock              15760          # 21.03/sec
onblock              10456          # 13.95/sec
cpu-clock            1384483475706  # 1384.483 seconds
task-clock           1384896944201  # 1384.897 seconds
page faults          5647449        # 4077.884/sec
context switches     8141137        # 5878.515/sec
cpu migrations       120081         # 86.708/sec
major page faults    163            # 0.118/sec
minor page faults    5647286        # 4077.766/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             327470422030   # 29.701 branches per 1000 inst
branch misses        2265053860     # 0.69% branch miss
conditional          327470459310   # 29.701 conditional branches per 1000 inst
indirect             45635483920    # 4.139 indirect branches per 1000 inst
slots                21933232930778 #
retiring             12264288512056 # 55.9% (55.9%) high
-- ucode             701472801784   #     3.2%
-- fastpath          11562815710272 #    52.7%
frontend             2803267333224  # 12.8% (12.8%)
-- latency           1783243291316  #     8.1%
-- bandwidth         1020024041908  #     4.7%
backend              6871443097914  # 31.3% (31.3%)
-- cpu               3842821497310  #    17.5%
-- memory            3028621600604  #    13.8%
speculation          1485815766476  #  6.8% ( 6.8%)
-- branch mispredict 1451257286366  #     6.6%
-- pipeline restart  34558480110    #     0.2%
smt-contention       0              #  0.0% ( 0.0%)
cpu-cycles           5173265968393  # 0.39 GHz
instructions         14473440438126 # 2.80 IPC
l2 access            178180311971   # 17.066 l2 access per 1000 inst
l2 miss              83307874133    # 46.75% l2 miss
cpu-cycles           4579755625449  # 28.7% memory latency
load stalls          1298392731852  #  0.5% l1 bound
l1 miss              1273798954929  # 10.5% l2 bound
l2 miss              794601654431   #  5.6% l3 bound
l3 miss              539003203401   # 11.8% dram bound
store_stalls         17972241280    #  0.4% store bound


AMD metrics show most of the time in the blas_cpu process.

1198 processes
	102 blas_cpu             12168.09  8389.18
	 51 cg_cpu                 695.13   183.60
	306 blas_opencl            373.02   658.65
	153 cg_opencl              140.62   150.05
	272 clinfo                  74.48    24.64
	 38 vulkaninfo               1.14     1.14
	  6 glxinfo:gdrv0            0.13     0.03
	  6 glxinfo:gl0              0.13     0.03
	  4 vulkani:disk$0           0.12     0.12
	  6 php                      0.10     0.15
	  2 glxinfo                  0.08     0.02
	  2 glxinfo:cs0              0.08     0.02
	  2 glxinfo:disk$0           0.07     0.02
	  2 glxinfo:sh0              0.07     0.01
	  2 glxinfo:shlo0            0.07     0.01
	  2 llvmpipe-0               0.06     0.06
	  2 llvmpipe-1               0.06     0.06
	  2 llvmpipe-10              0.06     0.06
	  2 llvmpipe-11              0.06     0.06
	  2 llvmpipe-12              0.06     0.06
	  2 llvmpipe-13              0.06     0.06
	  2 llvmpipe-14              0.06     0.06
	  2 llvmpipe-15              0.06     0.06
	  2 llvmpipe-2               0.06     0.06
	  2 llvmpipe-3               0.06     0.06
	  2 llvmpipe-4               0.06     0.06
	  2 llvmpipe-5               0.06     0.06
	  2 llvmpipe-6               0.06     0.06
	  2 llvmpipe-7               0.06     0.06
	  2 llvmpipe-8               0.06     0.06
	  2 llvmpipe-9               0.06     0.06
	  6 clang                    0.05     0.05
	  3 rocminfo                 0.03     0.03
	  1 lspci                    0.01     0.02
	  1 ps                       0.00     0.01
	 98 sh                       0.00     0.00
	 18 arrayfire                0.00     0.00
	 13 gcc                      0.00     0.00
	 11 gsettings                0.00     0.00
	  8 stat                     0.00     0.00
	  8 systemd-detect-          0.00     0.00
	  6 llvm-link                0.00     0.00
	  5 phoronix-test-s          0.00     0.00
	  3 gmain                    0.00     0.00
	  2 cc                       0.00     0.00
	  2 dconf worker             0.00     0.00
	  2 lscpu                    0.00     0.00
	  2 uname                    0.00     0.00
	  2 which                    0.00     0.00
	  2 xset                     0.00     0.00
	  1 date                     0.00     0.00
	  1 dirname                  0.00     0.00
	  1 dmesg                    0.00     0.00
	  1 dmidecode                0.00     0.00
	  1 grep                     0.00     0.00
	  1 ifconfig                 0.00     0.00
	  1 ip                       0.00     0.00
	  1 lsmod                    0.00     0.00
	  1 mktemp                   0.00     0.00
	  1 qdbus                    0.00     0.00
	  1 readlink                 0.00     0.00
	  1 realpath                 0.00     0.00
	  1 sed                      0.00     0.00
	  1 sort                     0.00     0.00
	  1 stty                     0.00     0.00
	  1 systemctl                0.00     0.00
	  1 template.sh              0.00     0.00
	  1 wc                       0.00     0.00
	  1 xrandr                   0.00     0.00
0 processes running
59 maximum processes

Computation blocks look as follows

      363808) arrayfire        cpu=5 start=6.95  finish=23.75
        363809) blas_cpu         cpu=7 start=6.95  finish=23.73
          363810) blas_cpu         cpu=14 start=6.96  finish=23.73
          363811) blas_cpu         cpu=10 start=6.96  finish=23.73
          363812) blas_cpu         cpu=4 start=6.96  finish=23.73
          363813) blas_cpu         cpu=9 start=6.96  finish=23.73
          363814) blas_cpu         cpu=8 start=6.96  finish=23.73
          363815) blas_cpu         cpu=3 start=6.96  finish=23.73
          363816) blas_cpu         cpu=5 start=6.96  finish=23.73
          363817) blas_cpu         cpu=15 start=6.96  finish=23.73
          363818) blas_cpu         cpu=0 start=6.96  finish=23.73
          363819) blas_cpu         cpu=6 start=6.96  finish=23.73
          363820) blas_cpu         cpu=12 start=6.96  finish=23.73
          363821) blas_cpu         cpu=1 start=6.96  finish=23.73
          363822) blas_cpu         cpu=11 start=6.96  finish=23.73
          363823) blas_cpu         cpu=2 start=6.96  finish=23.72
          363824) blas_cpu         cpu=13 start=6.96  finish=23.72
          363825) blas_cpu         cpu=7 start=6.96  finish=23.73