A speech to text process using Tensorflow to transcribe a three minute audio recording. One workload that runs in about a minute and then three times. Looks to be single-threaded

Topdown profile looks to be backend bound.

AMD metrics confirm a backend-stall heavy workload with low frontend stalls. There is not much floating point. A good amount of L2 access.

elapsed              160.615
on_cpu               0.070          # 1.12 / 16 cores
utime                169.116
stime                10.486
nvcsw                487526         # 99.62%
nivcsw               1847           # 0.38%
inblock              0              # 0.00/sec
onblock              12568          # 78.25/sec
cpu-clock            177998788938   # 177.999 seconds
task-clock           178430160190   # 178.430 seconds
page faults          235378         # 1319.160/sec
context switches     489985         # 2746.088/sec
cpu migrations       23465          # 131.508/sec
major page faults    3416           # 19.145/sec
minor page faults    231962         # 1300.016/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             49259821446    # 63.246 branches per 1000 inst
branch misses        521285591      # 1.06% branch miss
conditional          40408849370    # 51.882 conditional branches per 1000 inst
indirect             1381988941     # 1.774 indirect branches per 1000 inst
cpu-cycles           696267329926   # 0.24 GHz
instructions         772925724731   # 1.11 IPC
slots                1390055158284  #
retiring             256880226504   # 18.5% (18.7%)
-- ucode             151528830      #     0.0%
-- fastpath          256728697674   #    18.5%
frontend             58842049778    #  4.2% ( 4.3%) low
-- latency           47353735122    #     3.4%
-- bandwidth         11488314656    #     0.8%
backend              1047097332617  # 75.3% (76.2%) high
-- cpu               130578369248   #     9.4%
-- memory            916518963369   #    65.9%
speculation          11335440378    #  0.8% ( 0.8%) low
-- branch mispredict 10259441048    #     0.7%
-- pipeline restart  1075999330     #     0.1%
smt-contention       15848595211    #  1.1% ( 0.0%)
cpu-cycles           697932624106   # 0.28 GHz
instructions         777877029300   # 1.11 IPC
instructions         259176178766   # 172.445 l2 access per 1000 inst
l2 hit from l1       26157068658    # 31.31% l2 miss
l2 miss from l1      2105517564     #
l2 hit from l2 pf    6648857572     #
l3 hit from l2 pf    269934316      #
l3 miss from l2 pf   11617658376    #
instructions         259534288888   # 15.867 float per 1000 inst
float 512            65             # 0.000 AVX-512 per 1000 inst
float 256            2318481        # 0.009 AVX-256 per 1000 inst
float 128            4115648027     # 15.858 AVX-128 per 1000 inst
float MMX            0              # 0.000 MMX per 1000 inst
float scalar         4              # 0.000 scalar per 1000 inst
instructions         778907715084   #
opcache              94567228200    # 121.410 opcache per 1000 inst
opcache miss         4023676179     #  4.3% opcache miss rate
l1 dTLB miss         4315198834     # 5.540 L1 dTLB per 1000 inst
l2 dTLB miss         1044990269     # 1.342 L2 dTLB per 1000 inst
instructions         778450113458   #
icache               7298837934     # 9.376 icache per 1000 inst
icache miss          1369530995     # 18.8% icache miss rate
l1 iTLB miss         23755944       # 0.031 L1 iTLB per 1000 inst
l2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst
tlb flush            16742          # 0.000 TLB flush per 1000 inst

Intel metrics confirm both L3 and dram-bound natures of the memory-bound stalls.

elapsed              201.626
on_cpu               0.056          # 0.90 / 16 cores
utime                176.392
stime                5.582
nvcsw                468751         # 97.63%
nivcsw               11384          # 2.37%
inblock              1208           # 5.99/sec
onblock              1288           # 6.39/sec
cpu-clock            180258810774   # 180.259 seconds
task-clock           180510870708   # 180.511 seconds
page faults          222169         # 1230.779/sec
context switches     480881         # 2664.000/sec
cpu migrations       38123          # 211.195/sec
major page faults    2428           # 13.451/sec
minor page faults    219741         # 1217.328/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             49026825305    # 59.224 branches per 1000 inst
branch misses        354171008      # 0.72% branch miss
conditional          49026840985    # 59.224 conditional branches per 1000 inst
indirect             3105678778     # 3.752 indirect branches per 1000 inst
slots                3095032907966  #
retiring             773986641565   # 25.0% (25.0%)
-- ucode             10055304339    #     0.3%
-- fastpath          763931337226   #    24.7%
frontend             74881303981    #  2.4% ( 2.4%) low
-- latency           36919463368    #     1.2%
-- bandwidth         37961840613    #     1.2%
backend              2231130561294  # 72.1% (72.1%) high
-- cpu               1108840747474  #    35.8%
-- memory            1122289813820  #    36.3%
speculation          42072490016    #  1.4% ( 1.4%)
-- branch mispredict 39314118809    #     1.3%
-- pipeline restart  2758371207     #     0.1%
smt-contention       0              #  0.0% ( 0.0%)
cpu-cycles           618989502966   # 0.24 GHz
instructions         878893138266   # 1.42 IPC
l2 access            144912559758   # 188.820 l2 access per 1000 inst
l2 miss              98850887983    # 68.21% l2 miss
cpu-cycles           540075543736   # 42.5% memory latency
load stalls          228742926778   #  0.0% l1 bound
l1 miss              463786101064   # 47.1% l2 bound
l2 miss              209158694764   # 14.2% l3 bound
l3 miss              132645464628   # 24.6% dram bound
store_stalls         624046835      #  0.1% store bound

Process profile shows time spent on the deepspeech process.

450 processes
	102 deepspeech            5788.84   199.24
	 68 clinfo                  16.53     6.33
	 38 vulkaninfo               1.34     0.96
	  4 vulkani:disk$0           0.15     0.11
	  6 glxinfo:gdrv0            0.09     0.09
	  6 glxinfo:gl0              0.09     0.09
	  2 llvmpipe-0               0.07     0.05
	  2 llvmpipe-1               0.07     0.05
	  2 llvmpipe-10              0.07     0.05
	  2 llvmpipe-11              0.07     0.05
	  2 llvmpipe-12              0.07     0.05
	  2 llvmpipe-13              0.07     0.05
	  2 llvmpipe-14              0.07     0.05
	  2 llvmpipe-15              0.07     0.05
	  2 llvmpipe-2               0.07     0.05
	  2 llvmpipe-3               0.07     0.05
	  2 llvmpipe-4               0.07     0.05
	  2 llvmpipe-5               0.07     0.05
	  2 llvmpipe-6               0.07     0.05
	  2 llvmpipe-7               0.07     0.05
	  2 llvmpipe-8               0.07     0.05
	  2 llvmpipe-9               0.07     0.05
	  6 php                      0.06     0.09
	  2 glxinfo                  0.06     0.03
	  6 clang                    0.05     0.07
	  2 glxinfo:cs0              0.05     0.03
	  2 glxinfo:disk$0           0.05     0.03
	  2 glxinfo:sh0              0.05     0.03
	  2 glxinfo:shlo0            0.05     0.03
	  3 rocminfo                 0.00     0.03
	  1 lspci                    0.00     0.02
	 81 sh                       0.00     0.00
	 12 gcc                      0.00     0.00
	  9 gsettings                0.00     0.00
	  8 stat                     0.00     0.00
	  8 systemd-detect-          0.00     0.00
	  6 llvm-link                0.00     0.00
	  5 gmain                    0.00     0.00
	  5 phoronix-test-s          0.00     0.00
	  3 deepspeech-run           0.00     0.00
	  2 dconf worker             0.00     0.00
	  2 lscpu                    0.00     0.00
	  2 uname                    0.00     0.00
	  2 which                    0.00     0.00
	  2 xset                     0.00     0.00
	  1 cc                       0.00     0.00
	  1 date                     0.00     0.00
	  1 dirname                  0.00     0.00
	  1 dmesg                    0.00     0.00
	  1 dmidecode                0.00     0.00
	  1 grep                     0.00     0.00
	  1 ifconfig                 0.00     0.00
	  1 ip                       0.00     0.00
	  1 lsmod                    0.00     0.00
	  1 mktemp                   0.00     0.00
	  1 ps                       0.00     0.00
	  1 qdbus                    0.00     0.00
	  1 readlink                 0.00     0.00
	  1 realpath                 0.00     0.00
	  1 sed                      0.00     0.00
	  1 sort                     0.00     0.00
	  1 stty                     0.00     0.00
	  1 systemctl                0.00     0.00
	  1 template.sh              0.00     0.00
	  1 wc                       0.00     0.00
	  1 xrandr                   0.00     0.00
0 processes running
47 maximum processes

Computation blocks

      906663) deepspeech-run   cpu=5 start=5.99  finish=49.12
        906664) deepspeech       cpu=5 start=6.00  finish=49.11
          906665) deepspeech       cpu=4 start=6.01  finish=49.11
          906666) deepspeech       cpu=14 start=6.01  finish=49.11
          906667) deepspeech       cpu=13 start=6.01  finish=49.11
          906668) deepspeech       cpu=2 start=6.01  finish=49.11
          906669) deepspeech       cpu=9 start=6.01  finish=49.11
          906670) deepspeech       cpu=7 start=6.01  finish=49.11
          906671) deepspeech       cpu=1 start=6.01  finish=49.11
          906672) deepspeech       cpu=8 start=6.01  finish=49.11
          906673) deepspeech       cpu=12 start=6.01  finish=49.11
          906674) deepspeech       cpu=8 start=6.01  finish=49.11
          906675) deepspeech       cpu=15 start=6.01  finish=49.11
          906676) deepspeech       cpu=6 start=6.01  finish=49.11
          906677) deepspeech       cpu=10 start=6.01  finish=49.11
          906678) deepspeech       cpu=3 start=6.01  finish=49.11
          906679) deepspeech       cpu=9 start=6.01  finish=49.11
          906680) deepspeech       cpu=11 start=6.01  finish=49.11
          906681) deepspeech       cpu=7 start=6.01  finish=49.11
          906682) deepspeech       cpu=11 start=6.01  finish=49.11
          906683) deepspeech       cpu=0 start=6.01  finish=49.11
          906684) deepspeech       cpu=14 start=6.01  finish=49.11
          906685) deepspeech       cpu=6 start=6.01  finish=49.11
          906686) deepspeech       cpu=13 start=6.01  finish=49.11
          906687) deepspeech       cpu=4 start=6.01  finish=49.11
          906688) deepspeech       cpu=12 start=6.01  finish=49.11
          906689) deepspeech       cpu=13 start=6.01  finish=49.11
          906690) deepspeech       cpu=15 start=6.01  finish=49.11
          906691) deepspeech       cpu=5 start=6.01  finish=49.11
          906692) deepspeech       cpu=10 start=6.01  finish=49.11
          906693) deepspeech       cpu=1 start=6.01  finish=49.11
          906694) deepspeech       cpu=2 start=6.01  finish=49.11
          906695) deepspeech       cpu=0 start=6.01  finish=49.11
          906696) deepspeech       cpu=3 start=6.01  finish=49.11
          906697) deepspeech       cpu=11 start=6.01  finish=49.11