x265 – Performance analysis, tools and experiments

x265 benchmark with standard input and two workloads

AMD metrics show a 128-bit floating point program with some L2 misses and backend memory but otherwise medium range retiring. We spend ~50% of time on the CPU so threaded but not fully threaded operation?

elapsed              713.905
on_cpu               0.502          # 8.03 / 16 cores
utime                5668.028
stime                66.483
nvcsw                2807701        # 92.74%
nivcsw               219710         # 7.26%
inblock              6170832
onblock              27632
cpu-clock            5730360170277  # 5730.360 seconds
task-clock           5731795930958  # 5731.796 seconds
page faults          6120568        # 1067.827/sec
context switches     3030769        # 528.764/sec
cpu migrations       176614         # 30.813/sec
major page faults    67             # 0.012/sec
minor page faults    6120501        # 1067.816/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             2298223476101  # 64.254 branches per 1000 inst
branch misses        38765597182    # 1.69% branch miss
conditional          1447123909275  # 40.459 conditional branches per 1000 inst
indirect             184907459078   # 5.170 indirect branches per 1000 inst
cpu-cycles           6226498530298  # 1.98 GHz
instructions         9678813453851  # 1.55 IPC
slots                12467027650776 #
retiring             3419335741620  # 27.4% (35.4%)
-- ucode             47510862855    #     0.4%
-- fastpath          3371824878765  #    27.0%
frontend             1352978437698  # 10.9% (14.0%)
-- latency           861177646632   #     6.9%
-- bandwidth         491800791066   #     3.9%
backend              4618690961014  # 37.0% (47.8%)
-- cpu               1412014283462  #    11.3%
-- memory            3206676677552  #    25.7%
speculation          263046398653   #  2.1% ( 2.7%)
-- branch mispredict 242439249427   #     1.9%
-- pipeline restart  20607149226    #     0.2%
smt-contention       2812859705014  # 22.6% ( 0.0%)
cpu-cycles           7936497399423  # 2.00 GHz
instructions         12283162200513 # 1.55 IPC
instructions         4099519478946  # 39.677 l2 access per 1000 inst
l2 hit from l1       134192432419   # 12.06% l2 miss
l2 miss from l1      12511817133    #
l2 hit from l2 pf    21366966534    #
l3 hit from l2 pf    3874073579     #
l3 miss from l2 pf   3222493799     #
instructions         4091045784055  # 190.915 float per 1000 inst
float 512            71             # 0.000 AVX-512 per 1000 inst
float 256            1060           # 0.000 AVX-256 per 1000 inst
float 128            781040286672   # 190.915 AVX-128 per 1000 inst
float MMX            0              # 0.000 MMX per 1000 inst
float scalar         0              # 0.000 scalar per 1000 inst

Corresponding Intel metrics shows >50% retiring rate.

elapsed              788.657
on_cpu               0.610          # 9.76 / 16 cores
utime                7603.378
stime                90.091
nvcsw                3688077        # 90.89%
nivcsw               369791         # 9.11%
inblock              31549176
onblock              26776
cpu-clock            7679288766057  # 7679.289 seconds
task-clock           7681427192071  # 7681.427 seconds
page faults          6542506        # 851.731/sec
context switches     4061655        # 528.763/sec
cpu migrations       594059         # 77.337/sec
major page faults    793            # 0.103/sec
minor page faults    6541713        # 851.627/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             2477806622104  # 64.408 branches per 1000 inst
branch misses        43684720181    # 1.76% branch miss
conditional          2477806644216  # 64.408 conditional branches per 1000 inst
indirect             880974334257   # 22.900 indirect branches per 1000 inst
slots                10702457016518 #
retiring             5872152202319  # 54.9% (54.9%)
-- ucode             373372570662   #     3.5%
-- fastpath          5498779631657  #    51.4%
frontend             1885584589158  # 17.6% (17.6%)
-- latency           953145946637   #     8.9%
-- bandwidth         932438642521   #     8.7%
backend              2105534559715  # 19.7% (19.7%)
-- cpu               981934933436   #     9.2%
-- memory            1123599626279  #    10.5%
speculation          939803939689   #  8.8% ( 8.8%)
-- branch mispredict 884451371813   #     8.3%
-- pipeline restart  55352567876    #     0.5%
smt-contention       0              #  0.0% ( 0.0%)
cpu-cycles           6467518831703  # 1.91 GHz
instructions         12325634319096 # 1.91 IPC
l2 access            173564026661   # 31.787 l2 access per 1000 inst
l2 miss              36974658876    # 21.30% l2 miss

No surprises from the process summary, all time spent in x265 process

434 processes
	150 x265                 36013.72   343.12
	 38 vulkaninfo               0.76     1.14
	  6 glxinfo:gdrv0            0.16     0.13
	  2 glxinfo                  0.12     0.05
	  2 glxinfo:cs0              0.12     0.05
	  2 glxinfo:disk$0           0.12     0.05
	  2 glxinfo:sh0              0.12     0.05
	  2 glxinfo:shlo0            0.12     0.05
	  4 vulkani:disk$0           0.08     0.12
	  2 llvmpipe-0               0.04     0.06
	  2 llvmpipe-1               0.04     0.06
	  2 llvmpipe-10              0.04     0.06
	  2 llvmpipe-11              0.04     0.06
	  2 llvmpipe-12              0.04     0.06
	  2 llvmpipe-13              0.04     0.06
	  2 llvmpipe-14              0.04     0.06
	  2 llvmpipe-15              0.04     0.06
	  2 llvmpipe-2               0.04     0.06
	  2 llvmpipe-3               0.04     0.06
	  2 llvmpipe-4               0.04     0.06
	  2 llvmpipe-5               0.04     0.06
	  2 llvmpipe-6               0.04     0.06
	  2 llvmpipe-7               0.04     0.06
	  2 llvmpipe-8               0.04     0.06
	  2 llvmpipe-9               0.04     0.06
	  6 php                      0.03     0.14
	  6 clang                    0.02     0.06
	  1 lspci                    0.00     0.03
	 88 sh                       0.00     0.00
	 13 gcc                      0.00     0.00
	  9 gsettings                0.00     0.00
	  9 stty                     0.00     0.00
	  8 stat                     0.00     0.00
	  8 systemd-detect-          0.00     0.00
	  6 llvm-link                0.00     0.00
	  5 phoronix-test-s          0.00     0.00
	  4 gmain                    0.00     0.00
	  3 dconf worker             0.00     0.00
	  2 cc                       0.00     0.00
	  2 lscpu                    0.00     0.00
	  2 uname                    0.00     0.00
	  2 which                    0.00     0.00
	  2 xset                     0.00     0.00
	  1 date                     0.00     0.00
	  1 dirname                  0.00     0.00
	  1 dmesg                    0.00     0.00
	  1 dmidecode                0.00     0.00
	  1 grep                     0.00     0.00
	  1 ifconfig                 0.00     0.00
	  1 ip                       0.00     0.00
	  1 lsmod                    0.00     0.00
	  1 mktemp                   0.00     0.00
	  1 ps                       0.00     0.00
	  1 readlink                 0.00     0.00
	  1 realpath                 0.00     0.00
	  1 sed                      0.00     0.00
	  1 sort                     0.00     0.00
	  1 systemctl                0.00     0.00
	  1 template.sh              0.00     0.00
	  1 wc                       0.00     0.00
	  1 xrandr                   0.00     0.00

The core blocks start slightly more threads than cores

      1666842) x265 start=4.92  finish=52.17
        1666843) x265 start=4.92  finish=52.16
          1666844) x265 start=4.92  finish=50.25
          1666845) x265 start=4.92  finish=52.06
          1666846) x265 start=4.92  finish=52.06
          1666847) x265 start=4.92  finish=52.06
          1666848) x265 start=4.92  finish=52.06
          1666849) x265 start=4.92  finish=52.06
          1666850) x265 start=4.92  finish=52.06
          1666851) x265 start=4.92  finish=52.06
          1666852) x265 start=4.92  finish=52.06
          1666853) x265 start=4.92  finish=52.06
          1666854) x265 start=4.92  finish=52.06
          1666855) x265 start=4.92  finish=52.06
          1666856) x265 start=4.92  finish=52.06
          1666857) x265 start=4.92  finish=52.06
          1666858) x265 start=4.92  finish=52.06
          1666859) x265 start=4.92  finish=52.06
          1666860) x265 start=4.92  finish=52.06
          1666861) x265 start=4.92  finish=52.06
          1666862) x265 start=4.94  finish=52.06
          1666863) x265 start=4.94  finish=52.06
          1666864) x265 start=4.94  finish=52.06
          1666865) x265 start=4.94  finish=52.16
          1666866) x265 start=4.94  finish=50.58

What accounts for the relatively lower amount of on-core time? The AMD metrics provide one clue with the number of inblock/outblock operations (need to update the metrics to provide these as a rate).