{"id":288,"date":"2024-01-06T13:19:21","date_gmt":"2024-01-06T13:19:21","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?page_id=288"},"modified":"2024-01-07T13:34:29","modified_gmt":"2024-01-07T13:34:29","slug":"embree","status":"publish","type":"page","link":"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/embree\/","title":{"rendered":"embree"},"content":{"rendered":"\n<p>Embree is set of ray tracing kernels.  In the test below, I run three workloads showing slight differences between the first and the other workloads. The workloads are listed as AVX-512 capable, but I don&#8217;t see those instructions in my trace.  Most likely because I am using default compilation rather than -march=native. An experiment to try in the future.  It is floating point intensive code with most time spent in backend memory operations. Branch misprediction is also higher than average.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-21.png\" alt=\"\" class=\"wp-image-301\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-21.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-21-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-21-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>AMD metrics show a relatively low IPC and many backend misses. The floating point code is predominantly AVX-128 bit.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              589.246\non_cpu               0.871          # 13.93 \/ 16 cores\nutime                8196.986\nstime                11.873\nnvcsw                93191          # 41.33%\nnivcsw               132272         # 58.67%\ninblock              2348544        # 3985.68\/sec\nonblock              1648           # 2.80\/sec\ncpu-clock            8210676277740  # 8210.676 seconds\ntask-clock           8210826881166  # 8210.827 seconds\npage faults          2579835        # 314.199\/sec\ncontext switches     228219         # 27.795\/sec\ncpu migrations       657            # 0.080\/sec\nmajor page faults    111            # 0.014\/sec\nminor page faults    2579724        # 314.186\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             1362232904120  # 80.306 branches per 1000 inst\nbranch misses        184544546458   # 13.55% branch miss\nconditional          1140352434370  # 67.226 conditional branches per 1000 inst\nindirect             12415374703    # 0.732 indirect branches per 1000 inst\ncpu-cycles           34845974472315 # 3.69 GHz\ninstructions         16962277518169 # 0.49 IPC\nslots                69674090415768 #\nretiring             7581529448521  # 10.9% (13.5%)\n-- ucode             24936273893    #     0.0%\n-- fastpath          7556593174628  #    10.8%\nfrontend             7598757925485  # 10.9% (13.5%)\n-- latency           6124859496648  #     8.8%\n-- bandwidth         1473898428837  #     2.1%\nbackend              37121990917073 # 53.3% (65.9%)\n-- cpu               9400871625693  #    13.5%\n-- memory            27721119291380 #    39.8%\nspeculation          3987229045557  #  5.7% ( 7.1%)\n-- branch mispredict 3967073762620  #     5.7%\n-- pipeline restart  20155282937    #     0.0%\nsmt-contention       13384540352281 # 19.2% ( 0.0%)\ncpu-cycles           34862170519282 # 3.68 GHz\ninstructions         16966704924565 # 0.49 IPC\ninstructions         5654335743036  # 98.697 l2 access per 1000 inst\nl2 hit from l1       440791291703   # 25.36% l2 miss\nl2 miss from l1      93098250807    #\nl2 hit from l2 pf    68873331326    #\nl3 hit from l2 pf    20579962754    #\nl3 miss from l2 pf   27819968309    #\ninstructions         5652446118217  # 296.440 float per 1000 inst\nfloat 512            91             # 0.000 AVX-512 per 1000 inst\nfloat 256            2921024696     # 0.517 AVX-256 per 1000 inst\nfloat 128            1672690446523  # 295.923 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         0              # 0.000 scalar per 1000 inst<\/code><\/pre>\n\n\n\n<p>Intel metrics including a large percentage of time with branch misses.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              1463.161\non_cpu               0.904          # 14.46 \/ 16 cores\nutime                21137.635\nstime                18.313\nnvcsw                368397         # 55.12%\nnivcsw               299967         # 44.88%\ninblock              0              # 0.00\/sec\nonblock              1912           # 1.31\/sec\ncpu-clock            21156464346568 # 21156.464 seconds\ntask-clock           21156834367225 # 21156.834 seconds\npage faults          3295364        # 155.759\/sec\ncontext switches     675478         # 31.927\/sec\ncpu migrations       70820          # 3.347\/sec\nmajor page faults    0              # 0.000\/sec\nminor page faults    3295364        # 155.759\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             3008212425185  # 71.821 branches per 1000 inst\nbranch misses        441975034062   # 14.69% branch miss\nconditional          3008212445665  # 71.821 conditional branches per 1000 inst\nindirect             610431103988   # 14.574 indirect branches per 1000 inst\nslots                52138940679776 #\nretiring             13804850785894 # 26.5% (26.5%)\n-- ucode             1354930255474  #     2.6%\n-- fastpath          12449920530420 #    23.9%\nfrontend             8695611248482  # 16.7% (16.7%)\n-- latency           5828517798001  #    11.2%\n-- bandwidth         2867093450481  #     5.5%\nbackend              17257445503008 # 33.1% (33.1%)\n-- cpu               5308618896422  #    10.2%\n-- memory            11948826606586 #    22.9%\nspeculation          11980383511891 # 23.0% (23.0%)\n-- branch mispredict 11929015562360 #    22.9%\n-- pipeline restart  51367949531    #     0.1%\nsmt-contention       0              #  0.0% ( 0.0%)\ncpu-cycles           33660387719224 # 2.61 GHz\ninstructions         26349322565350 # 0.78 IPC\nl2 access            824384239858   # 61.635 l2 access per 1000 inst\nl2 miss              320741599938   # 38.91% l2 miss<\/code><\/pre>\n\n\n\n<p>Process time is spent in embree_pathtrac program<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>504 processes\n\t135 embree_pathtrac      123173.29   153.69\n\t 64 clinfo                  11.20     2.88\n\t 38 vulkaninfo               0.75     1.32\n\t  6 glxinfo:gdrv0            0.17     0.03\n\t  6 php                      0.08     0.18\n\t  4 vulkani:disk$0           0.08     0.14\n\t  2 glxinfo                  0.07     0.01\n\t  2 glxinfo:cs0              0.07     0.01\n\t  2 glxinfo:disk$0           0.07     0.01\n\t  2 glxinfo:sh0              0.07     0.01\n\t  2 glxinfo:shlo0            0.07     0.01\n\t  2 llvmpipe-0               0.04     0.07\n\t  2 llvmpipe-1               0.04     0.07\n\t  2 llvmpipe-10              0.04     0.07\n\t  2 llvmpipe-11              0.04     0.07\n\t  2 llvmpipe-12              0.04     0.07\n\t  2 llvmpipe-13              0.04     0.07\n\t  2 llvmpipe-14              0.04     0.07\n\t  2 llvmpipe-15              0.04     0.07\n\t  2 llvmpipe-2               0.04     0.07\n\t  2 llvmpipe-3               0.04     0.07\n\t  2 llvmpipe-4               0.04     0.07\n\t  2 llvmpipe-5               0.04     0.07\n\t  2 llvmpipe-6               0.04     0.07\n\t  2 llvmpipe-7               0.04     0.07\n\t  2 llvmpipe-8               0.04     0.07\n\t  2 llvmpipe-9               0.04     0.07\n\t  6 clang                    0.03     0.03\n\t  1 lspci                    0.00     0.03\n\t 92 sh                       0.00     0.00\n\t 12 gcc                      0.00     0.00\n\t 10 gsettings                0.00     0.00\n\t  9 embree                   0.00     0.00\n\t  9 stty                     0.00     0.00\n\t  8 stat                     0.00     0.00\n\t  8 systemd-detect-          0.00     0.00\n\t  6 llvm-link                0.00     0.00\n\t  5 phoronix-test-s          0.00     0.00\n\t  4 gmain                    0.00     0.00\n\t  2 dconf worker             0.00     0.00\n\t  2 lscpu                    0.00     0.00\n\t  2 uname                    0.00     0.00\n\t  2 which                    0.00     0.00\n\t  2 xset                     0.00     0.00\n\t  1 cc                       0.00     0.00\n\t  1 date                     0.00     0.00\n\t  1 dirname                  0.00     0.00\n\t  1 dmesg                    0.00     0.00\n\t  1 dmidecode                0.00     0.00\n\t  1 grep                     0.00     0.00\n\t  1 ifconfig                 0.00     0.00\n\t  1 ip                       0.00     0.00\n\t  1 lsmod                    0.00     0.00\n\t  1 mktemp                   0.00     0.00\n\t  1 ps                       0.00     0.00\n\t  1 qdbus                    0.00     0.00\n\t  1 readlink                 0.00     0.00\n\t  1 realpath                 0.00     0.00\n\t  1 sed                      0.00     0.00\n\t  1 sort                     0.00     0.00\n\t  1 systemctl                0.00     0.00\n\t  1 template.sh              0.00     0.00\n\t  1 wc                       0.00     0.00\n\t  1 xrandr                   0.00     0.00\n9 processes running\n56 maximum processes\n<\/code><\/pre>\n\n\n\n<p>Following is the example of the recurring process pattern.  Occasionally we miss an &#8220;exit&#8221; event so that is why 9 were still listed as running above.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>      6117) embree start=137.17 finish=198.93\n        6118) embree_pathtrac start=137.17 finish=198.92\n          6119) embree_pathtrac start=137.37 finish=198.92\n            6121) embree_pathtrac start=137.37 finish=198.92\n              6125) embree_pathtrac start=137.37 finish=198.92\n              6129) embree_pathtrac start=137.37 finish=198.92\n            6123) embree_pathtrac start=137.37 finish=198.92\n              6130) embree_pathtrac start=137.37 finish=198.92\n              6131) embree_pathtrac start=137.37 finish=198.92\n          6120) embree_pathtrac start=137.37 finish=198.92\n            6122) embree_pathtrac start=137.37 finish=198.92\n              6124) embree_pathtrac start=137.37 finish=198.92\n                6128) embree_pathtrac start=137.37 finish=198.92\n                6132) embree_pathtrac start=137.37 finish=198.92\n              6126) embree_pathtrac start=137.37 finish=198.92\n                6133) embree_pathtrac start=137.37 finish=198.92\n            6127) embree_pathtrac start=137.37 finish=198.92<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Embree is set of ray tracing kernels. In the test below, I run three workloads showing slight differences between the first and the other workloads. The workloads are listed as AVX-512 capable, but I don&#8217;t see those instructions in my <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/embree\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":58,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-288","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/288","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=288"}],"version-history":[{"count":5,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/288\/revisions"}],"predecessor-version":[{"id":331,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/288\/revisions\/331"}],"up":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/58"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=288"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}