{"id":1753,"date":"2024-02-22T07:59:30","date_gmt":"2024-02-22T07:59:30","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?page_id=1753"},"modified":"2024-02-24T01:33:53","modified_gmt":"2024-02-24T01:33:53","slug":"oidn","status":"publish","type":"page","link":"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/oidn\/","title":{"rendered":"oidn"},"content":{"rendered":"\n<p>Open Image Denoise library for ray-tracing and part of the oneAPI rendering toolkit. There are three tests that run on the CPU. On AMD the hip and SYCL tests fail. Looks like the six failures may be that single-threaded segment at the end.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/systemtime-55.png\" alt=\"\" class=\"wp-image-1757\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/systemtime-55.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/systemtime-55-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/systemtime-55-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>Topdown profile is dominated by backend stalls.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/amdtopdown-57.png\" alt=\"\" class=\"wp-image-1758\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/amdtopdown-57.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/amdtopdown-57-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/amdtopdown-57-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>AMD metrics show little floating point.  Backend stalls are cpu-bound not memory bound.  Frontend stalls are very low.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              1143.453\non_cpu               0.852          # 13.64 \/ 16 cores\nutime                15569.790\nstime                24.394\nnvcsw                137174         # 47.97%\nnivcsw               148800         # 52.03%\ninblock              8              # 0.01\/sec\nonblock              13656          # 11.94\/sec\ncpu-clock            15595849921789 # 15595.850 seconds\ntask-clock           15596082503631 # 15596.083 seconds\npage faults          7315810        # 469.080\/sec\ncontext switches     291471         # 18.689\/sec\ncpu migrations       931            # 0.060\/sec\nmajor page faults    57             # 0.004\/sec\nminor page faults    7315753        # 469.076\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             202447109427   # 4.901 branches per 1000 inst\nbranch misses        2965059032     # 1.46% branch miss\nconditional          185803354842   # 4.498 conditional branches per 1000 inst\nindirect             786178808      # 0.019 indirect branches per 1000 inst\ncpu-cycles           62242756524025 # 3.39 GHz\ninstructions         41309080431827 # 0.66 IPC low\nslots                124479064490070 #\nretiring             14028040556278 # 11.3% (14.2%)\n-- ucode             4540852651     #     0.0%\n-- fastpath          14023499703627 #    11.3%\nfrontend             903299780714   #  0.7% ( 0.9%) low\n-- latency           805847673036   #     0.6%\n-- bandwidth         97452107678    #     0.1%\nbackend              83506378748576 # 67.1% (84.8%) high\n-- cpu               72430204392186 #    58.2%\n-- memory            11076174356390 #     8.9%\nspeculation          23559215421    #  0.0% ( 0.0%) low\n-- branch mispredict 19678376604    #     0.0%\n-- pipeline restart  3880838817     #     0.0%\nsmt-contention       26017722604434 # 20.9% ( 0.0%)\ncpu-cycles           62237080562275 # 3.38 GHz\ninstructions         41309704710978 # 0.66 IPC low\ninstructions         13768582732070 # 125.298 l2 access per 1000 inst\nl2 hit from l1       1477657932786  # 4.28% l2 miss\nl2 miss from l1      13861287643    #\nl2 hit from l2 pf    187554180525   #\nl3 hit from l2 pf    20200972860    #\nl3 miss from l2 pf   39763266084    #\ninstructions         13770331503339 # 6.083 float per 1000 inst\nfloat 512            108            # 0.000 AVX-512 per 1000 inst\nfloat 256            1120053095     # 0.081 AVX-256 per 1000 inst\nfloat 128            82649393722    # 6.002 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         0              # 0.000 scalar per 1000 inst\ninstructions         2655157        #\nopcache              985914         # 371.320 opcache per 1000 inst\nopcache miss         525657         # 53.3% opcache miss rate\nl1 dTLB miss         5852           # 2.204 L1 dTLB per 1000 inst\nl2 dTLB miss         1012           # 0.381 L2 dTLB per 1000 inst\ninstructions         2809369        #\nicache               1346399        # 479.253 icache per 1000 inst\nicache miss          118242         #  8.8% icache miss rate\nl1 iTLB miss         13             # 0.005 L1 iTLB per 1000 inst\nl2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst\ntlb flush            19             # 0.007 TLB flush per 1000 inst\n<\/code><\/pre>\n\n\n\n<p>CPU stalls of 58% are almost as high as minibude (64%) and much above the mean with both showing up as outliers on the distribution.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"400\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/21.backend-cpu.png\" alt=\"\" class=\"wp-image-1761\"\/><\/figure>\n\n\n\n<p>Intel metrics show most memory is L1 with only 2.4% dram.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              1784.190\non_cpu               0.919          # 14.71 \/ 16 cores\nutime                26221.932\nstime                21.883\nnvcsw                228045         # 48.75%\nnivcsw               239714         # 51.25%\ninblock              18752          # 10.51\/sec\nonblock              1800           # 1.01\/sec\ncpu-clock            26244075480642 # 26244.075 seconds\ntask-clock           26244333593954 # 26244.334 seconds\npage faults          8712850        # 331.990\/sec\ncontext switches     476473         # 18.155\/sec\ncpu migrations       30947          # 1.179\/sec\nmajor page faults    161            # 0.006\/sec\nminor page faults    8712689        # 331.984\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             1321680383971  # 12.574 branches per 1000 inst\nbranch misses        2731388696     # 0.21% branch miss\nconditional          1321680401059  # 12.574 conditional branches per 1000 inst\nindirect             358569107192   # 3.411 indirect branches per 1000 inst\nslots                98943282613106 #\nretiring             49432693869376 # 50.0% (50.0%)\n-- ucode             563001550256   #     0.6%\n-- fastpath          48869692319120 #    49.4%\nfrontend             22964816174943 # 23.2% (23.2%)\n-- latency           22183147237452 #    22.4%\n-- bandwidth         781668937491   #     0.8%\nbackend              25631568005899 # 25.9% (25.9%)\n-- cpu               13970973300638 #    14.1%\n-- memory            11660594705261 #    11.8%\nspeculation          398396612737   #  0.4% ( 0.4%) low\n-- branch mispredict 281168117116   #     0.3%\n-- pipeline restart  117228495621   #     0.1%\nsmt-contention       0              #  0.0% ( 0.0%)\ncpu-cycles           64601012956599 # 2.60 GHz\ninstructions         103400515861605 # 1.60 IPC\nl2 access            803899147867   # 15.693 l2 access per 1000 inst\nl2 miss              186431498354   # 23.19% l2 miss\ncpu-cycles           32018963882436 # 26.2% memory latency\nload stalls          8267191766134  # 20.3% l1 bound\nl1 miss              1767537943547  #  1.8% l2 bound\nl2 miss              1198804839203  #  1.4% l3 bound\nl3 miss              759185998010   #  2.4% dram bound\nstore_stalls         128410516875   #  0.4% store bound\n<\/code><\/pre>\n\n\n\n<p>Process summary shows time spent in the benchmark application.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>592 processes\n\t200 oidnBenchmark        236248.38   341.85\n\t 68 clinfo                  19.50     6.32\n\t 38 vulkaninfo               1.34     1.35\n\t  4 vulkani:disk$0           0.15     0.15\n\t  6 glxinfo:gdrv0            0.15     0.07\n\t  6 glxinfo:gl0              0.15     0.07\n\t  6 php                      0.13     0.28\n\t  2 llvmpipe-0               0.08     0.07\n\t  2 llvmpipe-1               0.08     0.07\n\t  2 llvmpipe-10              0.08     0.07\n\t  2 llvmpipe-11              0.08     0.07\n\t  2 llvmpipe-12              0.08     0.07\n\t  2 llvmpipe-13              0.08     0.07\n\t  2 llvmpipe-14              0.08     0.07\n\t  2 llvmpipe-15              0.08     0.07\n\t  2 llvmpipe-2               0.08     0.07\n\t  2 llvmpipe-3               0.08     0.07\n\t  2 llvmpipe-4               0.08     0.07\n\t  2 llvmpipe-5               0.08     0.07\n\t  2 llvmpipe-6               0.08     0.07\n\t  2 llvmpipe-7               0.08     0.07\n\t  2 llvmpipe-8               0.08     0.07\n\t  2 llvmpipe-9               0.08     0.07\n\t  2 glxinfo                  0.07     0.03\n\t  2 glxinfo:cs0              0.07     0.03\n\t  2 glxinfo:disk$0           0.07     0.03\n\t  2 glxinfo:sh0              0.07     0.03\n\t  2 glxinfo:shlo0            0.07     0.03\n\t  6 clang                    0.06     0.05\n\t  3 rocminfo                 0.03     0.03\n\t  1 lspci                    0.01     0.02\n\t 85 sh                       0.00     0.00\n\t 27 oidn                     0.00     0.00\n\t 12 gcc                      0.00     0.00\n\t 10 gsettings                0.00     0.00\n\t  8 stat                     0.00     0.00\n\t  8 systemd-detect-          0.00     0.00\n\t  6 llvm-link                0.00     0.00\n\t  5 phoronix-test-s          0.00     0.00\n\t  4 gmain                    0.00     0.00\n\t  2 dconf worker             0.00     0.00\n\t  2 lscpu                    0.00     0.00\n\t  2 uname                    0.00     0.00\n\t  2 which                    0.00     0.00\n\t  2 xset                     0.00     0.00\n\t  1 cc                       0.00     0.00\n\t  1 date                     0.00     0.00\n\t  1 dirname                  0.00     0.00\n\t  1 dmesg                    0.00     0.00\n\t  1 dmidecode                0.00     0.00\n\t  1 grep                     0.00     0.00\n\t  1 ifconfig                 0.00     0.00\n\t  1 ip                       0.00     0.00\n\t  1 lsmod                    0.00     0.00\n\t  1 mktemp                   0.00     0.00\n\t  1 ps                       0.00     0.00\n\t  1 qdbus                    0.00     0.00\n\t  1 readlink                 0.00     0.00\n\t  1 realpath                 0.00     0.00\n\t  1 sed                      0.00     0.00\n\t  1 sort                     0.00     0.00\n\t  1 stty                     0.00     0.00\n\t  1 systemctl                0.00     0.00\n\t  1 template.sh              0.00     0.00\n\t  1 wc                       0.00     0.00\n\t  1 xrandr                   0.00     0.00\n16 processes running\n63 maximum processes\n<\/code><\/pre>\n\n\n\n<p>Computation blocks show a similar pattern<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>      8061) oidn             cpu=1 start=89.65 finish=169.60\n        8062) oidnBenchmark    cpu=10 start=89.65 finish=169.58\n          8065) oidnBenchmark    cpu=12 start=89.68 finish=169.58\n          8066) oidnBenchmark    cpu=5 start=89.68 finish=89.68\n          8067) oidnBenchmark    cpu=15 start=90.14 finish=169.58\n            8068) oidnBenchmark    cpu=5 start=90.14 finish=169.58\n              8070) oidnBenchmark    cpu=3 start=90.14 finish=169.58\n                8075) ?? cpu=0 start=90.14 finish=0.00 \n                  8078) ?? cpu=0 start=90.14 finish=0.00 \n                8077) oidnBenchmark    cpu=14 start=90.14 finish=169.58\n              8074) oidnBenchmark    cpu=11 start=90.14 finish=169.58\n            8072) oidnBenchmark    cpu=1 start=90.14 finish=169.58\n              8076) oidnBenchmark    cpu=13 start=90.14 finish=169.58\n              8079) oidnBenchmark    cpu=8 start=90.14 finish=169.58\n          8069) oidnBenchmark    cpu=10 start=90.14 finish=169.58\n            8071) oidnBenchmark    cpu=9 start=90.14 finish=169.58\n              8080) oidnBenchmark    cpu=6 start=90.14 finish=169.58\n              8081) oidnBenchmark    cpu=7 start=90.14 finish=169.58\n            8073) oidnBenchmark    cpu=0 start=90.14 finish=169.58\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Open Image Denoise library for ray-tracing and part of the oneAPI rendering toolkit. There are three tests that run on the CPU. On AMD the hip and SYCL tests fail. Looks like the six failures may be that single-threaded segment <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/oidn\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":58,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-1753","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1753","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=1753"}],"version-history":[{"count":5,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1753\/revisions"}],"predecessor-version":[{"id":1803,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1753\/revisions\/1803"}],"up":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/58"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=1753"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}