{"id":2188,"date":"2024-03-24T20:20:45","date_gmt":"2024-03-24T20:20:45","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?page_id=2188"},"modified":"2024-03-25T01:22:49","modified_gmt":"2024-03-25T01:22:49","slug":"jpegxl-decode","status":"publish","type":"page","link":"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/jpegxl-decode\/","title":{"rendered":"jpegxl-decode"},"content":{"rendered":"\n<p>a JPEG image processing library. Exercised with two quick running workloads. The first runs on one thread, the second on &#8220;all&#8221; threads.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/systemtime-44.png\" alt=\"\" class=\"wp-image-2214\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/systemtime-44.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/systemtime-44-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/systemtime-44-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>Topdown profile shows the single-threaded workload as having a higher retirement rate and some backend stalls. The parallel one is flipped.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/amdtopdown-46.png\" alt=\"\" class=\"wp-image-2216\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/amdtopdown-46.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/amdtopdown-46-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/amdtopdown-46-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>AMD metrics show not much floating point, some L2 access including misses.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              214.087\non_cpu               0.230          # 3.69 \/ 16 cores\nutime                572.235\nstime                217.040\nnvcsw                46264          # 80.71%\nnivcsw               11056          # 19.29%\ninblock              88             # 0.41\/sec\nonblock              293008         # 1368.64\/sec\ncpu-clock            789175841055   # 789.176 seconds\ntask-clock           789261200257   # 789.261 seconds\npage faults          78020185       # 98852.173\/sec\ncontext switches     58208          # 73.750\/sec\ncpu migrations       7653           # 9.696\/sec\nmajor page faults    4              # 0.005\/sec\nminor page faults    78020181       # 98852.168\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             322457994967   # 72.656 branches per 1000 inst\nbranch misses        15346807047    # 4.76% branch miss\nconditional          231733673746   # 52.214 conditional branches per 1000 inst\nindirect             6110886568     # 1.377 indirect branches per 1000 inst\ncpu-cycles           3143196687512  # 0.92 GHz\ninstructions         4405132309118  # 1.40 IPC\nslots                6291991187694  #\nretiring             1536387710543  # 24.4% (29.6%)\n-- ucode             3228779195     #     0.1%\n-- fastpath          1533158931348  #    24.4%\nfrontend             929644428644   # 14.8% (17.9%)\n-- latency           707692046958   #    11.2%\n-- bandwidth         221952381686   #     3.5%\nbackend              2565494368112  # 40.8% (49.5%)\n-- cpu               956203684287   #    15.2%\n-- memory            1609290683825  #    25.6%\nspeculation          150701191697   #  2.4% ( 2.9%)\n-- branch mispredict 145671010991   #     2.3%\n-- pipeline restart  5030180706     #     0.1%\nsmt-contention       1109755007091  # 17.6% ( 0.0%)\ncpu-cycles           3147769160356  # 0.91 GHz\ninstructions         4396937620175  # 1.40 IPC\ninstructions         1480153486902  # 45.732 l2 access per 1000 inst\nl2 hit from l1       54965304511    # 11.34% l2 miss\nl2 miss from l1      3863029248     #\nl2 hit from l2 pf    8912497384     #\nl3 hit from l2 pf    2337614534     #\nl3 miss from l2 pf   1474996298     #\ninstructions         1468633262644  # 47.498 float per 1000 inst\nfloat 512            57             # 0.000 AVX-512 per 1000 inst\nfloat 256            602            # 0.000 AVX-256 per 1000 inst\nfloat 128            69757460749    # 47.498 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         0              # 0.000 scalar per 1000 inst\ninstructions         4410783289171  #\nopcache              639131229094   # 144.902 opcache per 1000 inst\nopcache miss         75966023342    # 11.9% opcache miss rate\nl1 dTLB miss         15719138380    # 3.564 L1 dTLB per 1000 inst\nl2 dTLB miss         696242527      # 0.158 L2 dTLB per 1000 inst\ninstructions         4411366720558  #\nicache               147700356249   # 33.482 icache per 1000 inst\nicache miss          9449915162     #  6.4% icache miss rate\nl1 iTLB miss         30947776       # 0.007 L1 iTLB per 1000 inst\nl2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst\ntlb flush            23457          # 0.000 TLB flush per 1000 inst\n<\/code><\/pre>\n\n\n\n<p>Intel metrics show backend stalls are more at L1\/L2 levels than all the way out to dram<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              229.959\non_cpu               0.221          # 3.54 \/ 16 cores\nutime                655.417\nstime                157.900\nnvcsw                47596          # 81.51%\nnivcsw               10796          # 18.49%\ninblock              248224         # 1079.43\/sec\nonblock              281728         # 1225.12\/sec\ncpu-clock            812664634134   # 812.665 seconds\ntask-clock           812731141309   # 812.731 seconds\npage faults          64066346       # 78828.462\/sec\ncontext switches     59368          # 73.048\/sec\ncpu migrations       11485          # 14.131\/sec\nmajor page faults    1276           # 1.570\/sec\nminor page faults    64065070       # 78826.892\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             289467025331   # 66.628 branches per 1000 inst\nbranch misses        4791047711     # 1.66% branch miss\nconditional          289467037075   # 66.628 conditional branches per 1000 inst\nindirect             46740820621    # 10.759 indirect branches per 1000 inst\nslots                7250543859452  #\nretiring             3182672569517  # 43.9% (43.9%)\n-- ucode             284462783358   #     3.9%\n-- fastpath          2898209786159  #    40.0%\nfrontend             798789567358   # 11.0% (11.0%)\n-- latency           466544490290   #     6.4%\n-- bandwidth         332245077068   #     4.6%\nbackend              2613200296738  # 36.0% (36.0%)\n-- cpu               1445575001374  #    19.9%\n-- memory            1167625295364  #    16.1%\nspeculation          713440337672   #  9.8% ( 9.8%)\n-- branch mispredict 642636577796   #     8.9%\n-- pipeline restart  70803759876    #     1.0%\nsmt-contention       0              #  0.0% ( 0.0%)\ncpu-cycles           3015898317055  # 0.82 GHz\ninstructions         5329185911432  # 1.77 IPC\nl2 access            103439376307   # 33.427 l2 access per 1000 inst\nl2 miss              23259806947    # 22.49% l2 miss\ncpu-cycles           1737674083545  # 27.5% memory latency\nload stalls          410906786285   # 11.1% l1 bound\nl1 miss              217693607649   #  8.2% l2 bound\nl2 miss              74926260792    #  2.1% l3 bound\nl3 miss              38350978722    #  2.2% dram bound\nstore_stalls         66085442220    #  3.8% store bound\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>a JPEG image processing library. Exercised with two quick running workloads. The first runs on one thread, the second on &#8220;all&#8221; threads. Topdown profile shows the single-threaded workload as having a higher retirement rate and some backend stalls. The parallel <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/jpegxl-decode\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":58,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-2188","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/2188","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=2188"}],"version-history":[{"count":2,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/2188\/revisions"}],"predecessor-version":[{"id":2217,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/2188\/revisions\/2217"}],"up":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/58"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=2188"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}