{"id":2109,"date":"2024-03-19T18:43:23","date_gmt":"2024-03-19T18:43:23","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?p=2109"},"modified":"2024-03-19T18:51:27","modified_gmt":"2024-03-19T18:51:27","slug":"graphics-magick-sharpen-compiler-improvements","status":"publish","type":"post","link":"https:\/\/mvermeulen.org\/perf\/2024\/03\/19\/graphics-magick-sharpen-compiler-improvements\/","title":{"rendered":"graphics-magick sharpen, compiler improvements"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">The following Phoronix Article &#8211; <a href=\"https:\/\/www.phoronix.com\/review\/nvidia-gh200-compilers\">https:\/\/www.phoronix.com\/review\/nvidia-gh200-compilers<\/a> compares GCC 13.2 with Clang 17.0.2 on an ARM platform. On the discussions attached the improvement for graphics-magick sharpen benchmark particularly stand out.  So I thought I would see if I could see a similar improvement and using performance tools could spot likely areas contributing to the difference.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"237\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/graphicsmagick-sharpen.png\" alt=\"\" class=\"wp-image-2110\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">My system has Ubuntu 22.04 system compiler or gcc 11.4 and also aocc 4.1 based on clang 16.0.3 so not exactly the same but close enough.  I forced a rebuild by reinstalling the test and setting environment variables, e.g.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>export CC=\/opt\/AMD\/aocc-compiler-4.1.0\/bin\/clang\nexport CXX=\/opt\/AMD\/aocc-compiler-4.1.0\/bin\/clang++\nexport CFLAGS=\"-O3 -march=native\"\nexport CXXFLAGS=\"-O3 -march=native\"<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">With these differences, I see the following with gcc 11.4<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>    Operation: Sharpen:\n        107\n        108\n        108\n\n    Average: 108 Iterations Per Minute\n    Deviation: 0.54%<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">and the following differences with clang 16.0<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>    Operation: Sharpen:\n        177\n        178\n        178\n\n    Average: 178 Iterations Per Minute\n    Deviation: 0.32%\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">So overall a 1.65x speedup. Noy quite the 2x speedup seen on the AArch64 system but close enough given different compilers.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here is what my topdown profile shows for gcc<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/amdtopdown-34.png\" alt=\"\" class=\"wp-image-2111\" style=\"width:1060px;height:auto\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/amdtopdown-34.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/amdtopdown-34-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/amdtopdown-34-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Here is the comparison point with clang<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/amdtopdown-35.png\" alt=\"\" class=\"wp-image-2112\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/amdtopdown-35.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/amdtopdown-35-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/03\/amdtopdown-35-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Interestingly the total runtime is close to the same (time-bound test?) but we definitely have dropped backend stalls in favor of retiring a higher percentage of instructions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here is what the metrics show for gcc<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              196.669\non_cpu               0.912          # 14.59 \/ 16 cores\nutime                2847.753\nstime                22.595\nnvcsw                7799           # 21.59%\nnivcsw               28324          # 78.41%\ninblock              72             # 0.37\/sec\nonblock              12832          # 65.25\/sec\ncpu-clock            2870386800612  # 2870.387 seconds\ntask-clock           2870418438024  # 2870.418 seconds\npage faults          8219671        # 2863.579\/sec\ncontext switches     36937          # 12.868\/sec\ncpu migrations       252            # 0.088\/sec\nmajor page faults    3              # 0.001\/sec\nminor page faults    8219668        # 2863.578\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             1140203600927  # 59.971 branches per 1000 inst\nbranch misses        9706629710     # 0.85% branch miss\nconditional          1123801038031  # 59.108 conditional branches per 1000 inst\nindirect             79758239       # 0.004 indirect branches per 1000 inst\ncpu-cycles           12006859596451 # 3.83 GHz\ninstructions         18841741204039 # 1.57 IPC\nslots                24015874376394 #\nretiring             6873291214752  # 28.6% (44.9%)\n-- ucode             776914684      #     0.0%\n-- fastpath          6872514300068  #    28.6%\nfrontend             280560426894   #  1.2% ( 1.8%) low\n-- latency           204333739230   #     0.9%\n-- bandwidth         76226687664    #     0.3%\nbackend              8106573021445  # 33.8% (52.9%)\n-- cpu               7904629941195  #    32.9%\n-- memory            201943080250   #     0.8%\nspeculation          52507444606    #  0.2% ( 0.3%) low\n-- branch mispredict 52421287288    #     0.2%\n-- pipeline restart  86157318       #     0.0%\nsmt-contention       8702915928072  # 36.2% ( 0.0%)\ncpu-cycles           12008757786517 # 3.84 GHz\ninstructions         18832540244485 # 1.57 IPC\ninstructions         6279648919771  # 2.124 l2 access per 1000 inst\nl2 hit from l1       7173349879     # 20.29% l2 miss\nl2 miss from l1      704685663      #\nl2 hit from l2 pf    4162164156     #\nl3 hit from l2 pf    1757001598     #\nl3 miss from l2 pf   244954442      #\ninstructions         6277843164084  # 351.548 float per 1000 inst\nfloat 512            57             # 0.000 AVX-512 per 1000 inst\nfloat 256            584            # 0.000 AVX-256 per 1000 inst\nfloat 128            2206965520855  # 351.548 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         0              # 0.000 scalar per 1000 inst\ninstructions         18950819136221 #\nopcache              2107351794817  # 111.201 opcache per 1000 inst\nopcache miss         9523189344     #  0.5% opcache miss rate\nl1 dTLB miss         902958198      # 0.048 L1 dTLB per 1000 inst\nl2 dTLB miss         68055690       # 0.004 L2 dTLB per 1000 inst\ninstructions         18892305597227 #\nicache               18578037535    # 0.983 icache per 1000 inst\nicache miss          1477678165     #  8.0% icache miss rate\nl1 iTLB miss         8626682        # 0.000 L1 iTLB per 1000 inst\nl2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst\ntlb flush            34816          # 0.000 TLB flush per 1000 inst\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Here is what they show for clang<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              198.605\non_cpu               0.910          # 14.55 \/ 16 cores\nutime                2846.489\nstime                43.933\nnvcsw                10817          # 26.18%\nnivcsw               30507          # 73.82%\ninblock              8              # 0.04\/sec\nonblock              12904          # 64.97\/sec\ncpu-clock            2890592540363  # 2890.593 seconds\ntask-clock           2890613273288  # 2890.613 seconds\npage faults          13446401       # 4651.747\/sec\ncontext switches     42134          # 14.576\/sec\ncpu migrations       320            # 0.111\/sec\nmajor page faults    51             # 0.018\/sec\nminor page faults    13446350       # 4651.729\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             1895985554702  # 85.338 branches per 1000 inst\nbranch misses        16208435546    # 0.85% branch miss\nconditional          1856679790302  # 83.569 conditional branches per 1000 inst\nindirect             162624101      # 0.007 indirect branches per 1000 inst\ncpu-cycles           11802414391046 # 3.76 GHz\ninstructions         22167963494891 # 1.88 IPC\nslots                23606223963606 #\nretiring             7476214292393  # 31.7% (52.1%)\n-- ucode             3459130581     #     0.0%\n-- fastpath          7472755161812  #    31.7%\nfrontend             577593637926   #  2.4% ( 4.0%) low\n-- latency           362394713874   #     1.5%\n-- bandwidth         215198924052   #     0.9%\nbackend              6205319253065  # 26.3% (43.3%)\n-- cpu               5685432163067  #    24.1%\n-- memory            519887089998   #     2.2%\nspeculation          83292194787    #  0.4% ( 0.6%) low\n-- branch mispredict 83160520795    #     0.4%\n-- pipeline restart  131673992      #     0.0%\nsmt-contention       9263789209330  # 39.2% ( 0.0%)\ncpu-cycles           11818914678350 # 3.74 GHz\ninstructions         22211450935976 # 1.88 IPC\ninstructions         7404943446705  # 2.939 l2 access per 1000 inst\nl2 hit from l1       11586386135    # 19.79% l2 miss\nl2 miss from l1      1111590991     #\nl2 hit from l2 pf    6979543347     #\nl3 hit from l2 pf    2793906722     #\nl3 miss from l2 pf   399941912      #\ninstructions         7400104673984  # 491.708 float per 1000 inst\nfloat 512            72             # 0.000 AVX-512 per 1000 inst\nfloat 256            668            # 0.000 AVX-256 per 1000 inst\nfloat 128            3638689694804  # 491.708 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         0              # 0.000 scalar per 1000 inst\ninstructions         22251978896991 #\nopcache              3428837218389  # 154.091 opcache per 1000 inst\nopcache miss         16257852042    #  0.5% opcache miss rate\nl1 dTLB miss         1527716103     # 0.069 L1 dTLB per 1000 inst\nl2 dTLB miss         108536720      # 0.005 L2 dTLB per 1000 inst\ninstructions         22248633347533 #\nicache               35471913129    # 1.594 icache per 1000 inst\nicache miss          1971706421     #  5.6% icache miss rate\nl1 iTLB miss         9490954        # 0.000 L1 iTLB per 1000 inst\nl2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst\ntlb flush            71325          # 0.000 TLB flush per 1000 inst\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Looking with a rough comparison I notice:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>User time is almost identical so likely some time-bounded loop<\/li>\n\n\n\n<li>There are more instructions overall, and particularly AVX-128 has gone from 351 per thousand to 491 per thousand.  The number of branches has also gone up<\/li>\n\n\n\n<li>IPC has gone from 1.57 to 1.88.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Based on this my likely guess is some greater vectorization to tighten the core loop.  This indirectly results in more branches (smaller loop).  CPU stalls still contribute most to backend stalls but have gone down while number of vector instructions have gone up.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There may be other more direct ways to compare compiler options and results, but this is at least an indirect way to view the effects looking at the overall performance characterization.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The following Phoronix Article &#8211; https:\/\/www.phoronix.com\/review\/nvidia-gh200-compilers compares GCC 13.2 with Clang 17.0.2 on an ARM platform. On the discussions attached the improvement for graphics-magick sharpen benchmark particularly stand out. So I thought I would see if I could see a <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/2024\/03\/19\/graphics-magick-sharpen-compiler-improvements\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11],"tags":[23,32,22],"class_list":["post-2109","post","type-post","status-publish","format-standard","hentry","category-experiment","tag-benchmarks","tag-compiler","tag-phoronix"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/posts\/2109","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=2109"}],"version-history":[{"count":2,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/posts\/2109\/revisions"}],"predecessor-version":[{"id":2115,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/posts\/2109\/revisions\/2115"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=2109"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/categories?post=2109"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/tags?post=2109"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}