{"id":1720,"date":"2024-02-11T09:24:48","date_gmt":"2024-02-11T09:24:48","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?page_id=1720"},"modified":"2024-02-11T14:57:51","modified_gmt":"2024-02-11T14:57:51","slug":"clomp","status":"publish","type":"page","link":"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/clomp\/","title":{"rendered":"clomp"},"content":{"rendered":"\n<p>Livermore OpenMP  test with one workload test. Looks to be mostly single-threaded with short sections of multi-threaded runs.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/systemtime-52.png\" alt=\"\" class=\"wp-image-1723\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/systemtime-52.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/systemtime-52-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/systemtime-52-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>Topdown profile looks backend bound with the short parallel sections less so.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/amdtopdown-54.png\" alt=\"\" class=\"wp-image-1725\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/amdtopdown-54.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/amdtopdown-54-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/amdtopdown-54-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>AMD metrics show an average of only 3.5 cores. This is floating point code with few branch misses. Frontend stalls are low and backend stalls are high.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              546.949\non_cpu               0.218          # 3.49 \/ 16 cores\nutime                1904.013\nstime                2.173\nnvcsw                16236          # 51.16%\nnivcsw               15499          # 48.84%\ninblock              0              # 0.00\/sec\nonblock              2056           # 3.76\/sec\ncpu-clock            1906902816556  # 1906.903 seconds\ntask-clock           1906923992083  # 1906.924 seconds\npage faults          138839         # 72.808\/sec\ncontext switches     34302          # 17.988\/sec\ncpu migrations       339            # 0.178\/sec\nmajor page faults    0              # 0.000\/sec\nminor page faults    138839         # 72.808\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             773040432866   # 112.899 branches per 1000 inst\nbranch misses        1708677115     # 0.22% branch miss\nconditional          770588283706   # 112.540 conditional branches per 1000 inst\nindirect             374106320      # 0.055 indirect branches per 1000 inst\ncpu-cycles           1692125537767  # 0.94 GHz\ninstructions         1370561071471  # 0.81 IPC\nslots                3386944277190  #\nretiring             409289835857   # 12.1% (18.2%)\n-- ucode             463432023      #     0.0%\n-- fastpath          408826403834   #    12.1%\nfrontend             52934629700    #  1.6% ( 2.4%) low\n-- latency           26013781506    #     0.8%\n-- bandwidth         26920848194    #     0.8%\nbackend              1774101880707  # 52.4% (79.0%) high\n-- cpu               635790079324   #    18.8%\n-- memory            1138311801383  #    33.6%\nspeculation          8381426298     #  0.2% ( 0.4%) low\n-- branch mispredict 7898633078     #     0.2%\n-- pipeline restart  482793220      #     0.0%\nsmt-contention       1142234822831  # 33.7% ( 0.0%)\ncpu-cycles           1697637019615  # 0.95 GHz\ninstructions         1372189486239  # 0.81 IPC\ninstructions         456767926414   # 142.516 l2 access per 1000 inst\nl2 hit from l1       26648010394    # 44.94% l2 miss\nl2 miss from l1      1513928287     #\nl2 hit from l2 pf    10710793952    #\nl3 hit from l2 pf    25090146143    #\nl3 miss from l2 pf   2647865317     #\ninstructions         457188883424   # 329.247 float per 1000 inst\nfloat 512            72             # 0.000 AVX-512 per 1000 inst\nfloat 256            344            # 0.000 AVX-256 per 1000 inst\nfloat 128            150528050110   # 329.247 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         0              # 0.000 scalar per 1000 inst\ninstructions         2390641        #\nopcache              897366         # 375.366 opcache per 1000 inst\nopcache miss         478333         # 53.3% opcache miss rate\nl1 dTLB miss         5470           # 2.288 L1 dTLB per 1000 inst\nl2 dTLB miss         1094           # 0.458 L2 dTLB per 1000 inst\ninstructions         2418972        #\nicache               1193224        # 493.277 icache per 1000 inst\nicache miss          111159         #  9.3% icache miss rate\nl1 iTLB miss         7              # 0.003 L1 iTLB per 1000 inst\nl2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst\ntlb flush            19             # 0.008 TLB flush per 1000 inst\n<\/code><\/pre>\n\n\n\n<p>Intel metrics are much quicker, looks like the AMD version needs multiple runs to reduce tolerance.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              115.355\non_cpu               0.318          # 5.08 \/ 16 cores\nutime                585.447\nstime                0.776\nnvcsw                4673           # 48.25%\nnivcsw               5012           # 51.75%\ninblock              616            # 5.34\/sec\nonblock              1416           # 12.28\/sec\ncpu-clock            586340601288   # 586.341 seconds\ntask-clock           586347075893   # 586.347 seconds\npage faults          152353         # 259.834\/sec\ncontext switches     10093          # 17.213\/sec\ncpu migrations       421            # 0.718\/sec\nmajor page faults    2              # 0.003\/sec\nminor page faults    152351         # 259.831\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             157824839681   # 114.070 branches per 1000 inst\nbranch misses        328274560      # 0.21% branch miss\nconditional          157824853057   # 114.070 conditional branches per 1000 inst\nindirect             47055770829    # 34.010 indirect branches per 1000 inst\nslots                10662556995632 #\nretiring             2005207474450  # 18.8% (18.8%)\n-- ucode             38968677224    #     0.4%\n-- fastpath          1966238797226  #    18.4%\nfrontend             1142542081923  # 10.7% (10.7%)\n-- latency           1027167896973  #     9.6%\n-- bandwidth         115374184950   #     1.1%\nbackend              7477821619753  # 70.1% (70.1%) high\n-- cpu               4816740097479  #    45.2%\n-- memory            2661081522274  #    25.0%\nspeculation          51623307331    #  0.5% ( 0.5%) low\n-- branch mispredict 48674846500    #     0.5%\n-- pipeline restart  2948460831     #     0.0%\nsmt-contention       0              #  0.0% ( 0.0%)\ncpu-cycles           1389467375584  # 0.74 GHz\ninstructions         1100892371408  # 0.79 IPC\nl2 access            104540813869   # 108.996 l2 access per 1000 inst\nl2 miss              55139555141    # 52.74% l2 miss\ncpu-cycles           1612435730781  # 32.1% memory latency\nload stalls          517933116525   # 17.9% l1 bound\nl1 miss              229865098888   #  4.7% l2 bound\nl2 miss              154331027827   #  5.8% l3 bound\nl3 miss              60457495909    #  3.7% dram bound\nstore_stalls         315269774      #  0.0% store bound\n<\/code><\/pre>\n\n\n\n<p>Process overview shows most of the time in clomp-build.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>293 processes\n\t 48 clomp_build           6106.56     6.88\n\t 38 vulkaninfo               1.31     0.95\n\t  6 glxinfo:gdrv0            0.14     0.10\n\t  4 vulkani:disk$0           0.13     0.10\n\t  6 php                      0.07     0.05\n\t  2 llvmpipe-0               0.07     0.05\n\t  2 llvmpipe-1               0.07     0.05\n\t  2 llvmpipe-10              0.07     0.05\n\t  2 llvmpipe-11              0.07     0.05\n\t  2 llvmpipe-12              0.07     0.05\n\t  2 llvmpipe-13              0.07     0.05\n\t  2 llvmpipe-14              0.07     0.05\n\t  2 llvmpipe-15              0.07     0.05\n\t  2 llvmpipe-2               0.07     0.05\n\t  2 llvmpipe-3               0.07     0.05\n\t  2 llvmpipe-4               0.07     0.05\n\t  2 llvmpipe-5               0.07     0.05\n\t  2 llvmpipe-6               0.07     0.05\n\t  2 llvmpipe-7               0.07     0.05\n\t  2 llvmpipe-8               0.07     0.05\n\t  2 llvmpipe-9               0.07     0.05\n\t  2 glxinfo                  0.06     0.04\n\t  2 glxinfo:cs0              0.06     0.04\n\t  2 glxinfo:disk$0           0.06     0.04\n\t  2 glxinfo:sh0              0.06     0.04\n\t  2 glxinfo:shlo0            0.06     0.04\n\t  1 lspci                    0.01     0.01\n\t  1 ps                       0.00     0.01\n\t 66 sh                       0.00     0.00\n\t 13 gcc                      0.00     0.00\n\t  8 gsettings                0.00     0.00\n\t  8 stat                     0.00     0.00\n\t  8 systemd-detect-          0.00     0.00\n\t  5 gmain                    0.00     0.00\n\t  5 phoronix-test-s          0.00     0.00\n\t  3 clomp                    0.00     0.00\n\t  3 dconf worker             0.00     0.00\n\t  2 cc                       0.00     0.00\n\t  2 lscpu                    0.00     0.00\n\t  2 uname                    0.00     0.00\n\t  2 which                    0.00     0.00\n\t  2 xset                     0.00     0.00\n\t  1 date                     0.00     0.00\n\t  1 dirname                  0.00     0.00\n\t  1 dmesg                    0.00     0.00\n\t  1 dmidecode                0.00     0.00\n\t  1 grep                     0.00     0.00\n\t  1 ifconfig                 0.00     0.00\n\t  1 ip                       0.00     0.00\n\t  1 lsmod                    0.00     0.00\n\t  1 mktemp                   0.00     0.00\n\t  1 readlink                 0.00     0.00\n\t  1 realpath                 0.00     0.00\n\t  1 sed                      0.00     0.00\n\t  1 sort                     0.00     0.00\n\t  1 stty                     0.00     0.00\n\t  1 systemctl                0.00     0.00\n\t  1 template.sh              0.00     0.00\n\t  1 wc                       0.00     0.00\n\t  1 xrandr                   0.00     0.00\n0 processes running\n47 maximum processes\n<\/code><\/pre>\n\n\n\n<p> Computation blocks<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>      83118) clomp            cpu=3 start=4.90  finish=36.81\n        83119) clomp_build      cpu=13 start=4.90  finish=36.81\n          83120) clomp_build      cpu=15 start=4.90  finish=36.81\n          83121) clomp_build      cpu=0 start=4.90  finish=36.81\n          83122) clomp_build      cpu=14 start=4.90  finish=36.81\n          83123) clomp_build      cpu=4 start=4.90  finish=36.81\n          83124) clomp_build      cpu=9 start=4.90  finish=36.81\n          83125) clomp_build      cpu=10 start=4.90  finish=36.81\n          83126) clomp_build      cpu=3 start=4.90  finish=36.81\n          83127) clomp_build      cpu=5 start=4.90  finish=36.81\n          83128) clomp_build      cpu=1 start=4.90  finish=36.81\n          83129) clomp_build      cpu=7 start=4.90  finish=36.81\n          83130) clomp_build      cpu=8 start=4.90  finish=36.81\n          83131) clomp_build      cpu=12 start=4.90  finish=36.81\n          83132) clomp_build      cpu=6 start=4.90  finish=36.81\n          83133) clomp_build      cpu=2 start=4.90  finish=36.81\n          83134) clomp_build      cpu=11 start=4.90  finish=36.81\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Livermore OpenMP test with one workload test. Looks to be mostly single-threaded with short sections of multi-threaded runs. Topdown profile looks backend bound with the short parallel sections less so. AMD metrics show an average of only 3.5 cores. This <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/clomp\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":58,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-1720","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1720","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=1720"}],"version-history":[{"count":3,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1720\/revisions"}],"predecessor-version":[{"id":1740,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1720\/revisions\/1740"}],"up":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/58"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=1720"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}