A Mandelbrot fractal generator with four workloads using different parallelism methods: TBB, OpenMP, C++ tasks and C++ threads. Also showing different levels of parallelism with C++ threads the highest and C++ tasks moderately higher and OpenMP/TBB matching the number of cores.

Topdown profile is surprising in how much the four methods are still similar.

AMD metrics confirm this is floating point code that has very little L2 access. Retirement rate is high.

elapsed              534.380
on_cpu               0.863          # 13.81 / 16 cores
utime                7377.820
stime                2.399
nvcsw                3143           # 0.44%
nivcsw               716953         # 99.56%
inblock              0              # 0.00/sec
onblock              14408          # 26.96/sec
cpu-clock            7382597325204  # 7382.597 seconds
task-clock           7382631028362  # 7382.631 seconds
page faults          345620         # 46.815/sec
context switches     722570         # 97.874/sec
cpu migrations       2794           # 0.378/sec
major page faults    2              # 0.000/sec
minor page faults    345618         # 46.815/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             8083876110088  # 132.798 branches per 1000 inst
branch misses        80137273285    # 0.99% branch miss
conditional          7915372159838  # 130.030 conditional branches per 1000 inst
indirect             64437985819    # 1.059 indirect branches per 1000 inst
cpu-cycles           31071237393434 # 3.63 GHz
instructions         60879442564782 # 1.96 IPC
slots                62176734680100 #
retiring             22110032250570 # 35.6% (64.9%) high
-- ucode             10771596099    #     0.0%
-- fastpath          22099260654471 #    35.5%
frontend             1744368334608  #  2.8% ( 5.1%)
-- latency           894253211706   #     1.4%
-- bandwidth         850115122902   #     1.4%
backend              9010849789479  # 14.5% (26.5%)
-- cpu               8877823608547  #    14.3%
-- memory            133026180932   #     0.2%
speculation          1192503018886  #  1.9% ( 3.5%)
-- branch mispredict 1192479459812  #     1.9%
-- pipeline restart  23559074       #     0.0%
smt-contention       28118884517461 # 45.2% ( 0.0%)
cpu-cycles           31064294398784 # 3.62 GHz
instructions         60860623406151 # 1.96 IPC
instructions         20295654465016 # 0.039 l2 access per 1000 inst
l2 hit from l1       717271512      # 9.91% l2 miss
l2 miss from l1      39658005       #
l2 hit from l2 pf    31850550       #
l3 hit from l2 pf    24623451       #
l3 miss from l2 pf   13728835       #
instructions         20287877027354 # 363.504 float per 1000 inst
float 512            59             # 0.000 AVX-512 per 1000 inst
float 256            616            # 0.000 AVX-256 per 1000 inst
float 128            7374729455923  # 363.504 AVX-128 per 1000 inst
float MMX            0              # 0.000 MMX per 1000 inst
float scalar         1              # 0.000 scalar per 1000 inst
instructions         60874438174522 #
opcache              5446273322626  # 89.467 opcache per 1000 inst
opcache miss         3328964533     #  0.1% opcache miss rate
l1 dTLB miss         111328090      # 0.002 L1 dTLB per 1000 inst
l2 dTLB miss         23225462       # 0.000 L2 dTLB per 1000 inst
instructions         60875026269815 #
icache               7052704318     # 0.116 icache per 1000 inst
icache miss          652940617      #  9.3% icache miss rate
l1 iTLB miss         9231663        # 0.000 L1 iTLB per 1000 inst
l2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst
tlb flush            38525          # 0.000 TLB flush per 1000 inst

Intel metrics also show high retirement rate and low backend stalls.

elapsed              638.535
on_cpu               0.858          # 13.72 / 16 cores
utime                8759.240
stime                1.467
nvcsw                4116           # 0.49%
nivcsw               842516         # 99.51%
inblock              12336          # 19.32/sec
onblock              3176           # 4.97/sec
cpu-clock            8763097608450  # 8763.098 seconds
task-clock           8763121289241  # 8763.121 seconds
page faults          332556         # 37.949/sec
context switches     849614         # 96.953/sec
cpu migrations       3215           # 0.367/sec
major page faults    93             # 0.011/sec
minor page faults    332463         # 37.939/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             8083416801050  # 132.788 branches per 1000 inst
branch misses        56866033035    # 0.70% branch miss
conditional          8083416852346  # 132.788 conditional branches per 1000 inst
indirect             1384972721313  # 22.751 indirect branches per 1000 inst
slots                47671596043970 #
retiring             31887711190056 # 66.9% (66.9%) high
-- ucode             15614157267    #     0.0%
-- fastpath          31872097032789 #    66.9%
frontend             9325764960018  # 19.6% (19.6%)
-- latency           8097339304921  #    17.0%
-- bandwidth         1228425655097  #     2.6%
backend              3227716735681  #  6.8% ( 6.8%) low
-- cpu               2892801822478  #     6.1%
-- memory            334914913203   #     0.7%
speculation          3365729603445  #  7.1% ( 7.1%)
-- branch mispredict 3363574411730  #     7.1%
-- pipeline restart  2155191715     #     0.0%
smt-contention       0              #  0.0% ( 0.0%)
cpu-cycles           25692550556350 # 2.33 GHz
instructions         54368912295309 # 2.12 IPC
l2 access            626319059      # 0.019 l2 access per 1000 inst
l2 miss              189667419      # 30.28% l2 miss
cpu-cycles           16897128544726 #  7.4% memory latency
load stalls          1244125349928  #  7.4% l1 bound
l1 miss              1996402438     #  0.0% l2 bound
l2 miss              1032469931     #  0.0% l3 bound
l3 miss              322943920      #  0.0% dram bound
store_stalls         377235985      #  0.0% store bound

Process overview shows “rm*” for each different type of parallelism

1602 processes
	963 rmSTD_THREADS        526068.01   145.63
	195 rmSTD_TASKS          117381.36    19.48
	 48 rmOpenMP             29588.00     2.56
	 30 rmTBB                18339.68     4.18
	 68 clinfo                  16.52     6.33
	 38 vulkaninfo               1.34     1.14
	  4 vulkani:disk$0           0.15     0.12
	  6 php                      0.09     0.09
	  2 llvmpipe-0               0.07     0.06
	  2 llvmpipe-1               0.07     0.06
	  2 llvmpipe-10              0.07     0.06
	  2 llvmpipe-11              0.07     0.06
	  2 llvmpipe-12              0.07     0.06
	  2 llvmpipe-13              0.07     0.06
	  2 llvmpipe-14              0.07     0.06
	  2 llvmpipe-15              0.07     0.06
	  2 llvmpipe-2               0.07     0.06
	  2 llvmpipe-3               0.07     0.06
	  2 llvmpipe-4               0.07     0.06
	  2 llvmpipe-5               0.07     0.06
	  2 llvmpipe-6               0.07     0.06
	  2 llvmpipe-7               0.07     0.06
	  2 llvmpipe-8               0.07     0.06
	  2 llvmpipe-9               0.07     0.06
	  6 clang                    0.04     0.08
	  1 lspci                    0.00     0.02
	  3 rocminfo                 0.00     0.01
	  1 ps                       0.00     0.01
	 89 sh                       0.00     0.00
	 14 gsettings                0.00     0.00
	 13 gcc                      0.00     0.00
	 12 toybrot                  0.00     0.00
	  8 stat                     0.00     0.00
	  8 systemd-detect-          0.00     0.00
	  6 llvm-link                0.00     0.00
	  5 phoronix-test-s          0.00     0.00
	  4 glxinfo                  0.00     0.00
	  2 cc                       0.00     0.00
	  2 lscpu                    0.00     0.00
	  2 setterm                  0.00     0.00
	  2 uname                    0.00     0.00
	  2 which                    0.00     0.00
	  1 date                     0.00     0.00
	  1 dconf worker             0.00     0.00
	  1 dirname                  0.00     0.00
	  1 dmesg                    0.00     0.00
	  1 dmidecode                0.00     0.00
	  1 gmain                    0.00     0.00
	  1 grep                     0.00     0.00
	  1 ifconfig                 0.00     0.00
	  1 ip                       0.00     0.00
	  1 lsmod                    0.00     0.00
	  1 mktemp                   0.00     0.00
	  1 qdbus                    0.00     0.00
	  1 readlink                 0.00     0.00
	  1 realpath                 0.00     0.00
	  1 sed                      0.00     0.00
	  1 sort                     0.00     0.00
	  1 stty                     0.00     0.00
	  1 systemctl                0.00     0.00
	  1 template.sh              0.00     0.00
	  1 wc                       0.00     0.00
18 processes running
349 maximum processes

TBB and OpenMP sections

      189095) toybrot          cpu=8 start=90.09 finish=128.74
        189096) rmTBB            cpu=9 start=90.10 finish=128.74
          189097) ?? cpu=0 start=90.13 finish=0.00 
            189099) rmTBB            cpu=5 start=90.13 finish=128.74
              189102) rmTBB            cpu=4 start=90.13 finish=128.74
                189109) ?? cpu=0 start=90.13 finish=0.00 
              189105) rmTBB            cpu=14 start=90.13 finish=128.74
            189100) rmTBB            cpu=8 start=90.13 finish=128.74
              189108) rmTBB            cpu=12 start=90.13 finish=128.74
              189111) rmTBB            cpu=9 start=90.13 finish=128.74
          189098) rmTBB            cpu=3 start=90.13 finish=128.74
            189101) rmTBB            cpu=6 start=90.13 finish=128.74
              189104) rmTBB            cpu=13 start=90.13 finish=128.74
              189107) rmTBB            cpu=15 start=90.13 finish=128.74
            189103) rmTBB            cpu=7 start=90.13 finish=128.74
              189106) rmTBB            cpu=0 start=90.13 finish=128.74
              189110) rmTBB            cpu=11 start=90.13 finish=128.74
      189112) sh               cpu=5 start=128.74 finish=128.74
        189113) sh               cpu=2 start=128.74 finish=128.74
      189114) toybrot          cpu=12 start=138.95 finish=177.74
        189115) rmOpenMP         cpu=6 start=138.96 finish=177.74
          189116) rmOpenMP         cpu=9 start=138.99 finish=177.74
          189117) rmOpenMP         cpu=15 start=138.99 finish=177.74
          189118) rmOpenMP         cpu=8 start=138.99 finish=177.74
          189119) rmOpenMP         cpu=5 start=138.99 finish=177.74
          189120) rmOpenMP         cpu=14 start=138.99 finish=177.74
          189121) rmOpenMP         cpu=3 start=138.99 finish=177.74
          189122) rmOpenMP         cpu=4 start=138.99 finish=177.74
          189123) rmOpenMP         cpu=13 start=138.99 finish=177.74
          189124) rmOpenMP         cpu=7 start=138.99 finish=177.74
          189125) rmOpenMP         cpu=0 start=138.99 finish=177.74
          189126) rmOpenMP         cpu=2 start=138.99 finish=177.74
          189127) rmOpenMP         cpu=10 start=138.99 finish=177.74
          189128) rmOpenMP         cpu=1 start=138.99 finish=177.74
          189129) rmOpenMP         cpu=11 start=138.99 finish=177.74
          189130) rmOpenMP         cpu=12 start=138.99 finish=177.74

Tasks (and threads) look more like

      189167) toybrot          cpu=12 start=273.86 finish=312.40
        189168) rmSTD_TASKS      cpu=7 start=273.86 finish=312.40
          189169) rmSTD_TASKS      cpu=5 start=273.90 finish=311.35
          189170) rmSTD_TASKS      cpu=5 start=273.90 finish=311.38
          189171) rmSTD_TASKS      cpu=10 start=273.90 finish=310.99
          189172) rmSTD_TASKS      cpu=1 start=273.90 finish=310.67
          189173) rmSTD_TASKS      cpu=2 start=273.90 finish=310.94
          189174) rmSTD_TASKS      cpu=11 start=273.90 finish=310.64
          189175) rmSTD_TASKS      cpu=4 start=273.90 finish=310.25
          189176) rmSTD_TASKS      cpu=13 start=273.90 finish=309.85
          189177) rmSTD_TASKS      cpu=8 start=273.90 finish=309.94
          189178) rmSTD_TASKS      cpu=9 start=273.90 finish=310.13
          189179) rmSTD_TASKS      cpu=10 start=273.90 finish=310.34
          189180) rmSTD_TASKS      cpu=5 start=273.90 finish=310.51
          189181) rmSTD_TASKS      cpu=11 start=273.90 finish=311.01
          189182) rmSTD_TASKS      cpu=6 start=273.90 finish=310.93
          189183) rmSTD_TASKS      cpu=4 start=273.90 finish=310.99
          189184) rmSTD_TASKS      cpu=5 start=273.90 finish=310.71
          189185) rmSTD_TASKS      cpu=5 start=273.90 finish=310.55
          189186) rmSTD_TASKS      cpu=13 start=273.90 finish=310.43
          189187) rmSTD_TASKS      cpu=13 start=273.90 finish=310.40
          189188) rmSTD_TASKS      cpu=5 start=273.90 finish=310.53
          189189) rmSTD_TASKS      cpu=0 start=273.90 finish=311.49
          189190) rmSTD_TASKS      cpu=7 start=273.90 finish=311.45
          189191) rmSTD_TASKS      cpu=2 start=273.90 finish=311.15
          189192) rmSTD_TASKS      cpu=11 start=273.90 finish=311.49
          189193) rmSTD_TASKS      cpu=14 start=273.90 finish=311.55
          189194) rmSTD_TASKS      cpu=5 start=273.90 finish=311.49
          189195) rmSTD_TASKS      cpu=12 start=273.90 finish=311.50
          189196) rmSTD_TASKS      cpu=8 start=273.90 finish=311.93
          189197) rmSTD_TASKS      cpu=9 start=273.90 finish=311.96
          189198) rmSTD_TASKS      cpu=10 start=273.90 finish=311.89
          189199) rmSTD_TASKS      cpu=14 start=273.90 finish=312.27
          189200) rmSTD_TASKS      cpu=6 start=273.90 finish=312.13
          189201) rmSTD_TASKS      cpu=13 start=273.90 finish=312.13
          189202) rmSTD_TASKS      cpu=10 start=273.90 finish=312.14
          189203) rmSTD_TASKS      cpu=1 start=273.90 finish=312.21
          189204) rmSTD_TASKS      cpu=2 start=273.90 finish=312.27
          189205) rmSTD_TASKS      cpu=5 start=273.90 finish=311.94
          189206) rmSTD_TASKS      cpu=5 start=273.90 finish=312.32
          189207) rmSTD_TASKS      cpu=0 start=273.90 finish=312.33
          189208) rmSTD_TASKS      cpu=15 start=273.90 finish=312.39
          189209) rmSTD_TASKS      cpu=2 start=273.90 finish=312.08
          189210) rmSTD_TASKS      cpu=2 start=273.90 finish=312.16
          189211) rmSTD_TASKS      cpu=9 start=273.90 finish=312.29
          189212) rmSTD_TASKS      cpu=4 start=273.90 finish=312.31
          189213) rmSTD_TASKS      cpu=11 start=273.90 finish=312.29
          189214) rmSTD_TASKS      cpu=8 start=273.90 finish=312.25
          189215) rmSTD_TASKS      cpu=10 start=273.90 finish=311.96
          189216) rmSTD_TASKS      cpu=4 start=273.90 finish=312.10
          189217) rmSTD_TASKS      cpu=14 start=273.90 finish=312.01
          189218) rmSTD_TASKS      cpu=7 start=273.90 finish=312.13
          189219) rmSTD_TASKS      cpu=0 start=273.90 finish=312.11
          189220) rmSTD_TASKS      cpu=12 start=273.90 finish=312.16
          189221) rmSTD_TASKS      cpu=12 start=273.90 finish=312.03
          189222) rmSTD_TASKS      cpu=14 start=273.90 finish=312.03
          189223) rmSTD_TASKS      cpu=10 start=273.90 finish=311.37
          189224) rmSTD_TASKS      cpu=13 start=273.90 finish=311.55
          189225) rmSTD_TASKS      cpu=9 start=273.90 finish=311.38
          189226) rmSTD_TASKS      cpu=0 start=273.90 finish=311.51
          189227) rmSTD_TASKS      cpu=3 start=273.90 finish=311.26
          189228) rmSTD_TASKS      cpu=1 start=273.90 finish=311.36
          189229) rmSTD_TASKS      cpu=11 start=273.90 finish=311.53
          189230) rmSTD_TASKS      cpu=2 start=273.90 finish=310.92
          189231) rmSTD_TASKS      cpu=14 start=273.90 finish=310.84
          189232) rmSTD_TASKS      cpu=6 start=273.90 finish=310.83