{"id":1004,"date":"2024-01-28T19:29:18","date_gmt":"2024-01-28T19:29:18","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?page_id=1004"},"modified":"2024-01-28T20:58:02","modified_gmt":"2024-01-28T20:58:02","slug":"tnn","status":"publish","type":"page","link":"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/tnn\/","title":{"rendered":"tnn"},"content":{"rendered":"\n<p>An open source deep learning framework from Tencent. There are four workloads, all on the CPU and for densenet, mobilenet, squeezenet v2 and squeezenet v1.1. The densenet workload runs on all cores and other workloads look single-threaded.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-75.png\" alt=\"\" class=\"wp-image-1011\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-75.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-75-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/systemtime-75-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>Topdown profile also shows different profiles for the benchmarks. However a general theme of being dominated by backend stalls and having mostly low levels of frontend stalls except for in transition.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-112.png\" alt=\"\" class=\"wp-image-1013\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-112.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-112-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/01\/amdtopdown-112-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>AMD metrics show not many floating point and having ~50 L2 access per 1000 instructions.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              739.513\non_cpu               0.302          # 4.82 \/ 16 cores\nutime                3560.039\nstime                7.444\nnvcsw                223692         # 87.59%\nnivcsw               31706          # 12.41%\ninblock              0              # 0.00\/sec\nonblock              13776          # 18.63\/sec\ncpu-clock            3564316007029  # 3564.316 seconds\ntask-clock           3564837738544  # 3564.838 seconds\npage faults          228836         # 64.193\/sec\ncontext switches     258899         # 72.626\/sec\ncpu migrations       814            # 0.228\/sec\nmajor page faults    3              # 0.001\/sec\nminor page faults    228833         # 64.192\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             4196404420376  # 189.470 branches per 1000 inst\nbranch misses        3237305237     # 0.08% branch miss\nconditional          3022374000895  # 136.462 conditional branches per 1000 inst\nindirect             305139741479   # 13.777 indirect branches per 1000 inst\ncpu-cycles           14722665973061 # 1.25 GHz\ninstructions         22146096055369 # 1.50 IPC\nslots                29444273560320 #\nretiring             7301956221892  # 24.8% (35.8%)\n-- ucode             40904035395    #     0.1%\n-- fastpath          7261052186497  #    24.7%\nfrontend             1517745054634  #  5.2% ( 7.4%)\n-- latency           419893118628   #     1.4%\n-- bandwidth         1097851936006  #     3.7%\nbackend              11560250031251 # 39.3% (56.6%)\n-- cpu               5778461213660  #    19.6%\n-- memory            5781788817591  #    19.6%\nspeculation          39744504863    #  0.1% ( 0.2%) low\n-- branch mispredict 36045076797    #     0.1%\n-- pipeline restart  3699428066     #     0.0%\nsmt-contention       9024536450254  # 30.6% ( 0.0%)\ncpu-cycles           14721801786379 # 1.24 GHz\ninstructions         22187875339042 # 1.51 IPC\ninstructions         7369246766729  # 51.188 l2 access per 1000 inst\nl2 hit from l1       234221533537   # 0.97% l2 miss\nl2 miss from l1      1817504517     #\nl2 hit from l2 pf    141169526887   #\nl3 hit from l2 pf    1747943816     #\nl3 miss from l2 pf   78990687       #\ninstructions         7382349757638  # 48.622 float per 1000 inst\nfloat 512            65             # 0.000 AVX-512 per 1000 inst\nfloat 256            770            # 0.000 AVX-256 per 1000 inst\nfloat 128            358941089279   # 48.622 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         0              # 0.000 scalar per 1000 inst\n<\/code><\/pre>\n\n\n\n<p>Intel metrics<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              957.141\non_cpu               0.418          # 6.68 \/ 16 cores\nutime                6393.282\nstime                4.919\nnvcsw                205201         # 78.38%\nnivcsw               56603          # 21.62%\ninblock              328            # 0.34\/sec\nonblock              2664           # 2.78\/sec\ncpu-clock            6393687302717  # 6393.687 seconds\ntask-clock           6393965412585  # 6393.965 seconds\npage faults          232344         # 36.338\/sec\ncontext switches     266387         # 41.662\/sec\ncpu migrations       40747          # 6.373\/sec\nmajor page faults    2              # 0.000\/sec\nminor page faults    232342         # 36.338\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             5392050354022  # 188.279 branches per 1000 inst\nbranch misses        5089990262     # 0.09% branch miss\nconditional          5392050370150  # 188.279 conditional branches per 1000 inst\nindirect             1477539033770  # 51.592 indirect branches per 1000 inst\nslots                40379208406658 #\nretiring             15114663323800 # 37.4% (37.4%)\n-- ucode             1079718189390  #     2.7%\n-- fastpath          14034945134410 #    34.8%\nfrontend             2281353403271  #  5.6% ( 5.6%)\n-- latency           1328892806649  #     3.3%\n-- bandwidth         952460596622   #     2.4%\nbackend              22305036556297 # 55.2% (55.2%)\n-- cpu               20253897577472 #    50.2%\n-- memory            2051138978825  #     5.1%\nspeculation          428791222988   #  1.1% ( 1.1%)\n-- branch mispredict 335441759078   #     0.8%\n-- pipeline restart  93349463910    #     0.2%\nsmt-contention       0              #  0.0% ( 0.0%)\ncpu-cycles           16045151722351 # 1.32 GHz\ninstructions         22585698220732 # 1.41 IPC\nl2 access            408311732315   # 30.691 l2 access per 1000 inst\nl2 miss              9044196543     # 2.22% l2 miss\n<\/code><\/pre>\n\n\n\n<p>Process overview shows almost all of the time in TNNtest<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>468 processes\n\t102 TNNTest              56099.20    83.22\n\t 68 clinfo                  19.17     7.99\n\t 38 vulkaninfo               1.50     1.33\n\t  4 vulkani:disk$0           0.15     0.14\n\t  6 glxinfo:gdrv0            0.15     0.07\n\t  6 glxinfo:gl0              0.15     0.07\n\t  6 php                      0.10     0.14\n\t  2 llvmpipe-0               0.08     0.07\n\t  2 llvmpipe-1               0.08     0.07\n\t  2 llvmpipe-10              0.08     0.07\n\t  2 llvmpipe-11              0.08     0.07\n\t  2 llvmpipe-12              0.08     0.07\n\t  2 llvmpipe-13              0.08     0.07\n\t  2 llvmpipe-14              0.08     0.07\n\t  2 llvmpipe-15              0.08     0.07\n\t  2 llvmpipe-2               0.08     0.07\n\t  2 llvmpipe-3               0.08     0.07\n\t  2 llvmpipe-4               0.08     0.07\n\t  2 llvmpipe-5               0.08     0.07\n\t  2 llvmpipe-6               0.08     0.07\n\t  2 llvmpipe-7               0.08     0.07\n\t  2 llvmpipe-8               0.08     0.07\n\t  2 llvmpipe-9               0.08     0.07\n\t  2 glxinfo                  0.07     0.03\n\t  2 glxinfo:cs0              0.07     0.03\n\t  2 glxinfo:disk$0           0.07     0.03\n\t  2 glxinfo:sh0              0.07     0.03\n\t  2 glxinfo:shlo0            0.07     0.03\n\t  6 clang                    0.06     0.06\n\t  3 rocminfo                 0.03     0.00\n\t  1 lspci                    0.00     0.03\n\t  1 ps                       0.00     0.01\n\t 88 sh                       0.00     0.00\n\t 13 gcc                      0.00     0.00\n\t 12 tnn                      0.00     0.00\n\t 10 gsettings                0.00     0.00\n\t  8 stat                     0.00     0.00\n\t  8 systemd-detect-          0.00     0.00\n\t  6 llvm-link                0.00     0.00\n\t  5 phoronix-test-s          0.00     0.00\n\t  4 gmain                    0.00     0.00\n\t  2 cc                       0.00     0.00\n\t  2 dconf worker             0.00     0.00\n\t  2 lscpu                    0.00     0.00\n\t  2 uname                    0.00     0.00\n\t  2 which                    0.00     0.00\n\t  2 xset                     0.00     0.00\n\t  1 date                     0.00     0.00\n\t  1 dirname                  0.00     0.00\n\t  1 dmesg                    0.00     0.00\n\t  1 dmidecode                0.00     0.00\n\t  1 grep                     0.00     0.00\n\t  1 ifconfig                 0.00     0.00\n\t  1 ip                       0.00     0.00\n\t  1 lsmod                    0.00     0.00\n\t  1 mktemp                   0.00     0.00\n\t  1 qdbus                    0.00     0.00\n\t  1 readlink                 0.00     0.00\n\t  1 realpath                 0.00     0.00\n\t  1 sed                      0.00     0.00\n\t  1 sort                     0.00     0.00\n\t  1 stty                     0.00     0.00\n\t  1 systemctl                0.00     0.00\n\t  1 template.sh              0.00     0.00\n\t  1 wc                       0.00     0.00\n\t  1 xrandr                   0.00     0.00\n0 processes running\n47 maximum processes\n<\/code><\/pre>\n\n\n\n<p>Computation structures start one thread on each cover, at least for the first workload<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>     1119747) tnn              cpu=1 start=7.90  finish=183.99\n        1119748) TNNTest          cpu=5 start=7.90  finish=183.99\n          1119749) TNNTest          cpu=3 start=8.11  finish=183.99\n          1119750) TNNTest          cpu=15 start=8.11  finish=183.99\n          1119751) TNNTest          cpu=6 start=8.11  finish=183.99\n          1119752) TNNTest          cpu=9 start=8.11  finish=183.99\n          1119753) TNNTest          cpu=7 start=8.11  finish=183.99\n          1119754) TNNTest          cpu=14 start=8.11  finish=183.99\n          1119755) TNNTest          cpu=13 start=8.11  finish=183.99\n          1119756) TNNTest          cpu=2 start=8.11  finish=183.99\n          1119757) TNNTest          cpu=12 start=8.11  finish=183.99\n          1119758) TNNTest          cpu=10 start=8.11  finish=183.99\n          1119759) TNNTest          cpu=0 start=8.11  finish=183.99\n          1119760) TNNTest          cpu=11 start=8.11  finish=183.99\n          1119761) TNNTest          cpu=4 start=8.11  finish=183.99\n          1119762) TNNTest          cpu=8 start=8.11  finish=183.99\n          1119763) TNNTest          cpu=1 start=8.11  finish=183.99\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>An open source deep learning framework from Tencent. There are four workloads, all on the CPU and for densenet, mobilenet, squeezenet v2 and squeezenet v1.1. The densenet workload runs on all cores and other workloads look single-threaded. Topdown profile also <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/tnn\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":58,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-1004","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1004","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=1004"}],"version-history":[{"count":3,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1004\/revisions"}],"predecessor-version":[{"id":1014,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1004\/revisions\/1014"}],"up":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/58"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=1004"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}