{"id":1363,"date":"2024-02-03T15:14:24","date_gmt":"2024-02-03T15:14:24","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?page_id=1363"},"modified":"2024-02-03T17:28:29","modified_gmt":"2024-02-03T17:28:29","slug":"tensorflow-lite","status":"publish","type":"page","link":"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/tensorflow-lite\/","title":{"rendered":"tensorflow-lite"},"content":{"rendered":"\n<p>Tensorflow based engine for inference. There are six different models. With exception of one model, we mostly run on all cores.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/systemtime-17.png\" alt=\"\" class=\"wp-image-1375\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/systemtime-17.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/systemtime-17-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/systemtime-17-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>Topdown profile shows most tests are constrained by backend stalls, though to different degrees. Overall frontend stalls are low. This is consistent with tensorflow and ai-benchment, two other workloads using the tensorflow source base.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"960\" src=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/amdtopdown-17.png\" alt=\"\" class=\"wp-image-1377\" srcset=\"https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/amdtopdown-17.png 1280w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/amdtopdown-17-1024x768.png 1024w, https:\/\/mvermeulen.org\/perf\/wp-content\/uploads\/sites\/7\/2024\/02\/amdtopdown-17-768x576.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n\n\n\n<p>AMD metrics show not much floating point. Backend stalls are more CPU than memory. On-core is most of the 16 cores.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              1781.202\non_cpu               0.906          # 14.49 \/ 16 cores\nutime                25768.805\nstime                46.546\nnvcsw                2497631        # 90.04%\nnivcsw               276147         # 9.96%\ninblock              0              # 0.00\/sec\nonblock              13600          # 7.64\/sec\ncpu-clock            25817374390635 # 25817.374 seconds\ntask-clock           25818707618172 # 25818.708 seconds\npage faults          543211         # 21.039\/sec\ncontext switches     2782445        # 107.769\/sec\ncpu migrations       2740           # 0.106\/sec\nmajor page faults    402            # 0.016\/sec\nminor page faults    542809         # 21.024\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             8176257999730  # 75.297 branches per 1000 inst\nbranch misses        18848036907    # 0.23% branch miss\nconditional          6168687726909  # 56.809 conditional branches per 1000 inst\nindirect             708765353562   # 6.527 indirect branches per 1000 inst\ncpu-cycles           66488699011329 # 3.47 GHz\ninstructions         62700919022287 # 0.94 IPC\nslots                132972780453384 #\nretiring             20997039894536 # 15.8% (22.3%)\n-- ucode             70604174137    #     0.1%\n-- fastpath          20926435720399 #    15.7%\nfrontend             6812430443169  #  5.1% ( 7.2%)\n-- latency           4262617441938  #     3.2%\n-- bandwidth         2549813001231  #     1.9%\nbackend              65880605697902 # 49.5% (70.1%) high\n-- cpu               43055389397505 #    32.4%\n-- memory            22825216300397 #    17.2%\nspeculation          304512513870   #  0.2% ( 0.3%) low\n-- branch mispredict 231994676551   #     0.2%\n-- pipeline restart  72517837319    #     0.1%\nsmt-contention       38978063022358 # 29.3% ( 0.0%)\ncpu-cycles           66437980534368 # 3.46 GHz\ninstructions         62886224269728 # 0.95 IPC\ninstructions         20963778736844 # 110.104 l2 access per 1000 inst\nl2 hit from l1       1484797600488  # 17.09% l2 miss\nl2 miss from l1      76449398591    #\nl2 hit from l2 pf    505326269978   #\nl3 hit from l2 pf    300588021776   #\nl3 miss from l2 pf   17475885955    #\ninstructions         20952703348688 # 88.143 float per 1000 inst\nfloat 512            86             # 0.000 AVX-512 per 1000 inst\nfloat 256            16638293267    # 0.794 AVX-256 per 1000 inst\nfloat 128            1830204124915  # 87.349 AVX-128 per 1000 inst\nfloat MMX            0              # 0.000 MMX per 1000 inst\nfloat scalar         126            # 0.000 scalar per 1000 inst\ninstructions         2683871        #\nopcache              1001419        # 373.125 opcache per 1000 inst\nopcache miss         538595         # 53.8% opcache miss rate\nl1 dTLB miss         5215           # 1.943 L1 dTLB per 1000 inst\nl2 dTLB miss         1072           # 0.399 L2 dTLB per 1000 inst\ninstructions         2719642        #\nicache               1329853        # 488.981 icache per 1000 inst\nicache miss          113221         #  8.5% icache miss rate\nl1 iTLB miss         13             # 0.005 L1 iTLB per 1000 inst\nl2 iTLB miss         0              # 0.000 L2 iTLB per 1000 inst\ntlb flush            19             # 0.007 TLB flush per 1000 inst\n<\/code><\/pre>\n\n\n\n<p>Intel metrics<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>elapsed              3505.681\non_cpu               0.900          # 14.41 \/ 16 cores\nutime                50462.029\nstime                45.015\nnvcsw                3228584        # 86.96%\nnivcsw               484151         # 13.04%\ninblock              671152         # 191.45\/sec\nonblock              3288           # 0.94\/sec\ncpu-clock            50501413499110 # 50501.413 seconds\ntask-clock           50502962043307 # 50502.962 seconds\npage faults          1267501        # 25.098\/sec\ncontext switches     3729982        # 73.857\/sec\ncpu migrations       20244          # 0.401\/sec\nmajor page faults    5621           # 0.111\/sec\nminor page faults    1261880        # 24.986\/sec\nalignment faults     0              # 0.000\/sec\nemulation faults     0              # 0.000\/sec\nbranches             7113178270992  # 42.878 branches per 1000 inst\nbranch misses        33305181988    # 0.47% branch miss\nconditional          7113178311760  # 42.878 conditional branches per 1000 inst\nindirect             2612548769133  # 15.748 indirect branches per 1000 inst\nslots                195950194535378 #\nretiring             60251227935729 # 30.7% (30.7%)\n-- ucode             2905514460597  #     1.5%\n-- fastpath          57345713475132 #    29.3%\nfrontend             15083119217621 #  7.7% ( 7.7%)\n-- latency           11672569239183 #     6.0%\n-- bandwidth         3410549978438  #     1.7%\nbackend              120615883054053 # 61.6% (61.6%)\n-- cpu               102213139192031 #    52.2%\n-- memory            18402743862022 #     9.4%\nspeculation          1161461077519  #  0.6% ( 0.6%) low\n-- branch mispredict 1029624209035  #     0.5%\n-- pipeline restart  131836868484   #     0.1%\nsmt-contention       0              #  0.0% ( 0.0%)\ncpu-cycles           90788972703168 # 2.18 GHz\ninstructions         100133205361729 # 1.10 IPC\nl2 access            2519301193908  # 41.427 l2 access per 1000 inst\nl2 miss              695926152907   # 27.62% l2 miss\ncpu-cycles           58978337968420 # 21.4% memory latency\nload stalls          12514222603573 # 12.5% l1 bound\nl1 miss              5158405253880  #  3.4% l2 bound\nl2 miss              3175529755484  #  4.3% l3 bound\nl3 miss              641250892911   #  1.1% dram bound\nstore_stalls         100727600884   #  0.2% store bound\n<\/code><\/pre>\n\n\n\n<p>Process summary<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>814 processes\n\t432 linux_x86-64_be      412501.71   546.44\n\t 68 clinfo                  17.52     4.67\n\t 38 vulkaninfo               1.13     1.14\n\t  6 php                      0.16     0.27\n\t  4 vulkani:disk$0           0.12     0.12\n\t  6 glxinfo:gdrv0            0.09     0.04\n\t  6 glxinfo:gl0              0.09     0.04\n\t  2 llvmpipe-0               0.06     0.06\n\t  2 llvmpipe-1               0.06     0.06\n\t  2 llvmpipe-10              0.06     0.06\n\t  2 llvmpipe-11              0.06     0.06\n\t  2 llvmpipe-12              0.06     0.06\n\t  2 llvmpipe-13              0.06     0.06\n\t  2 llvmpipe-14              0.06     0.06\n\t  2 llvmpipe-15              0.06     0.06\n\t  2 llvmpipe-2               0.06     0.06\n\t  2 llvmpipe-3               0.06     0.06\n\t  2 llvmpipe-4               0.06     0.06\n\t  2 llvmpipe-5               0.06     0.06\n\t  2 llvmpipe-6               0.06     0.06\n\t  2 llvmpipe-7               0.06     0.06\n\t  2 llvmpipe-8               0.06     0.06\n\t  2 llvmpipe-9               0.06     0.06\n\t  2 glxinfo                  0.06     0.02\n\t  2 glxinfo:cs0              0.06     0.02\n\t  2 glxinfo:disk$0           0.06     0.02\n\t  2 glxinfo:sh0              0.05     0.02\n\t  2 glxinfo:shlo0            0.05     0.02\n\t  6 clang                    0.03     0.09\n\t  3 rocminfo                 0.03     0.00\n\t  1 lspci                    0.00     0.02\n\t  1 ps                       0.00     0.01\n\t 91 sh                       0.00     0.00\n\t 27 tensorflow-lite          0.00     0.00\n\t 12 gcc                      0.00     0.00\n\t 10 gsettings                0.00     0.00\n\t  8 stat                     0.00     0.00\n\t  8 systemd-detect-          0.00     0.00\n\t  6 llvm-link                0.00     0.00\n\t  5 phoronix-test-s          0.00     0.00\n\t  4 gmain                    0.00     0.00\n\t  2 dconf worker             0.00     0.00\n\t  2 lscpu                    0.00     0.00\n\t  2 uname                    0.00     0.00\n\t  2 which                    0.00     0.00\n\t  2 xset                     0.00     0.00\n\t  1 cc                       0.00     0.00\n\t  1 date                     0.00     0.00\n\t  1 dirname                  0.00     0.00\n\t  1 dmesg                    0.00     0.00\n\t  1 dmidecode                0.00     0.00\n\t  1 grep                     0.00     0.00\n\t  1 ifconfig                 0.00     0.00\n\t  1 ip                       0.00     0.00\n\t  1 lsmod                    0.00     0.00\n\t  1 mktemp                   0.00     0.00\n\t  1 qdbus                    0.00     0.00\n\t  1 readlink                 0.00     0.00\n\t  1 realpath                 0.00     0.00\n\t  1 sed                      0.00     0.00\n\t  1 sort                     0.00     0.00\n\t  1 stty                     0.00     0.00\n\t  1 systemctl                0.00     0.00\n\t  1 template.sh              0.00     0.00\n\t  1 wc                       0.00     0.00\n\t  1 xrandr                   0.00     0.00\n0 processes running\n47 maximum processes\n\n\n<\/code><\/pre>\n\n\n\n<p>Core computation blocks start one process on each core<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>      233538) tensorflow-lite  cpu=13 start=5.82  finish=66.34\n        233539) linux_x86-64_be  cpu=0 start=5.82  finish=66.34\n          233540) linux_x86-64_be  cpu=11 start=5.83  finish=66.34\n          233541) linux_x86-64_be  cpu=4 start=5.83  finish=66.34\n          233542) linux_x86-64_be  cpu=6 start=5.83  finish=66.34\n          233543) linux_x86-64_be  cpu=2 start=5.83  finish=66.34\n          233544) linux_x86-64_be  cpu=1 start=5.83  finish=66.34\n          233545) linux_x86-64_be  cpu=5 start=5.83  finish=66.34\n          233546) linux_x86-64_be  cpu=7 start=5.83  finish=66.34\n          233547) linux_x86-64_be  cpu=8 start=5.83  finish=66.34\n          233548) linux_x86-64_be  cpu=14 start=5.83  finish=66.34\n          233549) linux_x86-64_be  cpu=10 start=5.83  finish=66.34\n          233550) linux_x86-64_be  cpu=3 start=5.83  finish=66.34\n          233551) linux_x86-64_be  cpu=12 start=5.83  finish=66.34\n          233552) linux_x86-64_be  cpu=9 start=5.83  finish=66.34\n          233553) linux_x86-64_be  cpu=13 start=5.83  finish=66.34\n          233554) linux_x86-64_be  cpu=15 start=5.83  finish=66.34\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Tensorflow based engine for inference. There are six different models. With exception of one model, we mostly run on all cores. Topdown profile shows most tests are constrained by backend stalls, though to different degrees. Overall frontend stalls are low. <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/workloads\/phoronix\/tensorflow-lite\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":58,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-1363","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1363","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=1363"}],"version-history":[{"count":4,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1363\/revisions"}],"predecessor-version":[{"id":1383,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/1363\/revisions\/1383"}],"up":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/pages\/58"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=1363"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}