{"id":79,"date":"2023-12-16T20:29:06","date_gmt":"2023-12-16T20:29:06","guid":{"rendered":"https:\/\/mvermeulen.org\/perf\/?p=79"},"modified":"2023-12-17T19:49:10","modified_gmt":"2023-12-17T19:49:10","slug":"stream-experiments","status":"publish","type":"post","link":"https:\/\/mvermeulen.org\/perf\/2023\/12\/16\/stream-experiments\/","title":{"rendered":"Stream, experiments"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">I copied Stream from <a href=\"https:\/\/www.cs.virginia.edu\/stream\/\">https:\/\/www.cs.virginia.edu\/stream\/<\/a> and put a copy in <a href=\"https:\/\/github.com\/cycletourist\/perf\">https:\/\/github.com\/cycletourist\/perf<\/a>.  This suggested the following compilation flags<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/opt\/AMD\/aocc-compiler-4.1.0\/bin\/clang -O2 -fopenmp -mcmodel=large -ffp-contract=fast -fnt-store stream.c -DSTREAM_TYPE=double -DSTREAM_ARRAY_SIZE=100000000 -DNTIMES=100 -o stream<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">On my system with a <a href=\"https:\/\/www.amd.com\/en\/products\/apu\/amd-ryzen-7-7800x3d\">Ryzen 7 7800X3D <\/a>this results in the following performance:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Function    Best Rate MB\/s  Avg time     Min time     Max time\nCopy:           44965.3     0.035895     0.035583     0.041095\nScale:          44902.1     0.035803     0.035633     0.040071\nAdd:            44214.4     0.054599     0.054281     0.057224\nTriad:          44659.5     0.054155     0.053740     0.062816<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The question is what is the sensitivity of various alternatives for compiling\/running stream and can I do better than roughly ~45k MB\/s?<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Number of threads<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The first dimension to try is the number of concurrent threads.  By default, we run on all cores so 16 threads (8 cores by 2-way hyperthreading).  However, there are only two memory channels on the processor, so perhaps if we limit threads we get less contention?  Using <a href=\"https:\/\/github.com\/RRZE-HPC\/likwid\/wiki\/likwid-topology\">likwid-topology<\/a> we see the following<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>********************************************************************************\nCache Topology\n********************************************************************************\nLevel:\t\t\t1\nSize:\t\t\t32 kB\nCache groups:\t\t( 0 8 ) ( 1 9 ) ( 2 10 ) ( 3 11 ) ( 4 12 ) ( 5 13 ) ( 6 14 ) ( 7 15 )\n--------------------------------------------------------------------------------\nLevel:\t\t\t2\nSize:\t\t\t1 MB\nCache groups:\t\t( 0 8 ) ( 1 9 ) ( 2 10 ) ( 3 11 ) ( 4 12 ) ( 5 13 ) ( 6 14 ) ( 7 15 )\n--------------------------------------------------------------------------------\nLevel:\t\t\t3\nSize:\t\t\t96 MB\nCache groups:\t\t( 0 8 1 9 2 10 3 11 4 12 5 13 6 14 7 15 )\n--------------------------------------------------------------------------------<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">So other than avoiding the same core which share L1\/L2, we try lower numbers of threads.  Using 8 copies (one for each core) and &#8220;taskset -c 0,1,2,3,4,5,7 we are slightly higher<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Function    Best Rate MB\/s  Avg time     Min time     Max time\nCopy:           45801.8     0.035042     0.034933     0.036696\nScale:          45887.3     0.034946     0.034868     0.035780\nAdd:            44956.4     0.053540     0.053385     0.054156\nTriad:          45478.8     0.052950     0.052772     0.053537<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Using 4 copies and &#8220;taskset -c 0,2,4,6&#8221; we are higher still<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Function    Best Rate MB\/s  Avg time     Min time     Max time\nCopy:           47930.1     0.033463     0.033382     0.034021\nScale:          47935.9     0.033456     0.033378     0.033836\nAdd:            47367.2     0.050786     0.050668     0.051453\nTriad:          47514.5     0.050662     0.050511     0.053086<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Using 2 copies and &#8220;taskset -c 0,4&#8221; we are even higher<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Function    Best Rate MB\/s  Avg time     Min time     Max time\nCopy:           49953.0     0.032219     0.032030     0.032577\nScale:          50139.6     0.032096     0.031911     0.032346\nAdd:            49384.7     0.048934     0.048598     0.049323\nTriad:          49297.6     0.049014     0.048684     0.049403<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Using 1 copy and &#8220;taskset -c 0&#8221; we are lower again<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Function    Best Rate MB\/s  Avg time     Min time     Max time\nCopy:           45256.7     0.035486     0.035354     0.035758\nScale:          45941.4     0.034933     0.034827     0.035466\nAdd:            45713.4     0.052640     0.052501     0.053054\nTriad:          45691.7     0.052665     0.052526     0.053191<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">For completeness, we try using 3 copies and &#8220;taskset -c 0,2,4&#8221; and are also slightly lower than the two copy run<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Function    Best Rate MB\/s  Avg time     Min time     Max time\nCopy:           49209.8     0.032582     0.032514     0.032765\nScale:          49169.0     0.032624     0.032541     0.032854\nAdd:            48532.8     0.049577     0.049451     0.049879\nTriad:          48672.6     0.049475     0.049309     0.049913<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">So looks like for this processor it runs fastest with two threads, one for each memory channel.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Compiler options<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The next dimension to try is the compiler and compiler options.  Here we expect we have a recommended AOCC compiler and options, so don&#8217;t expect removing them to add performance &#8211; but useful to see anyways.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Running with &#8220;gcc -O2&#8221; instead of aocc results in slower performance<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>gcc -O2 -fopenmp -mcmodel=large stream.c -DSTREAM_TYPE=double -DSTREAM_ARRAY_SIZE=100000000 -DNTIMES=100 -o stream\n\nFunction    Best Rate MB\/s  Avg time     Min time     Max time\nCopy:           38927.6     0.041432     0.041102     0.041713\nScale:          31689.5     0.050793     0.050490     0.051133\nAdd:            34335.7     0.070263     0.069898     0.070816\nTriad:          34257.9     0.070204     0.070057     0.070589<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The -fnt-store option uses a non-temporal store.  This keeps the processor from keeping entries in the caches.  This makes sense for stream since we are streaming through memory much larger than the cache and it otherwise gets polluted where cache entries conflict with new fetches from memory.  Removing the -fnt-store option results in numbers close to the gcc numbers<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/opt\/AMD\/aocc-compiler-4.1.0\/bin\/clang -O2 -fopenmp -mcmodel=large -ffp-contract=fast stream.c -DSTREAM_TYPE=double -DSTREAM_ARRAY_SIZE=100000000 -DNTIMES=100 -o stream\n\nFunction    Best Rate MB\/s  Avg time     Min time     Max time\nCopy:           38575.4     0.041629     0.041477     0.041940\nScale:          31698.9     0.050659     0.050475     0.051251\nAdd:            34480.2     0.069831     0.069605     0.070409\nTriad:          34473.8     0.069811     0.069618     0.070233<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The -ffp-contract=fast option allows different core for contractions (e.g. multiply and add) so should primarily focus on &#8220;triad&#8221; which has those options.  We essentially see no difference removing this option, so that may be a &#8220;don&#8217;t care&#8221;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/opt\/AMD\/aocc-compiler-4.1.0\/bin\/clang -O2 -fopenmp -mcmodel=large -fnt-store stream.c -DSTREAM_TYPE=double -DSTREAM_ARRAY_SIZE=100000000 -DNTIMES=100 -o stream\n\nFunction    Best Rate MB\/s  Avg time     Min time     Max time\nCopy:           49973.1     0.032147     0.032017     0.032620\nScale:          50160.2     0.032049     0.031898     0.033233\nAdd:            49402.9     0.048854     0.048580     0.050982\nTriad:          49385.9     0.048923     0.048597     0.049789<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The -O3 option performance set of optimizations such as vectorization and loop optimizations. While beneficial to some codes, this slows down slightly for stream suggesting these are not helpful<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/opt\/AMD\/aocc-compiler-4.1.0\/bin\/clang -O3 -fopenmp -mcmodel=large -ffp-contract=fast -fnt-store stream.c -DSTREAM_TYPE=double -DSTREAM_ARRAY_SIZE=100000000 -DNTIMES=100 -o stream\n\nFunction    Best Rate MB\/s  Avg time     Min time     Max time\nCopy:           49810.3     0.032276     0.032122     0.032575\nScale:          49892.5     0.032225     0.032069     0.032478\nAdd:            48962.7     0.049157     0.049017     0.049637\nTriad:          49026.6     0.049232     0.048953     0.049644<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">So overall, we use the options given.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Another point of comparison is the Intel compiler (icx).  Compiling with options -axCORE-AVX2 -O3 -qopenmp -qopt-streaming-stores results in slightly lower performance.  Will check this with an Intel system as well.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Function    Best Rate MB\/s  Avg time     Min time     Max time\nCopy:           47829.7     0.033554     0.033452     0.033797\nScale:          31703.3     0.050603     0.050468     0.050823\nAdd:            34460.0     0.069887     0.069646     0.070249\nTriad:          34514.5     0.069775     0.069536     0.070086<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Memory configuration<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The other alternative we do not change is the memory configuration.  This system has the following memory &#8211; <a href=\"https:\/\/www.gskill.com\/product\/165\/374\/1648545408\/F5-5600J3636D32GX2-TZ5RK-F5-5600J3636D32GA2-TZ5RK\">https:\/\/www.gskill.com\/product\/165\/374\/1648545408\/F5-5600J3636D32GX2-TZ5RK-F5-5600J3636D32GA2-TZ5RK<\/a>.  Each DIMM is 32 GB and there are four dimms.  <a href=\"https:\/\/forums.guru3d.com\/threads\/single-rank-vs-dual-rank-ram-just-in-case.447117\/\">This thread<\/a> suggests that DDR5 with 32GB are dual rank and DDR5 with 8GB are single rank.  Also the processor specification suggests that 2x2R can run at DDR5-5200 and 4x2R runs at DDR5-3600.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I haven&#8217;t done the experiments but there is a suggestion that if this processor is paired with 16GB of RAM in 2 DIMMs that perhaps we can see faster stream performance than the current 128GB of RAM in 4 DIMMs.  There would of course be a tradeoff on other workloads with a larger working set size.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Versions of Stream<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Another variable would be the specific version of stream.  As a comparison, I tried running the phoronix-test-suite copy of the stream benchmark and got the following results<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Stream 2013-01-17:\n    pts\/stream-1.3.4 &#91;Type: Copy]\n    Test 1 of 4\n    Estimated Trial Run Count:    5                      \n    Estimated Test Run-Time:      4 Minutes              \n    Estimated Time To Completion: 16 Minutes &#91;15:04 CST] \n        Started Run 1 @ 14:48:35\n        Started Run 2 @ 14:50:18\n        Started Run 3 @ 14:52:02\n        Started Run 4 @ 14:53:45\n        Started Run 5 @ 14:55:28\n\n    Type: Copy:\n        44593\n        44619.9\n        44697.6\n        44632.3\n        44658.9\n\n    Average: 44640.3 MB\/s\n    Deviation: 0.09%\n\nStream 2013-01-17:\n    pts\/stream-1.3.4 &#91;Type: Scale]\n    Test 2 of 4\n    Estimated Trial Run Count:    5                      \n    Estimated Test Run-Time:      9 Minutes              \n    Estimated Time To Completion: 26 Minutes &#91;15:22 CST] \n        Utilizing Data From Shared Cache @ 14:57:12\n\n    Type: Scale:\n        28951.6\n        28930.4\n        28951.2\n        28969.1\n        28933.6\n\n    Average: 28947.2 MB\/s\n    Deviation: 0.05%\n\nStream 2013-01-17:\n    pts\/stream-1.3.4 &#91;Type: Triad]\n    Test 3 of 4\n    Estimated Trial Run Count:    5                      \n    Estimated Test Run-Time:      9 Minutes              \n    Estimated Time To Completion: 17 Minutes &#91;15:13 CST] \n        Utilizing Data From Shared Cache @ 14:57:14\n\n    Type: Triad:\n        32139.4\n        32127.3\n        32150.8\n        32161.8\n        32152.7\n\n    Average: 32146.4 MB\/s\n    Deviation: 0.04%\n\nStream 2013-01-17:\n    pts\/stream-1.3.4 &#91;Type: Add]\n    Test 4 of 4\n    Estimated Trial Run Count:    5                     \n    Estimated Time To Completion: 9 Minutes &#91;15:05 CST] \n        Utilizing Data From Shared Cache @ 14:57:16\n\n    Type: Add:\n        32140.2\n        32135.8\n        32107.5\n        32110.3\n        32120.3\n\n    Average: 32122.8 MB\/s\n    Deviation: 0.05%<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">These are substantially slower.  A peek at the installed directory and run logs suggests several reasons<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Sixteen threads are run instead of two<\/li>\n\n\n\n<li>The compiler is gcc and compiler options are &#8220;-mcmodel=medium -O3 -march=native -fopenmp&#8221;<\/li>\n\n\n\n<li>We run an array size of 402653184 elements instead of 100000000 elements (though I suspect this doesn&#8217;t have as much effect)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The other remaining variable would be to find the older version of stream I previously ran ~five years ago.  Will find this and also compare.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I copied Stream from https:\/\/www.cs.virginia.edu\/stream\/ and put a copy in https:\/\/github.com\/cycletourist\/perf. This suggested the following compilation flags On my system with a Ryzen 7 7800X3D this results in the following performance: The question is what is the sensitivity of various <span class=\"excerpt-dots\">&hellip;<\/span> <a class=\"more-link\" href=\"https:\/\/mvermeulen.org\/perf\/2023\/12\/16\/stream-experiments\/\"><span class=\"more-msg\">Continue reading &rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11],"tags":[10],"class_list":["post-79","post","type-post","status-publish","format-standard","hentry","category-experiment","tag-stream"],"_links":{"self":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/posts\/79","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/comments?post=79"}],"version-history":[{"count":3,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/posts\/79\/revisions"}],"predecessor-version":[{"id":110,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/posts\/79\/revisions\/110"}],"wp:attachment":[{"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/media?parent=79"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/categories?post=79"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mvermeulen.org\/perf\/wp-json\/wp\/v2\/tags?post=79"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}