Boost C++ Libraries: Ticket #6057: Lazy variance in accumulator gives wrong result when too high statistics https://svn.boost.org/trac10/ticket/6057 <p> Hello, </p> <p> I was trying to compare performances of various ways to compute the variance of a distribution of integer values, because I need a very fast one. </p> <p> I tried to use lazy_variance and variance, a custom algorithm, (and also TProfile class from ROOT). </p> <p> What I saw is that using lazy_variance gives the correct mean value but a weird variance value when the number of events becomes large (n &gt; 0x3FFFF). </p> <p> The bug was seen first with boost 1.44.0 on a 32 bit Fedora 14 machine (using gcc 4.5.1), and also with boost 1.46.1 on a 64 bit Mandriva 2011 (using gcc 4.6.1). </p> <p> Here below is the source code to reproduce the bug (ROOT stuff commented) : </p> <pre class="wiki">#include &lt;iostream&gt; #include &lt;vector&gt; #include &lt;cmath&gt; // accumulators #include &lt;boost/accumulators/accumulators.hpp&gt; #include &lt;boost/accumulators/statistics/stats.hpp&gt; #include &lt;boost/accumulators/statistics/mean.hpp&gt; #include &lt;boost/accumulators/statistics/variance.hpp&gt; // root //#include &lt;TProfile.h&gt; // timing info #include &lt;boost/progress.hpp&gt; // Nbr aléatoires #include &lt;boost/random/lagged_fibonacci.hpp&gt; #include &lt;boost/random/variate_generator.hpp&gt; #include &lt;boost/random/uniform_int.hpp&gt; using namespace std; using namespace boost::accumulators; using namespace boost; class MyStat { public: MyStat() : NbrEvt( 0u ), Mean( 0. ), MeanSq( 0. ) {}; ~MyStat() {}; void fill( const int&amp; ev ) { Mean = ( Mean * NbrEvt + ev ) / double( NbrEvt + 1 ); MeanSq = ( MeanSq * NbrEvt + ev * ev ) / double( NbrEvt + 1 ); NbrEvt++; }; double mean() const { return Mean; }; double variance() const { return MeanSq - Mean * Mean; }; protected: long unsigned NbrEvt; double Mean; double MeanSq; }; int main( int argc, char* argv[] ) { lagged_fibonacci44497 GenerateurBase( 12345 ); uniform_int&lt; int &gt; Distribution( -128, 127 ); variate_generator&lt; lagged_fibonacci44497&amp;, uniform_int&lt; int &gt; &gt; Generateur( GenerateurBase, Distribution ); vector&lt; int &gt; Data; long unsigned i, total( 0xFFFF ); for ( i = 0; i &lt; total; i++ ) Data . push_back( Generateur() ); cout &lt;&lt; "Test using " &lt;&lt; total &lt;&lt; " data points." &lt;&lt; endl &lt;&lt; endl; accumulator_set&lt; signed, stats&lt; tag::mean, tag::lazy_variance &gt; &gt; lazyVar; accumulator_set&lt; signed, stats&lt; tag::mean, tag::variance &gt; &gt; immedVar; //TProfile * profile = new TProfile( "profile", "My profile", 1, -.5, .5, "S" ); MyStat simpleStat; progress_timer * chrono = 0; cout &lt;&lt; "Lazy variance accumulator : "; chrono = new progress_timer(); for ( i = 0; i &lt; total; i++ ) lazyVar( Data[ i ] ); delete chrono; cout &lt;&lt; "Mean : " &lt;&lt; extract_result&lt; tag::mean &gt;( lazyVar ) &lt;&lt; endl &lt;&lt; "Variance : " &lt;&lt; extract_result&lt; tag::lazy_variance &gt;( lazyVar ) &lt;&lt; endl &lt;&lt; endl; cout &lt;&lt; "Immediate variance accumulator : "; chrono = new progress_timer(); for ( i = 0; i &lt; total; i++ ) immedVar( Data[ i ] ); delete chrono; cout &lt;&lt; "Mean : " &lt;&lt; extract_result&lt; tag::mean &gt;( immedVar ) &lt;&lt; endl &lt;&lt; "Variance : " &lt;&lt; extract_result&lt; tag::variance &gt;( immedVar ) &lt;&lt; endl &lt;&lt; endl; /* cout &lt;&lt; "TProfile : "; chrono = new progress_timer(); for ( i = 0; i &lt; total; i++ ) profile -&gt; Fill( 0., double( Data[ i ] ) ); delete chrono; cout &lt;&lt; "Mean : " &lt;&lt; profile -&gt; GetBinContent( 1 ) &lt;&lt; endl &lt;&lt; "Variance : " &lt;&lt; profile -&gt; GetBinError( 1 ) * profile -&gt; GetBinError( 1 ) &lt;&lt; endl &lt;&lt; endl; */ cout &lt;&lt; "Simple classe : "; chrono = new progress_timer(); for ( i = 0; i &lt; total; i++ ) simpleStat . fill( Data[ i ] ); delete chrono; cout &lt;&lt; "Mean : " &lt;&lt; simpleStat . mean() &lt;&lt; endl &lt;&lt; "Variance : " &lt;&lt; simpleStat . variance() &lt;&lt; endl &lt;&lt; endl &lt;&lt; flush; delete profile; return 0; } </pre><p> And here is the output : </p> <pre class="wiki">Test using 268435455 data points. Lazy variance accumulator : 0.27 s Mean : -0.497278 Variance : 5.30971 Immediate variance accumulator : 9.07 s Mean : -0.497278 Variance : 5461.31 TProfile : 15.56 s Mean : -0.497278 Variance : 5461.31 Simple classe : 7.28 s Mean : -0.497278 Variance : 5461.31 </pre><p> Note that all mean values agree, and all variance values but the lazy one do too. </p> <p> Cheers, johan </p> en-us Boost C++ Libraries /htdocs/site/boost.png https://svn.boost.org/trac10/ticket/6057 Trac 1.4.3 Eric Niebler Thu, 15 Nov 2012 18:46:18 GMT owner changed https://svn.boost.org/trac10/ticket/6057#comment:1 https://svn.boost.org/trac10/ticket/6057#comment:1 <ul> <li><strong>owner</strong> changed from <span class="trac-author">Eric Niebler</span> to <span class="trac-author">Matthias Troyer</span> </li> </ul> <p> Matthias, can you have a look at this one when you get a chance? </p> Ticket Matthias Troyer Thu, 15 Nov 2012 21:12:46 GMT status changed; resolution set https://svn.boost.org/trac10/ticket/6057#comment:2 https://svn.boost.org/trac10/ticket/6057#comment:2 <ul> <li><strong>status</strong> <span class="trac-field-old">new</span> → <span class="trac-field-new">closed</span> </li> <li><strong>resolution</strong> → <span class="trac-field-new">worksforme</span> </li> </ul> <p> I have failed to reproduce this bug with both g++ 4.5 and 4.6 and using Boost 1.44, 1.46.1 and later releases but always get the same variance with lazy, immediate, and the "Simple classe". I am thus closing the ticket. </p> Ticket Johan <johan.luisier@…> Fri, 16 Nov 2012 07:01:28 GMT <link>https://svn.boost.org/trac10/ticket/6057#comment:3 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/6057#comment:3</guid> <description> <p> Well, thank you for having a look. However I have to disagree. I did the test again this morning, with this little change made to the given source code: </p> <pre class="wiki">unsigned i, total( 0xFFFF1 ); </pre><p> and the result is: </p> <pre class="wiki">Test using 1048561 data points. Lazy variance accumulator : 0.00 s Mean : -0.436519 Variance : 1369.89 Immediate variance accumulator : 0.02 s Mean : -0.436519 Variance : 5465.95 Simple classe : 0.01 s Mean : -0.436519 Variance : 5465.95 </pre><p> Test performed on Mandriva 2011 64 bits using gcc 4.6.1 and boost 1.46.1. </p> </description> <category>Ticket</category> </item> <item> <dc:creator>Matthias Troyer</dc:creator> <pubDate>Fri, 16 Nov 2012 08:06:43 GMT</pubDate> <title>status changed; resolution deleted https://svn.boost.org/trac10/ticket/6057#comment:4 https://svn.boost.org/trac10/ticket/6057#comment:4 <ul> <li><strong>status</strong> <span class="trac-field-old">closed</span> → <span class="trac-field-new">reopened</span> </li> <li><strong>resolution</strong> <span class="trac-field-deleted">worksforme</span> </li> </ul> <p> Thank you for the updated example which now indeed shows what you report. The reason is simply an overflow in the sum of squares. To avoid it just use double instead of signed int as sample type - just as you do in your own reference class where you store <a class="missing wiki">MeanSq</a> also as double. </p> Ticket Matthias Troyer Fri, 16 Nov 2012 08:06:59 GMT status changed; resolution set https://svn.boost.org/trac10/ticket/6057#comment:5 https://svn.boost.org/trac10/ticket/6057#comment:5 <ul> <li><strong>status</strong> <span class="trac-field-old">reopened</span> → <span class="trac-field-new">closed</span> </li> <li><strong>resolution</strong> → <span class="trac-field-new">wontfix</span> </li> </ul> Ticket