Opened 11 years ago
Closed 10 years ago
#6057 closed Bugs (wontfix)
Lazy variance in accumulator gives wrong result when too high statistics
Reported by: | Owned by: | Matthias Troyer | |
---|---|---|---|
Milestone: | To Be Determined | Component: | accumulator |
Version: | Boost 1.44.0 | Severity: | Problem |
Keywords: | accumulator lazy_variance | Cc: |
Description
Hello,
I was trying to compare performances of various ways to compute the variance of a distribution of integer values, because I need a very fast one.
I tried to use lazy_variance and variance, a custom algorithm, (and also TProfile class from ROOT).
What I saw is that using lazy_variance gives the correct mean value but a weird variance value when the number of events becomes large (n > 0x3FFFF).
The bug was seen first with boost 1.44.0 on a 32 bit Fedora 14 machine (using gcc 4.5.1), and also with boost 1.46.1 on a 64 bit Mandriva 2011 (using gcc 4.6.1).
Here below is the source code to reproduce the bug (ROOT stuff commented) :
#include <iostream> #include <vector> #include <cmath> // accumulators #include <boost/accumulators/accumulators.hpp> #include <boost/accumulators/statistics/stats.hpp> #include <boost/accumulators/statistics/mean.hpp> #include <boost/accumulators/statistics/variance.hpp> // root //#include <TProfile.h> // timing info #include <boost/progress.hpp> // Nbr aléatoires #include <boost/random/lagged_fibonacci.hpp> #include <boost/random/variate_generator.hpp> #include <boost/random/uniform_int.hpp> using namespace std; using namespace boost::accumulators; using namespace boost; class MyStat { public: MyStat() : NbrEvt( 0u ), Mean( 0. ), MeanSq( 0. ) {}; ~MyStat() {}; void fill( const int& ev ) { Mean = ( Mean * NbrEvt + ev ) / double( NbrEvt + 1 ); MeanSq = ( MeanSq * NbrEvt + ev * ev ) / double( NbrEvt + 1 ); NbrEvt++; }; double mean() const { return Mean; }; double variance() const { return MeanSq - Mean * Mean; }; protected: long unsigned NbrEvt; double Mean; double MeanSq; }; int main( int argc, char* argv[] ) { lagged_fibonacci44497 GenerateurBase( 12345 ); uniform_int< int > Distribution( -128, 127 ); variate_generator< lagged_fibonacci44497&, uniform_int< int > > Generateur( GenerateurBase, Distribution ); vector< int > Data; long unsigned i, total( 0xFFFF ); for ( i = 0; i < total; i++ ) Data . push_back( Generateur() ); cout << "Test using " << total << " data points." << endl << endl; accumulator_set< signed, stats< tag::mean, tag::lazy_variance > > lazyVar; accumulator_set< signed, stats< tag::mean, tag::variance > > immedVar; //TProfile * profile = new TProfile( "profile", "My profile", 1, -.5, .5, "S" ); MyStat simpleStat; progress_timer * chrono = 0; cout << "Lazy variance accumulator : "; chrono = new progress_timer(); for ( i = 0; i < total; i++ ) lazyVar( Data[ i ] ); delete chrono; cout << "Mean : " << extract_result< tag::mean >( lazyVar ) << endl << "Variance : " << extract_result< tag::lazy_variance >( lazyVar ) << endl << endl; cout << "Immediate variance accumulator : "; chrono = new progress_timer(); for ( i = 0; i < total; i++ ) immedVar( Data[ i ] ); delete chrono; cout << "Mean : " << extract_result< tag::mean >( immedVar ) << endl << "Variance : " << extract_result< tag::variance >( immedVar ) << endl << endl; /* cout << "TProfile : "; chrono = new progress_timer(); for ( i = 0; i < total; i++ ) profile -> Fill( 0., double( Data[ i ] ) ); delete chrono; cout << "Mean : " << profile -> GetBinContent( 1 ) << endl << "Variance : " << profile -> GetBinError( 1 ) * profile -> GetBinError( 1 ) << endl << endl; */ cout << "Simple classe : "; chrono = new progress_timer(); for ( i = 0; i < total; i++ ) simpleStat . fill( Data[ i ] ); delete chrono; cout << "Mean : " << simpleStat . mean() << endl << "Variance : " << simpleStat . variance() << endl << endl << flush; delete profile; return 0; }
And here is the output :
Test using 268435455 data points. Lazy variance accumulator : 0.27 s Mean : -0.497278 Variance : 5.30971 Immediate variance accumulator : 9.07 s Mean : -0.497278 Variance : 5461.31 TProfile : 15.56 s Mean : -0.497278 Variance : 5461.31 Simple classe : 7.28 s Mean : -0.497278 Variance : 5461.31
Note that all mean values agree, and all variance values but the lazy one do too.
Cheers, johan
Change History (5)
comment:1 by , 10 years ago
Owner: | changed from | to
---|
comment:2 by , 10 years ago
Resolution: | → worksforme |
---|---|
Status: | new → closed |
I have failed to reproduce this bug with both g++ 4.5 and 4.6 and using Boost 1.44, 1.46.1 and later releases but always get the same variance with lazy, immediate, and the "Simple classe". I am thus closing the ticket.
comment:3 by , 10 years ago
Well, thank you for having a look. However I have to disagree. I did the test again this morning, with this little change made to the given source code:
unsigned i, total( 0xFFFF1 );
and the result is:
Test using 1048561 data points. Lazy variance accumulator : 0.00 s Mean : -0.436519 Variance : 1369.89 Immediate variance accumulator : 0.02 s Mean : -0.436519 Variance : 5465.95 Simple classe : 0.01 s Mean : -0.436519 Variance : 5465.95
Test performed on Mandriva 2011 64 bits using gcc 4.6.1 and boost 1.46.1.
comment:4 by , 10 years ago
Resolution: | worksforme |
---|---|
Status: | closed → reopened |
Thank you for the updated example which now indeed shows what you report. The reason is simply an overflow in the sum of squares. To avoid it just use double instead of signed int as sample type - just as you do in your own reference class where you store MeanSq also as double.
comment:5 by , 10 years ago
Resolution: | → wontfix |
---|---|
Status: | reopened → closed |
Matthias, can you have a look at this one when you get a chance?