Opened 11 years ago

Closed 10 years ago

#6057 closed Bugs (wontfix)

Lazy variance in accumulator gives wrong result when too high statistics

Reported by: Johan <johan.luisier@…> Owned by: Matthias Troyer
Milestone: To Be Determined Component: accumulator
Version: Boost 1.44.0 Severity: Problem
Keywords: accumulator lazy_variance Cc:

Description

Hello,

I was trying to compare performances of various ways to compute the variance of a distribution of integer values, because I need a very fast one.

I tried to use lazy_variance and variance, a custom algorithm, (and also TProfile class from ROOT).

What I saw is that using lazy_variance gives the correct mean value but a weird variance value when the number of events becomes large (n > 0x3FFFF).

The bug was seen first with boost 1.44.0 on a 32 bit Fedora 14 machine (using gcc 4.5.1), and also with boost 1.46.1 on a 64 bit Mandriva 2011 (using gcc 4.6.1).

Here below is the source code to reproduce the bug (ROOT stuff commented) :

#include <iostream>

#include <vector>

#include <cmath>

// accumulators
#include <boost/accumulators/accumulators.hpp>
#include <boost/accumulators/statistics/stats.hpp>
#include <boost/accumulators/statistics/mean.hpp>
#include <boost/accumulators/statistics/variance.hpp>

// root
//#include <TProfile.h>

// timing info
#include <boost/progress.hpp>

// Nbr aléatoires
#include <boost/random/lagged_fibonacci.hpp>
#include <boost/random/variate_generator.hpp>
#include <boost/random/uniform_int.hpp>

using namespace std;
using namespace boost::accumulators;
using namespace boost;

class MyStat
{
public:
  MyStat()
    : NbrEvt( 0u ), Mean( 0. ), MeanSq( 0. )
  {};
  ~MyStat()
  {};
  void fill( const int& ev )
  {
    Mean   = ( Mean * NbrEvt + ev ) / double( NbrEvt + 1 );
    MeanSq = ( MeanSq * NbrEvt + ev * ev ) / double( NbrEvt + 1 );
    NbrEvt++;
  };
  double mean() const
  {
    return Mean;
  };
  double variance() const
  {
    return MeanSq - Mean * Mean;
  };
protected:
  long unsigned NbrEvt;
  double Mean;
  double MeanSq;
};

int main( int argc, char* argv[] )
{
  lagged_fibonacci44497 GenerateurBase( 12345 );
  uniform_int< int > Distribution( -128, 127 );

  variate_generator< lagged_fibonacci44497&,
		     uniform_int< int > > Generateur( GenerateurBase,
						       Distribution );

  vector< int > Data;

  long unsigned i, total( 0xFFFF );

  for ( i = 0; i < total; i++ )
    Data . push_back( Generateur() );

  cout << "Test using " << total << " data points." << endl << endl;

  accumulator_set< signed, stats< tag::mean,
				  tag::lazy_variance > > lazyVar;
  accumulator_set< signed, stats< tag::mean,
				  tag::variance > > immedVar;
  
  //TProfile * profile = new TProfile( "profile", "My profile", 1, -.5, .5, "S" );

  MyStat simpleStat;

  progress_timer * chrono = 0;

  cout << "Lazy variance accumulator : ";

  chrono = new progress_timer();
  
  for ( i = 0; i < total; i++ )
    lazyVar( Data[ i ] );

  delete chrono;

  cout << "Mean : " << extract_result< tag::mean >( lazyVar ) << endl
       << "Variance : " << extract_result< tag::lazy_variance >( lazyVar )
       << endl << endl;

  cout << "Immediate variance accumulator : ";

  chrono = new progress_timer();
  
  for ( i = 0; i < total; i++ )
    immedVar( Data[ i ] );

  delete chrono;

  cout << "Mean : " << extract_result< tag::mean >( immedVar ) << endl
       << "Variance : "
       << extract_result< tag::variance >( immedVar )
       << endl << endl;

  /*
  cout << "TProfile : ";

  chrono = new progress_timer();
  
  for ( i = 0; i < total; i++ )
    profile -> Fill( 0., double( Data[ i ] ) );

  delete chrono;

  cout << "Mean : " << profile -> GetBinContent( 1 ) << endl
       << "Variance : " << profile -> GetBinError( 1 ) * profile -> GetBinError( 1 )
       << endl << endl;
  */

  cout << "Simple classe : ";

  chrono = new progress_timer();
  
  for ( i = 0; i < total; i++ )
     simpleStat . fill( Data[ i ] );

  delete chrono;

  cout << "Mean : " << simpleStat . mean() << endl << "Variance : "
       << simpleStat . variance() << endl << endl << flush;
  

  delete profile;

  return 0;
}

And here is the output :

Test using 268435455 data points.
Lazy variance accumulator : 0.27 s

Mean : -0.497278
Variance : 5.30971

Immediate variance accumulator : 9.07 s
Mean : -0.497278
Variance : 5461.31

TProfile : 15.56 s
Mean : -0.497278
Variance : 5461.31

Simple classe : 7.28 s
Mean : -0.497278
Variance : 5461.31

Note that all mean values agree, and all variance values but the lazy one do too.

Cheers, johan

Change History (5)

comment:1 by Eric Niebler, 10 years ago

Owner: changed from Eric Niebler to Matthias Troyer

Matthias, can you have a look at this one when you get a chance?

comment:2 by Matthias Troyer, 10 years ago

Resolution: worksforme
Status: newclosed

I have failed to reproduce this bug with both g++ 4.5 and 4.6 and using Boost 1.44, 1.46.1 and later releases but always get the same variance with lazy, immediate, and the "Simple classe". I am thus closing the ticket.

comment:3 by Johan <johan.luisier@…>, 10 years ago

Well, thank you for having a look. However I have to disagree. I did the test again this morning, with this little change made to the given source code:

unsigned i, total( 0xFFFF1 );

and the result is:

Test using 1048561 data points.

Lazy variance accumulator : 0.00 s

Mean : -0.436519
Variance : 1369.89

Immediate variance accumulator : 0.02 s

Mean : -0.436519
Variance : 5465.95

Simple classe : 0.01 s

Mean : -0.436519
Variance : 5465.95

Test performed on Mandriva 2011 64 bits using gcc 4.6.1 and boost 1.46.1.

comment:4 by Matthias Troyer, 10 years ago

Resolution: worksforme
Status: closedreopened

Thank you for the updated example which now indeed shows what you report. The reason is simply an overflow in the sum of squares. To avoid it just use double instead of signed int as sample type - just as you do in your own reference class where you store MeanSq also as double.

comment:5 by Matthias Troyer, 10 years ago

Resolution: wontfix
Status: reopenedclosed
Note: See TracTickets for help on using tickets.