Opened 10 years ago

Last modified 6 years ago

#6992 assigned Bugs

accumulator's median feature skips 1st two data points.

Reported by: polyactis@… Owned by: Eric Niebler
Milestone: To Be Determined Component: accumulator
Version: Boost 1.41.0 Severity: Problem
Keywords: Cc:

Description

Hi,

I used accumulator's median feature to calculate mean/median of input data. I also used armadillo library's mean/median. The mean from two libraries always agree. However the medians don't. I tried to give 1,2,3,4,5 input data to the program. and found accumulator's median outputs 0 if the number of input data is 1 or 2. It starts to output non-zero median if given more than 2 data points however, first two data points are not used.

Please check what is going on.

Thanks, yu

Attachments (2)

CalculateMedianMeanOfInputColumn.cc (5.6 KB ) - added by polyactis@… 10 years ago.
source code
CalculateMedianMeanOfInputColumn.h (2.1 KB ) - added by polyactis@… 10 years ago.
header file

Download all attachments as: .zip

Change History (6)

by polyactis@…, 10 years ago

source code

by polyactis@…, 10 years ago

header file

comment:1 by Eric Niebler, 10 years ago

Owner: changed from Eric Niebler to Matthias Troyer

Matthias, can you have a look at this one when you get a chance?

comment:2 by Matthias Troyer, 10 years ago

Eric, having dinner with you in Redmond is dangerous especially if I agree to take a look at the open accumulator tickets. Here is my conclusion after looking at the issue:

First, median estimated are notoriously hard and you never get an exact median unless you store at least half of the samples. Hence, unlike the mean which can easily and unambiguously be estimated (as long as the variance is finite), median estimation is harder and different algorithms to estimate the median will give different results.

Our default estimator is a P2 quantile estimator, which only stores and updates five numbers and hence has a minimal memory footprint. However, it requires at least five samples before it gives sensible output and I am thus not surprised that using less than five samples does not work.

Shall we throw an exception if the count is less than 5, or just document it more clearly?

Last edited 10 years ago by Eric Niebler (previous) (diff)

comment:3 by Eric Niebler, 10 years ago

Owner: changed from Matthias Troyer to Eric Niebler
Status: newassigned

There are many open accumulator tickets. I only gave you the two that required knowledge of the algorithms that you wrote. :-) Thanks for having a look.

If the precondition of the p2 quantile accumulator is that it must have at least 5 samples, then the right thing to do is assert the condition in the accessor. And document it. I'll add this to my to-do list, unless you beat me to it.

Thanks again.

comment:4 by A. Sinan Unur <sinan@…>, 6 years ago

Leaving aside the fact that the point of the P2 algorithm is to deal with much larger sample sizes, there is no reason the implementation cannot return the exact median for n <= 5.

It is the P2 algorithm's approximation that kicks in when the 6th observation arrives. Your implementation could simply return the exact median until then.

Note: See TracTickets for help on using tickets.