Opened 7 years ago

Closed 7 years ago

Last modified 7 years ago

#12039 closed Bugs (fixed)

cpp_bin_float convert_to<double>() rounding mistake

Reported by: Michael Shatz <shatz@…> Owned by: John Maddock
Milestone: To Be Determined Component: multiprecision
Version: Boost 1.61.0 Severity: Problem
Keywords: Cc:

Description

There are cases where cpp_bin_float convert_to<double>() does not produce the nearest 'double' result. It happens when the source argument differs form the middle point between two double-precision values in 65th bit or further. I didn't look at boost source code, but pretty sure that the mistakes occurs due to double rounding. I.e. instead of direct rounding to 53-bit precision of 'double' the number is initially converted to 64-bit 'long double' and then converted from 'long double' to 'double'.

In particular, on Microsoft compilers 'long double' and 'double' refer to the same type and the mistake does not happen. But on the same machine/OS with gcc compiler the mistake does happen.

Below attached a simple test case that prints 'good' on platforms without bug and prints bad on platforms with bug.

Attachments (3)

convert_test.cpp (516 bytes ) - added by Michael Shatz <shatz@…> 7 years ago.
convert_to<double>() bug reproducer
convert32_test.cpp (719 bytes ) - added by Michael Shatz <shatz@…> 7 years ago.
convert_to_double_core.cpp (716 bytes ) - added by shatz@… 7 years ago.

Download all attachments as: .zip

Change History (10)

by Michael Shatz <shatz@…>, 7 years ago

Attachment: convert_test.cpp added

convert_to<double>() bug reproducer

comment:1 by John Maddock, 7 years ago

Component: Nonemultiprecision
Owner: set to John Maddock

comment:2 by John Maddock, 7 years ago

Resolution: fixed
Status: newclosed

by Michael Shatz <shatz@…>, 7 years ago

Attachment: convert32_test.cpp added

comment:3 by Michael Shatz <shatz@…>, 7 years ago

Hi John

There exists another more obscure but very similar problem: double rounding in conversion to 'float' due to initial conversion to 'double'. Practically, due to huge difference in precision between 'double' and 'float', this problem is far less severe than the previous one. It is very unlikely that it negatively affect real-world applications. Still, when you know where to look, it can be easily demonstrated.

May be, when you are at it, it makes sense to fix this minor problem as well?

A test case (convert32_test.cpp) attached above.

comment:4 by John Maddock, 7 years ago

That case was already fixed by the patch above - however there was a different one involving rounding of ties - test case and patch for that in: https://github.com/boostorg/multiprecision/commit/a96bea66e191ba827626a75c9721f3018c7fb1f3

by shatz@…, 7 years ago

Attachment: convert_to_double_core.cpp added

comment:5 by shatz@…, 7 years ago

May be, I don't know how fetch your patch (my github skills are below basic), but it seems to me that instead of fixing bad case you had broken good case. Since my stupid test only compares two results of conversion to each other, it is happy. But it shouldn't.

I can't say that I fully understood your cpp_bin_float format (in particular, I can't figure out the business with guards) but assuming that I didn't misunderstood too badly, I recommend the attached core routine for conversion to double. In this routine rounding/ties handling is done by compiler/hardware rather than by us. Sometimes it does a better job. As additional advantage, it's likely several times (or many times for wide numbers) faster than your variant. Of course, it only works when arg.backend().bits().limbs() has a type 'uint64_t*' or its equivalent. I didn't figure out if it's a case on all supported platforms or only on mine (x64). But even if it's the later, it still makes sense to specialize, because I think it's safe to assume that x64 platform is by far the most important for your customers.

Another thing that I didn't figure out is what happens when # of binary digits is not an integer multiple of 64. But I would believe, that you will have no trouble to handle this case as well. At worst it will take a simple mask applied to the first (i.e. least significant) word.

Best regards, Michael

comment:6 by shatz@…, 7 years ago

Hi,

My previous comment was wrong. Because of my unfamiliarity with github, I was testing the version after the first patch, not after the second patch. The second version looks o.k. Actually, it is pretty close to my proposed variant above. A bit slower than mine, but a difference in speed is not much above noise level.

Good job, John. Thank you.

Regards, Michael

comment:7 by anonymous, 7 years ago

OK good, I just pushed a few additional commits, adds an exhaustive convert-to-float test program (takes about half a day to run, so not part of the regular tests). This uncovered a fencepost error in the subtraction code, plus a double-rounding error in mpfr_float_backend (both also fixed).

The main advantage of the new code is that it uses cpp_bin_float's existing rounding code to ensure a correct result, and is completely agnostic as to the widest integer size, or the size of the float being converted to. In trivial cases it basically degenerates to the same principle as yours - the temporary cpp_bin_float becomes a trivial wrapper around a single unsigned integer, and the bit-extraction loop executes just the once.

Note: See TracTickets for help on using tickets.