Boost C++ Libraries: Ticket #12039: cpp_bin_float convert_to() rounding mistake

attachment set

Wed, 02 Mar 2016 15:33:58 GMT

attachment → convert_test.cpp

convert_to<double>() bug reproducer

component changed; owner set

John Maddock — Sat, 12 Mar 2016 16:52:59 GMT

owner set to John Maddock
component None → multiprecision

status changed; resolution set

John Maddock — Tue, 15 Mar 2016 08:16:47 GMT

status new → closed
resolution → fixed

Fixed in https://github.com/boostorg/multiprecision/commit/8a8b2211d4348477dead2b4f4a21a23ccbd84a48

Thanks for the report!

attachment set

Tue, 15 Mar 2016 11:45:29 GMT

attachment → convert32_test.cpp

Tue, 15 Mar 2016 11:47:31 GMT

Hi John

There exists another more obscure but very similar problem: double rounding in conversion to 'float' due to initial conversion to 'double'. Practically, due to huge difference in precision between 'double' and 'float', this problem is far less severe than the previous one. It is very unlikely that it negatively affect real-world applications. Still, when you know where to look, it can be easily demonstrated.

May be, when you are at it, it makes sense to fix this minor problem as well?

A test case (convert32_test.cpp) attached above.

John Maddock — Tue, 15 Mar 2016 18:25:51 GMT

That case was already fixed by the patch above - however there was a different one involving rounding of ties - test case and patch for that in: https://github.com/boostorg/multiprecision/commit/a96bea66e191ba827626a75c9721f3018c7fb1f3

attachment set

shatz@… — Sat, 19 Mar 2016 22:02:26 GMT

attachment → convert_to_double_core.cpp

shatz@… — Sat, 19 Mar 2016 22:05:30 GMT

May be, I don't know how fetch your patch (my github skills are below basic), but it seems to me that instead of fixing bad case you had broken good case. Since my stupid test only compares two results of conversion to each other, it is happy. But it shouldn't.

I can't say that I fully understood your cpp_bin_float format (in particular, I can't figure out the business with guards) but assuming that I didn't misunderstood too badly, I recommend the attached core routine for conversion to double. In this routine rounding/ties handling is done by compiler/hardware rather than by us. Sometimes it does a better job. As additional advantage, it's likely several times (or many times for wide numbers) faster than your variant. Of course, it only works when arg.backend().bits().limbs() has a type 'uint64_t*' or its equivalent. I didn't figure out if it's a case on all supported platforms or only on mine (x64). But even if it's the later, it still makes sense to specialize, because I think it's safe to assume that x64 platform is by far the most important for your customers.

Another thing that I didn't figure out is what happens when # of binary digits is not an integer multiple of 64. But I would believe, that you will have no trouble to handle this case as well. At worst it will take a simple mask applied to the first (i.e. least significant) word.

Best regards, Michael

shatz@… — Mon, 21 Mar 2016 23:24:44 GMT

Hi,

My previous comment was wrong. Because of my unfamiliarity with github, I was testing the version after the first patch, not after the second patch. The second version looks o.k. Actually, it is pretty close to my proposed variant above. A bit slower than mine, but a difference in speed is not much above noise level.

Good job, John. Thank you.

Regards, Michael

anonymous — Tue, 22 Mar 2016 08:22:40 GMT

OK good, I just pushed a few additional commits, adds an exhaustive convert-to-float test program (takes about half a day to run, so not part of the regular tests). This uncovered a fencepost error in the subtraction code, plus a double-rounding error in mpfr_float_backend (both also fixed).

The main advantage of the new code is that it uses cpp_bin_float's existing rounding code to ensure a correct result, and is completely agnostic as to the widest integer size, or the size of the float being converted to. In trivial cases it basically degenerates to the same principle as yours - the temporary cpp_bin_float becomes a trivial wrapper around a single unsigned integer, and the bit-extraction loop executes just the once.