Opened 5 years ago

Last modified 4 years ago

#13402 new Bugs

Log format JUNIT generates invalid XML files with incorrect encoding

Reported by: gallien@… Owned by: Gennadiy Rozental
Milestone: To Be Determined Component: test
Version: Boost 1.66.0 Severity: Problem
Keywords: Cc: gallien@…

Description

The encoding of the written JUNIT XML file is CP1252 compiled on Windows with Visual Studio 2013, but the encoding in XML is 'encoding="UTF-8"'. The output should be always converted to 'UTF-8' or the XML encoding in JUNIT file should be replaced by the encoding of the stream.

Example output file:

<?xml version="1.0" **encoding="UTF-8"**?>
<testsuite tests="0" skipped="0" errors="1" failures="2" id="0" name="Master_Test_Suite" time="44.222">
<properties>
<property name="platform" value="Win32"
<property name="compiler" value="Microsoft Visual C++ version 12.0"
<property name="stl" value="Dinkumware standard library version 610"
<property name="boost" value="1.66.0"
</properties>

Change History (14)

comment:1 by Andreas Gallien <gallien@…>, 5 years ago

Cc: gallien@… added

comment:2 by Raffi Enficiaud, 5 years ago

There is no easy way to know the encoding the of source files. Also, the sources of boost.test and the current test module should have consistent encoding. So outputting the "real" encoding will not work.

Apart from mentioning the issue in the documentation, I do not see any easy way to address this. All the logged information coming from boost.test is ANSI, so claiming UTF8 should not cause any issue.

Your thoughts?

comment:3 by gallien@…, 5 years ago

In my opinion a ANSI written file with header information UTF-8 <?xml version="1.0" encoding="UTF-8"?> is not valid.Why it is not possible to force a specific encoding like UTF-8 for the written log file, e. g. with a new boost test parameter (http://www.boost.org/doc/libs/1_66_0/libs/test/doc/html/boost_test/utf_reference/rt_param_reference.html) like 'log_encoding'?

comment:4 by Raffi Enficiaud, 5 years ago

Sorry I meant ASCII-7bits: it is a subset of chars that does not require any escaping, so in my opinion it is correct to say that this is UTF-8 since no UTF-8 escaping is involved. See for instance here: https://en.wikipedia.org/wiki/ASCII#Unicode

In order to be correct in the encoding, we have to include other libraries (eg. ICU) that handle any source encoding and then map it to UTF-8 properly. While relevant, especially for string comparison, this is not my priority right now.

Determining the source encoding depends on how the files/compiler were configured at the time of Boost.Test compilation (MSBC vs Unicode in Windows) and for the module being tested. There is no trivial way to detect this and assumptions should be made, and this assumption should be consistent in all situations (boost.test as an external static library for instance). Also, you see that this is not at all related to the current locale.

What would be possible is to let the user specify the encoding in some way, and since Boost.Test is ASCII only, then it should not matter how Boost.Test is compiled as long as the user's encoding is ASCII backward compatible (latin1, utf8).

But to me, all this will just led to confusion.

comment:5 by anonymous, 5 years ago

Thank you for your clarification. I will take a closer look to our log file. IMHO ASCII-7bits will be fine. Maybe the log file is manipulated in another way, because of our setup.

comment:6 by Raffi Enficiaud, 5 years ago

Maybe you can attach to this ticket your problematic JUNIT file?

comment:7 by Raffi Enficiaud, 5 years ago

Kind reminder

comment:8 by Raffi Enficiaud, 5 years ago

Any news on this? Were you able to check something on your end?

comment:9 by sebastian.freitag@…, 5 years ago

I just found this ticket after experiencing the same issue.

tl;dr summary: boost test writes one-byte characters into junit xml output that are not supposed to exist in utf-8. for example the german umlaut Ö is 0x00D6 in UTF-8 but gets written as 0xD6 into the file. Only one-byte character values < 128 are valid 1-byte UTF-8 sequences.

How I found it:

One of my tests is doing the following comparison:

// oelniveau is std::string, previously read from a windows-1252 encoded textfile
// Ö is escaped here as \326 because our source code file is UTF-8
// and comparing the Ö string literal, in UTF-8, with the variable will 
// fail even when it is supposed to pass.
BOOST_TEST("\326lniveau" == oelniveau);

The JUNIT output then contains something like this (when I let it fail on purpose by putting "something" into the variable):

ASSERTION FAILURE:
(...)
- message: check "\326lniveau" == oelniveau has failed [?lniveau != something]
(...)

Here opened in an editor "as utf-8". The ? shows that the xml file has a character for the Ö that will not pass as a valid UTF-8 sequence. And xmllint complains about the file:

result.xml: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xD6 0x6C 0x6E 0x69

And a typical junit plugin from jenkins complains as well:

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.

comment:10 by sebastian.freitag@…, 5 years ago

One additional piece of info: I compile my stuff with clang on osx, with clang on linux and with msvc on windows 10. The error is on all three platforms.

comment:11 by Raffi Enficiaud, 5 years ago

I do not well understand why you need to escape the Ö if your file is encoded in UTF-8. So, I will ask dumb questions until I get it right.

From this table http://www.utf8-chartable.de/ the correct utf-8 for Ö / U+00D6 is the sequence of bytes "0xc3 0x96".

What about transforming your string to either

  • BOOST_TEST("Ölniveau" == oelniveau); as you are saying your files are written in UTF-8
  • or BOOST_TEST("\xc3\x96lniveau" == oelniveau);

My gut feeling is that the preprocessor does something with the octal representation of Ö. 0xD6 0x6C 0x6E 0x69 seems to mean

  • 0xD6 missing the following 00 for the Ö and/or not being escaped as UTF-8 (should be 0xC3 0x96)
  • 0x6C for the l of Ölniveau
  • 0x6E for the n of Ölniveau
  • 0x69 for the i of Ölniveau

The other possibility is that the file that is opened for the JUNIT output interprets stuff based on the locale. Would you mind checking also changing the locale like this

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8

and rerun the check?

Thanks

Last edited 5 years ago by Raffi Enficiaud (previous) (diff)

comment:12 by sebastian.freitag@…, 5 years ago

Your first question: I need to escape, because my source code file is utf-8, but I want to test the string in the variable oelniveau, and THIS string is supposed to be single byte windows-1252 encoded "Ölniveau". If I transform it, like you suggest, my test will always fail.

Regarding your second question: I compile and run tests on a few different VMs. I checked the locale and on the Linux machine it's identical to the one you suggested.

I realize how difficult it is to guarantee that everything printed to the junit XML is valid utf-8. My hack now is to just replace the first line in the XML before processing it further and rather live with a few mangled garbage utf-8 characters than having to deal with exceptions from the Java program parsing the XML. Would it hurt to do so in general and set encoding="windows-1252" by default? We are speaking about output from failed tests in the end. So I guess in any case you would rather want to have any info over crashing your post processor.

comment:13 by Raffi Enficiaud, 5 years ago

I realize how difficult it is to guarantee that everything printed to the junit XML is valid utf-8.

Indeed. The problem that you are facing, as I understand it, is that you are comparing a string in the cp1252 domain that is not pure ascii, while an std::string does not carry any encoding information. This cp1252 string is outputed as is to the JUNIT file, because boost.test does not interpret anything.

This is a shortcoming that I believe boost.test should address at some point, but OTOH boost.test does not interpret any char that is outputted, simply because boost.test does not know anything about encoding. I do not know if I should at some point support this: unicode and code-point transformation are natively supported on Windows, while on other operating systems I need to include an external library, which I do not want. I haven't looked into C++11 encoding facilities, maybe it is easier now.

The idea would be to be able to declare what encoding is being used for strings, and to transform to utf-8. Transforming to utf-8 is also something that you have to do to be correct: if you say that your source code is utf-8, it is likely that at some point you will output a string that is utf-8 encoded, while here you are willing to turn everything to cp1252 because the input is cp1252. This approach will not scale very well as some encoding will get mixed in the resulting log. The right approach would be to transform everything to eg. utf-8 (or at least the correct encoding that is declared in the xml file).

For now, I would just suggest to transform the strings to utf-8, until I come up with a correct handling in boost.test. After all, there are not so many chars that should be transformed in the cp1252 (and that you need).

comment:14 by kai.unger@…, 4 years ago

Sometimes it is not possible to convert the strings logged: There might be system messages logged internally.

In my case a network error (on a german Windows 7) generates an exception in boost.asio which is logged by boost.test automatically as "![CDATA[class boost::exception_detail::clone_impl<struct boost::exception_detail::error_info_injector<class boost::system::system_error> >: bind: Die angeforderte Adresse ist in diesem Kontext ungültig]]". The u-umlaut corrupts the xml.

Note: See TracTickets for help on using tickets.