Boost C++ Libraries: Ticket #13402: Log format JUNIT generates invalid XML files with incorrect encoding https://svn.boost.org/trac10/ticket/13402 <p> The encoding of the written JUNIT XML file is CP1252 compiled on Windows with Visual Studio 2013, but the encoding in XML is 'encoding="UTF-8"'. The output should be always converted to 'UTF-8' or the XML encoding in JUNIT file should be replaced by the encoding of the stream. </p> <p> Example output file: </p> <pre class="wiki">&lt;?xml version="1.0" **encoding="UTF-8"**?&gt; &lt;testsuite tests="0" skipped="0" errors="1" failures="2" id="0" name="Master_Test_Suite" time="44.222"&gt; &lt;properties&gt; &lt;property name="platform" value="Win32" &lt;property name="compiler" value="Microsoft Visual C++ version 12.0" &lt;property name="stl" value="Dinkumware standard library version 610" &lt;property name="boost" value="1.66.0" &lt;/properties&gt; </pre> en-us Boost C++ Libraries /htdocs/site/boost.png https://svn.boost.org/trac10/ticket/13402 Trac 1.4.3 Andreas Gallien <gallien@…> Thu, 18 Jan 2018 12:17:25 GMT cc set https://svn.boost.org/trac10/ticket/13402#comment:1 https://svn.boost.org/trac10/ticket/13402#comment:1 <ul> <li><strong>cc</strong> <span class="trac-author">gallien@…</span> added </li> </ul> Ticket Raffi Enficiaud Thu, 18 Jan 2018 14:20:38 GMT <link>https://svn.boost.org/trac10/ticket/13402#comment:2 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/13402#comment:2</guid> <description> <p> There is no easy way to know the encoding the of source files. Also, the sources of boost.test and the current test module should have consistent encoding. So outputting the "real" encoding will not work. </p> <p> Apart from mentioning the issue in the documentation, I do not see any easy way to address this. All the logged information coming from boost.test is ANSI, so claiming UTF8 should not cause any issue. </p> <p> Your thoughts? </p> </description> <category>Ticket</category> </item> <item> <author>gallien@…</author> <pubDate>Fri, 26 Jan 2018 02:39:24 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/13402#comment:3 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/13402#comment:3</guid> <description> <p> In my opinion a ANSI written file with header information UTF-8 &lt;?xml version="1.0" <strong>encoding="UTF-8"</strong>?&gt; is not valid.Why it is not possible to force a specific encoding like UTF-8 for the written log file, e. g. with a new boost test parameter (<a href="http://www.boost.org/doc/libs/1_66_0/libs/test/doc/html/boost_test/utf_reference/rt_param_reference.html">http://www.boost.org/doc/libs/1_66_0/libs/test/doc/html/boost_test/utf_reference/rt_param_reference.html</a>) like 'log_encoding'? </p> </description> <category>Ticket</category> </item> <item> <dc:creator>Raffi Enficiaud</dc:creator> <pubDate>Fri, 26 Jan 2018 08:37:41 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/13402#comment:4 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/13402#comment:4</guid> <description> <p> Sorry I meant ASCII-7bits: it is a subset of chars that does not require any escaping, so in my opinion it is correct to say that this is UTF-8 since no UTF-8 escaping is involved. See for instance here: <a class="ext-link" href="https://en.wikipedia.org/wiki/ASCII#Unicode"><span class="icon">​</span>https://en.wikipedia.org/wiki/ASCII#Unicode</a> </p> <p> In order to be correct in the encoding, we have to include other libraries (eg. ICU) that handle any source encoding and then map it to UTF-8 properly. While relevant, especially for string comparison, this is not my priority right now. </p> <p> Determining the source encoding depends on how the files/compiler were configured at the time of Boost.Test compilation (MSBC vs Unicode in Windows) and for the module being tested. There is no trivial way to detect this and assumptions should be made, and this assumption should be consistent in all situations (boost.test as an external static library for instance). Also, you see that this is not at all related to the current locale. </p> <p> What would be possible is to let the user specify the encoding in some way, and since Boost.Test is ASCII only, then it should not matter how Boost.Test is compiled as long as the user's encoding is ASCII backward compatible (latin1, utf8). </p> <p> But to me, all this will just led to confusion. </p> </description> <category>Ticket</category> </item> <item> <dc:creator>anonymous</dc:creator> <pubDate>Fri, 26 Jan 2018 18:40:22 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/13402#comment:5 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/13402#comment:5</guid> <description> <p> Thank you for your clarification. I will take a closer look to our log file. IMHO ASCII-7bits will be fine. Maybe the log file is manipulated in another way, because of our setup. </p> </description> <category>Ticket</category> </item> <item> <dc:creator>Raffi Enficiaud</dc:creator> <pubDate>Fri, 26 Jan 2018 20:04:28 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/13402#comment:6 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/13402#comment:6</guid> <description> <p> Maybe you can attach to this ticket your problematic JUNIT file? </p> </description> <category>Ticket</category> </item> <item> <dc:creator>Raffi Enficiaud</dc:creator> <pubDate>Sun, 04 Feb 2018 18:43:58 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/13402#comment:7 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/13402#comment:7</guid> <description> <p> Kind reminder </p> </description> <category>Ticket</category> </item> <item> <dc:creator>Raffi Enficiaud</dc:creator> <pubDate>Sat, 10 Feb 2018 14:17:14 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/13402#comment:8 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/13402#comment:8</guid> <description> <p> Any news on this? Were you able to check something on your end? </p> </description> <category>Ticket</category> </item> <item> <author>sebastian.freitag@…</author> <pubDate>Fri, 06 Apr 2018 17:56:11 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/13402#comment:9 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/13402#comment:9</guid> <description> <p> I just found this ticket after experiencing the same issue. </p> <p> tl;dr summary: boost test writes one-byte characters into junit xml output that are not supposed to exist in utf-8. for example the german umlaut Ö is 0x00D6 in UTF-8 but gets written as 0xD6 into the file. Only one-byte character values &lt; 128 are valid 1-byte UTF-8 sequences. </p> <p> How I found it: </p> <p> One of my tests is doing the following comparison: </p> <pre class="wiki">// oelniveau is std::string, previously read from a windows-1252 encoded textfile // Ö is escaped here as \326 because our source code file is UTF-8 // and comparing the Ö string literal, in UTF-8, with the variable will // fail even when it is supposed to pass. BOOST_TEST("\326lniveau" == oelniveau); </pre><p> The JUNIT output then contains something like this (when I let it fail on purpose by putting "something" into the variable): </p> <pre class="wiki">ASSERTION FAILURE: (...) - message: check "\326lniveau" == oelniveau has failed [?lniveau != something] (...) </pre><p> Here opened in an editor "as utf-8". The ? shows that the xml file has a character for the Ö that will not pass as a valid UTF-8 sequence. And xmllint complains about the file: </p> <pre class="wiki">result.xml: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xD6 0x6C 0x6E 0x69 </pre><p> And a typical junit plugin from jenkins complains as well: </p> <pre class="wiki">com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence. </pre> </description> <category>Ticket</category> </item> <item> <author>sebastian.freitag@…</author> <pubDate>Fri, 06 Apr 2018 17:58:10 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/13402#comment:10 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/13402#comment:10</guid> <description> <p> One additional piece of info: I compile my stuff with clang on osx, with clang on linux and with msvc on windows 10. The error is on all three platforms. </p> </description> <category>Ticket</category> </item> <item> <dc:creator>Raffi Enficiaud</dc:creator> <pubDate>Fri, 06 Apr 2018 21:54:42 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/13402#comment:11 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/13402#comment:11</guid> <description> <p> I do not well understand why you need to escape the Ö if your file is encoded in UTF-8. So, I will ask dumb questions until I get it right. </p> <p> From this table <a class="ext-link" href="http://www.utf8-chartable.de/"><span class="icon">​</span>http://www.utf8-chartable.de/</a> the correct utf-8 for Ö / U+00D6 is the sequence of bytes "0xc3 0x96". </p> <p> What about transforming your string to either </p> <ul><li><code>BOOST_TEST("Ölniveau" == oelniveau);</code> as you are saying your files are written in UTF-8 </li><li>or <code>BOOST_TEST("\xc3\x96lniveau" == oelniveau);</code> </li></ul><p> My gut feeling is that the preprocessor does something with the octal representation of Ö. <code>0xD6 0x6C 0x6E 0x69</code> seems to mean </p> <ul><li><code>0xD6</code> missing the following <code>00</code> for the Ö and/or not being escaped as UTF-8 (should be <code>0xC3 0x96</code>) </li><li><code>0x6C</code> for the <code>l</code> of <code>Ölniveau</code> </li><li><code>0x6E</code> for the <code>n</code> of <code>Ölniveau</code> </li><li><code>0x69</code> for the <code>i</code> of <code>Ölniveau</code> </li></ul><p> The other possibility is that the file that is opened for the JUNIT output interprets stuff based on the locale. Would you mind checking also changing the locale like this </p> <pre class="wiki">export LC_ALL=en_US.UTF-8 export LANG=en_US.UTF-8 export LANGUAGE=en_US.UTF-8 </pre><p> and rerun the check? </p> <p> Thanks </p> </description> <category>Ticket</category> </item> <item> <author>sebastian.freitag@…</author> <pubDate>Sat, 07 Apr 2018 07:27:20 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/13402#comment:12 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/13402#comment:12</guid> <description> <p> Your first question: I need to escape, because my source code file is utf-8, but I want to test the string in the variable oelniveau, and THIS string is supposed to be single byte windows-1252 encoded "Ölniveau". If I transform it, like you suggest, my test will always fail. </p> <p> Regarding your second question: I compile and run tests on a few different VMs. I checked the locale and on the Linux machine it's identical to the one you suggested. </p> <p> I realize how difficult it is to guarantee that everything printed to the junit XML is valid utf-8. My hack now is to just replace the first line in the XML before processing it further and rather live with a few mangled garbage utf-8 characters than having to deal with exceptions from the Java program parsing the XML. Would it hurt to do so in general and set encoding="windows-1252" by default? We are speaking about output from failed tests in the end. So I guess in any case you would rather want to have any info over crashing your post processor. </p> </description> <category>Ticket</category> </item> <item> <dc:creator>Raffi Enficiaud</dc:creator> <pubDate>Sat, 07 Apr 2018 09:18:59 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/13402#comment:13 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/13402#comment:13</guid> <description> <blockquote class="citation"> <p> I realize how difficult it is to guarantee that everything printed to the junit XML is valid utf-8. </p> </blockquote> <p> Indeed. The problem that you are facing, as I understand it, is that you are comparing a string in the cp1252 domain that is not pure ascii, while an <code>std::string</code> does not carry any encoding information. This cp1252 string is outputed <strong>as is</strong> to the JUNIT file, because boost.test does not interpret anything. </p> <p> This is a shortcoming that I believe boost.test should address at some point, but OTOH boost.test does not interpret any char that is outputted, simply because boost.test does not know anything about encoding. I do not know if I should at some point support this: unicode and code-point transformation are natively supported on Windows, while on other operating systems I need to include an external library, which I do not want. I haven't looked into C++11 encoding facilities, maybe it is easier now. </p> <p> The idea would be to be able to declare what encoding is being used for strings, and to transform to utf-8. Transforming to utf-8 is also something that you have to do to be correct: if you say that your source code is utf-8, it is likely that at some point you will output a string that is utf-8 encoded, while here you are willing to turn everything to cp1252 because the input is cp1252. This approach will not scale very well as some encoding will get mixed in the resulting log. The right approach would be to transform everything to eg. utf-8 (or at least the correct encoding that is declared in the xml file). </p> <p> For now, I would just suggest to transform the strings to utf-8, until I come up with a correct handling in boost.test. After all, there are not so many chars that should be transformed in the cp1252 (and that you need). </p> </description> <category>Ticket</category> </item> <item> <author>kai.unger@…</author> <pubDate>Mon, 02 Jul 2018 11:26:26 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/13402#comment:14 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/13402#comment:14</guid> <description> <p> Sometimes it is not possible to convert the strings logged: There might be system messages logged internally. </p> <p> In my case a network error (on a german Windows 7) generates an exception in boost.asio which is logged by boost.test automatically as "![CDATA[class boost::exception_detail::clone_impl&lt;struct boost::exception_detail::error_info_injector&lt;class boost::system::system_error&gt; &gt;: bind: Die angeforderte Adresse ist in diesem Kontext ungültig]]". The u-umlaut corrupts the xml. </p> </description> <category>Ticket</category> </item> </channel> </rss>