Boost C++ Libraries: Ticket #1678: Boost.property_tree::read_xml does not parse UNICODE file with BOMs https://svn.boost.org/trac10/ticket/1678 <p> I've reported it in both the users list and developers list with no response. please refer to Boost-users Digest, Vol 1560, Issue 1 and Boost Digest, Vol 2114, Issue 3 </p> en-us Boost C++ Libraries /htdocs/site/boost.png https://svn.boost.org/trac10/ticket/1678 Trac 1.4.3 Marshall Clow Wed, 12 Mar 2008 14:55:17 GMT component changed; owner set https://svn.boost.org/trac10/ticket/1678#comment:1 https://svn.boost.org/trac10/ticket/1678#comment:1 <ul> <li><strong>owner</strong> set to <span class="trac-author">kaalus</span> </li> <li><strong>component</strong> <span class="trac-field-old">None</span> → <span class="trac-field-new">property_tree</span> </li> </ul> Ticket Sebastian Redl Sun, 04 Oct 2009 11:51:12 GMT owner, status changed https://svn.boost.org/trac10/ticket/1678#comment:2 https://svn.boost.org/trac10/ticket/1678#comment:2 <ul> <li><strong>owner</strong> changed from <span class="trac-author">kaalus</span> to <span class="trac-author">Sebastian Redl</span> </li> <li><strong>status</strong> <span class="trac-field-old">new</span> → <span class="trac-field-new">assigned</span> </li> </ul> Ticket zhuo.qiang@… Wed, 26 Jan 2011 02:19:31 GMT type, severity, milestone changed https://svn.boost.org/trac10/ticket/1678#comment:3 https://svn.boost.org/trac10/ticket/1678#comment:3 <ul> <li><strong>type</strong> <span class="trac-field-old">Bugs</span> → <span class="trac-field-new">Patches</span> </li> <li><strong>severity</strong> <span class="trac-field-old">Problem</span> → <span class="trac-field-new">Showstopper</span> </li> <li><strong>milestone</strong> <span class="trac-field-old">Boost 1.36.0</span> → <span class="trac-field-new">Boost 1.47.0</span> </li> </ul> <p> The following is a fix for this, anyone pleaes review this fix and apply it to the trunk: </p> <p> /property_tree/detail/rapidxml.hpp(1730): </p> <div class="wiki-code"><div class="code"><pre> <span class="c1">// Parse BOM, if any</span> <span class="k">template</span><span class="o">&lt;</span><span class="kt">int</span> <span class="n">Flags</span><span class="o">&gt;</span> <span class="kt">void</span> <span class="n">parse_bom</span><span class="p">(</span><span class="kt">char</span> <span class="o">*&amp;</span><span class="n">text</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// UTF-8</span> <span class="k">if</span> <span class="p">(</span><span class="n">text</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">==</span> <span class="mh">0xEF</span> <span class="o">&amp;&amp;</span> <span class="n">text</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="mh">0xBB</span> <span class="o">&amp;&amp;</span> <span class="n">text</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">==</span> <span class="mh">0xBF</span><span class="p">)</span> <span class="p">{</span> <span class="n">text</span> <span class="o">+=</span> <span class="mi">3</span><span class="p">;</span> <span class="c1">// Skip utf-8 bom</span> <span class="p">}</span> <span class="p">}</span> <span class="c1">// Parse BOM, if any</span> <span class="k">template</span><span class="o">&lt;</span><span class="kt">int</span> <span class="n">Flags</span><span class="o">&gt;</span> <span class="kt">void</span> <span class="n">parse_bom</span><span class="p">(</span><span class="kt">wchar_t</span> <span class="o">*&amp;</span><span class="n">text</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// UTF-16</span> <span class="k">if</span> <span class="p">(</span><span class="n">text</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">==</span> <span class="n">L</span><span class="err">&#39;\</span><span class="n">uFEFF</span><span class="err">&#39;</span><span class="p">)</span> <span class="p">{</span> <span class="n">text</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// Skip utf-16 bom</span> <span class="p">}</span> <span class="p">}</span> </pre></div></div> Ticket Sebastian Redl Fri, 18 Feb 2011 16:30:44 GMT <link>https://svn.boost.org/trac10/ticket/1678#comment:4 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/1678#comment:4</guid> <description> <p> Fixed on trunk in <a class="changeset" href="https://svn.boost.org/trac10/changeset/68991" title="Apply patch from bug 1678 with slight modification: allow BOM in XML ...">r68991</a>. </p> </description> <category>Ticket</category> </item> <item> <dc:creator>Marshall Clow</dc:creator> <pubDate>Fri, 18 Feb 2011 17:28:32 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/1678#comment:5 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/1678#comment:5</guid> <description> <p> I don't think this is right - especially in the case of UTF-16 (and UTF-32) </p> <p> For UTF-16: </p> <ul><li>If there is a BOM, and it is "FE FF", then the rest of the UTF-16 must be interpreted as "little endian" </li><li>If there is a BOM, and it is "FF FE", then the rest of the UTF-16 must be interpreted as "big endian". </li></ul><p> I see no code here to do that. It just notes "Hey, there's a BOM here" (assuming that the BOM matches the endianness of the processors that is consuming the XML), and continues on. </p> <p> See <a class="ext-link" href="http://www.opentag.com/xfaq_enc.htm"><span class="icon">​</span>http://www.opentag.com/xfaq_enc.htm</a> for more info. </p> </description> <category>Ticket</category> </item> <item> <dc:creator>Sebastian Redl</dc:creator> <pubDate>Fri, 18 Feb 2011 20:04:07 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/1678#comment:6 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/1678#comment:6</guid> <description> <p> True. The code is just to skip a BOM and continue as if it wasn't there. The data still has to be in the expected format. </p> <p> This is the right thing to do. The XML parser should not concern itself with encodings. It's the responsibility of the code convert facet of the input stream to bring the data into the right format. The code here is just to skip the BOM. </p> </description> <category>Ticket</category> </item> <item> <dc:creator>Marshall Clow</dc:creator> <pubDate>Fri, 18 Feb 2011 20:59:27 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/1678#comment:7 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/1678#comment:7</guid> <description> <p> Ok, then - let's add these additional test cases: </p> <pre class="wiki"> // byte order mark (UTF-16, big endian) const char *bug_data_pr1678be = "\xFE\xFF\0&lt;\0?\0x\0m\0l\0 \0v\0e\0r\0s\0i\0o\0n\0=\0\"\01\0.\00\0\"\0 \0e\0n\0c\0o\0d\0i\0n\0g\0=\0\"\0u\0t\0f\0-\08\0\"\0?\0&gt;\0&lt;\0r\0o\0o\0t\0/\0&gt;"; // byte order mark (UTF-16, little endian) const char *bug_data_pr1678le = "\xFF\xFE&lt;\0?\0x\0m\0l\0 \0v\0e\0r\0s\0i\0o\0n\0=\0\"\01\0.\00\0\"\0 \0e\0n\0c\0o\0d\0i\0n\0g\0=\0\"\0u\0t\0f\0-\08\0\"\0?\0&gt;\0&lt;\0r\0o\0o\0t\0/\0&gt;\0"; </pre><p> and make sure that they work - because they don't at the moment. </p> </description> <category>Ticket</category> </item> <item> <dc:creator>Sebastian Redl</dc:creator> <pubDate>Fri, 18 Feb 2011 21:26:23 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/1678#comment:8 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/1678#comment:8</guid> <description> <p> No. These tests are invalid because of the way the test system works, and because of the way PTree's XML support works. </p> <p> PTree's XML parser doesn't insulate you from encoding issues. In fact, I have an error in my test case in that I specify UTF-16 as the encoding of the XML snippet. That's incorrect: the encoding is UTF-8 in the test file that is created. And PTree doesn't care anyway, because all PTree does is read data from an input stream (wistream in the wchar_t case) and process it, assuming that it's in the platform default encoding for this character type. The encoding declaration of the XML is completely ignored. </p> <p> So the only thing that does encoding conversion is the input stream. The test cases install a UTF-8 conversion facet in the global locale, so that the wide stream tests expect the input files to contain UTF-8. Any other encoding would require replacing the code conversion facet for such tests, and wouldn't work at all for the narrow version, because narrow streams don't transcode AFAIK. </p> <p> Yes, this is technically invalid handling of XML. But that's a completely different issue and has nothing to do with the BOM issue here. That would be a much bigger issue: it would mean that the library would have to load the file as a binary block, detect the encoding, transcode the data to the native encoding for the given character type, and only then actually parse the XML. </p> <p> Sorry, but I'm not going to do that. I may do it if Boost ever has a usable encoding handling library that I can use, but not before that. </p> <p> This bug is about reading UTF-8 files that contain a BOM with a wide-character property_tree. I've fixed this bug by correctly skipping the BOM for wchar_t sequences under the assumption that the input stream has correctly converted whatever was on disk to the native encoding for wchar_t, which is further assumed to be native-endian UTF-16/32. That's actually a precondition for the XML parser, even though it's probably not documented. </p> <p> But poor documentation is also another bug. </p> </description> <category>Ticket</category> </item> <item> <dc:creator>Sebastian Redl</dc:creator> <pubDate>Mon, 16 May 2011 18:33:41 GMT</pubDate> <title>status changed; resolution set https://svn.boost.org/trac10/ticket/1678#comment:9 https://svn.boost.org/trac10/ticket/1678#comment:9 <ul> <li><strong>status</strong> <span class="trac-field-old">assigned</span> → <span class="trac-field-new">closed</span> </li> <li><strong>resolution</strong> → <span class="trac-field-new">fixed</span> </li> </ul> <p> (In <a class="changeset" href="https://svn.boost.org/trac10/changeset/71991" title="Merge r68990-68993, several fixes to PTree. Fixes bug 1678. Fixes bug 4387.">[71991]</a>) Merge <a class="changeset" href="https://svn.boost.org/trac10/changeset/68990" title="Fix a compile error in PTree JSON parser.">r68990</a>-68993, several fixes to PTree. Fixes bug 1678. Fixes bug 4387. </p> Ticket Sebastian Redl Mon, 16 May 2011 18:34:44 GMT <link>https://svn.boost.org/trac10/ticket/1678#comment:10 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/1678#comment:10</guid> <description> <p> (In <a class="changeset" href="https://svn.boost.org/trac10/changeset/71992" title="Merge r68990-68993, several fixes to PTree. Fixes bug 1678. Fixes bug ...">[71992]</a>) Merge <a class="changeset" href="https://svn.boost.org/trac10/changeset/68990" title="Fix a compile error in PTree JSON parser.">r68990</a>-68993, several fixes to PTree. Fixes bug 1678. Fixes bug 4387. Forgot to commit these together with the header part. </p> </description> <category>Ticket</category> </item> </channel> </rss>