Boost C++ Libraries: Ticket #11600: boost property_tree exponential newline growth
https://svn.boost.org/trac10/ticket/11600
<h2 class="section" id="Problem">Problem</h2>
<p>
Boost "property_tree to xml" includes many newlines on roundtrip, when it is used without the trim_whitespace option.
This makes using ptree unusuable, when not using it with trim_whitespace. ptree is not an option, when whitespace in xml text has to be actually preserved.
</p>
<h2 class="section" id="Example">Example</h2>
<p>
in.xml:
</p>
<pre class="wiki"> <simona_input>
<simona_configuration>
<coord residue_name="LI1" residue_id="0" chain_id="0" name="C" id="0">
<X>0.0</X>
<Y>0.0</Y>
</pre><p>
rewritten.xml:
</p>
<pre class="wiki"> <simona_input>
&#10; &#10; &#10; &#10; &#10; &#10; &#10;
<simona_configuration>
&#10; &#10; &#10; &#10;
<coord residue_name="LI1" residue_id="0" chain_id="0" name="C" id="0">
</pre><h2 class="section" id="Cause">Cause</h2>
<p>
The problem is due to the interpretation of strings in the following XML example:
</p>
<pre class="wiki"> <element1>
<subelement/>
<subelement/>
<subelement/>
</element1>
</pre><p>
rapidxml interprets this as element1, which has 3 children + a textelement with a bunch of newlines + whitespace.
</p>
<h2 class="section" id="Fix">Fix</h2>
<p>
Fixing this is easy, because the logic for removing this scenario is already present in trim_whitespace. The solution is to remove all whitespace in an xml element in case said xml element consists ONLY of whitespace. The diff (which applies against 1.59 rapidxml.hpp) is attached.
</p>
<h2 class="section" id="Reproduction">Reproduction</h2>
<p>
I attached a testcase, which includes a clean 1.59 boost property tree and a fixed 1.59 boost property tree and a big test xml file.
To reproduce: Call ./build.sh, which will generate a fixed and a broken executable. The executable will read in.xml and write it as stage1.xml. Then it will read stage1.xml and write it as stage2.xml and then again the same for stage3.xml.
</p>
<p>
Open in.xml and compare the first lines against stage3.xml and you will see that the roundtrip included many newlines, which actually got encoded in the text as &#10;
</p>
<h2 class="section" id="Remarks">Remarks</h2>
<p>
This is a regression and did not happen before. This problem exists since at least four years <a class="ext-link" href="http://stackoverflow.com/questions/6572550/boostproperty-tree-xml-pretty-printing"><span class="icon"></span>http://stackoverflow.com/questions/6572550/boostproperty-tree-xml-pretty-printing</a>
</p>
<h2 class="section" id="Changeofexistingbehaviour-secondbug">Change of existing behaviour - second bug</h2>
<p>
The stackoverflow answer also shows a difference between now and before, as newlines are encoded now. This is a separate bug, but in my opinion there is no reason to encode \n or \t, as these are not reserved XML statements. They should be removed from detail/xml_parser_utils.hpp:73-74, as they change existing behaviour without a reason.
</p>
en-usBoost C++ Libraries/htdocs/site/boost.png
https://svn.boost.org/trac10/ticket/11600
Trac 1.4.3Timo Strunk <Timo.Strunk@…>Sun, 30 Aug 2015 14:07:21 GMTattachment set
https://svn.boost.org/trac10/ticket/11600
https://svn.boost.org/trac10/ticket/11600
<ul>
<li><strong>attachment</strong>
→ <span class="trac-field-new">property_tree_bugreport.tar.bz2</span>
</li>
</ul>
<p>
Testcase and diff
</p>
TicketTimo Strunk <Timo.Strunk@…>Mon, 26 Oct 2015 15:08:15 GMT
<link>https://svn.boost.org/trac10/ticket/11600#comment:1 </link>
<guid isPermaLink="false">https://svn.boost.org/trac10/ticket/11600#comment:1</guid>
<description>
<p>
I made a pull request fixing the newline introduction.
<a class="ext-link" href="https://github.com/boostorg/property_tree/pull/16"><span class="icon"></span>https://github.com/boostorg/property_tree/pull/16</a>
</p>
<p>
The newline and tab translation behaviour is unchanged, but I think it should still be fixed. I can send a pull request for that immediately, too, if required.
</p>
</description>
<category>Ticket</category>
</item>
<item>
<dc:creator>Sebastian Redl</dc:creator>
<pubDate>Wed, 10 Feb 2016 12:56:49 GMT</pubDate>
<title/>
<link>https://svn.boost.org/trac10/ticket/11600#comment:2 </link>
<guid isPermaLink="false">https://svn.boost.org/trac10/ticket/11600#comment:2</guid>
<description>
<p>
XML whitespace behavior is a mess, but anything that introduces greater roundtrip fidelity under non-strip_whitespace mode is an improvement. If you still have that second pull request, please send it.
</p>
</description>
<category>Ticket</category>
</item>
<item>
<author>Timo Strunk <timo.strunk@…></author>
<pubDate>Wed, 10 Feb 2016 16:01:29 GMT</pubDate>
<title/>
<link>https://svn.boost.org/trac10/ticket/11600#comment:3 </link>
<guid isPermaLink="false">https://svn.boost.org/trac10/ticket/11600#comment:3</guid>
<description>
<p>
Thank you for merging!
</p>
<p>
I made a pull request for the second issue and explained it in more detail there:
<a class="ext-link" href="https://github.com/boostorg/property_tree/pull/18"><span class="icon"></span>https://github.com/boostorg/property_tree/pull/18</a>
</p>
</description>
<category>Ticket</category>
</item>
<item>
<dc:creator>Sebastian Redl</dc:creator>
<pubDate>Thu, 11 Feb 2016 09:35:04 GMT</pubDate>
<title>status changed; resolution set
https://svn.boost.org/trac10/ticket/11600#comment:4
https://svn.boost.org/trac10/ticket/11600#comment:4
<ul>
<li><strong>status</strong>
<span class="trac-field-old">new</span> → <span class="trac-field-new">closed</span>
</li>
<li><strong>resolution</strong>
→ <span class="trac-field-new">invalid</span>
</li>
</ul>
<p>
I've reverted this. After thinking it over, it doesn't make sense to parse XML without stripping whitespace, writing it out in pretty-print mode, and expecting this to roundtrip.
</p>
TicketTimo Strunk <timo.strunk@…>Thu, 11 Feb 2016 09:51:54 GMT
<link>https://svn.boost.org/trac10/ticket/11600#comment:5 </link>
<guid isPermaLink="false">https://svn.boost.org/trac10/ticket/11600#comment:5</guid>
<description>
<p>
Without my code you currently cannot round-trip the following XML using Property Tree:
</p>
<blockquote>
<p>
<XML>
</p>
<blockquote>
<p>
<Text>AB CD</Text>
</p>
</blockquote>
<p>
</XML>
</p>
</blockquote>
<p>
There are two protected spaces in between AB and CD. I need those two spaces.
</p>
<p>
Using the previous boost 1.59 code you end up with:
</p>
<blockquote>
<p>
<?xml version="1.0" encoding="utf-8"?>
<XML>
</p>
<blockquote>
<p>
&#10; &#10; &#10; &#10;&#10; &#10;&#10; &#10;
<Text>AB CD</Text>
</p>
</blockquote>
<p>
</XML>
</p>
</blockquote>
<p>
This is good, because the text is not broken, but if the XML is rewritten several million times you end up with a GB of '&#10;' (or actual newlines in case of pull request <a class="closed ticket" href="https://svn.boost.org/trac10/ticket/18" title="#18: Bugs: lexical_cast fails in some cases (closed: Fixed)">#18</a>).
</p>
<p>
Using trim_whitespace you end up with:
</p>
<blockquote>
<p>
<?xml version="1.0" encoding="utf-8"?>
<XML>
</p>
<blockquote>
<p>
<Text>AB CD</Text>
</p>
</blockquote>
<p>
</XML>
</p>
</blockquote>
<p>
which makes everything look nice, but the double whitespace in the middle is gone.
</p>
<p>
The problem here is that property_tree::trim_whitespace is converted to rapidxml::normalize_whitespace. The trim_whitespace option of rapidxml is not exposed.
</p>
<p>
My change (incorrectly, you are right) trims whitespace, even if trim_whitespace is not enabled.
Therefore my suggestion would be: I send a third Pull request in which I integrate a new option boost::property_tree::xml_parser::trim_but_dont_normalize_whitespace, which enables rapidxml::trim_whitespace but not rapidxml::normalize_whitespace.
</p>
<p>
Would this be acceptable for you?
</p>
</description>
<category>Ticket</category>
</item>
<item>
<author>Timo Strunk <timo.strunk@…></author>
<pubDate>Thu, 11 Feb 2016 09:57:29 GMT</pubDate>
<title/>
<link>https://svn.boost.org/trac10/ticket/11600#comment:6 </link>
<guid isPermaLink="false">https://svn.boost.org/trac10/ticket/11600#comment:6</guid>
<description>
<p>
Quick remark:
"My change (incorrectly, you are right) trims whitespace, even if trim_whitespace is not enabled."
</p>
<p>
This is not completely correct. It actually only trims, if the xml element contains ONLY whitespace. So I could integrate the option boost::property_tree::xml_parser::prune_whitespace_xml_data
</p>
</description>
<category>Ticket</category>
</item>
</channel>
</rss>