Boost C++ Libraries: Ticket #11600: boost property_tree exponential newline growth https://svn.boost.org/trac10/ticket/11600 <h2 class="section" id="Problem">Problem</h2> <p> Boost "property_tree to xml" includes many newlines on roundtrip, when it is used without the trim_whitespace option. This makes using ptree unusuable, when not using it with trim_whitespace. ptree is not an option, when whitespace in xml text has to be actually preserved. </p> <h2 class="section" id="Example">Example</h2> <p> in.xml: </p> <pre class="wiki"> &lt;simona_input&gt; &lt;simona_configuration&gt; &lt;coord residue_name="LI1" residue_id="0" chain_id="0" name="C" id="0"&gt; &lt;X&gt;0.0&lt;/X&gt; &lt;Y&gt;0.0&lt;/Y&gt; </pre><p> rewritten.xml: </p> <pre class="wiki"> &lt;simona_input&gt; &amp;#10; &amp;#10; &amp;#10; &amp;#10; &amp;#10; &amp;#10; &amp;#10; &lt;simona_configuration&gt; &amp;#10; &amp;#10; &amp;#10; &amp;#10; &lt;coord residue_name="LI1" residue_id="0" chain_id="0" name="C" id="0"&gt; </pre><h2 class="section" id="Cause">Cause</h2> <p> The problem is due to the interpretation of strings in the following XML example: </p> <pre class="wiki"> &lt;element1&gt; &lt;subelement/&gt; &lt;subelement/&gt; &lt;subelement/&gt; &lt;/element1&gt; </pre><p> rapidxml interprets this as element1, which has 3 children + a textelement with a bunch of newlines + whitespace. </p> <h2 class="section" id="Fix">Fix</h2> <p> Fixing this is easy, because the logic for removing this scenario is already present in trim_whitespace. The solution is to remove all whitespace in an xml element in case said xml element consists ONLY of whitespace. The diff (which applies against 1.59 rapidxml.hpp) is attached. </p> <h2 class="section" id="Reproduction">Reproduction</h2> <p> I attached a testcase, which includes a clean 1.59 boost property tree and a fixed 1.59 boost property tree and a big test xml file. To reproduce: Call ./build.sh, which will generate a fixed and a broken executable. The executable will read in.xml and write it as stage1.xml. Then it will read stage1.xml and write it as stage2.xml and then again the same for stage3.xml. </p> <p> Open in.xml and compare the first lines against stage3.xml and you will see that the roundtrip included many newlines, which actually got encoded in the text as &amp;#10; </p> <h2 class="section" id="Remarks">Remarks</h2> <p> This is a regression and did not happen before. This problem exists since at least four years <a class="ext-link" href="http://stackoverflow.com/questions/6572550/boostproperty-tree-xml-pretty-printing"><span class="icon">​</span>http://stackoverflow.com/questions/6572550/boostproperty-tree-xml-pretty-printing</a> </p> <h2 class="section" id="Changeofexistingbehaviour-secondbug">Change of existing behaviour - second bug</h2> <p> The stackoverflow answer also shows a difference between now and before, as newlines are encoded now. This is a separate bug, but in my opinion there is no reason to encode \n or \t, as these are not reserved XML statements. They should be removed from detail/xml_parser_utils.hpp:73-74, as they change existing behaviour without a reason. </p> en-us Boost C++ Libraries /htdocs/site/boost.png https://svn.boost.org/trac10/ticket/11600 Trac 1.4.3 Timo Strunk <Timo.Strunk@…> Sun, 30 Aug 2015 14:07:21 GMT attachment set https://svn.boost.org/trac10/ticket/11600 https://svn.boost.org/trac10/ticket/11600 <ul> <li><strong>attachment</strong> → <span class="trac-field-new">property_tree_bugreport.tar.bz2</span> </li> </ul> <p> Testcase and diff </p> Ticket Timo Strunk <Timo.Strunk@…> Mon, 26 Oct 2015 15:08:15 GMT <link>https://svn.boost.org/trac10/ticket/11600#comment:1 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/11600#comment:1</guid> <description> <p> I made a pull request fixing the newline introduction. <a class="ext-link" href="https://github.com/boostorg/property_tree/pull/16"><span class="icon">​</span>https://github.com/boostorg/property_tree/pull/16</a> </p> <p> The newline and tab translation behaviour is unchanged, but I think it should still be fixed. I can send a pull request for that immediately, too, if required. </p> </description> <category>Ticket</category> </item> <item> <dc:creator>Sebastian Redl</dc:creator> <pubDate>Wed, 10 Feb 2016 12:56:49 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/11600#comment:2 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/11600#comment:2</guid> <description> <p> XML whitespace behavior is a mess, but anything that introduces greater roundtrip fidelity under non-strip_whitespace mode is an improvement. If you still have that second pull request, please send it. </p> </description> <category>Ticket</category> </item> <item> <author>Timo Strunk <timo.strunk@…></author> <pubDate>Wed, 10 Feb 2016 16:01:29 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/11600#comment:3 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/11600#comment:3</guid> <description> <p> Thank you for merging! </p> <p> I made a pull request for the second issue and explained it in more detail there: <a class="ext-link" href="https://github.com/boostorg/property_tree/pull/18"><span class="icon">​</span>https://github.com/boostorg/property_tree/pull/18</a> </p> </description> <category>Ticket</category> </item> <item> <dc:creator>Sebastian Redl</dc:creator> <pubDate>Thu, 11 Feb 2016 09:35:04 GMT</pubDate> <title>status changed; resolution set https://svn.boost.org/trac10/ticket/11600#comment:4 https://svn.boost.org/trac10/ticket/11600#comment:4 <ul> <li><strong>status</strong> <span class="trac-field-old">new</span> → <span class="trac-field-new">closed</span> </li> <li><strong>resolution</strong> → <span class="trac-field-new">invalid</span> </li> </ul> <p> I've reverted this. After thinking it over, it doesn't make sense to parse XML without stripping whitespace, writing it out in pretty-print mode, and expecting this to roundtrip. </p> Ticket Timo Strunk <timo.strunk@…> Thu, 11 Feb 2016 09:51:54 GMT <link>https://svn.boost.org/trac10/ticket/11600#comment:5 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/11600#comment:5</guid> <description> <p> Without my code you currently cannot round-trip the following XML using Property Tree: </p> <blockquote> <p> &lt;XML&gt; </p> <blockquote> <p> &lt;Text&gt;AB CD&lt;/Text&gt; </p> </blockquote> <p> &lt;/XML&gt; </p> </blockquote> <p> There are two protected spaces in between AB and CD. I need those two spaces. </p> <p> Using the previous boost 1.59 code you end up with: </p> <blockquote> <p> &lt;?xml version="1.0" encoding="utf-8"?&gt; &lt;XML&gt; </p> <blockquote> <p> &amp;#10; &amp;#10; &amp;#10; &amp;#10;&amp;#10; &amp;#10;&amp;#10; &amp;#10; &lt;Text&gt;AB CD&lt;/Text&gt; </p> </blockquote> <p> &lt;/XML&gt; </p> </blockquote> <p> This is good, because the text is not broken, but if the XML is rewritten several million times you end up with a GB of '&amp;#10;' (or actual newlines in case of pull request <a class="closed ticket" href="https://svn.boost.org/trac10/ticket/18" title="#18: Bugs: lexical_cast fails in some cases (closed: Fixed)">#18</a>). </p> <p> Using trim_whitespace you end up with: </p> <blockquote> <p> &lt;?xml version="1.0" encoding="utf-8"?&gt; &lt;XML&gt; </p> <blockquote> <p> &lt;Text&gt;AB CD&lt;/Text&gt; </p> </blockquote> <p> &lt;/XML&gt; </p> </blockquote> <p> which makes everything look nice, but the double whitespace in the middle is gone. </p> <p> The problem here is that property_tree::trim_whitespace is converted to rapidxml::normalize_whitespace. The trim_whitespace option of rapidxml is not exposed. </p> <p> My change (incorrectly, you are right) trims whitespace, even if trim_whitespace is not enabled. Therefore my suggestion would be: I send a third Pull request in which I integrate a new option boost::property_tree::xml_parser::trim_but_dont_normalize_whitespace, which enables rapidxml::trim_whitespace but not rapidxml::normalize_whitespace. </p> <p> Would this be acceptable for you? </p> </description> <category>Ticket</category> </item> <item> <author>Timo Strunk <timo.strunk@…></author> <pubDate>Thu, 11 Feb 2016 09:57:29 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/11600#comment:6 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/11600#comment:6</guid> <description> <p> Quick remark: "My change (incorrectly, you are right) trims whitespace, even if trim_whitespace is not enabled." </p> <p> This is not completely correct. It actually only trims, if the xml element contains ONLY whitespace. So I could integrate the option boost::property_tree::xml_parser::prune_whitespace_xml_data </p> </description> <category>Ticket</category> </item> </channel> </rss>