Opened 7 years ago

Closed 7 years ago

Last modified 7 years ago

#11600 closed Bugs (invalid)

boost property_tree exponential newline growth

Reported by: Timo Strunk <Timo.Strunk@…> Owned by: Sebastian Redl
Milestone: To Be Determined Component: property_tree
Version: Boost 1.59.0 Severity: Problem
Keywords: Cc:

Description

Problem

Boost "property_tree to xml" includes many newlines on roundtrip, when it is used without the trim_whitespace option. This makes using ptree unusuable, when not using it with trim_whitespace. ptree is not an option, when whitespace in xml text has to be actually preserved.

Example

in.xml:

    <simona_input>
      <simona_configuration>
        <coord residue_name="LI1" residue_id="0" chain_id="0" name="C" id="0">
          <X>0.0</X>
          <Y>0.0</Y>

rewritten.xml:

    <simona_input>
      &#10;  &#10;  &#10;  &#10;  &#10;  &#10;  &#10;
      <simona_configuration>
        &#10;    &#10;    &#10;    &#10;   
        <coord residue_name="LI1" residue_id="0" chain_id="0" name="C" id="0">

Cause

The problem is due to the interpretation of strings in the following XML example:

    <element1>
        <subelement/>
        <subelement/>
        <subelement/>
    </element1>

rapidxml interprets this as element1, which has 3 children + a textelement with a bunch of newlines + whitespace.

Fix

Fixing this is easy, because the logic for removing this scenario is already present in trim_whitespace. The solution is to remove all whitespace in an xml element in case said xml element consists ONLY of whitespace. The diff (which applies against 1.59 rapidxml.hpp) is attached.

Reproduction

I attached a testcase, which includes a clean 1.59 boost property tree and a fixed 1.59 boost property tree and a big test xml file. To reproduce: Call ./build.sh, which will generate a fixed and a broken executable. The executable will read in.xml and write it as stage1.xml. Then it will read stage1.xml and write it as stage2.xml and then again the same for stage3.xml.

Open in.xml and compare the first lines against stage3.xml and you will see that the roundtrip included many newlines, which actually got encoded in the text as &#10;

Remarks

This is a regression and did not happen before. This problem exists since at least four years http://stackoverflow.com/questions/6572550/boostproperty-tree-xml-pretty-printing

Change of existing behaviour - second bug

The stackoverflow answer also shows a difference between now and before, as newlines are encoded now. This is a separate bug, but in my opinion there is no reason to encode \n or \t, as these are not reserved XML statements. They should be removed from detail/xml_parser_utils.hpp:73-74, as they change existing behaviour without a reason.

Attachments (1)

property_tree_bugreport.tar.bz2 (55.5 KB ) - added by Timo Strunk <Timo.Strunk@…> 7 years ago.
Testcase and diff

Download all attachments as: .zip

Change History (7)

by Timo Strunk <Timo.Strunk@…>, 7 years ago

Testcase and diff

comment:1 by Timo Strunk <Timo.Strunk@…>, 7 years ago

I made a pull request fixing the newline introduction. https://github.com/boostorg/property_tree/pull/16

The newline and tab translation behaviour is unchanged, but I think it should still be fixed. I can send a pull request for that immediately, too, if required.

comment:2 by Sebastian Redl, 7 years ago

XML whitespace behavior is a mess, but anything that introduces greater roundtrip fidelity under non-strip_whitespace mode is an improvement. If you still have that second pull request, please send it.

comment:3 by Timo Strunk <timo.strunk@…>, 7 years ago

Thank you for merging!

I made a pull request for the second issue and explained it in more detail there: https://github.com/boostorg/property_tree/pull/18

comment:4 by Sebastian Redl, 7 years ago

Resolution: invalid
Status: newclosed

I've reverted this. After thinking it over, it doesn't make sense to parse XML without stripping whitespace, writing it out in pretty-print mode, and expecting this to roundtrip.

comment:5 by Timo Strunk <timo.strunk@…>, 7 years ago

Without my code you currently cannot round-trip the following XML using Property Tree:

<XML>

<Text>AB CD</Text>

</XML>

There are two protected spaces in between AB and CD. I need those two spaces.

Using the previous boost 1.59 code you end up with:

<?xml version="1.0" encoding="utf-8"?> <XML>

&#10; &#10; &#10; &#10;&#10; &#10;&#10; &#10; <Text>AB CD</Text>

</XML>

This is good, because the text is not broken, but if the XML is rewritten several million times you end up with a GB of '&#10;' (or actual newlines in case of pull request #18).

Using trim_whitespace you end up with:

<?xml version="1.0" encoding="utf-8"?> <XML>

<Text>AB CD</Text>

</XML>

which makes everything look nice, but the double whitespace in the middle is gone.

The problem here is that property_tree::trim_whitespace is converted to rapidxml::normalize_whitespace. The trim_whitespace option of rapidxml is not exposed.

My change (incorrectly, you are right) trims whitespace, even if trim_whitespace is not enabled. Therefore my suggestion would be: I send a third Pull request in which I integrate a new option boost::property_tree::xml_parser::trim_but_dont_normalize_whitespace, which enables rapidxml::trim_whitespace but not rapidxml::normalize_whitespace.

Would this be acceptable for you?

comment:6 by Timo Strunk <timo.strunk@…>, 7 years ago

Quick remark: "My change (incorrectly, you are right) trims whitespace, even if trim_whitespace is not enabled."

This is not completely correct. It actually only trims, if the xml element contains ONLY whitespace. So I could integrate the option boost::property_tree::xml_parser::prune_whitespace_xml_data

Note: See TracTickets for help on using tickets.