Opened 15 years ago

Closed 11 years ago

Last modified 11 years ago

#1678 closed Patches (fixed)

Boost.property_tree::read_xml does not parse UNICODE file with BOMs

Reported by: tom Owned by: Sebastian Redl
Milestone: Boost 1.47.0 Component: property_tree
Version: Boost Development Trunk Severity: Showstopper
Keywords: property_tree UNICODE BOM read_xml Cc: Marcin Kalicinski

Description

I've reported it in both the users list and developers list with no response. please refer to Boost-users Digest, Vol 1560, Issue 1 and Boost Digest, Vol 2114, Issue 3

Change History (10)

comment:1 by Marshall Clow, 15 years ago

Component: Noneproperty_tree
Owner: set to kaalus

comment:2 by Sebastian Redl, 13 years ago

Owner: changed from kaalus to Sebastian Redl
Status: newassigned

comment:3 by zhuo.qiang@…, 12 years ago

Milestone: Boost 1.36.0Boost 1.47.0
Severity: ProblemShowstopper
Type: BugsPatches

The following is a fix for this, anyone pleaes review this fix and apply it to the trunk:

/property_tree/detail/rapidxml.hpp(1730):

        // Parse BOM, if any
        template<int Flags>
        void parse_bom(char *&text)
        {
            // UTF-8
            if (text[0] == 0xEF && 
                text[1] == 0xBB && 
                text[2] == 0xBF)
            {
                text += 3;      // Skip utf-8 bom
            }
        }

        // Parse BOM, if any
        template<int Flags>
        void parse_bom(wchar_t *&text)
        {
            // UTF-16
            if (text[0] == L'\uFEFF')
            {
                text += 1;      // Skip utf-16 bom
            }
        }

comment:4 by Sebastian Redl, 12 years ago

Fixed on trunk in r68991.

comment:5 by Marshall Clow, 12 years ago

I don't think this is right - especially in the case of UTF-16 (and UTF-32)

For UTF-16:

  • If there is a BOM, and it is "FE FF", then the rest of the UTF-16 must be interpreted as "little endian"
  • If there is a BOM, and it is "FF FE", then the rest of the UTF-16 must be interpreted as "big endian".

I see no code here to do that. It just notes "Hey, there's a BOM here" (assuming that the BOM matches the endianness of the processors that is consuming the XML), and continues on.

See http://www.opentag.com/xfaq_enc.htm for more info.

comment:6 by Sebastian Redl, 12 years ago

True. The code is just to skip a BOM and continue as if it wasn't there. The data still has to be in the expected format.

This is the right thing to do. The XML parser should not concern itself with encodings. It's the responsibility of the code convert facet of the input stream to bring the data into the right format. The code here is just to skip the BOM.

comment:7 by Marshall Clow, 12 years ago

Ok, then - let's add these additional test cases:

// byte order mark (UTF-16, big endian)
const char *bug_data_pr1678be =
    "\xFE\xFF\0<\0?\0x\0m\0l\0 \0v\0e\0r\0s\0i\0o\0n\0=\0\"\01\0.\00\0\"\0 \0e\0n\0c\0o\0d\0i\0n\0g\0=\0\"\0u\0t\0f\0-\08\0\"\0?\0>\0<\0r\0o\0o\0t\0/\0>";

// byte order mark (UTF-16, little endian)
const char *bug_data_pr1678le =
    "\xFF\xFE<\0?\0x\0m\0l\0 \0v\0e\0r\0s\0i\0o\0n\0=\0\"\01\0.\00\0\"\0 \0e\0n\0c\0o\0d\0i\0n\0g\0=\0\"\0u\0t\0f\0-\08\0\"\0?\0>\0<\0r\0o\0o\0t\0/\0>\0";

and make sure that they work - because they don't at the moment.

comment:8 by Sebastian Redl, 12 years ago

No. These tests are invalid because of the way the test system works, and because of the way PTree's XML support works.

PTree's XML parser doesn't insulate you from encoding issues. In fact, I have an error in my test case in that I specify UTF-16 as the encoding of the XML snippet. That's incorrect: the encoding is UTF-8 in the test file that is created. And PTree doesn't care anyway, because all PTree does is read data from an input stream (wistream in the wchar_t case) and process it, assuming that it's in the platform default encoding for this character type. The encoding declaration of the XML is completely ignored.

So the only thing that does encoding conversion is the input stream. The test cases install a UTF-8 conversion facet in the global locale, so that the wide stream tests expect the input files to contain UTF-8. Any other encoding would require replacing the code conversion facet for such tests, and wouldn't work at all for the narrow version, because narrow streams don't transcode AFAIK.

Yes, this is technically invalid handling of XML. But that's a completely different issue and has nothing to do with the BOM issue here. That would be a much bigger issue: it would mean that the library would have to load the file as a binary block, detect the encoding, transcode the data to the native encoding for the given character type, and only then actually parse the XML.

Sorry, but I'm not going to do that. I may do it if Boost ever has a usable encoding handling library that I can use, but not before that.

This bug is about reading UTF-8 files that contain a BOM with a wide-character property_tree. I've fixed this bug by correctly skipping the BOM for wchar_t sequences under the assumption that the input stream has correctly converted whatever was on disk to the native encoding for wchar_t, which is further assumed to be native-endian UTF-16/32. That's actually a precondition for the XML parser, even though it's probably not documented.

But poor documentation is also another bug.

comment:9 by Sebastian Redl, 11 years ago

Resolution: fixed
Status: assignedclosed

(In [71991]) Merge r68990-68993, several fixes to PTree. Fixes bug 1678. Fixes bug 4387.

comment:10 by Sebastian Redl, 11 years ago

(In [71992]) Merge r68990-68993, several fixes to PTree. Fixes bug 1678. Fixes bug 4387. Forgot to commit these together with the header part.

Note: See TracTickets for help on using tickets.