#1678 closed Patches (fixed)
Boost.property_tree::read_xml does not parse UNICODE file with BOMs
Reported by: | tom | Owned by: | Sebastian Redl |
---|---|---|---|
Milestone: | Boost 1.47.0 | Component: | property_tree |
Version: | Boost Development Trunk | Severity: | Showstopper |
Keywords: | property_tree UNICODE BOM read_xml | Cc: | Marcin Kalicinski |
Description
I've reported it in both the users list and developers list with no response. please refer to Boost-users Digest, Vol 1560, Issue 1 and Boost Digest, Vol 2114, Issue 3
Change History (10)
comment:1 by , 15 years ago
Component: | None → property_tree |
---|---|
Owner: | set to |
comment:2 by , 13 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:3 by , 12 years ago
Milestone: | Boost 1.36.0 → Boost 1.47.0 |
---|---|
Severity: | Problem → Showstopper |
Type: | Bugs → Patches |
comment:5 by , 12 years ago
I don't think this is right - especially in the case of UTF-16 (and UTF-32)
For UTF-16:
- If there is a BOM, and it is "FE FF", then the rest of the UTF-16 must be interpreted as "little endian"
- If there is a BOM, and it is "FF FE", then the rest of the UTF-16 must be interpreted as "big endian".
I see no code here to do that. It just notes "Hey, there's a BOM here" (assuming that the BOM matches the endianness of the processors that is consuming the XML), and continues on.
See http://www.opentag.com/xfaq_enc.htm for more info.
comment:6 by , 12 years ago
True. The code is just to skip a BOM and continue as if it wasn't there. The data still has to be in the expected format.
This is the right thing to do. The XML parser should not concern itself with encodings. It's the responsibility of the code convert facet of the input stream to bring the data into the right format. The code here is just to skip the BOM.
comment:7 by , 12 years ago
Ok, then - let's add these additional test cases:
// byte order mark (UTF-16, big endian) const char *bug_data_pr1678be = "\xFE\xFF\0<\0?\0x\0m\0l\0 \0v\0e\0r\0s\0i\0o\0n\0=\0\"\01\0.\00\0\"\0 \0e\0n\0c\0o\0d\0i\0n\0g\0=\0\"\0u\0t\0f\0-\08\0\"\0?\0>\0<\0r\0o\0o\0t\0/\0>"; // byte order mark (UTF-16, little endian) const char *bug_data_pr1678le = "\xFF\xFE<\0?\0x\0m\0l\0 \0v\0e\0r\0s\0i\0o\0n\0=\0\"\01\0.\00\0\"\0 \0e\0n\0c\0o\0d\0i\0n\0g\0=\0\"\0u\0t\0f\0-\08\0\"\0?\0>\0<\0r\0o\0o\0t\0/\0>\0";
and make sure that they work - because they don't at the moment.
comment:8 by , 12 years ago
No. These tests are invalid because of the way the test system works, and because of the way PTree's XML support works.
PTree's XML parser doesn't insulate you from encoding issues. In fact, I have an error in my test case in that I specify UTF-16 as the encoding of the XML snippet. That's incorrect: the encoding is UTF-8 in the test file that is created. And PTree doesn't care anyway, because all PTree does is read data from an input stream (wistream in the wchar_t case) and process it, assuming that it's in the platform default encoding for this character type. The encoding declaration of the XML is completely ignored.
So the only thing that does encoding conversion is the input stream. The test cases install a UTF-8 conversion facet in the global locale, so that the wide stream tests expect the input files to contain UTF-8. Any other encoding would require replacing the code conversion facet for such tests, and wouldn't work at all for the narrow version, because narrow streams don't transcode AFAIK.
Yes, this is technically invalid handling of XML. But that's a completely different issue and has nothing to do with the BOM issue here. That would be a much bigger issue: it would mean that the library would have to load the file as a binary block, detect the encoding, transcode the data to the native encoding for the given character type, and only then actually parse the XML.
Sorry, but I'm not going to do that. I may do it if Boost ever has a usable encoding handling library that I can use, but not before that.
This bug is about reading UTF-8 files that contain a BOM with a wide-character property_tree. I've fixed this bug by correctly skipping the BOM for wchar_t sequences under the assumption that the input stream has correctly converted whatever was on disk to the native encoding for wchar_t, which is further assumed to be native-endian UTF-16/32. That's actually a precondition for the XML parser, even though it's probably not documented.
But poor documentation is also another bug.
comment:9 by , 11 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
The following is a fix for this, anyone pleaes review this fix and apply it to the trunk:
/property_tree/detail/rapidxml.hpp(1730):