Opened 12 years ago
Closed 7 years ago
#5033 closed Bugs (fixed)
Property Tree JSON Parser cannot handle utf-8 string correctly.
Reported by: | Owned by: | Sebastian Redl | |
---|---|---|---|
Milestone: | To Be Determined | Component: | property_tree |
Version: | Boost 1.45.0 | Severity: | Problem |
Keywords: | Cc: |
Description
Please refer to the following code fragment.
// This is a json utf-8 string {"value": "天津"} const char json[] = {0x7B, 0x22, 0x76, 0x61, 0x6C, 0x75, 0x65, 0x22, 0x3A, 0x20, 0x22, 0xE5, 0xA4, 0xA9, 0xE6, 0xB4, 0xA5, 0x22, 0x7D, 0x00}; boost::property_tree::ptree pt; boost::format fmter("%1% : %2% \n"); std::stringstream strm; std::string value; strm << json; read_json(strm, pt); value = pt.get<std::string>("value"); // Print the individual char one by one. // However the wrong result appears. All chars printed to console are 0x7F. // And the expected result should be the chars of 0xE5, 0xA4, 0xA9, 0xE6, 0xB4 and 0xA5. BOOST_FOREACH(char c, value) std::cout << (fmter % (int)(unsigned char) c % c) << std::endl;
After my investigation, this might be a bug in boost/property_tree/detail/json_parser_read.hpp.
My patch for this issue is as follows.
--- json_parser_read.hpp.orig 2010-12-24 15:49:06.000000000 +0800 +++ json_parser_read.hpp 2011-01-02 10:26:37.000000000 +0800 @@ -145,7 +145,7 @@ a_unicode(context &c): c(c) { } void operator()(unsigned long u) const { - u = (std::min)(u, static_cast<unsigned long>((std::numeric_limits<Ch>::max)())); + // u = (std::min)(u, static_cast<unsigned long>((std::numeric_limits<Ch>::max)())); c.string += Ch(u); } };
Attachments (1)
Change History (9)
comment:1 by , 11 years ago
follow-up: 3 comment:2 by , 11 years ago
I think the approach taken in the patch is not correct.
Since a single byte of UTF-8 string could take a value larger than the maximum that signed char could take(0x7F), I think that certain characters may overflow a Ch (or just char) object.
I'm guessing that the issue I just wrote is exactly the reason why the original code is taking std::min() approach, am I correct? So if my opinion is right, I think we should store the UTF-8 character in a unsigned char sequence...
comment:3 by , 11 years ago
Replying to rshhh <ryushiro.sugehara@…>:
I think the approach taken in the patch is not correct.
Since a single byte of UTF-8 string could take a value larger than the maximum that signed char could take(0x7F), I think that certain characters may overflow a Ch (or just char) object.
I'm guessing that the issue I just wrote is exactly the reason why the original code is taking std::min() approach, am I correct? So if my opinion is right, I think we should store the UTF-8 character in a unsigned char sequence...
I think the problem is : a_unicode SHOULD handle a unicode char, which can't fit in a Ch (or just char). Should mbrtowc()/wcrtomb() be used to convert between unicode(wchar_t) and Ch?
by , 10 years ago
Attachment: | property.tree.read.UTF-8.patch added |
---|
Property Tree JSON reader fix for UTF-8 encoded string
comment:5 by , 10 years ago
While it is true, that char can not handle whole Unicode, it can still handle values larger than 0x7F if you view it as an unsigned integer. There was a fix for JSON writer in version 1.45 that makes it unconditionally view character type as unsigned thus allowing it to save UTF-8 encoded strings even if char is signed. Here is a similar patch for the JSON reader. While it still has std::min() in there it uses maximum value for unsigned char when clamping a character value been read.
This way JSON writer and JSON reader are doing the same kind of transformation to the characters and UTF-8 encoded strings can go through a full save/load cycle.
comment:6 by , 9 years ago
Encoding of JSON strings is straightforward for utf8 encoding. The specification says that 'any' Unicode character except '"' or '\' may be literally represented in the string. For utf8 encoding, either parse the std::string to determine code points, and if any code point (1-4 chars in utf8) contains '"' or '\' chars then the whole code point needs to be escaped with "\uXXXX" or "\uXXXX\uXXXX" Which is actually UTF16). Otherwise you can dump the 1-4 literal chars into the JSON string content.
When decoding JSON strings, the escaped unicode chars (UTF16) must be decoded and converted back to UTF8 chars which can be literally injected into the result string.
http://www.ietf.org/rfc/rfc4627.txt (Sec 2.5) http://www.json.org/
comment:7 by , 9 years ago
This issue is related to https://svn.boost.org/trac/boost/ticket/8883
I posted a patch there that should make Unicode code points result in UTF-8 encodings.
comment:8 by , 7 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
The JSON parser rewrite has landed on master, with full support for UTF-8, including converting \u-encoded UTF-16 surrogate pairs to a single UTF-8-encoded codepoint while parsing. It assumes, however, that the encoding of all narrow strings is UTF-8, even on Windows.
(Note that the writer still has poor Unicode support.)
I can confirm this bug. As in the previous code snippets, a_unicode::operator() will be called with 0xE5. Because std::numeric_limits<char>::max is 127, so std::min(0xE5, 127)==127 will be append to the result string.
Hope this bug can be fixed in the next release.