Opened 12 years ago

Closed 7 years ago

#5033 closed Bugs (fixed)

Property Tree JSON Parser cannot handle utf-8 string correctly.

Reported by: Lorin Liu <liu.lorin@…> Owned by: Sebastian Redl
Milestone: To Be Determined Component: property_tree
Version: Boost 1.45.0 Severity: Problem
Keywords: Cc:

Description

Please refer to the following code fragment.

    // This is a json utf-8 string {"value": "天津"}
    const char json[] = {0x7B, 0x22, 0x76, 0x61, 0x6C, 0x75, 0x65, 0x22, 0x3A, 0x20, 
                         0x22, 0xE5, 0xA4, 0xA9, 0xE6, 0xB4, 0xA5, 0x22, 0x7D, 0x00};

    boost::property_tree::ptree pt;
    boost::format fmter("%1% : %2% \n");
    std::stringstream strm;
    std::string value;

    strm << json;

    read_json(strm, pt);

    value = pt.get<std::string>("value");

    // Print the individual char one by one.
    // However the wrong result appears. All chars printed to console are 0x7F. 
    // And the expected result should be the chars of 0xE5, 0xA4, 0xA9, 0xE6, 0xB4 and 0xA5.
    BOOST_FOREACH(char c, value)
        std::cout << (fmter % (int)(unsigned char) c % c) << std::endl;

After my investigation, this might be a bug in boost/property_tree/detail/json_parser_read.hpp.

My patch for this issue is as follows.

--- json_parser_read.hpp.orig	2010-12-24 15:49:06.000000000 +0800
+++ json_parser_read.hpp	2011-01-02 10:26:37.000000000 +0800
@@ -145,7 +145,7 @@
             a_unicode(context &c): c(c) { }
             void operator()(unsigned long u) const
             {
-                u = (std::min)(u, static_cast<unsigned long>((std::numeric_limits<Ch>::max)()));
+                // u = (std::min)(u, static_cast<unsigned long>((std::numeric_limits<Ch>::max)()));
                 c.string += Ch(u);
             }
         };

Attachments (1)

property.tree.read.UTF-8.patch (951 bytes ) - added by Ilya Bobyr <ilya.bobyr@…> 10 years ago.
Property Tree JSON reader fix for UTF-8 encoded string

Download all attachments as: .zip

Change History (9)

comment:1 by Tommy, 11 years ago

I can confirm this bug. As in the previous code snippets, a_unicode::operator() will be called with 0xE5. Because std::numeric_limits<char>::max is 127, so std::min(0xE5, 127)==127 will be append to the result string.

Hope this bug can be fixed in the next release.

comment:2 by rshhh <ryushiro.sugehara@…>, 11 years ago

I think the approach taken in the patch is not correct.
Since a single byte of UTF-8 string could take a value larger than the maximum that signed char could take(0x7F), I think that certain characters may overflow a Ch (or just char) object.
I'm guessing that the issue I just wrote is exactly the reason why the original code is taking std::min() approach, am I correct? So if my opinion is right, I think we should store the UTF-8 character in a unsigned char sequence...

in reply to:  2 comment:3 by Tommy, 11 years ago

Replying to rshhh <ryushiro.sugehara@…>:

I think the approach taken in the patch is not correct.
Since a single byte of UTF-8 string could take a value larger than the maximum that signed char could take(0x7F), I think that certain characters may overflow a Ch (or just char) object.
I'm guessing that the issue I just wrote is exactly the reason why the original code is taking std::min() approach, am I correct? So if my opinion is right, I think we should store the UTF-8 character in a unsigned char sequence...

I think the problem is : a_unicode SHOULD handle a unicode char, which can't fit in a Ch (or just char). Should mbrtowc()/wcrtomb() be used to convert between unicode(wchar_t) and Ch?

comment:4 by Jan Ciger <jan.ciger@…>, 10 years ago

Just got bitten by the same bug. What is the recommended fix for this?

by Ilya Bobyr <ilya.bobyr@…>, 10 years ago

Property Tree JSON reader fix for UTF-8 encoded string

comment:5 by Ilya Bobyr <ilya.bobyr@…>, 10 years ago

While it is true, that char can not handle whole Unicode, it can still handle values larger than 0x7F if you view it as an unsigned integer. There was a fix for JSON writer in version 1.45 that makes it unconditionally view character type as unsigned thus allowing it to save UTF-8 encoded strings even if char is signed. Here is a similar patch for the JSON reader. While it still has std::min() in there it uses maximum value for unsigned char when clamping a character value been read.

This way JSON writer and JSON reader are doing the same kind of transformation to the characters and UTF-8 encoded strings can go through a full save/load cycle.

comment:6 by bmccart@…, 9 years ago

Encoding of JSON strings is straightforward for utf8 encoding. The specification says that 'any' Unicode character except '"' or '\' may be literally represented in the string. For utf8 encoding, either parse the std::string to determine code points, and if any code point (1-4 chars in utf8) contains '"' or '\' chars then the whole code point needs to be escaped with "\uXXXX" or "\uXXXX\uXXXX" Which is actually UTF16). Otherwise you can dump the 1-4 literal chars into the JSON string content.

When decoding JSON strings, the escaped unicode chars (UTF16) must be decoded and converted back to UTF8 chars which can be literally injected into the result string.

http://www.ietf.org/rfc/rfc4627.txt (Sec 2.5) http://www.json.org/

comment:7 by ecotax@…, 9 years ago

This issue is related to https://svn.boost.org/trac/boost/ticket/8883

I posted a patch there that should make Unicode code points result in UTF-8 encodings.

comment:8 by Sebastian Redl, 7 years ago

Resolution: fixed
Status: newclosed

The JSON parser rewrite has landed on master, with full support for UTF-8, including converting \u-encoded UTF-16 surrogate pairs to a single UTF-8-encoded codepoint while parsing. It assumes, however, that the encoding of all narrow strings is UTF-8, even on Windows.

(Note that the writer still has poor Unicode support.)

Note: See TracTickets for help on using tickets.