Boost C++ Libraries: Ticket #5033: Property Tree JSON Parser cannot handle utf-8 string correctly.

Tommy — Wed, 24 Aug 2011 06:26:40 GMT

I can confirm this bug. As in the previous code snippets, a_unicode::operator() will be called with 0xE5. Because std::numeric_limits<char>::max is 127, so std::min(0xE5, 127)==127 will be append to the result string.

Hope this bug can be fixed in the next release.

Sun, 20 Nov 2011 00:24:05 GMT

I think the approach taken in the patch is not correct.
Since a single byte of UTF-8 string could take a value larger than the maximum that signed char could take(0x7F), I think that certain characters may overflow a Ch (or just char) object.
I'm guessing that the issue I just wrote is exactly the reason why the original code is taking std::min() approach, am I correct? So if my opinion is right, I think we should store the UTF-8 character in a unsigned char sequence...

Tommy — Thu, 01 Dec 2011 08:01:19 GMT

Replying to rshhh <ryushiro.sugehara@…>:

I think the approach taken in the patch is not correct.
Since a single byte of UTF-8 string could take a value larger than the maximum that signed char could take(0x7F), I think that certain characters may overflow a Ch (or just char) object.
I'm guessing that the issue I just wrote is exactly the reason why the original code is taking std::min() approach, am I correct? So if my opinion is right, I think we should store the UTF-8 character in a unsigned char sequence...

I think the problem is : a_unicode SHOULD handle a unicode char, which can't fit in a Ch (or just char). Should mbrtowc()/wcrtomb() be used to convert between unicode(wchar_t) and Ch?

Thu, 03 May 2012 09:03:01 GMT

Just got bitten by the same bug. What is the recommended fix for this?

attachment set

Mon, 17 Sep 2012 17:15:34 GMT

attachment → property.tree.read.UTF-8.patch

Property Tree JSON reader fix for UTF-8 encoded string

Mon, 17 Sep 2012 17:21:31 GMT

While it is true, that char can not handle whole Unicode, it can still handle values larger than 0x7F if you view it as an unsigned integer. There was a fix for JSON writer in version 1.45 that makes it unconditionally view character type as unsigned thus allowing it to save UTF-8 encoded strings even if char is signed. Here is a similar patch for the JSON reader. While it still has std::min() in there it uses maximum value for unsigned char when clamping a character value been read.

This way JSON writer and JSON reader are doing the same kind of transformation to the characters and UTF-8 encoded strings can go through a full save/load cycle.

bmccart@… — Wed, 17 Jul 2013 18:18:28 GMT

Encoding of JSON strings is straightforward for utf8 encoding. The specification says that 'any' Unicode character except '"' or '\' may be literally represented in the string. For utf8 encoding, either parse the std::string to determine code points, and if any code point (1-4 chars in utf8) contains '"' or '\' chars then the whole code point needs to be escaped with "\uXXXX" or "\uXXXX\uXXXX" Which is actually UTF16). Otherwise you can dump the 1-4 literal chars into the JSON string content.

When decoding JSON strings, the escaped unicode chars (UTF16) must be decoded and converted back to UTF8 chars which can be literally injected into the result string.

http://www.ietf.org/rfc/rfc4627.txt (Sec 2.5) http://www.json.org/

ecotax@… — Thu, 29 Aug 2013 15:15:30 GMT

This issue is related to https://svn.boost.org/trac/boost/ticket/8883

I posted a patch there that should make Unicode code points result in UTF-8 encodings.

status changed; resolution set

Sebastian Redl — Tue, 07 Jul 2015 12:26:58 GMT

status new → closed
resolution → fixed

The JSON parser rewrite has landed on master, with full support for UTF-8, including converting \u-encoded UTF-16 surrogate pairs to a single UTF-8-encoded codepoint while parsing. It assumes, however, that the encoding of all narrow strings is UTF-8, even on Windows.

(Note that the writer still has poor Unicode support.)