Boost C++ Libraries: Ticket #5033: Property Tree JSON Parser cannot handle utf-8 string correctly. https://svn.boost.org/trac10/ticket/5033 <p> Please refer to the following code fragment. </p> <pre class="wiki"> // This is a json utf-8 string {"value": "天津"} const char json[] = {0x7B, 0x22, 0x76, 0x61, 0x6C, 0x75, 0x65, 0x22, 0x3A, 0x20, 0x22, 0xE5, 0xA4, 0xA9, 0xE6, 0xB4, 0xA5, 0x22, 0x7D, 0x00}; boost::property_tree::ptree pt; boost::format fmter("%1% : %2% \n"); std::stringstream strm; std::string value; strm &lt;&lt; json; read_json(strm, pt); value = pt.get&lt;std::string&gt;("value"); // Print the individual char one by one. // However the wrong result appears. All chars printed to console are 0x7F. // And the expected result should be the chars of 0xE5, 0xA4, 0xA9, 0xE6, 0xB4 and 0xA5. BOOST_FOREACH(char c, value) std::cout &lt;&lt; (fmter % (int)(unsigned char) c % c) &lt;&lt; std::endl; </pre><p> After my investigation, this might be a bug in boost/property_tree/detail/json_parser_read.hpp. </p> <p> My patch for this issue is as follows. </p> <pre class="wiki">--- json_parser_read.hpp.orig 2010-12-24 15:49:06.000000000 +0800 +++ json_parser_read.hpp 2011-01-02 10:26:37.000000000 +0800 @@ -145,7 +145,7 @@ a_unicode(context &amp;c): c(c) { } void operator()(unsigned long u) const { - u = (std::min)(u, static_cast&lt;unsigned long&gt;((std::numeric_limits&lt;Ch&gt;::max)())); + // u = (std::min)(u, static_cast&lt;unsigned long&gt;((std::numeric_limits&lt;Ch&gt;::max)())); c.string += Ch(u); } }; </pre> en-us Boost C++ Libraries /htdocs/site/boost.png https://svn.boost.org/trac10/ticket/5033 Trac 1.4.3 Tommy Wed, 24 Aug 2011 06:26:40 GMT <link>https://svn.boost.org/trac10/ticket/5033#comment:1 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/5033#comment:1</guid> <description> <p> I can confirm this bug. As in the previous code snippets, a_unicode::operator() will be called with 0xE5. Because std::numeric_limits&lt;char&gt;::max is 127, so std::min(0xE5, 127)==127 will be append to the result string. </p> <p> Hope this bug can be fixed in the next release. </p> </description> <category>Ticket</category> </item> <item> <author>rshhh <ryushiro.sugehara@…></author> <pubDate>Sun, 20 Nov 2011 00:24:05 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/5033#comment:2 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/5033#comment:2</guid> <description> <p> I think the approach taken in the patch is not correct. <br /> Since a single byte of UTF-8 string could take a value larger than the maximum that <strong><em>signed char</em></strong> could take(0x7F), I think that certain characters may overflow a <strong><em>Ch</em></strong> (or just <strong><em>char</em></strong>) object. <br /> I'm guessing that the issue I just wrote is exactly the reason why the original code is taking std::min() approach, am I correct? So if my opinion is right, I think we should store the UTF-8 character in a <strong><em>unsigned char</em></strong> sequence... </p> </description> <category>Ticket</category> </item> <item> <dc:creator>Tommy</dc:creator> <pubDate>Thu, 01 Dec 2011 08:01:19 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/5033#comment:3 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/5033#comment:3</guid> <description> <p> Replying to <a class="ticket" href="https://svn.boost.org/trac10/ticket/5033#comment:2" title="Comment 2">rshhh &lt;ryushiro.sugehara@…&gt;</a>: </p> <blockquote class="citation"> <p> I think the approach taken in the patch is not correct. <br /> Since a single byte of UTF-8 string could take a value larger than the maximum that <strong><em>signed char</em></strong> could take(0x7F), I think that certain characters may overflow a <strong><em>Ch</em></strong> (or just <strong><em>char</em></strong>) object. <br /> I'm guessing that the issue I just wrote is exactly the reason why the original code is taking std::min() approach, am I correct? So if my opinion is right, I think we should store the UTF-8 character in a <strong><em>unsigned char</em></strong> sequence... </p> </blockquote> <p> I think the problem is : a_unicode SHOULD handle a unicode char, which can't fit in a Ch (or just char). Should mbrtowc()/wcrtomb() be used to convert between unicode(wchar_t) and Ch? </p> </description> <category>Ticket</category> </item> <item> <author>Jan Ciger <jan.ciger@…></author> <pubDate>Thu, 03 May 2012 09:03:01 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/5033#comment:4 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/5033#comment:4</guid> <description> <p> Just got bitten by the same bug. What is the recommended fix for this? </p> </description> <category>Ticket</category> </item> <item> <author>Ilya Bobyr <ilya.bobyr@…></author> <pubDate>Mon, 17 Sep 2012 17:15:34 GMT</pubDate> <title>attachment set https://svn.boost.org/trac10/ticket/5033 https://svn.boost.org/trac10/ticket/5033 <ul> <li><strong>attachment</strong> → <span class="trac-field-new">property.tree.read.UTF-8.patch</span> </li> </ul> <p> Property Tree JSON reader fix for UTF-8 encoded string </p> Ticket Ilya Bobyr <ilya.bobyr@…> Mon, 17 Sep 2012 17:21:31 GMT <link>https://svn.boost.org/trac10/ticket/5033#comment:5 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/5033#comment:5</guid> <description> <p> While it is true, that char can not handle whole Unicode, it can still handle values larger than 0x7F if you view it as an unsigned integer. There was a fix for JSON writer in version 1.45 that makes it unconditionally view character type as unsigned thus allowing it to save UTF-8 encoded strings even if char is signed. Here is a similar patch for the JSON reader. While it still has std::min() in there it uses maximum value for unsigned char when clamping a character value been read. </p> <p> This way JSON writer and JSON reader are doing the same kind of transformation to the characters and UTF-8 encoded strings can go through a full save/load cycle. </p> </description> <category>Ticket</category> </item> <item> <author>bmccart@…</author> <pubDate>Wed, 17 Jul 2013 18:18:28 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/5033#comment:6 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/5033#comment:6</guid> <description> <p> Encoding of JSON strings is straightforward for utf8 encoding. The specification says that 'any' Unicode character except '"' or '\' may be literally represented in the string. For utf8 encoding, either parse the std::string to determine code points, and if any code point (1-4 chars in utf8) contains '"' or '\' chars then the whole code point needs to be escaped with "\uXXXX" or "\uXXXX\uXXXX" Which is actually UTF16). Otherwise you can dump the 1-4 literal chars into the JSON string content. </p> <p> When decoding JSON strings, the escaped unicode chars (UTF16) must be decoded and converted back to UTF8 chars which can be literally injected into the result string. </p> <p> <a class="ext-link" href="http://www.ietf.org/rfc/rfc4627.txt"><span class="icon">​</span>http://www.ietf.org/rfc/rfc4627.txt</a> (Sec 2.5) <a class="ext-link" href="http://www.json.org/"><span class="icon">​</span>http://www.json.org/</a> </p> </description> <category>Ticket</category> </item> <item> <author>ecotax@…</author> <pubDate>Thu, 29 Aug 2013 15:15:30 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/5033#comment:7 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/5033#comment:7</guid> <description> <p> This issue is related to <a class="ext-link" href="https://svn.boost.org/trac/boost/ticket/8883"><span class="icon">​</span>https://svn.boost.org/trac/boost/ticket/8883</a> </p> <p> I posted a patch there that should make Unicode code points result in UTF-8 encodings. </p> </description> <category>Ticket</category> </item> <item> <dc:creator>Sebastian Redl</dc:creator> <pubDate>Tue, 07 Jul 2015 12:26:58 GMT</pubDate> <title>status changed; resolution set https://svn.boost.org/trac10/ticket/5033#comment:8 https://svn.boost.org/trac10/ticket/5033#comment:8 <ul> <li><strong>status</strong> <span class="trac-field-old">new</span> → <span class="trac-field-new">closed</span> </li> <li><strong>resolution</strong> → <span class="trac-field-new">fixed</span> </li> </ul> <p> The JSON parser rewrite has landed on master, with full support for UTF-8, including converting \u-encoded UTF-16 surrogate pairs to a single UTF-8-encoded codepoint while parsing. It assumes, however, that the encoding of all narrow strings is UTF-8, even on Windows. </p> <p> (Note that the writer still has poor Unicode support.) </p> Ticket