Boost C++ Libraries: Ticket #8883: property_tree JSON reader does not parse unicode characters properly https://svn.boost.org/trac10/ticket/8883 <p> The JSON writer converts every character outside the ASCII range (&gt;127) to \u notation (e.g. an 'ä' in UTF-8 will become \u00C3\u00A4 ). The JSON reader on the other hand does not read such \u encoded characters back properly. The reason is the a_unicode struct in json_parser_read.hpp. </p> <p> This code: </p> <pre class="wiki">u = (std::min)(u, static_cast&lt;unsigned long&gt;((std::numeric_limits&lt;Ch&gt;::max)())); </pre><p> should actually be more like this: </p> <pre class="wiki">typedef typename make_unsigned&lt;Ch&gt;::type UCh; u = (std::min)(u, static_cast&lt;unsigned long&gt;((std::numeric_limits&lt;UCh&gt;::max)())); </pre><p> Otherwise every \u escaped character whose value is by definition greater than 127 will be clamped to 127 and will therefor never get its original value. </p> en-us Boost C++ Libraries /htdocs/site/boost.png https://svn.boost.org/trac10/ticket/8883 Trac 1.4.3 ecotax@… Thu, 29 Aug 2013 14:56:47 GMT attachment set https://svn.boost.org/trac10/ticket/8883 https://svn.boost.org/trac10/ticket/8883 <ul> <li><strong>attachment</strong> → <span class="trac-field-new">json_parser_diff.txt</span> </li> </ul> <p> property_tree json_parser_read UTF-8 patch </p> Ticket ecotax@… Thu, 29 Aug 2013 15:02:44 GMT <link>https://svn.boost.org/trac10/ticket/8883#comment:1 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8883#comment:1</guid> <description> <p> I also encountered problems with parsing Unicode characters. Specifically, when using the normal (non-wide-string) version of property_tree, parsing JSON containing explicitly encoded Unicode characters, using \uXXXX, characters above 0x7F, for example \u2026 (horizontal ellipsis) are not converted to UTF-8 (giving E2 80 A6). I just added a patch that will do so. I'm not sure if this is a good thing in general, and if so, whether the patch as is is production-ready, but maybe it's helpful in some way. </p> </description> <category>Ticket</category> </item> <item> <dc:creator>Lettort</dc:creator> <pubDate>Wed, 04 Sep 2013 08:11:26 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8883#comment:2 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8883#comment:2</guid> <description> <p> Hello, </p> <p> maybe I am asking about basic unicode knowledge. Even than I would be happy to get a link or some explanation to the following question: </p> <p> The description says: "...'ä' in UTF-8 will become \u00C3\u00A4 ..." </p> <p> Why is 'ä' not becoming /u00E4? The "/u"-pair is no surrogate pair, isn't it? So what's the meaning with it? </p> <p> Why am I asking? My problem is, that I am generating a JSON file which will be read via JQuery to be shown in a web page. In that case, this pair will be shown as two separated unicode symbols. </p> </description> <category>Ticket</category> </item> <item> <author>ecotax@…</author> <pubDate>Wed, 04 Sep 2013 09:05:41 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8883#comment:3 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8883#comment:3</guid> <description> <p> @Lettort: There is a difference betweeen Unicode, specifying 'ä' maps to code point E4, and the various ways to encode this code point in bits or bytes. There is UTF-16, encoding this as 00E4 (16 bits, fits in a wide char), but also UTF-8, encoding this as two bytes, C3 A4. When parsing a /u00E4, the correct way to handle this depends on what encoding you want for your string. If you have a wide string and expect UTF-16, then yes, you'd expect the wide char 00E4. If you have a regular string and expect UTF-8, you'd expect the two bytes C3 A4. </p> <p> The original bug report states that first writing and then reading 'ä', the writer (defensively?) writes this using two \u encoded characters, each being one byte of the UTF-8 encoding. Regardless if this is the best choice or not, you'd want the reader to handle this in such a way that it 'round-trips' as much as possible, which currently is not the case. </p> <p> BTW, For future questions/discussions, I guess a site like stackoverflow.com is more appropriate. </p> </description> <category>Ticket</category> </item> <item> <author>Ben McCart <bmccart@…></author> <pubDate>Mon, 09 Sep 2013 20:23:35 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8883#comment:4 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8883#comment:4</guid> <description> <p> After inspecting /detail/json_parser_write.hpp and /detail/json_parser_read.hpp it has become apparent to me that the current implementation is completely deficient concerning Unicode (both UTF8 and UTF16). The writer creates a Unicode escape sequence block (/uXXXX) for each incoming 'C' char in the UTF8 code sequence, which is incorrect. Each escaped Unicode character is to be represented as 16-bit hex values of UTF16 single or surrogate pairs (see RFC4627 section 2.5 <a class="ext-link" href="http://tools.ietf.org/html/rfc4627"><span class="icon">​</span>http://tools.ietf.org/html/rfc4627</a>). In addition, the existing implementation will only encode UCS2 correctly for 16 bit wchar_t strings as it isn't doing anything to correctly handle surrogate pairs either in the writer or the reader. </p> <p> I have a working implementation which I'm sure does not meet Boost coding guidelines, but may be adaptable with use of Boost Locale (I am using codecvt). I couldn't figure out how to add an attachment so I'm including the diffs inline bellow (The UTF8 implementation has been tested some, the UTF16 portion hasn't. It is ugly, but it works.): </p> <pre class="wiki"> Index: json_parser_read.hpp =================================================================== --- json_parser_read.hpp (revision 85628) +++ json_parser_read.hpp (working copy) @@ -22,6 +22,7 @@ #include &lt;istream&gt; #include &lt;vector&gt; #include &lt;algorithm&gt; +#include &lt;codecvt&gt; namespace boost { namespace property_tree { namespace json_parser { @@ -41,7 +42,10 @@ Str name; Ptree root; std::vector&lt;Ptree *&gt; stack; + unsigned long u_surrogate; + context() : u_surrogate(0) {} + struct a_object_s { context &amp;c; @@ -146,8 +150,46 @@ a_unicode(context &amp;c): c(c) { } void operator()(unsigned long u) const { - u = (std::min)(u, static_cast&lt;unsigned long&gt;((std::numeric_limits&lt;Ch&gt;::max)())); - c.string += Ch(u); + typedef typename make_unsigned&lt;Ch&gt;::type UCh; + if (long(std::numeric_limits&lt;UCh&gt;::max()) &gt;= 0xFFFF) + { + if (c.u_surrogate) + c.string += Ch((std::min)(c.u_surrogate, static_cast&lt;unsigned long&gt;((std::numeric_limits&lt;Ch&gt;::max)()))); + + u = (std::min)(u, static_cast&lt;unsigned long&gt;((std::numeric_limits&lt;Ch&gt;::max)())); + c.string += Ch(u); + } + else // Ch is one byte - encode the given Unicode code point as UTF-8 + { + if ((c.u_surrogate == 0) &amp;&amp; (0xD7FF &lt; u &amp;&amp; u &lt; 0xE000)) + { + c.u_surrogate = u; + } + else + { + wchar_t utf16str[3] = { wchar_t(0), wchar_t(0), wchar_t(0) }; + wchar_t const *from_next = utf16str; + + Ch utf8str[5] = { Ch(0), Ch(0), Ch(0), Ch(0), Ch(0) }; + Ch * to_next = utf8str; + + std::size_t size = 0; + if (c.u_surrogate) + { + utf16str[size++] = wchar_t(c.u_surrogate); + c.u_surrogate = 0; + } + utf16str[size++] = wchar_t(u); + + std::mbstate_t state = 0; + std::codecvt_utf8_utf16&lt;wchar_t&gt; utf16_utf8; + + if (utf16_utf8.out(state, &amp;utf16str[0], &amp;utf16str[0] + size, from_next, &amp;utf8str[0], &amp;utf8str[4], to_next) != 0) + BOOST_PROPERTY_TREE_THROW(json_parser_error("write error", "", 0)); + + c.string += utf8str; + } + } } }; </pre><pre class="wiki">Index: json_parser_read.hpp =================================================================== --- json_parser_read.hpp (revision 85628) +++ json_parser_read.hpp (working copy) @@ -22,6 +22,7 @@ #include &lt;istream&gt; #include &lt;vector&gt; #include &lt;algorithm&gt; +#include &lt;codecvt&gt; namespace boost { namespace property_tree { namespace json_parser { @@ -41,7 +42,10 @@ Str name; Ptree root; std::vector&lt;Ptree *&gt; stack; + unsigned long u_surrogate; + context() : u_surrogate(0) {} + struct a_object_s { context &amp;c; @@ -146,8 +150,46 @@ a_unicode(context &amp;c): c(c) { } void operator()(unsigned long u) const { - u = (std::min)(u, static_cast&lt;unsigned long&gt;((std::numeric_limits&lt;Ch&gt;::max)())); - c.string += Ch(u); + typedef typename make_unsigned&lt;Ch&gt;::type UCh; + if (long(std::numeric_limits&lt;UCh&gt;::max()) &gt;= 0xFFFF) + { + if (c.u_surrogate) + c.string += Ch((std::min)(c.u_surrogate, static_cast&lt;unsigned long&gt;((std::numeric_limits&lt;Ch&gt;::max)()))); + + u = (std::min)(u, static_cast&lt;unsigned long&gt;((std::numeric_limits&lt;Ch&gt;::max)())); + c.string += Ch(u); + } + else // Ch is one byte - encode the given Unicode code point as UTF-8 + { + if ((c.u_surrogate == 0) &amp;&amp; (0xD7FF &lt; u &amp;&amp; u &lt; 0xE000)) + { + c.u_surrogate = u; + } + else + { + wchar_t utf16str[3] = { wchar_t(0), wchar_t(0), wchar_t(0) }; + wchar_t const *from_next = utf16str; + + Ch utf8str[5] = { Ch(0), Ch(0), Ch(0), Ch(0), Ch(0) }; + Ch * to_next = utf8str; + + std::size_t size = 0; + if (c.u_surrogate) + { + utf16str[size++] = wchar_t(c.u_surrogate); + c.u_surrogate = 0; + } + utf16str[size++] = wchar_t(u); + + std::mbstate_t state = 0; + std::codecvt_utf8_utf16&lt;wchar_t&gt; utf16_utf8; + + if (utf16_utf8.out(state, &amp;utf16str[0], &amp;utf16str[0] + size, from_next, &amp;utf8str[0], &amp;utf8str[4], to_next) != 0) + BOOST_PROPERTY_TREE_THROW(json_parser_error("write error", "", 0)); + + c.string += utf8str; + } + } } }; </pre><p> </p> <p> </p> </description> <category>Ticket</category> </item> <item> <author>Ben McCart <bmccart@…></author> <pubDate>Mon, 09 Sep 2013 21:06:21 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8883#comment:5 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8883#comment:5</guid> <description> <p> I accidentally pasted the reader diff twice rather than pasting the reader and the writer. In addition, a bug in the UTF16 reader has been fixed. (How embarrassing) </p> <pre class="wiki"> Index: json_parser_write.hpp =================================================================== --- json_parser_write.hpp (revision 85628) +++ json_parser_write.hpp (working copy) @@ -10,12 +10,15 @@ #ifndef BOOST_PROPERTY_TREE_DETAIL_JSON_PARSER_WRITE_HPP_INCLUDED #define BOOST_PROPERTY_TREE_DETAIL_JSON_PARSER_WRITE_HPP_INCLUDED +#include &lt;boost/cstdint.hpp&gt; #include &lt;boost/property_tree/ptree.hpp&gt; #include &lt;boost/next_prior.hpp&gt; #include &lt;boost/type_traits/make_unsigned.hpp&gt; #include &lt;string&gt; #include &lt;ostream&gt; +#include &lt;sstream&gt; #include &lt;iomanip&gt; +#include &lt;codecvt&gt; namespace boost { namespace property_tree { namespace json_parser { @@ -33,7 +36,7 @@ // We escape everything outside ASCII, because this code can't // handle high unicode characters. if (*b == 0x20 || *b == 0x21 || (*b &gt;= 0x23 &amp;&amp; *b &lt;= 0x2E) || - (*b &gt;= 0x30 &amp;&amp; *b &lt;= 0x5B) || (*b &gt;= 0x5D &amp;&amp; *b &lt;= 0xFF)) + (*b &gt;= 0x30 &amp;&amp; *b &lt;= 0x5B) || (*b &gt;= 0x5D &amp;&amp; *b &lt;= 0x80)) result += *b; else if (*b == Ch('\b')) result += Ch('\\'), result += Ch('b'); else if (*b == Ch('\f')) result += Ch('\\'), result += Ch('f'); @@ -44,18 +47,59 @@ else if (*b == Ch('\\')) result += Ch('\\'), result += Ch('\\'); else { - const char *hexdigits = "0123456789ABCDEF"; + std::ostringstream oss; typedef typename make_unsigned&lt;Ch&gt;::type UCh; - unsigned long u = (std::min)(static_cast&lt;unsigned long&gt;( - static_cast&lt;UCh&gt;(*b)), - 0xFFFFul); - int d1 = u / 4096; u -= d1 * 4096; - int d2 = u / 256; u -= d2 * 256; - int d3 = u / 16; u -= d3 * 16; - int d4 = u; - result += Ch('\\'); result += Ch('u'); - result += Ch(hexdigits[d1]); result += Ch(hexdigits[d2]); - result += Ch(hexdigits[d3]); result += Ch(hexdigits[d4]); + if (long(std::numeric_limits&lt;UCh&gt;::max()) &gt;= 0xFFFF) + { + // Assume UTF16. + oss &lt;&lt; "\\u" &lt;&lt; std::setw(4) &lt;&lt; std::setfill('0') &lt;&lt; std::hex &lt;&lt; std::uppercase &lt;&lt; uint16_t(*b); + if ((0xD7FF &lt; uint16_t(*b)) &amp;&amp; (uint16_t(*b) &lt; 0xE000)) + { + // Add second 16 bit value for surrogat6e pair + if (e-b == 1) + BOOST_PROPERTY_TREE_THROW(json_parser_error("write error", "", 0)); + + ++b; + oss &lt;&lt; "\\u" &lt;&lt; std::setw(4) &lt;&lt; std::setfill('0') &lt;&lt; std::hex &lt;&lt; std::uppercase &lt;&lt; uint16_t(*b); + } + + result += oss.str(); + } + else + { + // Assume UTF8 + std::mbstate_t state = 0; + std::codecvt_utf8_utf16&lt;wchar_t&gt; utf16_utf8; + + std::size_t in_max = std::min(e-b, 4);; + // Determine how many 'C' chars will be consumed to constuct a single UTF8 codepoint - this is + // required as codecvt.in will fail if it begins another code point that can't be completed + // because the 'end' of input sequence is reached... + if ((*b &amp; uint8_t(0xE0)) == uint8_t(0xC0)) + in_max = std::min(e-b, 2); + else if ((*b &amp; uint8_t(0xF0)) == uint8_t(0xE0)) + in_max = std::min(e-b, 3); + else if ((*b &amp; uint8_t(0xF8)) == uint8_t(0xF0)) + in_max = std::min(e-b, 4); + else + BOOST_PROPERTY_TREE_THROW(json_parser_error("write error", "", 0)); + + Ch const *from_next = &amp;*b; + wchar_t utf16str[2] = { wchar_t(0), wchar_t(0) }; + wchar_t *to_next = utf16str; + if (utf16_utf8.in(state, &amp;*b, (&amp;*b) + in_max, from_next, utf16str, utf16str + 2, to_next) != 0) + BOOST_PROPERTY_TREE_THROW(json_parser_error("write error", "", 0)); + + oss &lt;&lt; "\\u" &lt;&lt; std::setw(4) &lt;&lt; std::setfill('0') &lt;&lt; std::hex &lt;&lt; std::uppercase &lt;&lt; uint16_t(utf16str[0]); + if ( to_next - utf16str == 2) + { + // Add second 16 bit value for surrogat6e pair + oss &lt;&lt; "\\u" &lt;&lt; std::setw(4) &lt;&lt; std::setfill('0') &lt;&lt; std::hex &lt;&lt; std::uppercase &lt;&lt; uint16_t(utf16str[1]); + } + result += oss.str(); + + b += ((from_next - &amp;*b) - std::size_t(1)); // Additinal incrementing for additional UTF8 chars consumed. + } } ++b; } </pre><pre class="wiki">Index: json_parser_read.hpp =================================================================== --- json_parser_read.hpp (revision 85628) +++ json_parser_read.hpp (working copy) @@ -22,6 +22,7 @@ #include &lt;istream&gt; #include &lt;vector&gt; #include &lt;algorithm&gt; +#include &lt;codecvt&gt; namespace boost { namespace property_tree { namespace json_parser { @@ -41,7 +42,10 @@ Str name; Ptree root; std::vector&lt;Ptree *&gt; stack; + unsigned long u_surrogate; + context() : u_surrogate(0) {} + struct a_object_s { context &amp;c; @@ -146,8 +150,53 @@ a_unicode(context &amp;c): c(c) { } void operator()(unsigned long u) const { - u = (std::min)(u, static_cast&lt;unsigned long&gt;((std::numeric_limits&lt;Ch&gt;::max)())); - c.string += Ch(u); + typedef typename make_unsigned&lt;Ch&gt;::type UCh; + if (long(std::numeric_limits&lt;UCh&gt;::max()) &gt;= 0xFFFF) + { + if ((c.u_surrogate == 0) &amp;&amp; (0xD7FF &lt; u &amp;&amp; u &lt; 0xE000)) + { + c.u_surrogate = u; + } + else if (c.u_surrogate) + { + c.string += Ch((std::min)(c.u_surrogate, static_cast&lt;unsigned long&gt;((std::numeric_limits&lt;Ch&gt;::max)()))); + c.u_surrogate = 0; + } + + u = (std::min)(u, static_cast&lt;unsigned long&gt;((std::numeric_limits&lt;Ch&gt;::max)())); + c.string += Ch(u); + } + else // Ch is one byte - encode the given Unicode code point as UTF-8 + { + if ((c.u_surrogate == 0) &amp;&amp; (0xD7FF &lt; u &amp;&amp; u &lt; 0xE000)) + { + c.u_surrogate = u; + } + else + { + wchar_t utf16str[3] = { wchar_t(0), wchar_t(0), wchar_t(0) }; + wchar_t const *from_next = utf16str; + + Ch utf8str[5] = { Ch(0), Ch(0), Ch(0), Ch(0), Ch(0) }; + Ch * to_next = utf8str; + + std::size_t size = 0; + if (c.u_surrogate) + { + utf16str[size++] = wchar_t(c.u_surrogate); + c.u_surrogate = 0; + } + utf16str[size++] = wchar_t(u); + + std::mbstate_t state = 0; + std::codecvt_utf8_utf16&lt;wchar_t&gt; utf16_utf8; + + if (utf16_utf8.out(state, &amp;utf16str[0], &amp;utf16str[0] + size, from_next, &amp;utf8str[0], &amp;utf8str[4], to_next) != 0) + BOOST_PROPERTY_TREE_THROW(json_parser_error("write error", "", 0)); + + c.string += utf8str; + } + } } }; </pre><p> </p> </description> <category>Ticket</category> </item> <item> <author>ecotax@…</author> <pubDate>Tue, 10 Sep 2013 08:14:35 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8883#comment:6 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8883#comment:6</guid> <description> <p> It looks like the patch by Ben <a class="missing wiki">McCart</a> is a much better starting point for fixing the Boost library than the one I posted previously. </p> <p> My patch was good enough for the product I'm maintaining: we don't use the writer at all, so I didn't look at it, and for the reader we only have to deal with the occasional escaped charachter, for which assuming it's within the Basic Multilingual Plane is not a serious limitation. </p> <p> For the Boost libraries, the more complete solution is of course the better one. </p> </description> <category>Ticket</category> </item> <item> <dc:creator>anonymous</dc:creator> <pubDate>Fri, 16 Jan 2015 10:13:34 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8883#comment:7 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8883#comment:7</guid> <description> <p> So will this be ever fixed? </p> </description> <category>Ticket</category> </item> <item> <dc:creator>Sebastian Redl</dc:creator> <pubDate>Fri, 23 Jan 2015 11:09:23 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8883#comment:8 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8883#comment:8</guid> <description> <p> Yes, when I get around to making <a class="missing wiki">PropertyTree</a> actually support Unicode in any meaningful way. I have finally time to work on the library again, and given how many bug reports there are from people who think it supports Unicode, I've made this a high priority. </p> </description> <category>Ticket</category> </item> <item> <dc:creator>anonymous</dc:creator> <pubDate>Mon, 23 Mar 2015 09:46:43 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8883#comment:9 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8883#comment:9</guid> <description> <p> By when this issue will be getting fixed?? </p> </description> <category>Ticket</category> </item> <item> <dc:creator>Sebastian Redl</dc:creator> <pubDate>Tue, 07 Jul 2015 13:55:27 GMT</pubDate> <title>status changed; resolution set https://svn.boost.org/trac10/ticket/8883#comment:10 https://svn.boost.org/trac10/ticket/8883#comment:10 <ul> <li><strong>status</strong> <span class="trac-field-old">new</span> → <span class="trac-field-new">closed</span> </li> <li><strong>resolution</strong> → <span class="trac-field-new">fixed</span> </li> </ul> <p> The JSON reader rewrite has landed in master. It fully supports reading UTF-8 input files, including converting \u-encoded surrogate pairs to proper UTF-8. </p> Ticket