Boost C++ Libraries: Ticket #8883: property_tree JSON reader does not parse unicode characters properly

attachment set

ecotax@… — Thu, 29 Aug 2013 14:56:47 GMT

attachment → json_parser_diff.txt

property_tree json_parser_read UTF-8 patch

ecotax@… — Thu, 29 Aug 2013 15:02:44 GMT

I also encountered problems with parsing Unicode characters. Specifically, when using the normal (non-wide-string) version of property_tree, parsing JSON containing explicitly encoded Unicode characters, using \uXXXX, characters above 0x7F, for example \u2026 (horizontal ellipsis) are not converted to UTF-8 (giving E2 80 A6). I just added a patch that will do so. I'm not sure if this is a good thing in general, and if so, whether the patch as is is production-ready, but maybe it's helpful in some way.

Lettort — Wed, 04 Sep 2013 08:11:26 GMT

Hello,

maybe I am asking about basic unicode knowledge. Even than I would be happy to get a link or some explanation to the following question:

The description says: "...'ä' in UTF-8 will become \u00C3\u00A4 ..."

Why is 'ä' not becoming /u00E4? The "/u"-pair is no surrogate pair, isn't it? So what's the meaning with it?

Why am I asking? My problem is, that I am generating a JSON file which will be read via JQuery to be shown in a web page. In that case, this pair will be shown as two separated unicode symbols.

ecotax@… — Wed, 04 Sep 2013 09:05:41 GMT

@Lettort: There is a difference betweeen Unicode, specifying 'ä' maps to code point E4, and the various ways to encode this code point in bits or bytes. There is UTF-16, encoding this as 00E4 (16 bits, fits in a wide char), but also UTF-8, encoding this as two bytes, C3 A4. When parsing a /u00E4, the correct way to handle this depends on what encoding you want for your string. If you have a wide string and expect UTF-16, then yes, you'd expect the wide char 00E4. If you have a regular string and expect UTF-8, you'd expect the two bytes C3 A4.

The original bug report states that first writing and then reading 'ä', the writer (defensively?) writes this using two \u encoded characters, each being one byte of the UTF-8 encoding. Regardless if this is the best choice or not, you'd want the reader to handle this in such a way that it 'round-trips' as much as possible, which currently is not the case.

BTW, For future questions/discussions, I guess a site like stackoverflow.com is more appropriate.

Mon, 09 Sep 2013 20:23:35 GMT

After inspecting /detail/json_parser_write.hpp and /detail/json_parser_read.hpp it has become apparent to me that the current implementation is completely deficient concerning Unicode (both UTF8 and UTF16). The writer creates a Unicode escape sequence block (/uXXXX) for each incoming 'C' char in the UTF8 code sequence, which is incorrect. Each escaped Unicode character is to be represented as 16-bit hex values of UTF16 single or surrogate pairs (see RFC4627 section 2.5 http://tools.ietf.org/html/rfc4627). In addition, the existing implementation will only encode UCS2 correctly for 16 bit wchar_t strings as it isn't doing anything to correctly handle surrogate pairs either in the writer or the reader.

I have a working implementation which I'm sure does not meet Boost coding guidelines, but may be adaptable with use of Boost Locale (I am using codecvt). I couldn't figure out how to add an attachment so I'm including the diffs inline bellow (The UTF8 implementation has been tested some, the UTF16 portion hasn't. It is ugly, but it works.):

Index: json_parser_read.hpp
===================================================================
--- json_parser_read.hpp	(revision 85628)
+++ json_parser_read.hpp	(working copy)
@@ -22,6 +22,7 @@
 #include <istream>
 #include <vector>
 #include <algorithm>
+#include <codecvt>
 namespace boost { namespace property_tree { namespace json_parser
 {
@@ -41,7 +42,10 @@
         Str name;
         Ptree root;
         std::vector<Ptree *> stack;
+        unsigned long u_surrogate;
+        context() : u_surrogate(0) {}
+
         struct a_object_s
         {
             context &c;
@@ -146,8 +150,46 @@
             a_unicode(context &c): c(c) { }
             void operator()(unsigned long u) const
             {
-                u = (std::min)(u, static_cast<unsigned long>((std::numeric_limits<Ch>::max)()));
-                c.string += Ch(u);
+                typedef typename make_unsigned<Ch>::type UCh;
+                if (long(std::numeric_limits<UCh>::max()) >= 0xFFFF)
+                {
+                    if (c.u_surrogate)
+                        c.string += Ch((std::min)(c.u_surrogate, static_cast<unsigned long>((std::numeric_limits<Ch>::max)())));
+
+                    u = (std::min)(u, static_cast<unsigned long>((std::numeric_limits<Ch>::max)()));
+                    c.string += Ch(u);
+                }
+                else // Ch is one byte - encode the given Unicode code point as UTF-8
+                {
+                    if ((c.u_surrogate == 0) && (0xD7FF < u && u < 0xE000))
+                    {
+                        c.u_surrogate = u;
+                    }
+                    else
+                    {
+                        wchar_t utf16str[3] = { wchar_t(0), wchar_t(0), wchar_t(0) };
+                        wchar_t const *from_next = utf16str;
+
+                        Ch utf8str[5] = { Ch(0), Ch(0), Ch(0), Ch(0), Ch(0) };
+                        Ch * to_next = utf8str;
+
+                        std::size_t size = 0;
+                        if (c.u_surrogate)
+                        {
+                            utf16str[size++] = wchar_t(c.u_surrogate);
+                            c.u_surrogate = 0;
+                        }
+                        utf16str[size++] = wchar_t(u);
+
+                        std::mbstate_t state = 0;
+                        std::codecvt_utf8_utf16<wchar_t> utf16_utf8;
+
+                        if (utf16_utf8.out(state, &utf16str[0], &utf16str[0] + size, from_next, &utf8str[0], &utf8str[4], to_next) != 0)
+                            BOOST_PROPERTY_TREE_THROW(json_parser_error("write error", "", 0));
+
+                        c.string += utf8str;
+                    }
+                }
             }
         };

Index: json_parser_read.hpp
===================================================================
--- json_parser_read.hpp	(revision 85628)
+++ json_parser_read.hpp	(working copy)
@@ -22,6 +22,7 @@
 #include <istream>
 #include <vector>
 #include <algorithm>
+#include <codecvt>
 namespace boost { namespace property_tree { namespace json_parser
 {
@@ -41,7 +42,10 @@
         Str name;
         Ptree root;
         std::vector<Ptree *> stack;
+        unsigned long u_surrogate;
+        context() : u_surrogate(0) {}
+
         struct a_object_s
         {
             context &c;
@@ -146,8 +150,46 @@
             a_unicode(context &c): c(c) { }
             void operator()(unsigned long u) const
             {
-                u = (std::min)(u, static_cast<unsigned long>((std::numeric_limits<Ch>::max)()));
-                c.string += Ch(u);
+                typedef typename make_unsigned<Ch>::type UCh;
+                if (long(std::numeric_limits<UCh>::max()) >= 0xFFFF)
+                {
+                    if (c.u_surrogate)
+                        c.string += Ch((std::min)(c.u_surrogate, static_cast<unsigned long>((std::numeric_limits<Ch>::max)())));
+
+                    u = (std::min)(u, static_cast<unsigned long>((std::numeric_limits<Ch>::max)()));
+                    c.string += Ch(u);
+                }
+                else // Ch is one byte - encode the given Unicode code point as UTF-8
+                {
+                    if ((c.u_surrogate == 0) && (0xD7FF < u && u < 0xE000))
+                    {
+                        c.u_surrogate = u;
+                    }
+                    else
+                    {
+                        wchar_t utf16str[3] = { wchar_t(0), wchar_t(0), wchar_t(0) };
+                        wchar_t const *from_next = utf16str;
+
+                        Ch utf8str[5] = { Ch(0), Ch(0), Ch(0), Ch(0), Ch(0) };
+                        Ch * to_next = utf8str;
+
+                        std::size_t size = 0;
+                        if (c.u_surrogate)
+                        {
+                            utf16str[size++] = wchar_t(c.u_surrogate);
+                            c.u_surrogate = 0;
+                        }
+                        utf16str[size++] = wchar_t(u);
+
+                        std::mbstate_t state = 0;
+                        std::codecvt_utf8_utf16<wchar_t> utf16_utf8;
+
+                        if (utf16_utf8.out(state, &utf16str[0], &utf16str[0] + size, from_next, &utf8str[0], &utf8str[4], to_next) != 0)
+                            BOOST_PROPERTY_TREE_THROW(json_parser_error("write error", "", 0));
+
+                        c.string += utf8str;
+                    }
+                }
             }
         };

Mon, 09 Sep 2013 21:06:21 GMT

I accidentally pasted the reader diff twice rather than pasting the reader and the writer. In addition, a bug in the UTF16 reader has been fixed. (How embarrassing)

Index: json_parser_write.hpp
===================================================================
--- json_parser_write.hpp	(revision 85628)
+++ json_parser_write.hpp	(working copy)
@@ -10,12 +10,15 @@
 #ifndef BOOST_PROPERTY_TREE_DETAIL_JSON_PARSER_WRITE_HPP_INCLUDED
 #define BOOST_PROPERTY_TREE_DETAIL_JSON_PARSER_WRITE_HPP_INCLUDED
+#include <boost/cstdint.hpp>
 #include <boost/property_tree/ptree.hpp>
 #include <boost/next_prior.hpp>
 #include <boost/type_traits/make_unsigned.hpp>
 #include <string>
 #include <ostream>
+#include <sstream>
 #include <iomanip>
+#include <codecvt>
 namespace boost { namespace property_tree { namespace json_parser
 {
@@ -33,7 +36,7 @@
             // We escape everything outside ASCII, because this code can't
             // handle high unicode characters.
             if (*b == 0x20 || *b == 0x21 || (*b >= 0x23 && *b <= 0x2E) ||
-                (*b >= 0x30 && *b <= 0x5B) || (*b >= 0x5D && *b <= 0xFF))
+                (*b >= 0x30 && *b <= 0x5B) || (*b >= 0x5D && *b <= 0x80))
                 result += *b;
             else if (*b == Ch('\b')) result += Ch('\\'), result += Ch('b');
             else if (*b == Ch('\f')) result += Ch('\\'), result += Ch('f');
@@ -44,18 +47,59 @@
             else if (*b == Ch('\\')) result += Ch('\\'), result += Ch('\\');
             else
             {
-                const char *hexdigits = "0123456789ABCDEF";
+                std::ostringstream oss;
                 typedef typename make_unsigned<Ch>::type UCh;
-                unsigned long u = (std::min)(static_cast<unsigned long>(
-                                                 static_cast<UCh>(*b)),
-                                             0xFFFFul);
-                int d1 = u / 4096; u -= d1 * 4096;
-                int d2 = u / 256; u -= d2 * 256;
-                int d3 = u / 16; u -= d3 * 16;
-                int d4 = u;
-                result += Ch('\\'); result += Ch('u');
-                result += Ch(hexdigits[d1]); result += Ch(hexdigits[d2]);
-                result += Ch(hexdigits[d3]); result += Ch(hexdigits[d4]);
+                if (long(std::numeric_limits<UCh>::max()) >= 0xFFFF)
+                {
+                    // Assume UTF16.
+                    oss << "\\u" << std::setw(4) << std::setfill('0') << std::hex << std::uppercase << uint16_t(*b);
+                    if ((0xD7FF < uint16_t(*b)) && (uint16_t(*b) < 0xE000))
+                    {
+                        // Add second 16 bit value for surrogat6e pair
+                        if (e-b == 1)
+                            BOOST_PROPERTY_TREE_THROW(json_parser_error("write error", "", 0));
+
+                        ++b;
+                        oss << "\\u" << std::setw(4) << std::setfill('0') << std::hex << std::uppercase << uint16_t(*b);
+                    }
+
+                    result += oss.str();
+                }
+                else
+                {
+                    // Assume UTF8
+                    std::mbstate_t state = 0;
+                    std::codecvt_utf8_utf16<wchar_t> utf16_utf8;
+
+                    std::size_t in_max =  std::min(e-b, 4);;
+                    // Determine how many 'C' chars will be consumed to constuct a single UTF8 codepoint - this is
+                    // required as codecvt.in will fail if it begins another code point that can't be completed
+                    // because the 'end' of input sequence is reached...
+                    if ((*b & uint8_t(0xE0)) == uint8_t(0xC0))
+                        in_max =  std::min(e-b, 2);
+                    else if ((*b & uint8_t(0xF0)) == uint8_t(0xE0))
+                        in_max =  std::min(e-b, 3);
+                    else if ((*b & uint8_t(0xF8)) == uint8_t(0xF0))
+                        in_max =  std::min(e-b, 4);
+                    else
+                        BOOST_PROPERTY_TREE_THROW(json_parser_error("write error", "", 0));
+
+                    Ch const *from_next = &*b;
+                    wchar_t utf16str[2] = { wchar_t(0), wchar_t(0) };
+                    wchar_t *to_next = utf16str;
+                    if (utf16_utf8.in(state, &*b, (&*b) + in_max, from_next, utf16str, utf16str + 2, to_next) != 0)
+                        BOOST_PROPERTY_TREE_THROW(json_parser_error("write error", "", 0));
+
+                    oss << "\\u" << std::setw(4) << std::setfill('0') << std::hex << std::uppercase << uint16_t(utf16str[0]);
+                    if ( to_next - utf16str == 2)
+                    {
+                        // Add second 16 bit value for surrogat6e pair
+                        oss << "\\u" << std::setw(4) << std::setfill('0') << std::hex << std::uppercase << uint16_t(utf16str[1]);
+                    }
+                    result += oss.str();
+
+                    b += ((from_next - &*b) - std::size_t(1)); // Additinal incrementing for additional UTF8 chars consumed.
+                }
             }
             ++b;
         }

Index: json_parser_read.hpp
===================================================================
--- json_parser_read.hpp	(revision 85628)
+++ json_parser_read.hpp	(working copy)
@@ -22,6 +22,7 @@
 #include <istream>
 #include <vector>
 #include <algorithm>
+#include <codecvt>
 namespace boost { namespace property_tree { namespace json_parser
 {
@@ -41,7 +42,10 @@
         Str name;
         Ptree root;
         std::vector<Ptree *> stack;
+        unsigned long u_surrogate;
+        context() : u_surrogate(0) {}
+
         struct a_object_s
         {
             context &c;
@@ -146,8 +150,53 @@
             a_unicode(context &c): c(c) { }
             void operator()(unsigned long u) const
             {
-                u = (std::min)(u, static_cast<unsigned long>((std::numeric_limits<Ch>::max)()));
-                c.string += Ch(u);
+                typedef typename make_unsigned<Ch>::type UCh;
+                if (long(std::numeric_limits<UCh>::max()) >= 0xFFFF)
+                {
+                    if ((c.u_surrogate == 0) && (0xD7FF < u && u < 0xE000))
+                    {
+                        c.u_surrogate = u;
+                    }
+                    else if (c.u_surrogate)
+                    {
+                        c.string += Ch((std::min)(c.u_surrogate, static_cast<unsigned long>((std::numeric_limits<Ch>::max)())));
+                        c.u_surrogate = 0;
+                    }
+
+                    u = (std::min)(u, static_cast<unsigned long>((std::numeric_limits<Ch>::max)()));
+                    c.string += Ch(u);
+                }
+                else // Ch is one byte - encode the given Unicode code point as UTF-8
+                {
+                    if ((c.u_surrogate == 0) && (0xD7FF < u && u < 0xE000))
+                    {
+                        c.u_surrogate = u;
+                    }
+                    else
+                    {
+                        wchar_t utf16str[3] = { wchar_t(0), wchar_t(0), wchar_t(0) };
+                        wchar_t const *from_next = utf16str;
+
+                        Ch utf8str[5] = { Ch(0), Ch(0), Ch(0), Ch(0), Ch(0) };
+                        Ch * to_next = utf8str;
+
+                        std::size_t size = 0;
+                        if (c.u_surrogate)
+                        {
+                            utf16str[size++] = wchar_t(c.u_surrogate);
+                            c.u_surrogate = 0;
+                        }
+                        utf16str[size++] = wchar_t(u);
+
+                        std::mbstate_t state = 0;
+                        std::codecvt_utf8_utf16<wchar_t> utf16_utf8;
+
+                        if (utf16_utf8.out(state, &utf16str[0], &utf16str[0] + size, from_next, &utf8str[0], &utf8str[4], to_next) != 0)
+                            BOOST_PROPERTY_TREE_THROW(json_parser_error("write error", "", 0));
+
+                        c.string += utf8str;
+                    }
+                }
             }
         };

ecotax@… — Tue, 10 Sep 2013 08:14:35 GMT

It looks like the patch by Ben McCart is a much better starting point for fixing the Boost library than the one I posted previously.

My patch was good enough for the product I'm maintaining: we don't use the writer at all, so I didn't look at it, and for the reader we only have to deal with the occasional escaped charachter, for which assuming it's within the Basic Multilingual Plane is not a serious limitation.

For the Boost libraries, the more complete solution is of course the better one.

anonymous — Fri, 16 Jan 2015 10:13:34 GMT

So will this be ever fixed?

Sebastian Redl — Fri, 23 Jan 2015 11:09:23 GMT

Yes, when I get around to making PropertyTree actually support Unicode in any meaningful way. I have finally time to work on the library again, and given how many bug reports there are from people who think it supports Unicode, I've made this a high priority.

anonymous — Mon, 23 Mar 2015 09:46:43 GMT

By when this issue will be getting fixed??

status changed; resolution set

Sebastian Redl — Tue, 07 Jul 2015 13:55:27 GMT

status new → closed
resolution → fixed

The JSON reader rewrite has landed in master. It fully supports reading UTF-8 input files, including converting \u-encoded surrogate pairs to proper UTF-8.