Opened 9 years ago

Closed 9 years ago

#9435 closed Bugs (wontfix)

Erroneous character set conversions of strings with more than int32 bytes

Reported by: Martin Korp <martin.korp@…> Owned by: Artyom Beilis
Milestone: To Be Determined Component: locale
Version: Boost 1.54.0 Severity: Problem
Keywords: character set conversion Cc:

Description

To internationalize our software, we use Boost.Locale together with ICU for character set conversions. During our tests we found out that it is not possible to convert strings with more than int32_t bytes because icu::UnicodeString, which is used by the functions boost::locale::conv::to_utf and boost::locale::conv::from_utf to perform character set conversions, is limited to strings with a size of at most int32_t bytes. Because Boost.Locale does not check if the size of the given string exceeds those limit, the behavior of the functions boost::locale::conv::to_utf and boost::locale::conv::from_utf is undefined for big strings.

PS: We already contact the ICU support mailing list. They told us that the UText API (http://icu-project.org/apiref/icu4c/utext_8h.html) might be able to handle strings with more than int32_t bytes. Another possibility, according to the ICU support mailing list, would be to use the lower-level conversion API of ICU (uconv).

Change History (1)

comment:1 by Artyom Beilis, 9 years ago

Resolution: wontfix
Status: newclosed
  • This is the limitation of ICU.
  • It is bad idea to convert "huge chuncks of text" via to_utf API as it allocates entire text in memory.

However, you can use std::locale::codecvt facet for stream based conversions that provide integration with io-streams:

http://www.boost.org/doc/libs/1_55_0/libs/locale/doc/html/charset_handling.html#codecvt_codecvt

Of course it is not as simple as call to_utf or from_utf, however, allocating buffer of more than 2G for string is not good idea either.

Closing this bug.

Note: See TracTickets for help on using tickets.