Opened 9 years ago
Closed 9 years ago
#9435 closed Bugs (wontfix)
Erroneous character set conversions of strings with more than int32 bytes
Reported by: | Owned by: | Artyom Beilis | |
---|---|---|---|
Milestone: | To Be Determined | Component: | locale |
Version: | Boost 1.54.0 | Severity: | Problem |
Keywords: | character set conversion | Cc: |
Description
To internationalize our software, we use Boost.Locale together with ICU for character set conversions. During our tests we found out that it is not possible to convert strings with more than int32_t bytes because icu::UnicodeString, which is used by the functions boost::locale::conv::to_utf and boost::locale::conv::from_utf to perform character set conversions, is limited to strings with a size of at most int32_t bytes. Because Boost.Locale does not check if the size of the given string exceeds those limit, the behavior of the functions boost::locale::conv::to_utf and boost::locale::conv::from_utf is undefined for big strings.
PS: We already contact the ICU support mailing list. They told us that the UText API (http://icu-project.org/apiref/icu4c/utext_8h.html) might be able to handle strings with more than int32_t bytes. Another possibility, according to the ICU support mailing list, would be to use the lower-level conversion API of ICU (uconv).
However, you can use std::locale::codecvt facet for stream based conversions that provide integration with io-streams:
http://www.boost.org/doc/libs/1_55_0/libs/locale/doc/html/charset_handling.html#codecvt_codecvt
Of course it is not as simple as call to_utf or from_utf, however, allocating buffer of more than 2G for string is not good idea either.
Closing this bug.