Opened 9 years ago

Last modified 5 years ago

#9827 assigned Bugs

Missing support for some code page(e.g 949, 950) in windows conversion with std backend

Reported by: hucaonju@… Owned by: Artyom Beilis
Milestone: To Be Determined Component: locale
Version: Boost 1.55.0 Severity: Problem
Keywords: locale, code page, Korean, Traditional Chinese, exception Cc:

Description

There is a table windows_encoding all_windows_encodings[] in wconv_codepage.ipp. It contains several code page definitions. However, it misses some code pages, such as the Korean code page(949) or Traditional Chinese Big5 code page(950), which will cause an invalid_charset_error when running in that windows for the following code:

// Assuming we are using the std backend so it supports ansi encodings
boost::locale::generator gen;
gen.use_ansi_encoding(true);

std::locale loc(gen(""));
// Throws invalid_charset_error when running in Korean windows but OK in English windows.
// The charset is "windows949" in Korean windows, which is not in the table.
std::string us = boost::locale::conv::to_utf<char>("abcdefg", loc);

The root cause of this exception is that the generated code page string is not in the table. When the locale generator with std backend in windows platform generates a locale, it calls boost::locale::util::get_system_locale(bool use_utf8). This function will use the following code to generate the locale string(in default_locale.cpp):

if(GetLocaleInfoA(LOCALE_USER_DEFAULT,LOCALE_IDEFAULTANSICODEPAGE,buf,sizeof(buf))!=0) {
    if(atoi(buf)==0)
        lc_name+=".UTF-8";
    else {
        lc_name +=".windows-";
        lc_name +=buf;
    }
}

So the encoding part of the lc_name is windows-(code page). In a system with Korean(949) or Traditional Chinese(950) code page, this will generate an encoding string like "windows-949" or "windows-950". However, when wconv_from_utf::open() initializes, it tries to search "windows949" or "windows950" in array all_windows_encodings[]. Obviously it will not find the string, and the open() fails, then the exception is thrown.

For a quick fix, I suggest adding the missing code page to the table:

{ "cp949",      949, 0 }, // Korean
{ "uhc",        949, 0 }, // From "iconv -l"
{ "windows949",         949, 0 }, // Korean
// "big5" already in the table
{ "windows950",         950, 0 }, // TC, big5

However the list may not be complete, and we may encounter problems when running in a system with code page that does not exist in the list. So we may probably add the following code to function int encoding_to_windows_codepage(char const *ccharset) in wconv_codepage.ipp:

--- E:\Build1\boost_1_55_0\libs\locale\src\encoding\wconv_codepage.ipp	2014-04-02 16:34:52.000000000 +0800
+++ E:\Build2\boost_1_55_0\libs\locale\src\encoding\wconv_codepage.ipp	2014-04-02 17:31:37.000000000 +0800
@@ -206,12 +206,18 @@
                 return ptr->codepage;
             }
             else {
                 return -1;
             }
         }
+        if(ptr==end && charset.size()>7 && charset.substr(0,7)=="windows") {
+            int cp = atoi(charset.substr(7).c_str());
+            if(IsValidCodePage(cp)) {
+                return cp;
+            }
+        }
         return -1;
         
     }
 
     template<typename CharType>
     bool validate_utf16(CharType const *str,unsigned len)

This piece of code directly parses and validates the encoding string. The concern is that the call to IsValidCodePage may decrease the performance(not tested).

Change History (1)

comment:1 by Artyom Beilis, 5 years ago

Status: newassigned
Note: See TracTickets for help on using tickets.