Opened 13 years ago
Closed 11 years ago
#3634 closed Bugs (fixed)
to_upper / to_lower incorrect for machines with signed chars
Reported by: | Owned by: | Marshall Clow | |
---|---|---|---|
Milestone: | Boost 1.41.0 | Component: | algorithm |
Version: | Boost 1.40.0 | Severity: | Problem |
Keywords: | Cc: |
Description
I'm using an ISO8859-1 / ISO8859-15 (Latin-1 / Latin-9) character set for the source and a string containing german umlauts is not correctly converted according to the locale configuration. The problem is reproducible on any machine where the default configuration of C/C++ uses signed chars, in our case Sun Solaris:
1. The conversion functions toupper and tolower of the standard C library expect an int parameter.
2. Solaris' characters are signed by default (and Sun explicitly advises against changing that, the manpage of Sunstudio 12.1' CC says about the -xchar option:
"It is strongly recommended that you never use -xchar to compile routines for any interface exported through a library. The Solaris ABI specifies type char as signed, and system libraries behave accordingly. The effect of making char unsigned has not been extensively tested with system libraries. Instead of using this option, modify your code so that it does not depend on whether type char is signed or unsigned. The sign of type char varies among compilers and operating systems.")
3. Characters with an unsinged value >= 128 (e.g. an umlaut) have negative values for toupper and tolower and thus are never converted for any locale.
An explicit static cast to unsigned character in the calls to the according standard C libraries function should solve this problem.
Note that this may also be needed for other functions as well, e.g. the classification function. I didn't check those so far.
Attachments (2)
Change History (10)
by , 13 years ago
Attachment: | testLocaleBoost.cpp added |
---|
comment:1 by , 13 years ago
Status: | new → assigned |
---|
Would you be by any chance able to submit a patch that is tested on solaris? I don't have access to such machine, so I would be fixing blindly.
I will gladly incorporate it into the library code.
comment:2 by , 13 years ago
I don't have something you would like to use as I only have a fix that would completely break the locale parameter to the function call. I've attached the patch anyway, maybe you could give me an additional hint what to try instead. (I'm a bit lost between all those facet templates. ;-)
follow-up: 6 comment:3 by , 13 years ago
Looking at the problem again, I think that the problem actually lies in Solaris's C++ locales. There is nothing wrong with chars being signed.
Anyway, I'll try to check it out and come up with a solution.
comment:4 by , 13 years ago
Thanks! If you have something ready, feel free to contact me to test it.
comment:5 by , 11 years ago
Component: | string_algo → algorithm |
---|---|
Owner: | changed from | to
Status: | assigned → new |
comment:6 by , 11 years ago
Replying to pavol_droba:
Looking at the problem again, I think that the problem actually lies in Solaris's C++ locales. There is nothing wrong with chars being signed.
Anyway, I'll try to check it out and come up with a solution.
I'm coming to that conclusion, too.
The C versions of tolower
, et. al. take an int
as a parameter. The C99 standard (section 7.4.1) says that the input to tolower needs to be representable as unsigned char, or EOF. To me that means "no negative numbers". Microsoft has a page that talks about this issue, too: http://msdn.microsoft.com/en-us/library/ms245348.aspx
However, the C++ version std::tolower
, takes a char
(templated) as a parameter, and I can't find any similar restriction in either the C++03 standard or the (draft) C++11 standard. To me, that means that all possible values of char
are allowable (or whatever type the function is templated upon)
comment:7 by , 11 years ago
I just checked in a fix in [76435], will merge to release after tests cycle.
comment:8 by , 11 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
example demonstrating the problem