Opened 13 years ago

Closed 11 years ago

#3634 closed Bugs (fixed)

to_upper / to_lower incorrect for machines with signed chars

Reported by: Thomas Dorner <td-eclipse@…> Owned by: Marshall Clow
Milestone: Boost 1.41.0 Component: algorithm
Version: Boost 1.40.0 Severity: Problem
Keywords: Cc:

Description

I'm using an ISO8859-1 / ISO8859-15 (Latin-1 / Latin-9) character set for the source and a string containing german umlauts is not correctly converted according to the locale configuration. The problem is reproducible on any machine where the default configuration of C/C++ uses signed chars, in our case Sun Solaris:

1. The conversion functions toupper and tolower of the standard C library expect an int parameter.

2. Solaris' characters are signed by default (and Sun explicitly advises against changing that, the manpage of Sunstudio 12.1' CC says about the -xchar option:

"It is strongly recommended that you never use -xchar to compile routines for any interface exported through a library. The Solaris ABI specifies type char as signed, and system libraries behave accordingly. The effect of making char unsigned has not been extensively tested with system libraries. Instead of using this option, modify your code so that it does not depend on whether type char is signed or unsigned. The sign of type char varies among compilers and operating systems.")

3. Characters with an unsinged value >= 128 (e.g. an umlaut) have negative values for toupper and tolower and thus are never converted for any locale.

An explicit static cast to unsigned character in the calls to the according standard C libraries function should solve this problem.

Note that this may also be needed for other functions as well, e.g. the classification function. I didn't check those so far.

Attachments (2)

testLocaleBoost.cpp (1.1 KB ) - added by Thomas Dorner <td-eclipse@…> 13 years ago.
example demonstrating the problem
dirty_fix.patch (1.3 KB ) - added by Thomas Dorner <td-eclipse@…> 13 years ago.
unified diff of a dirty quick-fix

Download all attachments as: .zip

Change History (10)

by Thomas Dorner <td-eclipse@…>, 13 years ago

Attachment: testLocaleBoost.cpp added

example demonstrating the problem

comment:1 by Pavol Droba, 13 years ago

Status: newassigned

Would you be by any chance able to submit a patch that is tested on solaris? I don't have access to such machine, so I would be fixing blindly.

I will gladly incorporate it into the library code.

by Thomas Dorner <td-eclipse@…>, 13 years ago

Attachment: dirty_fix.patch added

unified diff of a dirty quick-fix

comment:2 by Thomas Dorner <td-eclipse@…>, 13 years ago

I don't have something you would like to use as I only have a fix that would completely break the locale parameter to the function call. I've attached the patch anyway, maybe you could give me an additional hint what to try instead. (I'm a bit lost between all those facet templates. ;-)

comment:3 by Pavol Droba, 13 years ago

Looking at the problem again, I think that the problem actually lies in Solaris's C++ locales. There is nothing wrong with chars being signed.

Anyway, I'll try to check it out and come up with a solution.

comment:4 by Thomas Dorner <td-eclipse@…>, 13 years ago

Thanks! If you have something ready, feel free to contact me to test it.

comment:5 by Marshall Clow, 11 years ago

Component: string_algoalgorithm
Owner: changed from Pavol Droba to Marshall Clow
Status: assignednew

in reply to:  3 comment:6 by Marshall Clow, 11 years ago

Replying to pavol_droba:

Looking at the problem again, I think that the problem actually lies in Solaris's C++ locales. There is nothing wrong with chars being signed.

Anyway, I'll try to check it out and come up with a solution.

I'm coming to that conclusion, too.

The C versions of tolower, et. al. take an int as a parameter. The C99 standard (section 7.4.1) says that the input to tolower needs to be representable as unsigned char, or EOF. To me that means "no negative numbers". Microsoft has a page that talks about this issue, too: http://msdn.microsoft.com/en-us/library/ms245348.aspx

However, the C++ version std::tolower, takes a char (templated) as a parameter, and I can't find any similar restriction in either the C++03 standard or the (draft) C++11 standard. To me, that means that all possible values of char are allowable (or whatever type the function is templated upon)

comment:7 by Marshall Clow, 11 years ago

I just checked in a fix in [76435], will merge to release after tests cycle.

comment:8 by Marshall Clow, 11 years ago

Resolution: fixed
Status: newclosed

(In [76522]) Merge changes to release; fixes #3634

Note: See TracTickets for help on using tickets.