Opened 10 years ago

Closed 10 years ago

Last modified 10 years ago

#8304 closed Bugs (invalid)

Regex not matching case in character ranges if collate flag specified

Reported by: dave@… Owned by: John Maddock
Milestone: To Be Determined Component: regex
Version: Boost 1.51.0 Severity: Problem
Keywords: Cc:

Description

If the regex::collate option is specified regular expressions do not seem to be matching case with character ranges, e.g.

boost::wregex test( L"res[A-Z]+", boost::regex::collate );
bool bMatch = boost::regex_search( L"resource", test  );

I would not expect the collate flag to change the result of the code above. However, if the collate flag is specified bMatch is set to true and without it the result is false.

Similar code using Visual Studio 2010 and the MS regex implementation works as I expect, ie the collate flag does not effect the result, e.g.

std::tr1::wregex test( L"res[A-Z]+", std::tr1::regex_constants::collate );
bool bMatch = std::tr1::regex_search( L"resource", test  );

Change History (4)

comment:1 by viboes, 10 years ago

Component: Noneregex
Owner: set to John Maddock

comment:2 by John Maddock, 10 years ago

Resolution: invalid
Status: newclosed

I suspect the MS implementation ignores that flag completely (haven't looked though).

Look at it this way, if you expect it to have no effect, why are you setting it? It's purpose as documented is in this case to match any character which collates between 'A' and 'Z' . The fact is that all of the characters in your test string will collate inside that range in most Win32 locales.

In any event, as soon as you set that flag you get implementation, platform, and locale specific behavior. In short use with extreme caution!

comment:3 by dave@…, 10 years ago

Isn't case insensitive collation different from case sensitive collation?

Your comment seems to suggest that collation is an odd or unusual thing. We use the regex engine to allow users to search through text and they expect to be able to use '[a-z]+' and match against 'Années'. I guess we could add a 'Foreign language support' option to support both cases, I just wanted to make sure that the Boost implementation was 'correct'.

Thanks.

ps The MS implementation does indeed seem to ignore the collation flag altogether.

comment:4 by John Maddock, 10 years ago

Isn't case insensitive collation different from case sensitive collation?

Yes, since this is wide characters on Win32, it's basically Unicode collation: http://www.unicode.org/reports/tr10/#Multi_Level_Comparison which is based on "levels".

So character shape first, then accent, then case, then some other differences. That means that 'a', 'à' and 'A' collate next to each other followed by 'b' and 'B' if case sensitivity is on, while if matching is case insensitive then 'a' and 'A' are treated as equivalent (ie collate the same), but 'à' is still separate.

Your comment seems to suggest that collation is an odd or unusual thing. We use the regex engine to allow users to search through text and they expect to be able to use '[a-z]+' and match against 'Années'. I guess we could add a 'Foreign language support' option to support both cases, I just wanted to make sure that the Boost implementation was 'correct'.

Sigh... I see where you're coming from, (and why your users would want that), but regex would have to implement it's own collation algorithm to support that. You could probably do that yourself actually by using a custom traits class?

Note: See TracTickets for help on using tickets.