Boost C++ Libraries: Ticket #8304: Regex not matching case in character ranges if collate flag specified

component changed; owner set

viboes — Wed, 20 Mar 2013 21:42:02 GMT

owner set to John Maddock
component None → regex

status changed; resolution set

John Maddock — Thu, 21 Mar 2013 17:13:08 GMT

status new → closed
resolution → invalid

I suspect the MS implementation ignores that flag completely (haven't looked though).

Look at it this way, if you expect it to have no effect, why are you setting it? It's purpose as documented is in this case to match any character which collates between 'A' and 'Z' . The fact is that all of the characters in your test string will collate inside that range in most Win32 locales.

In any event, as soon as you set that flag you get implementation, platform, and locale specific behavior. In short use with extreme caution!

dave@… — Thu, 21 Mar 2013 18:15:42 GMT

Isn't case insensitive collation different from case sensitive collation?

Your comment seems to suggest that collation is an odd or unusual thing. We use the regex engine to allow users to search through text and they expect to be able to use '[a-z]+' and match against 'Années'. I guess we could add a 'Foreign language support' option to support both cases, I just wanted to make sure that the Boost implementation was 'correct'.

Thanks.

ps The MS implementation does indeed seem to ignore the collation flag altogether.

John Maddock — Thu, 21 Mar 2013 18:46:03 GMT

Isn't case insensitive collation different from case sensitive collation?

Yes, since this is wide characters on Win32, it's basically Unicode collation: http://www.unicode.org/reports/tr10/#Multi_Level_Comparison which is based on "levels".

So character shape first, then accent, then case, then some other differences. That means that 'a', 'à' and 'A' collate next to each other followed by 'b' and 'B' if case sensitivity is on, while if matching is case insensitive then 'a' and 'A' are treated as equivalent (ie collate the same), but 'à' is still separate.

Your comment seems to suggest that collation is an odd or unusual thing. We use the regex engine to allow users to search through text and they expect to be able to use '[a-z]+' and match against 'Années'. I guess we could add a 'Foreign language support' option to support both cases, I just wanted to make sure that the Boost implementation was 'correct'.

Sigh... I see where you're coming from, (and why your users would want that), but regex would have to implement it's own collation algorithm to support that. You could probably do that yourself actually by using a custom traits class?