Boost C++ Libraries: Ticket #8304: Regex not matching case in character ranges if collate flag specified https://svn.boost.org/trac10/ticket/8304 <p> If the regex::collate option is specified regular expressions do not seem to be matching case with character ranges, e.g. </p> <pre class="wiki">boost::wregex test( L"res[A-Z]+", boost::regex::collate ); bool bMatch = boost::regex_search( L"resource", test ); </pre><p> I would not expect the collate flag to change the result of the code above. However, if the collate flag is specified bMatch is set to true and without it the result is false. </p> <p> Similar code using Visual Studio 2010 and the MS regex implementation works as I expect, ie the collate flag does not effect the result, e.g. </p> <pre class="wiki">std::tr1::wregex test( L"res[A-Z]+", std::tr1::regex_constants::collate ); bool bMatch = std::tr1::regex_search( L"resource", test ); </pre><p> </p> en-us Boost C++ Libraries /htdocs/site/boost.png https://svn.boost.org/trac10/ticket/8304 Trac 1.4.3 viboes Wed, 20 Mar 2013 21:42:02 GMT component changed; owner set https://svn.boost.org/trac10/ticket/8304#comment:1 https://svn.boost.org/trac10/ticket/8304#comment:1 <ul> <li><strong>owner</strong> set to <span class="trac-author">John Maddock</span> </li> <li><strong>component</strong> <span class="trac-field-old">None</span> → <span class="trac-field-new">regex</span> </li> </ul> Ticket John Maddock Thu, 21 Mar 2013 17:13:08 GMT status changed; resolution set https://svn.boost.org/trac10/ticket/8304#comment:2 https://svn.boost.org/trac10/ticket/8304#comment:2 <ul> <li><strong>status</strong> <span class="trac-field-old">new</span> → <span class="trac-field-new">closed</span> </li> <li><strong>resolution</strong> → <span class="trac-field-new">invalid</span> </li> </ul> <p> I suspect the MS implementation ignores that flag completely (haven't looked though). </p> <p> Look at it this way, if you expect it to have no effect, why are you setting it? It's purpose as documented is in this case to match any character <em>which collates between 'A' and 'Z' </em>. The fact is that all of the characters in your test string <em>will</em> collate inside that range in most Win32 locales. </p> <p> In any event, as soon as you set that flag you get implementation, platform, and locale specific behavior. In short use with extreme caution! </p> Ticket dave@… Thu, 21 Mar 2013 18:15:42 GMT <link>https://svn.boost.org/trac10/ticket/8304#comment:3 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8304#comment:3</guid> <description> <p> Isn't case insensitive collation different from case sensitive collation? </p> <p> Your comment seems to suggest that collation is an odd or unusual thing. We use the regex engine to allow users to search through text and they expect to be able to use '[a-z]+' and match against 'Années'. I guess we could add a 'Foreign language support' option to support both cases, I just wanted to make sure that the Boost implementation was 'correct'. </p> <p> Thanks. </p> <p> ps The MS implementation does indeed seem to ignore the collation flag altogether. </p> </description> <category>Ticket</category> </item> <item> <dc:creator>John Maddock</dc:creator> <pubDate>Thu, 21 Mar 2013 18:46:03 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8304#comment:4 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8304#comment:4</guid> <description> <blockquote class="citation"> <p> Isn't case insensitive collation different from case sensitive collation? </p> </blockquote> <p> Yes, since this is wide characters on Win32, it's basically Unicode collation: <a class="ext-link" href="http://www.unicode.org/reports/tr10/#Multi_Level_Comparison"><span class="icon">​</span>http://www.unicode.org/reports/tr10/#Multi_Level_Comparison</a> which is based on "levels". </p> <p> So character shape first, then accent, then case, then some other differences. That means that 'a', 'à' and 'A' collate next to each other followed by 'b' and 'B' if case sensitivity is on, while if matching is case insensitive then 'a' and 'A' are treated as equivalent (ie collate the same), but 'à' is still separate. </p> <blockquote class="citation"> <p> Your comment seems to suggest that collation is an odd or unusual thing. We use the regex engine to allow users to search through text and they expect to be able to use '[a-z]+' and match against 'Années'. I guess we could add a 'Foreign language support' option to support both cases, I just wanted to make sure that the Boost implementation was 'correct'. </p> </blockquote> <p> Sigh... I see where you're coming from, (and why your users would want that), but regex would have to implement it's own collation algorithm to support that. You could probably do that yourself actually by using a custom traits class? </p> </description> <category>Ticket</category> </item> </channel> </rss>