Opened 7 years ago

Closed 7 years ago

#12076 closed Bugs (invalid)

A couple issues matching with unicode regular expressions (word delimiters, brackets)

Reported by: anonymous Owned by: John Maddock
Milestone: To Be Determined Component: regex
Version: Boost 1.61.0 Severity: Problem
Keywords: Cc:

Description

Hi,

The kakoune code editor uses boost-regex in order to search through a file using a regular expression, and I've stumbled upon some issues which I think are related to how boost handles unicode codepoints.

The syntax used is the Perl one.

First, the \b word delimiter doesn't seem to work when involving unicode characters, some strings that should be matched are not e.g. "abc” 123" with the pattern "”\b".

Secondly, using the "." pattern on strings that contain unicode seems to select bytes, and not entire codepoints e.g. "”" with the pattern "." will select two bytes.

Finally, using bracket around unicode characters does not work, for example "[”“]. This issue is probably related to the one above.

I have had a look at the documentation, namely the Unicode & boost.regex / Characters classes supported by Unicode regular expressions pages, but I'm not sure if they are related to the issues above (please let me know if I missed something).

Thanks.

Change History (4)

comment:1 by anonymous, 7 years ago

Can you please post a self contained test case so I can see exactly which code you're using?

Also "”\b" against "abc” 123" should not match since there is no word boundary *after* the ” character.

comment:2 by anonymous, 7 years ago

I think this issue has no reason to be anymore, the behavior of boost is to be expected in the examples I gave, I just need to use ICU to get what I want.

Thanks, closing this now.

comment:3 by anonymous, 7 years ago

Actually I can't close this issue, probably because I'm not logged in, please close the matter.

comment:4 by John Maddock, 7 years ago

Resolution: invalid
Status: newclosed
Note: See TracTickets for help on using tickets.