id	summary	reporter	owner	description	type	status	milestone	component	version	severity	resolution	keywords	cc
12076	A couple issues matching with unicode regular expressions (word delimiters, brackets)	anonymous	John Maddock	"Hi,

The [https://github.com/mawww/kakoune/ kakoune] code editor uses boost-regex in order to search through a file using a regular expression, and I've stumbled upon some issues which I think are related to how boost handles unicode codepoints.

The syntax used is the Perl one.

First, the `\b` word delimiter doesn't seem to work when involving unicode characters, some strings that should be matched are not e.g. ""abc” 123"" with the pattern ""”\b"".

Secondly, using the ""."" pattern on strings that contain unicode seems to select bytes, and not entire codepoints e.g. ""”"" with the pattern ""."" will select two bytes.

Finally, using bracket around unicode characters does not work, for example ""[”“]. This issue is probably related to the one above.

I have had a look at the documentation, namely the [http://www.boost.org/doc/libs/1_60_0/libs/regex/doc/html/boost_regex/unicode.html Unicode & boost.regex] / [http://www.boost.org/doc/libs/1_60_0/libs/regex/doc/html/boost_regex/syntax/character_classes/optional_char_class_names.html Characters classes supported by Unicode regular expressions] pages, but I'm not sure if they are related to the issues above (please let me know if I missed something).

Thanks."	Bugs	closed	To Be Determined	regex	Boost 1.61.0	Problem	invalid