3 | | '''Project Description''' |
4 | | The Boost regex library does not implement recursive regexes (e.g., using “\((\w*|(?R))*\”) in PCRE to match nested parenthesis and words) or named sub-expressions (like Perl's “?P<name>” capability). These two features are found in similar libraries such as Philip Hazel's PCRE library and Microsoft's GRETA, and developers have come to expect to be able to use them when constructing regular expressions. This stands in the way of making Boost's stated goal of providing libraries suitable for eventual standardization. |
| 2 | = About = |
| 3 | The Boost Regex library written by John Maddock is great, and used in many projects. I'm working to add a few features to Boost.Regex, and then work on performance and predictability of the library for the remainder of the summer. |
| 4 | |
| 5 | === Named capture groups === |
| 6 | Normally, in a regular expression, anything enclosed by parenthesis is "captured", and stored for later retrieval. Two things can be done with a captured group: it can be used as a back reference or recalled later as a sub match. Like all other regular expression engines, Boost.Regex does both of these using the numerical index as the lookup index. However, most modern regular expression engines also allow a capture group to be named, so that it can be referenced or recalled by a string index. |
| 7 | |
| 8 | Example: This could possibly be used to read emails line by line and print out the subject line. (the original is [http://www.boost.org/more/getting_started/unix-variants.html#link-your-program-to-a-boost-library here]) |
| 9 | {{{ |
| 10 | std::string line; |
| 11 | boost::regex pat( "^Subject: (Re: |Fw: )*(?P<subject>.*)" ); |
| 12 | |
| 13 | while (std::cin) |
| 14 | { |
| 15 | std::getline(std::cin, line); |
| 16 | boost::smatch matches; |
| 17 | if (boost::regex_match(line, matches, pat)) |
| 18 | std::cout << matches.name("subject") << std::endl; |
| 19 | } |
| 20 | }}} |
| 21 | The reason for naming the capture is not obvious in the example above, but if there are multiple capture groups, it is more natural than counting. Furthermore, in the above example, adding another capture group that captures an earlier header in the email wouldn't require changing how the capture is retrieved, whereas if the capture was referenced by index, the index would change from ```1``` to ```2```. |
| 22 | |
| 23 | The regular expression ```[.|\n]*Subject: (Re: |Fw: )*(?P<subject>.*)\n[.|\n]*(?P=subject)[.|\n]*``` would match an email that contained the subject line in the body of the email. As above, this is somewhat more readable, especially if there are multiple capture groups, and is less likely to break if the regex is changed. |
| 24 | |
| 25 | There's a pretty good overview of existing practice for all of this [http://www.regular-expressions.info/named.html here]. .NET supports named captured, but uses different syntax than everybody else, so Boost.Regex is using the more widely-used Perl syntax. |
| 26 | |
| 27 | |
| 28 | === Recursive matching in regular expressions === |
| 29 | In Perl, |
| 30 | {{{ |
| 31 | $paren = qr/ |
| 32 | \( |
| 33 | ( |
| 34 | [^()]+ # Not parens |
| 35 | | |
| 36 | (??{ $paren }) # Another balanced group (not interpolated yet) |
| 37 | )* |
| 38 | \) |
| 39 | /x; |
| 40 | }}} |
| 41 | Matches text with balanced parenthesis. The same text can be matched with the PCRE regular expression ```\(([^()]*|(?R))*\)```. Furthermore, PCRE allows weird recursive matches because it handles matching atomically. Wikipedia's example is ```/^(<(?:[^<>]+|(?3)|(?1))*>)()(!>!>!>)$/``` matching the text ```<<!>!>!>><>>!>!>!>```. |
| 42 | |
| 43 | == More == |
5 | 44 | During the summer, I plan to add these two capabilities to the Boost regex library. However, I think that I can have these features working significantly before the end of the summer, so if things go smoothly, I would very much like to work on extending the core regex matching algorithm to make its performance more predictable by using a DFA-based approach to find matches. This is the approach taken by most Unix tools, but not by most programming languages. My inspiration for this plan comes from reading the article by Russ Cox on automata-based regular expression evaluation, found at http://swtch.com/~rsc/regexp/regexp1.html/ Automata-based approaches generally significantly out-perform traditional recursive backtracking methods, but the primary reason for their use is that there are many regular expressions that require very complex branch-prediction and memoization techniques to avoid exponential running time on matching texts, and even these techniques still cannot prevent exponential blowup in all cases. However, a good DFA implementation is simple (I wrote one for my Models of Computation class) and never requires exponential running time. The reason they are not generally used is that they cannot model regular expressions that use backreferences, recursive patterns, or code callouts, and are difficult to modify to deal with assertions. |
6 | 45 | The deficiencies of a DFA regex implementation can be remedied by deciding between using an automata and using recursive backtracking when a regular expression is compiled, producing a hybrid approach that performs much better in some circumstances. This is an approach taken by GNU awk and GNU egrep, while Tcl takes it one step further and switches between backtracking and using an automata during the actual run of the regular expression. |