wiki:soc/2007/RegexRecursiveMatch

Version 2 (modified by hughwimberly, 15 years ago) ( diff )

update

About

The Boost Regex library written by John Maddock is great, and used in many projects. I'm working to add a few features to Boost.Regex, and then work on performance and predictability of the library for the remainder of the summer.

Named capture groups

Normally, in a regular expression, anything enclosed by parenthesis is "captured", and stored for later retrieval. Two things can be done with a captured group: it can be used as a back reference or recalled later as a sub match. Like all other regular expression engines, Boost.Regex does both of these using the numerical index as the lookup index. However, most modern regular expression engines also allow a capture group to be named, so that it can be referenced or recalled by a string index.

Example: This could possibly be used to read emails line by line and print out the subject line. (the original is here)

std::string line;
boost::regex pat( "^Subject: (Re: |Fw: )*(?P<subject>.*)" );

while (std::cin)
{
   std::getline(std::cin, line);
   boost::smatch matches;
   if (boost::regex_match(line, matches, pat))
      std::cout << matches.name("subject") << std::endl;
}

The reason for naming the capture is not obvious in the example above, but if there are multiple capture groups, it is more natural than counting. Furthermore, in the above example, adding another capture group that captures an earlier header in the email wouldn't require changing how the capture is retrieved, whereas if the capture was referenced by index, the index would change from 1 to 2.

The regular expression [.|\n]*Subject: (Re: |Fw: )*(?P<subject>.*)\n[.|\n]*(?P=subject)[.|\n]* would match an email that contained the subject line in the body of the email. As above, this is somewhat more readable, especially if there are multiple capture groups, and is less likely to break if the regex is changed.

There's a pretty good overview of existing practice for all of this here. .NET supports named captured, but uses different syntax than everybody else, so Boost.Regex is using the more widely-used Perl syntax.

Recursive matching in regular expressions

In Perl,

$paren = qr/
      \(
        ( 
           [^()]+  # Not parens
         | 
           (??{ $paren })  # Another balanced group (not interpolated yet)
        )*
      \)
    /x;

Matches text with balanced parenthesis. The same text can be matched with the PCRE regular expression \(([^()]*|(?R))*\). Furthermore, PCRE allows weird recursive matches because it handles matching atomically. Wikipedia's example is /^(<(?:[^<>]+|(?3)|(?1))*>)()(!>!>!>)$/ matching the text <<!>!>!>><>>!>!>!>.

More

During the summer, I plan to add these two capabilities to the Boost regex library. However, I think that I can have these features working significantly before the end of the summer, so if things go smoothly, I would very much like to work on extending the core regex matching algorithm to make its performance more predictable by using a DFA-based approach to find matches. This is the approach taken by most Unix tools, but not by most programming languages. My inspiration for this plan comes from reading the article by Russ Cox on automata-based regular expression evaluation, found at http://swtch.com/~rsc/regexp/regexp1.html/ Automata-based approaches generally significantly out-perform traditional recursive backtracking methods, but the primary reason for their use is that there are many regular expressions that require very complex branch-prediction and memoization techniques to avoid exponential running time on matching texts, and even these techniques still cannot prevent exponential blowup in all cases. However, a good DFA implementation is simple (I wrote one for my Models of Computation class) and never requires exponential running time. The reason they are not generally used is that they cannot model regular expressions that use backreferences, recursive patterns, or code callouts, and are difficult to modify to deal with assertions. The deficiencies of a DFA regex implementation can be remedied by deciding between using an automata and using recursive backtracking when a regular expression is compiled, producing a hybrid approach that performs much better in some circumstances. This is an approach taken by GNU awk and GNU egrep, while Tcl takes it one step further and switches between backtracking and using an automata during the actual run of the regular expression. Obviously, performance is very important. The fact that Perl, Ruby, Java, etc. all use recursive backtracking indicates that developers have accepted the pathological cases of recursive backtracking. However, good regular expression tutorials spend a lot of time explaining how to write regular expressions to allow them to complete before the programmer dies, but even with such guides, it can be difficult to write a regular expression that will properly search long texts. For inexperienced programmers, lazy programmers, and situations where a user can write their own regular expressions, the predictability of an automata approach can be better. I plan to extensively test my hybrid implementation to give Boost everything they need to consider whether or not the hybrid approach is a valuable change to the regex library. Even if it is not, the work performed earlier in the summer will have improved the Boost regex library, and I will be incredibly happy at the work and research I was able to accomplish.

Success Criteria

At the end of the summer, the Boost regex library should be able to handle regular expressions that use recursive patterns and regular expressions that have named sub-expressions so that matches can be retrieved directly by name. Neither of these improvements should cause the regex library to behave differently on any expressions it could handle before the changes, and the Boost regex library should continue to perform comparably to Philip Hazel's PCRE library on benchmarks using and not using these two features. Furthermore, I should have a working version of the Boost regex library that uses an automation instead of recursive backtracking when possible. This version should pass all regression tests, and should never use exponential memory or runtime on a regex unless that regex contains backreferences or recursive matching and requires exponential memory or runtime on the base version of Boost regex. This hybrid version of the Boost regex library should be accompanied by enough performance data to justify a decision on whether or not it should replace the previous version.

Roadmap I use specific dates, but I understand that these are quite rough and will change as I become more familiar with the project. I intend to only predict how much time (in approximate weeks) I should spend on each stage.

April 9 through May 27: Get to know my mentor, the Boost project in general, and the regex library specifically. Work out a plan for which library files need to be changed for each addition, and try to determine what specific problems I will have so they can be addressed early. After my school lets out, start prototyping additions.

May 12: My exams are over.

May 28 through June 1: Construct extensive regression and unit tests for the recursive patterns and named sub-expression additions. Perform benchmarking and performance testing for use on all stages.

June 2 through June 22: Extend regex library to allow recursive matching.

June 23 through July 13: Extend regex library to allow named sub-expressions.

July 14 through July 28: Build DFA-based regex matching engine.

July 29 through August 17: Incorporate DFA-based matching into regex library, building hybrid regex library.

August 21: My school starts back up.

Note: See TracWiki for help on using the wiki.