wiki:soc/2007/RegexRecursiveMatch

About

The Boost Regex library written by John Maddock is great, and used in many projects. I'm working to add a few features to Boost.Regex, and then work on performance and predictability of the library for the remainder of the summer.

Named capture groups

Update: The prototype for named capture groups has been finished. It will be modified to properly deal with ICU, and any other issues that come up, but seems to work well. The syntax will probably not change.

Normally, in a regular expression, anything enclosed by parenthesis is "captured", and stored for later retrieval. Two things can be done with a captured group: it can be used as a back reference or recalled later as a sub match. Like all other regular expression engines, Boost.Regex does both of these using the numerical index as the lookup index. However, most modern regular expression engines also allow a capture group to be named, so that it can be referenced or recalled by a string index.

Example: This could be used to read emails line by line and print out the subject line. (the original is here)

std::string line;
boost::regex pat( "^Subject: (Re: |Fw: )*(?P<subject>.*)" );

while (std::cin)
{
   std::getline(std::cin, line);
   boost::smatch matches;
   if (boost::regex_match(line, matches, pat))
      std::cout << matches["subject"] << std::endl;
}

The reason for naming the capture is not obvious in the example above, but if there are multiple capture groups, it is more natural than counting. Furthermore, in the above example, adding another capture group that captures an earlier header in the email wouldn't require changing how the capture is retrieved, whereas if the capture was referenced by index, the index would change from 1 to 2.

The regular expression [.|\n]*Subject: (Re: |Fw: )*(?P<subject>.*)\n[.|\n]*(?P=subject)[.|\n]* would match an email that contained the subject line in the body of the email. As above, this is somewhat more readable, especially if there are multiple capture groups, and is less likely to break if the regex is changed.

There's a pretty good overview of existing practice for all of this here. .NET supports named captured, but uses different syntax than everybody else, so Boost.Regex is using the more widely-used Python syntax. Specifically, Boost.Regex now allows the following regular expression syntax:

(?P<name>pattern)

if pattern is a valid regular expression, then it is captured and assigned to the group named name. The group is also assigned a number that can be used to recall it or as a backreference; this number is the same as it would be if the group were simply an ordinary capture group (i.e. its number in order). At this point name may contain any character other than >, but this will probably be restricted to alphabetic or alphanumeric names. Note that a ) will prevent backreferences using this name. A name may be used only once in a given regex.

(?P=name)

is a backreference to the capture group identified by name. Fails if no previous group has such name. Currently, name can contain any character other than ), but this will be eventually restricted to the same character set allowed by named captures.

m["name"]

if m is a submatch object that has been used in the matching of a regular expression with a named capture group name, then this returns that capture group.

Recursive matches

In Perl,

$paren = qr/
      \(
        ( 
           [^()]+  # Not parens
         | 
           (??{ $paren })  # Another balanced group (not interpreted yet)
        )*
      \)
    /x;

Matches text with balanced parenthesis. The same text can be matched with the PCRE regular expression \(([^()]*|(?R))*\). This is basically the same expression without comments or spaces, but is somewhat simplified. Furthermore, PCRE allows weird recursive matches because it handles matching atomically. Wikipedia's example is /^(<(?:[^<>]+|(?3)|(?1))*>)()(!>!>!>)$/ matching the text <<!>!>!>><>>!>!>!>.

NFA-based matching

Schedule & Status

I fell pretty far behind schedule during the first half of the summer, and am unsure as to whether it will be possible to do the work I had hoped to do with DFA-based matching. However, DFA-based matching is simple, and as long as hybridizing doesn't turn out to be too difficult, then I can probably still get a hybridized NFA/DFA regex engine that uses the DFA matcher only for very simple regex constructs. Since simple regexes get used more frequently then complex regexes (I assume--I have no data on this), this could still be worthwhile.

According to my original schedule, I should have finished both named captures and recursive matching by July 13. As of then, I had only named captures working, and it still needs to be tested and reviewed for defects.

Completed:

  • Nothing is completely finished yet.

Near completed:

  • Named captures are implemented, but do not have support for ICU. They also need to be checked out for completeness and correctness, and undergo testing on more platforms.
    • recall by name (e.g., m["name"]) currently takes only one argument, of type std::basic_string<re_detail::regex_iterator_traits<BidiIterator>::sub_match<BidiIterator> >. This will probably be overloaded, but works well and will probably not have any real change.
    • there is an pointer copy between a map using std::basic_string<charT> as its key and a map using the above string type as its key. They will usually be the same, but there will be a runtime error if they aren't.
    • the character set allowed inside the regex for the capture group and the backreference should be limited, but I'm not sure by how much, or what the best way to do this is.
  • Regression tests for named captures are present, but unfinished. There doesn't seem to be an easy way to test recall, but named capture groups and named backreferences are fairly simple, and there are several tests for each.

In progress:

  • Recursive matching has just started. There is currently no working code.

Not started:

  • I haven't written any code towards a hybrid NFA/DFA matching engine.

Dates

In light of my missed deadlines, this is a proposed schedule for the remainder of the summer. Given how long everything has taken me so far, it is ambitious, and the NFA-capabilities may not get finished, but I view them as trivial compared to the goal of getting named and recursive matching a polished addition to Boost.Regex.

July 16 through July 30: Finish working on named captures, until they are a polished addition to Boost.Regex. This is intended to be background work, stretched out to give the community time to input changes, and if necessary will be stretched out even further.

July 16 through July 28: Develop a working, feature complete prototype for recursive matching.

July 30 through August 13: Polish recursive matching implementation. As above, this is background work.

July 30 through August 4: Develop rough prototype for NFA-based matching.

August 6 through August 11: Work on hybridizing NFA prototype with existing regex library.

August 13 through August 20: Test hybrid library for stability and performance, then incrementally modify hybrid library to improve the NFA capabilities as much as possible.

Last modified 15 years ago Last modified on Jul 17, 2007, 2:58:58 AM
Note: See TracWiki for help on using the wiki.