Context Navigation

← Previous Change
Wiki History
Next Change →

RegexRecursiveMatch

Timestamp:: Jul 17, 2007, 2:58:58 AM (15 years ago)
Author:: hughwimberly
Comment:: schedule work

Legend:

: Unmodified
: Added
: Removed
: Modified

soc/2007/RegexRecursiveMatch

-              v4
+              v5
 The Boost Regex library written by John Maddock is great, and used in many projects. I'm working to add a few features to Boost.Regex, and then work on performance and predictability of the library for the remainder of the summer.
 === Named capture groups ===
+== Named capture groups ==
 ''Update'': The prototype for named capture groups has been finished. It will be modified to properly deal with ICU, and any other issues that come up, but seems to work well. The syntax will probably not change.
 …
 Example: This could be used to read emails line by line and print out the subject line. (the original is [http://www.boost.org/more/getting_started/unix-variants.html#link-your-program-to-a-boost-library here])
 {{{
+#!cpp
 std::string line;
 boost::regex pat( "^Subject: (Re: |Fw: )*(?P<subject>.*)" );
 …
 The reason for naming the capture is not obvious in the example above, but if there are multiple capture groups, it is more natural than counting. Furthermore, in the above example, adding another capture group that captures an earlier header in the email wouldn't require changing how the capture is retrieved, whereas if the capture was referenced by index, the index would change from ```1``` to ```2```.
 The regular expression ```[.|\n]*Subject: (Re: |Fw: )*(?P<subject>.*)\n[.|\n]*(?P=subject)[.|\n]*``` would match an email that contained the subject line in the body of the email. As above, this is somewhat more readable, especially if there are multiple capture groups, and is less likely to break if the regex is changed.
+The regular expression `[.|\n]*Subject: (Re: |Fw: )*(?P<subject>.*)\n[.|\n]*(?P=subject)[.|\n]*` would match an email that contained the subject line in the body of the email. As above, this is somewhat more readable, especially if there are multiple capture groups, and is less likely to break if the regex is changed.
 There's a pretty good overview of existing practice for all of this [http://www.regular-expressions.info/named.html here]. .NET supports named captured, but uses different syntax than everybody else, so Boost.Regex is using the more widely-used Python syntax. Specifically, Boost.Regex now allows the following regular expression syntax:
 ```(?P<name>pattern)```
   if ''pattern'' is a valid regular expression, then it is captured and assigned to the group named ''name''. The group is also assigned a number that can be used to recall it or as a backreference; this number is the same as it would be if the group were simply an ordinary capture group (i.e. its number in order). At this point ''name'' may contain any character other than ```>```, but this will probably be restricted to alphabetic or alphanumeric names. Note that a ```)``` will prevent backreferences using this name. A name may be used only once in a given regex.
+`(?P<name>pattern)`
+  if `pattern` is a valid regular expression, then it is captured and assigned to the group named `name`. The group is also assigned a number that can be used to recall it or as a backreference; this number is the same as it would be if the group were simply an ordinary capture group (i.e. its number in order). At this point `name` may contain any character other than `>`, but this will probably be restricted to alphabetic or alphanumeric names. Note that a `)` will prevent backreferences using this name. A name may be used only once in a given regex.
 ```(?P=name)```
   is a backreference to the capture group identified by ```name```. Fails if no previous group has such name. Currently, ```name``` can contain any character other than ```)```, but this will be eventually restricted to the same character set allowed by named captures.
+`(?P=name)`
+  is a backreference to the capture group identified by `name`. Fails if no previous group has such name. Currently, `name` can contain any character other than `)`, but this will be eventually restricted to the same character set allowed by named captures.
 ```m["name"]```
   if ```m``` is a submatch object that has been used in the matching of a regular expression with a named capture group ```name```, then this returns that capture group.
+`m["name"]`
+  if `m` is a submatch object that has been used in the matching of a regular expression with a named capture group `name`, then this returns that capture group.
 === Recursive matching in regular expressions ===
+== Recursive matches ==
 In Perl,
 {{{
 …
            [^()]+  # Not parens
+         |
            (??{ $paren })  # Another balanced group (not interpolated yet)
+           (??{ $paren })  # Another balanced group (not interpreted yet)
         )*
       \)
     /x;
 }}}
 Matches text with balanced parenthesis. The same text can be matched with the PCRE regular expression ```\(([^()]*|(?R))*\)```. Furthermore, PCRE allows weird recursive matches because it handles matching atomically. Wikipedia's example is ```/^(<(?:[^<>]+|(?3)|(?1))*>)()(!>!>!>)$/``` matching the text ```<<!>!>!>><>>!>!>!>```.
+Matches text with balanced parenthesis. The same text can be matched with the PCRE regular expression `\(([^()]*|(?R))*\)`. This is basically the same expression without comments or spaces, but is somewhat simplified. Furthermore, PCRE allows weird recursive matches because it handles matching atomically. Wikipedia's example is `/^(<(?:[^<>]+|(?3)|(?1))*>)()(!>!>!>)$/` matching the text `<<!>!>!>><>>!>!>!>`.
+== More ==
+ During the summer, I plan to add these two capabilities to the Boost regex library. However, I think that I can have these features working significantly before the end of the summer, so if things go smoothly, I would very much like to work on extending the core regex matching algorithm to make its performance more predictable by using a DFA-based approach to find matches. This is the approach taken by most Unix tools, but not by most programming languages. My inspiration for this plan comes from reading the article by Russ Cox on automata-based regular expression evaluation, found at http://swtch.com/~rsc/regexp/regexp1.html/ Automata-based approaches generally significantly out-perform traditional recursive backtracking methods, but the primary reason for their use is that there are many regular expressions that require very complex branch-prediction and memoization techniques to avoid exponential running time on matching texts, and even these techniques still cannot prevent exponential blowup in all cases. However, a good DFA implementation is simple (I wrote one for my Models of Computation class) and never requires exponential running time. The reason they are not generally used is that they cannot model regular expressions that use backreferences, recursive patterns, or code callouts, and are difficult to modify to deal with assertions.
+ The deficiencies of a DFA regex implementation can be remedied by deciding between using an automata and using recursive backtracking when a regular expression is compiled, producing a hybrid approach that performs much better in some circumstances. This is an approach taken by GNU awk and GNU egrep, while Tcl takes it one step further and switches between backtracking and using an automata during the actual run of the regular expression.
+ Obviously, performance is very important. The fact that Perl, Ruby, Java, etc. all use recursive backtracking indicates that developers have accepted the pathological cases of recursive backtracking. However, good regular expression tutorials spend a lot of time explaining how to write regular expressions to allow them to complete before the programmer dies, but even with such guides, it can be difficult to write a regular expression that will properly search long texts. For inexperienced programmers, lazy programmers, and situations where a user can write their own regular expressions, the predictability of an automata approach can be better. I plan to extensively test my hybrid implementation to give Boost everything they need to consider whether or not the hybrid approach is a valuable change to the regex library. Even if it is not, the work performed earlier in the summer will have improved the Boost regex library, and I will be incredibly happy at the work and research I was able to accomplish.
+== NFA-based matching ==
+'''Success Criteria'''
+ At the end of the summer, the Boost regex library should be able to handle regular expressions that use recursive patterns and regular expressions that have named sub-expressions so that matches can be retrieved directly by name. Neither of these improvements should cause the regex library to behave differently on any expressions it could handle before the changes, and the Boost regex library should continue to perform comparably to Philip Hazel's PCRE library on benchmarks using and not using these two features.
+ Furthermore, I should have a working version of the Boost regex library that uses an automation instead of recursive backtracking when possible. This version should pass all regression tests, and should never use exponential memory or runtime on a regex unless that regex contains backreferences or recursive matching and requires exponential memory or runtime on the base version of Boost regex. This hybrid version of the Boost regex library should be accompanied by enough performance data to justify a decision on whether or not it should replace the previous version.
+== Schedule & Status ==
+'''Roadmap'''
+I use specific dates, but I understand that these are quite rough and will change as I become more familiar with the project. I intend to only predict how much time (in approximate weeks) I should spend on each stage.
+I fell pretty far behind schedule during the first half of the summer, and am unsure as to whether it will be possible to do the work I had hoped to do with DFA-based matching. However, DFA-based matching is simple, and as long as hybridizing doesn't turn out to be too difficult, then I can probably still get a hybridized NFA/DFA regex engine that uses the DFA matcher only for very simple regex constructs. Since simple regexes get used more frequently then complex regexes (I assume--I have no data on this), this could still be worthwhile.
 April 9 through May 27: Get to know my mentor, the Boost project in general, and the regex library specifically. Work out a plan for which library files need to be changed for each addition, and try to determine what specific problems I will have so they can be addressed early. After my school lets out, start prototyping additions.
+According to my original schedule, I should have finished both named captures and recursive matching by July 13. As of then, I had only named captures working, and it still needs to be tested and reviewed for defects.
+May 12: My exams are over.
+'''Completed:'''
+ * Nothing is completely finished yet.
+May 28 through June 1: Construct extensive regression and unit tests for the recursive patterns and named sub-expression additions. Perform benchmarking and performance testing for use on all stages.
+'''Near completed:'''
+ * Named captures are implemented, but do not have support for ICU. They also need to be checked out for completeness and correctness, and undergo testing on more platforms.
+   * recall by name (e.g., `m["name"]`) currently takes only one argument, of type `std::basic_string<re_detail::regex_iterator_traits<BidiIterator>::sub_match<BidiIterator> >`. This will probably be overloaded, but works well and will probably not have any real change.
+   * there is an pointer copy between a map using `std::basic_string<charT>` as its key and a map using the above string type as its key. They will usually be the same, but there will be a runtime error if they aren't.
+   * the character set allowed inside the regex for the capture group and the backreference should be limited, but I'm not sure by how much, or what the best way to do this is.
+ * Regression tests for named captures are present, but unfinished. There doesn't seem to be an easy way to test recall, but named capture groups and named backreferences are fairly simple, and there are several tests for each.
+June 2 through June 22: Extend regex library to allow recursive matching.
+'''In progress:'''
+ * Recursive matching has just started. There is currently no working code.
+June 23 through July 13: Extend regex library to allow named sub-expressions.
+'''Not started:'''
+ * I haven't written any code towards a hybrid NFA/DFA matching engine.
+July 14 through July 28: Build DFA-based regex matching engine.
+=== Dates ===
 July 29 through August 17: Incorporate DFA-based matching into regex library, building hybrid regex library.
+In light of my missed deadlines, this is a proposed schedule for the remainder of the summer. Given how long everything has taken me so far, it is ambitious, and the NFA-capabilities may not get finished, but I view them as trivial compared to the goal of getting named and recursive matching a polished addition to Boost.Regex.
+August 21: My school starts back up.
+July 16 through July 30: Finish working on named captures, until they are a polished addition to Boost.Regex. This is intended to be background work, stretched out to give the community time to input changes, and if necessary will be stretched out even further.
+July 16 through July 28: Develop a working, feature complete prototype for recursive matching.
+July 30 through August 13: Polish recursive matching implementation. As above, this is background work.
+July 30 through August 4: Develop rough prototype for NFA-based matching.
+August 6 through August 11: Work on hybridizing NFA prototype with existing regex library.
+August 13 through August 20: Test hybrid library for stability and performance, then incrementally modify hybrid library to improve the NFA capabilities as much as possible.