Version 1 (modified by 15 years ago) ( diff ) | ,
---|
Just finished exams; I'm moving to a new place, and should be back on the internet and solidly working on this project in about a week. For now, here is my original project description.
Project Description
The Boost regex library does not implement recursive regexes (e.g., using “\((\w*|(?R))*\”) in PCRE to match nested parenthesis and words) or named sub-expressions (like Perl's “?P<name>” capability). These two features are found in similar libraries such as Philip Hazel's PCRE library and Microsoft's GRETA, and developers have come to expect to be able to use them when constructing regular expressions. This stands in the way of making Boost's stated goal of providing libraries suitable for eventual standardization. During the summer, I plan to add these two capabilities to the Boost regex library. However, I think that I can have these features working significantly before the end of the summer, so if things go smoothly, I would very much like to work on extending the core regex matching algorithm to make its performance more predictable by using a DFA-based approach to find matches. This is the approach taken by most Unix tools, but not by most programming languages. My inspiration for this plan comes from reading the article by Russ Cox on automata-based regular expression evaluation, found at http://swtch.com/~rsc/regexp/regexp1.html/ Automata-based approaches generally significantly out-perform traditional recursive backtracking methods, but the primary reason for their use is that there are many regular expressions that require very complex branch-prediction and memoization techniques to avoid exponential running time on matching texts, and even these techniques still cannot prevent exponential blowup in all cases. However, a good DFA implementation is simple (I wrote one for my Models of Computation class) and never requires exponential running time. The reason they are not generally used is that they cannot model regular expressions that use backreferences, recursive patterns, or code callouts, and are difficult to modify to deal with assertions. The deficiencies of a DFA regex implementation can be remedied by deciding between using an automata and using recursive backtracking when a regular expression is compiled, producing a hybrid approach that performs much better in some circumstances. This is an approach taken by GNU awk and GNU egrep, while Tcl takes it one step further and switches between backtracking and using an automata during the actual run of the regular expression. Obviously, performance is very important. The fact that Perl, Ruby, Java, etc. all use recursive backtracking indicates that developers have accepted the pathological cases of recursive backtracking. However, good regular expression tutorials spend a lot of time explaining how to write regular expressions to allow them to complete before the programmer dies, but even with such guides, it can be difficult to write a regular expression that will properly search long texts. For inexperienced programmers, lazy programmers, and situations where a user can write their own regular expressions, the predictability of an automata approach can be better. I plan to extensively test my hybrid implementation to give Boost everything they need to consider whether or not the hybrid approach is a valuable change to the regex library. Even if it is not, the work performed earlier in the summer will have improved the Boost regex library, and I will be incredibly happy at the work and research I was able to accomplish.
Success Criteria
At the end of the summer, the Boost regex library should be able to handle regular expressions that use recursive patterns and regular expressions that have named sub-expressions so that matches can be retrieved directly by name. Neither of these improvements should cause the regex library to behave differently on any expressions it could handle before the changes, and the Boost regex library should continue to perform comparably to Philip Hazel's PCRE library on benchmarks using and not using these two features. Furthermore, I should have a working version of the Boost regex library that uses an automation instead of recursive backtracking when possible. This version should pass all regression tests, and should never use exponential memory or runtime on a regex unless that regex contains backreferences or recursive matching and requires exponential memory or runtime on the base version of Boost regex. This hybrid version of the Boost regex library should be accompanied by enough performance data to justify a decision on whether or not it should replace the previous version.
Roadmap I use specific dates, but I understand that these are quite rough and will change as I become more familiar with the project. I intend to only predict how much time (in approximate weeks) I should spend on each stage.
April 9 through May 27: Get to know my mentor, the Boost project in general, and the regex library specifically. Work out a plan for which library files need to be changed for each addition, and try to determine what specific problems I will have so they can be addressed early. After my school lets out, start prototyping additions.
May 12: My exams are over.
May 28 through June 1: Construct extensive regression and unit tests for the recursive patterns and named sub-expression additions. Perform benchmarking and performance testing for use on all stages.
June 2 through June 22: Extend regex library to allow recursive matching.
June 23 through July 13: Extend regex library to allow named sub-expressions.
July 14 through July 28: Build DFA-based regex matching engine.
July 29 through August 17: Incorporate DFA-based matching into regex library, building hybrid regex library.
August 21: My school starts back up.