Changes between Version 2 and Version 3 of soc/2007/RegexRecursiveMatch


Ignore:
Timestamp:
Jul 16, 2007, 4:49:43 PM (15 years ago)
Author:
hughwimberly
Comment:

more information on named captures, specifically in the Boost implementation

Legend:

Unmodified
Added
Removed
Modified
  • soc/2007/RegexRecursiveMatch

    v2 v3  
    44
    55=== Named capture groups ===
     6''Update'': The prototype for named capture groups has been finished. It will be modified to properly deal with ICU, and any other issues that come up, but seems to work well. The syntax will probably not change.
     7
    68Normally, in a regular expression, anything enclosed by parenthesis is "captured", and stored for later retrieval. Two things can be done with a captured group: it can be used as a back reference or recalled later as a sub match. Like all other regular expression engines, Boost.Regex does both of these using the numerical index as the lookup index. However, most modern regular expression engines also allow a capture group to be named, so that it can be referenced or recalled by a string index.
    79
    8 Example: This could possibly be used to read emails line by line and print out the subject line. (the original is [http://www.boost.org/more/getting_started/unix-variants.html#link-your-program-to-a-boost-library here])
     10Example: This could be used to read emails line by line and print out the subject line. (the original is [http://www.boost.org/more/getting_started/unix-variants.html#link-your-program-to-a-boost-library here])
    911{{{
    1012std::string line;
     
    1618   boost::smatch matches;
    1719   if (boost::regex_match(line, matches, pat))
    18       std::cout << matches.name("subject") << std::endl;
     20      std::cout << matches["subject"] << std::endl;
    1921}
    2022}}}
     
    2325The regular expression ```[.|\n]*Subject: (Re: |Fw: )*(?P<subject>.*)\n[.|\n]*(?P=subject)[.|\n]*``` would match an email that contained the subject line in the body of the email. As above, this is somewhat more readable, especially if there are multiple capture groups, and is less likely to break if the regex is changed.
    2426
    25 There's a pretty good overview of existing practice for all of this [http://www.regular-expressions.info/named.html here]. .NET supports named captured, but uses different syntax than everybody else, so Boost.Regex is using the more widely-used Perl syntax.
     27There's a pretty good overview of existing practice for all of this [http://www.regular-expressions.info/named.html here]. .NET supports named captured, but uses different syntax than everybody else, so Boost.Regex is using the more widely-used Python syntax. Specifically, Boost.Regex now allows the following regular expression syntax:
     28
     29```(?P<name>pattern)```::
     30  if ''pattern'' is a valid regular expression, then it is captured and assigned to the group named ''name''. The group is also assigned a number that can be used to recall it or as a backreference; this number is the same as it would be if the group were simply an ordinary capture group (i.e. its number in order). At this point ''name'' may contain any character other than ```>```, but this will probably be restricted to alphabetic or alphanumeric names. Note that a ```)``` will prevent backreferences using this name. A name may be used only once in a given regex.
     31
     32```(?P=name)```::
     33  is a backreference to the capture group identified by ```name```. Fails if no previous group has such name. Currently, ```name``` can contain any character other than ```)```, but this will be eventually restricted to the same character set allowed by named captures.
     34
     35```m["name"]```::
     36  if ```m``` is a submatch object that has been used in the matching of a regular expression with a named capture group ```name```, then this returns that capture group.
    2637
    2738