Opened 12 years ago

Closed 12 years ago

Last modified 12 years ago

#4721 closed Bugs (fixed)

multiple capture groups with the same name break regex

Reported by: robin.snyder@… Owned by: John Maddock
Milestone: To Be Determined Component: regex
Version: Boost 1.44.0 Severity: Problem
Keywords: regex named capture group Cc:

Description

If I have a named capture group that has the possibility of matching multiple times, the named group is always an empty string.

See attached file for full example of it working / breaking

Notice that in the case where broke==true that the group names are not unique and when broke==false they are.

This problem could be caused by the partial results overwriting the good match.

The commented out string e is a regex that works in .net 2.0. I didn't like having my hands tied to a platform/runtime which is reason for rewriting in c/c++

Attachments (1)

regexbroke.cpp (4.6 KB ) - added by robin.snyder@… 12 years ago.
Example of the broken condition and what is expected when working

Download all attachments as: .zip

Change History (8)

by robin.snyder@…, 12 years ago

Attachment: regexbroke.cpp added

Example of the broken condition and what is expected when working

comment:1 by anonymous, 12 years ago

Confirmed.

I have a fix for this in the pipeline, however, please note that things will be fixed to do what Perl does which is subtly different to how .NET handles things... in this particular case I think they'll work the same way though.

John.

comment:2 by anonymous, 12 years ago

Here is another pattern for testing.

(?<MPAT01>((?<MPAT01.zone>[0-5][0-9]|60)\s*)((?<MPAT01.band>[CcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVvWwXx])\s*)(?<MPAT01.grid>[AaBbCcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVvWwXxYyZz][AaBbCcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVv])\s*(((?<MPAT01.easting>\d{5})\s*(?<MPAT01.northing>\d{5}))|((?<MPAT01.easting>\d{4})\s*(?<MPAT01.northing>\d{4}))|((?<MPAT01.easting>\d{3})\s*(?<MPAT01.northing>\d{3}))))([^\d]|$)

The easting northing pairs cause the first capture groups to be over written by the latter ones

for

11spa 12345 67890

11spa 1234 6789

11spa 123 678

the last one is the only one that the capture groups work for.

comment:3 by anonymous, 12 years ago

to make the above pattern work it needs to be edited by adding numbers after the easting and northing names

(?<MPAT01>((?<MPAT01.zone>[0-5][0-9]|60)\s*)((?<MPAT01.band>[CcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVvWwXx])\s*)(?<MPAT01.grid>[AaBbCcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVvWwXxYyZz][AaBbCcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVv])\s*(((?<MPAT01.easting5>\d{5})\s*(?<MPAT01.northing5>\d{5}))|((?<MPAT01.easting4>\d{4})\s*(?<MPAT01.northing4>\d{4}))|((?<MPAT01.easting3>\d{3})\s*(?<MPAT01.northing3>\d{3}))))([^\d]|$)

once this is done all 3 patterns will match properly. however getting at the named groups is now more difficult.

comment:4 by anonymous, 12 years ago

Still testing the revised code, but I believe your examples work now, for the new one I get:

The following match was found for text 11spa 12345 67890
$0 = "11spa 12345 67890"
$1 = "11spa 12345 67890"
$2 = "11"
$3 = "11"
$4 = "s"
$5 = "s"
$6 = "pa"
$7 = "12345 67890"
$8 = "12345 67890"
$9 = "12345"
$10 = "67890"
$11 = ""
$12 = ""
$13 = ""
$14 = ""
$15 = ""
$16 = ""
$17 = ""
MPAT01 = 11spa 12345 67890
MPAT01.zone = 11
MPAT01.band = s
MPAT01.grid = pa
MPAT01.easting = 12345
MPAT01.northing = 67890

The following match was found for text 11spa 1234 6789
$0 = "11spa 1234 6789"
$1 = "11spa 1234 6789"
$2 = "11"
$3 = "11"
$4 = "s"
$5 = "s"
$6 = "pa"
$7 = "1234 6789"
$8 = ""
$9 = ""
$10 = ""
$11 = "1234 6789"
$12 = "1234"
$13 = "6789"
$14 = ""
$15 = ""
$16 = ""
$17 = ""
MPAT01 = 11spa 1234 6789
MPAT01.zone = 11
MPAT01.band = s
MPAT01.grid = pa
MPAT01.easting = 1234
MPAT01.northing = 6789

The following match was found for text 11spa 123 678
$0 = "11spa 123 678"
$1 = "11spa 123 678"
$2 = "11"
$3 = "11"
$4 = "s"
$5 = "s"
$6 = "pa"
$7 = "123 678"
$8 = ""
$9 = ""
$10 = ""
$11 = ""
$12 = ""
$13 = ""
$14 = "123 678"
$15 = "123"
$16 = "678"
$17 = ""
MPAT01 = 11spa 123 678
MPAT01.zone = 11
MPAT01.band = s
MPAT01.grid = pa
MPAT01.easting = 123
MPAT01.northing = 678

Which I believe was what you were hoping for?

Note that the way in which named sub-expressions get numbered differs between Perl and .NET. They also differ in how they treat multiple named subs with the same name - in .NET they are treated as the same named capture group. In Perl they are separate groups (with different numbers) that happen to have the same name - so $+{name} returns the leftmost capture group called "name" that matched. As long as only one of the identically named captures can match at a time then the two approaches are the same; other than for the numbers assigned to the capture groups. However, if more than one capture with a given name can match at a time, then it is possible to tell the difference between them, for example:

(?<A>a)(?<A>b) against "ab"

will result in $+{A} being "a" for Perl "b" for .NET (at least I think that's what .NET does!!).

in reply to:  4 comment:5 by anonymous, 12 years ago

That is the correct parsing. Is there a way that I can make the currently release version working? from what you said above, it seems that I should be able to access the multiple captures. what would be the name to access the subexpression, match_results$easting? Thanks again.

comment:6 by John Maddock, 12 years ago

Resolution: fixed
Status: newclosed

Fixed in this changeset: https://svn.boost.org/trac/boost/changeset/65943

It's rather a large patch, but you would need to apply this to 1.44.0 to get this working correctly.

Named sub-expressions are accessed as before, by subscripting the match_results object with the name, here's my test code for your example

void test(const boost::regex& r, const char* text)
{
   using namespace std;
   boost::cmatch what;
   if(regex_search(text, what, r))
   {
      cout << "The following match was found for text " << text << endl;
      for(unsigned i = 0; i < what.size(); ++i)
      {
         cout << "$" << i << " = \"" << what[i] << "\"" << endl;
      }
      cout << "MPAT01 = " << what["MPAT01"] << endl;
      cout << "MPAT01.zone = " << what["MPAT01.zone"] << endl;
      cout << "MPAT01.band = " << what["MPAT01.band"] << endl;
      cout << "MPAT01.grid = " << what["MPAT01.grid"] << endl;
      cout << "MPAT01.easting = " << what["MPAT01.easting"] << endl;
      cout << "MPAT01.northing = " << what["MPAT01.northing"] << endl;
   }
   else
   {
      cout << "No match found for text " << text << endl;
   }
   cout << endl;
}


int _tmain(int argc, _TCHAR* argv[])
{

   boost::regex e("(?<MPAT01>((?<MPAT01.zone>[0-5][0-9]|60)\\s*)((?<MPAT01.band>[CcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVvWwXx])\\s*)(?<MPAT01.grid>[AaBbCcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVvWwXxYyZz][AaBbCcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVv])\\s*(((?<MPAT01.easting>\\d{5})\\s*(?<MPAT01.northing>\\d{5}))|((?<MPAT01.easting>\\d{4})\\s*(?<MPAT01.northing>\\d{4}))|((?<MPAT01.easting>\\d{3})\\s*(?<MPAT01.northing>\\d{3}))))([^\\d]|$)");
   test(e, "11spa 12345 67890");
   test(e, "11spa 1234 6789");
   test(e, "11spa 123 678");
   return 0;
}


comment:7 by John Maddock, 12 years ago

(In [66116]) Merge fixes from Trunk. Fixes #4721.

Note: See TracTickets for help on using tickets.