#4721 closed Bugs (fixed)
multiple capture groups with the same name break regex
Reported by: | Owned by: | John Maddock | |
---|---|---|---|
Milestone: | To Be Determined | Component: | regex |
Version: | Boost 1.44.0 | Severity: | Problem |
Keywords: | regex named capture group | Cc: |
Description
If I have a named capture group that has the possibility of matching multiple times, the named group is always an empty string.
See attached file for full example of it working / breaking
Notice that in the case where broke==true that the group names are not unique and when broke==false they are.
This problem could be caused by the partial results overwriting the good match.
The commented out string e is a regex that works in .net 2.0. I didn't like having my hands tied to a platform/runtime which is reason for rewriting in c/c++
Attachments (1)
Change History (8)
by , 12 years ago
Attachment: | regexbroke.cpp added |
---|
comment:1 by , 12 years ago
Confirmed.
I have a fix for this in the pipeline, however, please note that things will be fixed to do what Perl does which is subtly different to how .NET handles things... in this particular case I think they'll work the same way though.
John.
comment:2 by , 12 years ago
Here is another pattern for testing.
(?<MPAT01>((?<MPAT01.zone>[0-5][0-9]|60)\s*)((?<MPAT01.band>[CcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVvWwXx])\s*)(?<MPAT01.grid>[AaBbCcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVvWwXxYyZz][AaBbCcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVv])\s*(((?<MPAT01.easting>\d{5})\s*(?<MPAT01.northing>\d{5}))|((?<MPAT01.easting>\d{4})\s*(?<MPAT01.northing>\d{4}))|((?<MPAT01.easting>\d{3})\s*(?<MPAT01.northing>\d{3}))))([^\d]|$)
The easting northing pairs cause the first capture groups to be over written by the latter ones
for
11spa 12345 67890
11spa 1234 6789
11spa 123 678
the last one is the only one that the capture groups work for.
comment:3 by , 12 years ago
to make the above pattern work it needs to be edited by adding numbers after the easting and northing names
(?<MPAT01>((?<MPAT01.zone>[0-5][0-9]|60)\s*)((?<MPAT01.band>[CcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVvWwXx])\s*)(?<MPAT01.grid>[AaBbCcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVvWwXxYyZz][AaBbCcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVv])\s*(((?<MPAT01.easting5>\d{5})\s*(?<MPAT01.northing5>\d{5}))|((?<MPAT01.easting4>\d{4})\s*(?<MPAT01.northing4>\d{4}))|((?<MPAT01.easting3>\d{3})\s*(?<MPAT01.northing3>\d{3}))))([^\d]|$)
once this is done all 3 patterns will match properly. however getting at the named groups is now more difficult.
follow-up: 5 comment:4 by , 12 years ago
Still testing the revised code, but I believe your examples work now, for the new one I get:
The following match was found for text 11spa 12345 67890 $0 = "11spa 12345 67890" $1 = "11spa 12345 67890" $2 = "11" $3 = "11" $4 = "s" $5 = "s" $6 = "pa" $7 = "12345 67890" $8 = "12345 67890" $9 = "12345" $10 = "67890" $11 = "" $12 = "" $13 = "" $14 = "" $15 = "" $16 = "" $17 = "" MPAT01 = 11spa 12345 67890 MPAT01.zone = 11 MPAT01.band = s MPAT01.grid = pa MPAT01.easting = 12345 MPAT01.northing = 67890 The following match was found for text 11spa 1234 6789 $0 = "11spa 1234 6789" $1 = "11spa 1234 6789" $2 = "11" $3 = "11" $4 = "s" $5 = "s" $6 = "pa" $7 = "1234 6789" $8 = "" $9 = "" $10 = "" $11 = "1234 6789" $12 = "1234" $13 = "6789" $14 = "" $15 = "" $16 = "" $17 = "" MPAT01 = 11spa 1234 6789 MPAT01.zone = 11 MPAT01.band = s MPAT01.grid = pa MPAT01.easting = 1234 MPAT01.northing = 6789 The following match was found for text 11spa 123 678 $0 = "11spa 123 678" $1 = "11spa 123 678" $2 = "11" $3 = "11" $4 = "s" $5 = "s" $6 = "pa" $7 = "123 678" $8 = "" $9 = "" $10 = "" $11 = "" $12 = "" $13 = "" $14 = "123 678" $15 = "123" $16 = "678" $17 = "" MPAT01 = 11spa 123 678 MPAT01.zone = 11 MPAT01.band = s MPAT01.grid = pa MPAT01.easting = 123 MPAT01.northing = 678
Which I believe was what you were hoping for?
Note that the way in which named sub-expressions get numbered differs between Perl and .NET. They also differ in how they treat multiple named subs with the same name - in .NET they are treated as the same named capture group. In Perl they are separate groups (with different numbers) that happen to have the same name - so $+{name} returns the leftmost capture group called "name" that matched. As long as only one of the identically named captures can match at a time then the two approaches are the same; other than for the numbers assigned to the capture groups. However, if more than one capture with a given name can match at a time, then it is possible to tell the difference between them, for example:
(?<A>a)(?<A>b) against "ab"
will result in $+{A} being "a" for Perl "b" for .NET (at least I think that's what .NET does!!).
comment:5 by , 12 years ago
That is the correct parsing. Is there a way that I can make the currently release version working? from what you said above, it seems that I should be able to access the multiple captures. what would be the name to access the subexpression, match_results$easting? Thanks again.
comment:6 by , 12 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Fixed in this changeset: https://svn.boost.org/trac/boost/changeset/65943
It's rather a large patch, but you would need to apply this to 1.44.0 to get this working correctly.
Named sub-expressions are accessed as before, by subscripting the match_results object with the name, here's my test code for your example
void test(const boost::regex& r, const char* text) { using namespace std; boost::cmatch what; if(regex_search(text, what, r)) { cout << "The following match was found for text " << text << endl; for(unsigned i = 0; i < what.size(); ++i) { cout << "$" << i << " = \"" << what[i] << "\"" << endl; } cout << "MPAT01 = " << what["MPAT01"] << endl; cout << "MPAT01.zone = " << what["MPAT01.zone"] << endl; cout << "MPAT01.band = " << what["MPAT01.band"] << endl; cout << "MPAT01.grid = " << what["MPAT01.grid"] << endl; cout << "MPAT01.easting = " << what["MPAT01.easting"] << endl; cout << "MPAT01.northing = " << what["MPAT01.northing"] << endl; } else { cout << "No match found for text " << text << endl; } cout << endl; } int _tmain(int argc, _TCHAR* argv[]) { boost::regex e("(?<MPAT01>((?<MPAT01.zone>[0-5][0-9]|60)\\s*)((?<MPAT01.band>[CcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVvWwXx])\\s*)(?<MPAT01.grid>[AaBbCcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVvWwXxYyZz][AaBbCcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVv])\\s*(((?<MPAT01.easting>\\d{5})\\s*(?<MPAT01.northing>\\d{5}))|((?<MPAT01.easting>\\d{4})\\s*(?<MPAT01.northing>\\d{4}))|((?<MPAT01.easting>\\d{3})\\s*(?<MPAT01.northing>\\d{3}))))([^\\d]|$)"); test(e, "11spa 12345 67890"); test(e, "11spa 1234 6789"); test(e, "11spa 123 678"); return 0; }
Example of the broken condition and what is expected when working