Boost C++ Libraries: Ticket #4721: multiple capture groups with the same name break regex https://svn.boost.org/trac10/ticket/4721 <p> If I have a named capture group that has the possibility of matching multiple times, the named group is always an empty string. </p> <p> See attached file for full example of it working / breaking </p> <p> Notice that in the case where broke==true that the group names are not unique and when broke==false they are. </p> <p> This problem could be caused by the partial results overwriting the good match. </p> <p> The commented out string e is a regex that works in .net 2.0. I didn't like having my hands tied to a platform/runtime which is reason for rewriting in c/c++ </p> en-us Boost C++ Libraries /htdocs/site/boost.png https://svn.boost.org/trac10/ticket/4721 Trac 1.4.3 robin.snyder@… Fri, 08 Oct 2010 12:34:08 GMT attachment set https://svn.boost.org/trac10/ticket/4721 https://svn.boost.org/trac10/ticket/4721 <ul> <li><strong>attachment</strong> → <span class="trac-field-new">regexbroke.cpp</span> </li> </ul> <p> Example of the broken condition and what is expected when working </p> Ticket anonymous Tue, 12 Oct 2010 15:57:34 GMT <link>https://svn.boost.org/trac10/ticket/4721#comment:1 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/4721#comment:1</guid> <description> <p> Confirmed. </p> <p> I have a fix for this in the pipeline, however, please note that things will be fixed to do what Perl does which is subtly different to how .NET handles things... in this particular case I <em>think</em> they'll work the same way though. </p> <p> John. </p> </description> <category>Ticket</category> </item> <item> <dc:creator>anonymous</dc:creator> <pubDate>Tue, 12 Oct 2010 20:57:16 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/4721#comment:2 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/4721#comment:2</guid> <description> <p> Here is another pattern for testing. </p> <pre class="wiki">(?&lt;MPAT01&gt;((?&lt;MPAT01.zone&gt;[0-5][0-9]|60)\s*)((?&lt;MPAT01.band&gt;[CcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVvWwXx])\s*)(?&lt;MPAT01.grid&gt;[AaBbCcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVvWwXxYyZz][AaBbCcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVv])\s*(((?&lt;MPAT01.easting&gt;\d{5})\s*(?&lt;MPAT01.northing&gt;\d{5}))|((?&lt;MPAT01.easting&gt;\d{4})\s*(?&lt;MPAT01.northing&gt;\d{4}))|((?&lt;MPAT01.easting&gt;\d{3})\s*(?&lt;MPAT01.northing&gt;\d{3}))))([^\d]|$) </pre><p> The easting northing pairs cause the first capture groups to be over written by the latter ones </p> <p> for <br /> </p> <p> 11spa 12345 67890<br /> </p> <p> 11spa 1234 6789<br /> </p> <p> 11spa 123 678<br /> </p> <p> the last one is the only one that the capture groups work for. </p> </description> <category>Ticket</category> </item> <item> <dc:creator>anonymous</dc:creator> <pubDate>Tue, 12 Oct 2010 21:09:01 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/4721#comment:3 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/4721#comment:3</guid> <description> <p> to make the above pattern work it needs to be edited by adding numbers after the easting and northing names </p> <pre class="wiki">(?&lt;MPAT01&gt;((?&lt;MPAT01.zone&gt;[0-5][0-9]|60)\s*)((?&lt;MPAT01.band&gt;[CcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVvWwXx])\s*)(?&lt;MPAT01.grid&gt;[AaBbCcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVvWwXxYyZz][AaBbCcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVv])\s*(((?&lt;MPAT01.easting5&gt;\d{5})\s*(?&lt;MPAT01.northing5&gt;\d{5}))|((?&lt;MPAT01.easting4&gt;\d{4})\s*(?&lt;MPAT01.northing4&gt;\d{4}))|((?&lt;MPAT01.easting3&gt;\d{3})\s*(?&lt;MPAT01.northing3&gt;\d{3}))))([^\d]|$) </pre><p> once this is done all 3 patterns will match properly. however getting at the named groups is now more difficult. </p> </description> <category>Ticket</category> </item> <item> <dc:creator>anonymous</dc:creator> <pubDate>Wed, 13 Oct 2010 09:20:25 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/4721#comment:4 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/4721#comment:4</guid> <description> <p> Still testing the revised code, but I believe your examples work now, for the new one I get: </p> <pre class="wiki">The following match was found for text 11spa 12345 67890 $0 = "11spa 12345 67890" $1 = "11spa 12345 67890" $2 = "11" $3 = "11" $4 = "s" $5 = "s" $6 = "pa" $7 = "12345 67890" $8 = "12345 67890" $9 = "12345" $10 = "67890" $11 = "" $12 = "" $13 = "" $14 = "" $15 = "" $16 = "" $17 = "" MPAT01 = 11spa 12345 67890 MPAT01.zone = 11 MPAT01.band = s MPAT01.grid = pa MPAT01.easting = 12345 MPAT01.northing = 67890 The following match was found for text 11spa 1234 6789 $0 = "11spa 1234 6789" $1 = "11spa 1234 6789" $2 = "11" $3 = "11" $4 = "s" $5 = "s" $6 = "pa" $7 = "1234 6789" $8 = "" $9 = "" $10 = "" $11 = "1234 6789" $12 = "1234" $13 = "6789" $14 = "" $15 = "" $16 = "" $17 = "" MPAT01 = 11spa 1234 6789 MPAT01.zone = 11 MPAT01.band = s MPAT01.grid = pa MPAT01.easting = 1234 MPAT01.northing = 6789 The following match was found for text 11spa 123 678 $0 = "11spa 123 678" $1 = "11spa 123 678" $2 = "11" $3 = "11" $4 = "s" $5 = "s" $6 = "pa" $7 = "123 678" $8 = "" $9 = "" $10 = "" $11 = "" $12 = "" $13 = "" $14 = "123 678" $15 = "123" $16 = "678" $17 = "" MPAT01 = 11spa 123 678 MPAT01.zone = 11 MPAT01.band = s MPAT01.grid = pa MPAT01.easting = 123 MPAT01.northing = 678 </pre><p> Which I believe was what you were hoping for? </p> <p> Note that the way in which named sub-expressions get numbered differs between Perl and .NET. They also differ in how they treat multiple named subs with the same name - in .NET they are treated as the same named capture group. In Perl they are separate groups (with different numbers) that happen to have the same name - so $+{name} returns the leftmost capture group called "name" that matched. As long as only one of the identically named captures can match at a time then the two approaches are the same; other than for the numbers assigned to the capture groups. However, if more than one capture with a given name can match at a time, then it is possible to tell the difference between them, for example: </p> <p> (?&lt;A&gt;a)(?&lt;A&gt;b) against "ab" </p> <p> will result in $+{A} being "a" for Perl "b" for .NET (at least I think that's what .NET does!!). </p> </description> <category>Ticket</category> </item> <item> <dc:creator>anonymous</dc:creator> <pubDate>Wed, 13 Oct 2010 12:15:45 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/4721#comment:5 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/4721#comment:5</guid> <description> <p> That is the correct parsing. Is there a way that I can make the currently release version working? from what you said above, it seems that I should be able to access the multiple captures. what would be the name to access the subexpression, match_results<a class="missing wiki">$easting</a>? Thanks again. </p> </description> <category>Ticket</category> </item> <item> <dc:creator>John Maddock</dc:creator> <pubDate>Wed, 13 Oct 2010 16:58:26 GMT</pubDate> <title>status changed; resolution set https://svn.boost.org/trac10/ticket/4721#comment:6 https://svn.boost.org/trac10/ticket/4721#comment:6 <ul> <li><strong>status</strong> <span class="trac-field-old">new</span> → <span class="trac-field-new">closed</span> </li> <li><strong>resolution</strong> → <span class="trac-field-new">fixed</span> </li> </ul> <p> Fixed in this changeset: <a class="ext-link" href="https://svn.boost.org/trac/boost/changeset/65943"><span class="icon">​</span>https://svn.boost.org/trac/boost/changeset/65943</a> </p> <p> It's rather a large patch, but you would need to apply this to 1.44.0 to get this working correctly. </p> <p> Named sub-expressions are accessed as before, by subscripting the match_results object with the name, here's my test code for your example </p> <pre class="wiki">void test(const boost::regex&amp; r, const char* text) { using namespace std; boost::cmatch what; if(regex_search(text, what, r)) { cout &lt;&lt; "The following match was found for text " &lt;&lt; text &lt;&lt; endl; for(unsigned i = 0; i &lt; what.size(); ++i) { cout &lt;&lt; "$" &lt;&lt; i &lt;&lt; " = \"" &lt;&lt; what[i] &lt;&lt; "\"" &lt;&lt; endl; } cout &lt;&lt; "MPAT01 = " &lt;&lt; what["MPAT01"] &lt;&lt; endl; cout &lt;&lt; "MPAT01.zone = " &lt;&lt; what["MPAT01.zone"] &lt;&lt; endl; cout &lt;&lt; "MPAT01.band = " &lt;&lt; what["MPAT01.band"] &lt;&lt; endl; cout &lt;&lt; "MPAT01.grid = " &lt;&lt; what["MPAT01.grid"] &lt;&lt; endl; cout &lt;&lt; "MPAT01.easting = " &lt;&lt; what["MPAT01.easting"] &lt;&lt; endl; cout &lt;&lt; "MPAT01.northing = " &lt;&lt; what["MPAT01.northing"] &lt;&lt; endl; } else { cout &lt;&lt; "No match found for text " &lt;&lt; text &lt;&lt; endl; } cout &lt;&lt; endl; } int _tmain(int argc, _TCHAR* argv[]) { boost::regex e("(?&lt;MPAT01&gt;((?&lt;MPAT01.zone&gt;[0-5][0-9]|60)\\s*)((?&lt;MPAT01.band&gt;[CcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVvWwXx])\\s*)(?&lt;MPAT01.grid&gt;[AaBbCcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVvWwXxYyZz][AaBbCcDdEeFfGgHhJjKkLlMmNnPpQqRrSsTtUuVv])\\s*(((?&lt;MPAT01.easting&gt;\\d{5})\\s*(?&lt;MPAT01.northing&gt;\\d{5}))|((?&lt;MPAT01.easting&gt;\\d{4})\\s*(?&lt;MPAT01.northing&gt;\\d{4}))|((?&lt;MPAT01.easting&gt;\\d{3})\\s*(?&lt;MPAT01.northing&gt;\\d{3}))))([^\\d]|$)"); test(e, "11spa 12345 67890"); test(e, "11spa 1234 6789"); test(e, "11spa 123 678"); return 0; } </pre> Ticket John Maddock Wed, 20 Oct 2010 12:11:23 GMT <link>https://svn.boost.org/trac10/ticket/4721#comment:7 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/4721#comment:7</guid> <description> <p> (In <a class="changeset" href="https://svn.boost.org/trac10/changeset/66116" title="Merge fixes from Trunk. Fixes #4721.">[66116]</a>) Merge fixes from Trunk. Fixes <a class="closed ticket" href="https://svn.boost.org/trac10/ticket/4721" title="#4721: Bugs: multiple capture groups with the same name break regex (closed: fixed)">#4721</a>. </p> </description> <category>Ticket</category> </item> </channel> </rss>