Boost C++ Libraries: Ticket #3899: Regex: Bug in handling of "\Z" https://svn.boost.org/trac10/ticket/3899 <p> Bug in handling of "\Z" </p> <p> According to the documentation, the RE "\Z" is equivalent to "\n*\z", but it behaves differently. Given the text: </p> <pre class="wiki">1\r 2\r \r </pre><p> Replacing all instances of "\n*\z" with "A" correctly results in: </p> <pre class="wiki">1\r 2\r \r A </pre><p> Whereas replacing all instances of "\Z" with "A" incorrectly results in: </p> <pre class="wiki">1\r 2A\r A\r A </pre><p> It does not seem to differentiate between "\r" and "\n", nor to correctly recognize the end of the buffer. </p> en-us Boost C++ Libraries /htdocs/site/boost.png https://svn.boost.org/trac10/ticket/3899 Trac 1.4.3 John Maddock Thu, 04 Feb 2010 19:00:41 GMT <link>https://svn.boost.org/trac10/ticket/3899#comment:1 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/3899#comment:1</guid> <description> <p> Strange, works for me, given: </p> <blockquote> <p> std::string s = "1\<a class="changeset" href="https://svn.boost.org/trac10/changeset/2" title="Add Boost Disclaimer">r2</a>\<a class="changeset" href="https://svn.boost.org/trac10/changeset/3" title="Tweak disclaimer text">r3</a>"; regex re("<br />Z"); </p> </blockquote> <p> Then </p> <blockquote> <p> std::string s2 = regex_replace(s, re, "[end]"); </p> </blockquote> <p> results in "1\<a class="changeset" href="https://svn.boost.org/trac10/changeset/2" title="Add Boost Disclaimer">r2</a>\<a class="changeset" href="https://svn.boost.org/trac10/changeset/3" title="Tweak disclaimer text">r3</a>[end]" in s2. </p> <p> Do you have a test case? </p> <p> BTW \r and \n are intentionally treated as equivalent. </p> </description> <category>Ticket</category> </item> <item> <author>Keith MacDonald <keith@…></author> <pubDate>Thu, 04 Feb 2010 20:30:52 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/3899#comment:2 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/3899#comment:2</guid> <description> <p> Here you go: </p> <pre class="wiki">// cl /EHsc /IC:\Dev\boost_1_41_0 re.cpp /link /LIBPATH:C:\Dev\boost_1_41_0\stage\lib #include &lt;string&gt; #include &lt;iostream&gt; #include &lt;boost/regex.hpp&gt; int main() { std::string s = "1\r2\r\r"; boost::regex re("\\Z"); //boost::regex re("\\n*\\z"); std::string s2 = regex_replace(s, re, "A"); for (int i = 0; i &lt; s2.length(); ++i) { const char c = s2[i]; if (c == '\r') std::cout &lt;&lt; "\\r"; else if (c == '\n') std::cout &lt;&lt; "\\n"; else std::cout &lt;&lt; c; } return 0; } </pre><p> This, when built with Visual Studio 2008 results in "1\r2A\rA\rA". </p> <p> The Perl documentation for \Z states that it works with "\n", not "\r", which is necessary for it to be equivalent to "\n*\z". My tests with perl 5.10.1 confirm this behaviour. </p> </description> <category>Ticket</category> </item> <item> <dc:creator>John Maddock</dc:creator> <pubDate>Fri, 05 Feb 2010 12:50:29 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/3899#comment:3 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/3899#comment:3</guid> <description> <p> There are two separate issues here: </p> <p> 1) Boost.Regex has always treated all line-termination characters as equivalent, so for example $ will match before any line-termination sequence: \n \r\n \r plus a few other unicode-specific sequences. This is different to Perl's behaviour, but then Perl has complete control over file IO and text file formats and line endings where as Boost.Regex does not - and is intended to work with all text file formats wherever they're from and however they're read in. This seems to have worked well in practice up until now, and I don't really want to change it. </p> <p> 2) The behaviour of \Z in Perl seems to be quite "quirky" ;-) In fact it's quite hard to write a regular expression that matches it's behaviour exactly! From messing around it seems to be: </p> <p> $(?=\n\z)|\z </p> <p> where as Boost is doing: </p> <p> $(?=\v+\z)|\z </p> <p> This one I will look into changing, even though I would argue that the current behaviour is often more useful :-) </p> <p> John. </p> </description> <category>Ticket</category> </item> <item> <author>Keith MacDonald <keith@…></author> <pubDate>Fri, 05 Feb 2010 14:20:51 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/3899#comment:4 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/3899#comment:4</guid> <description> <p> I'd settle for "\Z" being equivalent to "\v*\z", and the documentation reflecting the equivalence of line termination characters (except where "\r" and "\n" are explicitly used). That should minimise any surprises for Perl users. ;) </p> <p> Thanks, Keith </p> </description> <category>Ticket</category> </item> <item> <dc:creator>John Maddock</dc:creator> <pubDate>Fri, 05 Feb 2010 16:43:23 GMT</pubDate> <title>status changed https://svn.boost.org/trac10/ticket/3899#comment:5 https://svn.boost.org/trac10/ticket/3899#comment:5 <ul> <li><strong>status</strong> <span class="trac-field-old">new</span> → <span class="trac-field-new">assigned</span> </li> </ul> <p> In that case I'll just update the docs :-) </p> <p> Thanks, John. </p> Ticket John Maddock Fri, 05 Feb 2010 17:05:11 GMT status changed; resolution set https://svn.boost.org/trac10/ticket/3899#comment:6 https://svn.boost.org/trac10/ticket/3899#comment:6 <ul> <li><strong>status</strong> <span class="trac-field-old">assigned</span> → <span class="trac-field-new">closed</span> </li> <li><strong>resolution</strong> → <span class="trac-field-new">fixed</span> </li> </ul> <p> (In <a class="changeset" href="https://svn.boost.org/trac10/changeset/59512" title="Highlight the differences between \Z in Boost and Perl. Regenerate ...">[59512]</a>) Highlight the differences between \Z in Boost and Perl. Regenerate docs. Fixes <a class="closed ticket" href="https://svn.boost.org/trac10/ticket/3899" title="#3899: Bugs: Regex: Bug in handling of &#34;\Z&#34; (closed: worksforme)">#3899</a>. </p> Ticket Keith MacDonald <keith@…> Sat, 06 Feb 2010 14:42:21 GMT status changed; resolution deleted https://svn.boost.org/trac10/ticket/3899#comment:7 https://svn.boost.org/trac10/ticket/3899#comment:7 <ul> <li><strong>status</strong> <span class="trac-field-old">closed</span> → <span class="trac-field-new">reopened</span> </li> <li><strong>resolution</strong> <span class="trac-field-deleted">fixed</span> </li> </ul> <p> Hmmm, still not sure about this. The documentation now states: </p> <pre class="wiki">\Z Matches a zero-width assertion consisting of an optional sequence of newlines at the end of a buffer: equivalent to the regular expression (?=\v*\z). </pre><p> However, a few lines earlier, we have: </p> <pre class="wiki">a "buffer" in this context is the whole of the input text that is being matched against. </pre><p> In that case, shouldn't my sample code above result in "1\r2A" rather than "1\r2A\rA\rA"? </p> Ticket John Maddock Sat, 06 Feb 2010 15:03:25 GMT <link>https://svn.boost.org/trac10/ticket/3899#comment:8 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/3899#comment:8</guid> <description> <p> No, that's why I said that it's a zero-width assertion - it matches zero characters preceeding a sequence of newlines at the end of a buffer - where as Perl matches zero characters preceeding up to one newline at the end of the buffer. </p> </description> <category>Ticket</category> </item> <item> <dc:creator>John Maddock</dc:creator> <pubDate>Tue, 02 Mar 2010 17:03:36 GMT</pubDate> <title>status changed; resolution set https://svn.boost.org/trac10/ticket/3899#comment:9 https://svn.boost.org/trac10/ticket/3899#comment:9 <ul> <li><strong>status</strong> <span class="trac-field-old">reopened</span> → <span class="trac-field-new">closed</span> </li> <li><strong>resolution</strong> → <span class="trac-field-new">worksforme</span> </li> </ul> Ticket