Opened 13 years ago
Closed 13 years ago
#3899 closed Bugs (worksforme)
Regex: Bug in handling of "\Z"
Reported by: | Owned by: | John Maddock | |
---|---|---|---|
Milestone: | Boost 1.43.0 | Component: | None |
Version: | Boost 1.42.0 | Severity: | Problem |
Keywords: | Cc: |
Description
Bug in handling of "\Z"
According to the documentation, the RE "\Z" is equivalent to "\n*\z", but it behaves differently. Given the text:
1\r 2\r \r
Replacing all instances of "\n*\z" with "A" correctly results in:
1\r 2\r \r A
Whereas replacing all instances of "\Z" with "A" incorrectly results in:
1\r 2A\r A\r A
It does not seem to differentiate between "\r" and "\n", nor to correctly recognize the end of the buffer.
Change History (9)
comment:1 by , 13 years ago
comment:2 by , 13 years ago
Here you go:
// cl /EHsc /IC:\Dev\boost_1_41_0 re.cpp /link /LIBPATH:C:\Dev\boost_1_41_0\stage\lib #include <string> #include <iostream> #include <boost/regex.hpp> int main() { std::string s = "1\r2\r\r"; boost::regex re("\\Z"); //boost::regex re("\\n*\\z"); std::string s2 = regex_replace(s, re, "A"); for (int i = 0; i < s2.length(); ++i) { const char c = s2[i]; if (c == '\r') std::cout << "\\r"; else if (c == '\n') std::cout << "\\n"; else std::cout << c; } return 0; }
This, when built with Visual Studio 2008 results in "1\r2A\rA\rA".
The Perl documentation for \Z states that it works with "\n", not "\r", which is necessary for it to be equivalent to "\n*\z". My tests with perl 5.10.1 confirm this behaviour.
comment:3 by , 13 years ago
There are two separate issues here:
1) Boost.Regex has always treated all line-termination characters as equivalent, so for example $ will match before any line-termination sequence: \n \r\n \r plus a few other unicode-specific sequences. This is different to Perl's behaviour, but then Perl has complete control over file IO and text file formats and line endings where as Boost.Regex does not - and is intended to work with all text file formats wherever they're from and however they're read in. This seems to have worked well in practice up until now, and I don't really want to change it.
2) The behaviour of \Z in Perl seems to be quite "quirky" ;-) In fact it's quite hard to write a regular expression that matches it's behaviour exactly! From messing around it seems to be:
$(?=\n\z)|\z
where as Boost is doing:
$(?=\v+\z)|\z
This one I will look into changing, even though I would argue that the current behaviour is often more useful :-)
John.
comment:4 by , 13 years ago
I'd settle for "\Z" being equivalent to "\v*\z", and the documentation reflecting the equivalence of line termination characters (except where "\r" and "\n" are explicitly used). That should minimise any surprises for Perl users. ;)
Thanks, Keith
comment:5 by , 13 years ago
Status: | new → assigned |
---|
In that case I'll just update the docs :-)
Thanks, John.
comment:6 by , 13 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
comment:7 by , 13 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
Hmmm, still not sure about this. The documentation now states:
\Z Matches a zero-width assertion consisting of an optional sequence of newlines at the end of a buffer: equivalent to the regular expression (?=\v*\z).
However, a few lines earlier, we have:
a "buffer" in this context is the whole of the input text that is being matched against.
In that case, shouldn't my sample code above result in "1\r2A" rather than "1\r2A\rA\rA"?
comment:8 by , 13 years ago
No, that's why I said that it's a zero-width assertion - it matches zero characters preceeding a sequence of newlines at the end of a buffer - where as Perl matches zero characters preceeding up to one newline at the end of the buffer.
comment:9 by , 13 years ago
Resolution: | → worksforme |
---|---|
Status: | reopened → closed |
Strange, works for me, given:
Then
results in "1\r2\r3[end]" in s2.
Do you have a test case?
BTW \r and \n are intentionally treated as equivalent.