Boost C++ Libraries: Ticket #3899: Regex: Bug in handling of "\Z"

John Maddock — Thu, 04 Feb 2010 19:00:41 GMT

Strange, works for me, given:

std::string s = "1\r2\r3"; regex re("
Z");

Then

std::string s2 = regex_replace(s, re, "[end]");

results in "1\r2\r3[end]" in s2.

Do you have a test case?

BTW \r and \n are intentionally treated as equivalent.

Thu, 04 Feb 2010 20:30:52 GMT

Here you go:

// cl /EHsc /IC:\Dev\boost_1_41_0 re.cpp /link /LIBPATH:C:\Dev\boost_1_41_0\stage\lib
#include <string>
#include <iostream>
#include <boost/regex.hpp>
int main()
{
	std::string s = "1\r2\r\r";
	boost::regex re("\\Z");
	//boost::regex re("\\n*\\z");
	std::string s2 = regex_replace(s, re, "A");
	for (int i = 0; i < s2.length(); ++i) {
		const char	c = s2[i];
		if (c == '\r')
			std::cout << "\\r";
		else if (c == '\n')
			std::cout << "\\n";
		else
			std::cout << c;
	}
	return 0;
}

This, when built with Visual Studio 2008 results in "1\r2A\rA\rA".

The Perl documentation for \Z states that it works with "\n", not "\r", which is necessary for it to be equivalent to "\n*\z". My tests with perl 5.10.1 confirm this behaviour.

John Maddock — Fri, 05 Feb 2010 12:50:29 GMT

There are two separate issues here:

1) Boost.Regex has always treated all line-termination characters as equivalent, so for example $ will match before any line-termination sequence: \n \r\n \r plus a few other unicode-specific sequences. This is different to Perl's behaviour, but then Perl has complete control over file IO and text file formats and line endings where as Boost.Regex does not - and is intended to work with all text file formats wherever they're from and however they're read in. This seems to have worked well in practice up until now, and I don't really want to change it.

2) The behaviour of \Z in Perl seems to be quite "quirky" ;-) In fact it's quite hard to write a regular expression that matches it's behaviour exactly! From messing around it seems to be:

$(?=\n\z)|\z

where as Boost is doing:

$(?=\v+\z)|\z

This one I will look into changing, even though I would argue that the current behaviour is often more useful :-)

John.

Fri, 05 Feb 2010 14:20:51 GMT

I'd settle for "\Z" being equivalent to "\v*\z", and the documentation reflecting the equivalence of line termination characters (except where "\r" and "\n" are explicitly used). That should minimise any surprises for Perl users. ;)

Thanks, Keith

status changed

John Maddock — Fri, 05 Feb 2010 16:43:23 GMT

status new → assigned

In that case I'll just update the docs :-)

Thanks, John.

status changed; resolution set

John Maddock — Fri, 05 Feb 2010 17:05:11 GMT

status assigned → closed
resolution → fixed

(In [59512]) Highlight the differences between \Z in Boost and Perl. Regenerate docs. Fixes #3899.

status changed; resolution deleted

Sat, 06 Feb 2010 14:42:21 GMT

status closed → reopened
resolution fixed

Hmmm, still not sure about this. The documentation now states:

\Z Matches a zero-width assertion consisting of an optional sequence of newlines
at the end of a buffer: equivalent to the regular expression (?=\v*\z).

However, a few lines earlier, we have:

a "buffer" in this context is the whole of the input text that is being matched
against.

In that case, shouldn't my sample code above result in "1\r2A" rather than "1\r2A\rA\rA"?

John Maddock — Sat, 06 Feb 2010 15:03:25 GMT

No, that's why I said that it's a zero-width assertion - it matches zero characters preceeding a sequence of newlines at the end of a buffer - where as Perl matches zero characters preceeding up to one newline at the end of the buffer.

status changed; resolution set

John Maddock — Tue, 02 Mar 2010 17:03:36 GMT

status reopened → closed
resolution → worksforme