Opened 13 years ago

Closed 13 years ago

#3899 closed Bugs (worksforme)

Regex: Bug in handling of "\Z"

Reported by: Keith MacDonald <keith@…> Owned by: John Maddock
Milestone: Boost 1.43.0 Component: None
Version: Boost 1.42.0 Severity: Problem
Keywords: Cc:

Description

Bug in handling of "\Z"

According to the documentation, the RE "\Z" is equivalent to "\n*\z", but it behaves differently. Given the text:

1\r
2\r
\r

Replacing all instances of "\n*\z" with "A" correctly results in:

1\r
2\r
\r
A

Whereas replacing all instances of "\Z" with "A" incorrectly results in:

1\r
2A\r
A\r
A

It does not seem to differentiate between "\r" and "\n", nor to correctly recognize the end of the buffer.

Change History (9)

comment:1 by John Maddock, 13 years ago

Strange, works for me, given:

std::string s = "1\r2\r3"; regex re("
Z");

Then

std::string s2 = regex_replace(s, re, "[end]");

results in "1\r2\r3[end]" in s2.

Do you have a test case?

BTW \r and \n are intentionally treated as equivalent.

comment:2 by Keith MacDonald <keith@…>, 13 years ago

Here you go:

// cl /EHsc /IC:\Dev\boost_1_41_0 re.cpp /link /LIBPATH:C:\Dev\boost_1_41_0\stage\lib

#include <string>
#include <iostream>
#include <boost/regex.hpp>

int main()
{
	std::string s = "1\r2\r\r";
	boost::regex re("\\Z");
	//boost::regex re("\\n*\\z");

	std::string s2 = regex_replace(s, re, "A");

	for (int i = 0; i < s2.length(); ++i) {
		const char	c = s2[i];

		if (c == '\r')
			std::cout << "\\r";
		else if (c == '\n')
			std::cout << "\\n";
		else
			std::cout << c;
	}

	return 0;
}

This, when built with Visual Studio 2008 results in "1\r2A\rA\rA".

The Perl documentation for \Z states that it works with "\n", not "\r", which is necessary for it to be equivalent to "\n*\z". My tests with perl 5.10.1 confirm this behaviour.

comment:3 by John Maddock, 13 years ago

There are two separate issues here:

1) Boost.Regex has always treated all line-termination characters as equivalent, so for example $ will match before any line-termination sequence: \n \r\n \r plus a few other unicode-specific sequences. This is different to Perl's behaviour, but then Perl has complete control over file IO and text file formats and line endings where as Boost.Regex does not - and is intended to work with all text file formats wherever they're from and however they're read in. This seems to have worked well in practice up until now, and I don't really want to change it.

2) The behaviour of \Z in Perl seems to be quite "quirky" ;-) In fact it's quite hard to write a regular expression that matches it's behaviour exactly! From messing around it seems to be:

$(?=\n\z)|\z

where as Boost is doing:

$(?=\v+\z)|\z

This one I will look into changing, even though I would argue that the current behaviour is often more useful :-)

John.

comment:4 by Keith MacDonald <keith@…>, 13 years ago

I'd settle for "\Z" being equivalent to "\v*\z", and the documentation reflecting the equivalence of line termination characters (except where "\r" and "\n" are explicitly used). That should minimise any surprises for Perl users. ;)

Thanks, Keith

comment:5 by John Maddock, 13 years ago

Status: newassigned

In that case I'll just update the docs :-)

Thanks, John.

comment:6 by John Maddock, 13 years ago

Resolution: fixed
Status: assignedclosed

(In [59512]) Highlight the differences between \Z in Boost and Perl. Regenerate docs. Fixes #3899.

comment:7 by Keith MacDonald <keith@…>, 13 years ago

Resolution: fixed
Status: closedreopened

Hmmm, still not sure about this. The documentation now states:

\Z Matches a zero-width assertion consisting of an optional sequence of newlines
at the end of a buffer: equivalent to the regular expression (?=\v*\z).

However, a few lines earlier, we have:

a "buffer" in this context is the whole of the input text that is being matched
against.

In that case, shouldn't my sample code above result in "1\r2A" rather than "1\r2A\rA\rA"?

comment:8 by John Maddock, 13 years ago

No, that's why I said that it's a zero-width assertion - it matches zero characters preceeding a sequence of newlines at the end of a buffer - where as Perl matches zero characters preceeding up to one newline at the end of the buffer.

comment:9 by John Maddock, 13 years ago

Resolution: worksforme
Status: reopenedclosed
Note: See TracTickets for help on using tickets.