Opened 10 years ago

Closed 10 years ago

Last modified 10 years ago

#8392 closed Bugs (wontfix)

Too complex regex for boost::regex (perl, extended) isn't so complex

Reported by: bkalinczuk@… Owned by: John Maddock
Milestone: To Be Determined Component: regex
Version: Boost 1.52.0 Severity: Problem
Keywords: Cc:

Description

Here I have regular expression: "[a-e]+[b-f]+[ac-f]+[abd-f]+[a-cef]+[a-df]+"

This expression raises exception: "The complexity of matching the regular expression exceeded predefined bounds. Try refactoring the regular expression to make each choice made by the state machine unambiguous. This exception is thrown to prevent "eternal" matches that take an indefinite period time to locate."

Obviously perl regex and grep has no problem with this. According to my calculations this regex should use up to 27 states, which is barely 128. However boost implementation states, that it exceeds maximum of ptrdiff.

Here is code in c++ (compiled with gcc-4.7.2-11 and boost-1.53.0 and no additional flags (except -I-L-l)): #define BOOST_REGEX_MAX_STATE_COUNT std::ptrdiff_t(std::numeric_limits<std::ptrdiff_t>::max)

#include <iostream> #include <boost/regex.hpp> #include <string>

int main() {

boost::regex exp("[a-e]+[b-f]+[ac-f]+[abd-f]+[a-cef]+[a-df]+", boost::regex::perl); std::string text1 = "bbbbbbbccccccccccccccccccccccccbbbbbbbbbbbbbbbbcccccccccccccccccccccbbbbbbbbaaaaa"; std::string text2 = "bbbbbbbccccccccccccccccccccccccbbbbbbbbbbbbbbbbcccccccccccccccccccccbbbbbbbbaaaaae"; std::cout << BOOST_REGEX_MAX_STATE_COUNT << std::endl; std::cout << boost::regex_match(text1, exp) << std::endl; std::cout << boost::regex_match(text2, exp) << std::endl;

}

Here analog grep match: echo bbbbbbccccccccccccccccccccccccbbbbbbbbbbbbbbbbcccccccccccccccccccccbbbbbbbbaaaaae | grep "[a-e]\+[b-f]\+[ac-f]\+[abd-f]\+[a-cef]\+[a-df]\+"

And analog perl code: if ("bbbbbbbccccccccccccccccccccccccbbbbbbbbbbbbbbbbcccccccccccccccccccccbbbbbbbbaaaaa" =~ m/[a-e]+[b-f]+[ac-f]+[abd-f]+[a-cef]+[a-df]+$/) {

print "MATCHED\n";

} else {

print "NOT MATCHED\n";

} if ("bbbbbbbccccccccccccccccccccccccbbbbbbbbbbbbbbbbcccccccccccccccccccccbbbbbbbbaaaaae" =~ m/[a-e]+[b-f]+[ac-f]+[abd-f]+[a-cef]+[a-df]+$/) {

print "MATCHED\n";

} else {

print "NOT MATCHED\n";

}

Change History (7)

comment:1 by viboes, 10 years ago

Component: Noneregex
Owner: set to John Maddock

comment:2 by John Maddock, 10 years ago

Resolution: wontfix
Status: newclosed

This is not nearly so clear cut as you think, both Boost.Regex and Perl are backtracking NFA's and if I make the string being matched slightly longer then Perl takes several minutes to figure out that the string can't be matched:


if ("bbbbbbbccccccccccccccccccccccccbbbbbbbbbbbbbbbbcccccccccccccccccccccbbbbbbbbaaaaa" =~ m/[a-e]+[b-f]+[ac-f]+[abd-f]+[a-cef]+[a-df]+$/) {

    print "MATCHED\n";

} else {

    print "NOT MATCHED\n";

} if ("bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbcccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbaaaaae" =~ m/[a-e]+[b-f]+[ac-f]+[abd-f]+[a-cef]+[a-df]+$/) {

    print "MATCHED\n";

} else {

    print "NOT MATCHED\n";

} 

Make the string being matched longer still and Perl will go into an effectively infinite loop.

Boost.Regex takes the view that these pathological cases should be caught as soon as possible, and that's what you're seeing here. It's true that this behavior might not suite everyone all of the time, but it's also safer, and helps prevent deliberately "bad" regexes being used in DOS attacks etc.

comment:3 by anonymous, 10 years ago

I checked your example with grep and it had no problem with matching the largest matching substring. (There wasn't even any delay).

I understand, that for the purpose of limiting the time taken for regex matching/searching a developer can set a BOOST_REGEX_MAX_STATE_COUNT as he wishes. I would expect though, that regex matching will take O(BOOST_REGEX_MAX_STATE_COUNT * exp(size_of_regex) ) (128n in the example), which is not the case.

I believe there is a bug with counting number of states in this example and the behaviour is not as it is described.

comment:4 by bkalinczuk@…, 10 years ago

forgot email :P

comment:5 by anonymous, 10 years ago

I checked your example with grep and it had no problem with matching the largest matching substring. (There wasn't even any delay).

Grep is a completely different beast: it's a DFA meaning the complexity of matching is O(N). Perl compatible engines are NFA's, and in the worst case the complexity of obtaining a match is NP-complete, so something like O(N!) for an N-character string.

I understand, that for the purpose of limiting the time taken for regex matching/searching a developer can set a BOOST_REGEX_MAX_STATE_COUNT as he wishes. I would expect though, that regex matching will take O(BOOST_REGEX_MAX_STATE_COUNT * exp(size_of_regex) ) (128n in the example), which is not the case.

No it's much worst than that as mentioned above. NP-complete in the worst case (which strangely this nearly is).

I believe there is a bug with counting number of states in this example and the behaviour is not as it is described.

It's not states in the machine that matter - it's how many are visited while trying to find the match. The maximum is set to the lowest of:

  • BOOST_REGEX_MAX_STATE_COUNT
  • N2 for an N character string.
  • A lower limit k, which is hard coded to 100000.

You can change this behavour by editing estimate_max_state_count in perl_matcher_common.hpp.

in reply to:  5 comment:6 by bkalinczuk@…, 10 years ago

Replying to anonymous:

I checked your example with grep and it had no problem with matching the largest matching substring. (There wasn't even any delay).

Grep is a completely different beast: it's a DFA meaning the complexity of matching is O(N). Perl compatible engines are NFA's, and in the worst case the complexity of obtaining a match is NP-complete, so something like O(N!) for an N-character string.

I understand, that for the purpose of limiting the time taken for regex matching/searching a developer can set a BOOST_REGEX_MAX_STATE_COUNT as he wishes. I would expect though, that regex matching will take O(BOOST_REGEX_MAX_STATE_COUNT * exp(size_of_regex) ) (128n in the example), which is not the case.

No it's much worst than that as mentioned above. NP-complete in the worst case (which strangely this nearly is).

I believe there is a bug with counting number of states in this example and the behaviour is not as it is described.

It's not states in the machine that matter - it's how many are visited while trying to find the match. The maximum is set to the lowest of:

  • BOOST_REGEX_MAX_STATE_COUNT
  • N2 for an N character string.
  • A lower limit k, which is hard coded to 100000.

You can change this behavour by editing estimate_max_state_count in perl_matcher_common.hpp.

My estimation have been N * 2number_of_DFA_states, so for your example it would be 56448 (even less then hardcoded), which could be calculated in no time. I would never expect the regex matching to be dependent on text size in any other way then linear.

There is this article about comparing perl regexes to the awk and grep approach: http:\/\/ swtch.com \/ ~rsc \/ regexp \/ regexp1.html

Clearly I expected a Thompson approach.

in reply to:  5 comment:7 by Steven Watanabe, 10 years ago

Replying to anonymous:

I checked your example with grep and it had no problem with matching the largest matching substring. (There wasn't even any delay).

Grep is a completely different beast: it's a DFA meaning the complexity of matching is O(N). Perl compatible engines are NFA's, and in the worst case the complexity of obtaining a match is NP-complete, so something like O(N!) for an N-character string.

That isn't really true. NFA matching is in P. The problem is that a Perl regular expression cannot necessarily be represented by an NFA. The representation is similar to an NFA, but it isn't really one.

Note: See TracTickets for help on using tickets.