Boost C++ Libraries: Ticket #8392: Too complex regex for boost::regex (perl, extended) isn't so complex

component changed; owner set

viboes — Sun, 07 Apr 2013 18:27:59 GMT

owner set to John Maddock
component None → regex

status changed; resolution set

John Maddock — Sat, 20 Apr 2013 16:48:45 GMT

status new → closed
resolution → wontfix

This is not nearly so clear cut as you think, both Boost.Regex and Perl are backtracking NFA's and if I make the string being matched slightly longer then Perl takes several minutes to figure out that the string can't be matched:

if ("bbbbbbbccccccccccccccccccccccccbbbbbbbbbbbbbbbbcccccccccccccccccccccbbbbbbbbaaaaa" =~ m/[a-e]+[b-f]+[ac-f]+[abd-f]+[a-cef]+[a-df]+$/) {
    print "MATCHED\n";
} else {
    print "NOT MATCHED\n";
} if ("bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbcccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbaaaaae" =~ m/[a-e]+[b-f]+[ac-f]+[abd-f]+[a-cef]+[a-df]+$/) {
    print "MATCHED\n";
} else {
    print "NOT MATCHED\n";
}

Make the string being matched longer still and Perl will go into an effectively infinite loop.

Boost.Regex takes the view that these pathological cases should be caught as soon as possible, and that's what you're seeing here. It's true that this behavior might not suite everyone all of the time, but it's also safer, and helps prevent deliberately "bad" regexes being used in DOS attacks etc.

anonymous — Sat, 20 Apr 2013 19:41:22 GMT

I checked your example with grep and it had no problem with matching the largest matching substring. (There wasn't even any delay).

I understand, that for the purpose of limiting the time taken for regex matching/searching a developer can set a BOOST_REGEX_MAX_STATE_COUNT as he wishes. I would expect though, that regex matching will take O(BOOST_REGEX_MAX_STATE_COUNT * exp(size_of_regex) ) (128n in the example), which is not the case.

I believe there is a bug with counting number of states in this example and the behaviour is not as it is described.

bkalinczuk@… — Sat, 20 Apr 2013 19:42:08 GMT

forgot email :P

anonymous — Sun, 21 Apr 2013 11:00:25 GMT

I checked your example with grep and it had no problem with matching the largest matching substring. (There wasn't even any delay).

Grep is a completely different beast: it's a DFA meaning the complexity of matching is O(N). Perl compatible engines are NFA's, and in the worst case the complexity of obtaining a match is NP-complete, so something like O(N!) for an N-character string.

I understand, that for the purpose of limiting the time taken for regex matching/searching a developer can set a BOOST_REGEX_MAX_STATE_COUNT as he wishes. I would expect though, that regex matching will take O(BOOST_REGEX_MAX_STATE_COUNT * exp(size_of_regex) ) (128n in the example), which is not the case.

No it's much worst than that as mentioned above. NP-complete in the worst case (which strangely this nearly is).

I believe there is a bug with counting number of states in this example and the behaviour is not as it is described.

It's not states in the machine that matter - it's how many are visited while trying to find the match. The maximum is set to the lowest of:

BOOST_REGEX_MAX_STATE_COUNT
N² for an N character string.
A lower limit k, which is hard coded to 100000.

You can change this behavour by editing estimate_max_state_count in perl_matcher_common.hpp.

bkalinczuk@… — Sun, 21 Apr 2013 11:53:37 GMT

Replying to anonymous:

I checked your example with grep and it had no problem with matching the largest matching substring. (There wasn't even any delay).

Grep is a completely different beast: it's a DFA meaning the complexity of matching is O(N). Perl compatible engines are NFA's, and in the worst case the complexity of obtaining a match is NP-complete, so something like O(N!) for an N-character string.

I understand, that for the purpose of limiting the time taken for regex matching/searching a developer can set a BOOST_REGEX_MAX_STATE_COUNT as he wishes. I would expect though, that regex matching will take O(BOOST_REGEX_MAX_STATE_COUNT * exp(size_of_regex) ) (128n in the example), which is not the case.

No it's much worst than that as mentioned above. NP-complete in the worst case (which strangely this nearly is).

I believe there is a bug with counting number of states in this example and the behaviour is not as it is described.

It's not states in the machine that matter - it's how many are visited while trying to find the match. The maximum is set to the lowest of:

BOOST_REGEX_MAX_STATE_COUNT
N² for an N character string.
A lower limit k, which is hard coded to 100000.
You can change this behavour by editing estimate_max_state_count in perl_matcher_common.hpp.

My estimation have been N * 2^{number_of_DFA_states, so for your example it would be 56448 (even less then hardcoded), which could be calculated in no time.
I would never expect the regex matching to be dependent on text size in any other way then linear.}

There is this article about comparing perl regexes to the awk and grep approach: http:\/\/ swtch.com \/ ~rsc \/ regexp \/ regexp1.html

Clearly I expected a Thompson approach.

Steven Watanabe — Sun, 21 Apr 2013 17:17:35 GMT

Replying to anonymous:

I checked your example with grep and it had no problem with matching the largest matching substring. (There wasn't even any delay).

Grep is a completely different beast: it's a DFA meaning the complexity of matching is O(N). Perl compatible engines are NFA's, and in the worst case the complexity of obtaining a match is NP-complete, so something like O(N!) for an N-character string.

That isn't really true. NFA matching is in P. The problem is that a Perl regular expression cannot necessarily be represented by an NFA. The representation is similar to an NFA, but it isn't really one.