#8392 closed Bugs (wontfix)
Too complex regex for boost::regex (perl, extended) isn't so complex
Reported by: | Owned by: | John Maddock | |
---|---|---|---|
Milestone: | To Be Determined | Component: | regex |
Version: | Boost 1.52.0 | Severity: | Problem |
Keywords: | Cc: |
Description
Here I have regular expression: "[a-e]+[b-f]+[ac-f]+[abd-f]+[a-cef]+[a-df]+"
This expression raises exception: "The complexity of matching the regular expression exceeded predefined bounds. Try refactoring the regular expression to make each choice made by the state machine unambiguous. This exception is thrown to prevent "eternal" matches that take an indefinite period time to locate."
Obviously perl regex and grep has no problem with this. According to my calculations this regex should use up to 27 states, which is barely 128. However boost implementation states, that it exceeds maximum of ptrdiff.
Here is code in c++ (compiled with gcc-4.7.2-11 and boost-1.53.0 and no additional flags (except -I-L-l)): #define BOOST_REGEX_MAX_STATE_COUNT std::ptrdiff_t(std::numeric_limits<std::ptrdiff_t>::max)
#include <iostream> #include <boost/regex.hpp> #include <string>
int main() {
boost::regex exp("[a-e]+[b-f]+[ac-f]+[abd-f]+[a-cef]+[a-df]+", boost::regex::perl); std::string text1 = "bbbbbbbccccccccccccccccccccccccbbbbbbbbbbbbbbbbcccccccccccccccccccccbbbbbbbbaaaaa"; std::string text2 = "bbbbbbbccccccccccccccccccccccccbbbbbbbbbbbbbbbbcccccccccccccccccccccbbbbbbbbaaaaae"; std::cout << BOOST_REGEX_MAX_STATE_COUNT << std::endl; std::cout << boost::regex_match(text1, exp) << std::endl; std::cout << boost::regex_match(text2, exp) << std::endl;
}
Here analog grep match: echo bbbbbbccccccccccccccccccccccccbbbbbbbbbbbbbbbbcccccccccccccccccccccbbbbbbbbaaaaae | grep "[a-e]\+[b-f]\+[ac-f]\+[abd-f]\+[a-cef]\+[a-df]\+"
And analog perl code: if ("bbbbbbbccccccccccccccccccccccccbbbbbbbbbbbbbbbbcccccccccccccccccccccbbbbbbbbaaaaa" =~ m/[a-e]+[b-f]+[ac-f]+[abd-f]+[a-cef]+[a-df]+$/) {
print "MATCHED\n";
} else {
print "NOT MATCHED\n";
} if ("bbbbbbbccccccccccccccccccccccccbbbbbbbbbbbbbbbbcccccccccccccccccccccbbbbbbbbaaaaae" =~ m/[a-e]+[b-f]+[ac-f]+[abd-f]+[a-cef]+[a-df]+$/) {
print "MATCHED\n";
} else {
print "NOT MATCHED\n";
}
Change History (7)
comment:1 by , 10 years ago
Component: | None → regex |
---|---|
Owner: | set to |
comment:2 by , 10 years ago
Resolution: | → wontfix |
---|---|
Status: | new → closed |
comment:3 by , 10 years ago
I checked your example with grep and it had no problem with matching the largest matching substring. (There wasn't even any delay).
I understand, that for the purpose of limiting the time taken for regex matching/searching a developer can set a BOOST_REGEX_MAX_STATE_COUNT as he wishes. I would expect though, that regex matching will take O(BOOST_REGEX_MAX_STATE_COUNT * exp(size_of_regex) ) (128n in the example), which is not the case.
I believe there is a bug with counting number of states in this example and the behaviour is not as it is described.
follow-ups: 6 7 comment:5 by , 10 years ago
I checked your example with grep and it had no problem with matching the largest matching substring. (There wasn't even any delay).
Grep is a completely different beast: it's a DFA meaning the complexity of matching is O(N). Perl compatible engines are NFA's, and in the worst case the complexity of obtaining a match is NP-complete, so something like O(N!) for an N-character string.
I understand, that for the purpose of limiting the time taken for regex matching/searching a developer can set a BOOST_REGEX_MAX_STATE_COUNT as he wishes. I would expect though, that regex matching will take O(BOOST_REGEX_MAX_STATE_COUNT * exp(size_of_regex) ) (128n in the example), which is not the case.
No it's much worst than that as mentioned above. NP-complete in the worst case (which strangely this nearly is).
I believe there is a bug with counting number of states in this example and the behaviour is not as it is described.
It's not states in the machine that matter - it's how many are visited while trying to find the match. The maximum is set to the lowest of:
- BOOST_REGEX_MAX_STATE_COUNT
- N2 for an N character string.
- A lower limit k, which is hard coded to 100000.
You can change this behavour by editing estimate_max_state_count
in perl_matcher_common.hpp.
comment:6 by , 10 years ago
Replying to anonymous:
I checked your example with grep and it had no problem with matching the largest matching substring. (There wasn't even any delay).
Grep is a completely different beast: it's a DFA meaning the complexity of matching is O(N). Perl compatible engines are NFA's, and in the worst case the complexity of obtaining a match is NP-complete, so something like O(N!) for an N-character string.
I understand, that for the purpose of limiting the time taken for regex matching/searching a developer can set a BOOST_REGEX_MAX_STATE_COUNT as he wishes. I would expect though, that regex matching will take O(BOOST_REGEX_MAX_STATE_COUNT * exp(size_of_regex) ) (128n in the example), which is not the case.
No it's much worst than that as mentioned above. NP-complete in the worst case (which strangely this nearly is).
I believe there is a bug with counting number of states in this example and the behaviour is not as it is described.
It's not states in the machine that matter - it's how many are visited while trying to find the match. The maximum is set to the lowest of:
- BOOST_REGEX_MAX_STATE_COUNT
- N2 for an N character string.
- A lower limit k, which is hard coded to 100000.
You can change this behavour by editing
estimate_max_state_count
in perl_matcher_common.hpp.
My estimation have been N * 2number_of_DFA_states, so for your example it would be 56448 (even less then hardcoded), which could be calculated in no time. I would never expect the regex matching to be dependent on text size in any other way then linear.
There is this article about comparing perl regexes to the awk and grep approach: http:\/\/ swtch.com \/ ~rsc \/ regexp \/ regexp1.html
Clearly I expected a Thompson approach.
comment:7 by , 10 years ago
Replying to anonymous:
I checked your example with grep and it had no problem with matching the largest matching substring. (There wasn't even any delay).
Grep is a completely different beast: it's a DFA meaning the complexity of matching is O(N). Perl compatible engines are NFA's, and in the worst case the complexity of obtaining a match is NP-complete, so something like O(N!) for an N-character string.
That isn't really true. NFA matching is in P. The problem is that a Perl regular expression cannot necessarily be represented by an NFA. The representation is similar to an NFA, but it isn't really one.
This is not nearly so clear cut as you think, both Boost.Regex and Perl are backtracking NFA's and if I make the string being matched slightly longer then Perl takes several minutes to figure out that the string can't be matched:
Make the string being matched longer still and Perl will go into an effectively infinite loop.
Boost.Regex takes the view that these pathological cases should be caught as soon as possible, and that's what you're seeing here. It's true that this behavior might not suite everyone all of the time, but it's also safer, and helps prevent deliberately "bad" regexes being used in DOS attacks etc.