Boost C++ Libraries: Ticket #8392: Too complex regex for boost::regex (perl, extended) isn't so complex https://svn.boost.org/trac10/ticket/8392 <p> Here I have regular expression: "[a-e]+[b-f]+[ac-f]+[abd-f]+[a-cef]+[a-df]+" </p> <p> This expression raises exception: "The complexity of matching the regular expression exceeded predefined bounds. Try refactoring the regular expression to make each choice made by the state machine unambiguous. This exception is thrown to prevent "eternal" matches that take an indefinite period time to locate." </p> <p> Obviously perl regex and grep has no problem with this. According to my calculations this regex should use up to 2<sup>7 states, which is barely 128. However boost implementation states, that it exceeds maximum of ptrdiff. </sup></p> <p> Here is code in c++ (compiled with gcc-4.7.2-11 and boost-1.53.0 and no additional flags (except -I-L-l)): #define BOOST_REGEX_MAX_STATE_COUNT std::ptrdiff_t(std::numeric_limits&lt;std::ptrdiff_t&gt;::max) </p> <p> #include &lt;iostream&gt; #include &lt;boost/regex.hpp&gt; #include &lt;string&gt; </p> <p> int main() { </p> <blockquote> <p> boost::regex exp("[a-e]+[b-f]+[ac-f]+[abd-f]+[a-cef]+[a-df]+", boost::regex::perl); std::string text1 = "bbbbbbbccccccccccccccccccccccccbbbbbbbbbbbbbbbbcccccccccccccccccccccbbbbbbbbaaaaa"; std::string text2 = "bbbbbbbccccccccccccccccccccccccbbbbbbbbbbbbbbbbcccccccccccccccccccccbbbbbbbbaaaaae"; std::cout &lt;&lt; BOOST_REGEX_MAX_STATE_COUNT &lt;&lt; std::endl; std::cout &lt;&lt; boost::regex_match(text1, exp) &lt;&lt; std::endl; std::cout &lt;&lt; boost::regex_match(text2, exp) &lt;&lt; std::endl; </p> </blockquote> <p> } </p> <p> Here analog grep match: echo bbbbbbccccccccccccccccccccccccbbbbbbbbbbbbbbbbcccccccccccccccccccccbbbbbbbbaaaaae | grep "[a-e]\+[b-f]\+[ac-f]\+[abd-f]\+[a-cef]\+[a-df]\+" </p> <p> And analog perl code: if ("bbbbbbbccccccccccccccccccccccccbbbbbbbbbbbbbbbbcccccccccccccccccccccbbbbbbbbaaaaa" =~ m/<sup>[a-e]+[b-f]+[ac-f]+[abd-f]+[a-cef]+[a-df]+$/) { </sup></p> <blockquote> <p> print "MATCHED\n"; </p> </blockquote> <p> } else { </p> <blockquote> <p> print "NOT MATCHED\n"; </p> </blockquote> <p> } if ("bbbbbbbccccccccccccccccccccccccbbbbbbbbbbbbbbbbcccccccccccccccccccccbbbbbbbbaaaaae" =~ m/<sup>[a-e]+[b-f]+[ac-f]+[abd-f]+[a-cef]+[a-df]+$/) { </sup></p> <blockquote> <p> print "MATCHED\n"; </p> </blockquote> <p> } else { </p> <blockquote> <p> print "NOT MATCHED\n"; </p> </blockquote> <p> } </p> en-us Boost C++ Libraries /htdocs/site/boost.png https://svn.boost.org/trac10/ticket/8392 Trac 1.4.3 viboes Sun, 07 Apr 2013 18:27:59 GMT component changed; owner set https://svn.boost.org/trac10/ticket/8392#comment:1 https://svn.boost.org/trac10/ticket/8392#comment:1 <ul> <li><strong>owner</strong> set to <span class="trac-author">John Maddock</span> </li> <li><strong>component</strong> <span class="trac-field-old">None</span> → <span class="trac-field-new">regex</span> </li> </ul> Ticket John Maddock Sat, 20 Apr 2013 16:48:45 GMT status changed; resolution set https://svn.boost.org/trac10/ticket/8392#comment:2 https://svn.boost.org/trac10/ticket/8392#comment:2 <ul> <li><strong>status</strong> <span class="trac-field-old">new</span> → <span class="trac-field-new">closed</span> </li> <li><strong>resolution</strong> → <span class="trac-field-new">wontfix</span> </li> </ul> <p> This is not nearly so clear cut as you think, both Boost.Regex and Perl are backtracking NFA's and if I make the string being matched slightly longer then Perl takes several minutes to figure out that the string can't be matched: </p> <pre class="wiki"> if ("bbbbbbbccccccccccccccccccccccccbbbbbbbbbbbbbbbbcccccccccccccccccccccbbbbbbbbaaaaa" =~ m/[a-e]+[b-f]+[ac-f]+[abd-f]+[a-cef]+[a-df]+$/) { print "MATCHED\n"; } else { print "NOT MATCHED\n"; } if ("bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbcccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbaaaaae" =~ m/[a-e]+[b-f]+[ac-f]+[abd-f]+[a-cef]+[a-df]+$/) { print "MATCHED\n"; } else { print "NOT MATCHED\n"; } </pre><p> Make the string being matched longer still and Perl will go into an effectively infinite loop. </p> <p> Boost.Regex takes the view that these pathological cases should be caught as soon as possible, and that's what you're seeing here. It's true that this behavior might not suite everyone all of the time, but it's also safer, and helps prevent deliberately "bad" regexes being used in DOS attacks etc. </p> Ticket anonymous Sat, 20 Apr 2013 19:41:22 GMT <link>https://svn.boost.org/trac10/ticket/8392#comment:3 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8392#comment:3</guid> <description> <p> I checked your example with grep and it had no problem with matching the largest matching substring. (There wasn't even any delay). </p> <p> I understand, that for the purpose of limiting the time taken for regex matching/searching a developer can set a BOOST_REGEX_MAX_STATE_COUNT as he wishes. I would expect though, that regex matching will take O(BOOST_REGEX_MAX_STATE_COUNT * exp(size_of_regex) ) (128n in the example), which is not the case. </p> <p> I believe there is a bug with counting number of states in this example and the behaviour is not as it is described. </p> </description> <category>Ticket</category> </item> <item> <author>bkalinczuk@…</author> <pubDate>Sat, 20 Apr 2013 19:42:08 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8392#comment:4 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8392#comment:4</guid> <description> <p> forgot email :P </p> </description> <category>Ticket</category> </item> <item> <dc:creator>anonymous</dc:creator> <pubDate>Sun, 21 Apr 2013 11:00:25 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8392#comment:5 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8392#comment:5</guid> <description> <blockquote class="citation"> <p> I checked your example with grep and it had no problem with matching the largest matching substring. (There wasn't even any delay). </p> </blockquote> <p> Grep is a completely different beast: it's a DFA meaning the complexity of matching is O(N). Perl compatible engines are NFA's, and in the worst case the complexity of obtaining a match is NP-complete, so something like O(N!) for an N-character string. </p> <blockquote class="citation"> <p> I understand, that for the purpose of limiting the time taken for regex matching/searching a developer can set a BOOST_REGEX_MAX_STATE_COUNT as he wishes. I would expect though, that regex matching will take O(BOOST_REGEX_MAX_STATE_COUNT * exp(size_of_regex) ) (128n in the example), which is not the case. </p> </blockquote> <p> No it's much worst than that as mentioned above. NP-complete in the worst case (which strangely this nearly is). </p> <blockquote class="citation"> <p> I believe there is a bug with counting number of states in this example and the behaviour is not as it is described. </p> </blockquote> <p> It's not states in the machine that matter - it's how many are <em>visited</em> while trying to find the match. The maximum is set to the lowest of: </p> <ul><li>BOOST_REGEX_MAX_STATE_COUNT </li><li>N<sup>2</sup> for an N character string. </li><li>A lower limit k, which is hard coded to 100000. </li></ul><p> You can change this behavour by editing <code>estimate_max_state_count</code> in perl_matcher_common.hpp. </p> </description> <category>Ticket</category> </item> <item> <author>bkalinczuk@…</author> <pubDate>Sun, 21 Apr 2013 11:53:37 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8392#comment:6 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8392#comment:6</guid> <description> <p> Replying to <a class="ticket" href="https://svn.boost.org/trac10/ticket/8392#comment:5" title="Comment 5">anonymous</a>: </p> <blockquote class="citation"> <blockquote class="citation"> <p> I checked your example with grep and it had no problem with matching the largest matching substring. (There wasn't even any delay). </p> </blockquote> <p> Grep is a completely different beast: it's a DFA meaning the complexity of matching is O(N). Perl compatible engines are NFA's, and in the worst case the complexity of obtaining a match is NP-complete, so something like O(N!) for an N-character string. </p> <blockquote class="citation"> <p> I understand, that for the purpose of limiting the time taken for regex matching/searching a developer can set a BOOST_REGEX_MAX_STATE_COUNT as he wishes. I would expect though, that regex matching will take O(BOOST_REGEX_MAX_STATE_COUNT * exp(size_of_regex) ) (128n in the example), which is not the case. </p> </blockquote> <p> No it's much worst than that as mentioned above. NP-complete in the worst case (which strangely this nearly is). </p> <blockquote class="citation"> <p> I believe there is a bug with counting number of states in this example and the behaviour is not as it is described. </p> </blockquote> <p> It's not states in the machine that matter - it's how many are <em>visited</em> while trying to find the match. The maximum is set to the lowest of: </p> <ul><li>BOOST_REGEX_MAX_STATE_COUNT </li><li>N<sup>2</sup> for an N character string. </li><li>A lower limit k, which is hard coded to 100000. </li></ul><p> You can change this behavour by editing <code>estimate_max_state_count</code> in perl_matcher_common.hpp. </p> </blockquote> <p> My estimation have been N * 2<sup>number_of_DFA_states, so for your example it would be 56448 (even less then hardcoded), which could be calculated in no time. I would never expect the regex matching to be dependent on text size in any other way then linear. </sup></p> <p> There is this article about comparing perl regexes to the awk and grep approach: http:\/\/ swtch.com \/ ~rsc \/ regexp \/ regexp1.html </p> <p> Clearly I expected a Thompson approach. </p> </description> <category>Ticket</category> </item> <item> <dc:creator>Steven Watanabe</dc:creator> <pubDate>Sun, 21 Apr 2013 17:17:35 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8392#comment:7 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8392#comment:7</guid> <description> <p> Replying to <a class="ticket" href="https://svn.boost.org/trac10/ticket/8392#comment:5" title="Comment 5">anonymous</a>: </p> <blockquote class="citation"> <blockquote class="citation"> <p> I checked your example with grep and it had no problem with matching the largest matching substring. (There wasn't even any delay). </p> </blockquote> <p> Grep is a completely different beast: it's a DFA meaning the complexity of matching is O(N). Perl compatible engines are NFA's, and in the worst case the complexity of obtaining a match is NP-complete, so something like O(N!) for an N-character string. </p> </blockquote> <p> That isn't really true. NFA matching is in P. The problem is that a Perl regular expression cannot necessarily be represented by an NFA. The representation is similar to an NFA, but it isn't really one. </p> </description> <category>Ticket</category> </item> </channel> </rss>