Boost C++ Libraries: Ticket #11776: Effective way to find all regex matches in large file https://svn.boost.org/trac10/ticket/11776 <p> Finding all regexes in a file via boost::regex_iterator is a very complicated task as you can normally not load the whole file into a buffer (could be too large). </p> <p> A possible solution is presented in the documentation of regular expressions in section <a href="http://www.boost.org/doc/libs/1_59_0/libs/regex/doc/html/boost_regex/partial_matches.html">Partial Matches</a>, see the second example. </p> <p> Unfortunately, it is not correct: Consider a file with content "12abc", a regex "[a-z]+", and a buffer size of 4. This would result in the matches ab and c, but should be abc. The first match is not partial and touches the end of the buffer. Increasing the buffer size does not solve the problem in general, and with more complex regexes it even gets worse. Another example: same as earlier except with regex "[a-z]{2,}" (i. e. words with at least two letters), what results in one match ab, but should be abc. </p> <p> The easiest solution seems to be to add a new match flag (“range_incomplete” or “input_incomplete” (?)), that checks if the beginning of the current match and the end of the buffer build a partial or full match. In that case this “match” should be marked to the user as possibly incomplete (e. g. by the already existing member sub_match::matched). There probably exist better solutions. </p> <p> If you do not want to or cannot follow this feature request, I ask you at least to update the discussed 2nd example in the partial matches section. Thanks! </p> en-us Boost C++ Libraries /htdocs/site/boost.png https://svn.boost.org/trac10/ticket/11776 Trac 1.4.3 John Maddock Sat, 12 Mar 2016 19:03:36 GMT status changed; resolution set https://svn.boost.org/trac10/ticket/11776#comment:1 https://svn.boost.org/trac10/ticket/11776#comment:1 <ul> <li><strong>status</strong> <span class="trac-field-old">new</span> → <span class="trac-field-new">closed</span> </li> <li><strong>resolution</strong> → <span class="trac-field-new">wontfix</span> </li> </ul> <p> Documentation updated. </p> <p> Note that this particular case can be detected by checking whether the end of the match is at the end of the current string. However, there are expression such as "abc.*123" that actually need the whole input string (no matter how long it may be) to determine the longest match. </p> Ticket der-storch-85@… Tue, 15 Mar 2016 17:00:05 GMT attachment set https://svn.boost.org/trac10/ticket/11776 https://svn.boost.org/trac10/ticket/11776 <ul> <li><strong>attachment</strong> → <span class="trac-field-new">main.cpp</span> </li> </ul> <p> Correctly finding all regexes in stream (e. g. file) of unknown size by workaround </p> Ticket der-storch-85@… Tue, 15 Mar 2016 17:06:50 GMT status, summary changed; resolution deleted https://svn.boost.org/trac10/ticket/11776#comment:2 https://svn.boost.org/trac10/ticket/11776#comment:2 <ul> <li><strong>status</strong> <span class="trac-field-old">closed</span> → <span class="trac-field-new">reopened</span> </li> <li><strong>resolution</strong> <span class="trac-field-deleted">wontfix</span> </li> <li><strong>summary</strong> <span class="trac-field-old">Need way to find all regex matches in large file</span> → <span class="trac-field-new">Effective way to find all regex matches in large file</span> </li> </ul> <p> Replying to <a class="ticket" href="https://svn.boost.org/trac10/ticket/11776#comment:1" title="Comment 1">johnmaddock</a>: </p> <blockquote class="citation"> <p> Documentation updated. </p> </blockquote> <p> Great. But, as I wrote, an additional update of <a href="http://www.boost.org/doc/libs/1_59_0/libs/regex/doc/html/boost_regex/partial_matches.html">Partial Matches</a> is really necessary, because it’s one of the first answers when searching a solution for this special problem. For a (tortuous) workaround see attached file <code>main.cpp</code> that can be compiled with </p> <pre class="wiki">g++ main.cpp -std=c++11 -lboost_regex -o regex </pre><blockquote class="citation"> <p> Note that this particular case can be detected by checking whether the end of the match is at the end of the current string. </p> </blockquote> <p> In the <em>above very particular</em> case: yes, but normally: no. That would be to easy ;-) Example: buffer size 4, regex "(ab)+", and input 1abab2. </p> <blockquote class="citation"> <p> However, there are expression such as "abc.*123" that actually need the whole input string (no matter how long it may be) to determine the longest match. </p> </blockquote> <p> Correct, you have to keep that in mind, especially when parsing large files. Nevertheless: Erroneous or senseless regexes lead to unwanted or erroneous program executions. </p> Ticket Dr. Robert van Engelen <engelen@…> Wed, 23 Nov 2016 16:13:11 GMT <link>https://svn.boost.org/trac10/ticket/11776#comment:3 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/11776#comment:3</guid> <description> <p> I'm working on several C++ projects that require searching and scanning large text files. I fully agree with the OP that the <code>partial_match</code> implementation is broken. This is very unfortunate, because <code>partial_match</code> provides the only possible mechanism to search streaming input text without buffering the entire text. To restrict the regex to simple forms that do not include repetitions (*, +) is not a viable workaround for these projects (such as the RE/flex <a class="ext-link" href="https://sourceforge.net/projects/re-flex"><span class="icon">​</span>https://sourceforge.net/projects/re-flex</a> project). There are use cases in which we must take interactive input (i.e. buffering one char at a time) or take large files in which the pattern searched may not fit in the current buffer allocated, thus not producing the longest match, and worse we don't know if the buffer must be enlarged to continue iterating to find the longest match. </p> <p> The correct algorithm should consider that <strong>as long as backtracking on a repetition pattern in the regex is still possible given some partial input text, Boost.Regex should flag the result as a partial match instead of a full match.</strong>. With this change, matching "<code>abc.*123</code>" may require the whole input, but that is OK! </p> <p> I added the following note to my project documentation until this problem is resolved: </p> <pre class="wiki"> Boost.Regex may fail to find the longest match when repetition patterns such as `.*` are used. Under certain conditions greedy repetitions may behave as lazy repetitions. For example, the Boost.Regex engine may return the short match `abc` when the regex `a.*c` is applied to `abc abc`, instead of returning the full match `abc abc`. The problem is caused by the limitations of Boost.Regex `match_partial` matching algorithm. To work around this limitation, we suggest to make the repetition pattern as specific as possible and not overlap with the pattern that follows the repetition. *The best solution is to read the entire input, which the RE/flex scanner generator does if the input file length can be determined* (but this won't always work, e.g. with interactive TTY input). We consider this Boost.Regex partial match behavior a bug, not a restriction, because *as long as backtracking on a repetition pattern is possible given some partial text, Boost.Regex should flag the result as a partial match instead of a full match.* Otherwise, it is impossible to reliably use Boost.Regex partial matching. </pre><p> Looking forward to have a resolution on this issue. I will be happy to help out. </p> </description> <category>Ticket</category> </item> </channel> </rss>