Boost C++ Libraries: Ticket #11776: Effective way to find all regex matches in large file

status changed; resolution set

John Maddock — Sat, 12 Mar 2016 19:03:36 GMT

status new → closed
resolution → wontfix

Documentation updated.

Note that this particular case can be detected by checking whether the end of the match is at the end of the current string. However, there are expression such as "abc.*123" that actually need the whole input string (no matter how long it may be) to determine the longest match.

attachment set

der-storch-85@… — Tue, 15 Mar 2016 17:00:05 GMT

attachment → main.cpp

Correctly finding all regexes in stream (e. g. file) of unknown size by workaround

status, summary changed; resolution deleted

der-storch-85@… — Tue, 15 Mar 2016 17:06:50 GMT

status closed → reopened
resolution wontfix
summary Need way to find all regex matches in large file → Effective way to find all regex matches in large file

Replying to johnmaddock:

Documentation updated.

Great. But, as I wrote, an additional update of Partial Matches is really necessary, because it’s one of the first answers when searching a solution for this special problem. For a (tortuous) workaround see attached file main.cpp that can be compiled with

g++ main.cpp -std=c++11 -lboost_regex -o regex

Note that this particular case can be detected by checking whether the end of the match is at the end of the current string.

In the above very particular case: yes, but normally: no. That would be to easy ;-) Example: buffer size 4, regex "(ab)+", and input 1abab2.

However, there are expression such as "abc.*123" that actually need the whole input string (no matter how long it may be) to determine the longest match.

Correct, you have to keep that in mind, especially when parsing large files. Nevertheless: Erroneous or senseless regexes lead to unwanted or erroneous program executions.

Wed, 23 Nov 2016 16:13:11 GMT

I'm working on several C++ projects that require searching and scanning large text files. I fully agree with the OP that the partial_match implementation is broken. This is very unfortunate, because partial_match provides the only possible mechanism to search streaming input text without buffering the entire text. To restrict the regex to simple forms that do not include repetitions (*, +) is not a viable workaround for these projects (such as the RE/flex https://sourceforge.net/projects/re-flex project). There are use cases in which we must take interactive input (i.e. buffering one char at a time) or take large files in which the pattern searched may not fit in the current buffer allocated, thus not producing the longest match, and worse we don't know if the buffer must be enlarged to continue iterating to find the longest match.

The correct algorithm should consider that as long as backtracking on a repetition pattern in the regex is still possible given some partial input text, Boost.Regex should flag the result as a partial match instead of a full match.. With this change, matching "abc.*123" may require the whole input, but that is OK!

I added the following note to my project documentation until this problem is resolved:

  Boost.Regex may fail to find the longest match when repetition patterns such
  as `.*` are used.  Under certain conditions greedy repetitions may behave as
  lazy repetitions.  For example, the Boost.Regex engine may return the short
  match `abc` when the regex `a.*c` is applied to `abc abc`, instead of
  returning the full match `abc abc`.  The problem is caused by the limitations
  of Boost.Regex `match_partial` matching algorithm.  To work around this
  limitation, we suggest to make the repetition pattern as specific as possible
  and not overlap with the pattern that follows the repetition.  *The best
  solution is to read the entire input, which the RE/flex scanner generator
  does if the input file length can be determined* (but this won't always work,
  e.g. with interactive TTY input).  We consider this Boost.Regex partial match
  behavior a bug, not a restriction, because *as long as backtracking on a
  repetition pattern is possible given some partial text, Boost.Regex should
  flag the result as a partial match instead of a full match.*  Otherwise, it
  is impossible to reliably use Boost.Regex partial matching.

Looking forward to have a resolution on this issue. I will be happy to help out.