Opened 7 years ago

Last modified 6 years ago

#11776 reopened Feature Requests

Effective way to find all regex matches in large file

Reported by: der-storch-85@… Owned by: John Maddock
Milestone: To Be Determined Component: regex
Version: Boost 1.59.0 Severity: Optimization
Keywords: Cc:

Description

Finding all regexes in a file via boost::regex_iterator is a very complicated task as you can normally not load the whole file into a buffer (could be too large).

A possible solution is presented in the documentation of regular expressions in section Partial Matches, see the second example.

Unfortunately, it is not correct: Consider a file with content "12abc", a regex "[a-z]+", and a buffer size of 4. This would result in the matches ab and c, but should be abc. The first match is not partial and touches the end of the buffer. Increasing the buffer size does not solve the problem in general, and with more complex regexes it even gets worse. Another example: same as earlier except with regex "[a-z]{2,}" (i. e. words with at least two letters), what results in one match ab, but should be abc.

The easiest solution seems to be to add a new match flag (“range_incomplete” or “input_incomplete” (?)), that checks if the beginning of the current match and the end of the buffer build a partial or full match. In that case this “match” should be marked to the user as possibly incomplete (e. g. by the already existing member sub_match::matched). There probably exist better solutions.

If you do not want to or cannot follow this feature request, I ask you at least to update the discussed 2nd example in the partial matches section. Thanks!

Attachments (1)

main.cpp (2.8 KB ) - added by der-storch-85@… 7 years ago.
Correctly finding all regexes in stream (e. g. file) of unknown size by workaround

Download all attachments as: .zip

Change History (4)

comment:1 by John Maddock, 7 years ago

Resolution: wontfix
Status: newclosed

Documentation updated.

Note that this particular case can be detected by checking whether the end of the match is at the end of the current string. However, there are expression such as "abc.*123" that actually need the whole input string (no matter how long it may be) to determine the longest match.

by der-storch-85@…, 7 years ago

Attachment: main.cpp added

Correctly finding all regexes in stream (e. g. file) of unknown size by workaround

in reply to:  1 comment:2 by der-storch-85@…, 7 years ago

Resolution: wontfix
Status: closedreopened
Summary: Need way to find all regex matches in large fileEffective way to find all regex matches in large file

Replying to johnmaddock:

Documentation updated.

Great. But, as I wrote, an additional update of Partial Matches is really necessary, because it’s one of the first answers when searching a solution for this special problem. For a (tortuous) workaround see attached file main.cpp that can be compiled with

g++ main.cpp -std=c++11 -lboost_regex -o regex

Note that this particular case can be detected by checking whether the end of the match is at the end of the current string.

In the above very particular case: yes, but normally: no. That would be to easy ;-) Example: buffer size 4, regex "(ab)+", and input 1abab2.

However, there are expression such as "abc.*123" that actually need the whole input string (no matter how long it may be) to determine the longest match.

Correct, you have to keep that in mind, especially when parsing large files. Nevertheless: Erroneous or senseless regexes lead to unwanted or erroneous program executions.

comment:3 by Dr. Robert van Engelen <engelen@…>, 6 years ago

I'm working on several C++ projects that require searching and scanning large text files. I fully agree with the OP that the partial_match implementation is broken. This is very unfortunate, because partial_match provides the only possible mechanism to search streaming input text without buffering the entire text. To restrict the regex to simple forms that do not include repetitions (*, +) is not a viable workaround for these projects (such as the RE/flex https://sourceforge.net/projects/re-flex project). There are use cases in which we must take interactive input (i.e. buffering one char at a time) or take large files in which the pattern searched may not fit in the current buffer allocated, thus not producing the longest match, and worse we don't know if the buffer must be enlarged to continue iterating to find the longest match.

The correct algorithm should consider that as long as backtracking on a repetition pattern in the regex is still possible given some partial input text, Boost.Regex should flag the result as a partial match instead of a full match.. With this change, matching "abc.*123" may require the whole input, but that is OK!

I added the following note to my project documentation until this problem is resolved:

  Boost.Regex may fail to find the longest match when repetition patterns such
  as `.*` are used.  Under certain conditions greedy repetitions may behave as
  lazy repetitions.  For example, the Boost.Regex engine may return the short
  match `abc` when the regex `a.*c` is applied to `abc abc`, instead of
  returning the full match `abc abc`.  The problem is caused by the limitations
  of Boost.Regex `match_partial` matching algorithm.  To work around this
  limitation, we suggest to make the repetition pattern as specific as possible
  and not overlap with the pattern that follows the repetition.  *The best
  solution is to read the entire input, which the RE/flex scanner generator
  does if the input file length can be determined* (but this won't always work,
  e.g. with interactive TTY input).  We consider this Boost.Regex partial match
  behavior a bug, not a restriction, because *as long as backtracking on a
  repetition pattern is possible given some partial text, Boost.Regex should
  flag the result as a partial match instead of a full match.*  Otherwise, it
  is impossible to reliably use Boost.Regex partial matching.

Looking forward to have a resolution on this issue. I will be happy to help out.

Note: See TracTickets for help on using tickets.