Opened 7 years ago
Last modified 6 years ago
#11776 reopened Feature Requests
Effective way to find all regex matches in large file
Reported by: | Owned by: | John Maddock | |
---|---|---|---|
Milestone: | To Be Determined | Component: | regex |
Version: | Boost 1.59.0 | Severity: | Optimization |
Keywords: | Cc: |
Description
Finding all regexes in a file via boost::regex_iterator is a very complicated task as you can normally not load the whole file into a buffer (could be too large).
A possible solution is presented in the documentation of regular expressions in section Partial Matches, see the second example.
Unfortunately, it is not correct: Consider a file with content "12abc", a regex "[a-z]+", and a buffer size of 4. This would result in the matches ab and c, but should be abc. The first match is not partial and touches the end of the buffer. Increasing the buffer size does not solve the problem in general, and with more complex regexes it even gets worse. Another example: same as earlier except with regex "[a-z]{2,}" (i. e. words with at least two letters), what results in one match ab, but should be abc.
The easiest solution seems to be to add a new match flag (“range_incomplete” or “input_incomplete” (?)), that checks if the beginning of the current match and the end of the buffer build a partial or full match. In that case this “match” should be marked to the user as possibly incomplete (e. g. by the already existing member sub_match::matched). There probably exist better solutions.
If you do not want to or cannot follow this feature request, I ask you at least to update the discussed 2nd example in the partial matches section. Thanks!
Attachments (1)
Change History (4)
follow-up: 2 comment:1 by , 7 years ago
Resolution: | → wontfix |
---|---|
Status: | new → closed |
by , 7 years ago
Correctly finding all regexes in stream (e. g. file) of unknown size by workaround
comment:2 by , 7 years ago
Resolution: | wontfix |
---|---|
Status: | closed → reopened |
Summary: | Need way to find all regex matches in large file → Effective way to find all regex matches in large file |
Replying to johnmaddock:
Documentation updated.
Great. But, as I wrote, an additional update of Partial Matches is really necessary, because it’s one of the first answers when searching a solution for this special problem. For a (tortuous) workaround see attached file main.cpp
that can be compiled with
g++ main.cpp -std=c++11 -lboost_regex -o regex
Note that this particular case can be detected by checking whether the end of the match is at the end of the current string.
In the above very particular case: yes, but normally: no. That would be to easy ;-) Example: buffer size 4, regex "(ab)+", and input 1abab2.
However, there are expression such as "abc.*123" that actually need the whole input string (no matter how long it may be) to determine the longest match.
Correct, you have to keep that in mind, especially when parsing large files. Nevertheless: Erroneous or senseless regexes lead to unwanted or erroneous program executions.
comment:3 by , 6 years ago
I'm working on several C++ projects that require searching and scanning large text files. I fully agree with the OP that the partial_match
implementation is broken. This is very unfortunate, because partial_match
provides the only possible mechanism to search streaming input text without buffering the entire text. To restrict the regex to simple forms that do not include repetitions (*, +) is not a viable workaround for these projects (such as the RE/flex https://sourceforge.net/projects/re-flex project). There are use cases in which we must take interactive input (i.e. buffering one char at a time) or take large files in which the pattern searched may not fit in the current buffer allocated, thus not producing the longest match, and worse we don't know if the buffer must be enlarged to continue iterating to find the longest match.
The correct algorithm should consider that as long as backtracking on a repetition pattern in the regex is still possible given some partial input text, Boost.Regex should flag the result as a partial match instead of a full match.. With this change, matching "abc.*123
" may require the whole input, but that is OK!
I added the following note to my project documentation until this problem is resolved:
Boost.Regex may fail to find the longest match when repetition patterns such as `.*` are used. Under certain conditions greedy repetitions may behave as lazy repetitions. For example, the Boost.Regex engine may return the short match `abc` when the regex `a.*c` is applied to `abc abc`, instead of returning the full match `abc abc`. The problem is caused by the limitations of Boost.Regex `match_partial` matching algorithm. To work around this limitation, we suggest to make the repetition pattern as specific as possible and not overlap with the pattern that follows the repetition. *The best solution is to read the entire input, which the RE/flex scanner generator does if the input file length can be determined* (but this won't always work, e.g. with interactive TTY input). We consider this Boost.Regex partial match behavior a bug, not a restriction, because *as long as backtracking on a repetition pattern is possible given some partial text, Boost.Regex should flag the result as a partial match instead of a full match.* Otherwise, it is impossible to reliably use Boost.Regex partial matching.
Looking forward to have a resolution on this issue. I will be happy to help out.
Documentation updated.
Note that this particular case can be detected by checking whether the end of the match is at the end of the current string. However, there are expression such as "abc.*123" that actually need the whole input string (no matter how long it may be) to determine the longest match.