Opened 10 years ago
Last modified 10 years ago
#7606 new Bugs
u32regex causes bus error
Reported by: | Owned by: | John Maddock | |
---|---|---|---|
Milestone: | To Be Determined | Component: | regex |
Version: | Boost 1.51.0 | Severity: | Problem |
Keywords: | Cc: |
Description
The Unicode regular expression
boost::make_u32regex ("[pq]
.
.[xy]");
causes a Bus Error. The same r.e. as a boost::regex is fine.
The r.e. "[pq]..[xy]" seems to be okay, so it looks like the repeated "
." is at least part of the problem.
The following program Bus Errors on Solaris 10 with gcc 4.6.1 and Boost 1.51. (and all previous versions of Boost as far as I can tell.)
# include <boost/regex/icu.hpp>
int main (int, char) {
const boost::u32regex re = boost::make_u32regex ("[pq]
.
.[xy]"); return 0;
}
Attachments (5)
Change History (19)
by , 10 years ago
comment:1 by , 10 years ago
comment:2 by , 10 years ago
Just noticed I attached the wrong version of the file before (a version that didn't cause a Bus Error.) So I have now attached the version that does cause the Bus Error. At least you now have both versions.
Apologies for any confusion.
comment:3 by , 10 years ago
I'm unable to reproduce either on Win32 or ubuntu Linux with current SVN Trunk and ICU 49. Are you able to debug locally?
comment:5 by , 10 years ago
Built a debug version of libboost_regex. I'll attach a stack trace from gdb. If you'd like me to investigate further please give me some pointers as to what to look for!
comment:6 by , 10 years ago
Thanks for the stack trace, unfortunately that makes even less sense now :-(
Can you set a breakpoint (basic_regex.cpp:399) inside:
template <class InputIterator> basic_regex(InputIterator arg_first, InputIterator arg_last, flag_type f = regex_constants::normal) { typedef typename traits::string_type seq_type; seq_type a(arg_first, arg_last); if(a.size()) assign(static_cast<const charT*>(&*a.begin()), static_cast<const charT*>(&*a.begin() + a.size()), f); else assign(static_cast<const charT*>(0), static_cast<const charT*>(0), f); }
What are the contents of "a" after construction?
Any chance that your code is compiled using a compiler code page that results in the input string not being valid ASCII/UTF8?
Thanks, John.
comment:7 by , 10 years ago
Here you are. Not sure that this looks helpful. At the "if" statement following the "a" construction took the true branch.
I'm not setting any compiler code page anywhere.
Breakpoint 2 at 0x15738: file /export/home/ashley/src/boost_1_51_0/boost/regex/v4/basic_regex.hpp, line 399. (gdb) run The program being debugged has been started already. Start it from the beginning? (y or n) y Starting program: /export/home/ashley/tmp/regex [Thread debugging using libthread_db enabled] [New Thread 1 (LWP 1)] [Switching to Thread 1 (LWP 1)] Breakpoint 2, boost::basic_regex<int, boost::icu_regex_traits>::basic_regex<boost::u8_to_u32_iterator<char const*, int> > ( this=0xffbffab0, arg_first=..., arg_last=..., f=0) at /export/home/ashley/src/boost_1_51_0/boost/regex/v4/basic_regex.hpp:400 400 { (gdb) list 395 assign(p, f); 396 } 397 398 template <class InputIterator> 399 basic_regex(InputIterator arg_first, InputIterator arg_last, flag_type f = regex_constants::normal) 400 { 401 typedef typename traits::string_type seq_type; 402 seq_type a(arg_first, arg_last); 403 if(a.size()) 404 assign(static_cast<const charT*>(&*a.begin()), static_cast<const charT*>(&*a.begin() + a.size()), f); (gdb) n 402 seq_type a(arg_first, arg_last); (gdb) n 403 if(a.size()) (gdb) print a $1 = {<std::_Vector_base<int, std::allocator<int> >> = { _M_impl = {<std::allocator<int>> = {<__gnu_cxx::new_allocator<int>> = {<No data fields>}, <No data fields>}, _M_start = 0x27cd0, _M_finish = 0x27d00, _M_end_of_storage = 0x27d00}}, <No data fields>}
comment:9 by , 10 years ago
You're right, doesn't really help :-(
Try:
#include <boost/regex/icu.hpp> template <class C> void printout(const C& c) { for(unsigned i = 0; i < c.size(); ++i) std::cout << std::hex << (int)c[i] << " "; std::cout << std::endl; } int main() { using namespace boost; typedef u32regex::traits_type::string_type st; typedef boost::u8_to_u32_iterator<std::string::const_iterator, UChar32> conv_type; const std::string p = "[pq]\\.\\.[xy]"; st t(conv_type(p.begin(), p.begin(), p.end()), conv_type(p.end(), p.begin(), p.end())); printout(p); printout(t); return 0; }
Which should output:
5b 70 71 5d 5c 2e 5c 2e 5b 78 79 5d 5b 70 71 5d 5c 2e 5c 2e 5b 78 79 5d
Thanks! John.
comment:10 by , 10 years ago
It does indeed output
5b 70 71 5d 5c 2e 5c 2e 5b 78 79 5d 5b 70 71 5d 5c 2e 5c 2e 5b 78 79 5d
follow-up: 13 comment:12 by , 10 years ago
Some supplementary questions before we get too involved:
1) Is
regex e("[pq]\\.\\.[xy]");
OK?
2) Is:
wregex we(L"[pq]\\.\\.[xy]");
OK?
3) Are their multiple versions of ICU installed on this system? Any chance there's a mismatch between the headers included and libraries loaded, or between the version used when Boost was built, and the one used by the test program?
4) Do the regex tests run OK? To run, cd into libs/regex/test and do a "bjam toolset=sun". Assuming ICU is installed in the usual location, you should see a message at the start to say it's being used/tested.
Thanks again! John.
comment:13 by , 10 years ago
Apologies for the delay in doing the stuff you asked for. Life and work get in the way...
Replying to anonymous:
Some supplementary questions before we get too involved:
1) Is
regex e("[pq]\\.\\.[xy]");
Compiles and runs okay.
2) Is:
wregex we(L"[pq]\\.\\.[xy]");
Also compiles and runs okay.
3) Are their multiple versions of ICU installed on this system? Any chance there's a mismatch between the headers included and libraries loaded, or between the version used when Boost was built, and the one used by the test program?
There are multiple version of the .so files, but as far as I can tell there is only one set of header files. I don't think this should be a problem.
4) Do the regex tests run OK? To run, cd into libs/regex/test and do a "bjam toolset=sun". Assuming ICU is installed in the usual location, you should see a message at the start to say it's being used/tested.
I'm using gcc to compile so I did "bjam toolset=gcc". I'll attach the output separately. There were errors but they are a bit hard to spot from the warnings and errors spat out by the compiler. I'll attach two files. The first the output from running bjam the first time (rather a large file) and the second file is from running bjam again -- less output which hopefully makes it easier to spot the bus error from one of the tests.
Ashley.
by , 10 years ago
Attachment: | regex-gcc-test.txt.bz2 added |
---|
Output from first run of bjam toolset=gcc
by , 10 years ago
Attachment: | regex-gcc-test-2.txt added |
---|
Output from second run of bjam toolset=gcc
comment:14 by , 10 years ago
My turn to apologize for the delay - I blame Christmas! :)
Thanks for running the tests, they show the same issue as you reported in your test case, I don't understand why it would work for wregex but not u32regex though :(
There must be some memory corruption/overrun going on, but it's going to be hard to diagnose by email! Is Valgrind available for that platform? If so it's output might help a lot, otherwise I'll have to write a special instrumented version for you to test with I guess.
Thanks, John.
Wiki formatting made a mess of the quoted program so have attached it.