Opened 6 years ago
Closed 6 years ago
#12919 closed Bugs (invalid)
multi_index indexed with hashed_non_unique produces buffer overrun.
Reported by: | Owned by: | Joaquín M López Muñoz | |
---|---|---|---|
Milestone: | To Be Determined | Component: | multi_index |
Version: | Boost 1.63.0 | Severity: | Problem |
Keywords: | multi_index | Cc: | joaquin@… |
Description
Using multi_index_container on a struct indexing using hashed_non_unique (hashing a std::string) can cause a buffer overrun in unchecked_rehash(size_type,hashed_non_unique_tag), typically culminating in a segfault.
The segment below, taken from lines 1351-1372 of boost/multi_index/hash_index.hpp, can iterate between two different nodes, causing x==end_
to never occur, causing a auto_space to be overrun, eventually leading to a segmentation fault.
Code highlighting:
auto_space< std::size_t,allocator_type> hashes(get_allocator(),size()); auto_space< node_impl_pointer,allocator_type> node_ptrs(get_allocator(),size()); std::size_t i=0; bool within_bucket=false; BOOST_TRY{ for(;;++i){ node_impl_pointer x=end_->prior(); if(x==end_)break; /* only this can possibly throw */ std::size_t h=hash_(key(node_type::from_impl(x)->value())); hashes.data()[i]=h; node_ptrs.data()[i]=x; std::pair<node_impl_pointer,bool> p= node_alg::unlink_last_group(end_); node_alg::link_range( p.first,x,buckets_cpy.at(buckets_cpy.position(h)),cpy_end); within_bucket=!(p.second); } }
The solution is quite simple, and is shown in the other implementation of unchecked_rehash(size_type,hashed_unique_tag), by ensuring the rehash does not exceed the number of elements. The proper implementation (possibly suboptimal) would look something like this:
Code highlighting:
auto_space< std::size_t,allocator_type> hashes(get_allocator(),size()); auto_space< node_impl_pointer,allocator_type> node_ptrs(get_allocator(),size()); std::size_t i=0, size_=size(); bool within_bucket=false; BOOST_TRY{ for(;i!=size_;++i){ node_impl_pointer x=end_->prior(); if(x==end_)break; /* only this can possibly throw */ std::size_t h=hash_(key(node_type::from_impl(x)->value())); hashes.data()[i]=h; node_ptrs.data()[i]=x; std::pair<node_impl_pointer,bool> p= node_alg::unlink_last_group(end_); node_alg::link_range( p.first,x,buckets_cpy.at(buckets_cpy.position(h)),cpy_end); within_bucket=!(p.second); } }
As the data leading to this segfault is private, I currently do not have a minimal, verified example to reproduce the bug, however, I believe this is a rational explanation, with a straight-forward fix.
I've seen examples where an ~800 element buffer will continue until i > 15,000, culminating in a segfault. I will submit a pull request with the above diff.
Thanks, Alex
Attachments (1)
Change History (7)
by , 6 years ago
Attachment: | hashed_index.patch added |
---|
comment:1 by , 6 years ago
Hi Alex,
Thank you for reporting the problem. I can't just accept your patch because, even f it worked, it would be masking a more serious problem, namely that erasing elements from a hashed index does not eventually lead to an empty structure where end_->prior()==end_
--this is a fundamental property of the underyling data structure. So, I need to know more about the context where this is happening. I understand the data you work with is private, but could you at least show the hashing code for std::string
you're using? I suspect there might be a problem with that.
follow-up: 3 comment:2 by , 6 years ago
Hi Joaquin,
I can email you private details about the data, most of the strings are concatenations of UniProt identifiers, a residue, and a protein position (of the form "P46406:C15", or "P46406:C15-P46406:C230", and as the data has not yet been published, I would rather not provide it on a public mailing list.
The hash function used is boost::hash (the default hash is), and I have noticed that changing the hash function (std::hash does not cause an issue), or slightly tweaking the input data (such as removing the ":" in the identifiers), will make a given test-case that normally overruns the container size on a rehash to run normally.
Just for the full, public bug tracker, I am using the multi_index characteristic of the container, the primary indexer is a hashed struct using boost::hash_combine, which includes an atypical data structure (a short, immutable, ordered set of pointers), and I have confirmed in all working cases the hash function is unique (there is not a single collision, and bucket loads are typically low).
One other thing is that my proposed patch also does not fix infinite iteration when I was attempting to calculate the bucket size of the hashed_non_unique index, so my patch seems to be woefully insufficient, but I am unsure how to continue.
I can provide a working test case privately via email.
Rudimentary Code Snippet
Note: This snippet is missing includes, some template specializations, and a few other things. It's meant as a general idea of the model, not as working code.
Code highlighting:
// this is identical to boost hash_combine_impl, just exported here inline void hash_combine_impl(size_t& seed, size_t value) { seed ^= value + 0x9e3779b9 + (seed<<6) + (seed>>2); } // this uses the same magic numbers as Python's frozenset hash, // for immutable set hashing struct set_hash { template <typename T> size_t operator(const T &t) { size_t h = 1927868237 * (t.size() + 1); size_t i; for (auto &v: t) { i = std::hash<decltype(v)>()(v); h ^= (i ^ (i << 16) ^ 89869747UL) * 3644798167UL; } h = h * 69069U + 907133923UL; return h; } }; struct my_item { std::set<uintptr_t> pointers; std::set<uint32_t> positions; std::string value; }; bool operator==(const my_item &lhs, const my_item &rhs) { auto l = std::tie(lhs.pointers, lhs.positions, lhs.value); auto r = std::tie(rhs.pointers, rhs.positions, rhs.value); return l == r; } struct my_item_hash { size_t operator(const my_item &i) { size_t seed = 0; hash_combine_impl(seed, set_hash()(i.pointers)); hash_combine_impl(seed, set_hash()(i.positions)); boost::hash_combine(seed, i.value); return seed; } }; struct my_struct { std::set<my_item, my_item_hash> set; std::string string; uint32_t count; }; bool operator==(const my_struct &lhs, const my_struct &rhs) { auto l = std::tie(lhs.set, lhs.string, lhs.count); auto r = std::tie(rhs.set, rhs.string, rhs.count); return l == r; } struct my_struct_hash { size_t operator(const my_struct &s) { size_t seed = 0; // these following lines are identical to boost::hash_combine_impl hash_combine_impl(seed, set_hash()(s.set)); boost::hash_combine(seed, s.string); boost::hash_combine(seed, s.count); return seed; } }; typedef boost::multi_index_container< my_struct, boost::multi_index::indexed_by< boost::multi_index::hashed_unique< boost::multi_index::identity<my_struct>, my_struct_hash >, boost::multi_index::hashed_non_unique< boost::multi_index::tag<tags::my_tag>, boost::multi_index::const_mem_fun<my_struct, std::string, &my_struct::linkage> > > > my_struct_memo;
Best, Alex
comment:3 by , 6 years ago
Sorry, I am realizing I cannot edit comments, I forgot a few small items in my code snippet:
Code highlighting:
namespace tags { struct my_tag {}; } // tags // skip many lines // ... struct my_struct { std::set<my_item, my_item_hash> set; std::string string; uint32_t count; std::string linkage() const; }; // skip many lines // ... std::string my_struct::linkage() const { // implementation not shown }
comment:4 by , 6 years ago
For some reason, if I use only std::hash, or only boost::hash for primitives, everything works fine. This might be some weird magic in magic numbers. I will be running a lot more test cases (and fuzz with random data) to see if I can actually create a working example, if not, please close this.
comment:5 by , 6 years ago
std::set<my_item, my_item_hash> set;
This looks very weird to me: std::set
expects a strict weak order-inducing binary predicate as its second template parameter, yet you are passing a hash functor. How can this even compile?
comment:6 by , 6 years ago
Resolution: | → invalid |
---|---|
Status: | new → closed |
Closing due to lack of further feedback from OP.
Diff between old and new for bug fix