Opened 6 years ago

Closed 6 years ago

#12919 closed Bugs (invalid)

multi_index indexed with hashed_non_unique produces buffer overrun.

Reported by: Alex Huszagh <ahuszagh@…> Owned by: Joaquín M López Muñoz
Milestone: To Be Determined Component: multi_index
Version: Boost 1.63.0 Severity: Problem
Keywords: multi_index Cc: joaquin@…

Description

Using multi_index_container on a struct indexing using hashed_non_unique (hashing a std::string) can cause a buffer overrun in unchecked_rehash(size_type,hashed_non_unique_tag), typically culminating in a segfault.

The segment below, taken from lines 1351-1372 of boost/multi_index/hash_index.hpp, can iterate between two different nodes, causing x==end_ to never occur, causing a auto_space to be overrun, eventually leading to a segmentation fault.

Code highlighting:

    auto_space<
      std::size_t,allocator_type>       hashes(get_allocator(),size());
    auto_space<
      node_impl_pointer,allocator_type> node_ptrs(get_allocator(),size());
    std::size_t                         i=0;
    bool                                within_bucket=false;
     BOOST_TRY{
      for(;;++i){
        node_impl_pointer x=end_->prior();
        if(x==end_)break;

        /* only this can possibly throw */
        std::size_t h=hash_(key(node_type::from_impl(x)->value()));

        hashes.data()[i]=h;
        node_ptrs.data()[i]=x;
        std::pair<node_impl_pointer,bool> p=
          node_alg::unlink_last_group(end_);
        node_alg::link_range(
          p.first,x,buckets_cpy.at(buckets_cpy.position(h)),cpy_end);
        within_bucket=!(p.second);
      }
    }

The solution is quite simple, and is shown in the other implementation of unchecked_rehash(size_type,hashed_unique_tag), by ensuring the rehash does not exceed the number of elements. The proper implementation (possibly suboptimal) would look something like this:

Code highlighting:

    auto_space<
      std::size_t,allocator_type>       hashes(get_allocator(),size());
    auto_space<
      node_impl_pointer,allocator_type> node_ptrs(get_allocator(),size());
    std::size_t                         i=0, size_=size();
    bool                                within_bucket=false;
     BOOST_TRY{
      for(;i!=size_;++i){
        node_impl_pointer x=end_->prior();
        if(x==end_)break;

        /* only this can possibly throw */
        std::size_t h=hash_(key(node_type::from_impl(x)->value()));

        hashes.data()[i]=h;
        node_ptrs.data()[i]=x;
        std::pair<node_impl_pointer,bool> p=
          node_alg::unlink_last_group(end_);
        node_alg::link_range(
          p.first,x,buckets_cpy.at(buckets_cpy.position(h)),cpy_end);
        within_bucket=!(p.second);
      }
    }

As the data leading to this segfault is private, I currently do not have a minimal, verified example to reproduce the bug, however, I believe this is a rational explanation, with a straight-forward fix.

I've seen examples where an ~800 element buffer will continue until i > 15,000, culminating in a segfault. I will submit a pull request with the above diff.

Thanks, Alex

Attachments (1)

hashed_index.patch (743 bytes ) - added by Alex Huszagh <ahuszagh@…> 6 years ago.
Diff between old and new for bug fix

Download all attachments as: .zip

Change History (7)

by Alex Huszagh <ahuszagh@…>, 6 years ago

Attachment: hashed_index.patch added

Diff between old and new for bug fix

comment:1 by Joaquín M López Muñoz, 6 years ago

Hi Alex,

Thank you for reporting the problem. I can't just accept your patch because, even f it worked, it would be masking a more serious problem, namely that erasing elements from a hashed index does not eventually lead to an empty structure where end_->prior()==end_ --this is a fundamental property of the underyling data structure. So, I need to know more about the context where this is happening. I understand the data you work with is private, but could you at least show the hashing code for std::string you're using? I suspect there might be a problem with that.

comment:2 by Alex Huszagh <ahuszagh@…>, 6 years ago

Hi Joaquin,

I can email you private details about the data, most of the strings are concatenations of UniProt identifiers, a residue, and a protein position (of the form "P46406:C15", or "P46406:C15-P46406:C230", and as the data has not yet been published, I would rather not provide it on a public mailing list.

The hash function used is boost::hash (the default hash is), and I have noticed that changing the hash function (std::hash does not cause an issue), or slightly tweaking the input data (such as removing the ":" in the identifiers), will make a given test-case that normally overruns the container size on a rehash to run normally.

Just for the full, public bug tracker, I am using the multi_index characteristic of the container, the primary indexer is a hashed struct using boost::hash_combine, which includes an atypical data structure (a short, immutable, ordered set of pointers), and I have confirmed in all working cases the hash function is unique (there is not a single collision, and bucket loads are typically low).

One other thing is that my proposed patch also does not fix infinite iteration when I was attempting to calculate the bucket size of the hashed_non_unique index, so my patch seems to be woefully insufficient, but I am unsure how to continue.

I can provide a working test case privately via email.

Rudimentary Code Snippet

Note: This snippet is missing includes, some template specializations, and a few other things. It's meant as a general idea of the model, not as working code.

Code highlighting:

// this is identical to boost hash_combine_impl, just exported here
inline void hash_combine_impl(size_t& seed, size_t value)
{
    seed ^= value + 0x9e3779b9 + (seed<<6) + (seed>>2);
}


// this uses the same magic numbers as Python's frozenset hash, 
// for immutable set hashing
struct set_hash
{
    template <typename T>
    size_t operator(const T &t)
    {
        size_t h = 1927868237 * (t.size() + 1);
        size_t i;

        for (auto &v: t) {
            i = std::hash<decltype(v)>()(v);
            h ^= (i ^ (i << 16) ^ 89869747UL)  * 3644798167UL;
        }
        h = h * 69069U + 907133923UL;
        return h;
    }
};


struct my_item
{
    std::set<uintptr_t> pointers;
    std::set<uint32_t> positions;
    std::string value;
};


bool operator==(const my_item &lhs,
    const my_item &rhs)
{
    auto l = std::tie(lhs.pointers, lhs.positions, lhs.value);
    auto r = std::tie(rhs.pointers, rhs.positions, rhs.value);
    return l == r;
}


struct my_item_hash
{
    size_t operator(const my_item &i)
    {
        size_t seed = 0;
        hash_combine_impl(seed, set_hash()(i.pointers));
        hash_combine_impl(seed, set_hash()(i.positions));
        boost::hash_combine(seed, i.value);

        return seed;
    }
};


struct my_struct
{
    std::set<my_item, my_item_hash> set;
    std::string string;
    uint32_t count;
};


bool operator==(const my_struct &lhs,
    const my_struct &rhs)
{
    auto l = std::tie(lhs.set, lhs.string, lhs.count);
    auto r = std::tie(rhs.set, rhs.string, rhs.count);
    return l == r;
}


struct my_struct_hash
{
    size_t operator(const my_struct &s)
    {
        size_t seed = 0;
        // these following lines are identical to boost::hash_combine_impl
        hash_combine_impl(seed, set_hash()(s.set));
        boost::hash_combine(seed, s.string);
        boost::hash_combine(seed, s.count);

        return seed;
    }
};


typedef boost::multi_index_container<
    my_struct,
    boost::multi_index::indexed_by<
        boost::multi_index::hashed_unique<
            boost::multi_index::identity<my_struct>,
            my_struct_hash
        >,
        boost::multi_index::hashed_non_unique<
            boost::multi_index::tag<tags::my_tag>,
            boost::multi_index::const_mem_fun<my_struct, std::string, &my_struct::linkage>
        >
    >
> my_struct_memo;

Best, Alex

in reply to:  2 comment:3 by Alex Huszagh <ahuszagh@…>, 6 years ago

Sorry, I am realizing I cannot edit comments, I forgot a few small items in my code snippet:

Code highlighting:

namespace tags {
struct my_tag {};
}   // tags

// skip many lines
// ...

struct my_struct
{
    std::set<my_item, my_item_hash> set;
    std::string string;
    uint32_t count;

    std::string linkage() const;
};

// skip many lines
// ...

std::string my_struct::linkage() const
{
    // implementation not shown
}

comment:4 by Alex Huszagh <ahuszagh@…>, 6 years ago

For some reason, if I use only std::hash, or only boost::hash for primitives, everything works fine. This might be some weird magic in magic numbers. I will be running a lot more test cases (and fuzz with random data) to see if I can actually create a working example, if not, please close this.

comment:5 by Joaquín M López Muñoz, 6 years ago

std::set<my_item, my_item_hash> set;

This looks very weird to me: std::set expects a strict weak order-inducing binary predicate as its second template parameter, yet you are passing a hash functor. How can this even compile?

comment:6 by Joaquín M López Muñoz, 6 years ago

Resolution: invalid
Status: newclosed

Closing due to lack of further feedback from OP.

Note: See TracTickets for help on using tickets.