Opened 6 years ago

Closed 6 years ago

#12578 closed Bugs (fixed)

Crash with boost::filesystem's directory_iterator and recursive_directory_iterator

Reported by: anonymous Owned by: Beman Dawes
Milestone: To Be Determined Component: filesystem
Version: Boost 1.61.0 Severity: Problem
Keywords: Cc: mlimber@…

Description

I have a simple example using recursive_directory_iterator that crashes, but it also crashes with directory_iterator. I am aware from the documentation that "[t]he practical consequence of not preserving equality is that directory iterators can only be used for single-pass algorithms", but I don't see why my use here would invoke undefined behavior since I make a copy of a const iterator.

#include <boost/filesystem.hpp>
#include <boost/range/iterator_range.hpp>
#include <iostream>

namespace fs = boost::filesystem;

int main() 
{
  try
  {
    const auto iter = fs::recursive_directory_iterator{ "." };
    std::cout << std::distance( iter, {} ) << "\n";
    for( const auto& entry : boost::make_iterator_range( 
                               iter, // CRASH! 
                               {} ) )
    {
      std::cout << entry << "\n";
    }
  }
  catch( const std::exception& e )
  {
    std::cerr << e.what() << "\n";
  }
  catch( ... )
  {
    std::cerr << "Unknown exception.\n";
  }
}

This crashes on several compilers, Windows and Linux, and multiple versions of Boost.(a)

Note that iter is const. I can workaround this by (1) replacing iter(b) on the line marked "CRASH" with fs::recursive_directory_iterator{ "." } or (2) deleting the call to std::distance() just before the for loop or by changing its arguments(c). Either of these have the effect of not advancing a copy of iter to the end.

I suspect this has something to do with the copy of iter made by std::distance() and boost::make_iterator_range() still being secretly connected to the internal state of iter, though it is a const iterator from which a deep copy was supposed to be made. It also happens with experimental::filesystem(d) and multiple versions of Boost (at least 1.61.0 and 1.55.0).

This same problem can also be seen in other scenarios such as iterating over a directory range twice(e) or any other use of a recursive directory iterator which has had a copy made and advanced to end.

I asked about this on the Boost Users list(f), but got no reply.

Links

  • (a) See, e.g., the crash on Coliru with gcc: coliru.stacked-crooked.com/a/15c4a731100d25f2
  • (b) coliru.stacked-crooked.com/a/b37f94ebc809fa65
  • (c) coliru.stacked-crooked.com/a/17277df962ab5989
  • (d) coliru.stacked-crooked.com/a/8e83f5c7e2adfe90
  • (e) coliru.stacked-crooked.com/a/0bfab95a513ebed8
  • (f) lists.boost.org/boost-users/2016/10/86840.php

Change History (13)

comment:1 by mlimber@…, 6 years ago

Component: Nonefilesystem
Owner: set to Beman Dawes

comment:2 by mlimber@…, 6 years ago

Cc: mlimber@… added

comment:3 by Charles Olivi <charles.olivi@…>, 6 years ago

Hi,

There is no defect, the std::distance calls iteratively the same operation on the single_pass_traversal_tag iterator, i.e. increment().

recursive_directory_iterator is shared_ptr implemented and increment() call pops one element from the iterator stack.

In the end, when you enter your loop, iter points to end() which makes the segfault you get.

comment:4 by mlimber@…, 6 years ago

I guessed that that is what was happening. Thanks for confirming.

I would suggest that if that is intentionally part of the design, it is quite a surprising choice and reflects at least a defect in the documentation, if not the design or code.

Moreover, if this is in fact intended, I do not understand the reasoning on why this is a good design choice. Prima facie, I would not expect a copy of an iterator to share state with the original (except that they are iterating over the same range). I would expect this function to behave the same regardless of whether I used boost::filesystem::directory_iterator or std::vector::const_iterator:

template<class Iter>
void print( const Iter begin, const Iter end )
{
    std::cout << std::distance( begin, end ) << "\n";
    for( const auto& entry : boost::make_iterator_range( begin, end ) )
    {
      std::cout << entry << "\n";
    }
}

But watch it fail here: coliru.stacked-crooked.com/a/ce85e7adb3c16bdc

If I remove the call to std::distance(), it works as expected. This, I contend, is a defect of some kind.

comment:5 by mlimber@…, 6 years ago

The design rationale appears to be this comment in the code for a member of directory_iterator:

  // shared_ptr provides shallow-copy semantics required for InputIterators.
  // m_imp.get()==0 indicates the end iterator.
  boost::shared_ptr< detail::dir_itr_imp >  m_imp;

While the single-pass requirement is clear from the InputIterator concept, I don't see how it requires this shallow-copy behavior, which appears to me to be a defect in the design/code, as described above.

I took a closer look to see what it would take to change that shared_ptr to a unique_ptr (or similar) in order to move directory_iterator to deep-copy semantics, but detail::dir_itr_imp would also need them and that looked platform-specific and non-trivial.

I'd be willing to take a whack at creating a patch if you tell me it is a desired change, though I mainly work with mainstream platforms like Windows and POSIX/Linux and so couldn't easily test others.

comment:6 by Charles Olivi <charles.olivi@…>, 6 years ago

Hi,

Yep it's indeed the shared pointer that does the job for the single pass iterator used and described here: http://www.boost.org/doc/libs/1_55_0/doc/html/InputIterator.html

I'm not the maintainer of boost.org/filesystem, I'm pretty sure your fix will be discarded.

If you really want to do it with shared_ptr you probably need a copy constructor based on [ {std::move; std::make_shared} / boost::make_shared ] and operator= based on std::swap.

Thanks, Charles

comment:7 by mlimber@…, 6 years ago

Thanks, Charles.

I see now: InputIterator and single-pass behavior means a copy of an iterator invalidates any other input iterators to the underlying source, akin to std::istream_iterator. My intuition was that a directory_iterator would be a ForwardIterator, and honestly I didn't fully grasp the distinction between those categories until now.

Strictly speaking, there is no defect since this is documented, but (1) the documentation of this is too oblique, IMHO, and (2) I'm not clear on why directory iterators should not be ForwardIterators.

For (1), I would propose adjusting the wording of the notes on directory_iterator and recursive_directory_iterator to make this behavior more apparent, something like:

Note: The practical consequence of not preserving equality is that directory iterators can only be used for single-pass algorithms. Also, each copy of an iterator shares state with all the copies, so when one is incremented, they are all effectively incremented. --end note

For (2), I would think ForwardIterator would be the preferable concept here, so I'm not sure why InputIterator was chosen.

comment:8 by Charles Olivi <charles.olivi@…>, 6 years ago

Beware that if you copy iterator the origin is not invalidated, it's just that the shared_ptr increases its ref_count.

test the following code based on your example:

  • define a method that takes a fs::recursive_directory_iterator as parameter

void process_iter(fs::recursive_directory_iterator myIter) {

std::cout << *(myIter++) << std::endl;

}

  • and then in main:

int main(void) {

fs::path aPath(".");

fs::recursive_directory_iterator iter(aPath);

fs::recursive_directory_iterator end;

process_iter(iter);

for( const auto& entry : boost::make_iterator_range(iter, end) {

std::cout << entry;

}

return 0;

}

expected result is that process_iter call will pop the first element in current directory stream, successive calls in following loop will pop remaining elements, starting from the second one. in the end iter points to end().

Hope this clarifies

comment:9 by mlimber@…, 6 years ago

Ah, but but but.....

Your example still crashes if the folder you run in has one entry only, as here:

#include <boost/filesystem.hpp>
#include <boost/range/iterator_range.hpp>
#include <iostream>

namespace fs = boost::filesystem;

void process_iter(fs::recursive_directory_iterator myIter) 
{
  std::cout << *(myIter++) << std::endl;
}

int main() 
{
  try
  {
    // Move to a folder with only one entry
    fs::create_directories( "dir/dir2" );
    fs::current_path( fs::current_path() / "dir" );

    auto       iter = fs::recursive_directory_iterator{ "." };
    const auto end  = fs::recursive_directory_iterator{}; 
    process_iter( iter );
    
    std::cout << (iter != end) << std::endl;

    for( const auto& entry : boost::make_iterator_range( iter, end ) )
    // OR for( const auto& entry : iter )
    {
      std::cout << entry << "\n";
    }
    
    // OR
    // while( iter != end )
    // {
    //     std::cout << "Post: " << *iter << "\n";
    //     ++iter;
    // }
  }
  catch( ... )
  {
    std::cerr << "Exception.\n";
  }
}

See it fail: coliru.stacked-crooked.com/a/3c4e31bbef79cc6b

Note that we are NOT dereferencing iter after the call to process_iter(). The problem is that iter is itself invalidated once the myIter copy is advanced to the end. But as far as I can tell, there is NO WAY to tell whether iter has been invalidated or can still be used. Adding a print statement as I did after the call to process_iter() indicates that the iterator is not equal to end and is therefore valid, but that is wrong!

So at the very least, your suggested usage invokes undefined behavior.

Compare how istream_iterator behaves: coliru.stacked-crooked.com/a/e5988c4626f2c339

Is that lesson here that input iterators should be considered move-only, not copyable, because copies are too prone to error in untested circumstances (like having only one file, above)? Should that be codified in the standard or at least coding standards?

comment:10 by mlimber@…, 6 years ago

On further thought, perhaps the solution is that the directory_iterator should use some sort of observer pattern or weak reference rather than just a shared_ptr for its pointer to the implementation m_imp.

At present, the iterator depends on m_imp becoming null when the last directory entry has been reached (see the comment in the operations.hpp: "m_imp.get()==0 indicates the end iterator."), but in the face of shared ownership, one owner can get to end without the others hearing about it or being able to detect it. From boost_1_62_0/libs/filesystem/operations.cpp:

  void directory_iterator_increment(directory_iterator& it,
    system::error_code* ec)
  {
      // ...
      if (it.m_imp->handle == 0)  // eof, make end
      {
        it.m_imp.reset();
        return;
      }
      // ...
  }

So the current copy of the it gets its m_imp reset and is thus now equal to the end iterator, but this does not tell the other shared owners of m_imp that they should reset too. By the nature of shared_ptr, the reference count decreased but the object still lives for all other owners.

I personally would still prefer a copy-on-write or other deep-copy semantic (or making it a full-up forward iterator) but I understand that this may be difficult to impossible within the bounds of POSIX readdir_r() and other supported OS file systems.

To sum up, possible paths I see to address this are:

  1. Make directory iterators do a deep copy or COW.
  2. Make directory iterators a ForwardIterator (additional semantics beyond deep copy).
  3. Make directory iterators move-only rather than copyable. This could involve changing the concept of InputIterator (or making a new concept MoveableInputIterator so as to leave other input iterators undisturbed).
  4. Make directory iterator use some sort of weak ref or observer pattern to update all copies that they have been invalidated.
  5. Update the documentation to indicate the pitfalls more clearly and suggest best practices for coding standards to adopt.

I'm willing to lend a hand, but I don't want to go down a path that you all know would be rejected in principle. What is the best option?

comment:11 by mlimber@…, 6 years ago

On further inspection, I have a proposed patch without doing any of the above. Instead, I change the equal() function of directory_iterator and recursive_directory_iterator to take account of what their shared owners may have done:

Old

New

  bool equal(const directory_iterator& rhs) const
    { return m_imp == rhs.m_imp; }
  bool equal(const directory_iterator& rhs) const
  { 
    const bool is_lhs_valid = m_imp && m_imp->handle;
    const bool is_rhs_valid = rhs.m_imp && rhs.m_imp->handle;
    return (is_lhs_valid && is_rhs_valid && m_imp == rhs.m_imp) 
      || (!is_lhs_valid && !is_rhs_valid);
  }
  bool equal(const recursive_directory_iterator& rhs) const
    { return m_imp == rhs.m_imp; }
  bool equal(const recursive_directory_iterator& rhs) const
  {
    const bool is_lhs_valid = m_imp && !m_imp->m_stack.empty();
    const bool is_rhs_valid = rhs.m_imp && !rhs.m_imp->m_stack.empty();
    return (is_lhs_valid && is_rhs_valid && m_imp == rhs.m_imp)
      || (!is_lhs_valid && !is_rhs_valid);
  }

This allows my problematic programs above to detect if the iterator has been invalidated by comparing it against the empty iterator.

Should I submit this as an official patch? (If so, what process should I go through to do so? I assume I'll have to add a unit test and validate existing tests as well.)

comment:12 by mlimber@…, 6 years ago

I created a pull request with unit test here: github.com/boostorg/filesystem/pull/37

comment:13 by Beman Dawes, 6 years ago

Resolution: fixed
Status: newclosed

Fixed in develop. Should make it into 1.63.

Thanks for the bug report, and thanks to Charles Olivi for helping with discussion of the issue.

Note: See TracTickets for help on using tickets.