Opened 12 years ago
Closed 12 years ago
#5021 closed Bugs (fixed)
io_service destructor hangs on Mac OS X
Reported by: | Owned by: | chris_kohlhoff | |
---|---|---|---|
Milestone: | To Be Determined | Component: | asio |
Version: | Boost 1.45.0 | Severity: | Problem |
Keywords: | Cc: |
Description
Sometimes, when destroying an io_service object on Mac OS X, my application is locked indefinitely. I have the feeling this happens when I destroy the io_service object right at the time when something's wrong with my network connection.
I have attached Instruments, and this seems to point that it's stuck in the destructor of pipe_secet_interrupter(), in the call to close() (See screenshot attached).
Is there something that can be done to avoid this?
Attachments (2)
Change History (26)
by , 12 years ago
Attachment: | BoostAsioBug.png added |
---|
comment:1 by , 12 years ago
FYI, this also seems to happen in healthy network conditions, and other users of our application have reported the same issue (also on Mac OS X), with the same profile trace.
comment:2 by , 12 years ago
Possibly a Mac OS bug, e.g. see http://bugs.python.org/issue7401.
What happens if you modify:
asio/detail/select_interrupter.hpp asio/detail/socket_select_interrupter.hpp asio/detail/impl/socket_select_interrupter.ipp
so that socket_select_interrupter is used on Mac OS X?
comment:3 by , 12 years ago
The test program in that report doesn't seem to exhibit the problem here, but the description of the problem sounds plausible.
Since it's hard for me to reproduce the problem, I changed our code to ensure that close() and write() are never called simultaneously, and will check with the user later if he still has the problem.
Thanks!
comment:4 by , 12 years ago
Several users have still reported the same problem, after changing the code to ensure that nothing is done on the socket from different threads simultaniously. It seems to be a different issue than the one from the python tracker?
comment:6 by , 12 years ago
Hi Chris,
Same problem it seems:
017 2659 Thread_1779307 DispatchQueue_1: com.apple.main-thread (serial) 018 2659 start 019 2659 main 020 2659 Swift::QtSwift::~QtSwift() 021 2659 Swift::BoostNetworkFactories::~BoostNetworkFactories() 022 2659 Swift::BoostIOServiceThread::~BoostIOServiceThread() 023 2659 boost::asio::io_service::~io_service() 024 2659 boost::asio::detail::service_registry::~service_registry() 025 2659 boost::asio::detail::service_registry::destroy(boost::asio::io_service::service*) 026 2659 boost::asio::detail::kqueue_reactor::~kqueue_reactor() 027 2659 boost::asio::detail::socket_select_interrupter::~socket_select_interrupter() 028 2659 boost::asio::detail::socket_ops::close(int, unsigned char&, bool, boost::system::error_code&) 029 2659 close
comment:7 by , 12 years ago
All I can suggest is that you change the select_interrupter's destructor (pipe or socket, whichever is easier to test) to call reset() before it closes the descriptors. Let me know if that makes any difference, thanks.
follow-up: 19 comment:9 by , 12 years ago
Well, I'm afraid that unless you can cut it down to a small test case there's little I can do.
BTW, you never mentioned which version of Mac OS X is involved. Is there any pattern there?
comment:10 by , 12 years ago
You might also want to see what happens if you disable kqueue (by building with BOOST_ASIO_DISABLE_KQUEUE defined).
comment:11 by , 12 years ago
Cutting it down will probably be hard, since I can't reproduce it myself. The OS X version is the latest (10.6.6 i believe).
Disabling kqueue seems to make the problem go away. Does this help anything?
comment:12 by , 12 years ago
Maybe useful: it seems the program from the Python tracker also gets stuck in close() for the user that is reporting the problem. For the test program, it's not in an uninterruptable sleep though, whereas it is for our application.
comment:13 by , 12 years ago
Hmmm, if it's kqueue related, perhaps the interrupter's descriptor needs to be explicitly removed from the kqueue before closing:
--- kqueue_reactor.ipp 24 Oct 2010 04:03:09 -0000 1.1.2.7 +++ kqueue_reactor.ipp 22 Jan 2011 10:52:47 -0000 @@ -54,6 +54,11 @@ kqueue_reactor::~kqueue_reactor() { + struct kevent event; + BOOST_ASIO_KQUEUE_EV_SET(&event, interrupter_.read_descriptor(), + EVFILT_READ, EV_DELETE, 0, 0, &interrupter_); + ::kevent(kqueue_fd_, &event, 1, 0, 0, 0); + close(kqueue_fd_); }
Just a stab in the dark really, but probably worth trying.
comment:15 by , 12 years ago
Sadly, I think that means you'll need to disable kqueue for the moment. It smells like it could be an OS bug. However, I don't know what support channels Apple provides for this sort of thing.
comment:16 by , 12 years ago
I fixed a destruction order problem in our application where a timer was being destructed after the io_service object was destroyed. Making sure the io_service stays alive until the last timer goes away seems to have made the loop disappear.
comment:17 by , 12 years ago
I have a similar issue in libtorrent on Mac OS X 10.6.5, built as 64 bit. I'm not sure about what might have made this start to happen, but it appears to have started around the time when I merged uTP support into trunk, which essentially mean a lot more traffic (and events) over a single udp socket. It seems to somehow be related to busyness, as it seems to be more likely to hang when it's been running for a while (an hour or so). It hangs here (I'm on boost 1.44):
Call graph: 2674 libtorrent::session::~session() 2674 boost::shared_ptr<libtorrent::aux::session_impl>::~shared_ptr() 2674 boost::detail::shared_count::~shared_count() 2674 boost::detail::sp_counted_base::release() 2674 boost::detail::sp_counted_impl_p<libtorrent::aux::session_impl>::dispose() 2674 void boost::checked_delete<libtorrent::aux::session_impl>(libtorrent::aux::session_impl*) 2674 libtorrent::aux::session_impl::~session_impl() 2674 boost::asio::io_service::~io_service() 2674 boost::asio::detail::service_registry::~service_registry() 2674 boost::asio::detail::service_registry::destroy(boost::asio::io_service::service*) 2674 boost::asio::detail::kqueue_reactor::~kqueue_reactor() 2674 boost::asio::detail::pipe_select_interrupter::~pipe_select_interrupter() 2674 close
This is the last thread alive at this point, so I don't think it's related to multithreading.
It definitely seems like an OS bug to me. close() isn't ever supposed to hang indefinitely, right?
comment:18 by , 12 years ago
Arvid,
I can at least confirm that we didn't have reports of the problem anymore since we avoided calling close() after destroying the io_service object. If you're sure your session shared_ptr isn't accidentally kept alive longer than io_service (a shared_ptr pitfall we fell into), maybe your problem is slightly different.
comment:19 by , 12 years ago
Replying to chris_kohlhoff:
Well, I'm afraid that unless you can cut it down to a small test case there's little I can do.
BTW, you never mentioned which version of Mac OS X is involved. Is there any pattern there?
I've run into the same problem on Mac OS X 10.6.4
Attaching simplified repro code (repro.cpp); note in the comments there are a few places that seem to be crucial to reproducing the hang.
follow-up: 21 comment:20 by , 12 years ago
Thank you for the test case. I was able to reproduce the issue on several different Mac OS X 10.6 systems. It seems to be an OS bug triggered by the use of EV_ONESHOT. Please try the following diff to see if it fixes the problem for you, and doesn't cause any other problems. (Note that you may need to apply the diff by hand since it is made against the trunk.)
Index: boost/asio/detail/impl/kqueue_reactor.ipp =================================================================== --- boost/asio/detail/impl/kqueue_reactor.ipp (revision 69227) +++ boost/asio/detail/impl/kqueue_reactor.ipp (working copy) @@ -47,9 +47,9 @@ interrupter_(), shutdown_(false) { - // The interrupter is put into a permanently readable state. Whenever we - // want to interrupt the blocked kevent call we register a one-shot read - // operation against the descriptor. + // The interrupter is put into a permanently readable state. Whenever we want + // to interrupt the blocked kevent call we register a read operation against + // the descriptor. interrupter_.interrupt(); } @@ -108,15 +108,15 @@ { case read_op: BOOST_ASIO_KQUEUE_EV_SET(&event, descriptor, EVFILT_READ, - EV_ADD | EV_ONESHOT, 0, 0, descriptor_data); + EV_ADD | EV_CLEAR, 0, 0, descriptor_data); break; case write_op: BOOST_ASIO_KQUEUE_EV_SET(&event, descriptor, EVFILT_WRITE, - EV_ADD | EV_ONESHOT, 0, 0, descriptor_data); + EV_ADD | EV_CLEAR, 0, 0, descriptor_data); break; case except_op: BOOST_ASIO_KQUEUE_EV_SET(&event, descriptor, EVFILT_READ, - EV_ADD | EV_ONESHOT, EV_OOBAND, 0, descriptor_data); + EV_ADD | EV_CLEAR, EV_OOBAND, 0, descriptor_data); break; } ::kevent(kqueue_fd_, &event, 1, 0, 0, 0); @@ -170,17 +170,17 @@ { case read_op: BOOST_ASIO_KQUEUE_EV_SET(&event, descriptor, EVFILT_READ, - EV_ADD | EV_ONESHOT, 0, 0, descriptor_data); + EV_ADD | EV_CLEAR, 0, 0, descriptor_data); break; case write_op: BOOST_ASIO_KQUEUE_EV_SET(&event, descriptor, EVFILT_WRITE, - EV_ADD | EV_ONESHOT, 0, 0, descriptor_data); + EV_ADD | EV_CLEAR, 0, 0, descriptor_data); break; case except_op: if (!descriptor_data->op_queue_[read_op].empty()) return; // Already registered for read events. BOOST_ASIO_KQUEUE_EV_SET(&event, descriptor, EVFILT_READ, - EV_ADD | EV_ONESHOT, EV_OOBAND, 0, descriptor_data); + EV_ADD | EV_CLEAR, EV_OOBAND, 0, descriptor_data); break; } @@ -290,7 +290,7 @@ if (ptr == &interrupter_) { // No need to reset the interrupter since we're leaving the descriptor - // in a ready-to-read state and relying on one-shot notifications. + // in a ready-to-read state and relying on edge-triggered notifications. } else { @@ -339,17 +339,17 @@ case EVFILT_READ: if (!descriptor_data->op_queue_[read_op].empty()) BOOST_ASIO_KQUEUE_EV_SET(&event, descriptor, EVFILT_READ, - EV_ADD | EV_ONESHOT, 0, 0, descriptor_data); + EV_ADD | EV_CLEAR, 0, 0, descriptor_data); else if (!descriptor_data->op_queue_[except_op].empty()) BOOST_ASIO_KQUEUE_EV_SET(&event, descriptor, EVFILT_READ, - EV_ADD | EV_ONESHOT, EV_OOBAND, 0, descriptor_data); + EV_ADD | EV_CLEAR, EV_OOBAND, 0, descriptor_data); else continue; break; case EVFILT_WRITE: if (!descriptor_data->op_queue_[write_op].empty()) BOOST_ASIO_KQUEUE_EV_SET(&event, descriptor, EVFILT_WRITE, - EV_ADD | EV_ONESHOT, 0, 0, descriptor_data); + EV_ADD | EV_CLEAR, 0, 0, descriptor_data); else continue; break; @@ -381,7 +381,7 @@ { struct kevent event; BOOST_ASIO_KQUEUE_EV_SET(&event, interrupter_.read_descriptor(), - EVFILT_READ, EV_ADD | EV_ONESHOT, 0, 0, &interrupter_); + EVFILT_READ, EV_ADD | EV_CLEAR, 0, 0, &interrupter_); ::kevent(kqueue_fd_, &event, 1, 0, 0, 0); }
follow-up: 22 comment:21 by , 12 years ago
Thanks, Chris, for the quick turn around on this. I'll apply the patch and retry my original test case to verify.
--Tony
comment:22 by , 12 years ago
Replying to aastolfi@…:
Thanks, Chris, for the quick turn around on this. I'll apply the patch and retry my original test case to verify.
--Tony
All tests are now passing; I've been running them continuously for the last 3 hours or so. Before, I'd see a failure within a couple minutes.
Thanks again.
comment:23 by , 12 years ago
(In [69467]) * Add support for the fork() system call. Programs that use fork must call
io_service.notify_fork() at the appropriate times. Two new examples have been added showing how to use this feature. Refs #3238, #4162.
- Clean up the handling of errors reported by the close() system call. In particular, assume that most operating systems won't have close() fail with EWOULDBLOCK, but if it does then set blocking mode and restart the call. If any other error occurs we assume the descriptor is closed. Refs #3307.
- EV_ONESHOT seems to cause problems on some versions of Mac OS X, with the io_service destructor getting stuck inside the close() system call. Use EV_CLEAR instead. Refs #5021.
- Include function name in exception what() messages.
- Fix insufficient initialisers warning with MinGW.
- Make the shutdown_service() member functions private.
- Add archetypes for testing socket option functions.
- Add missing lock in signal_set_service::cancel().
- Fix copy/paste error in SignalHandler example.
- The signal header needs to be included in signal_set_service.hpp so that we can use constants like NSIG and SIGRTMAX.
- Don't use Boost.Thread's convenience header. Use the header file that is specifically for the boost::thread class instead.
comment:24 by , 12 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
(In [69680]) Merge selected bug fixes from trunk:
- Fixed a compile error on some versions of g++ due to anonymous enums. Fixes #4883.
- Fixed a bug in asio::streambuf where the consume() function did not always update the internal buffer pointers correctly. The problem may occur when the asio::streambuf is filled with data using the standard C++ member functions such as sputn(). (Note: the problem does not manifest when the streambuf is populated by the Asio free functions read(), async_read(), read_until() or async_read_until().)
- EV_ONESHOT seems to cause problems on some versions of Mac OS X, with the io_service destructor getting stuck inside the close() system call. Use EV_CLEAR instead. Fixes #5021.
- Fixed a bug on kqueue-based platforms, where reactor read operations that return false from their perform() function are not correctly re-registered with kqueue.
- Fixed the linger socket option on non-Windows platforms.
- Fixed function name in comment for asio::placeholders::iterator.
Profile data from hanging process