Opened 12 years ago

Closed 12 years ago

#5021 closed Bugs (fixed)

io_service destructor hangs on Mac OS X

Reported by: Remko Tronçon <remko@…> Owned by: chris_kohlhoff
Milestone: To Be Determined Component: asio
Version: Boost 1.45.0 Severity: Problem
Keywords: Cc:

Description

Sometimes, when destroying an io_service object on Mac OS X, my application is locked indefinitely. I have the feeling this happens when I destroy the io_service object right at the time when something's wrong with my network connection.

I have attached Instruments, and this seems to point that it's stuck in the destructor of pipe_secet_interrupter(), in the call to close() (See screenshot attached).

Is there something that can be done to avoid this?

Attachments (2)

BoostAsioBug.png (165.8 KB ) - added by Remko Tronçon <remko@…> 12 years ago.
Profile data from hanging process
repro.zip (1.2 KB ) - added by aastolfi@… 12 years ago.
reproduction case (zipped .cpp file)

Download all attachments as: .zip

Change History (26)

by Remko Tronçon <remko@…>, 12 years ago

Attachment: BoostAsioBug.png added

Profile data from hanging process

comment:1 by Remko Tronçon <remko@…>, 12 years ago

FYI, this also seems to happen in healthy network conditions, and other users of our application have reported the same issue (also on Mac OS X), with the same profile trace.

comment:2 by chris_kohlhoff, 12 years ago

Possibly a Mac OS bug, e.g. see http://bugs.python.org/issue7401.

What happens if you modify:

asio/detail/select_interrupter.hpp
asio/detail/socket_select_interrupter.hpp
asio/detail/impl/socket_select_interrupter.ipp

so that socket_select_interrupter is used on Mac OS X?

comment:3 by Remko Tronçon <remko@…>, 12 years ago

The test program in that report doesn't seem to exhibit the problem here, but the description of the problem sounds plausible.

Since it's hard for me to reproduce the problem, I changed our code to ensure that close() and write() are never called simultaneously, and will check with the user later if he still has the problem.

Thanks!

comment:4 by Remko Tronçon <remko@…>, 12 years ago

Several users have still reported the same problem, after changing the code to ensure that nothing is done on the socket from different threads simultaniously. It seems to be a different issue than the one from the python tracker?

comment:5 by chris_kohlhoff, 12 years ago

Did you try changing the select_interrupter?

comment:6 by Remko Tronçon <remko@…>, 12 years ago

Hi Chris,

Same problem it seems:

017
    2659 Thread_1779307   DispatchQueue_1: com.apple.main-thread  (serial)
018
      2659 start
019
        2659 main
020
          2659 Swift::QtSwift::~QtSwift()
021
            2659 Swift::BoostNetworkFactories::~BoostNetworkFactories()
022
              2659 Swift::BoostIOServiceThread::~BoostIOServiceThread()
023
                2659 boost::asio::io_service::~io_service()
024
                  2659 boost::asio::detail::service_registry::~service_registry()
025
                    2659 boost::asio::detail::service_registry::destroy(boost::asio::io_service::service*)
026
                      2659 boost::asio::detail::kqueue_reactor::~kqueue_reactor()
027
                        2659 boost::asio::detail::socket_select_interrupter::~socket_select_interrupter()
028
                          2659 boost::asio::detail::socket_ops::close(int, unsigned char&, bool, boost::system::error_code&)
029
                            2659 close

comment:7 by chris_kohlhoff, 12 years ago

All I can suggest is that you change the select_interrupter's destructor (pipe or socket, whichever is easier to test) to call reset() before it closes the descriptors. Let me know if that makes any difference, thanks.

comment:8 by Remko Tronçon <remko@…>, 12 years ago

Calling reset() doesn't help.

comment:9 by chris_kohlhoff, 12 years ago

Well, I'm afraid that unless you can cut it down to a small test case there's little I can do.

BTW, you never mentioned which version of Mac OS X is involved. Is there any pattern there?

comment:10 by chris_kohlhoff, 12 years ago

You might also want to see what happens if you disable kqueue (by building with BOOST_ASIO_DISABLE_KQUEUE defined).

comment:11 by Remko Tronçon <remko@…>, 12 years ago

Cutting it down will probably be hard, since I can't reproduce it myself. The OS X version is the latest (10.6.6 i believe).

Disabling kqueue seems to make the problem go away. Does this help anything?

comment:12 by Remko Tronçon <remko@…>, 12 years ago

Maybe useful: it seems the program from the Python tracker also gets stuck in close() for the user that is reporting the problem. For the test program, it's not in an uninterruptable sleep though, whereas it is for our application.

comment:13 by chris_kohlhoff, 12 years ago

Hmmm, if it's kqueue related, perhaps the interrupter's descriptor needs to be explicitly removed from the kqueue before closing:

--- kqueue_reactor.ipp	24 Oct 2010 04:03:09 -0000	1.1.2.7
+++ kqueue_reactor.ipp	22 Jan 2011 10:52:47 -0000
@@ -54,6 +54,11 @@
 
 kqueue_reactor::~kqueue_reactor()
 {
+  struct kevent event;
+  BOOST_ASIO_KQUEUE_EV_SET(&event, interrupter_.read_descriptor(),
+      EVFILT_READ, EV_DELETE, 0, 0, &interrupter_);
+  ::kevent(kqueue_fd_, &event, 1, 0, 0, 0);
+
   close(kqueue_fd_);
 }

Just a stab in the dark really, but probably worth trying.

comment:14 by Remko Tronçon <remko@…>, 12 years ago

Nope, that doesn't seem to help :-(

comment:15 by chris_kohlhoff, 12 years ago

Sadly, I think that means you'll need to disable kqueue for the moment. It smells like it could be an OS bug. However, I don't know what support channels Apple provides for this sort of thing.

comment:16 by Remko, 12 years ago

I fixed a destruction order problem in our application where a timer was being destructed after the io_service object was destroyed. Making sure the io_service stays alive until the last timer goes away seems to have made the loop disappear.

comment:17 by arvid@…, 12 years ago

I have a similar issue in libtorrent on Mac OS X 10.6.5, built as 64 bit. I'm not sure about what might have made this start to happen, but it appears to have started around the time when I merged uTP support into trunk, which essentially mean a lot more traffic (and events) over a single udp socket. It seems to somehow be related to busyness, as it seems to be more likely to hang when it's been running for a while (an hour or so). It hangs here (I'm on boost 1.44):

Call graph:
          2674 libtorrent::session::~session()
            2674 boost::shared_ptr<libtorrent::aux::session_impl>::~shared_ptr()
              2674 boost::detail::shared_count::~shared_count()
                2674 boost::detail::sp_counted_base::release()
                  2674 boost::detail::sp_counted_impl_p<libtorrent::aux::session_impl>::dispose()
                    2674 void boost::checked_delete<libtorrent::aux::session_impl>(libtorrent::aux::session_impl*)
                      2674 libtorrent::aux::session_impl::~session_impl()
                        2674 boost::asio::io_service::~io_service()
                          2674 boost::asio::detail::service_registry::~service_registry()
                            2674 boost::asio::detail::service_registry::destroy(boost::asio::io_service::service*)
                              2674 boost::asio::detail::kqueue_reactor::~kqueue_reactor()
                                2674 boost::asio::detail::pipe_select_interrupter::~pipe_select_interrupter()
                                  2674 close

This is the last thread alive at this point, so I don't think it's related to multithreading.

It definitely seems like an OS bug to me. close() isn't ever supposed to hang indefinitely, right?

comment:18 by Remko Tronçon <remko@…>, 12 years ago

Arvid,

I can at least confirm that we didn't have reports of the problem anymore since we avoided calling close() after destroying the io_service object. If you're sure your session shared_ptr isn't accidentally kept alive longer than io_service (a shared_ptr pitfall we fell into), maybe your problem is slightly different.

in reply to:  9 comment:19 by aastolfi@…, 12 years ago

Replying to chris_kohlhoff:

Well, I'm afraid that unless you can cut it down to a small test case there's little I can do.

BTW, you never mentioned which version of Mac OS X is involved. Is there any pattern there?

I've run into the same problem on Mac OS X 10.6.4

Attaching simplified repro code (repro.cpp); note in the comments there are a few places that seem to be crucial to reproducing the hang.

by aastolfi@…, 12 years ago

Attachment: repro.zip added

reproduction case (zipped .cpp file)

comment:20 by chris_kohlhoff, 12 years ago

Thank you for the test case. I was able to reproduce the issue on several different Mac OS X 10.6 systems. It seems to be an OS bug triggered by the use of EV_ONESHOT. Please try the following diff to see if it fixes the problem for you, and doesn't cause any other problems. (Note that you may need to apply the diff by hand since it is made against the trunk.)

Index: boost/asio/detail/impl/kqueue_reactor.ipp
===================================================================
--- boost/asio/detail/impl/kqueue_reactor.ipp	(revision 69227)
+++ boost/asio/detail/impl/kqueue_reactor.ipp	(working copy)
@@ -47,9 +47,9 @@
     interrupter_(),
     shutdown_(false)
 {
-  // The interrupter is put into a permanently readable state. Whenever we
-  // want to interrupt the blocked kevent call we register a one-shot read
-  // operation against the descriptor.
+  // The interrupter is put into a permanently readable state. Whenever we want
+  // to interrupt the blocked kevent call we register a read operation against
+  // the descriptor.
   interrupter_.interrupt();
 }
 
@@ -108,15 +108,15 @@
   {
   case read_op:
     BOOST_ASIO_KQUEUE_EV_SET(&event, descriptor, EVFILT_READ,
-        EV_ADD | EV_ONESHOT, 0, 0, descriptor_data);
+        EV_ADD | EV_CLEAR, 0, 0, descriptor_data);
     break;
   case write_op:
     BOOST_ASIO_KQUEUE_EV_SET(&event, descriptor, EVFILT_WRITE,
-        EV_ADD | EV_ONESHOT, 0, 0, descriptor_data);
+        EV_ADD | EV_CLEAR, 0, 0, descriptor_data);
     break;
   case except_op:
     BOOST_ASIO_KQUEUE_EV_SET(&event, descriptor, EVFILT_READ,
-        EV_ADD | EV_ONESHOT, EV_OOBAND, 0, descriptor_data);
+        EV_ADD | EV_CLEAR, EV_OOBAND, 0, descriptor_data);
     break;
   }
   ::kevent(kqueue_fd_, &event, 1, 0, 0, 0);
@@ -170,17 +170,17 @@
     {
     case read_op:
       BOOST_ASIO_KQUEUE_EV_SET(&event, descriptor, EVFILT_READ,
-          EV_ADD | EV_ONESHOT, 0, 0, descriptor_data);
+          EV_ADD | EV_CLEAR, 0, 0, descriptor_data);
       break;
     case write_op:
       BOOST_ASIO_KQUEUE_EV_SET(&event, descriptor, EVFILT_WRITE,
-          EV_ADD | EV_ONESHOT, 0, 0, descriptor_data);
+          EV_ADD | EV_CLEAR, 0, 0, descriptor_data);
       break;
     case except_op:
       if (!descriptor_data->op_queue_[read_op].empty())
         return; // Already registered for read events.
       BOOST_ASIO_KQUEUE_EV_SET(&event, descriptor, EVFILT_READ,
-          EV_ADD | EV_ONESHOT, EV_OOBAND, 0, descriptor_data);
+          EV_ADD | EV_CLEAR, EV_OOBAND, 0, descriptor_data);
       break;
     }
 
@@ -290,7 +290,7 @@
     if (ptr == &interrupter_)
     {
       // No need to reset the interrupter since we're leaving the descriptor
-      // in a ready-to-read state and relying on one-shot notifications.
+      // in a ready-to-read state and relying on edge-triggered notifications.
     }
     else
     {
@@ -339,17 +339,17 @@
       case EVFILT_READ:
         if (!descriptor_data->op_queue_[read_op].empty())
           BOOST_ASIO_KQUEUE_EV_SET(&event, descriptor, EVFILT_READ,
-              EV_ADD | EV_ONESHOT, 0, 0, descriptor_data);
+              EV_ADD | EV_CLEAR, 0, 0, descriptor_data);
         else if (!descriptor_data->op_queue_[except_op].empty())
           BOOST_ASIO_KQUEUE_EV_SET(&event, descriptor, EVFILT_READ,
-              EV_ADD | EV_ONESHOT, EV_OOBAND, 0, descriptor_data);
+              EV_ADD | EV_CLEAR, EV_OOBAND, 0, descriptor_data);
         else
           continue;
         break;
       case EVFILT_WRITE:
         if (!descriptor_data->op_queue_[write_op].empty())
           BOOST_ASIO_KQUEUE_EV_SET(&event, descriptor, EVFILT_WRITE,
-              EV_ADD | EV_ONESHOT, 0, 0, descriptor_data);
+              EV_ADD | EV_CLEAR, 0, 0, descriptor_data);
         else
           continue;
         break;
@@ -381,7 +381,7 @@
 {
   struct kevent event;
   BOOST_ASIO_KQUEUE_EV_SET(&event, interrupter_.read_descriptor(),
-      EVFILT_READ, EV_ADD | EV_ONESHOT, 0, 0, &interrupter_);
+      EVFILT_READ, EV_ADD | EV_CLEAR, 0, 0, &interrupter_);
   ::kevent(kqueue_fd_, &event, 1, 0, 0, 0);
 }

in reply to:  20 ; comment:21 by aastolfi@…, 12 years ago

Thanks, Chris, for the quick turn around on this. I'll apply the patch and retry my original test case to verify.

--Tony

in reply to:  21 comment:22 by aastolfi@…, 12 years ago

Replying to aastolfi@…:

Thanks, Chris, for the quick turn around on this. I'll apply the patch and retry my original test case to verify.

--Tony

All tests are now passing; I've been running them continuously for the last 3 hours or so. Before, I'd see a failure within a couple minutes.

Thanks again.

comment:23 by chris_kohlhoff, 12 years ago

(In [69467]) * Add support for the fork() system call. Programs that use fork must call

io_service.notify_fork() at the appropriate times. Two new examples have been added showing how to use this feature. Refs #3238, #4162.

  • Clean up the handling of errors reported by the close() system call. In particular, assume that most operating systems won't have close() fail with EWOULDBLOCK, but if it does then set blocking mode and restart the call. If any other error occurs we assume the descriptor is closed. Refs #3307.
  • EV_ONESHOT seems to cause problems on some versions of Mac OS X, with the io_service destructor getting stuck inside the close() system call. Use EV_CLEAR instead. Refs #5021.
  • Include function name in exception what() messages.
  • Fix insufficient initialisers warning with MinGW.
  • Make the shutdown_service() member functions private.
  • Add archetypes for testing socket option functions.
  • Add missing lock in signal_set_service::cancel().
  • The signal header needs to be included in signal_set_service.hpp so that we can use constants like NSIG and SIGRTMAX.
  • Don't use Boost.Thread's convenience header. Use the header file that is specifically for the boost::thread class instead.

comment:24 by chris_kohlhoff, 12 years ago

Resolution: fixed
Status: newclosed

(In [69680]) Merge selected bug fixes from trunk:

  • Fixed a compile error on some versions of g++ due to anonymous enums. Fixes #4883.
  • Fixed a bug in asio::streambuf where the consume() function did not always update the internal buffer pointers correctly. The problem may occur when the asio::streambuf is filled with data using the standard C++ member functions such as sputn(). (Note: the problem does not manifest when the streambuf is populated by the Asio free functions read(), async_read(), read_until() or async_read_until().)
  • EV_ONESHOT seems to cause problems on some versions of Mac OS X, with the io_service destructor getting stuck inside the close() system call. Use EV_CLEAR instead. Fixes #5021.
  • Fixed a bug on kqueue-based platforms, where reactor read operations that return false from their perform() function are not correctly re-registered with kqueue.
  • Fixed the linger socket option on non-Windows platforms.
  • Fixed function name in comment for asio::placeholders::iterator.
Note: See TracTickets for help on using tickets.