Opened 10 years ago

Last modified 4 years ago

#7611 reopened Bugs

segfault in epoll_reactor.ipp

Reported by: Fredrik Jansson <fredrik.jansson.se@…> Owned by: chris_kohlhoff
Milestone: To Be Determined Component: asio
Version: Boost 1.52.0 Severity: Problem
Keywords: Cc:

Description

During testing of versions 1.46.1 and 1.51 on a 64-bit Ubuntu 12.04 I have found a seg fault condition in epoll_reactor.ipp.

The function is

void epoll_reactor::deregister_descriptor(socket_type descriptor,

epoll_reactor::per_descriptor_data& descriptor_data, bool closing)

{

if (!descriptor_data)

return;

mutex::scoped_lock descriptor_lock(descriptor_data->mutex_);

if (!descriptor_data->shutdown_) {

The member descriptor_data is checked for NULL before the mutex is locked, in rare conditions, when the if-statement is reached, descriptor_data is NULL.

I have solved this by adding a second check after the mutex is locked, i.e.

if (!descriptor_data)

return;

mutex::scoped_lock descriptor_lock(descriptor_data->mutex_);

if (!descriptor_data)

return;

if (!descriptor_data->shutdown_) {

Best regards, Fredrik Jansson

Attachments (2)

asio_bug.cpp (8.3 KB ) - added by bronf 5 years ago.
minimal example to reproduce the bug
gdb.log (44.2 KB ) - added by bronf 5 years ago.
gdb log with backtrace

Download all attachments as: .zip

Change History (17)

comment:1 by nanyu@…, 10 years ago

yes, I meet this bug too. here is dump with gdb

warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7fff405fc000

Core was generated by `./input_controller --id 0'.

Program terminated with signal 11, Segmentation fault.

#0 0x00000000004155ca in boost::asio::detail::epoll_reactor::start_op (this=0x5f40390, op_type=0, descriptor=10, descriptor_data=@0x2aaab0000aa0, op=0x2aaaac002dc0, allow_speculative=true)

at /usr/local/include/boost/asio/detail/impl/epoll_reactor.ipp:219

219 if (descriptor_data->shutdown_)

...

comment:2 by chris_kohlhoff, 10 years ago

Resolution: invalid
Status: newclosed

The descriptor_data variable is only set to NULL when the corresponding socket is deregistered (see epoll_reactor::deregister_descriptor, which is in turn called from reactive_socket_service_base::destroy/close). This means that your program has closed the socket or destroyed the socket object.

Most likely you have a threading issue in your program where you close a socket from one thread while simultaneously starting another async operation on the same socket from another thread. If you are sure this is not the case, please attach a small, complete program that exhibits the problem. Thanks.

comment:3 by jrob_email@…, 9 years ago

When was this bug introducted I see it in 1.45 Ubuntu 12.04

#0 0x00002b55ee92fe84 in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) bt #0 0x00002b55ee92fe84 in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x00002b55ed7e1049 in lock (this=0x3233332ecb) at /usr/include/boost/asio/detail/posix_mutex.hpp:52 #2 scoped_lock (m=..., this=<synthetic pointer>) at /usr/include/boost/asio/detail/scoped_lock.hpp:36 #3 boost::asio::detail::epoll_reactor::close_descriptor (this=0x3233332e53, descriptor_data=@0x51afeb2f0: 0x4f293a770) at /usr/include/boost/asio/detail/impl/epoll_reactor.ipp:195 #4 0x00002b55ee02d577 in destroy (impl=..., this=<optimized out>) at /usr/include/boost/asio/detail/impl/reactive_socket_service_base.ipp:53 #5 destroy (impl=..., this=<optimized out>) at /usr/include/boost/asio/datagram_socket_service.hpp:101 #6 ~basic_io_object (this=0x51afeb2e0, in_chrg=<optimized out>) at /usr/include/boost/asio/basic_io_object.hpp:85 #7 ~basic_socket (this=0x51afeb2e0, in_chrg=<optimized out>) at /usr/include/boost/asio/basic_socket.hpp:1054 #8 ~basic_datagram_socket (this=0x51afeb2e0, in_chrg=<optimized out>) at /usr/include/boost/asio/basic_datagram_socket.hpp:41 #9 msg::kit::ChannelState::shutdown (this=0x51b969000) at UDPReceiver.cpp:1985 #10 0x00002b55ee02d7cb in msg::kit::ChannelState::~ChannelState (this=0x51b969000, in_chrg=<optimized o

in reply to:  2 ; comment:4 by jyu@…, 6 years ago

Replying to chris_kohlhoff:

The descriptor_data variable is only set to NULL when the corresponding socket is deregistered (see epoll_reactor::deregister_descriptor, which is in turn called from reactive_socket_service_base::destroy/close). This means that your program has closed the socket or destroyed the socket object.

Most likely you have a threading issue in your program where you close a socket from one thread while simultaneously starting another async operation on the same socket from another thread. If you are sure this is not the case, please attach a small, complete program that exhibits the problem. Thanks.

In my case, the crash is due to two threads simultaneously closing the socket. can you make the socket close or shutdown function thread-safe, just as Fredrik Jansson suggested?

in reply to:  4 ; comment:5 by anonymous, 6 years ago

Replying to jyu@…:

In my case, the crash is due to two threads simultaneously closing the socket. can you make the socket close or shutdown function thread-safe, just as Fredrik Jansson suggested?

You have a threading issue that's even worse than that of the original poster. And it's your all your own fault.

Here's how you can solve it: Use a mutex, such that you do NOT have 2 threads messing with the socket at the same time. (Within the protected region, you could find out if the socket-descriptor is already closed, by using something like descriptor.is_open() , see http://www.boost.org/doc/libs/1_63_0/doc/html/boost_asio/reference.html#boost_asio.reference.posix__basic_descriptor.is_open )

in reply to:  5 comment:6 by anonymous, 6 years ago

Replying to anonymous:

Replying to jyu@…:

In my case, the crash is due to two threads simultaneously closing the socket. can you make the socket close or shutdown function thread-safe, just as Fredrik Jansson suggested?

You have a threading issue that's even worse than that of the original poster. And it's your all your own fault.

Here's how you can solve it: Use a mutex, such that you do NOT have 2 threads messing with the socket at the same time. (Within the protected region, you could find out if the socket-descriptor is already closed, by using something like descriptor.is_open() , see http://www.boost.org/doc/libs/1_63_0/doc/html/boost_asio/reference.html#boost_asio.reference.posix__basic_descriptor.is_open )

I ended up doing compare-and-swap on a flag to make sure the close-socket is called only once.

Just wondering that the asio close-socket function may be thread-safe by using the compare-and-swap on that crashing pointer.

In my app, only those thread-safe asio socket functions have been used concurrently. I did not realized that the asio close-socket function is an exception. My bad, I did not read the doc carefully.

comment:7 by anonymous, 5 years ago

I am running into a similar issue (using asio http server)

Thread 1 (the main thread) segfaulting in the same spot as the OP while trying to shutdown the server.

#0 0x00000000005a6667 in boost::asio::detail::epoll_reactor::deregister_descriptor (this=0x979c90, descriptor=41, descriptor_data=@0x7fffe4003e68: 0x0, closing=false) at /usr/include/boost/asio/detail/impl/epoll_reactor.ipp:309 #1 0x000000000066068e in boost::asio::detail::reactive_socket_service_base::close (this=0x979e18, impl=..., ec=...) at /usr/include/boost/asio/detail/impl/reactive_socket_service_base.ipp:104 #2 0x000000000066e1b2 in boost::asio::stream_socket_service<boost::asio::ip::tcp>::close (this=0x979df0, impl=..., ec=...) at /usr/include/boost/asio/stream_socket_service.hpp:170 #3 0x000000000066da0a in boost::asio::basic_socket<boost::asio::ip::tcp, boost::asio::stream_socket_service<boost::asio::ip::tcp> >::close (this=0x7fffe4003e60) at /usr/include/boost/asio/basic_socket.hpp:356 #4 0x000000000066af48 in http::server::connection::stop (this=0x7fffe4003e50) at src/main/include/asio_http/connection.cc:35 #5 0x00000000006693b3 in http::server::connection_manager::stop_all (this=0x978560) at src/main/include/asio_http/connection_manager.cc:35 #6 0x000000000065a24e in http::server::server::stop (this=0x978510) at src/main/include/asio_http/server.cc:104

Thread 2 (detached thread) started from thread 1 earlier in the application, waiting for thread 3 to return. Thread 3 (detached thread) started from thread 2, running server.run()

#0 lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007ffff7bc3dbd in GI_pthread_mutex_lock (mutex=0x7fffe4003d68) at ../nptl/pthread_mutex_lock.c:80 #2 0x00000000005a5dea in boost::asio::detail::posix_mutex::lock (this=0x7fffe4003d68) at /usr/include/boost/asio/detail/posix_mutex.hpp:52 #3 0x000000000065e9b7 in boost::asio::detail::epoll_reactor::descriptor_state::perform_io (this=0x7fffe4003d40, events=1) at /usr/include/boost/asio/detail/impl/epoll_reactor.ipp:610 #4 0x000000000065eb57 in boost::asio::detail::epoll_reactor::descriptor_state::do_complete (owner=0x976050, base=0x7fffe4003d40, ec=..., bytes_transferred=1) at /usr/include/boost/asio/detail/impl/epoll_reactor.ipp:649 #5 0x000000000065ce0e in boost::asio::detail::task_io_service_operation::complete (this=0x7fffe4003d40, owner=..., ec=..., bytes_transferred=1) at /usr/include/boost/asio/detail/task_io_service_operation.hpp:38 #6 0x000000000065f78f in boost::asio::detail::task_io_service::do_run_one (this=0x976050, lock=..., this_thread=..., ec=...) at /usr/include/boost/asio/detail/impl/task_io_service.ipp:372 #7 0x000000000065f1c9 in boost::asio::detail::task_io_service::run (this=0x976050, ec=...) at /usr/include/boost/asio/detail/impl/task_io_service.ipp:149 #8 0x000000000065f9e2 in boost::asio::io_service::run (this=0x978510) at /usr/include/boost/asio/impl/io_service.ipp:59 #9 0x0000000000659ff0 in http::server::server::run (this=0x978510) at src/main/include/asio_http/server.cc:64 server thread and some other threads and waits for them to return.

Although thread 3 can also be in other parts of the code when this occurs.

#0 0x00007ffff5edbe23 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84 #1 0x000000000065e319 in boost::asio::detail::epoll_reactor::run (this=0x979c90, block=true, ops=...) at /usr/include/boost/asio/detail/impl/epoll_reactor.ipp:392 #2 0x000000000065f70b in boost::asio::detail::task_io_service::do_run_one (this=0x976050, lock=..., this_thread=..., ec=...) at /usr/include/boost/asio/detail/impl/task_io_service.ipp:356 #3 0x000000000065f1c9 in boost::asio::detail::task_io_service::run (this=0x976050, ec=...) at /usr/include/boost/asio/detail/impl/task_io_service.ipp:149 #4 0x000000000065f9e2 in boost::asio::io_service::run (this=0x978510) at /usr/include/boost/asio/impl/io_service.ipp:59 #5 0x0000000000659ff0 in http::server::server::run (this=0x978510) at src/main/include/asio_http/server.cc:64

This occurs after the server has been running for some time (serving data without issue), and only during the shutdown sequence. I have done no modifications to the asio http server code and am using libcurl to do all of the transactions (which are done by thread 1 before the shutdown is called.) Making the changes in the OP does seem to resolve the issue.

comment:8 by anonymous, 5 years ago

Resolution: invalid
Status: closedreopened

by bronf, 5 years ago

Attachment: asio_bug.cpp added

minimal example to reproduce the bug

by bronf, 5 years ago

Attachment: gdb.log added

gdb log with backtrace

comment:9 by bronf, 5 years ago

I encountered the same bug and made a minimal example attached to this page to reproduce the bug (tested with 1.65.1). See also the gdb output, I kept the core if you would like me to extract some more information.

My program starts a server which just waits with the connected socket while the client writes a large amount of data. Because the server does not read, the write operation is stopped and the timer expires and cancels the write operation by closing the client socket. (This is just a test program, not a real program).

Apparently, in rare situations, closing the socket while in async_write gives a segmentation fault because of the dereferencement of a nullptr (descriptor_data).

boost/include/boost/asio/detail/impl/epoll_reactor.ipp:230 230 if (descriptor_data->shutdown_) (gdb) print descriptor_data $1 = (boost::asio::detail::epoll_reactor::per_descriptor_data &) @0x64de48: 0x0

Because the bug appears very rarely, this is what I do to make it happen and stop in gdb: while gdb -ex run -ex quit ./asio_bug ; do true; done

In parallel, I try to load the computer with a lot of things (not sure if this helps to make the bug appear).

Tested on linux 64 bits with gcc 7.2.0 and boost 1.65.1.

comment:10 by ddsherstennikov@…, 4 years ago

Any progress since then?

comment:11 by ddsherstennikov@…, 4 years ago

boost 1.65.0

linux 64

gcc 7.2.0

comment:12 by Eduardo Iglesias <7i77an@…>, 4 years ago

Same situation with boost 1.59 version.

#0 0x000000000057ecef in boost::asio::detail::epoll_reactor::start_op (this=0x272c050, op_type=0, descriptor=852, descriptor_data=@0x7f536c000e68: 0x0, op=0x7f54c8005f80, is_continuation=true,

allow_speculative=true) at /usr/local/include/boost-1_59/boost/asio/detail/impl/epoll_reactor.ipp:219

As additional information, the fall started to happen with Debian Jessie. In debian Squeeze, the error does not appear.

comment:13 by Eduardo Iglesias <7i77an@…>, 4 years ago

I correct, with boost 1.44 version into debian Squeeze, the error does not appear.

comment:14 by abdurrahim.cakar@…, 4 years ago

Why C++ sucks so bad version 1.67

in reply to:  2 comment:15 by anonymous, 4 years ago

Replying to chris_kohlhoff:

The descriptor_data variable is only set to NULL when the corresponding socket is deregistered (see epoll_reactor::deregister_descriptor, which is in turn called from reactive_socket_service_base::destroy/close). This means that your program has closed the socket or destroyed the socket object.

Most likely you have a threading issue in your program where you close a socket from one thread while simultaneously starting another async operation on the same socket from another thread. If you are sure this is not the case, please attach a small, complete program that exhibits the problem. Thanks.

Hi all, I'm using boost asio 1.67.0 in a quite big application. It is a webrtc server that uses both tcp and udp sockets to forward data among users.

The server can run both on windows or linux.

And I have the same issue reported above on linux only. On windows it works fine.

It is difficult to reproduce, but I found a scenario that works using valgrind.

The threading model I'm using is one io_service, multiple threads.

The behavior of the application in the point near the crash is the following: when the tcp socket of the user disconnects, then the server closes the udp socket and finalizes the session for that user.

The crash happens (sometimes) closing the udp socket while it is reading/writing.

Following you suggestion I wrapped all async operations of read/write/close in the same strand (one per udp socket) and magically it stopped crashing.

Before this, the strand was used only to serialize the write operations and avoid unordered transmission of packets. Because the read_async is called after the processing of the previous data chunk.

I checked the release notes of the previous versions of boost and it seems that the problem has been fixed into 1.65.0: Fixed a race condition in the Linux epoll backend, which may occur when a socket or descriptor is closed while another thread is blocked on epoll.

But it is not.

For what I understood about asio, it should be possible to read and write on the same socket from different threads without any problem. But if the close requires serialization, then all read/ write operations must be wrapped by a strand. And it will affect the performance (a bit).

Is this a bug inside the epoll/linux implementation of asio sockets or is it the correct way of working and it must be managed by the application?

Thank you, let me know if I can do something to help to solve.

Emanuele

Note: See TracTickets for help on using tickets.