Opened 6 years ago

Last modified 6 years ago

#12474 new Bugs

Using two different resolver instances on the same io_service causes a race condition

Reported by: michele.de.stefano@… Owned by: chris_kohlhoff
Milestone: To Be Determined Component: asio
Version: Boost 1.61.0 Severity: Problem
Keywords: race condition, data race Cc:

Description

Dear developer (or developers) of Boost Asio,

I think I've found a bug in Asio. If I instantiate two different TCP resolvers on the same io_service and I use these two distinct resolvers for resolving the same endpoint, a race condition is generated.

I have attached two sources (C++11 is required) that reproduce the issue. Unfortunately you need to run the programs many times to experience a failure ... I have bash scripts that allow me to run these programs thousands of times, each time using a different, random, free port on the local host.

Quick instructions for running the tests:

test_engine_client_dbg_x -s -p <port>

runs the server.

test_engine_client_dbg_x -c -p <port>

Runs the client.

test_engine_client_dbg_2.cpp is the source that fails (at line 364) because, with no apparent reason, we find (sometimes) the pointer to mEngineClient to be NULL. I've also verified that if in this code I insert a while loop that waits until mEngineClient.get() != nullptr, the code proceeds with no error (meaning that, at some point, that pointer is reset to the correct value). But, I repeat, there is no reason why mEngineClient.get() should be NULL in this point.

test_engine_client_dbg_4.cpp is the source that works. Notice that this time I'm not instantiating a second resolver within the EngineClient class, but I'm simply passing the already-resolved endpoint iterator to the EngineClient's constructor. With this change, we never loose the mEngineClient pointer.

This race condition can be experienced only if we call io_service.run() from multiple threads. I've verified that this does not happen if we call a single io_service.run().

Also, I think it is difficult to be reproduced, because I've experienced it only on one specific machine that, maybe, has a different timing with respect to the others I have (because of its hardware). On this machine (which has Fedora 23 OS, with kernel 4.7.3 and gcc 5.3.1) I've also tried re-building by using gcc 4.8.5, but the issue is the same (so this is not a compiler bug). I've also tried, on the same machine, with Fedora 24 OS (yes ... I re-installed the OS) and so I was also able to test with gcc 6.1 and the issue comes out again. On other machines (I've tried also a CentOS6, with 8 cores and gcc 4.8.5) I was not able to reproduce this issue even running the tests thousands of times, while on the Fedora machine where I'm experiencing this issue, it happens basically for sure within 1000 runs.

In summary, stress test is the only way to hope to experience this bug (but sometimes it also comes out at the first run).

Attachments (2)

test_engine_client_dbg_2.cpp (11.9 KB ) - added by michele.de.stefano@… 6 years ago.
This is the code that has the race condition
test_engine_client_dbg_4.cpp (11.0 KB ) - added by michele.de.stefano@… 6 years ago.
This is the code that does NOT have the race condition

Download all attachments as: .zip

Change History (3)

by michele.de.stefano@…, 6 years ago

This is the code that has the race condition

by michele.de.stefano@…, 6 years ago

This is the code that does NOT have the race condition

comment:1 by michele.de.stefano@…, 6 years ago

Simply wanted to add that I've also tried with Boost 1.59.0 and the issue was there already.

Note: See TracTickets for help on using tickets.