Boost C++ Libraries: Ticket #8933: on the windows platform async reads with multiple threads can produce random EOF errors https://svn.boost.org/trac10/ticket/8933 <p> After upgrading upgrading boost to 1.54.0 I started getting lots of eof errors out of some of my ASIO based applications that didn't make sense. </p> <p> My applications use the single io_service with a thread pool pattern. When the thread pool contains more than one thread I started getting random eof errors whenever the network traffic started to get heavy. </p> <p> Attached is an example program that can reproduce the problem. </p> <p> The application built with boost 1.53.0 works fine. Built with boost 1.54.0 fails. </p> <p> I have tried visual studio version 2008 and 2012 with the same results. </p> <p> To reproduce the problem I use the following parameters with two instances on the same machine: <a class="missing wiki">SendReceiveTest</a>.exe server 127.0.0.1 100 1 1000 10 <a class="missing wiki">SendReceiveTest</a>.exe client 127.0.0.1 100 2 0 0 </p> <p> The client will error out on the read with a eof error. </p> <p> I tracked the problem back to a posted IO completion event that was generated from a on_pending call. The event gets converted to eof because ::<a class="missing wiki">GetLastError</a> always returns '0' and the transferred bytes are always 'zero' from the on_pending call. Looking at the changes between boost asio v1.53.0 and v1.54.0 I spotted in asio\detail\impl\win_iocp_io_service.ipp that some <a class="missing wiki">PostQueuedCompletionStatus</a> calls where changed form passing 'overlapped_contains_result' to passing '0'. Changing them back to 'overlapped_contains_result' fixes the problem. </p> en-us Boost C++ Libraries /htdocs/site/boost.png https://svn.boost.org/trac10/ticket/8933 Trac 1.4.3 Shane Powell <killerbee@…> Sat, 27 Jul 2013 21:10:45 GMT attachment set https://svn.boost.org/trac10/ticket/8933 https://svn.boost.org/trac10/ticket/8933 <ul> <li><strong>attachment</strong> → <span class="trac-field-new">SendReceiveTest.zip</span> </li> </ul> <p> Test application that reproduces the problem </p> Ticket Shane Powell <killerbee@…> Sat, 27 Jul 2013 21:11:05 GMT attachment set https://svn.boost.org/trac10/ticket/8933 https://svn.boost.org/trac10/ticket/8933 <ul> <li><strong>attachment</strong> → <span class="trac-field-new">win_iocp_io_service.ipp.patch</span> </li> </ul> <p> Patch to fix the problem </p> Ticket smueller@… Fri, 02 Aug 2013 21:27:57 GMT <link>https://svn.boost.org/trac10/ticket/8933#comment:1 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8933#comment:1</guid> <description> <p> I saw this in my Windows application after upgrading to Boost 1.54. I noticed that, during a stress test, frequently the bytes received field in my <a class="missing wiki">ReadHandler</a> was getting zero and an error_code of zero while a non-zero number of bytes was requested from an async_read call. In my case this caused stream corruption which could only be resolved in non-library space by passing in the number of bytes requested for the read into the handler and using that instead of the bytes received value passed to the <a class="missing wiki">ReadHandler</a>. </p> <p> Strangely enough, in my testing, it seemed that activating BOOST_ASIO_ENABLE_HANDLER_TRACKING made this problem pop up more frequently. </p> </description> <category>Ticket</category> </item> <item> <author>Shane Powell <killerbee@…></author> <pubDate>Sun, 04 Aug 2013 02:12:39 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8933#comment:2 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8933#comment:2</guid> <description> <p> Where you using UDP? I was mainly taking about TCP. The same kind-of problem would affect UDP though. In UDP, you wouldn't get a EOF, you would get a zero length packet and you would lose data (hence your data lose). </p> </description> <category>Ticket</category> </item> <item> <author>Smueller@…</author> <pubDate>Sun, 04 Aug 2013 03:08:02 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8933#comment:3 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8933#comment:3</guid> <description> <p> I am using named pipes in Windows via the stream_handle interface. I should have mentioned this. </p> </description> <category>Ticket</category> </item> <item> <author>smueller@…</author> <pubDate>Mon, 05 Aug 2013 19:09:37 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8933#comment:4 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8933#comment:4</guid> <description> <p> Replying to <a class="ticket" href="https://svn.boost.org/trac10/ticket/8933#comment:2" title="Comment 2">Shane Powell &lt;killerbee@…&gt;</a>: </p> <blockquote class="citation"> <p> Where you using UDP? I was mainly taking about TCP. The same kind-of problem would affect UDP though. In UDP, you wouldn't get a EOF, you would get a zero length packet and you would lose data (hence your data lose). </p> </blockquote> <p> I did not get an EOF, but rather I had a successful result from a composed read operation which indicated that zero bytes were received. These read operations were performed on a windows::stream_handle object created from a named pipe handle. </p> <p> I'm planning on making further use of the ASIO library in our communication stack for TCP/IP, where I'm sure I would have seen the exact same symptoms. Thanks a ton for your investigation! </p> </description> <category>Ticket</category> </item> <item> <dc:creator>anonymous</dc:creator> <pubDate>Mon, 05 Aug 2013 19:26:32 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8933#comment:5 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8933#comment:5</guid> <description> <p> I haven't looked at the named pipes asio implementation, but it does sound like the same issue and my patch may work for you. </p> <p> Did you apply my patch and did it fix your problem? </p> </description> <category>Ticket</category> </item> <item> <author>smueller@…</author> <pubDate>Mon, 05 Aug 2013 20:30:06 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8933#comment:6 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8933#comment:6</guid> <description> <p> Replying to <a class="ticket" href="https://svn.boost.org/trac10/ticket/8933#comment:5" title="Comment 5">anonymous</a>: </p> <blockquote class="citation"> <p> I haven't looked at the named pipes asio implementation, but it does sound like the same issue and my patch may work for you. </p> <p> Did you apply my patch and did it fix your problem? </p> </blockquote> <p> I did, and it did fix the problem. I'm curious as to why the read completion handler is posted for execution by the io_service rather than executed in-place. Perhaps it is to control stack growth. I know I'd have to ask Mr. Kohlhoff about this one to know for certain. </p> </description> <category>Ticket</category> </item> <item> <dc:creator>anonymous</dc:creator> <pubDate>Wed, 07 Aug 2013 15:28:05 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8933#comment:7 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8933#comment:7</guid> <description> <p> For the record: it looks like this patch solves <a class="closed ticket" href="https://svn.boost.org/trac10/ticket/8935" title="#8935: Bugs: async_write successful but only wrote 64K, handler invoked with ... (closed: duplicate)">#8935</a> too. Same problem observed on send. </p> </description> <category>Ticket</category> </item> <item> <dc:creator>anonymous</dc:creator> <pubDate>Fri, 09 Aug 2013 22:40:31 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8933#comment:8 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8933#comment:8</guid> <description> <p> This looks like it may be the same underlying cause as <a class="ext-link" href="https://svn.boost.org/trac/boost/ticket/8967"><span class="icon">​</span>https://svn.boost.org/trac/boost/ticket/8967</a> </p> <p> I believe there are some issues in the Windows IOCP implementation that are described in ricket 8967 - the main ones are that on_pending should probably not be used and WSARecv/WSARecv should have null pointers for the bytes read parameter. </p> </description> <category>Ticket</category> </item> <item> <author>Shane Powell <killerbee@…></author> <pubDate>Sat, 10 Aug 2013 06:53:51 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8933#comment:9 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8933#comment:9</guid> <description> <p> This is the same problem but his patch is dangerous as the on_pending is used to delay the calling of the complete handler until after the returning of the winsock functions as they are not reentrant for a single socket. Removing the it completely would result in existing asio programs not working correctly. </p> </description> <category>Ticket</category> </item> <item> <author>simoncperkins@…</author> <pubDate>Sat, 10 Aug 2013 11:02:12 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8933#comment:10 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8933#comment:10</guid> <description> <p> I don't believe that handling the notification directly without on_pending is a problem for the following reasons: </p> <p> 1) The existing I/O operation has already completed (even though the API may not have returned) if another thread picks up the notification using <a class="missing wiki">GetQueueCompletionStatus</a>(). </p> <p> 2) None of the multi-threaded samples I looked at on MSDN worry about calling WSARecv/WSASend in a notification handler even though the originating API may not have returned. </p> <p> 3) The MSDN docs for WSARecv explicitly state that this is acceptable: </p> <p> <strong>* "The WSARecv function using overlapped I/O can be called from within the completion routine of a previous WSARecv, WSARecvFrom, WSASend or WSASendTo function. For a given socket, I/O completion routines will not be nested. This permits time-sensitive data transmissions to occur entirely within a preemptive context." </strong>* </p> </description> <category>Ticket</category> </item> <item> <author>Shane Powell <killerbee@…></author> <pubDate>Sat, 10 Aug 2013 20:56:10 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8933#comment:11 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8933#comment:11</guid> <description> <p> I think you may be correct then. I missed that part of the WSARecv doc's. </p> <p> So the op_ready / on_pending may not be strictly required. </p> <p> Altho, if you checkout my patch, there is actually more than the one place where ASIO uses the 'overlapped_contains_result'. They also use it in deferring the result until some other time which v1.54.0 has stuffed up on. </p> <p> I'm going to have a play around when I get time to see if it breaks any of my apps. </p> <p> Thanks for pointing that out. </p> </description> <category>Ticket</category> </item> <item> <author>smueller@…</author> <pubDate>Sun, 11 Aug 2013 01:33:07 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8933#comment:12 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8933#comment:12</guid> <description> <p> I think that this is right, but for the wrong reason. I also don't like the idea of re-posting to the IOcompletionport when we can execute the callback in-place. </p> <p> The comment about the document, however, is irrelevant, as it refers to the mode of WSASend/Recv/etc which uses the OVERLAPPED struct and a callback function. In this mode, the callback is executed as an APC which requires that the thread be placed in an alertable state. It is well-agreed upon that this does not make for a maintainable program. They aren't talking about iocompletionport, for which all you have to do is have your handle assigned to the iocompletionport for it to queue completion messages on threads calling <a class="missing wiki">GetQueuedCompletionStatus</a>. </p> </description> <category>Ticket</category> </item> <item> <author>Shane Powell <killerbee@…></author> <pubDate>Sun, 11 Aug 2013 07:10:58 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8933#comment:13 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8933#comment:13</guid> <description> <p> The io completion port setup for WASSend/Recv/etc do use a OVERLAPPED struct and the MSDN doc's can kind of be read either way I think. </p> <p> That is why I would like to play with it when I get a chance to see if I can break it. </p> <p> ASIO just side-steps the whole issue by delaying the handler complete until after the function has returned. It's been doing this for awhile now so it's been working for asio without any problems up until now. Maybe it's the "best" way to go overall that is the safest for most naive asio programs. </p> <p> The performance junkie in me tho would like to see it removed (or have a option to see it removed) if we can make it safe. </p> </description> <category>Ticket</category> </item> <item> <author>simoncperkins@…</author> <pubDate>Mon, 12 Aug 2013 13:41:37 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8933#comment:14 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8933#comment:14</guid> <description> <p> Agreed - that section in the WSARecv docs does specify completion routines but as I see it the it completion port is effectively a way of running a completion routine on a pool thread. </p> <p> I think the key point is that the existing I/O operation has finished when the handler is executed. Therefore, it is safe to call WSARecv/WSASend again because the existing buffers are no longer in use. </p> </description> <category>Ticket</category> </item> <item> <author>simoncperkins@…</author> <pubDate>Mon, 12 Aug 2013 13:45:41 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8933#comment:15 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8933#comment:15</guid> <description> <p> Shane, </p> <p> It would also be interesting to know if the polling of <a class="missing wiki">GetQueuedCompletionStatus</a> every 500 ms is still required or could the gqcs_timeout be set to INFINITE. </p> <p> The comment for this timeout is as follows: </p> <blockquote> <p> <em> Timeout to use with <a class="missing wiki">GetQueuedCompletionStatus</a>. Some versions of windows </em> have a "bug" where a call to <a class="missing wiki">GetQueuedCompletionStatus</a> can appear stuck <em> even though there are events waiting on the queue. Using a timeout helps </em> to work around the issue. </p> </blockquote> <p> I didn't find any problem with running my tests with a value of INFINITE - although that was only on Windows 7. </p> <p> If you had two service threads then a thread would be waking up four times a second even if the ASIO client/server was idle. Presumably, this could have power implications on some devices. </p> </description> <category>Ticket</category> </item> <item> <author>danobrega@…</author> <pubDate>Tue, 03 Sep 2013 15:06:29 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8933#comment:16 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8933#comment:16</guid> <description> <p> Hi all, </p> <p> We have recently started using boost::asio and have hit this problem. Sometimes our completion handler is called after a boost::asio::async_read (1.54) with 0 bytes_transferred and error == boost::asio::error::eof. </p> <p> Although the bytes_transferred argument is zero, our buffer is correctly read from the tcp socket. If we just ignore this error, our program works like nothing is happening. </p> <p> I've tried to debug the program and reached win_iocp_io_service.ipp, win_iocp_io_service::do_one at line 341. Upon inspecting, the line: </p> <pre class="wiki">bytes_transferred = op-&gt;OffsetHigh; </pre><p> sets bytes_transferred to zero. I don't know why we're trying to get the bytes read from op-&gt;<a class="missing wiki">OffsetHigh</a>. Looking at the OVERLAPPED microsoft documentation it seems we should be using the field <a class="missing wiki">InternalHigh</a> instead. In our case the OVERLAPPED <a class="missing wiki">InternalHigh</a> field has the correct number of bytes read, and they are actually read. Our read buffer has all the information it should have. </p> <p> With this information I looked for existing asio defects and found this one. It sure looks like the same problem: async reads produce random EOF errors. </p> <p> I never used windows async IO before, so I'm not sure of what I'm saying. I hope it helps though. I'm available if you want me to test/try any fix, just send me an email. </p> <p> Best Regards, danobrega </p> </description> <category>Ticket</category> </item> <item> <author>Shane Powell <killerbee@…></author> <pubDate>Tue, 03 Sep 2013 20:57:54 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8933#comment:17 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8933#comment:17</guid> <description> <p> Hi danobrega, </p> <p> Did you try my patch? </p> </description> <category>Ticket</category> </item> <item> <author>danobrega@…</author> <pubDate>Wed, 04 Sep 2013 17:47:03 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8933#comment:18 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8933#comment:18</guid> <description> <p> Hi again. </p> <p> We've compiled boost with Shanes patch and so far the problem has not occurred again. We are doing load tests and the problem used to occur several times and pretty fast... </p> <p> If we have any other problems related to this issue I'll add a new comment. If you want to test something else we're also available. </p> <p> Best Regards, danobrega </p> </description> <category>Ticket</category> </item> <item> <author>gladitz@…</author> <pubDate>Fri, 06 Sep 2013 14:07:41 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8933#comment:19 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8933#comment:19</guid> <description> <p> I think I've hit this one as well. After the Boost 1.54.0 upgrade I sporadically started getting "asio.misc:2" (which supposedly is EOF?) in my asynchronous read handler. Mostly this shows up in heavy load tests. </p> </description> <category>Ticket</category> </item> <item> <author>gladitz@…</author> <pubDate>Mon, 09 Sep 2013 13:00:58 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8933#comment:20 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8933#comment:20</guid> <description> <p> The patch seems to have fixed the problem for me as well. Thanks! </p> </description> <category>Ticket</category> </item> <item> <dc:creator>chris_kohlhoff</dc:creator> <pubDate>Tue, 01 Oct 2013 10:51:41 GMT</pubDate> <title>status changed; resolution set https://svn.boost.org/trac10/ticket/8933#comment:21 https://svn.boost.org/trac10/ticket/8933#comment:21 <ul> <li><strong>status</strong> <span class="trac-field-old">new</span> → <span class="trac-field-new">closed</span> </li> <li><strong>resolution</strong> → <span class="trac-field-new">fixed</span> </li> </ul> <p> Fixed on trunk in <a class="changeset" href="https://svn.boost.org/trac10/changeset/85741" title="Fix a regression on Windows where multiple threads are running an ...">[85741]</a>. Merged to release in <a class="changeset" href="https://svn.boost.org/trac10/changeset/85838" title="Merge asio from trunk. ...">[85838]</a>. </p> Ticket naidu.trk@… Mon, 24 Feb 2014 16:36:26 GMT <link>https://svn.boost.org/trac10/ticket/8933#comment:22 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8933#comment:22</guid> <description> <p> i am seeing this in 1.55 for boost local stream sockets on the client side. async_receive completion handler is being called with zero bytes transferred with out any errors. The completion handler is getting invoked infinitely with 0 bytes_transferred. </p> </description> <category>Ticket</category> </item> <item> <author>simoncperkins@…</author> <pubDate>Mon, 24 Feb 2014 17:00:40 GMT</pubDate> <title/> <link>https://svn.boost.org/trac10/ticket/8933#comment:23 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/8933#comment:23</guid> <description> <p> As I have pointed our in the past, I think there are some issues with the implementation on Windows. One issue is that it passes a lpNumberOfBytesRevd/Sent to WSASend/WSARecv even though the docs state: </p> <p> <strong>Use NULL for this parameter if the lpOverlapped parameter is not NULL to avoid potentially erroneous results. This parameter can be NULL only if the lpOverlapped parameter is not NULL.</strong> </p> <p> Also, I believe there is a potential efficiency issue due to the overuse of <a class="missing wiki">PostQueuedCompletionStatus</a>. </p> <p> See this ticket for more information: </p> <p> <a class="ext-link" href="https://svn.boost.org/trac/boost/ticket/8967"><span class="icon">​</span>https://svn.boost.org/trac/boost/ticket/8967</a> </p> <p> We are running with the patch in that ticket applied to boost 1.55 </p> </description> <category>Ticket</category> </item> </channel> </rss>