#8933 closed Bugs (fixed)
on the windows platform async reads with multiple threads can produce random EOF errors
Reported by: | Owned by: | chris_kohlhoff | |
---|---|---|---|
Milestone: | To Be Determined | Component: | asio |
Version: | Boost 1.54.0 | Severity: | Showstopper |
Keywords: | Cc: |
Description
After upgrading upgrading boost to 1.54.0 I started getting lots of eof errors out of some of my ASIO based applications that didn't make sense.
My applications use the single io_service with a thread pool pattern. When the thread pool contains more than one thread I started getting random eof errors whenever the network traffic started to get heavy.
Attached is an example program that can reproduce the problem.
The application built with boost 1.53.0 works fine. Built with boost 1.54.0 fails.
I have tried visual studio version 2008 and 2012 with the same results.
To reproduce the problem I use the following parameters with two instances on the same machine: SendReceiveTest.exe server 127.0.0.1 100 1 1000 10 SendReceiveTest.exe client 127.0.0.1 100 2 0 0
The client will error out on the read with a eof error.
I tracked the problem back to a posted IO completion event that was generated from a on_pending call. The event gets converted to eof because ::GetLastError always returns '0' and the transferred bytes are always 'zero' from the on_pending call. Looking at the changes between boost asio v1.53.0 and v1.54.0 I spotted in asio\detail\impl\win_iocp_io_service.ipp that some PostQueuedCompletionStatus calls where changed form passing 'overlapped_contains_result' to passing '0'. Changing them back to 'overlapped_contains_result' fixes the problem.
Attachments (2)
Change History (25)
by , 9 years ago
Attachment: | SendReceiveTest.zip added |
---|
comment:1 by , 9 years ago
I saw this in my Windows application after upgrading to Boost 1.54. I noticed that, during a stress test, frequently the bytes received field in my ReadHandler was getting zero and an error_code of zero while a non-zero number of bytes was requested from an async_read call. In my case this caused stream corruption which could only be resolved in non-library space by passing in the number of bytes requested for the read into the handler and using that instead of the bytes received value passed to the ReadHandler.
Strangely enough, in my testing, it seemed that activating BOOST_ASIO_ENABLE_HANDLER_TRACKING made this problem pop up more frequently.
follow-up: 4 comment:2 by , 9 years ago
Where you using UDP? I was mainly taking about TCP. The same kind-of problem would affect UDP though. In UDP, you wouldn't get a EOF, you would get a zero length packet and you would lose data (hence your data lose).
comment:3 by , 9 years ago
I am using named pipes in Windows via the stream_handle interface. I should have mentioned this.
comment:4 by , 9 years ago
Replying to Shane Powell <killerbee@…>:
Where you using UDP? I was mainly taking about TCP. The same kind-of problem would affect UDP though. In UDP, you wouldn't get a EOF, you would get a zero length packet and you would lose data (hence your data lose).
I did not get an EOF, but rather I had a successful result from a composed read operation which indicated that zero bytes were received. These read operations were performed on a windows::stream_handle object created from a named pipe handle.
I'm planning on making further use of the ASIO library in our communication stack for TCP/IP, where I'm sure I would have seen the exact same symptoms. Thanks a ton for your investigation!
follow-up: 6 comment:5 by , 9 years ago
I haven't looked at the named pipes asio implementation, but it does sound like the same issue and my patch may work for you.
Did you apply my patch and did it fix your problem?
comment:6 by , 9 years ago
Replying to anonymous:
I haven't looked at the named pipes asio implementation, but it does sound like the same issue and my patch may work for you.
Did you apply my patch and did it fix your problem?
I did, and it did fix the problem. I'm curious as to why the read completion handler is posted for execution by the io_service rather than executed in-place. Perhaps it is to control stack growth. I know I'd have to ask Mr. Kohlhoff about this one to know for certain.
comment:7 by , 9 years ago
For the record: it looks like this patch solves #8935 too. Same problem observed on send.
comment:8 by , 9 years ago
This looks like it may be the same underlying cause as https://svn.boost.org/trac/boost/ticket/8967
I believe there are some issues in the Windows IOCP implementation that are described in ricket 8967 - the main ones are that on_pending should probably not be used and WSARecv/WSARecv should have null pointers for the bytes read parameter.
comment:9 by , 9 years ago
This is the same problem but his patch is dangerous as the on_pending is used to delay the calling of the complete handler until after the returning of the winsock functions as they are not reentrant for a single socket. Removing the it completely would result in existing asio programs not working correctly.
comment:10 by , 9 years ago
I don't believe that handling the notification directly without on_pending is a problem for the following reasons:
1) The existing I/O operation has already completed (even though the API may not have returned) if another thread picks up the notification using GetQueueCompletionStatus().
2) None of the multi-threaded samples I looked at on MSDN worry about calling WSARecv/WSASend in a notification handler even though the originating API may not have returned.
3) The MSDN docs for WSARecv explicitly state that this is acceptable:
* "The WSARecv function using overlapped I/O can be called from within the completion routine of a previous WSARecv, WSARecvFrom, WSASend or WSASendTo function. For a given socket, I/O completion routines will not be nested. This permits time-sensitive data transmissions to occur entirely within a preemptive context." *
comment:11 by , 9 years ago
I think you may be correct then. I missed that part of the WSARecv doc's.
So the op_ready / on_pending may not be strictly required.
Altho, if you checkout my patch, there is actually more than the one place where ASIO uses the 'overlapped_contains_result'. They also use it in deferring the result until some other time which v1.54.0 has stuffed up on.
I'm going to have a play around when I get time to see if it breaks any of my apps.
Thanks for pointing that out.
comment:12 by , 9 years ago
I think that this is right, but for the wrong reason. I also don't like the idea of re-posting to the IOcompletionport when we can execute the callback in-place.
The comment about the document, however, is irrelevant, as it refers to the mode of WSASend/Recv/etc which uses the OVERLAPPED struct and a callback function. In this mode, the callback is executed as an APC which requires that the thread be placed in an alertable state. It is well-agreed upon that this does not make for a maintainable program. They aren't talking about iocompletionport, for which all you have to do is have your handle assigned to the iocompletionport for it to queue completion messages on threads calling GetQueuedCompletionStatus.
comment:13 by , 9 years ago
The io completion port setup for WASSend/Recv/etc do use a OVERLAPPED struct and the MSDN doc's can kind of be read either way I think.
That is why I would like to play with it when I get a chance to see if I can break it.
ASIO just side-steps the whole issue by delaying the handler complete until after the function has returned. It's been doing this for awhile now so it's been working for asio without any problems up until now. Maybe it's the "best" way to go overall that is the safest for most naive asio programs.
The performance junkie in me tho would like to see it removed (or have a option to see it removed) if we can make it safe.
comment:14 by , 9 years ago
Agreed - that section in the WSARecv docs does specify completion routines but as I see it the it completion port is effectively a way of running a completion routine on a pool thread.
I think the key point is that the existing I/O operation has finished when the handler is executed. Therefore, it is safe to call WSARecv/WSASend again because the existing buffers are no longer in use.
comment:15 by , 9 years ago
Shane,
It would also be interesting to know if the polling of GetQueuedCompletionStatus every 500 ms is still required or could the gqcs_timeout be set to INFINITE.
The comment for this timeout is as follows:
Timeout to use with GetQueuedCompletionStatus. Some versions of windows have a "bug" where a call to GetQueuedCompletionStatus can appear stuck even though there are events waiting on the queue. Using a timeout helps to work around the issue.
I didn't find any problem with running my tests with a value of INFINITE - although that was only on Windows 7.
If you had two service threads then a thread would be waking up four times a second even if the ASIO client/server was idle. Presumably, this could have power implications on some devices.
comment:16 by , 9 years ago
Hi all,
We have recently started using boost::asio and have hit this problem. Sometimes our completion handler is called after a boost::asio::async_read (1.54) with 0 bytes_transferred and error == boost::asio::error::eof.
Although the bytes_transferred argument is zero, our buffer is correctly read from the tcp socket. If we just ignore this error, our program works like nothing is happening.
I've tried to debug the program and reached win_iocp_io_service.ipp, win_iocp_io_service::do_one at line 341. Upon inspecting, the line:
bytes_transferred = op->OffsetHigh;
sets bytes_transferred to zero. I don't know why we're trying to get the bytes read from op->OffsetHigh. Looking at the OVERLAPPED microsoft documentation it seems we should be using the field InternalHigh instead. In our case the OVERLAPPED InternalHigh field has the correct number of bytes read, and they are actually read. Our read buffer has all the information it should have.
With this information I looked for existing asio defects and found this one. It sure looks like the same problem: async reads produce random EOF errors.
I never used windows async IO before, so I'm not sure of what I'm saying. I hope it helps though. I'm available if you want me to test/try any fix, just send me an email.
Best Regards, danobrega
comment:18 by , 9 years ago
Hi again.
We've compiled boost with Shanes patch and so far the problem has not occurred again. We are doing load tests and the problem used to occur several times and pretty fast...
If we have any other problems related to this issue I'll add a new comment. If you want to test something else we're also available.
Best Regards, danobrega
comment:19 by , 9 years ago
I think I've hit this one as well. After the Boost 1.54.0 upgrade I sporadically started getting "asio.misc:2" (which supposedly is EOF?) in my asynchronous read handler. Mostly this shows up in heavy load tests.
comment:21 by , 9 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
comment:22 by , 9 years ago
i am seeing this in 1.55 for boost local stream sockets on the client side. async_receive completion handler is being called with zero bytes transferred with out any errors. The completion handler is getting invoked infinitely with 0 bytes_transferred.
comment:23 by , 9 years ago
As I have pointed our in the past, I think there are some issues with the implementation on Windows. One issue is that it passes a lpNumberOfBytesRevd/Sent to WSASend/WSARecv even though the docs state:
Use NULL for this parameter if the lpOverlapped parameter is not NULL to avoid potentially erroneous results. This parameter can be NULL only if the lpOverlapped parameter is not NULL.
Also, I believe there is a potential efficiency issue due to the overuse of PostQueuedCompletionStatus.
See this ticket for more information:
https://svn.boost.org/trac/boost/ticket/8967
We are running with the patch in that ticket applied to boost 1.55
Test application that reproduces the problem