Opened 7 years ago
Last modified 5 years ago
#11728 new Bugs
interprocess::message_queue deadlocked when process exists unexpectedly on Windows
Reported by: | Owned by: | Ion Gaztañaga | |
---|---|---|---|
Milestone: | To Be Determined | Component: | interprocess |
Version: | Boost 1.59.0 | Severity: | Showstopper |
Keywords: | deadlock | Cc: | igaztanaga@…, daniel.kruegler@… |
Description
Suppose two processes communicate via a message_queue
. On Windows operating systems, when either of the two exists unexpectedly (e.g., via a crash or a kill process command), the other gets deadlocked. Attached is code that reproduces the deadlock reliably. First, launch the server process (reader), and then the client process (writer). Kill the server, and the client gets deadlocked within try_send()
.
Attachments (2)
Change History (8)
by , 7 years ago
Attachment: | example.zip added |
---|
comment:1 by , 7 years ago
In other words, when process exits unexpectedly, it is not guaranteed that locks owned by it with kernel of file-system persistence are released, at least on Windows platforms. This could lead to very serious problems. For example, suppose multiple processes are putting log records to a viewer process using an interprocess::message_queue. When any process, be it a logger or the viewer, crashes, all processes involved may just be deadlocked.
comment:2 by , 7 years ago
Thanks for the ticket and the test case. This is known issue since there is no guarantee for deadlock detection even in non-Windows systems, robust mutexes are not mandatory, there is no easy fix for this. In case this helps you, you can define the following before using including interprocess:
BOOST_INTERPROCESS_ENABLE_TIMEOUT_WHEN_LOCKING
and optionally define a timeout (by default 10 seconds)
BOOST_INTERPROCESS_TIMEOUT_WHEN_LOCKING_DURATION_MS
This converts any infinite mutex lock into a timed lock and throws an exception if the timeout passes. This won't if the message queue is waiting in the condition variable (one can in theory wait for a message for a long time). This could detect dead processes when trying to send a message. You can use timed receives as a workaround in reception.
It's the best we can do now. It's similar to a inter-thead communication, if a thread dies or deadlocks, then you are lost.
In any case BOOST_INTERPROCESS_ENABLE_TIMEOUT_WHEN_LOCKING, which was experimental, should be documented. Let me know if this at least alleviates the problem.
comment:3 by , 7 years ago
Thanks for the reply. I just thought the library is no longer maintained.
As to the issue, I think there is a not-so-difficult fix at least on Windows platforms. The keypoint is that mutex kernel object in Windows is indeed robust. Please see the description for the return value WAIT_ABANDONED
on [this](https://msdn.microsoft.com/en-us/library/windows/desktop/ms687032(v=vs.85).aspx) MSDN page.
With a robust mutex mechanism, the fix is pretty much straightforward. First, we synthesize a name for the mutex kernel object based on the name of the message queue which is supplied by the user. Then, everytime a process tires to access the message queue, it first creates or opens the mutex with the synthesized name. Note that this does not affect the kernel or filesystem persistence nature of the message queue.
comment:4 by , 6 years ago
Cc: | added |
---|
comment:5 by , 6 years ago
We are suffering from this problem.
On our systems (Windows 7) the mentioned Workaround helps neither using boost 1.57 nor 1.60.
Also the problem is triggered by timed_send() the same way as try_send()
This is rather critical for us, so any help would be greatly apreciated!
by , 6 years ago
Attachment: | reproducer.zip added |
---|
Reproducer with workaround and using timed_send()
comment:6 by , 5 years ago
Can you elaborate on this solution? How does use of the Window's named mutex prevent trying to lock an abandoned file-persistent mutex?
Code that reproduces the issue reliably