Opened 7 years ago

Last modified 5 years ago

#11728 new Bugs

interprocess::message_queue deadlocked when process exists unexpectedly on Windows

Reported by: Lingxi Li <lilingxi.cs@…> Owned by: Ion Gaztañaga
Milestone: To Be Determined Component: interprocess
Version: Boost 1.59.0 Severity: Showstopper
Keywords: deadlock Cc: igaztanaga@…, daniel.kruegler@…

Description

Suppose two processes communicate via a message_queue. On Windows operating systems, when either of the two exists unexpectedly (e.g., via a crash or a kill process command), the other gets deadlocked. Attached is code that reproduces the deadlock reliably. First, launch the server process (reader), and then the client process (writer). Kill the server, and the client gets deadlocked within try_send().

Attachments (2)

example.zip (1.0 KB ) - added by Lingxi Li <lilingxi.cs@…> 7 years ago.
Code that reproduces the issue reliably
reproducer.zip (1.4 KB ) - added by Arne.Brix@… 6 years ago.
Reproducer with workaround and using timed_send()

Download all attachments as: .zip

Change History (8)

by Lingxi Li <lilingxi.cs@…>, 7 years ago

Attachment: example.zip added

Code that reproduces the issue reliably

comment:1 by Lingxi Li <lilingxi.cs@…>, 7 years ago

In other words, when process exits unexpectedly, it is not guaranteed that locks owned by it with kernel of file-system persistence are released, at least on Windows platforms. This could lead to very serious problems. For example, suppose multiple processes are putting log records to a viewer process using an interprocess::message_queue. When any process, be it a logger or the viewer, crashes, all processes involved may just be deadlocked.

comment:2 by Ion Gaztañaga, 7 years ago

Thanks for the ticket and the test case. This is known issue since there is no guarantee for deadlock detection even in non-Windows systems, robust mutexes are not mandatory, there is no easy fix for this. In case this helps you, you can define the following before using including interprocess:

BOOST_INTERPROCESS_ENABLE_TIMEOUT_WHEN_LOCKING

and optionally define a timeout (by default 10 seconds)

BOOST_INTERPROCESS_TIMEOUT_WHEN_LOCKING_DURATION_MS

This converts any infinite mutex lock into a timed lock and throws an exception if the timeout passes. This won't if the message queue is waiting in the condition variable (one can in theory wait for a message for a long time). This could detect dead processes when trying to send a message. You can use timed receives as a workaround in reception.

It's the best we can do now. It's similar to a inter-thead communication, if a thread dies or deadlocks, then you are lost.

In any case BOOST_INTERPROCESS_ENABLE_TIMEOUT_WHEN_LOCKING, which was experimental, should be documented. Let me know if this at least alleviates the problem.

comment:3 by Lingxi Li <lilingxi.cs@…>, 7 years ago

Thanks for the reply. I just thought the library is no longer maintained.

As to the issue, I think there is a not-so-difficult fix at least on Windows platforms. The keypoint is that mutex kernel object in Windows is indeed robust. Please see the description for the return value WAIT_ABANDONED on [this](https://msdn.microsoft.com/en-us/library/windows/desktop/ms687032(v=vs.85).aspx) MSDN page.

With a robust mutex mechanism, the fix is pretty much straightforward. First, we synthesize a name for the mutex kernel object based on the name of the message queue which is supplied by the user. Then, everytime a process tires to access the message queue, it first creates or opens the mutex with the synthesized name. Note that this does not affect the kernel or filesystem persistence nature of the message queue.

comment:4 by daniel.kruegler@…, 6 years ago

Cc: daniel.kruegler@… added

comment:5 by Arne.Brix@…, 6 years ago

We are suffering from this problem.

On our systems (Windows 7) the mentioned Workaround helps neither using boost 1.57 nor 1.60.

Also the problem is triggered by timed_send() the same way as try_send()

This is rather critical for us, so any help would be greatly apreciated!

by Arne.Brix@…, 6 years ago

Attachment: reproducer.zip added

Reproducer with workaround and using timed_send()

comment:6 by davids@…, 5 years ago

Can you elaborate on this solution? How does use of the Window's named mutex prevent trying to lock an abandoned file-persistent mutex?

Note: See TracTickets for help on using tickets.