Opened 9 years ago

Closed 8 years ago

#9856 closed Bugs (wontfix)

[windows] condition_variable::wait_for returns wrong cv_status on timeout.

Reported by: pi88el@… Owned by: Niall Douglas
Milestone: Component: thread
Version: Boost 1.57.0 Severity: Problem
Keywords: condition_variable wait_for cv_status Cc:

Description

The following bug seems to work on *nix systems. But not on Windows 7 with Visual Studio 2013 Express.
If a condition variable is waiting for a defined amount of time, sometimes (nearly always) the returned cv_status is wrong.

int main() {
    mutex mtx;
    condition_variable cond;
    
    unique_lock<mutex> lk(mtx);
    
    std::cout << BOOST_VERSION << "\n";
    
    for(int i=0; i<10; i++) {
        std::cout << i << " : " << chrono::steady_clock::now() << "\n";
        if (cv_status::no_timeout == cond.wait_for(lk, chrono::seconds(1)))
            std::cout << "no_timeout\n";
    }
}

While the condition is never notified, i get no_timeout.

debuging the code:

steady_clock::time_point c_now = steady_clock::now();
do_wait(lock, ceil<milliseconds>(d).count());
return steady_clock::now() - c_now < d ? cv_status::no_timeout :
                                         cv_status::timeout;

steady_clock::now() - c_now < d is true. It seems do_wait is sleeping to short.

Change History (25)

comment:1 by pi88el@…, 9 years ago

This is also happening with wait_until.

comment:2 by viboes, 9 years ago

Owner: changed from Anthony Williams to viboes
Status: newassigned

comment:3 by viboes, 9 years ago

Summary: condition_variable::wait_for returns wrong cv_status on timeout.[windows] condition_variable::wait_for returns wrong cv_status on timeout.

comment:4 by sclark@…, 8 years ago

This bug seems closely related to, or the same as, something I found on Linux in Boost 1.54, 1.48, and 1.46. When passed a deadline represented as a system_time, condition_variable::timed_wait() returns false (i.e. timed out) after a little under a millisecond. For more info and a test program, see the boost-users mailing list archive, thread subject "[Thread] thousands of spurious wakeups from timed_wait() per second" around Sept. 11, 2014.

(Sorry, I can't get a link to it past trac's spam filter. The spam filter says "Please respond to the following" and nothing follows.)

comment:5 by viboes, 8 years ago

Milestone: To Be DeterminedBoost 1.57.0

comment:6 by Niall Douglas, 8 years ago

I am unable to replicate this issue with trunk 1.57 on VS2013 x86 on Win8.1.

comment:7 by viboes, 8 years ago

@pi88el,@sclark please, could you check on trunk?

comment:8 by viboes, 8 years ago

Milestone: Boost 1.57.0
Resolution: worksforme
Status: assignedclosed

comment:9 by yinon, 8 years ago

It reproduces on 1.57. Windows 8.1 Visual Studio 2013 Ultimate

Using the function

boost::condition_variable condition; condition.wait_until(lock, _waitUntil)

comment:10 by yinon@…, 8 years ago

Resolution: worksforme
Status: closedreopened
Version: Boost 1.55.0Boost 1.57.0

comment:11 by Niall Douglas, 8 years ago

I had a look for the testcase I wrote for #9856 but it looks like I deleted it last time I reset my Boost installation.

However, if I am to be blunt here, this must be a behaviour which differs between people's machines. Perhaps an AMD vs Intel difference, or maybe a HPET difference? On Windows it also matters hugely if any multimedia program is running because then the Windows context switch granularity goes from ~43ms to ~1ms. A multimedia program famous for this is the Chrome web browser.

Obviously whether a multimedia program is running should make no difference to Boost code. But I am trying to think of a repeatable demonstration, as otherwise finding and fixing any problem is very hard.

comment:12 by yinon@…, 8 years ago

I have reproduced the problem on a Parallels Desktop 10 Virtual Machine, running on a MacBook Pro, Processor: Intel Core i5.

comment:13 by Niall Douglas, 8 years ago

I've put together the following test case test_9856.cpp:

#include "boost/thread.hpp"
#include <iostream>

using namespace boost;

int main() {
    mutex mtx;
    condition_variable cond;
    
    unique_lock<mutex> lk(mtx);
    
    std::cout << BOOST_VERSION << "\n";
    
    for(int i=0; i<10; i++) {
        //std::cout << i << " : " << chrono::steady_clock::now().count() << "\n";
        if (cv_status::timeout != cond.wait_for(lk, chrono::seconds(1)))
        {
            std::cout << "ERROR: Wrong return code!\n";
            return 1;
        }
    }
    return 0;
}

I'll test it on Windows when I reboot out of Linux later today. I've also lined up a KVM virtualised Windows to test upon.

Niall

comment:14 by viboes, 8 years ago

Owner: changed from viboes to Niall Douglas
Status: reopenednew

comment:15 by Niall Douglas, 8 years ago

I've successfully replicated this bug! You have to loop running the test case a few times, but it'll eventually appear. This was on Windows 8.1 x64.

I'll see if I can fix it next Saturday.

Niall

comment:16 by Niall Douglas, 8 years ago

I have an update on this issue. Firstly, I have committed a testcase for this to Boost.Thread trunk, so it is now being tested as part of regular unit testing.

I have some quite scary news though. Boost.Thread decides whether the wait has timed out by comparing steady_clock (really QueryPerformanceCounter) from before and after, and if the wait took the timeout or longer it assumes that timeout occurred. Unfortunately, WaitForMultipleObjectsEx() provides no guarantees that it actually waits for the timeout requested - you might request 50ms, but might actually get 10ms. Here is an example program:

#include <windows.h>
#include <stdio.h>
#include <chrono>

int main(void)
{
  ULONG ulDelay_ms = 20;
  HANDLE hSemaphoreDelay = CreateSemaphore(NULL, 0, 1, NULL);

  for (size_t n = 0; n < 50; n++)
  {
    while (1) {
      auto begin = std::chrono::high_resolution_clock::now();
      ULONG hr = WaitForSingleObject(hSemaphoreDelay, ulDelay_ms);
      auto end = std::chrono::high_resolution_clock::now();
      auto diff = end - begin;

      if (hr == WAIT_ABANDONED)
        printf("Wait Abandoned ");
      else if (hr == WAIT_TIMEOUT)
        printf("Timed out ");
      else
        printf("Signaled ");
      DWORD lTDelta = std::chrono::duration_cast<std::chrono::milliseconds>(diff).count();
      printf("Target Wait Interval: %u Real Wait Interval: %u (%u)\n", ulDelay_ms, lTDelta, diff.count());
      if (lTDelta >= ulDelay_ms) break;
    }
    printf("\n");
  }

  CloseHandle(hSemaphoreDelay);
  return 0;
}

... and indeed very occasionally it is returning after 16-18ms instead of the requested 20ms at the same frequency that Boost.Thread is not returning timeout.

This is scary because I believe every single wait routine in Windows in Boost.Thread, and possibly other Boost libraries, will need to be audited and potentially refactored. I'll raise it on boost-dev tomorrow morning. I believe this new early timeout feature was probably introduced when Microsoft added timer coalescing to Windows to save battery on mobile devices, and so only appeared in Windows 7 and later. The fact it hasn't presented sooner is worrying.

Niall

comment:17 by Niall Douglas, 8 years ago

FYI the current consensus in the discussion on boost-dev is that this should be a wontfix. See http://comments.gmane.org/gmane.comp.lib.boost.devel/257246.

Personally speaking I'd prefer to adjust the code to produce fewer spurious wakeups, but not completely prevent them. We'll see if Microsoft come back to us to say what their STL does.

comment:18 by Niall Douglas, 8 years ago

Did some testing of the performance of passing a timeout to WaitForMultipleObjects versus using SetWaitableTimerEx:

Where underlying object is not signalled, 1ms timeout:

WaitForMultipleObjects alone: 1293.457580 microsecs

CreateWaitableTimer + SetWaitableTimer + WaitForMultipleObjects + CloseHandle: 1000.870380 microsecs

Where underlying object is signalled:

WaitForMultipleObjects alone: 0.246493 microsecs

SetWaitableTimer + WaitForMultipleObjects: 1.062579 microsecs

CreateWaitableTimer + SetWaitableTimer + WaitForMultipleObjects + CloseHandle: 2.055087 microsecs

Moving the creation of the waitable timer outside of interruptible_wait() will be a pain - every interruptible_wait() user will need to cache a copy, or else we keep a copy in thread local storage. Given that the number of occasions where the underlying semaphore goes signalled in between the check before starting the wait and the wait beginning will be few, I reckon the 8x slowdown for timeouts < 20ms will be worth the ease of implementation. So this is what I will do.

Niall

comment:19 by viboes, 8 years ago

It seems this is related to #10967 and #9720.

in reply to:  19 comment:20 by Niall Douglas, 8 years ago

Replying to viboes:

It seems this is related to #10967 and #9720.

It does appear that Windows respects the timeout you pass to a waitable timer, whilst it fuzzes any timeout given to WaitForMultipleObjects to be either too short or too long (on average 30% too long).

BTW I have more empirical findings. It turns out Windows ignores all TolerableDelay values <= 26 ms. I have no idea why. But 27 ms is definitely the minimum for it to have any effect at all.

Windows usually ticks at 15.6 ms, so a tolerable delay of 27 ms is about 1.75 context switches. Do I have permission to set that minimum for all timers i.e.

27 ms <= tolerable delay <= 5% of duration between now and timeout.

Do we need an upper bound on tolerable delay? I can't see Windows running for ten seconds without at least one interrupt firing though ...

Niall

comment:21 by viboes, 8 years ago

Niall, are these data dependent on the windows version?

comment:22 by Niall Douglas, 8 years ago

@viboes I only have access to Windows 8.1 nowadays. They added coalesced timer support in Vista and a tickless kernel in Windows 8, so I would assume each Windows will be a little bit different.

The code I have written should compile and work on Win XP however. As with GetTickCount64(), it emulates the SetWaitableTimerEx() function on XP and earlier.

Niall

comment:23 by viboes, 8 years ago

I let you take the best decision.

comment:24 by Niall Douglas, 8 years ago

Addition of coalesced timers is in https://github.com/boostorg/thread/pull/44.

The fact that Thread doesn't return timeout up to 10% of the time on Windows is wontfix.

comment:25 by Niall Douglas, 8 years ago

Resolution: wontfix
Status: newclosed
Note: See TracTickets for help on using tickets.