Opened 10 years ago

Closed 10 years ago

#8323 closed Bugs (fixed)

boost::thread::try_join_for/try_join_until may block indefinitely due to a combination of problems in Boost.Thread and Boost.Chrono

Reported by: szakharchenko@… Owned by: viboes
Milestone: Boost 1.54.0 Component: thread
Version: Boost 1.50.0 Severity: Problem
Keywords: thread, chrono, steady_clock, try_jon_for, try_join_until, QueryPerformanceCounter Cc:

Description

In the following, try_join_test may hang indefinitely (using Boost 1.50):

void very_busy_thread()
{
    while (true) boost::this_thread::interruption_point();
}

void try_join_test()
{
    boost::thread t(very_busy_thread);
    for (int i=0;i<1000000;i++)
    {
        t.try_join_for(boost::chrono::milliseconds(0)); // This call may hang indefinitely
    }
    t.interrupt();
    t.join();
}

This has been reproduced on a real dual-core Intel Xeon system and also on 2 multicore virtual machines (one using VirtualBox, another with KVM).

Here's my (incomplete, possibly incorrect) analysis of the problem:

thread.try_join_for(boost::chrono::milliseconds(0)) called on thread which is still running

bool try_join_for(const chrono::duration<Rep, Period>& rel_time)
{
   return try_join_until(chrono::steady_clock::now() + rel_time);
}

steady_clock comes into play, which calls QueryPerformanceCounter the "steadiness" of which is disputable

here Clock is steady_clock

t = rep_ = 5591678518200

system_clock::time_point s_now = system_clock::now() = rep_=130084071987844000;

typename Clock::time_point c_now = Clock::now() = rep_ = 5591679763370 which is LESS THAN t = rep_ = 5591678518200 (QueryPerformanceCounter says hello)

try_join_until(s_now + ceil<nanoseconds>(t - c_now)) is called with a system time in the PAST AND in a past ~1msec ago (possibly important!)

thread.cpp:342:

   chrono::milliseconds rel_time= chrono::ceil<chrono::milliseconds>(tp-chrono::system_clock::now());

rel_time becomes EXACTLY -1;

detail::timeout(uintmax_t milliseconds_) accepts a _signed_ integer, converting it to a huge number;

target_time.is_sentinel() == false

target_time.remaining_milliseconds(), time_left.milliseconds > min_timer_wait_period is false (thread.cpp:482)

using_timer == false (not using SetWaitableTimer)

target_time becomes detail::timeout(time_left.milliseconds) (a huge number) (thread.cpp:501)

is_sentinel() returns true -> remaining_milliseconds returns remaining_time(win32::infinite) (thread_data.hpp:151)

time_left.milliseconds becomes a large number

using_timer is false, time_left.milliseconds is used as timeout which is huge

WaitForMultipleObjects waits for a nearly-infinite amount of time

Attachments (1)

test.cpp (1000 bytes ) - added by szakharchenko@… 10 years ago.
Simple test case (should hang with 99% certainty)

Download all attachments as: .zip

Change History (12)

comment:1 by viboes, 10 years ago

Hi,

It is strange to me that successive calls to steady_clock::now() are not monotonic. The trace you gave got the second call to steady_clock::now() 5591679763370 which is GT the fist one 5591678518200

                                                  t = rep_ = 5591678518200
    typename Clock::time_point c_now = Clock::now() = rep_ = 5591679763370
                                                                   ^

What am I missing?

comment:2 by viboes, 10 years ago

Maybe the problem is that the system_clock is taken too early

Coulod you try

          typename Clock::time_point  c_now = Clock::now();
          return try_join_until(system_clock::now() + ceil<nanoseconds>(t - c_now));

instead of

          system_clock::time_point     s_now = system_clock::now();
          typename Clock::time_point  c_now = Clock::now();
          return try_join_until(s_now + ceil<nanoseconds>(t - c_now));

comment:3 by viboes, 10 years ago

Owner: changed from Anthony Williams to viboes
Status: newassigned

in reply to:  1 comment:4 by viboes, 10 years ago

Replying to viboes:

Hi,

It is strange to me that successive calls to steady_clock::now() are not monotonic. The trace you gave got the second call to steady_clock::now() 5591679763370 which is GT the fist one 5591678518200

                                                  t = rep_ = 5591678518200
    typename Clock::time_point c_now = Clock::now() = rep_ = 5591679763370
                                                                   ^

What am I missing?

Any news on this?

by szakharchenko@…, 10 years ago

Attachment: test.cpp added

Simple test case (should hang with 99% certainty)

comment:5 by szakharchenko@…, 10 years ago

I've worked around my original problem, and can't offer more explanations or testing. I've attached a more deterministic testcase. In short, QueryPerformanceCounter isn't steady at least on some multicore processors, nor it should be, steady_clock isn't steady, but even when all of this will be fixed some moment in the future in Boost.Chrono, Boost.Thread will hang when told to try_join_until(now-exactly 1 msec) while any other moment in the past seems fine. So this is actually a combination of two problems, you may want to clone this ticket or something. I can reproduce it any day (you should be able too, with the attached testcase). Following is another stack trace.

Frame 0: boost::detail::timeout::remaining_milliseconds()
thread_data.hpp Line 151
is_sentinel must have returned true, going to return infinite wait time.
this: {start=1714216 milliseconds=18446744073709551615 relative=true}

Frame 1: boost::this_thread::interruptible_wait()
thread.cpp Line 513
using_timer: false
handle_to_wait_for: 0x38 (OK)
handle_count: 1
wait_handle_index: 0
interruption_index: ~0u
handles: [0x38,0,0]
timeout_index: ~0u

Frame 2: boost::thread::try_join_until()
thread.cpp Line 343
local_thread_info: not null,thread_handle={handle_to_manage=0x00000038 }
rel_time: {rep_=-1 } /* !!! */
rel_time is calculated with a difference of the system clock measurements
on line 342, so it could have run backwards there as well.

Frame 3:
boost::thread::try_join_until<Clock=steady>()
thread.hpp Line 355
t       {d_={rep_=3089317882890 } }
s_now   {d_={rep_=130091140137592000 }
c_now   {d_={rep_=3089318990860 }

t-c_now is 3089317882890-3089318990860 = -1107970, which gives a ceil<milliseconds> of -1.

comment:6 by viboes, 10 years ago

I see the problem now.

Could you try to replace

        template <class Clock, class Duration>
        bool try_join_until(const chrono::time_point<Clock, Duration>& t)
        {
          using namespace chrono;
          system_clock::time_point     s_now = system_clock::now();
          typename Clock::time_point  c_now = Clock::now();
          return try_join_until(s_now + ceil<nanoseconds>(t - c_now));
        }

by

        template <class Clock, class Duration>
        bool try_join_until(const chrono::time_point<Clock, Duration>& t)
        {
          using namespace chrono;
          system_clock::time_point     s_now = system_clock::now();
          bool joined= false;
          do {
            typename Clock::duration   d = ceil<nanoseconds>(t-Clock::now());
            if (d <= Clock::duration::zero()) return false; // in case the Clock::time_point t is already reached
            joined = try_join_until(s_now + d);
          } while (! joined);
          return true;
        }
Last edited 10 years ago by viboes (previous) (diff)

comment:7 by szakharchenko@…, 10 years ago

This change doesn't affect the problem which is presented in the attached test case, because the try_join_until overload that you patched isn't selected if the time point is based on the system_clock (the next one, which mentions the system clock, is selected). It does work with the steady clock, though.

comment:7 by szakharchenko@…, 10 years ago

This change doesn't affect the problem which is presented in the attached test case, because the try_join_until overload that you patched isn't selected if the time point is based on the system_clock (the next one, which mentions the system clock, is selected). It does work with the steady clock, though.

comment:8 by viboes, 10 years ago

Milestone: To Be DeterminedBoost 1.54.0

Committed revision [83652].

in reply to:  7 comment:9 by viboes, 10 years ago

Replying to szakharchenko@…:

This change doesn't affect the problem which is presented in the attached test case, because the try_join_until overload that you patched isn't selected if the time point is based on the system_clock (the next one, which mentions the system clock, is selected). It does work with the steady clock, though.

This is another issue that could be discussed. Be free to create a new ticket.

comment:11 by viboes, 10 years ago

Resolution: fixed
Status: assignedclosed

(In [83660]) Thread: merge from trunk 1.54. Fix #8027,#8323,#8337.

Note: See TracTickets for help on using tickets.