Opened 10 years ago
Closed 10 years ago
#8323 closed Bugs (fixed)
boost::thread::try_join_for/try_join_until may block indefinitely due to a combination of problems in Boost.Thread and Boost.Chrono
Reported by: | Owned by: | viboes | |
---|---|---|---|
Milestone: | Boost 1.54.0 | Component: | thread |
Version: | Boost 1.50.0 | Severity: | Problem |
Keywords: | thread, chrono, steady_clock, try_jon_for, try_join_until, QueryPerformanceCounter | Cc: |
Description
In the following, try_join_test may hang indefinitely (using Boost 1.50):
void very_busy_thread() { while (true) boost::this_thread::interruption_point(); } void try_join_test() { boost::thread t(very_busy_thread); for (int i=0;i<1000000;i++) { t.try_join_for(boost::chrono::milliseconds(0)); // This call may hang indefinitely } t.interrupt(); t.join(); }
This has been reproduced on a real dual-core Intel Xeon system and also on 2 multicore virtual machines (one using VirtualBox, another with KVM).
Here's my (incomplete, possibly incorrect) analysis of the problem:
thread.try_join_for(boost::chrono::milliseconds(0)) called on thread which is still running
bool try_join_for(const chrono::duration<Rep, Period>& rel_time) { return try_join_until(chrono::steady_clock::now() + rel_time); }
steady_clock comes into play, which calls QueryPerformanceCounter the "steadiness" of which is disputable
here Clock is steady_clock
t = rep_ = 5591678518200
system_clock::time_point s_now = system_clock::now() = rep_=130084071987844000;
typename Clock::time_point c_now = Clock::now() = rep_ = 5591679763370 which is LESS THAN t = rep_ = 5591678518200 (QueryPerformanceCounter says hello)
try_join_until(s_now + ceil<nanoseconds>(t - c_now)) is called with a system time in the PAST AND in a past ~1msec ago (possibly important!)
thread.cpp:342:
chrono::milliseconds rel_time= chrono::ceil<chrono::milliseconds>(tp-chrono::system_clock::now());
rel_time becomes EXACTLY -1;
detail::timeout(uintmax_t milliseconds_) accepts a _signed_ integer, converting it to a huge number;
target_time.is_sentinel() == false
target_time.remaining_milliseconds(), time_left.milliseconds > min_timer_wait_period is false (thread.cpp:482)
using_timer == false (not using SetWaitableTimer)
target_time becomes detail::timeout(time_left.milliseconds) (a huge number) (thread.cpp:501)
is_sentinel() returns true -> remaining_milliseconds returns remaining_time(win32::infinite) (thread_data.hpp:151)
time_left.milliseconds becomes a large number
using_timer is false, time_left.milliseconds is used as timeout which is huge
WaitForMultipleObjects waits for a nearly-infinite amount of time
Attachments (1)
Change History (12)
follow-up: 4 comment:1 by , 10 years ago
comment:2 by , 10 years ago
Maybe the problem is that the system_clock is taken too early
Coulod you try
typename Clock::time_point c_now = Clock::now(); return try_join_until(system_clock::now() + ceil<nanoseconds>(t - c_now));
instead of
system_clock::time_point s_now = system_clock::now(); typename Clock::time_point c_now = Clock::now(); return try_join_until(s_now + ceil<nanoseconds>(t - c_now));
comment:3 by , 10 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:4 by , 10 years ago
Replying to viboes:
Hi,
It is strange to me that successive calls to steady_clock::now() are not monotonic. The trace you gave got the second call to steady_clock::now() 5591679763370 which is GT the fist one 5591678518200
t = rep_ = 5591678518200 typename Clock::time_point c_now = Clock::now() = rep_ = 5591679763370 ^What am I missing?
Any news on this?
comment:5 by , 10 years ago
I've worked around my original problem, and can't offer more explanations or testing. I've attached a more deterministic testcase. In short, QueryPerformanceCounter isn't steady at least on some multicore processors, nor it should be, steady_clock isn't steady, but even when all of this will be fixed some moment in the future in Boost.Chrono, Boost.Thread will hang when told to try_join_until(now-exactly 1 msec) while any other moment in the past seems fine. So this is actually a combination of two problems, you may want to clone this ticket or something. I can reproduce it any day (you should be able too, with the attached testcase). Following is another stack trace.
Frame 0: boost::detail::timeout::remaining_milliseconds() thread_data.hpp Line 151 is_sentinel must have returned true, going to return infinite wait time. this: {start=1714216 milliseconds=18446744073709551615 relative=true} Frame 1: boost::this_thread::interruptible_wait() thread.cpp Line 513 using_timer: false handle_to_wait_for: 0x38 (OK) handle_count: 1 wait_handle_index: 0 interruption_index: ~0u handles: [0x38,0,0] timeout_index: ~0u Frame 2: boost::thread::try_join_until() thread.cpp Line 343 local_thread_info: not null,thread_handle={handle_to_manage=0x00000038 } rel_time: {rep_=-1 } /* !!! */ rel_time is calculated with a difference of the system clock measurements on line 342, so it could have run backwards there as well. Frame 3: boost::thread::try_join_until<Clock=steady>() thread.hpp Line 355 t {d_={rep_=3089317882890 } } s_now {d_={rep_=130091140137592000 } c_now {d_={rep_=3089318990860 } t-c_now is 3089317882890-3089318990860 = -1107970, which gives a ceil<milliseconds> of -1.
comment:6 by , 10 years ago
I see the problem now.
Could you try to replace
template <class Clock, class Duration> bool try_join_until(const chrono::time_point<Clock, Duration>& t) { using namespace chrono; system_clock::time_point s_now = system_clock::now(); typename Clock::time_point c_now = Clock::now(); return try_join_until(s_now + ceil<nanoseconds>(t - c_now)); }
by
template <class Clock, class Duration> bool try_join_until(const chrono::time_point<Clock, Duration>& t) { using namespace chrono; system_clock::time_point s_now = system_clock::now(); bool joined= false; do { typename Clock::duration d = ceil<nanoseconds>(t-Clock::now()); if (d <= Clock::duration::zero()) return false; // in case the Clock::time_point t is already reached joined = try_join_until(s_now + d); } while (! joined); return true; }
follow-up: 9 comment:7 by , 10 years ago
This change doesn't affect the problem which is presented in the attached test case, because the try_join_until overload that you patched isn't selected if the time point is based on the system_clock (the next one, which mentions the system clock, is selected). It does work with the steady clock, though.
follow-up: 9 comment:7 by , 10 years ago
This change doesn't affect the problem which is presented in the attached test case, because the try_join_until overload that you patched isn't selected if the time point is based on the system_clock (the next one, which mentions the system clock, is selected). It does work with the steady clock, though.
comment:9 by , 10 years ago
Replying to szakharchenko@…:
This change doesn't affect the problem which is presented in the attached test case, because the try_join_until overload that you patched isn't selected if the time point is based on the system_clock (the next one, which mentions the system clock, is selected). It does work with the steady clock, though.
This is another issue that could be discussed. Be free to create a new ticket.
comment:11 by , 10 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Hi,
It is strange to me that successive calls to steady_clock::now() are not monotonic. The trace you gave got the second call to steady_clock::now() 5591679763370 which is GT the fist one 5591678518200
What am I missing?