Opened 12 years ago

Closed 10 years ago

Last modified 10 years ago

#4345 closed Bugs (fixed)

thread::id and joining problem with cascade of threads

Reported by: bartek szurgot <bartosz.szurgot@…> Owned by: viboes
Milestone: Boost 1.50.0 Component: thread
Version: Boost 1.43.0 Severity: Problem
Keywords: thread::id Cc:

Description

i believe there is a problem when saving thread's ids in an external (for a thread) collection, when joining cascade of threads and collection lives on.

example code, showing this issue has been attached this this report. what happens there:

  1. main() creates collection of thread::id objects (i.e. Data)
  2. main() creates two threads (in classes A and B)
  3. ThreadJoiner helper class interrupts and joins threads in d-tor, and so main() int+join thread in B class' instance, that in turn int+join A class' instance thread (member of B object)
  4. check if threads exited, as they should.

each thread increments global counter when entering operator()() and decrements it when leaving operator()()'s scope, so when leaving scope of ThreadJoiner root object all threads should be already joined. it happens this way (as expected) as long as one does not save thread::ids. if they are saved, and live longer then thread, threads are not (always) joined properly (in attached example program: assertion on global counter equals 0 fails). it looks like race, but is reproducible in ~95% of the cases.

problem does not seem to appear when threads are run directly, and joined from main as well (not cascade).

in example code there are 2 ways to make code work as expected:

  1. uncomment main.cpp:139 (deallocation of collection of saved IDs before calling joins in d-tors)
  2. change main.cpp:32 from '#if 1' to '#if 0' - this changes collection to hold std::string (generated via stream, from boost::thread::id) from boost::thread::id

i was able to reproduce this issue with the same results with boost versions:

  1. 1.38
  2. 1.40
  3. 1.43
  4. today's trunk

toolchain is g++ 4.4.x under Linux (Debian and Ubuntu).

to build and run example code extract bzip2 archive, enter directory and run: make && gen/debug/a.out

Attachments (4)

thread_bug_test.tar.bz2 (2.8 KB ) - added by bartek szurgot <bartosz.szurgot@…> 12 years ago.
thread_bug_test.tar.2.bz2 (2.9 KB ) - added by anonymous 12 years ago.
patch_thread_id.diff.bz2 (766 bytes ) - added by anonymous 12 years ago.
patch_cond_warn.diff.bz2 (386 bytes ) - added by anonymous 12 years ago.

Download all attachments as: .zip

Change History (23)

by bartek szurgot <bartosz.szurgot@…>, 12 years ago

Attachment: thread_bug_test.tar.bz2 added

comment:1 by anonymous, 12 years ago

g_count is not protected against race conditions. You should use an atomic integer or use a mutext to protect g_count.

comment:2 by Steven Watanabe, 12 years ago

This code is very hard to follow, but here's what's happening in detail:

  • An object of type B is created
    • This creates a shared_ptr with a new A object
      • The new A object starts thread A
  • Thread B is created, and receives a copy of the B object, this copy is not destroyed until the last thread::id for this thread is gone.
  • Thread B runs to completion and is joined. The last copy of B has not yet been destroyed, so there is no attempt to join thread A.
  • Kaboom.

I don't think this should be considered a bug in the library. If you want to join another thread when a thread exits, join at the end of the thread function instead of in the destructor.

in reply to:  1 ; comment:3 by bartek szurgot <bartosz.szurgot@…>, 12 years ago

Replying to anonymous:

g_count is not protected against race conditions. You should use an atomic integer or use a mutext to protect g_count.

it's true, what you say - the same goes for g_started. i left it this way intentionally since i work on x86 where ++x on integer is atomic any way and i didn't want to complicate example code any more.

in reply to:  2 comment:4 by bartek szurgot <bartosz.szurgot@…>, 12 years ago

Replying to steven_watanabe:

This code is very hard to follow, but here's what's happening in detail:

this is the easy case - trust me. i've spent last 2 days tracking it down in "real" system we're developing. ;)

  • An object of type B is created
    • This creates a shared_ptr with a new A object
      • The new A object starts thread A
  • Thread B is created, and receives a copy of the B object, this copy is not destroyed until the last thread::id for this thread is gone.

this is the key - thread::id holds identifier of a thread as well as handle to it. it means that thread::id has in fact double responsibility, thus non-obvious behavior (id!=handle).

  • Thread B runs to completion and is joined. The last copy of B has not yet been destroyed, so there is no attempt to join thread A.
  • Kaboom.

I don't think this should be considered a bug in the library. If you want to join another thread when a thread exits, join at the end of the thread function instead of in the destructor.

technically this can be done, though i don't think it's a proper way - by doing so one must remember in each thread's operator()()'s implementation to explicitly join all owned threads, instead of doing it in error-prone, RAII-way (d-tor perform clean up - here: int+join).

a took a look in thread::id implementation - it holds shared_ptr to "thread's data" (i assume this is a handle to thread itself). my suggestion is to change it to weak_ptr, so that thread::id would become only "observer" of the thread, that can be compared, put into stream, etc... this would eliminate double responsibility of thread::id. alternatively only data required to identify thread should be stored in thread::id - ex.: name string.

in reply to:  3 ; comment:5 by Anthony Williams, 12 years ago

Replying to bartek szurgot <bartosz.szurgot@…>:

Replying to anonymous:

g_count is not protected against race conditions. You should use an atomic integer or use a mutext to protect g_count.

it's true, what you say - the same goes for g_started. i left it this way intentionally since i work on x86 where ++x on integer is atomic any way and i didn't want to complicate example code any more.

++x is not atomic on integers, even on x86. See http://www.devx.com/cplus/Article/42725

comment:6 by Anthony Williams, 12 years ago

I agree that the thread function should be destroyed when the thread is joined. I am not sure of the best way to handle this --- maybe weak_ptr is the way to go, maybe another way is preferable.

in reply to:  5 comment:7 by bartek szurgot <bartosz.szurgot@…>, 12 years ago

Replying to anthonyw:

Replying to bartek szurgot <bartosz.szurgot@…>:

Replying to anonymous:

g_count is not protected against race conditions. You should use an atomic integer or use a mutext to protect g_count.

it's true, what you say - the same goes for g_started. i left it this way intentionally since i work on x86 where ++x on integer is atomic any way and i didn't want to complicate example code any more.

++x is not atomic on integers, even on x86. See http://www.devx.com/cplus/Article/42725

i've read linked article and run tests, but got different results. since this discussion is a bit out of scope of this ticket, please contact me directly (bartosz dot szurgot at pwr dot wroc dot pl) so we can proceed with this topic.

comment:8 by viboes, 12 years ago

Cc: viboes added

Please, could you give a try to week_ptr and see how the solution you propose behaves?

by anonymous, 12 years ago

Attachment: thread_bug_test.tar.2.bz2 added

by anonymous, 12 years ago

Attachment: patch_thread_id.diff.bz2 added

by anonymous, 12 years ago

Attachment: patch_cond_warn.diff.bz2 added

comment:9 by bartosz.szurgot@…, 12 years ago

i've tested it with boost from svn (revision 66677) and created patch that changes shared_prt<> to weak_ptr<> in thread::id. it appears to work (i.e. solves the problem). i've also attached changed version of test-program (locks for ++ and -- operations).

when doing so i've also found warning on condition variable (unused variable when no assert is present). i've made separate patch for this as well.

comment:10 by viboes, 11 years ago

Cc: viboes removed
Milestone: Boost 1.43.0To Be Determined
Owner: changed from Anthony Williams to viboes
Status: newassigned

comment:11 by viboes, 11 years ago

I was wondering what prevent from making thread::id an opaque type around the thread::native_handle, Anthony?

comment:12 by viboes, 11 years ago

Keywords: thread::id added

comment:13 by viboes, 11 years ago

See also #5173 Bug/design issue with boost::thread::id

Last edited 11 years ago by viboes (previous) (diff)

comment:14 by viboes, 11 years ago

I've attached a patch to #5173 that should solve this issue also. Please could you give it a try?

comment:15 by viboes, 11 years ago

Milestone: To Be DeterminedBoost 1.50.0

Committed in trunk r77838

comment:16 by bartek szurgot <bartek.szurgot@…>, 10 years ago

i've checked r77838 - it appears that it does not fix the problem. using boost's trunk (r78191, as of writing these words) however appears to fix it, i.e. using provided test program it does join both threads, as it should.

comment:17 by viboes, 10 years ago

Have you defined BOOST_THREAD_PROVIDES_BASIC_THREAD_ID or BOOST_THREAD_VERSION=2?

comment:18 by viboes, 10 years ago

Resolution: fixed
Status: assignedclosed

Committed in release branch at [78543]

comment:19 by bartek.szurgot@…, 10 years ago

when BOOST_THREAD_PROVIDES_BASIC_THREAD_ID is set in r77838 test app works fine. i see that on trunk (r79003 as of writing these words) is is enabled by default. this probably was the difference.

Note: See TracTickets for help on using tickets.