Opened 10 years ago
Closed 10 years ago
#7906 closed Bugs (wontfix)
Very bad performance of generic implementation of shared_mutex on windows
Reported by: | Owned by: | viboes | |
---|---|---|---|
Milestone: | Component: | thread | |
Version: | Boost 1.52.0 | Severity: | Optimization |
Keywords: | thread shared_mutex | Cc: |
Description
Now (including boost 1.52) shared_mutex on win uses very efficient implementation. But after upgrade from boost 1.44 to boost 1.52 I have found what some base scenarios stopped working. For example: https://svn.boost.org/trac/boost/ticket/7755. Vicente asked me to try define BOOST_THREAD_PROVIDES_GENERIC_SHARED_MUTEX_ON_WIN to try generic implementation of shared_mutex. And problem 7755 was resolved as "will be fixed by generic implementation which will be default in boost 1.53". Some time later I have found one more broken in 1.52 base scenario when using upgrade_lock. The problem is the same as https://svn.boost.org/trac/boost/ticket/7720. I have switched to the generic implementation in my second project and problem seems to be resolved. But when I have found some delays in my time-critical code which uses shared_mutex'es (all 3 kinds of locks is used: shared, unique and upgrade). I have made some performance analysis and found what the problem is in new shared_mutex. On boost 1.44 some code is performed for 0.44s (real CPU time as shown in VTune) and with boost 1.52 with generic implementation the same code is performed for 1.52s.
For now I have rewrite my code to don't use upgrade_lock and compiled it with win32 implementation of shared_mutex. The main problem I see is this slow implementation will be default in the boost 1.53. I believe this will impact many boost users and this issue will be raised anyway when 1.53 will be released.
I have made a simple test for you:
#include "stdafx.h" using namespace boost; shared_mutex mtx; const int cycles = 10000; void shared() { int cycle(0); while (++cycle < cycles) { shared_lock<shared_mutex> lock(mtx); } } void unique() { int cycle(0); while (++cycle < cycles) { unique_lock<shared_mutex> lock(mtx); } } int main() { boost::chrono::high_resolution_clock clock; boost::chrono::high_resolution_clock::time_point s1 = clock.now(); thread t0(shared); thread t1(shared); thread t2(unique); t0.join(); t1.join(); t2.join(); boost::chrono::high_resolution_clock::time_point f1 = clock.now(); std::cout << "Time spent:" << (f1 - s1) << std::endl; return 0; }
The results (I made 10 runs of each exe, below is average results):
win32 implementation: Time spent:3450301 nanoseconds
generic implementation: Time spent:12010409 nanoseconds
Attachments (1)
Change History (12)
comment:1 by , 10 years ago
comment:2 by , 10 years ago
Component: | None → thread |
---|---|
Keywords: | thread shared_mutex added |
Owner: | set to |
Severity: | Problem → Regression |
comment:3 by , 10 years ago
comment:4 by , 10 years ago
Owner: | changed from | to
---|---|
Severity: | Regression → Optimization |
Status: | new → assigned |
Moved to optimization as BOOST_THREAD_PROVIDES_GENERIC_SHARED_MUTEX_ON_WIN is not defined by default.
by , 10 years ago
Attachment: | 7906_it.patch added |
---|
follow-up: 6 comment:5 by , 10 years ago
One of the factors making the difference is that the win32 doesn't mask the interruption.
Once the attached patch is applied the measures are closer even if the win32 is yet better.
Please could you try the attached patch on trunk? BTW, give the measure for the best sample as done on the performance test example/perf_shared_mutex.cpp
comment:6 by , 10 years ago
Replying to viboes:
One of the factors making the difference is that the win32 doesn't mask the interruption.
Once the attached patch is applied the measures are closer even if the win32 is yet better.
Please could you try the attached patch on trunk? BTW, give the measure for the best sample as done on the performance test example/perf_shared_mutex.cpp
Andrey, any news on the measures?
follow-ups: 8 9 comment:7 by , 10 years ago
I am going to test you patch on next week. Sorry for the delay.
comment:8 by , 10 years ago
Replying to Andrey <nikolay@…>:
I am going to test you patch on next week. Sorry for the delay.
After thinking a little bit more on this subject, I'm not sure shared_mutex should be an interruption point or not. With the pthread implementation most of the operations are interruption points as they use condition_variables, but the win32 implementation don't use condition_variables. What do you think?
In any case the patch is incomplete as it only disable interrupts, but doesn't adds calls to this_thread::interruption_point().
The test I also did is to compile without interruptions. Please could you give the performances when you enable/disable interruptions (BOOST_THREAD_PROVIDES_INTERRUPTIONS/BOOST_THREAD_DONT_PROVIDE_INTERRUPTIONS)?
comment:9 by , 10 years ago
Results (average for 5 runs) on my new PC:
shared_mutex_speed_win32_no_patch: Time spent:1910861 nanoseconds
shared_mutex_speed_generic_no_patch: Time spent:6831144 nanoseconds
shared_mutex_speed_win32_patch: Time spent:1969341 nanoseconds
shared_mutex_speed_generic_patch: Time spent:6782312 nanoseconds
shared_mutex_speed_generic_patch_no_interruptions (defined BOOST_THREAD_DONT_PROVIDE_INTERRUPTIONS): Time spent:6798547 nanoseconds
shared_mutex_speed_win32_patch_no_interruptions (defined BOOST_THREAD_DONT_PROVIDE_INTERRUPTIONS): Time spent:1998582 nanoseconds
In generic implementation most of CPU time is spent on boost::mutex::scoped_lock lk(state_change);. Win32 implementation is lock free and it is a reason why this implementation is very fast.
comment:10 by , 10 years ago
I don't know what I can do to improve this situation. I'm working on fixing some of the shared_mutex fixes on win specific implementation. The generic implementation provides more services, e.g. timed operations which are missing on the win specific implementation. Could I close the ticket?
comment:11 by , 10 years ago
Milestone: | To Be Determined |
---|---|
Resolution: | → wontfix |
Status: | assigned → closed |
My proposal is to fix bugs 7720 and 7755 in win32 implementation and leave it default in boost 1.53.