Opened 10 years ago

Closed 10 years ago

#7906 closed Bugs (wontfix)

Very bad performance of generic implementation of shared_mutex on windows

Reported by: Andrey <nikolay@…> Owned by: viboes
Milestone: Component: thread
Version: Boost 1.52.0 Severity: Optimization
Keywords: thread shared_mutex Cc:

Description

Now (including boost 1.52) shared_mutex on win uses very efficient implementation. But after upgrade from boost 1.44 to boost 1.52 I have found what some base scenarios stopped working. For example: https://svn.boost.org/trac/boost/ticket/7755. Vicente asked me to try define BOOST_THREAD_PROVIDES_GENERIC_SHARED_MUTEX_ON_WIN to try generic implementation of shared_mutex. And problem 7755 was resolved as "will be fixed by generic implementation which will be default in boost 1.53". Some time later I have found one more broken in 1.52 base scenario when using upgrade_lock. The problem is the same as https://svn.boost.org/trac/boost/ticket/7720. I have switched to the generic implementation in my second project and problem seems to be resolved. But when I have found some delays in my time-critical code which uses shared_mutex'es (all 3 kinds of locks is used: shared, unique and upgrade). I have made some performance analysis and found what the problem is in new shared_mutex. On boost 1.44 some code is performed for 0.44s (real CPU time as shown in VTune) and with boost 1.52 with generic implementation the same code is performed for 1.52s.

For now I have rewrite my code to don't use upgrade_lock and compiled it with win32 implementation of shared_mutex. The main problem I see is this slow implementation will be default in the boost 1.53. I believe this will impact many boost users and this issue will be raised anyway when 1.53 will be released.

I have made a simple test for you:

#include "stdafx.h"

using namespace boost;

shared_mutex mtx;
const int cycles = 10000;

void shared()
{
	int cycle(0);
	while (++cycle < cycles)
	{
		shared_lock<shared_mutex> lock(mtx);
	}
}

void unique()
{
	int cycle(0);
	while (++cycle < cycles)
	{
		unique_lock<shared_mutex> lock(mtx);
	}
}

int main()
{
	boost::chrono::high_resolution_clock clock;
	boost::chrono::high_resolution_clock::time_point s1 = clock.now();
	thread t0(shared);
	thread t1(shared);
	thread t2(unique);

	t0.join();
	t1.join();
	t2.join();
	boost::chrono::high_resolution_clock::time_point f1 = clock.now();
	std::cout << "Time spent:" << (f1 - s1) << std::endl;

	return 0;
}

The results (I made 10 runs of each exe, below is average results):
win32 implementation: Time spent:3450301 nanoseconds
generic implementation: Time spent:12010409 nanoseconds

Attachments (1)

7906_it.patch (5.0 KB ) - added by viboes 10 years ago.

Download all attachments as: .zip

Change History (12)

comment:1 by Andrey <nikolay@…>, 10 years ago

My proposal is to fix bugs 7720 and 7755 in win32 implementation and leave it default in boost 1.53.

comment:2 by Andrey <nikolay@…>, 10 years ago

Component: Nonethread
Keywords: thread shared_mutex added
Owner: set to Anthony Williams
Severity: ProblemRegression

comment:3 by viboes, 10 years ago

I don't know if you are aware but BOOST_THREAD_PROVIDES_GENERIC_SHARED_MUTEX_ON_WIN will not be defined by default in 1.53.

Anyway, I have reopened the tickets that were closed as duplicated #7720 and #7755.

BTW, thanks for providing a performance test.

comment:4 by viboes, 10 years ago

Owner: changed from Anthony Williams to viboes
Severity: RegressionOptimization
Status: newassigned

Moved to optimization as BOOST_THREAD_PROVIDES_GENERIC_SHARED_MUTEX_ON_WIN is not defined by default.

by viboes, 10 years ago

Attachment: 7906_it.patch added

comment:5 by viboes, 10 years ago

One of the factors making the difference is that the win32 doesn't mask the interruption.

Once the attached patch is applied the measures are closer even if the win32 is yet better.

Please could you try the attached patch on trunk? BTW, give the measure for the best sample as done on the performance test example/perf_shared_mutex.cpp

in reply to:  5 comment:6 by viboes, 10 years ago

Replying to viboes:

One of the factors making the difference is that the win32 doesn't mask the interruption.

Once the attached patch is applied the measures are closer even if the win32 is yet better.

Please could you try the attached patch on trunk? BTW, give the measure for the best sample as done on the performance test example/perf_shared_mutex.cpp

Andrey, any news on the measures?

comment:7 by Andrey <nikolay@…>, 10 years ago

I am going to test you patch on next week. Sorry for the delay.

in reply to:  7 comment:8 by viboes, 10 years ago

Replying to Andrey <nikolay@…>:

I am going to test you patch on next week. Sorry for the delay.

After thinking a little bit more on this subject, I'm not sure shared_mutex should be an interruption point or not. With the pthread implementation most of the operations are interruption points as they use condition_variables, but the win32 implementation don't use condition_variables. What do you think?

In any case the patch is incomplete as it only disable interrupts, but doesn't adds calls to this_thread::interruption_point().

The test I also did is to compile without interruptions. Please could you give the performances when you enable/disable interruptions (BOOST_THREAD_PROVIDES_INTERRUPTIONS/BOOST_THREAD_DONT_PROVIDE_INTERRUPTIONS)?

in reply to:  7 comment:9 by Andrey <nikolay@…>, 10 years ago

Results (average for 5 runs) on my new PC:

shared_mutex_speed_win32_no_patch: Time spent:1910861 nanoseconds

shared_mutex_speed_generic_no_patch: Time spent:6831144 nanoseconds

shared_mutex_speed_win32_patch: Time spent:1969341 nanoseconds

shared_mutex_speed_generic_patch: Time spent:6782312 nanoseconds

shared_mutex_speed_generic_patch_no_interruptions (defined BOOST_THREAD_DONT_PROVIDE_INTERRUPTIONS): Time spent:6798547 nanoseconds

shared_mutex_speed_win32_patch_no_interruptions (defined BOOST_THREAD_DONT_PROVIDE_INTERRUPTIONS): Time spent:1998582 nanoseconds

In generic implementation most of CPU time is spent on boost::mutex::scoped_lock lk(state_change);. Win32 implementation is lock free and it is a reason why this implementation is very fast.

comment:10 by viboes, 10 years ago

I don't know what I can do to improve this situation. I'm working on fixing some of the shared_mutex fixes on win specific implementation. The generic implementation provides more services, e.g. timed operations which are missing on the win specific implementation. Could I close the ticket?

Last edited 10 years ago by viboes (previous) (diff)

comment:11 by viboes, 10 years ago

Milestone: To Be Determined
Resolution: wontfix
Status: assignedclosed
Note: See TracTickets for help on using tickets.