Issue 4631059: code review 4631059: runtime: replace Semacquire/Semrelease implementation

Issue 4631059: code review 4631059: runtime: replace Semacquire/Semrelease implementation (Closed)

Can't Edit
Can't Publish+Mail
Start Review

Created:
14 years, 1 month ago by dvyukov

Modified:
14 years, 1 month ago

Reviewers:

CC:
rsc, golang-dev

Visibility:
Public.

Description

runtime: replace Semacquire/Semrelease implementation 1. The implementation uses distributed hash table of waitlists instead of a centralized one. It significantly improves scalability for uncontended semaphores. 2. The implementation provides wait-free fast-path for signalers. 3. The implementation uses less locks (1 lock/unlock instead of 5 for Semacquire). 4. runtime·ready() call is moved out of critical section. 5. Semacquire() does not call semwake(). Benchmark results on HP Z600 (2 x Xeon E5620, 8 HT cores, 2.40GHz) are as follows: benchmark old ns/op new ns/op delta runtime_test.BenchmarkSemaUncontended 58.20 36.30 -37.63% runtime_test.BenchmarkSemaUncontended-2 199.00 18.30 -90.80% runtime_test.BenchmarkSemaUncontended-4 327.00 9.20 -97.19% runtime_test.BenchmarkSemaUncontended-8 491.00 5.32 -98.92% runtime_test.BenchmarkSemaUncontended-16 946.00 4.18 -99.56% runtime_test.BenchmarkSemaSyntNonblock 59.00 36.80 -37.63% runtime_test.BenchmarkSemaSyntNonblock-2 167.00 138.00 -17.37% runtime_test.BenchmarkSemaSyntNonblock-4 333.00 129.00 -61.26% runtime_test.BenchmarkSemaSyntNonblock-8 464.00 130.00 -71.98% runtime_test.BenchmarkSemaSyntNonblock-16 1015.00 136.00 -86.60% runtime_test.BenchmarkSemaSyntBlock 58.80 36.70 -37.59% runtime_test.BenchmarkSemaSyntBlock-2 294.00 149.00 -49.32% runtime_test.BenchmarkSemaSyntBlock-4 333.00 177.00 -46.85% runtime_test.BenchmarkSemaSyntBlock-8 471.00 221.00 -53.08% runtime_test.BenchmarkSemaSyntBlock-16 990.00 227.00 -77.07% runtime_test.BenchmarkSemaWorkNonblock 829.00 832.00 +0.36% runtime_test.BenchmarkSemaWorkNonblock-2 425.00 419.00 -1.41% runtime_test.BenchmarkSemaWorkNonblock-4 308.00 220.00 -28.57% runtime_test.BenchmarkSemaWorkNonblock-8 394.00 147.00 -62.69% runtime_test.BenchmarkSemaWorkNonblock-16 1510.00 149.00 -90.13% runtime_test.BenchmarkSemaWorkBlock 828.00 813.00 -1.81% runtime_test.BenchmarkSemaWorkBlock-2 428.00 436.00 +1.87% runtime_test.BenchmarkSemaWorkBlock-4 232.00 219.00 -5.60% runtime_test.BenchmarkSemaWorkBlock-8 392.00 251.00 -35.97% runtime_test.BenchmarkSemaWorkBlock-16 1524.00 298.00 -80.45% sync_test.BenchmarkMutexUncontended 24.10 24.00 -0.41% sync_test.BenchmarkMutexUncontended-2 12.00 12.00 +0.00% sync_test.BenchmarkMutexUncontended-4 6.25 6.17 -1.28% sync_test.BenchmarkMutexUncontended-8 3.43 3.34 -2.62% sync_test.BenchmarkMutexUncontended-16 2.34 2.32 -0.85% sync_test.BenchmarkMutex 24.70 24.70 +0.00% sync_test.BenchmarkMutex-2 208.00 99.50 -52.16% sync_test.BenchmarkMutex-4 2744.00 256.00 -90.67% sync_test.BenchmarkMutex-8 5137.00 556.00 -89.18% sync_test.BenchmarkMutex-16 5368.00 1284.00 -76.08% sync_test.BenchmarkMutexSlack 24.70 25.00 +1.21% sync_test.BenchmarkMutexSlack-2 1094.00 186.00 -83.00% sync_test.BenchmarkMutexSlack-4 3430.00 402.00 -88.28% sync_test.BenchmarkMutexSlack-8 5051.00 1066.00 -78.90% sync_test.BenchmarkMutexSlack-16 6806.00 1363.00 -79.97% sync_test.BenchmarkMutexWork 793.00 792.00 -0.13% sync_test.BenchmarkMutexWork-2 398.00 398.00 +0.00% sync_test.BenchmarkMutexWork-4 1441.00 308.00 -78.63% sync_test.BenchmarkMutexWork-8 8532.00 847.00 -90.07% sync_test.BenchmarkMutexWork-16 8225.00 2760.00 -66.44% sync_test.BenchmarkMutexWorkSlack 793.00 793.00 +0.00% sync_test.BenchmarkMutexWorkSlack-2 418.00 414.00 -0.96% sync_test.BenchmarkMutexWorkSlack-4 4481.00 480.00 -89.29% sync_test.BenchmarkMutexWorkSlack-8 6317.00 1598.00 -74.70% sync_test.BenchmarkMutexWorkSlack-16 9111.00 3038.00 -66.66%

Patch Set 1 #

Patch Set 2 : diff -r 6f1145ee588d https://go.googlecode.com/hg/ #

Patch Set 3 : diff -r 6f1145ee588d https://go.googlecode.com/hg/ #

Total comments: 12

Patch Set 4 : diff -r 6f1145ee588d https://go.googlecode.com/hg/ #

Patch Set 5 : diff -r 6f1145ee588d https://go.googlecode.com/hg/ #

Patch Set 6 : diff -r 9e53a1312e25 https://go.googlecode.com/hg/ #

Total comments: 18

Patch Set 7 : diff -r 607e0f74161f https://go.googlecode.com/hg/ #

Patch Set 8 : diff -r 607e0f74161f https://go.googlecode.com/hg/ #

Patch Set 9 : diff -r 607e0f74161f https://go.googlecode.com/hg/ #

Patch Set 10 : diff -r 607e0f74161f https://go.googlecode.com/hg/ #

Total comments: 4

Patch Set 11 : diff -r 607e0f74161f https://go.googlecode.com/hg/ #

Total comments: 1

Patch Set 12 : diff -r 607e0f74161f https://go.googlecode.com/hg/ #

Created: 14 years, 1 month ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+134 lines, -99 lines)			Patch
A	src/pkg/runtime/386/atomic.c	View	1 2 3 4	1 chunk	+12 lines, -0 lines	0 comments	Download
M	src/pkg/runtime/Makefile	View	1 2 3 4 5 6	1 chunk	+1 line, -0 lines	0 comments	Download
A	src/pkg/runtime/amd64/atomic.c	View	1 2 3 4	1 chunk	+12 lines, -0 lines	0 comments	Download
A	src/pkg/runtime/arm/atomic.c	View	1 2 3 4 5 6	1 chunk	+12 lines, -0 lines	0 comments	Download
M	src/pkg/runtime/runtime.h	View	1 2 3	1 chunk	+3 lines, -0 lines	0 comments	Download
M	src/pkg/runtime/sema.goc	View	1 2 3 4 5 6 7 8 9 10 11	2 chunks	+94 lines, -99 lines	0 comments	Download

Messages

Total messages: 42

Expand All Messages | Collapse All Messages

dvyukov

Hello golang-dev@googlegroups.com (cc: dvyukov@google.com, golang-dev@googlegroups.com), I'd like you to review this change to https://go.googlecode.com/hg/

14 years, 1 month ago (2011-06-22 14:21:33 UTC) #1

dvyukov

Here is comparison graph for uncontended semaphore: http://chart.googleapis.com/chart?cht=lc&chs=600x500&chxt=x,y&chxr=0,1,16,1|1,0,427184&chma=10,10,10,10&chdlp=b&chdl=ideal|sem-distr|sem-distr_base&chg=6.666667,5&chco=AAAAAA,000000,000000&chls=1|3|2,2,2&chd=t%3A6.25,12.50,18.75,25.00,31.25,37.50,43.75,50.00,56.25,62.50,68.75,75.00,81.25,87.50,93.75,100.00|6.25,11.72,17.97,24.01,30.23,31.35,39.03,42.95,47.02,46.43,50.74,51.80,52.49,52.95,54.32,54.90|3.83,1.06,1.04,0.83,0.74,0.47,0.42,0.35,0.32,0.30,0.29,0.21,0.19,0.18,0.18,0.17 Here is comparison graph for contended nonblocking ...

14 years, 1 month ago (2011-06-22 14:35:27 UTC) #2

dvyukov

Here is comparison graph for uncontended semaphore: http://goo.gl/yZcE4 Here is comparison graph for contended nonblocking ...

14 years, 1 month ago (2011-06-22 14:42:13 UTC) #3

rsc

can we get the benchmarks checked in first? that way people can update to before ...

14 years, 1 month ago (2011-06-22 14:50:29 UTC) #4

dvyukov

On 2011/06/22 14:50:29, rsc wrote: > can we get the benchmarks checked in first? > ...

14 years, 1 month ago (2011-06-22 14:56:57 UTC) #5

rsc

I'm thrilled to see the performance improvements here. I'm less thrilled about the fact that ...

14 years, 1 month ago (2011-06-22 15:05:41 UTC) #6

rsc

On Wed, Jun 22, 2011 at 10:42, <dvyukov@google.com> wrote: > Here is comparison graph for ...

14 years, 1 month ago (2011-06-22 15:07:12 UTC) #7

dvyukov

On 2011/06/22 15:07:12, rsc wrote: > On Wed, Jun 22, 2011 at 10:42, <mailto:dvyukov@google.com> wrote: ...

14 years, 1 month ago (2011-06-22 15:34:19 UTC) #8

dvyukov

http://codereview.appspot.com/4631059/diff/4001/src/pkg/runtime/386/asm.s File src/pkg/runtime/386/asm.s (right): http://codereview.appspot.com/4631059/diff/4001/src/pkg/runtime/386/asm.s#newcode321 src/pkg/runtime/386/asm.s:321: TEXT runtime·load(SB), 7, $0 On 2011/06/22 15:05:41, rsc wrote: ...

14 years, 1 month ago (2011-06-22 15:51:53 UTC) #9

rsc

>> It's not clear to me why this is in assembly. >> Why not implement ...

14 years, 1 month ago (2011-06-22 16:06:56 UTC) #10

dvyukov

On 2011/06/22 15:05:41, rsc wrote: > I'm thrilled to see the performance improvements here. > ...

14 years, 1 month ago (2011-06-22 16:07:02 UTC) #11

rsc

>> Reading diffs is a central part of code review and how >> we make ...

14 years, 1 month ago (2011-06-22 16:11:09 UTC) #12

dvyukov

On 2011/06/22 16:06:56, rsc wrote: > >> It's not clear to me why this is ...

14 years, 1 month ago (2011-06-22 16:17:49 UTC) #13

dvyukov

On 2011/06/22 16:11:09, rsc wrote: > >> Reading diffs is a central part of code ...

14 years, 1 month ago (2011-06-22 16:19:01 UTC) #14

rsc

>> You can't, because the C compiler you are using has no asm builtin. > ...

14 years, 1 month ago (2011-06-22 16:24:00 UTC) #15

dvyukov

Hello rsc@golang.org (cc: golang-dev@googlegroups.com), Please take another look.

14 years, 1 month ago (2011-06-22 16:41:53 UTC) #16

dvyukov

On 2011/06/22 16:24:00, rsc wrote: > >> You can't, because the C compiler you are ...

14 years, 1 month ago (2011-06-22 16:41:54 UTC) #17

dvyukov

Hello rsc@golang.org (cc: golang-dev@googlegroups.com), Please take another look.

14 years, 1 month ago (2011-06-22 16:44:17 UTC) #18

dvyukov

Hello rsc@golang.org (cc: golang-dev@googlegroups.com), Please take another look.

14 years, 1 month ago (2011-06-28 13:28:40 UTC) #19

rsc

http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/Makefile File src/pkg/runtime/Makefile (right): http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/Makefile#newcode84 src/pkg/runtime/Makefile:84: atomic.$O\ move up (sort) http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/arm/atomic.c File src/pkg/runtime/arm/atomic.c (right): http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/arm/atomic.c#newcode9 ...

14 years, 1 month ago (2011-06-28 14:08:51 UTC) #20

dvyukov

14 years, 1 month ago (2011-06-28 14:55:18 UTC) #21

http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/Makefile
File src/pkg/runtime/Makefile (right):

http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/Makefile#newcode84
src/pkg/runtime/Makefile:84: atomic.$O\
On 2011/06/28 14:08:51, rsc wrote:
> move up (sort)

Done.

http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/arm/atomic.c
File src/pkg/runtime/arm/atomic.c (right):

http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/arm/atomic.c#new...
src/pkg/runtime/arm/atomic.c:9: runtime·load(uint32 volatile* addr)
On 2011/06/28 14:08:51, rsc wrote:
> atomicload

Done.

http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc
File src/pkg/runtime/sema.goc (right):

http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc#newcode35
src/pkg/runtime/sema.goc:35: Lock lock;
On 2011/06/28 14:08:51, rsc wrote:
> can just say Lock; here

Done.

http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc#newcode55
src/pkg/runtime/sema.goc:55: static SemaRoot *
On 2011/06/28 14:08:51, rsc wrote:
> s/ */*/

Done.

http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc#newcode119
src/pkg/runtime/sema.goc:119: runtime·lock(&root->lock);
On 2011/06/28 14:08:51, rsc wrote:
> runtime·lock(root);
> etc

Done.

http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc#newcode147
src/pkg/runtime/sema.goc:147: // Harder case (search for a waiter and wake-up
it).
On 2011/06/28 14:08:51, rsc wrote:
> s/wake-up it/wake it/

Done.

http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc#newcode154
src/pkg/runtime/sema.goc:154: s = 0;
On 2011/06/28 14:08:51, rsc wrote:
> s = nil;
> (for pointers)

Done.

http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc#newcode162
src/pkg/runtime/sema.goc:162: runtime·xadd(&root->cnt, -1);
On 2011/06/28 14:08:51, rsc wrote:
> The manipulation of cnt here seems a lot more racy
> than it needs to be.  I expected:
> 
> lock(root)
> if(load(cnt) == 0) {
>     unlock(root);
>     return;
> }
> for(each s) {
>     if(s->addr == addr) {
>         semdequeue(root, s)
>         break
>     }
> }
> unlock(root)
> if(s)
>     ready(s->g)
> 
> Does that work?

Well, yes, it works. But I especially moved manipulations of root->cnt out of
the critical section, because otherwise it visibly affects uncontended
performance and significantly affects scaling. I've made a quick prototype:

void
runtime·semacquire(uint32 volatile *addr)
{
	Sema s;
	SemaRoot *root;

	if(cansemacquire(addr))
		return;

	root = semroot(addr);
	for(;;) {
		runtime·lock(root);
		root->cnt += 1;
		if(cansemacquire(addr)) {
			root->cnt -= 1;
			runtime·unlock(root);
			return;
		}
		semqueue(root, addr, &s);
		g->status = Gwaiting;
		runtime·unlock(root);
		runtime·gosched();
		if(cansemacquire(addr))
			return;
	}
}

void
runtime·semrelease(uint32 volatile *addr)
{
	Sema *s;
	SemaRoot *root;

	root = semroot(addr);
	runtime·xadd(addr, 1);

	runtime·lock(root);
	if(root->cnt == 0) {
		runtime·unlock(root);
		return;
	}
	for(s = root->head; s; s = s->next) {
		if(s->addr == addr) {
			semdequeue(root, s);
			root->cnt -= 1;
			break;
		}
	}
	runtime·unlock(root);
	if(s) {
		runtime·ready(s->g);
	}
}

And here are results:

runtime_test.BenchmarkSemaUncontended	50000000	        65.3 ns/op
runtime_test.BenchmarkSemaUncontended-2	50000000	        32.7 ns/op
runtime_test.BenchmarkSemaUncontended-4	100000000	        17.1 ns/op
runtime_test.BenchmarkSemaUncontended-8	200000000	         9.36 ns/op
runtime_test.BenchmarkSemaSyntNonblock	50000000	        66.0 ns/op
runtime_test.BenchmarkSemaSyntNonblock-2	10000000	       222 ns/op
runtime_test.BenchmarkSemaSyntNonblock-4	10000000	       271 ns/op
runtime_test.BenchmarkSemaSyntNonblock-8	 5000000	       530 ns/op
runtime_test.BenchmarkSemaSyntBlock	50000000	        66.4 ns/op
runtime_test.BenchmarkSemaSyntBlock-2	10000000	       343 ns/op
runtime_test.BenchmarkSemaSyntBlock-4	 5000000	       310 ns/op
runtime_test.BenchmarkSemaSyntBlock-8	 5000000	       534 ns/op
runtime_test.BenchmarkSemaWorkNonblock	 2000000	       836 ns/op
runtime_test.BenchmarkSemaWorkNonblock-2	 5000000	       449 ns/op
runtime_test.BenchmarkSemaWorkNonblock-4	 5000000	       358 ns/op
runtime_test.BenchmarkSemaWorkNonblock-8	 5000000	       424 ns/op
runtime_test.BenchmarkSemaWorkBlock	 2000000	       835 ns/op
runtime_test.BenchmarkSemaWorkBlock-2	 5000000	       446 ns/op
runtime_test.BenchmarkSemaWorkBlock-4	10000000	       284 ns/op
runtime_test.BenchmarkSemaWorkBlock-8	 5000000	       445 ns/op

dvyukov

Hello rsc@golang.org (cc: golang-dev@googlegroups.com), Please take another look.

14 years, 1 month ago (2011-06-28 14:56:18 UTC) #22

rsc

On Tue, Jun 28, 2011 at 10:55, <dvyukov@google.com> wrote: > Well, yes, it works. But ...

14 years, 1 month ago (2011-06-28 15:10:08 UTC) #23

dvyukov

Hello rsc@golang.org (cc: golang-dev@googlegroups.com), Please take another look.

14 years, 1 month ago (2011-06-28 15:33:37 UTC) #24

rsc

Also the racy cnt access looks the same. http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc File src/pkg/runtime/sema.goc (right): http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc#newcode141 src/pkg/runtime/sema.goc:141: // ...

14 years, 1 month ago (2011-06-28 15:39:46 UTC) #25

dvyukov

On 2011/06/28 15:10:08, rsc wrote: > On Tue, Jun 28, 2011 at 10:55, <mailto:dvyukov@google.com> wrote: ...

14 years, 1 month ago (2011-06-28 15:40:13 UTC) #26

dvyukov

Hello rsc@golang.org (cc: golang-dev@googlegroups.com), Please take another look.

14 years, 1 month ago (2011-06-28 15:43:04 UTC) #27

dvyukov

On 2011/06/28 15:39:46, rsc wrote: > Also the racy cnt access looks the same. Please ...

14 years, 1 month ago (2011-06-28 15:45:17 UTC) #28

rsc

> Since root->cnt is manipulated in a lock-free manner on fast-path, it > should be ...

14 years, 1 month ago (2011-06-28 15:51:39 UTC) #29

dvyukov

On 2011/06/28 15:51:39, rsc wrote: > > Since root->cnt is manipulated in a lock-free manner ...

14 years, 1 month ago (2011-06-28 16:40:06 UTC) #30

On 2011/06/28 15:51:39, rsc wrote:
> > Since root->cnt is manipulated in a lock-free manner on fast-path, it
> > should be manipulated in a lock-free manner everywhere. Mutexes can't
> > protect it anymore.
> 
> I am not saying to change the way you access it.
> I am saying to change *where* you access it, to
> make the correctness of the algorithm easier to see.
> 
> +	root = semroot(addr);
> +	runtime·xadd(addr, 1);
> +
> +	// Easy case: no waiters?
> +	if(runtime·atomicload(&root->cnt) == 0)
> +		return;
> +
> +	// Harder case: search for a waiter and wake it.
> +	runtime·lock(root);
> +	if(runtime·atomicload(&root->cnt) == 0) {
> +		// Count already consumed by another goroutine.
> +		runtime·unlock(root);
> +		return;
> +	}
> +	for(s = root->head; s; s = s->next) {
> +		if(s->addr == addr) {
> +			runtime·xadd(&root->cnt, -1);
> +			semdequeue(root, s);
> + 			break;
> +		}
> + 	}
> +	runtime·unlock(root);
> +	if(s)
> +		runtime·ready(s->g);

OK, I see what you mean.

> In addition to being easier (at least for me) to understand,
> this version avoids spurious semdequeue calls.
> 
> In fact, maybe the manipulations of root->cnt should be done
> only inside semqueue and semdequeue.

As for understanding, I can't make my mind. On one hand, if cnt
increment/decrement are in semqueue/semdequeue, respectively, it makes for a
discipline (it's node count in the end). On the other hand, I tried to move all
synchronization to the upper level (semacquire/semrelease functions) so that
it's easier to observe and comprehend the synchronization algorithm w/o missing
or forgetting any "details" (semdequeue/semenqueue are only single-threaded list
manipulation functions) (that was the main difficulty to comprehend your
previous implementation with a lot of small functions all of which are involved
in the synchronization algorithm).

But what do you mean by "this version avoids spurious semdequeue calls"?
My version also checks root->cnt value before semdequeue.

As for performance, I see that your version somewhat slower on some benchmarks
and somewhat faster on another, but there are no radical differences (they are
roughly the same).

rsc

On Tue, Jun 28, 2011 at 12:40, <dvyukov@google.com> wrote: > On 2011/06/28 15:51:39, rsc wrote: ...

14 years, 1 month ago (2011-06-28 16:43:51 UTC) #31

On Tue, Jun 28, 2011 at 12:40,  <dvyukov@google.com> wrote:
> On 2011/06/28 15:51:39, rsc wrote:
>>
>> > Since root->cnt is manipulated in a lock-free manner on fast-path,
>
> it
>>
>> > should be manipulated in a lock-free manner everywhere. Mutexes
>
> can't
>>
>> > protect it anymore.
>
>> I am not saying to change the way you access it.
>> I am saying to change *where* you access it, to
>> make the correctness of the algorithm easier to see.
>
>> +       root = semroot(addr);
>> +       runtime·xadd(addr, 1);
>> +
>> +       // Easy case: no waiters?
>> +       if(runtime·atomicload(&root->cnt) == 0)
>> +               return;
>> +
>> +       // Harder case: search for a waiter and wake it.
>> +       runtime·lock(root);
>> +       if(runtime·atomicload(&root->cnt) == 0) {
>> +               // Count already consumed by another goroutine.
>> +               runtime·unlock(root);
>> +               return;
>> +       }
>> +       for(s = root->head; s; s = s->next) {
>> +               if(s->addr == addr) {
>> +                       runtime·xadd(&root->cnt, -1);
>> +                       semdequeue(root, s);
>> +                       break;
>> +               }
>> +       }
>> +       runtime·unlock(root);
>> +       if(s)
>> +               runtime·ready(s->g);
>
> OK, I see what you mean.
>
>
>> In addition to being easier (at least for me) to understand,
>> this version avoids spurious semdequeue calls.
>
>> In fact, maybe the manipulations of root->cnt should be done
>> only inside semqueue and semdequeue.
>
> As for understanding, I can't make my mind. On one hand, if cnt
> increment/decrement are in semqueue/semdequeue, respectively, it makes
> for a discipline (it's node count in the end). On the other hand, I
> tried to move all synchronization to the upper level
> (semacquire/semrelease functions) so that it's easier to observe and
> comprehend the synchronization algorithm w/o missing or forgetting any
> "details" (semdequeue/semenqueue are only single-threaded list
> manipulation functions) (that was the main difficulty to comprehend your
> previous implementation with a lot of small functions all of which are
> involved in the synchronization algorithm).
>
> But what do you mean by "this version avoids spurious semdequeue calls"?
> My version also checks root->cnt value before semdequeue.
>
> As for performance, I see that your version somewhat slower on some
> benchmarks and somewhat faster on another, but there are no radical
> differences (they are roughly the same).

semqueue/semdequeue are adding a waiter.
cnt is the number of waiters.
I think it makes sense to have them do the
manipulation, as long as there's no performance
difference.

Also, cnt is probably not a great name.
s/cnt/nwait/

Russ

dvyukov

Hello rsc@golang.org (cc: golang-dev@googlegroups.com), Please take another look.

14 years, 1 month ago (2011-06-28 17:22:08 UTC) #32

dvyukov

On 2011/06/28 16:43:51, rsc wrote: > semqueue/semdequeue are adding a waiter. > cnt is the ...

14 years, 1 month ago (2011-06-28 17:22:11 UTC) #33

rsc

http://codereview.appspot.com/4631059/diff/24002/src/pkg/runtime/sema.goc File src/pkg/runtime/sema.goc (right): http://codereview.appspot.com/4631059/diff/24002/src/pkg/runtime/sema.goc#newcode120 src/pkg/runtime/sema.goc:120: runtime·xadd(&root->nwait, 1); There's an important but undocumented ordering here. ...

14 years, 1 month ago (2011-06-28 17:30:46 UTC) #34

dvyukov

Hello rsc@golang.org (cc: golang-dev@googlegroups.com), Please take another look.

14 years, 1 month ago (2011-06-28 18:18:52 UTC) #35

dvyukov

On 2011/06/28 17:30:46, rsc wrote: > http://codereview.appspot.com/4631059/diff/24002/src/pkg/runtime/sema.goc > File src/pkg/runtime/sema.goc (right): > > http://codereview.appspot.com/4631059/diff/24002/src/pkg/runtime/sema.goc#newcode120 > ...

14 years, 1 month ago (2011-06-28 18:21:43 UTC) #36

rsc

http://codereview.appspot.com/4631059/diff/7004/src/pkg/runtime/sema.goc File src/pkg/runtime/sema.goc (right): http://codereview.appspot.com/4631059/diff/7004/src/pkg/runtime/sema.goc#newcode120 src/pkg/runtime/sema.goc:120: // root->nwait increment followed by the load of *addr ...

14 years, 1 month ago (2011-06-28 18:27:12 UTC) #37

dvyukov

Hello rsc@golang.org (cc: golang-dev@googlegroups.com), Please take another look.

14 years, 1 month ago (2011-06-28 18:45:03 UTC) #38

rsc

LGTM Thanks for iterating. Do the numbers in the CL description need to be updated?

14 years, 1 month ago (2011-06-28 18:49:52 UTC) #39

dvyukov

On 2011/06/28 18:49:52, rsc wrote: > LGTM > > Thanks for iterating. Do the numbers ...

14 years, 1 month ago (2011-06-28 19:03:16 UTC) #40

rsc

14 years, 1 month ago (2011-06-28 19:10:21 UTC) #42

*** Submitted as http://code.google.com/p/go/source/detail?r=0d277395434f ***

runtime: replace Semacquire/Semrelease implementation
1. The implementation uses distributed hash table of waitlists instead of a
centralized one.
  It significantly improves scalability for uncontended semaphores.
2. The implementation provides wait-free fast-path for signalers.
3. The implementation uses less locks (1 lock/unlock instead of 5 for
Semacquire).
4. runtime·ready() call is moved out of critical section.
5. Semacquire() does not call semwake().
Benchmark results on HP Z600 (2 x Xeon E5620, 8 HT cores, 2.40GHz)
are as follows:
benchmark                                        old ns/op    new ns/op    delta
runtime_test.BenchmarkSemaUncontended                58.20        36.30  -37.63%
runtime_test.BenchmarkSemaUncontended-2             199.00        18.30  -90.80%
runtime_test.BenchmarkSemaUncontended-4             327.00         9.20  -97.19%
runtime_test.BenchmarkSemaUncontended-8             491.00         5.32  -98.92%
runtime_test.BenchmarkSemaUncontended-16            946.00         4.18  -99.56%

runtime_test.BenchmarkSemaSyntNonblock               59.00        36.80  -37.63%
runtime_test.BenchmarkSemaSyntNonblock-2            167.00       138.00  -17.37%
runtime_test.BenchmarkSemaSyntNonblock-4            333.00       129.00  -61.26%
runtime_test.BenchmarkSemaSyntNonblock-8            464.00       130.00  -71.98%
runtime_test.BenchmarkSemaSyntNonblock-16          1015.00       136.00  -86.60%

runtime_test.BenchmarkSemaSyntBlock                  58.80        36.70  -37.59%
runtime_test.BenchmarkSemaSyntBlock-2               294.00       149.00  -49.32%
runtime_test.BenchmarkSemaSyntBlock-4               333.00       177.00  -46.85%
runtime_test.BenchmarkSemaSyntBlock-8               471.00       221.00  -53.08%
runtime_test.BenchmarkSemaSyntBlock-16              990.00       227.00  -77.07%

runtime_test.BenchmarkSemaWorkNonblock              829.00       832.00   +0.36%
runtime_test.BenchmarkSemaWorkNonblock-2            425.00       419.00   -1.41%
runtime_test.BenchmarkSemaWorkNonblock-4            308.00       220.00  -28.57%
runtime_test.BenchmarkSemaWorkNonblock-8            394.00       147.00  -62.69%
runtime_test.BenchmarkSemaWorkNonblock-16          1510.00       149.00  -90.13%

runtime_test.BenchmarkSemaWorkBlock                 828.00       813.00   -1.81%
runtime_test.BenchmarkSemaWorkBlock-2               428.00       436.00   +1.87%
runtime_test.BenchmarkSemaWorkBlock-4               232.00       219.00   -5.60%
runtime_test.BenchmarkSemaWorkBlock-8               392.00       251.00  -35.97%
runtime_test.BenchmarkSemaWorkBlock-16             1524.00       298.00  -80.45%

sync_test.BenchmarkMutexUncontended                  24.10        24.00   -0.41%
sync_test.BenchmarkMutexUncontended-2                12.00        12.00   +0.00%
sync_test.BenchmarkMutexUncontended-4                 6.25         6.17   -1.28%
sync_test.BenchmarkMutexUncontended-8                 3.43         3.34   -2.62%
sync_test.BenchmarkMutexUncontended-16                2.34         2.32   -0.85%

sync_test.BenchmarkMutex                             24.70        24.70   +0.00%
sync_test.BenchmarkMutex-2                          208.00        99.50  -52.16%
sync_test.BenchmarkMutex-4                         2744.00       256.00  -90.67%
sync_test.BenchmarkMutex-8                         5137.00       556.00  -89.18%
sync_test.BenchmarkMutex-16                        5368.00      1284.00  -76.08%

sync_test.BenchmarkMutexSlack                        24.70        25.00   +1.21%
sync_test.BenchmarkMutexSlack-2                    1094.00       186.00  -83.00%
sync_test.BenchmarkMutexSlack-4                    3430.00       402.00  -88.28%
sync_test.BenchmarkMutexSlack-8                    5051.00      1066.00  -78.90%
sync_test.BenchmarkMutexSlack-16                   6806.00      1363.00  -79.97%

sync_test.BenchmarkMutexWork                        793.00       792.00   -0.13%
sync_test.BenchmarkMutexWork-2                      398.00       398.00   +0.00%
sync_test.BenchmarkMutexWork-4                     1441.00       308.00  -78.63%
sync_test.BenchmarkMutexWork-8                     8532.00       847.00  -90.07%
sync_test.BenchmarkMutexWork-16                    8225.00      2760.00  -66.44%

sync_test.BenchmarkMutexWorkSlack                   793.00       793.00   +0.00%
sync_test.BenchmarkMutexWorkSlack-2                 418.00       414.00   -0.96%
sync_test.BenchmarkMutexWorkSlack-4                4481.00       480.00  -89.29%
sync_test.BenchmarkMutexWorkSlack-8                6317.00      1598.00  -74.70%
sync_test.BenchmarkMutexWorkSlack-16               9111.00      3038.00  -66.66%

R=rsc
CC=golang-dev
http://codereview.appspot.com/4631059

Committer: Russ Cox <rsc@golang.org>

Expand All Messages | Collapse All Messages