|
|
Created:
13 years, 5 months ago by dvyukov Modified:
13 years, 5 months ago Reviewers:
CC:
rsc, golang-dev Visibility:
Public. |
Descriptionruntime: replace Semacquire/Semrelease implementation
1. The implementation uses distributed hash table of waitlists instead of a centralized one.
It significantly improves scalability for uncontended semaphores.
2. The implementation provides wait-free fast-path for signalers.
3. The implementation uses less locks (1 lock/unlock instead of 5 for Semacquire).
4. runtime·ready() call is moved out of critical section.
5. Semacquire() does not call semwake().
Benchmark results on HP Z600 (2 x Xeon E5620, 8 HT cores, 2.40GHz)
are as follows:
benchmark old ns/op new ns/op delta
runtime_test.BenchmarkSemaUncontended 58.20 36.30 -37.63%
runtime_test.BenchmarkSemaUncontended-2 199.00 18.30 -90.80%
runtime_test.BenchmarkSemaUncontended-4 327.00 9.20 -97.19%
runtime_test.BenchmarkSemaUncontended-8 491.00 5.32 -98.92%
runtime_test.BenchmarkSemaUncontended-16 946.00 4.18 -99.56%
runtime_test.BenchmarkSemaSyntNonblock 59.00 36.80 -37.63%
runtime_test.BenchmarkSemaSyntNonblock-2 167.00 138.00 -17.37%
runtime_test.BenchmarkSemaSyntNonblock-4 333.00 129.00 -61.26%
runtime_test.BenchmarkSemaSyntNonblock-8 464.00 130.00 -71.98%
runtime_test.BenchmarkSemaSyntNonblock-16 1015.00 136.00 -86.60%
runtime_test.BenchmarkSemaSyntBlock 58.80 36.70 -37.59%
runtime_test.BenchmarkSemaSyntBlock-2 294.00 149.00 -49.32%
runtime_test.BenchmarkSemaSyntBlock-4 333.00 177.00 -46.85%
runtime_test.BenchmarkSemaSyntBlock-8 471.00 221.00 -53.08%
runtime_test.BenchmarkSemaSyntBlock-16 990.00 227.00 -77.07%
runtime_test.BenchmarkSemaWorkNonblock 829.00 832.00 +0.36%
runtime_test.BenchmarkSemaWorkNonblock-2 425.00 419.00 -1.41%
runtime_test.BenchmarkSemaWorkNonblock-4 308.00 220.00 -28.57%
runtime_test.BenchmarkSemaWorkNonblock-8 394.00 147.00 -62.69%
runtime_test.BenchmarkSemaWorkNonblock-16 1510.00 149.00 -90.13%
runtime_test.BenchmarkSemaWorkBlock 828.00 813.00 -1.81%
runtime_test.BenchmarkSemaWorkBlock-2 428.00 436.00 +1.87%
runtime_test.BenchmarkSemaWorkBlock-4 232.00 219.00 -5.60%
runtime_test.BenchmarkSemaWorkBlock-8 392.00 251.00 -35.97%
runtime_test.BenchmarkSemaWorkBlock-16 1524.00 298.00 -80.45%
sync_test.BenchmarkMutexUncontended 24.10 24.00 -0.41%
sync_test.BenchmarkMutexUncontended-2 12.00 12.00 +0.00%
sync_test.BenchmarkMutexUncontended-4 6.25 6.17 -1.28%
sync_test.BenchmarkMutexUncontended-8 3.43 3.34 -2.62%
sync_test.BenchmarkMutexUncontended-16 2.34 2.32 -0.85%
sync_test.BenchmarkMutex 24.70 24.70 +0.00%
sync_test.BenchmarkMutex-2 208.00 99.50 -52.16%
sync_test.BenchmarkMutex-4 2744.00 256.00 -90.67%
sync_test.BenchmarkMutex-8 5137.00 556.00 -89.18%
sync_test.BenchmarkMutex-16 5368.00 1284.00 -76.08%
sync_test.BenchmarkMutexSlack 24.70 25.00 +1.21%
sync_test.BenchmarkMutexSlack-2 1094.00 186.00 -83.00%
sync_test.BenchmarkMutexSlack-4 3430.00 402.00 -88.28%
sync_test.BenchmarkMutexSlack-8 5051.00 1066.00 -78.90%
sync_test.BenchmarkMutexSlack-16 6806.00 1363.00 -79.97%
sync_test.BenchmarkMutexWork 793.00 792.00 -0.13%
sync_test.BenchmarkMutexWork-2 398.00 398.00 +0.00%
sync_test.BenchmarkMutexWork-4 1441.00 308.00 -78.63%
sync_test.BenchmarkMutexWork-8 8532.00 847.00 -90.07%
sync_test.BenchmarkMutexWork-16 8225.00 2760.00 -66.44%
sync_test.BenchmarkMutexWorkSlack 793.00 793.00 +0.00%
sync_test.BenchmarkMutexWorkSlack-2 418.00 414.00 -0.96%
sync_test.BenchmarkMutexWorkSlack-4 4481.00 480.00 -89.29%
sync_test.BenchmarkMutexWorkSlack-8 6317.00 1598.00 -74.70%
sync_test.BenchmarkMutexWorkSlack-16 9111.00 3038.00 -66.66%
Patch Set 1 #Patch Set 2 : diff -r 6f1145ee588d https://go.googlecode.com/hg/ #Patch Set 3 : diff -r 6f1145ee588d https://go.googlecode.com/hg/ #
Total comments: 12
Patch Set 4 : diff -r 6f1145ee588d https://go.googlecode.com/hg/ #Patch Set 5 : diff -r 6f1145ee588d https://go.googlecode.com/hg/ #Patch Set 6 : diff -r 9e53a1312e25 https://go.googlecode.com/hg/ #
Total comments: 18
Patch Set 7 : diff -r 607e0f74161f https://go.googlecode.com/hg/ #Patch Set 8 : diff -r 607e0f74161f https://go.googlecode.com/hg/ #Patch Set 9 : diff -r 607e0f74161f https://go.googlecode.com/hg/ #Patch Set 10 : diff -r 607e0f74161f https://go.googlecode.com/hg/ #
Total comments: 4
Patch Set 11 : diff -r 607e0f74161f https://go.googlecode.com/hg/ #
Total comments: 1
Patch Set 12 : diff -r 607e0f74161f https://go.googlecode.com/hg/ #
MessagesTotal messages: 42
Hello golang-dev@googlegroups.com (cc: dvyukov@google.com, golang-dev@googlegroups.com), I'd like you to review this change to https://go.googlecode.com/hg/
Sign in to reply to this message.
Here is comparison graph for uncontended semaphore: http://chart.googleapis.com/chart?cht=lc&chs=600x500&chxt=x,y&chxr=0,1,16,1|1... Here is comparison graph for contended nonblocking semaphore: http://chart.googleapis.com/chart?cht=lc&chs=600x500&chxt=x,y&chxr=0,1,16,1|1... Here is comparison graph for contended blocking semaphore: http://chart.googleapis.com/chart?cht=lc&chs=600x500&chxt=x,y&chxr=0,1,16,1|1... Here is comparison graph for contended sync.Mutex: http://chart.googleapis.com/chart?cht=lc&chs=600x500&chxt=x,y&chxr=0,1,16,1|1... Here is comparison graph for contended sync.Mutex with some amount of local work outside of critical section: http://chart.googleapis.com/chart?cht=lc&chs=600x500&chxt=x,y&chxr=0,1,16,1|1...
Sign in to reply to this message.
Here is comparison graph for uncontended semaphore: http://goo.gl/yZcE4 Here is comparison graph for contended nonblocking semaphore: http://goo.gl/uY9S7 Here is comparison graph for contended blocking semaphore: http://goo.gl/OUk5N Here is comparison graph for contended sync.Mutex: http://goo.gl/PfBEL Here is comparison graph for contended sync.Mutex with some amount of local work outside of critical section: http://goo.gl/RQLvh The results are obtained with the scalability benchmarking suite (http://codereview.appspot.com/4644051) on 2 x Intel(R) Xeon(R) CPU E5620 @ 2.40GHz (HP Z600) system.
Sign in to reply to this message.
can we get the benchmarks checked in first? that way people can update to before this change and run the benchmarks themselves and then update after and compare.
Sign in to reply to this message.
On 2011/06/22 14:50:29, rsc wrote: > can we get the benchmarks checked in first? > that way people can update to before this change > and run the benchmarks themselves > and then update after and compare. Ok, let's wait for benchmarks. The changes are quite free-standing, so hopefully I will be able to commit them later without significant improvements.
Sign in to reply to this message.
I'm thrilled to see the performance improvements here. I'm less thrilled about the fact that it's hard to see what changed, because even what stayed the same has been reformatted. Reading diffs is a central part of code review and how we make progress in Go. Please take the time to make the diffs smaller. Thanks. http://codereview.appspot.com/4631059/diff/4001/src/pkg/runtime/386/asm.s File src/pkg/runtime/386/asm.s (right): http://codereview.appspot.com/4631059/diff/4001/src/pkg/runtime/386/asm.s#new... src/pkg/runtime/386/asm.s:321: TEXT runtime·load(SB), 7, $0 It's not clear to me why this is in assembly. Why not implement this in C? http://codereview.appspot.com/4631059/diff/4001/src/pkg/runtime/sema.goc File src/pkg/runtime/sema.goc (left): http://codereview.appspot.com/4631059/diff/4001/src/pkg/runtime/sema.goc#oldc... src/pkg/runtime/sema.goc:144: // Easy case. Please put this comment back. http://codereview.appspot.com/4631059/diff/4001/src/pkg/runtime/sema.goc#oldc... src/pkg/runtime/sema.goc:148: // Harder case: Please put this comment back. http://codereview.appspot.com/4631059/diff/4001/src/pkg/runtime/sema.goc File src/pkg/runtime/sema.goc (right): http://codereview.appspot.com/4631059/diff/4001/src/pkg/runtime/sema.goc#newc... src/pkg/runtime/sema.goc:40: // Modern processors tend to have 64-byte cache lines, Better: static union { SemaRoot; uint8 pad[128]; } semtable[SEMTABLESZ]; http://codereview.appspot.com/4631059/diff/4001/src/pkg/runtime/sema.goc#newc... src/pkg/runtime/sema.goc:58: semqueue(SemaRoot *root, uint32 volatile *addr, Sema *sem) The diffs here would be smaller without the variable name change. Please s/sem/s/ throughout to reduce the diffs. http://codereview.appspot.com/4631059/diff/4001/src/pkg/runtime/sema.goc#newc... src/pkg/runtime/sema.goc:87: trysemacquire(uint32 volatile *addr) This function is identical to the old cansemacquire, except that it's named differently, formatted differently, and uses a different loop construct. That is, except for rewriting it, nothing changed. Please put the old function back, to reduce the diffs. http://codereview.appspot.com/4631059/diff/4001/src/pkg/runtime/sema.goc#newc... src/pkg/runtime/sema.goc:92: { The brace goes on the previous line. In general, when you are replacing code, the new code should look like the code being replaced. http://codereview.appspot.com/4631059/diff/4001/src/pkg/runtime/sema.goc#newc... src/pkg/runtime/sema.goc:138: if(runtime·load(&root->cnt) == 0) Given that cnt is declared volatile, why not just write if(root->cnt == 0) ?
Sign in to reply to this message.
On Wed, Jun 22, 2011 at 10:42, <dvyukov@google.com> wrote: > Here is comparison graph for uncontended semaphore: > http://goo.gl/yZcE4 Fantastic! > Here is comparison graph for contended nonblocking semaphore: > http://goo.gl/uY9S7 > > Here is comparison graph for contended blocking semaphore: > http://goo.gl/OUk5N > > Here is comparison graph for contended sync.Mutex: > http://goo.gl/PfBEL > > Here is comparison graph for contended sync.Mutex with some amount of > local work > outside of critical section: > http://goo.gl/RQLvh On these, there is a spot in the graph key for 'ideal' but I don't see the corresponding gray line. Am I missing it? What would ideal be in these situations?
Sign in to reply to this message.
On 2011/06/22 15:07:12, rsc wrote: > On Wed, Jun 22, 2011 at 10:42, <mailto:dvyukov@google.com> wrote: > > Here is comparison graph for uncontended semaphore: > > http://goo.gl/yZcE4 > > Fantastic! Yeah. It's the result of distributed hash table with waitlists and fast-pathed release(). > > Here is comparison graph for contended nonblocking semaphore: > > http://goo.gl/uY9S7 > > > > Here is comparison graph for contended blocking semaphore: > > http://goo.gl/OUk5N > > > > Here is comparison graph for contended sync.Mutex: > > http://goo.gl/PfBEL > > > > Here is comparison graph for contended sync.Mutex with some amount of > > local work > > outside of critical section: > > http://goo.gl/RQLvh > > On these, there is a spot in the graph key for 'ideal' > but I don't see the corresponding gray line. Am I missing it? > What would ideal be in these situations? The suite contains a flag -draw.ideal=false that turns off 'ideal' line. What you see is an issue with URL generating code (it's just a bit simpler to generate such URL). If I enable 'ideal' then all lines becomes just a horizontal lines near 0 :) So I turn off 'ideal' on benchmarks that are not intended to scale (like contended mutex).
Sign in to reply to this message.
http://codereview.appspot.com/4631059/diff/4001/src/pkg/runtime/386/asm.s File src/pkg/runtime/386/asm.s (right): http://codereview.appspot.com/4631059/diff/4001/src/pkg/runtime/386/asm.s#new... src/pkg/runtime/386/asm.s:321: TEXT runtime·load(SB), 7, $0 On 2011/06/22 15:05:41, rsc wrote: > It's not clear to me why this is in assembly. > Why not implement this in C? For consistency with other atomic functions. It seems that an author decided to use asm instead of C+intrinsics (which makes some sense). If you wish I can move cas/xadd/load to a C file. http://codereview.appspot.com/4631059/diff/4001/src/pkg/runtime/sema.goc File src/pkg/runtime/sema.goc (right): http://codereview.appspot.com/4631059/diff/4001/src/pkg/runtime/sema.goc#newc... src/pkg/runtime/sema.goc:40: // Modern processors tend to have 64-byte cache lines, On 2011/06/22 15:05:41, rsc wrote: > Better: > > static union { > SemaRoot; > uint8 pad[128]; > } semtable[SEMTABLESZ]; Done. http://codereview.appspot.com/4631059/diff/4001/src/pkg/runtime/sema.goc#newc... src/pkg/runtime/sema.goc:58: semqueue(SemaRoot *root, uint32 volatile *addr, Sema *sem) On 2011/06/22 15:05:41, rsc wrote: > The diffs here would be smaller without the variable name change. > Please s/sem/s/ throughout to reduce the diffs. Done. http://codereview.appspot.com/4631059/diff/4001/src/pkg/runtime/sema.goc#newc... src/pkg/runtime/sema.goc:87: trysemacquire(uint32 volatile *addr) On 2011/06/22 15:05:41, rsc wrote: > This function is identical to the old cansemacquire, > except that it's named differently, formatted differently, > and uses a different loop construct. That is, except > for rewriting it, nothing changed. > > Please put the old function back, to reduce the diffs. Done.
Sign in to reply to this message.
>> It's not clear to me why this is in assembly. >> Why not implement this in C? > > For consistency with other atomic functions. It seems that an author > decided to use asm instead of C+intrinsics (which makes some sense). If > you wish I can move cas/xadd/load to a C file. You can't, because the C compiler you are using has no asm builtin. My question was really why not write uint32 runtime.load(uint32 *addr) { return *addr; } which will compile to the assembly you have. But the second level question is why you need the load function at all, given the volatile marking. Russ
Sign in to reply to this message.
On 2011/06/22 15:05:41, rsc wrote: > I'm thrilled to see the performance improvements here. > I'm less thrilled about the fact that it's hard to see what > changed, because even what stayed the same has been > reformatted. > > Reading diffs is a central part of code review and how > we make progress in Go. Please take the time to make > the diffs smaller. Actually I am considering this as a rewrite, so diffs are not the way to work with it. The high level algo is changes, details are changes. > http://codereview.appspot.com/4631059/diff/4001/src/pkg/runtime/sema.goc#oldc... > src/pkg/runtime/sema.goc:144: // Easy case. > Please put this comment back. Done. > http://codereview.appspot.com/4631059/diff/4001/src/pkg/runtime/sema.goc#oldc... > src/pkg/runtime/sema.goc:148: // Harder case: > Please put this comment back. Done. > http://codereview.appspot.com/4631059/diff/4001/src/pkg/runtime/sema.goc#newc... > src/pkg/runtime/sema.goc:92: { > The brace goes on the previous line. > In general, when you are replacing code, the new code should > look like the code being replaced. Aha! I assumed that goc uses "braces on separate lines" style for some reason. Perhaps the reason was initial look at a function like: void func() { single line } Now I see that goc uses mixed style. Done.
Sign in to reply to this message.
>> Reading diffs is a central part of code review and how >> we make progress in Go. Please take the time to make >> the diffs smaller. > > Actually I am considering this as a rewrite, so diffs are not the way to > work with it. The high level algo is changes, details are changes. Even though you may consider it a rewrite, it helps me understand the new code if I can see what is different from the old code. Thanks for putting some of the original pieces back. By the way, you need to run hg mail once you're ready for us to take another look. (hg mail uploads your current copy before sending its mail.) Russ
Sign in to reply to this message.
On 2011/06/22 16:06:56, rsc wrote: > >> It's not clear to me why this is in assembly. > >> Why not implement this in C? > > > > For consistency with other atomic functions. It seems that an author > > decided to use asm instead of C+intrinsics (which makes some sense). If > > you wish I can move cas/xadd/load to a C file. > > You can't, because the C compiler you are using has no asm builtin. But does it have __sync_xxx builtins (__sync_bool_compare_and_swap)? > My question was really why not write > > uint32 > runtime.load(uint32 *addr) > { > return *addr; > } > > which will compile to the assembly you have. > > But the second level question is why you need the load > function at all, given the volatile marking. Well, how I would implement it in my own project with vanilla gcc is: uint32 load(uint32 volatile* a) { __asm__ __volatile__ ("" ::: memory); uint32 v = *a; __asm__ __volatile__ ("" ::: memory); return v; } Does your proposed C code equivalent (the problem may be with potential inlining and memory access reorderings on compiler level)? If yes, then I am happy to move it to a C file. > Given that cnt is declared volatile, why not just write > if(root->cnt == 0) What I need in terms of C1x is: atomic_fetch_add(addr, 1, memory_order_seq_cst); if (atomic_load(&root->cnt, memory_order_seq_cst) == 0) return; Go runtime atomics are not documented so it's hard to tell as to whether it's enough or not. But on ARM seq_cst load is most likely not just a plain load. On IA-32/Intel64 it's definitely enough, though. I think it's better to have such fundamental primitives as a separate functions anyway. Well, actually even on x86 plain load may be no enough, because it's actually depends on how xadd and cas are implemented, so it's better to keep them all in one place.
Sign in to reply to this message.
On 2011/06/22 16:11:09, rsc wrote: > >> Reading diffs is a central part of code review and how > >> we make progress in Go. Please take the time to make > >> the diffs smaller. > > > > Actually I am considering this as a rewrite, so diffs are not the way to > > work with it. The high level algo is changes, details are changes. > > Even though you may consider it a rewrite, it helps me > understand the new code if I can see what is different from > the old code. Thanks for putting some of the original pieces back. > > By the way, you need to run hg mail once you're ready for us > to take another look. (hg mail uploads your current copy > before sending its mail.) Yes, I just want to wait for our decision on runtime.load() and upload it in a batch.
Sign in to reply to this message.
>> You can't, because the C compiler you are using has no asm builtin. > > But does it have __sync_xxx builtins (__sync_bool_compare_and_swap)? Nope, just C. That's why there's the assembly file. > Does your proposed C code equivalent (the problem may be with potential > inlining and memory access reorderings on compiler level)? > If yes, then I am happy to move it to a C file. The C compiler does no inlining. Putting it in a separate function guarantees that it will not be reordered. If you want to keep it in a separate function, please call it atomic_load. load sounds like it might have something to do with loading libraries or files. Russ
Sign in to reply to this message.
Hello rsc@golang.org (cc: golang-dev@googlegroups.com), Please take another look.
Sign in to reply to this message.
On 2011/06/22 16:24:00, rsc wrote: > >> You can't, because the C compiler you are using has no asm builtin. > > > > But does it have __sync_xxx builtins (__sync_bool_compare_and_swap)? > > Nope, just C. That's why there's the assembly file. > > > Does your proposed C code equivalent (the problem may be with potential > > inlining and memory access reorderings on compiler level)? > > If yes, then I am happy to move it to a C file. > > The C compiler does no inlining. Putting it in a separate > function guarantees that it will not be reordered. OK, moving to a C file. > If you want to keep it in a separate function, please call it > atomic_load. load sounds like it might have something > to do with loading libraries or files. I call it atomicload, because the file uses C naming conversion (the name looks a way too long already :) ).
Sign in to reply to this message.
Hello rsc@golang.org (cc: golang-dev@googlegroups.com), Please take another look.
Sign in to reply to this message.
Hello rsc@golang.org (cc: golang-dev@googlegroups.com), Please take another look.
Sign in to reply to this message.
http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/Makefile File src/pkg/runtime/Makefile (right): http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/Makefile#newcode84 src/pkg/runtime/Makefile:84: atomic.$O\ move up (sort) http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/arm/atomic.c File src/pkg/runtime/arm/atomic.c (right): http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/arm/atomic.c#new... src/pkg/runtime/arm/atomic.c:9: runtime·load(uint32 volatile* addr) atomicload http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc File src/pkg/runtime/sema.goc (right): http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc#newcode35 src/pkg/runtime/sema.goc:35: Lock lock; can just say Lock; here http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc#newcode55 src/pkg/runtime/sema.goc:55: static SemaRoot * s/ */*/ http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc#newcode119 src/pkg/runtime/sema.goc:119: runtime·lock(&root->lock); runtime·lock(root); etc http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc#newcode141 src/pkg/runtime/sema.goc:141: // Easy case (incremented the counter and there is no waiters). root = semroot(addr); runtime·xadd(addr, 1); // Easy case: no waiters? if(runtime·atomicload(&root->cnt) == 0) return; http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc#newcode147 src/pkg/runtime/sema.goc:147: // Harder case (search for a waiter and wake-up it). s/wake-up it/wake it/ http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc#newcode154 src/pkg/runtime/sema.goc:154: s = 0; s = nil; (for pointers) http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc#newcode162 src/pkg/runtime/sema.goc:162: runtime·xadd(&root->cnt, -1); The manipulation of cnt here seems a lot more racy than it needs to be. I expected: lock(root) if(load(cnt) == 0) { unlock(root); return; } for(each s) { if(s->addr == addr) { semdequeue(root, s) break } } unlock(root) if(s) ready(s->g) Does that work?
Sign in to reply to this message.
http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/Makefile File src/pkg/runtime/Makefile (right): http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/Makefile#newcode84 src/pkg/runtime/Makefile:84: atomic.$O\ On 2011/06/28 14:08:51, rsc wrote: > move up (sort) Done. http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/arm/atomic.c File src/pkg/runtime/arm/atomic.c (right): http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/arm/atomic.c#new... src/pkg/runtime/arm/atomic.c:9: runtime·load(uint32 volatile* addr) On 2011/06/28 14:08:51, rsc wrote: > atomicload Done. http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc File src/pkg/runtime/sema.goc (right): http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc#newcode35 src/pkg/runtime/sema.goc:35: Lock lock; On 2011/06/28 14:08:51, rsc wrote: > can just say Lock; here Done. http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc#newcode55 src/pkg/runtime/sema.goc:55: static SemaRoot * On 2011/06/28 14:08:51, rsc wrote: > s/ */*/ Done. http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc#newcode119 src/pkg/runtime/sema.goc:119: runtime·lock(&root->lock); On 2011/06/28 14:08:51, rsc wrote: > runtime·lock(root); > etc Done. http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc#newcode147 src/pkg/runtime/sema.goc:147: // Harder case (search for a waiter and wake-up it). On 2011/06/28 14:08:51, rsc wrote: > s/wake-up it/wake it/ Done. http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc#newcode154 src/pkg/runtime/sema.goc:154: s = 0; On 2011/06/28 14:08:51, rsc wrote: > s = nil; > (for pointers) Done. http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc#newcode162 src/pkg/runtime/sema.goc:162: runtime·xadd(&root->cnt, -1); On 2011/06/28 14:08:51, rsc wrote: > The manipulation of cnt here seems a lot more racy > than it needs to be. I expected: > > lock(root) > if(load(cnt) == 0) { > unlock(root); > return; > } > for(each s) { > if(s->addr == addr) { > semdequeue(root, s) > break > } > } > unlock(root) > if(s) > ready(s->g) > > Does that work? Well, yes, it works. But I especially moved manipulations of root->cnt out of the critical section, because otherwise it visibly affects uncontended performance and significantly affects scaling. I've made a quick prototype: void runtime·semacquire(uint32 volatile *addr) { Sema s; SemaRoot *root; if(cansemacquire(addr)) return; root = semroot(addr); for(;;) { runtime·lock(root); root->cnt += 1; if(cansemacquire(addr)) { root->cnt -= 1; runtime·unlock(root); return; } semqueue(root, addr, &s); g->status = Gwaiting; runtime·unlock(root); runtime·gosched(); if(cansemacquire(addr)) return; } } void runtime·semrelease(uint32 volatile *addr) { Sema *s; SemaRoot *root; root = semroot(addr); runtime·xadd(addr, 1); runtime·lock(root); if(root->cnt == 0) { runtime·unlock(root); return; } for(s = root->head; s; s = s->next) { if(s->addr == addr) { semdequeue(root, s); root->cnt -= 1; break; } } runtime·unlock(root); if(s) { runtime·ready(s->g); } } And here are results: runtime_test.BenchmarkSemaUncontended 50000000 65.3 ns/op runtime_test.BenchmarkSemaUncontended-2 50000000 32.7 ns/op runtime_test.BenchmarkSemaUncontended-4 100000000 17.1 ns/op runtime_test.BenchmarkSemaUncontended-8 200000000 9.36 ns/op runtime_test.BenchmarkSemaSyntNonblock 50000000 66.0 ns/op runtime_test.BenchmarkSemaSyntNonblock-2 10000000 222 ns/op runtime_test.BenchmarkSemaSyntNonblock-4 10000000 271 ns/op runtime_test.BenchmarkSemaSyntNonblock-8 5000000 530 ns/op runtime_test.BenchmarkSemaSyntBlock 50000000 66.4 ns/op runtime_test.BenchmarkSemaSyntBlock-2 10000000 343 ns/op runtime_test.BenchmarkSemaSyntBlock-4 5000000 310 ns/op runtime_test.BenchmarkSemaSyntBlock-8 5000000 534 ns/op runtime_test.BenchmarkSemaWorkNonblock 2000000 836 ns/op runtime_test.BenchmarkSemaWorkNonblock-2 5000000 449 ns/op runtime_test.BenchmarkSemaWorkNonblock-4 5000000 358 ns/op runtime_test.BenchmarkSemaWorkNonblock-8 5000000 424 ns/op runtime_test.BenchmarkSemaWorkBlock 2000000 835 ns/op runtime_test.BenchmarkSemaWorkBlock-2 5000000 446 ns/op runtime_test.BenchmarkSemaWorkBlock-4 10000000 284 ns/op runtime_test.BenchmarkSemaWorkBlock-8 5000000 445 ns/op
Sign in to reply to this message.
Hello rsc@golang.org (cc: golang-dev@googlegroups.com), Please take another look.
Sign in to reply to this message.
On Tue, Jun 28, 2011 at 10:55, <dvyukov@google.com> wrote: > Well, yes, it works. But I especially moved manipulations of root->cnt > out of the critical section, because otherwise it visibly affects > uncontended performance and significantly affects scaling. I've made a > quick prototype: I wasn't clear enough. My proposal was to leave the fast path check but then after that use the code I wrote. Your comment on cnt says // Read w/o lock but the code is actually writing it without the lock too, which is confusing and doesn't seem to be necessary for the scaling. Russ
Sign in to reply to this message.
Hello rsc@golang.org (cc: golang-dev@googlegroups.com), Please take another look.
Sign in to reply to this message.
Also the racy cnt access looks the same. http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc File src/pkg/runtime/sema.goc (right): http://codereview.appspot.com/4631059/diff/4/src/pkg/runtime/sema.goc#newcode141 src/pkg/runtime/sema.goc:141: // Easy case (incremented the counter and there is no waiters). On 2011/06/28 14:08:51, rsc wrote: > root = semroot(addr); > runtime·xadd(addr, 1); > > // Easy case: no waiters? > if(runtime·atomicload(&root->cnt) == 0) > return; You didn't apply this change.
Sign in to reply to this message.
On 2011/06/28 15:10:08, rsc wrote: > On Tue, Jun 28, 2011 at 10:55, <mailto:dvyukov@google.com> wrote: > > Well, yes, it works. But I especially moved manipulations of root->cnt > > out of the critical section, because otherwise it visibly affects > > uncontended performance and significantly affects scaling. I've made a > > quick prototype: > > I wasn't clear enough. My proposal was to leave the > fast path check but then after that use the code I wrote. > > Your comment on cnt says // Read w/o lock > but the code is actually writing it without the lock too, > which is confusing and doesn't seem to be necessary > for the scaling. Since root->cnt is manipulated in a lock-free manner on fast-path, it should be manipulated in a lock-free manner everywhere. Mutexes can't protect it anymore. The essence of the pattern is: // producer thread atomic_fetch_add(addr, 1, memory_order_seq_cst); r1 = atomic_load(root->cnt, memory_order_seq_cst); // consumer thread atomic_fetch_add(root->cnt, 1, memory_order_seq_cst); r2 = atomic_load(addr, memory_order_seq_cst); Atomic operations with seq_cst memory ordering ensure that r1==r2==0 can't happen (this in turn means that either producer signals consumer or consumer does not block or both, that is, no deadlock happens). Since one part of the pattern happens w/o mutex, the second part under mutex can't be expressed with plain memory operations.
Sign in to reply to this message.
Hello rsc@golang.org (cc: golang-dev@googlegroups.com), Please take another look.
Sign in to reply to this message.
On 2011/06/28 15:39:46, rsc wrote: > Also the racy cnt access looks the same. Please post a prototype of semacquire/semrelease, perhaps I do not understand what exactly you want to change.
Sign in to reply to this message.
> Since root->cnt is manipulated in a lock-free manner on fast-path, it > should be manipulated in a lock-free manner everywhere. Mutexes can't > protect it anymore. I am not saying to change the way you access it. I am saying to change *where* you access it, to make the correctness of the algorithm easier to see. + root = semroot(addr); + runtime·xadd(addr, 1); + + // Easy case: no waiters? + if(runtime·atomicload(&root->cnt) == 0) + return; + + // Harder case: search for a waiter and wake it. + runtime·lock(root); + if(runtime·atomicload(&root->cnt) == 0) { + // Count already consumed by another goroutine. + runtime·unlock(root); + return; + } + for(s = root->head; s; s = s->next) { + if(s->addr == addr) { + runtime·xadd(&root->cnt, -1); + semdequeue(root, s); + break; + } + } + runtime·unlock(root); + if(s) + runtime·ready(s->g); In addition to being easier (at least for me) to understand, this version avoids spurious semdequeue calls. In fact, maybe the manipulations of root->cnt should be done only inside semqueue and semdequeue. Russ
Sign in to reply to this message.
On 2011/06/28 15:51:39, rsc wrote: > > Since root->cnt is manipulated in a lock-free manner on fast-path, it > > should be manipulated in a lock-free manner everywhere. Mutexes can't > > protect it anymore. > > I am not saying to change the way you access it. > I am saying to change *where* you access it, to > make the correctness of the algorithm easier to see. > > + root = semroot(addr); > + runtime·xadd(addr, 1); > + > + // Easy case: no waiters? > + if(runtime·atomicload(&root->cnt) == 0) > + return; > + > + // Harder case: search for a waiter and wake it. > + runtime·lock(root); > + if(runtime·atomicload(&root->cnt) == 0) { > + // Count already consumed by another goroutine. > + runtime·unlock(root); > + return; > + } > + for(s = root->head; s; s = s->next) { > + if(s->addr == addr) { > + runtime·xadd(&root->cnt, -1); > + semdequeue(root, s); > + break; > + } > + } > + runtime·unlock(root); > + if(s) > + runtime·ready(s->g); OK, I see what you mean. > In addition to being easier (at least for me) to understand, > this version avoids spurious semdequeue calls. > > In fact, maybe the manipulations of root->cnt should be done > only inside semqueue and semdequeue. As for understanding, I can't make my mind. On one hand, if cnt increment/decrement are in semqueue/semdequeue, respectively, it makes for a discipline (it's node count in the end). On the other hand, I tried to move all synchronization to the upper level (semacquire/semrelease functions) so that it's easier to observe and comprehend the synchronization algorithm w/o missing or forgetting any "details" (semdequeue/semenqueue are only single-threaded list manipulation functions) (that was the main difficulty to comprehend your previous implementation with a lot of small functions all of which are involved in the synchronization algorithm). But what do you mean by "this version avoids spurious semdequeue calls"? My version also checks root->cnt value before semdequeue. As for performance, I see that your version somewhat slower on some benchmarks and somewhat faster on another, but there are no radical differences (they are roughly the same).
Sign in to reply to this message.
On Tue, Jun 28, 2011 at 12:40, <dvyukov@google.com> wrote: > On 2011/06/28 15:51:39, rsc wrote: >> >> > Since root->cnt is manipulated in a lock-free manner on fast-path, > > it >> >> > should be manipulated in a lock-free manner everywhere. Mutexes > > can't >> >> > protect it anymore. > >> I am not saying to change the way you access it. >> I am saying to change *where* you access it, to >> make the correctness of the algorithm easier to see. > >> + root = semroot(addr); >> + runtime·xadd(addr, 1); >> + >> + // Easy case: no waiters? >> + if(runtime·atomicload(&root->cnt) == 0) >> + return; >> + >> + // Harder case: search for a waiter and wake it. >> + runtime·lock(root); >> + if(runtime·atomicload(&root->cnt) == 0) { >> + // Count already consumed by another goroutine. >> + runtime·unlock(root); >> + return; >> + } >> + for(s = root->head; s; s = s->next) { >> + if(s->addr == addr) { >> + runtime·xadd(&root->cnt, -1); >> + semdequeue(root, s); >> + break; >> + } >> + } >> + runtime·unlock(root); >> + if(s) >> + runtime·ready(s->g); > > OK, I see what you mean. > > >> In addition to being easier (at least for me) to understand, >> this version avoids spurious semdequeue calls. > >> In fact, maybe the manipulations of root->cnt should be done >> only inside semqueue and semdequeue. > > As for understanding, I can't make my mind. On one hand, if cnt > increment/decrement are in semqueue/semdequeue, respectively, it makes > for a discipline (it's node count in the end). On the other hand, I > tried to move all synchronization to the upper level > (semacquire/semrelease functions) so that it's easier to observe and > comprehend the synchronization algorithm w/o missing or forgetting any > "details" (semdequeue/semenqueue are only single-threaded list > manipulation functions) (that was the main difficulty to comprehend your > previous implementation with a lot of small functions all of which are > involved in the synchronization algorithm). > > But what do you mean by "this version avoids spurious semdequeue calls"? > My version also checks root->cnt value before semdequeue. > > As for performance, I see that your version somewhat slower on some > benchmarks and somewhat faster on another, but there are no radical > differences (they are roughly the same). semqueue/semdequeue are adding a waiter. cnt is the number of waiters. I think it makes sense to have them do the manipulation, as long as there's no performance difference. Also, cnt is probably not a great name. s/cnt/nwait/ Russ
Sign in to reply to this message.
Hello rsc@golang.org (cc: golang-dev@googlegroups.com), Please take another look.
Sign in to reply to this message.
On 2011/06/28 16:43:51, rsc wrote: > semqueue/semdequeue are adding a waiter. > cnt is the number of waiters. > I think it makes sense to have them do the > manipulation, as long as there's no performance > difference. > > Also, cnt is probably not a great name. > s/cnt/nwait/ I argue that it's more important to structure such code around synchronization rather than around single-threaded logic. It's almost trivial to follow intra-thread flow for an experienced developer, while inter-thread interactions and mutual memory access ordering can hardly can be called trivial for a developer of any level. However, I moved all root->cnt mutations into critical sections and renamed it to nwait. Please take a look at this version.
Sign in to reply to this message.
http://codereview.appspot.com/4631059/diff/24002/src/pkg/runtime/sema.goc File src/pkg/runtime/sema.goc (right): http://codereview.appspot.com/4631059/diff/24002/src/pkg/runtime/sema.goc#new... src/pkg/runtime/sema.goc:120: runtime·xadd(&root->nwait, 1); There's an important but undocumented ordering here. Comment why this is important. That is, why not write runtime.lock(root); if(cansemacquire(addr)) { runtime.unlock(root); return; } runtime.xadd(&root->nwait, 1); ? I know the answer but the code should say. http://codereview.appspot.com/4631059/diff/24002/src/pkg/runtime/sema.goc#new... src/pkg/runtime/sema.goc:122: // It's OK to mutate it non-atomically here. This is far too clever. The comment doesn't say why it's okay, and even if it did I wouldn't want to have to figure out whether it was correct. Please use xadd everywhere or nowhere. If the latter, please use ++ and --. http://codereview.appspot.com/4631059/diff/24002/src/pkg/runtime/sema.goc#new... src/pkg/runtime/sema.goc:159: // It's OK to mutate it non-atomically here. Same comment. http://codereview.appspot.com/4631059/diff/24002/src/pkg/runtime/sema.goc#new... src/pkg/runtime/sema.goc:166: if(s) { Drop { }.
Sign in to reply to this message.
Hello rsc@golang.org (cc: golang-dev@googlegroups.com), Please take another look.
Sign in to reply to this message.
On 2011/06/28 17:30:46, rsc wrote: > http://codereview.appspot.com/4631059/diff/24002/src/pkg/runtime/sema.goc > File src/pkg/runtime/sema.goc (right): > > http://codereview.appspot.com/4631059/diff/24002/src/pkg/runtime/sema.goc#new... > src/pkg/runtime/sema.goc:120: runtime·xadd(&root->nwait, 1); > There's an important but undocumented ordering here. > Comment why this is important. That is, why not write > > runtime.lock(root); > if(cansemacquire(addr)) { > runtime.unlock(root); > return; > } > runtime.xadd(&root->nwait, 1); > > ? > > I know the answer but the code should say. I doubt that it's possible to actually explain the thing in any reasonably sized comment, so I just clearly stated the memory access sequence is critical, point to a pairing sequence and refer to Dekker/Peterson algorithm as a core pattern. > http://codereview.appspot.com/4631059/diff/24002/src/pkg/runtime/sema.goc#new... > src/pkg/runtime/sema.goc:122: // It's OK to mutate it non-atomically here. > This is far too clever. The comment doesn't say why it's okay, > and even if it did I wouldn't want to have to figure out whether > it was correct. Please use xadd everywhere or nowhere. It's a pity but done :( > > If the latter, please use ++ and --. > > http://codereview.appspot.com/4631059/diff/24002/src/pkg/runtime/sema.goc#new... > src/pkg/runtime/sema.goc:159: // It's OK to mutate it non-atomically here. > Same comment. Done. > http://codereview.appspot.com/4631059/diff/24002/src/pkg/runtime/sema.goc#new... > src/pkg/runtime/sema.goc:166: if(s) { > Drop { }. Done.
Sign in to reply to this message.
http://codereview.appspot.com/4631059/diff/7004/src/pkg/runtime/sema.goc File src/pkg/runtime/sema.goc (right): http://codereview.appspot.com/4631059/diff/7004/src/pkg/runtime/sema.goc#newc... src/pkg/runtime/sema.goc:120: // root->nwait increment followed by the load of *addr in cansemacquire() I'd rather explain the code than refer to algorithms that people might not know. here: // Add ourselves to nwait to disable "easy case" in semrelease. runtime.xadd(&root->nwait, 1); // Check cansemacquire to avoid missed wakeup. if(cansemacquire(addr)) { ... } // Any semrelease after the cansemacquire knows we're waiting // (we set nwait above), so go to sleep. semqueue.... and in semrelease: runtime.xadd(addr, 1); // Easy case: no waiters? // This check must happen after the xadd, to avoid a missed wakeup // (see loop in semacquire). if(runtime.atomicload(&root->nwait) == 0) reutnr;
Sign in to reply to this message.
Hello rsc@golang.org (cc: golang-dev@googlegroups.com), Please take another look.
Sign in to reply to this message.
LGTM Thanks for iterating. Do the numbers in the CL description need to be updated?
Sign in to reply to this message.
On 2011/06/28 18:49:52, rsc wrote: > LGTM > > Thanks for iterating. Do the numbers in the CL description > need to be updated? No problem. It depends on in what workflows the description is involved after an issue is closed. There are no radical changes in numbers, so I would prefer to not remeasure unless you are insisting.
Sign in to reply to this message.
LGTM
Sign in to reply to this message.
*** Submitted as http://code.google.com/p/go/source/detail?r=0d277395434f *** runtime: replace Semacquire/Semrelease implementation 1. The implementation uses distributed hash table of waitlists instead of a centralized one. It significantly improves scalability for uncontended semaphores. 2. The implementation provides wait-free fast-path for signalers. 3. The implementation uses less locks (1 lock/unlock instead of 5 for Semacquire). 4. runtime·ready() call is moved out of critical section. 5. Semacquire() does not call semwake(). Benchmark results on HP Z600 (2 x Xeon E5620, 8 HT cores, 2.40GHz) are as follows: benchmark old ns/op new ns/op delta runtime_test.BenchmarkSemaUncontended 58.20 36.30 -37.63% runtime_test.BenchmarkSemaUncontended-2 199.00 18.30 -90.80% runtime_test.BenchmarkSemaUncontended-4 327.00 9.20 -97.19% runtime_test.BenchmarkSemaUncontended-8 491.00 5.32 -98.92% runtime_test.BenchmarkSemaUncontended-16 946.00 4.18 -99.56% runtime_test.BenchmarkSemaSyntNonblock 59.00 36.80 -37.63% runtime_test.BenchmarkSemaSyntNonblock-2 167.00 138.00 -17.37% runtime_test.BenchmarkSemaSyntNonblock-4 333.00 129.00 -61.26% runtime_test.BenchmarkSemaSyntNonblock-8 464.00 130.00 -71.98% runtime_test.BenchmarkSemaSyntNonblock-16 1015.00 136.00 -86.60% runtime_test.BenchmarkSemaSyntBlock 58.80 36.70 -37.59% runtime_test.BenchmarkSemaSyntBlock-2 294.00 149.00 -49.32% runtime_test.BenchmarkSemaSyntBlock-4 333.00 177.00 -46.85% runtime_test.BenchmarkSemaSyntBlock-8 471.00 221.00 -53.08% runtime_test.BenchmarkSemaSyntBlock-16 990.00 227.00 -77.07% runtime_test.BenchmarkSemaWorkNonblock 829.00 832.00 +0.36% runtime_test.BenchmarkSemaWorkNonblock-2 425.00 419.00 -1.41% runtime_test.BenchmarkSemaWorkNonblock-4 308.00 220.00 -28.57% runtime_test.BenchmarkSemaWorkNonblock-8 394.00 147.00 -62.69% runtime_test.BenchmarkSemaWorkNonblock-16 1510.00 149.00 -90.13% runtime_test.BenchmarkSemaWorkBlock 828.00 813.00 -1.81% runtime_test.BenchmarkSemaWorkBlock-2 428.00 436.00 +1.87% runtime_test.BenchmarkSemaWorkBlock-4 232.00 219.00 -5.60% runtime_test.BenchmarkSemaWorkBlock-8 392.00 251.00 -35.97% runtime_test.BenchmarkSemaWorkBlock-16 1524.00 298.00 -80.45% sync_test.BenchmarkMutexUncontended 24.10 24.00 -0.41% sync_test.BenchmarkMutexUncontended-2 12.00 12.00 +0.00% sync_test.BenchmarkMutexUncontended-4 6.25 6.17 -1.28% sync_test.BenchmarkMutexUncontended-8 3.43 3.34 -2.62% sync_test.BenchmarkMutexUncontended-16 2.34 2.32 -0.85% sync_test.BenchmarkMutex 24.70 24.70 +0.00% sync_test.BenchmarkMutex-2 208.00 99.50 -52.16% sync_test.BenchmarkMutex-4 2744.00 256.00 -90.67% sync_test.BenchmarkMutex-8 5137.00 556.00 -89.18% sync_test.BenchmarkMutex-16 5368.00 1284.00 -76.08% sync_test.BenchmarkMutexSlack 24.70 25.00 +1.21% sync_test.BenchmarkMutexSlack-2 1094.00 186.00 -83.00% sync_test.BenchmarkMutexSlack-4 3430.00 402.00 -88.28% sync_test.BenchmarkMutexSlack-8 5051.00 1066.00 -78.90% sync_test.BenchmarkMutexSlack-16 6806.00 1363.00 -79.97% sync_test.BenchmarkMutexWork 793.00 792.00 -0.13% sync_test.BenchmarkMutexWork-2 398.00 398.00 +0.00% sync_test.BenchmarkMutexWork-4 1441.00 308.00 -78.63% sync_test.BenchmarkMutexWork-8 8532.00 847.00 -90.07% sync_test.BenchmarkMutexWork-16 8225.00 2760.00 -66.44% sync_test.BenchmarkMutexWorkSlack 793.00 793.00 +0.00% sync_test.BenchmarkMutexWorkSlack-2 418.00 414.00 -0.96% sync_test.BenchmarkMutexWorkSlack-4 4481.00 480.00 -89.29% sync_test.BenchmarkMutexWorkSlack-8 6317.00 1598.00 -74.70% sync_test.BenchmarkMutexWorkSlack-16 9111.00 3038.00 -66.66% R=rsc CC=golang-dev http://codereview.appspot.com/4631059 Committer: Russ Cox <rsc@golang.org>
Sign in to reply to this message.
|