|
|
|
Created:
13 years, 2 months ago by dho Modified:
12 years, 4 months ago CC:
golang-codereviews Visibility:
Public. |
Descriptionruntime: use lock-free ring for work scheduling
The algorithm used is based on the implementation in Concurrency Kit
(http://concurrencykit.org). The ring has single producer, multiple
consumer access semantics; is wait-free on insert and lock-free on
retrieval. The size must be a power of 2. This changeset removes the
work stealing code; it may be possible to add that back, but it is a
bit hairy.
When a P's ring is full, the G "overflows" into the global run queue.
Patch Set 1 #Patch Set 2 : diff -r f545866390ab https://code.google.com/p/go/ #Patch Set 3 : diff -r f545866390ab https://code.google.com/p/go/ #
Total comments: 22
MessagesTotal messages: 23
Hello dvyukov@google.com, rsc@golang.org, minux.ma@gmail.com, dave@cheney.net (cc: golang-dev@googlegroups.com), I'd like you to review this change to https://code.google.com/p/go/
Sign in to reply to this message.
I haven't tested this on ARM yet; I'm not sure if it compiles. This
algorithm is sensitive to store / load ordering, which means that if
this causes ARM to fail, we're going to need to implement DSB/DMB
instructions in 5a. I'm not sure if the LL/SC loops are enough, but I
figured that work should go in a different CL. (Also, I have no idea
how to do it having never really messed with the compilers.)
Interesting benchmarks:
BenchmarkFields 30561068 24279386 -20.55%
BenchmarkFieldsFunc 30054191 24357377 -18.96%
BenchmarkReaderCopyOptimal 1509 1534 +1.66%
BenchmarkReaderCopyUnoptimal 7562 5602 -25.92%
BenchmarkReaderCopyNoWriteTo 10675 11640 +9.04%
BenchmarkWriterCopyOptimal 9261 4473 -51.70%
BenchmarkWriterCopyUnoptimal 7318 2757 -62.33%
BenchmarkWriterCopyNoReadFrom 10609 12184 +14.85%
BenchmarkReadString 5494 6213 +13.09%
BenchmarkIndexByte32 6 7 +13.89%
BenchmarkIndexByte4K 238 253 +6.30%
BenchmarkIndexByte4M 261574 261425 -0.06%
BenchmarkIndexByte64M 4331658 5021060 +15.92%
BenchmarkIndexBytePortable32 42 44 +5.19%
BenchmarkIndexBytePortable4K 2345 2448 +4.39%
BenchmarkIndexBytePortable4M 2416851 2533766 +4.84%
BenchmarkIndexBytePortable64M 38112846 38451022 +0.89%
BenchmarkEqual32 39 42 +7.91%
BenchmarkEqual4K 2331 2418 +3.73%
BenchmarkEqual4M 2586779 2677882 +3.52%
BenchmarkEqual64M 40851645 41153023 +0.74%
BenchmarkEqualPort32 46 49 +6.29%
BenchmarkEqualPort4K 3496 3624 +3.66%
BenchmarkEqualPortable4M 3776748 3765554 -0.30%
BenchmarkEqualPortable64M 58643898 60004963 +2.32%
BenchmarkIndex32 664 685 +3.16%
BenchmarkIndex4K 101669 105488 +3.76%
BenchmarkIndex4M 104609738 111066889 +6.17%
BenchmarkIndex64M 1676273044 1750548375 +4.43%
BenchmarkIndexEasy32 58 60 +3.09%
BenchmarkIndexEasy4K 288 292 +1.39%
BenchmarkIndexEasy4M 254642 282289 +10.86%
BenchmarkIndexEasy64M 4385937 4525697 +3.19%
BenchmarkCount32 673 667 -0.89%
BenchmarkCount4K 103380 103880 +0.48%
BenchmarkCount4M 106307010 104518944 -1.68%
BenchmarkCount64M 1702085125 1677917865 -1.42%
BenchmarkCountEasy32 59 59 +1.01%
BenchmarkCountEasy4K 289 292 +1.04%
BenchmarkCountEasy4M 262517 267442 +1.88%
BenchmarkCountEasy64M 4344423 4299636 -1.03%
BenchmarkTrimSpace 59 55 -6.61%
BenchmarkSprintfEmpty 98 99 +1.63%
BenchmarkSprintfString 284 279 -1.76%
BenchmarkSprintfInt 226 217 -3.98%
BenchmarkSprintfIntInt 345 335 -2.90%
BenchmarkSprintfPrefixedInt 345 356 +3.19%
BenchmarkSprintfFloat 470 468 -0.43%
BenchmarkManyArgs 1338 1301 -2.77%
BenchmarkScanInts 964898 974407 +0.99%
BenchmarkScanRecursiveInt 1180694 1250752 +5.93%
BenchmarkTCPOneShot 121556 118597 -2.43%
BenchmarkTCPOneShotTimeout 636851 601332 -5.58%
BenchmarkTCPPersistent 50440 48279 -4.28%
BenchmarkTCPPersistentTimeout 50095 49055 -2.08%
BenchmarkSelectUncontended 243 234 -3.70%
BenchmarkSelectContended 241 236 -2.07%
BenchmarkSelectNonblock 105 104 -0.95%
BenchmarkChanUncontended 64 65 +2.03%
BenchmarkChanContended 64 64 -0.47%
BenchmarkChanSync 137 120 -12.41%
BenchmarkChanProdCons0 140 121 -13.57%
BenchmarkChanProdCons10 87 83 -5.24%
BenchmarkChanProdCons100 70 66 -5.55%
BenchmarkChanProdConsWork0 633 584 -7.74%
BenchmarkChanProdConsWork10 544 549 +0.92%
BenchmarkChanProdConsWork100 526 533 +1.33%
BenchmarkChanCreation 164 172 +4.88%
BenchmarkChanSem 61 65 +6.01%
BenchmarkCallClosure 2 2 +7.24%
BenchmarkCallClosure1 3 3 +0.32%
BenchmarkCallClosure2 42 42 +0.70%
BenchmarkCallClosure3 43 45 +3.90%
BenchmarkCallClosure4 43 43 +0.23%
BenchmarkSyscall 28 27 -2.84%
BenchmarkSyscallWork 192 187 -2.60%
BenchmarkSyscallExcess 28 27 -2.47%
BenchmarkSyscallExcessWork 189 193 +2.12%
BenchmarkCreateGoroutines 111 87 -21.08%
BenchmarkCreateGoroutinesParallel 114 90 -20.79%
BenchmarkUncontendedSemaphore 25 26 +3.59%
BenchmarkContendedSemaphore 80 26 -66.50%
BenchmarkMutexUncontended 17 18 +3.43%
BenchmarkMutex 17 18 +0.56%
BenchmarkMutexSlack 17 17 +0.56%
BenchmarkMutexWork 176 178 +1.14%
BenchmarkMutexWorkSlack 178 176 -1.12%
BenchmarkOnce 5 5 -2.09%
BenchmarkSemaUncontended 25 26 +3.17%
BenchmarkSemaSyntNonblock 25 26 +0.77%
BenchmarkSemaSyntBlock 25 26 +1.54%
BenchmarkSemaWorkNonblock 191 192 +0.52%
BenchmarkSemaWorkBlock 191 192 +0.52%
BenchmarkRWMutexUncontended 66 69 +4.35%
BenchmarkRWMutexWrite100 22 22 +1.33%
BenchmarkRWMutexWrite10 24 24 +0.00%
BenchmarkRWMutexWorkWrite100 189 186 -1.59%
BenchmarkRWMutexWorkWrite10 173 177 +2.31%
BenchmarkWaitGroupUncontended 23 24 +1.26%
BenchmarkWaitGroupAddDone 21 21 +1.42%
BenchmarkWaitGroupAddDoneWork 184 184 +0.00%
BenchmarkWaitGroupWait 5 5 +0.92%
BenchmarkWaitGroupWaitWork 170 173 +1.76%
BenchmarkAfterFunc 428 362 -15.42%
BenchmarkAfter 829 607 -26.78%
BenchmarkStop 358 280 -21.79%
BenchmarkTicker 17123 14846 -13.30%
BenchmarkNow 14 11 -19.31%
BenchmarkNowUnixNano 12 11 -5.56%
BenchmarkFormat 904 617 -31.75%
BenchmarkFormatNow 685 584 -14.74%
BenchmarkParse 486 426 -12.35%
BenchmarkHour 19 17 -10.88%
BenchmarkSecond 16 15 -2.48%
BenchmarkYear 29 30 +2.04%
BenchmarkDay 37 40 +6.08%
Devons-MacBook-Pro:shootout dho$ cat thread-ring-old.txt
3.02 real 3.01 user 0.00 sys
10.21 real 11.51 user 4.07 sys
6.73 real 7.91 user 3.08 sys
6.11 real 7.27 user 2.74 sys
Devons-MacBook-Pro:shootout dho$ cat thread-ring-new.txt
2.67 real 2.67 user 0.00 sys
5.87 real 6.30 user 2.54 sys
4.66 real 4.88 user 2.17 sys
4.42 real 4.81 user 1.98 sys
Sign in to reply to this message.
2013/3/4 Devon H. O'Dell <devon.odell@gmail.com>: > BenchmarkReaderCopyOptimal 1509 1534 +1.66% > BenchmarkReaderCopyUnoptimal 7562 5602 -25.92% > BenchmarkReaderCopyNoWriteTo 10675 11640 +9.04% > BenchmarkWriterCopyOptimal 9261 4473 -51.70% > BenchmarkWriterCopyUnoptimal 7318 2757 -62.33% > BenchmarkWriterCopyNoReadFrom 10609 12184 +14.85% Why are there differences here? Is it because of a stack split? Rémy.
Sign in to reply to this message.
2013/3/4 Rémy Oudompheng <remyoudompheng@gmail.com>: > 2013/3/4 Devon H. O'Dell <devon.odell@gmail.com>: >> BenchmarkReaderCopyOptimal 1509 1534 +1.66% >> BenchmarkReaderCopyUnoptimal 7562 5602 -25.92% >> BenchmarkReaderCopyNoWriteTo 10675 11640 +9.04% >> BenchmarkWriterCopyOptimal 9261 4473 -51.70% >> BenchmarkWriterCopyUnoptimal 7318 2757 -62.33% >> BenchmarkWriterCopyNoReadFrom 10609 12184 +14.85% > > Why are there differences here? Is it because of a stack split? > GC accounts for 75% of the profile, and it runs thousands of times during the benchmark, I'm not sure anything can be said reliably. Rémy.
Sign in to reply to this message.
2013/3/4 Rémy Oudompheng <remyoudompheng@gmail.com>: > 2013/3/4 Rémy Oudompheng <remyoudompheng@gmail.com>: >> 2013/3/4 Devon H. O'Dell <devon.odell@gmail.com>: >>> BenchmarkReaderCopyOptimal 1509 1534 +1.66% >>> BenchmarkReaderCopyUnoptimal 7562 5602 -25.92% >>> BenchmarkReaderCopyNoWriteTo 10675 11640 +9.04% >>> BenchmarkWriterCopyOptimal 9261 4473 -51.70% >>> BenchmarkWriterCopyUnoptimal 7318 2757 -62.33% >>> BenchmarkWriterCopyNoReadFrom 10609 12184 +14.85% >> >> Why are there differences here? Is it because of a stack split? >> > > GC accounts for 75% of the profile, and it runs thousands of times > during the benchmark, I'm not sure anything can be said reliably. Ok, that's fair. I was curious about this myself. (But I'm still ramping up on figuring out how to profile Go programs; apparently it's unreliable on mac, or so says #go-nuts.) I'd hope this isn't also true of the select/chan/goroutine benchmarks. --dho > Rémy.
Sign in to reply to this message.
2013/3/4 Devon H. O'Dell <devon.odell@gmail.com>: > 2013/3/4 Rémy Oudompheng <remyoudompheng@gmail.com>: >> 2013/3/4 Rémy Oudompheng <remyoudompheng@gmail.com>: >>> 2013/3/4 Devon H. O'Dell <devon.odell@gmail.com>: >>>> BenchmarkReaderCopyOptimal 1509 1534 +1.66% >>>> BenchmarkReaderCopyUnoptimal 7562 5602 -25.92% >>>> BenchmarkReaderCopyNoWriteTo 10675 11640 +9.04% >>>> BenchmarkWriterCopyOptimal 9261 4473 -51.70% >>>> BenchmarkWriterCopyUnoptimal 7318 2757 -62.33% >>>> BenchmarkWriterCopyNoReadFrom 10609 12184 +14.85% >>> >>> Why are there differences here? Is it because of a stack split? >>> >> >> GC accounts for 75% of the profile, and it runs thousands of times >> during the benchmark, I'm not sure anything can be said reliably. > > Ok, that's fair. I was curious about this myself. (But I'm still > ramping up on figuring out how to profile Go programs; apparently it's > unreliable on mac, or so says #go-nuts.) > > I'd hope this isn't also true of the select/chan/goroutine benchmarks. The improvement seems real in the threadring benchmark. Before: Total: 657 samples 184 28.0% 28.0% 184 28.0% runtime.xchg 87 13.2% 41.2% 87 13.2% runtime.futex 73 11.1% 52.4% 306 46.6% runtime.chansend 42 6.4% 58.8% 472 71.8% testa.f 37 5.6% 64.4% 37 5.6% dequeue 35 5.3% 69.7% 118 18.0% runtime.chanrecv 35 5.3% 75.0% 103 15.7% runtime.unlock 23 3.5% 78.5% 124 18.9% schedule 22 3.3% 81.9% 149 22.7% runtime.lock 19 2.9% 84.8% 43 6.5% runqget After: Total: 597 samples 116 19.4% 19.4% 116 19.4% runtime.xchg 84 14.1% 33.5% 84 14.1% runtime.futex 56 9.4% 42.9% 427 71.5% testa.f 54 9.0% 51.9% 263 44.1% runtime.chansend 36 6.0% 58.0% 95 15.9% runtime.chanrecv 30 5.0% 63.0% 30 5.0% dequeue 29 4.9% 67.8% 104 17.4% runtime.lock 27 4.5% 72.4% 27 4.5% runtime.atomicload 26 4.4% 76.7% 113 18.9% schedule 21 3.5% 80.2% 43 7.2% runqget package main import ( "sync" "testing" ) const Nthread = 503 func f(i int, in <-chan int, out chan<- int, wg *sync.WaitGroup) { for { n := <-in if n == 0 { wg.Done() out <- 0 } out <- n - 1 } } func BenchmarkThreadRing(b *testing.B) { one := make(chan int) // will be input to thread 1 var in, out chan int = nil, one wg := new(sync.WaitGroup) wg.Add(Nthread) for i := 1; i <= Nthread-1; i++ { in, out = out, make(chan int) go f(i, in, out, wg) } go f(Nthread, out, one, wg) one <- b.N wg.Wait() }
Sign in to reply to this message.
Devon, Thanks for working on this. Have you coordinated with Dmitriy about this? I don't know what he plans for future work and want to make sure we don't step on him. Separately, I would prefer to wait on this for at least the rest of the week. There have been vague reports of crashes and such using the new scheduler, and I'd like to let things stop changing for a few days to see if we can stabilize before moving on. Thanks. Russ
Sign in to reply to this message.
2013/3/4 Russ Cox <rsc@golang.org>: > Devon, > > Thanks for working on this. Have you coordinated with Dmitriy about this? I > don't know what he plans for future work and want to make sure we don't step > on him. Separately, I would prefer to wait on this for at least the rest of > the week. There have been vague reports of crashes and such using the new > scheduler, and I'd like to let things stop changing for a few days to see if > we can stabilize before moving on. We had some discussion in a thread on the ML, he suggested uploading the CL. Happy to wait on submitting this; I think it probably doesn't work on ARM anyway (but I don't have my ARM boards set up to test quite yet). --dho > Thanks. > Russ
Sign in to reply to this message.
https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c File src/pkg/runtime/proc.c (right): https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... src/pkg/runtime/proc.c:279: if (runqput(m->p, gp) == false) { Perhaps combine it in a separate function, because it's copied several times. https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... src/pkg/runtime/proc.c:281: globrunqput(gp); it important to transfer a batch of G's to globrunq (half?) consider that a goroutine spawns lots of goroutines in a loop, you do not want to hit the slowpath every time https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... src/pkg/runtime/proc.c:355: runtime·fastatomicstore((uint32*)&runtime·gcwaiting, 1); This needs full memory barrier between store to gcwaiting and load of p->status, otherwise it can deadlock with entersyscall (both won't notice the other). https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... src/pkg/runtime/proc.c:1014: gp = runqget(p); "steal half" is required here. consider that a goroutine spawns lots of other goroutines in a loop, then other worker threads starts stealing 1 goroutine from slowpath each time, this will lead to unnecessary contention and slowdown. https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... src/pkg/runtime/proc.c:1214: runtime·fastatomicstore(&m->p->status, Psyscall); full memory barrier is required, otherwise it will deadlock with stoptheworld. https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... src/pkg/runtime/proc.c:1760: if(old < 0 || old > MaxGomaxprocs || new <= 0 || new > MaxGomaxprocs) such edits better go in a separate patch https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... src/pkg/runtime/proc.c:2042: if(runqput(p, gp1) == false) { steal at most runqsize (128), then runqput() will always succeed https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... src/pkg/runtime/proc.c:2087: if((delta & mask) == (start & mask)) Is it intended that you do not full the last slot? If so, it can be: if((delta^start)&mask)) the compiler is not particularly good at code optimization. https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... src/pkg/runtime/proc.c:2093: runtime·fastatomicstore(&p->r_tail, delta); atomicstorerelease should contain the necessary fence https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... src/pkg/runtime/proc.c:2094: please drop at least some of the empty lines https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... src/pkg/runtime/proc.c:2107: start = runtime·atomicload(&p->r_head); please use consistent maning: head/head instead of head/start (tail/end) https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... src/pkg/runtime/proc.c:2109: /* TODO: load fence */ atomicload contains the fence https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... src/pkg/runtime/proc.c:2114: /* TODO: load fence */ why is it needed? https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... src/pkg/runtime/proc.c:2116: gp = p->ring[start & mask]; I think you can implement steal in a similar way, just copying a batch of elements here and incrementing head by N instead on 1. https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... src/pkg/runtime/proc.c:2116: gp = p->ring[start & mask]; I think you can implement steal() in a similar way, just copying a batch of elements here and incrementing head by N instead of 1. https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... src/pkg/runtime/proc.c:2118: /* TODO: full memory fence */ why is it needed? https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... src/pkg/runtime/proc.c:2133: p.r_size = 8192; please test overflow as well https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/runtime.h File src/pkg/runtime/runtime.h (right): https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/runtime.h#ne... src/pkg/runtime/runtime.h:339: G** ring; I think size 128 should be enough. Embed it into P directly: enum { RunqSize = 128 }; G* ring[RunqSize]; Then you don't need mask. https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/runtime.h#ne... src/pkg/runtime/runtime.h:340: uint32 r_head; Rename, the codebase does not use underscores. Please refrain from unnecessary renaming, uint32 runqhead is fine. This will eliminate lots of edits in proc.c https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/runtime.h#ne... src/pkg/runtime/runtime.h:690: bool runtime·casvalue(uint32*, uint32, uint32, uint32*); Change cas(uint32*, uint32, uint32) to cas(uint32*, uint32*, uint32) similar to cas64. Then you don't need the separate function. This should go in a separate patch. https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/runtime.h#ne... src/pkg/runtime/runtime.h:697: void runtime·fastatomicstore(uint32 volatile*, uint32); rename to atomicstorerelease https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/runtime.h#ne... src/pkg/runtime/runtime.h:738: void runtime·setmg(M*, G*); Such changes better go in a separate patch.
Sign in to reply to this message.
Hi, Haven't had a lot of time to fix this recently but I wanted to address some of the comments. 2013/3/8 <dvyukov@google.com>: > > https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c > File src/pkg/runtime/proc.c (right): > > https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... > src/pkg/runtime/proc.c:279: if (runqput(m->p, gp) == false) { > Perhaps combine it in a separate function, because it's copied several > times. That will get inlined, yes? With some of the other comments you made, the number of times this is copied will probably decrease (as a result of implementing the work stealing). > https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... > src/pkg/runtime/proc.c:281: globrunqput(gp); > it important to transfer a batch of G's to globrunq (half?) > consider that a goroutine spawns lots of goroutines in a loop, you do > not want to hit the slowpath every time That's fair -- didn't want to make this a bigger change than I needed to. I'll look at doing this. > https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... > src/pkg/runtime/proc.c:355: > runtime·fastatomicstore((uint32*)&runtime·gcwaiting, 1); > This needs full memory barrier between store to gcwaiting and load of > p->status, otherwise it can deadlock with entersyscall (both won't > notice the other). Good point. I'm going to remove the other places where I changed this that aren't specifically related to the run queue. I didn't look at independed load / store dependencies, and that's clearly showing. > https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... > src/pkg/runtime/proc.c:1014: gp = runqget(p); > "steal half" is required here. > consider that a goroutine spawns lots of other goroutines in a loop, > then other worker threads starts stealing 1 goroutine from slowpath each > time, this will lead to unnecessary contention and slowdown. > > https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... > src/pkg/runtime/proc.c:1214: runtime·fastatomicstore(&m->p->status, > Psyscall); > full memory barrier is required, otherwise it will deadlock with > stoptheworld. > > https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... > src/pkg/runtime/proc.c:1760: if(old < 0 || old > MaxGomaxprocs || new <= > 0 || new > MaxGomaxprocs) > such edits better go in a separate patch > > https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... > src/pkg/runtime/proc.c:2042: if(runqput(p, gp1) == false) { > steal at most runqsize (128), then runqput() will always succeed > > https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... > src/pkg/runtime/proc.c:2087: if((delta & mask) == (start & mask)) > Is it intended that you do not full the last slot? > If so, it can be: > if((delta^start)&mask)) > the compiler is not particularly good at code optimization. This check won't work. We can't enqueue a new value into the ring if that would overflow the ring. So, if delta == start, we don't insert. If we have a ring of size 4, the tail is at 0x3 and the head is at 0x0, this will end up being if(0&mask), which is of course always false. > https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... > src/pkg/runtime/proc.c:2093: runtime·fastatomicstore(&p->r_tail, delta); > atomicstorerelease should contain the necessary fence The store fence is required to serialize the update of the tail with respect to putting the entry in the ring, so it's not really a question of whether this interface provides the fence. It's to make sure that all other consumers do not read that the tail is updated without being guaranteed to be able to read the entry at the tail. > https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... > src/pkg/runtime/proc.c:2094: > please drop at least some of the empty lines > > https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... > src/pkg/runtime/proc.c:2107: start = runtime·atomicload(&p->r_head); > please use consistent maning: head/head instead of head/start (tail/end) I will change these for consistency / to reduce diff size as suggested later. > https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... > src/pkg/runtime/proc.c:2109: /* TODO: load fence */ > atomicload contains the fence This is to guarantee consistency of our view of the snapshot of the head. > https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... > src/pkg/runtime/proc.c:2114: /* TODO: load fence */ > why is it needed? Same reason, guarantee we're looking at the proper state of the head when we do the assignment from the ring. > https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... > src/pkg/runtime/proc.c:2116: gp = p->ring[start & mask]; > I think you can implement steal in a similar way, just copying a batch > of elements here and incrementing head by N instead on 1. It would be possible to do this with a memcpy indeed, but you have to do the copy every iteration of the loop because head may have changed. Another way I was thinking of doing it was just to "steal" the blocks and do the copies once that was successful -- but the problem is that another producer might invalidate the consistency of the ring by the time the copy started. I'll do the memcpy approach for now. > https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... > src/pkg/runtime/proc.c:2116: gp = p->ring[start & mask]; > I think you can implement steal() in a similar way, just copying a batch > of elements here and incrementing head by N instead of 1. > > https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... > src/pkg/runtime/proc.c:2118: /* TODO: full memory fence */ > why is it needed? To serialize loading with respect to updating the head poitner. > https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... > src/pkg/runtime/proc.c:2133: p.r_size = 8192; > please test overflow as well Good call. Will do. > https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/runtime.h > File src/pkg/runtime/runtime.h (right): > > https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/runtime.h#ne... > src/pkg/runtime/runtime.h:339: G** ring; > I think size 128 should be enough. > Embed it into P directly: > enum { RunqSize = 128 }; > G* ring[RunqSize]; > Then you don't need mask. Also good call. > https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/runtime.h#ne... > src/pkg/runtime/runtime.h:340: uint32 r_head; > Rename, the codebase does not use underscores. > Please refrain from unnecessary renaming, uint32 runqhead is fine. This > will eliminate lots of edits in proc.c > > https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/runtime.h#ne... > src/pkg/runtime/runtime.h:690: bool runtime·casvalue(uint32*, uint32, > uint32, uint32*); > Change cas(uint32*, uint32, uint32) to cas(uint32*, uint32*, uint32) > similar to cas64. Then you don't need the separate function. > This should go in a separate patch. Then we pay the cost for assigning to the old value everywhere, even when we don't need it. I think this just adds overhead. > https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/runtime.h#ne... > src/pkg/runtime/runtime.h:697: void runtime·fastatomicstore(uint32 > volatile*, uint32); > rename to atomicstorerelease > > https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/runtime.h#ne... > src/pkg/runtime/runtime.h:738: void runtime·setmg(M*, G*); > Such changes better go in a separate patch. > > https://codereview.appspot.com/7455049/
Sign in to reply to this message.
The C compiler does not inline.
Sign in to reply to this message.
2013/3/12 Russ Cox <rsc@golang.org>: > The C compiler does not inline. Maybe we should revisit this after Go 1.1. I need help implementing support for the DMB instruction on ARM. I don't understand the assembler and linker code enough to add the instruction and this won't work on ARM without that. Additionally, we'd need an interface for generating memory barriers. If the C compiler doesn't inline, that's going to be a ton of overhead just to generate sfence/lfence/dmb. That in mind, I'm happy to work hard to get this going for the 1.1 release, but I think it's maybe going to take too much time from other things that really need to go in. --dho
Sign in to reply to this message.
On Tue, Mar 12, 2013 at 11:50 AM, Devon H. O'Dell <devon.odell@gmail.com>wrote: > 2013/3/12 Russ Cox <rsc@golang.org>: > > The C compiler does not inline. > > Maybe we should revisit this after Go 1.1. > > I need help implementing support for the DMB instruction on ARM. I > don't understand the assembler and linker code enough to add the > instruction and this won't work on ARM without that. Additionally, > we'd need an interface for generating memory barriers. If the C > compiler doesn't inline, that's going to be a ton of overhead just to > generate sfence/lfence/dmb. > I am happy to add intrinsics if you demonstrate that the performance win is enough. We did this already with prefetch. I very much doubt the C compiler will ever inline. For the tiny amount of C we have, it's just not worth the effort. It certainly can't inline calls to assembly functions that it can't see. Russ
Sign in to reply to this message.
2013/3/12 Russ Cox <rsc@golang.org>: > On Tue, Mar 12, 2013 at 11:50 AM, Devon H. O'Dell <devon.odell@gmail.com> > wrote: >> >> 2013/3/12 Russ Cox <rsc@golang.org>: >> > The C compiler does not inline. >> >> Maybe we should revisit this after Go 1.1. >> >> >> I need help implementing support for the DMB instruction on ARM. I >> don't understand the assembler and linker code enough to add the >> instruction and this won't work on ARM without that. Additionally, >> we'd need an interface for generating memory barriers. If the C >> compiler doesn't inline, that's going to be a ton of overhead just to >> generate sfence/lfence/dmb. > > > I am happy to add intrinsics if you demonstrate that the performance win is > enough. We did this already with prefetch. What benchmarks do you want me to run? I'll only be able to run them reliably on x86 and AMD64 until there's support for DMB. This instruction is required for consistency using the lock-free data structure to guarantee that stores / loads are observed by other threads; it's not really for performance. I guess in the meantime I could make an ASM stub using .BYTE? > I very much doubt the C compiler will ever inline. For the tiny amount of C > we have, it's just not worth the effort. It certainly can't inline calls to > assembly functions that it can't see. That's fair. > Russ --dho
Sign in to reply to this message.
On Tue, Mar 12, 2013 at 7:38 PM, Devon H. O'Dell <devon.odell@gmail.com> wrote: >> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... >> src/pkg/runtime/proc.c:279: if (runqput(m->p, gp) == false) { >> Perhaps combine it in a separate function, because it's copied several >> times. > > That will get inlined, yes? No, it won't. >> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... >> src/pkg/runtime/proc.c:2087: if((delta & mask) == (start & mask)) >> Is it intended that you do not full the last slot? >> If so, it can be: >> if((delta^start)&mask)) >> the compiler is not particularly good at code optimization. > > This check won't work. We can't enqueue a new value into the ring if > that would overflow the ring. So, if delta == start, we don't insert. > If we have a ring of size 4, the tail is at 0x3 and the head is at > 0x0, this will end up being if(0&mask), which is of course always > false. I meant ((delta^start)&mask) == 0 >> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... >> src/pkg/runtime/proc.c:2093: runtime·fastatomicstore(&p->r_tail, delta); >> atomicstorerelease should contain the necessary fence > > The store fence is required to serialize the update of the tail with > respect to putting the entry in the ring, so it's not really a > question of whether this interface provides the fence. It's to make > sure that all other consumers do not read that the tail is updated > without being guaranteed to be able to read the entry at the tail. atomicstorerelease should contain the necessary fence, the TODO is not necessary >> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... >> src/pkg/runtime/proc.c:2109: /* TODO: load fence */ >> atomicload contains the fence > > This is to guarantee consistency of our view of the snapshot of the head. Yes, I understand, but please remove the TODO, because it's not actually the TODO, the fence is already there. >> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... >> src/pkg/runtime/proc.c:2114: /* TODO: load fence */ >> why is it needed? > > Same reason, guarantee we're looking at the proper state of the head > when we do the assignment from the ring. The necessary fence is in atomicload/cas. >> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... >> src/pkg/runtime/proc.c:2116: gp = p->ring[start & mask]; >> I think you can implement steal in a similar way, just copying a batch >> of elements here and incrementing head by N instead on 1. > > It would be possible to do this with a memcpy indeed, but you have to > do the copy every iteration of the loop I think it is fine > because head may have changed. > Another way I was thinking of doing it was just to "steal" the blocks > and do the copies once that was successful -- but the problem is that > another producer might invalidate the consistency of the ring by the > time the copy started. I'll do the memcpy approach for now. > >> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... >> src/pkg/runtime/proc.c:2116: gp = p->ring[start & mask]; >> I think you can implement steal() in a similar way, just copying a batch >> of elements here and incrementing head by N instead of 1. >> >> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco... >> src/pkg/runtime/proc.c:2118: /* TODO: full memory fence */ >> why is it needed? > > To serialize loading with respect to updating the head poitner. cas contains the necessary fence, these are *not* relaxed atomic >> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/runtime.h#ne... >> src/pkg/runtime/runtime.h:690: bool runtime·casvalue(uint32*, uint32, >> uint32, uint32*); >> Change cas(uint32*, uint32, uint32) to cas(uint32*, uint32*, uint32) >> similar to cas64. Then you don't need the separate function. >> This should go in a separate patch. > > Then we pay the cost for assigning to the old value everywhere, even > when we don't need it. I think this just adds overhead. Local assignment is not a great cost. The cc compiler makes unnecessary assignments everywhere. And it's needed in mutexes and will be needed in chans. Most other places are not performance critical.
Sign in to reply to this message.
On Tue, Mar 12, 2013 at 8:07 PM, Devon H. O'Dell <devon.odell@gmail.com> wrote: > 2013/3/12 Russ Cox <rsc@golang.org>: >> On Tue, Mar 12, 2013 at 11:50 AM, Devon H. O'Dell <devon.odell@gmail.com> >> wrote: >>> >>> 2013/3/12 Russ Cox <rsc@golang.org>: >>> > The C compiler does not inline. >>> >>> Maybe we should revisit this after Go 1.1. >>> >>> >>> I need help implementing support for the DMB instruction on ARM. I >>> don't understand the assembler and linker code enough to add the >>> instruction and this won't work on ARM without that. Additionally, >>> we'd need an interface for generating memory barriers. If the C >>> compiler doesn't inline, that's going to be a ton of overhead just to >>> generate sfence/lfence/dmb. >> >> >> I am happy to add intrinsics if you demonstrate that the performance win is >> enough. We did this already with prefetch. > > What benchmarks do you want me to run? I'll only be able to run them > reliably on x86 and AMD64 until there's support for DMB. This > instruction is required for consistency using the lock-free data > structure to guarantee that stores / loads are observed by other > threads; it's not really for performance. I guess in the meantime I > could make an ASM stub using .BYTE? DMB should not be a separate function, it should be combined with atomicload/store/cas/etc. So there will be no additional cost for these. For ARM you can just use existing atomic functions, they meant to contain all the necessary fences. You can think of them as C atomic_foo(memory_order_seq_cst).
Sign in to reply to this message.
On Tue, Mar 12, 2013 at 12:07 PM, Devon H. O'Dell <devon.odell@gmail.com>wrote: > 2013/3/12 Russ Cox <rsc@golang.org>: > > On Tue, Mar 12, 2013 at 11:50 AM, Devon H. O'Dell <devon.odell@gmail.com > > > > wrote: > >> > >> 2013/3/12 Russ Cox <rsc@golang.org>: > >> > The C compiler does not inline. > >> > >> Maybe we should revisit this after Go 1.1. > >> > >> > >> I need help implementing support for the DMB instruction on ARM. I > >> don't understand the assembler and linker code enough to add the > >> instruction and this won't work on ARM without that. Additionally, > >> we'd need an interface for generating memory barriers. If the C > >> compiler doesn't inline, that's going to be a ton of overhead just to > >> generate sfence/lfence/dmb. > > > > > > I am happy to add intrinsics if you demonstrate that the performance win > is > > enough. We did this already with prefetch. > > What benchmarks do you want me to run? I'll only be able to run them > reliably on x86 and AMD64 until there's support for DMB. This > instruction is required for consistency using the lock-free data > structure to guarantee that stores / loads are observed by other > threads; it's not really for performance. I guess in the meantime I > could make an ASM stub using .BYTE? > I'd like to see a version that calls into assembly (you want WORD $x, not BYTE, and no dot), compared with a version that doesn't. In the case of prefetch, the difference was significant enough to add to the compiler. To compare with the version that doesn't call into assembly, you can patch in CL 7756043, with two caveats: 1. It's not done well, and it would need redoing if we keep it. 2. The linker emits DSB as 0xFFFFFFFF, which I assume is not the actual encoding. You'll need to tweak that before using. Russ
Sign in to reply to this message.
On Tue, Mar 12, 2013 at 8:26 PM, Russ Cox <rsc@golang.org> wrote: > On Tue, Mar 12, 2013 at 12:07 PM, Devon H. O'Dell <devon.odell@gmail.com> > wrote: >> >> 2013/3/12 Russ Cox <rsc@golang.org>: >> > On Tue, Mar 12, 2013 at 11:50 AM, Devon H. O'Dell >> > <devon.odell@gmail.com> >> > wrote: >> >> >> >> 2013/3/12 Russ Cox <rsc@golang.org>: >> >> > The C compiler does not inline. >> >> >> >> Maybe we should revisit this after Go 1.1. >> >> >> >> >> >> I need help implementing support for the DMB instruction on ARM. I >> >> don't understand the assembler and linker code enough to add the >> >> instruction and this won't work on ARM without that. Additionally, >> >> we'd need an interface for generating memory barriers. If the C >> >> compiler doesn't inline, that's going to be a ton of overhead just to >> >> generate sfence/lfence/dmb. >> > >> > >> > I am happy to add intrinsics if you demonstrate that the performance win >> > is >> > enough. We did this already with prefetch. >> >> What benchmarks do you want me to run? I'll only be able to run them >> reliably on x86 and AMD64 until there's support for DMB. This >> instruction is required for consistency using the lock-free data >> structure to guarantee that stores / loads are observed by other >> threads; it's not really for performance. I guess in the meantime I >> could make an ASM stub using .BYTE? > > > I'd like to see a version that calls into assembly (you want WORD $x, not > BYTE, and no dot), compared with a version that doesn't. In the case of > prefetch, the difference was significant enough to add to the compiler. > > To compare with the version that doesn't call into assembly, you can patch > in CL 7756043, with two caveats: > 1. It's not done well, and it would need redoing if we keep it. > 2. The linker emits DSB as 0xFFFFFFFF, which I assume is not the actual > encoding. You'll need to tweak that before using. We do not want stand-alone memory fences, the atomic operations we have include the memory fences. We can make them finer-grained if necessary (e.g. load-acquire, store-release, cas-release, etc). While the atomic operations are in assembly, intrinsic for DMB makes no sense. If anything, we need ATOMIC_LOAD/ATOMIC_STORE/ETC intrinsics.
Sign in to reply to this message.
Late to the party, but keen to test this change list. I am trying to use the DSB primitive from inside a linked assembly routine. Is DSB expected to work from assembly? I'm currently getting the following errors: test_arm.s:19 syntax error, last name: DSB On Tuesday, 12 March 2013 16:26:07 UTC, rsc wrote: > > On Tue, Mar 12, 2013 at 12:07 PM, Devon H. O'Dell <devon...@gmail.com<javascript:> > > wrote: > >> 2013/3/12 Russ Cox <r...@golang.org <javascript:>>: >> > On Tue, Mar 12, 2013 at 11:50 AM, Devon H. O'Dell <devon...@gmail.com<javascript:> >> > >> > wrote: >> >> >> >> 2013/3/12 Russ Cox <r...@golang.org <javascript:>>: >> >> > The C compiler does not inline. >> >> >> >> Maybe we should revisit this after Go 1.1. >> >> >> >> >> >> I need help implementing support for the DMB instruction on ARM. I >> >> don't understand the assembler and linker code enough to add the >> >> instruction and this won't work on ARM without that. Additionally, >> >> we'd need an interface for generating memory barriers. If the C >> >> compiler doesn't inline, that's going to be a ton of overhead just to >> >> generate sfence/lfence/dmb. >> > >> > >> > I am happy to add intrinsics if you demonstrate that the performance >> win is >> > enough. We did this already with prefetch. >> >> What benchmarks do you want me to run? I'll only be able to run them >> reliably on x86 and AMD64 until there's support for DMB. This >> instruction is required for consistency using the lock-free data >> structure to guarantee that stores / loads are observed by other >> threads; it's not really for performance. I guess in the meantime I >> could make an ASM stub using .BYTE? >> > > I'd like to see a version that calls into assembly (you want WORD $x, not > BYTE, and no dot), compared with a version that doesn't. In the case of > prefetch, the difference was significant enough to add to the compiler. > > To compare with the version that doesn't call into assembly, you can patch > in CL 7756043, with two caveats: > 1. It's not done well, and it would need redoing if we keep it. > 2. The linker emits DSB as 0xFFFFFFFF, which I assume is not the actual > encoding. You'll need to tweak that before using. > > Russ >
Sign in to reply to this message.
No, it doesn't. You can use WORD $0x12345678 with the constant corrected. Russ
Sign in to reply to this message.
I'd like to leave this until after Go 1.1.
Sign in to reply to this message.
Replacing golang-dev with golang-codereviews.
Sign in to reply to this message.
R=close replacement CL: https://codereview.appspot.com/46170044/
Sign in to reply to this message.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
