|
|
Created:
13 years, 10 months ago by hector Modified:
13 years, 10 months ago Reviewers:
CC:
rsc, dvyukov, brainman, jp, golang-dev Visibility:
Public. |
Descriptionruntime: eliminate handle churn when churning channels on Windows
The Windows implementation of the net package churns through a couple of channels for every read/write operation. This translates into a lot of time spent in the kernel creating and deleting event objects.
Patch Set 1 #Patch Set 2 : diff -r 546f21eebee8 https://go.googlecode.com/hg/ #
Total comments: 4
Patch Set 3 : diff -r 546f21eebee8 https://go.googlecode.com/hg/ #
Total comments: 10
Patch Set 4 : diff -r 546f21eebee8 https://go.googlecode.com/hg/ #
Total comments: 6
Patch Set 5 : diff -r 2302c9faa3ff https://go.googlecode.com/hg/ #
Total comments: 2
MessagesTotal messages: 18
Hello rsc@golang.org, dvyukov@google.com, alex.brainman@gmail.com (cc: golang-dev@googlegroups.com), I'd like you to review this change to https://go.googlecode.com/hg/
Sign in to reply to this message.
Nice. http://codereview.appspot.com/4997044/diff/9001/src/pkg/runtime/windows/thread.c File src/pkg/runtime/windows/thread.c (right): http://codereview.appspot.com/4997044/diff/9001/src/pkg/runtime/windows/threa... src/pkg/runtime/windows/thread.c:133: M *m; rename to avoid confusion with global m http://codereview.appspot.com/4997044/diff/9001/src/pkg/runtime/windows/threa... src/pkg/runtime/windows/thread.c:196: thandle = runtime·stdcall(runtime·CreateThread, 6, (uintptr)0, (uintptr)0, runtime·tstart_stdcall, m, (uintptr)4, (uintptr)0); comment CREATE_SUSPENDED or add it do defs.h
Sign in to reply to this message.
http://codereview.appspot.com/4997044/diff/9001/src/pkg/runtime/windows/thread.c File src/pkg/runtime/windows/thread.c (right): http://codereview.appspot.com/4997044/diff/9001/src/pkg/runtime/windows/threa... src/pkg/runtime/windows/thread.c:133: M *m; On 2011/09/11 07:23:27, jp wrote: > rename to avoid confusion with global m should be ok, look at proc.c for precedent. http://codereview.appspot.com/4997044/diff/9001/src/pkg/runtime/windows/threa... src/pkg/runtime/windows/thread.c:196: thandle = runtime·stdcall(runtime·CreateThread, 6, (uintptr)0, (uintptr)0, runtime·tstart_stdcall, m, (uintptr)4, (uintptr)0); On 2011/09/11 07:23:27, jp wrote: > comment CREATE_SUSPENDED or add it do defs.h Done.
Sign in to reply to this message.
This program doesn't terminate with this change. package main import ( "fmt" "runtime" "sync" "time" ) var ( lock sync.Mutex m = make(map[int]int) ) func f(i int) { for { lock.Lock() m[i]++ lock.Unlock() } } func main() { runtime.GOMAXPROCS(2) for i := 0; i < 10; i++ { go f(i) } time.Sleep(1e10) lock.Lock() sum := 0 for i := 0; i < 10; i++ { sum += m[i] } for i := 0; i < 10; i++ { fmt.Println(i, float64(m[i])*100 / float64(sum)) } } I'll investigate using a doubly linked list to make the lock more fair.
Sign in to reply to this message.
http://codereview.appspot.com/4997044/diff/12001/src/pkg/runtime/windows/thre... File src/pkg/runtime/windows/thread.c (right): http://codereview.appspot.com/4997044/diff/12001/src/pkg/runtime/windows/thre... src/pkg/runtime/windows/thread.c:126: while(m->waitm != m) this has to be load-acquire, because it acquires the lock http://codereview.appspot.com/4997044/diff/12001/src/pkg/runtime/windows/thre... src/pkg/runtime/windows/thread.c:140: if(m = runtime·atomicloadp(&l->waitm)) Unbounded active spinning is bad for performance. Especially on single core machines. Here you basically say, if l->waitm is not yet filled, eat whole my quantum in this loop, and only them switch to that other thread to finally fill it. Add runtime.procyield() into the loop + episodic runtime.osyield(). http://codereview.appspot.com/4997044/diff/12001/src/pkg/runtime/windows/thre... src/pkg/runtime/windows/thread.c:144: m->waitm = m; this has to be store-release, because it releases the lock http://codereview.appspot.com/4997044/diff/12001/src/pkg/runtime/windows/thre... src/pkg/runtime/windows/thread.c:145: runtime·stdcall(runtime·NtAlertThread, 1, m->hthread); If NtAlertThread(T) is called before NtDelayExecution(T), NtDelayExecution(T) will return immediately, right? It's not clear from docs. http://codereview.appspot.com/4997044/diff/12001/src/pkg/runtime/windows/thre... src/pkg/runtime/windows/thread.c:198: m, (uintptr)4/*CREATE_SUSPENDED*/, nil); What is reason to create a suspended thread and then to resume it?
Sign in to reply to this message.
On 2011/09/11 12:03:52, hector wrote: > This program doesn't terminate with this change. > I'll investigate using a doubly linked list to make the lock more fair. I think the event in the current version uses FIFO wakeup. You don't need a doubly linked list (+ it's problematic to implement in lock-free manner), you need singly linked fifo queue. However, I am not sure whether it's important to make the program work, it seems to be working by occasion.
Sign in to reply to this message.
PTAL. I looked into using a FIFO queue, but the implementation I looked at (Michael and Scott) requires a dummy node, which implies that it depends on malloc. http://codereview.appspot.com/4997044/diff/12001/src/pkg/runtime/windows/thre... File src/pkg/runtime/windows/thread.c (right): http://codereview.appspot.com/4997044/diff/12001/src/pkg/runtime/windows/thre... src/pkg/runtime/windows/thread.c:126: while(m->waitm != m) On 2011/09/11 13:00:05, dvyukov wrote: > this has to be load-acquire, because it acquires the lock Done. http://codereview.appspot.com/4997044/diff/12001/src/pkg/runtime/windows/thre... src/pkg/runtime/windows/thread.c:140: if(m = runtime·atomicloadp(&l->waitm)) On 2011/09/11 13:00:05, dvyukov wrote: > Unbounded active spinning is bad for performance. Especially on single core > machines. Here you basically say, if l->waitm is not yet filled, eat whole my > quantum in this loop, and only them switch to that other thread to finally fill > it. > Add runtime.procyield() into the loop + episodic runtime.osyield(). Done. http://codereview.appspot.com/4997044/diff/12001/src/pkg/runtime/windows/thre... src/pkg/runtime/windows/thread.c:144: m->waitm = m; On 2011/09/11 13:00:05, dvyukov wrote: > this has to be store-release, because it releases the lock Done. http://codereview.appspot.com/4997044/diff/12001/src/pkg/runtime/windows/thre... src/pkg/runtime/windows/thread.c:145: runtime·stdcall(runtime·NtAlertThread, 1, m->hthread); On 2011/09/11 13:00:05, dvyukov wrote: > If NtAlertThread(T) is called before NtDelayExecution(T), NtDelayExecution(T) > will return immediately, right? It's not clear from docs. I just tested this and it turns out it __doesn't__. I will go back to using event objects, but only one per M. http://codereview.appspot.com/4997044/diff/12001/src/pkg/runtime/windows/thre... src/pkg/runtime/windows/thread.c:198: m, (uintptr)4/*CREATE_SUSPENDED*/, nil); On 2011/09/11 13:00:05, dvyukov wrote: > What is reason to create a suspended thread and then to resume it? So there isn't a race to set m->hthread = thandle.
Sign in to reply to this message.
I think the spinning in unlock() when l->waitm == nil is unnecessary - it should be enough to cas it to -1, and let the thread in lock() pick this up and run with it. Am I off base here? On Sep 12, 2011 11:34 AM, <hectorchu@gmail.com> wrote: > PTAL. I looked into using a FIFO queue, but the implementation I looked > at (Michael and Scott) requires a dummy node, which implies that it > depends on malloc. > > > http://codereview.appspot.com/4997044/diff/12001/src/pkg/runtime/windows/thre... > File src/pkg/runtime/windows/thread.c (right): > > http://codereview.appspot.com/4997044/diff/12001/src/pkg/runtime/windows/thre... > src/pkg/runtime/windows/thread.c:126: while(m->waitm != m) > On 2011/09/11 13:00:05, dvyukov wrote: >> this has to be load-acquire, because it acquires the lock > > Done. > > http://codereview.appspot.com/4997044/diff/12001/src/pkg/runtime/windows/thre... > src/pkg/runtime/windows/thread.c:140: if(m = > runtime·atomicloadp(&l->waitm)) > On 2011/09/11 13:00:05, dvyukov wrote: >> Unbounded active spinning is bad for performance. Especially on single > core >> machines. Here you basically say, if l->waitm is not yet filled, eat > whole my >> quantum in this loop, and only them switch to that other thread to > finally fill >> it. >> Add runtime.procyield() into the loop + episodic runtime.osyield(). > > Done. > > http://codereview.appspot.com/4997044/diff/12001/src/pkg/runtime/windows/thre... > src/pkg/runtime/windows/thread.c:144: m->waitm = m; > On 2011/09/11 13:00:05, dvyukov wrote: >> this has to be store-release, because it releases the lock > > Done. > > http://codereview.appspot.com/4997044/diff/12001/src/pkg/runtime/windows/thre... > src/pkg/runtime/windows/thread.c:145: > runtime·stdcall(runtime·NtAlertThread, 1, m->hthread); > On 2011/09/11 13:00:05, dvyukov wrote: >> If NtAlertThread(T) is called before NtDelayExecution(T), > NtDelayExecution(T) >> will return immediately, right? It's not clear from docs. > > I just tested this and it turns out it __doesn't__. I will go back to > using event objects, but only one per M. > > http://codereview.appspot.com/4997044/diff/12001/src/pkg/runtime/windows/thre... > src/pkg/runtime/windows/thread.c:198: m, (uintptr)4/*CREATE_SUSPENDED*/, > nil); > On 2011/09/11 13:00:05, dvyukov wrote: >> What is reason to create a suspended thread and then to resume it? > > So there isn't a race to set m->hthread = thandle. > > http://codereview.appspot.com/4997044/
Sign in to reply to this message.
I agree with Hector's idea about changing l.waitm from nil to LOCK_HELD as a quick exit. That should eliminate the spinning nicely. http://codereview.appspot.com/4997044/diff/15001/src/pkg/runtime/windows/thre... File src/pkg/runtime/windows/thread.c (right): http://codereview.appspot.com/4997044/diff/15001/src/pkg/runtime/windows/thre... src/pkg/runtime/windows/thread.c:158: // someone else has it; wait This code deserves a bit more explanation. I suggest: Insert above this function: #define LOCK_HELD ((M*)-1) Rename M.waitm to M.nextwaitm. This becomes: // Someone else has it. // l->waitm points to a linked list of M's waiting for this lock, // chained through m->nextwaitm. // To pass the lock to this m, another M will set m->waitm = LOCK_HELD // and signal m->event. // Queue. for(;;) { ... } // Wait. while(runtime...) http://codereview.appspot.com/4997044/diff/15001/src/pkg/runtime/windows/thre... src/pkg/runtime/windows/thread.c:172: M *m; Please call this M *mp; There is a global m, and it is easier to read this code if you understand that it is not referring to the global m. http://codereview.appspot.com/4997044/diff/15001/src/pkg/runtime/windows/thre... src/pkg/runtime/windows/thread.c:173: uint32 i = 0, spin = 0; initializations separate from declaration please. move spin = 0 down before if(proccount > 1) and i = 0 down before for(;;). http://codereview.appspot.com/4997044/diff/15001/src/pkg/runtime/windows/thre... src/pkg/runtime/windows/thread.c:175: if(runtime·xadd(&l->key, -1) == 0) Same comments about explanation. The comments I'd want to see as a reader are: // Other M's are waiting for the lock. Wake one. if(proccount == 0) ... // Wait for an M to appear on the waiting list and dequeue it. for(;;) { ... } // Wake that M. runtime.atomicstorep(&m->nextwaitm, LOCK_HELD); runtime.stdcall(runtime.SetEvent, 1, m->event); http://codereview.appspot.com/4997044/diff/15001/src/pkg/runtime/windows/thre... src/pkg/runtime/windows/thread.c:185: while(m = runtime·atomicloadp(&l->waitm)) Please write while((m = ...) != nil) just so that it's clear that's not an ==. http://codereview.appspot.com/4997044/diff/15001/src/pkg/runtime/windows/thre... src/pkg/runtime/windows/thread.c:188: if(++i % (spin + 1) > 0) Easier to read: if(i++ < spin) runtime.procyield(ACTIVE_SPIN_CNT); else { i = 0; runtime.osyield(); }
Sign in to reply to this message.
Hello rsc@golang.org, dvyukov@google.com, alex.brainman@gmail.com, jp@webmaster.ms (cc: golang-dev@googlegroups.com), Please take another look.
Sign in to reply to this message.
http://codereview.appspot.com/4997044/diff/27001/src/pkg/runtime/windows/thre... File src/pkg/runtime/windows/thread.c (right): http://codereview.appspot.com/4997044/diff/27001/src/pkg/runtime/windows/thre... src/pkg/runtime/windows/thread.c:132: m->nextwaitm = runtime·atomicloadp(&l->waitm); Does this need to be atomicstorep(&m->nextwaitm, l->waitm)?
Sign in to reply to this message.
On 2011/09/13 11:06:12, hector wrote: > http://codereview.appspot.com/4997044/diff/27001/src/pkg/runtime/windows/thre... > File src/pkg/runtime/windows/thread.c (right): > Sorry Hector, I don't comment on your code, because I always miss something with those fiddly locks. On the other hand, I will try and change network code to reuse channels like unix code does. I didn't do it at the start to keep code simple. But it is probably worth it, even if only to avoid allocating memory on every io. Alex
Sign in to reply to this message.
No problem and perfectly understood Alex. I agree it is a tricky topic, these lock-free concurrent algorithms. I think Dmitry knows a lot about the issues, so I hope he can confirm the final version of the code. As for changing the net code to reuse channels, sounds great. I'm happy for the time being, with this change the CPU load is slighty lower under high throughput. On Sep 13, 2011 2:02 PM, <alex.brainman@gmail.com> wrote: > On 2011/09/13 11:06:12, hector wrote: > > http://codereview.appspot.com/4997044/diff/27001/src/pkg/runtime/windows/thre... >> File src/pkg/runtime/windows/thread.c (right): > > > Sorry Hector, I don't comment on your code, because I always miss > something with those fiddly locks. > > On the other hand, I will try and change network code to reuse channels > like unix code does. I didn't do it at the start to keep code simple. > But it is probably worth it, even if only to avoid allocating memory on > every io. > > Alex > > http://codereview.appspot.com/4997044/
Sign in to reply to this message.
http://codereview.appspot.com/4997044/diff/27001/src/pkg/runtime/windows/thre... File src/pkg/runtime/windows/thread.c (right): http://codereview.appspot.com/4997044/diff/27001/src/pkg/runtime/windows/thre... src/pkg/runtime/windows/thread.c:132: m->nextwaitm = runtime·atomicloadp(&l->waitm); On 2011/09/13 11:06:12, hector wrote: > Does this need to be atomicstorep(&m->nextwaitm, l->waitm)? To answer my own question, it doesn't because cas acts as a memory barrier.
Sign in to reply to this message.
On 2011/09/13 13:22:24, hector wrote: > No problem and perfectly understood Alex. I agree it is a tricky topic, > these lock-free concurrent algorithms. I think Dmitry knows a lot about the > issues, so I hope he can confirm the final version of the code. Hi Hector, I see no correctness issues at first glance. However, you know, you never sure with such things. I do not have time for really close examination right now. Ideally, one verifies such things with Relacy Race Detector, I can't do better that it. It requires time, though.
Sign in to reply to this message.
Correct me if I'm wrong, but as far as I'm aware there remains no more work to be done on this. Can someone please approve the change? Thanks.
Sign in to reply to this message.
LGTM
Sign in to reply to this message.
*** Submitted as http://code.google.com/p/go/source/detail?r=af0ac80bbb92 *** runtime: eliminate handle churn when churning channels on Windows The Windows implementation of the net package churns through a couple of channels for every read/write operation. This translates into a lot of time spent in the kernel creating and deleting event objects. R=rsc, dvyukov, alex.brainman, jp CC=golang-dev http://codereview.appspot.com/4997044 Committer: Russ Cox <rsc@golang.org>
Sign in to reply to this message.
|