|
|
Descriptioncmd/gc: Inline pointer sized T2I interface conversions
This CL also adds support for marking the likelyness of IF nodes in the AST being true. This feature is being used here to mark the slow path as unlikely.
src/pkg/runtime:
benchmark old ns/op new ns/op delta
BenchmarkConvT2IUintptr 16 1 -91.63%
test/bench/go1:
benchmark old ns/op new ns/op delta
BenchmarkBinaryTree17 5416917000 5461355000 +0.82%
BenchmarkFannkuch11 3810355000 3842609000 +0.85%
BenchmarkGobDecode 19950950 19855420 -0.48%
BenchmarkGobEncode 11301220 11308530 +0.06%
BenchmarkGzip 548119600 546869200 -0.23%
BenchmarkGunzip 176145400 180208300 +2.31%
BenchmarkJSONEncode 93117400 70163100 -24.65%
BenchmarkJSONDecode 406626800 409999200 +0.83%
BenchmarkMandelbrot200 6300992 6317866 +0.27%
BenchmarkParse 7664396 7451625 -2.78%
BenchmarkRevcomp 1189424000 1412332000 +18.74%
BenchmarkTemplate 491308400 458654200 -6.65%
benchmark old MB/s new MB/s speedup
BenchmarkGobDecode 38.47 38.66 1.00x
BenchmarkGobEncode 67.92 67.87 1.00x
BenchmarkGzip 35.40 35.48 1.00x
BenchmarkGunzip 110.16 107.68 0.98x
BenchmarkJSONEncode 20.84 27.66 1.33x
BenchmarkJSONDecode 4.77 4.73 0.99x
BenchmarkParse 7.56 7.77 1.03x
BenchmarkRevcomp 213.69 179.96 0.84x
BenchmarkTemplate 3.95 4.23 1.07x
Patch Set 1 #Patch Set 2 : diff -r 2874979af37a https://code.google.com/p/go/ #
Total comments: 9
Patch Set 3 : diff -r 64195dbf632d https://code.google.com/p/go/ #Patch Set 4 : diff -r 64195dbf632d https://code.google.com/p/go/ #
Total comments: 4
Patch Set 5 : diff -r 5d3aaf0b717e https://code.google.com/p/go/ #Patch Set 6 : diff -r 5d3aaf0b717e https://code.google.com/p/go/ #
Total comments: 1
Patch Set 7 : diff -r ca5e20f93081 https://code.google.com/p/go/ #
Total comments: 1
Patch Set 8 : diff -r 46e193d130ce https://code.google.com/p/go/ #
MessagesTotal messages: 36
linux/arm pandaboard pando(~/go/test/bench/go1) % ~/go/misc/benchcmp {old,new}.txt benchmark old ns/op new ns/op delta BenchmarkBinaryTree17 37124725000 37525849000 +1.08% BenchmarkFannkuch11 36154999000 34552552000 -4.43% BenchmarkGobDecode 124050900 123104800 -0.76% BenchmarkGobEncode 62636720 61475220 -1.85% BenchmarkGzip 5584015000 5551667000 -0.58% BenchmarkGunzip 1197876000 1183319000 -1.22% BenchmarkJSONEncode 758874600 755590800 -0.43% BenchmarkJSONDecode 2393341000 2969239000 +24.06% BenchmarkMandelbrot200 45813000 45698840 -0.25% BenchmarkParse 59043580 57451780 -2.70% BenchmarkRevcomp 144458100 138739000 -3.96% BenchmarkTemplate 5426147000 5195496000 -4.25% benchmark old MB/s new MB/s speedup BenchmarkGobDecode 6.19 6.23 1.01x BenchmarkGobEncode 12.25 12.49 1.02x BenchmarkGzip 3.48 3.50 1.01x BenchmarkGunzip 16.20 16.40 1.01x BenchmarkJSONEncode 2.56 2.57 1.00x BenchmarkJSONDecode 0.81 0.65 0.80x BenchmarkParse 0.98 1.01 1.03x BenchmarkRevcomp 17.60 18.32 1.04x BenchmarkTemplate 0.36 0.37 1.03x
Sign in to reply to this message.
On 2012/07/14 00:40:30, dfc wrote: > linux/arm pandaboard > > pando(~/go/test/bench/go1) % ~/go/misc/benchcmp {old,new}.txt > benchmark old ns/op new ns/op delta > BenchmarkBinaryTree17 37124725000 37525849000 +1.08% > BenchmarkFannkuch11 36154999000 34552552000 -4.43% > BenchmarkGobDecode 124050900 123104800 -0.76% > BenchmarkGobEncode 62636720 61475220 -1.85% > BenchmarkGzip 5584015000 5551667000 -0.58% > BenchmarkGunzip 1197876000 1183319000 -1.22% > BenchmarkJSONEncode 758874600 755590800 -0.43% > BenchmarkJSONDecode 2393341000 2969239000 +24.06% > BenchmarkMandelbrot200 45813000 45698840 -0.25% > BenchmarkParse 59043580 57451780 -2.70% > BenchmarkRevcomp 144458100 138739000 -3.96% > BenchmarkTemplate 5426147000 5195496000 -4.25% > > benchmark old MB/s new MB/s speedup > BenchmarkGobDecode 6.19 6.23 1.01x > BenchmarkGobEncode 12.25 12.49 1.02x > BenchmarkGzip 3.48 3.50 1.01x > BenchmarkGunzip 16.20 16.40 1.01x > BenchmarkJSONEncode 2.56 2.57 1.00x > BenchmarkJSONDecode 0.81 0.65 0.80x > BenchmarkParse 0.98 1.01 1.03x > BenchmarkRevcomp 17.60 18.32 1.04x > BenchmarkTemplate 0.36 0.37 1.03x Huh, I didn't know codereview would send out comments on unpublished CLs. For those not in the know, I sent this CL to Dave on IRC to get some benchmark figures on arm. I'll have the proper review mail sent out later today.
Sign in to reply to this message.
Ok, didn't know codereview would send out the CL along with the reply I just wrote. Only thing that's missing from the mail is to mention that the benchmarks were done on an linux/amd64 machine.
Sign in to reply to this message.
On 14 July 2012 16:25, <daniel.morsing@gmail.com> wrote: > Ok, didn't know codereview would send out the CL along with the reply I > just wrote. Is this change ready for review or not?
Sign in to reply to this message.
On Mon, Jul 16, 2012 at 4:03 AM, Nigel Tao <nigeltao@golang.org> wrote: > On 14 July 2012 16:25, <daniel.morsing@gmail.com> wrote: >> Ok, didn't know codereview would send out the CL along with the reply I >> just wrote. > > Is this change ready for review or not? Yes it is. Sorry for not making that clear
Sign in to reply to this message.
ping?
Sign in to reply to this message.
I'm not sure if introducing atomicloadtype and typ2Itab is the right way to cut it. I'll defer to what rsc thinks. I'm not sure what I learned from the benchmarks. amd64 gets dramatically better on JSONEncode. arm gets dramatically worse on JSONDecode. Do we have any idea why? Is it stack flapping again? It may be worth (as a hack) replacing the atomicloadtype call with a simple assignment for now, just to get some numbers on how fast it gets if we can avoid function calls entirely. Doing this properly might require introducing an AATOMICLOAD virtual op, but I'm still too new to 6g/6l to know if that's the right thing to do, and not familiar enough with arm to know if that's even feasible. Again, I'll defer to rsc. http://codereview.appspot.com/6351090/diff/2001/src/cmd/6g/cgen.c File src/cmd/6g/cgen.c (right): http://codereview.appspot.com/6351090/diff/2001/src/cmd/6g/cgen.c#newcode47 src/cmd/6g/cgen.c:47: case OEFACE: If the op is no longer just for empty-interface values, then it's probably worth doing s/OEFACE/OIFACE/. Let's see what rsc says. http://codereview.appspot.com/6351090/diff/2001/src/cmd/gc/go.h File src/cmd/gc/go.h (right): http://codereview.appspot.com/6351090/diff/2001/src/cmd/gc/go.h#newcode260 src/cmd/gc/go.h:260: uchar likely; // likeliness of if statement Does unsignedness work if -1 is supposed to mean unlikely? How much difference does the likeliness make on the benchmarks? http://codereview.appspot.com/6351090/diff/2001/src/cmd/gc/walk.c File src/cmd/gc/walk.c (right): http://codereview.appspot.com/6351090/diff/2001/src/cmd/gc/walk.c#newcode777 src/cmd/gc/walk.c:777: r = l; Some commentary on the C equivalent of what this generated AST does would be nice. http://codereview.appspot.com/6351090/diff/2001/src/cmd/gc/walk.c#newcode1199 src/cmd/gc/walk.c:1199: walkexpr(&r, init); Unrelated? Or an existing bug? What broke without it? http://codereview.appspot.com/6351090/diff/2001/src/pkg/runtime/iface.c File src/pkg/runtime/iface.c (right): http://codereview.appspot.com/6351090/diff/2001/src/pkg/runtime/iface.c#newco... src/pkg/runtime/iface.c:186: void I think that this and runtime·typ2Itab should have a #pragma textflag 7 before them to pick up NOSPLIT.
Sign in to reply to this message.
Also, can you paste what the before/after of a disassembly of what the line "ifaceValue = concreteValue" looks like? I think that will help me understand the walk.c change. For example, the before is: $ cat main.go package main func main() { i = t } type I interface { Method() } type T uintptr func (T) Method() {} var ( i I t T ) $ go tool 6g -S -o /dev/null main.go | sed -e '/Method/q' --- prog list "main" --- 0000 (main.go:3) TEXT main+0(SB),$48-0 0001 (main.go:4) MOVQ $type."".T+0(SB),(SP) 0002 (main.go:4) MOVQ $type."".I+0(SB),8(SP) 0003 (main.go:4) MOVQ $go.itab."".T."".I+0(SB),16(SP) 0004 (main.go:4) MOVQ t+0(SB),BX 0005 (main.go:4) MOVQ BX,24(SP) 0006 (main.go:4) CALL ,runtime.convT2I+0(SB) 0007 (main.go:4) MOVQ 32(SP),BX 0008 (main.go:4) MOVQ BX,i+0(SB) 0009 (main.go:4) MOVQ 40(SP),BX 0010 (main.go:4) MOVQ BX,i+8(SB) 0011 (main.go:5) RET , --- prog list "T.Method" --- go tool 6g: signal 13
Sign in to reply to this message.
On 2012/07/23 07:52:56, nigeltao wrote: > I'm not sure if introducing atomicloadtype and typ2Itab is the right way to cut > it. I'll defer to what rsc thinks. > > I'm not sure what I learned from the benchmarks. amd64 gets dramatically better > on JSONEncode. arm gets dramatically worse on JSONDecode. Do we have any idea > why? Is it stack flapping again? > > It may be worth (as a hack) replacing the atomicloadtype call with a simple > assignment for now, just to get some numbers on how fast it gets if we can avoid > function calls entirely. Doing this properly might require introducing an > AATOMICLOAD virtual op, but I'm still too new to 6g/6l to know if that's the > right thing to do, and not familiar enough with arm to know if that's even > feasible. Again, I'll defer to rsc. I've done testing where I relied on amd64 loading the pointer atomically. The conversion time was reduced to around 1 ns per op. On arm, the atomic loading isn't that straightforward. It involves loading an instruction from memory and executing it, and it's complicated enough to make it a pain to emit inline. As for typ2Itab, The only way I can see how to get rid of it, would be making the codegen emit it. I havn't looked too hard at this possibility since I'm not well versed in how the codegen works.
Sign in to reply to this message.
http://codereview.appspot.com/6351090/diff/2001/src/cmd/gc/go.h File src/cmd/gc/go.h (right): http://codereview.appspot.com/6351090/diff/2001/src/cmd/gc/go.h#newcode260 src/cmd/gc/go.h:260: uchar likely; // likeliness of if statement On 2012/07/23 07:52:56, nigeltao wrote: > Does unsignedness work if -1 is supposed to mean unlikely? > > How much difference does the likeliness make on the benchmarks? Whoops. I think this may have worked solely by luck. The performance impact of not having likely is small. IIRC, it's about 1 ns. In this version of the patch, this gain isn't that significant. However I made a version, where the pointer loading was done inline (relying on amd64 implicitly loading the pointer atomically). There the gain was about half the time spent per conversion. This optimization is a relic from that version, but I kept it in since it's still beneficial to benchmark. http://codereview.appspot.com/6351090/diff/2001/src/cmd/gc/walk.c File src/cmd/gc/walk.c (right): http://codereview.appspot.com/6351090/diff/2001/src/cmd/gc/walk.c#newcode777 src/cmd/gc/walk.c:777: r = l; On 2012/07/23 07:52:56, nigeltao wrote: > Some commentary on the C equivalent of what this generated AST does would be > nice. I'll include the some variant of the pseudocode from my initial design posting. http://codereview.appspot.com/6351090/diff/2001/src/cmd/gc/walk.c#newcode1199 src/cmd/gc/walk.c:1199: walkexpr(&r, init); On 2012/07/23 07:52:56, nigeltao wrote: > Unrelated? Or an existing bug? What broke without it? The gc compiler will segfault without this. Apparently, whoever wrote this code assumed that the r expression couldn't generate any init statements, which was true before this change. http://codereview.appspot.com/6351090/diff/2001/src/pkg/runtime/iface.c File src/pkg/runtime/iface.c (right): http://codereview.appspot.com/6351090/diff/2001/src/pkg/runtime/iface.c#newco... src/pkg/runtime/iface.c:186: void On 2012/07/23 07:52:56, nigeltao wrote: > I think that this and runtime·typ2Itab should have a > #pragma textflag 7 > before them to pick up NOSPLIT. I was a bit unsure as to what exactly "#pragma textflag 7" did. I could only see it on functions with ... arguments, so I thought it was related. I'll add this to the next version.
Sign in to reply to this message.
On 2012/07/23 08:02:28, nigeltao wrote: > Also, can you paste what the before/after of a disassembly of what the line > "ifaceValue = concreteValue" looks like? I think that will help me understand > the walk.c change. This is the assembly after the patch --- prog list "main" --- 0000 (main.go:3) TEXT main+0(SB),$32-0 0001 (main.go:4) MOVQ $go.itab."".T."".I+0(SB),(SP) 0002 (main.go:4) CALL ,runtime.atomicloadtype+0(SB) 0003 (main.go:4) MOVQ 8(SP),AX 0004 (main.go:4) MOVQ $0,BP 0005 (main.go:4) CMPQ AX,BP 0006 (main.go:4) JNE $1,12 0007 (main.go:4) MOVQ $type."".T+0(SB),(SP) 0008 (main.go:4) MOVQ $type."".I+0(SB),8(SP) 0009 (main.go:4) MOVQ $go.itab."".T."".I+0(SB),16(SP) 0010 (main.go:4) CALL ,runtime.typ2Itab+0(SB) 0011 (main.go:4) MOVQ 24(SP),AX 0012 (main.go:4) MOVQ AX,i+0(SB) 0013 (main.go:4) MOVQ t+0(SB),BX 0014 (main.go:4) MOVQ BX,i+8(SB) 0015 (main.go:5) RET , Note that the linker will move the typ2Itab branch to the end of the function.
Sign in to reply to this message.
Hello rsc@golang.org, dave@cheney.net, nigeltao@golang.org (cc: golang-dev@googlegroups.com), I'd like you to review this change to https://code.google.com/p/go/
Sign in to reply to this message.
On 23 July 2012 21:10, <daniel.morsing@gmail.com> wrote: > As for typ2Itab, The only way I can see how to get rid of it, would be > making the codegen emit it. I havn't looked too hard at this possibility > since I'm not well versed in how the codegen works. Well, an alternative is to call convT2I if the itab cache misses, instead of calling a new function typ2Itab. It might lead to simpler code. Sure, it adds a tiny amount of redundant computation, but cache misses should be rare. I'm not saying we should do that; it's just an idea.
Sign in to reply to this message.
On 23 July 2012 21:10, <daniel.morsing@gmail.com> wrote: > However I made a version, where the pointer loading was done inline > (relying on amd64 implicitly loading the pointer atomically). There the > gain was about half the time spent per conversion. I'm not sure how to interpret that paragraph. What are the benchmark numbers if you use a plain assignment instead of a runtime·atomicloadtype call? It won't be correct (on arm), but it would give me a better idea of the potential impact of this change. As it is, you're trying to avoid a convT2I function call, but always call runtime·atomicloadtype instead, so it's not an obvious win. Ideally, there'd be no function calls on the common path.
Sign in to reply to this message.
On Tue, Jul 24, 2012 at 10:51 AM, Nigel Tao <nigeltao@golang.org> wrote: > On 23 July 2012 21:10, <daniel.morsing@gmail.com> wrote: >> However I made a version, where the pointer loading was done inline >> (relying on amd64 implicitly loading the pointer atomically). There the >> gain was about half the time spent per conversion. > > I'm not sure how to interpret that paragraph. What are the benchmark > numbers if you use a plain assignment instead of a > runtime·atomicloadtype call? It won't be correct (on arm), but it > would give me a better idea of the potential impact of this change. As > it is, you're trying to avoid a convT2I function call, but always call > runtime·atomicloadtype instead, so it's not an obvious win. Ideally, > there'd be no function calls on the common path. Sorry for being inadvertently obtuse. When doing plain assignment, the benchmark takes around 1 ns per op in BenchmarkConvT2IUintptr. I've tried making a version of this change that does the following: Itab tab = typ2Itab(type, Itype, &cache); EFACE{tab, ptr} where the typ2Itab function does the check on cache and returns it if set. That version had benchmarks around 8 ns per op. Apparently, the path for interface conversion is getting so fast that pushing 2 words onto the stack makes a significant difference. As for using convT2I instead of typ2Itab, there's a problem with that. We would have to evaluate the data being converted in the init list, but where the init list is inserted, that data might be uninitialized. Consider: //begin init list Itab tab = cache //atomically if tab == nil { tmpiface = convT2I(type, itype, &cache, data) } else { tmpiface = EFACE{tab, data} } //end init list data = realValue //the expression we were evaluating. tmpiface will be wrong foo(tmpiface) This situation arises when evaluating AS2FUNC nodes.
Sign in to reply to this message.
Sorry that this one is taking so long... On 24 July 2012 20:30, Daniel Morsing <daniel.morsing@gmail.com> wrote: > Sorry for being inadvertently obtuse. When doing plain assignment, the > benchmark takes around 1 ns per op in BenchmarkConvT2IUintptr. What about the test/bench/go1 numbers? > As for using convT2I instead of typ2Itab, there's a problem with that. > We would have to evaluate the data being converted in the init list, > but where the init list is inserted, that data might be uninitialized. > > Consider: > > //begin init list > Itab tab = cache //atomically > if tab == nil { > tmpiface = convT2I(type, itype, &cache, data) > } else { > tmpiface = EFACE{tab, data} > } > //end init list > data = realValue > //the expression we were evaluating. tmpiface will be wrong > foo(tmpiface) > > This situation arises when evaluating AS2FUNC nodes. Sorry, I'm probably being dumb, but I don't understand how tmpiface will be wrong with OAS2FUNC. Can you give a small example.go program that fails in this way? IIUC, the proposal is to do Itab tab = cache //atomically if tab == nil { tmpiface = convT2I(type, itype, &cache, data) } else { tmpiface = EFACE{tab, data} } where we currently do tmpiface = convT2I(type, itype, &cache, data) Why are we also generating init code? I didn't notice that in my previous readings. It's true that you could conceivably initialize the itab caches eagerly instead of lazily, but I would want that to be a separate change to this one. I wouldn't expect it to be trivial if you want to also preserve dead code elimination for unused types.
Sign in to reply to this message.
On Wed, Jul 25, 2012 at 9:47 AM, Nigel Tao <nigeltao@golang.org> wrote: > Sorry that this one is taking so long... > > > On 24 July 2012 20:30, Daniel Morsing <daniel.morsing@gmail.com> wrote: >> Sorry for being inadvertently obtuse. When doing plain assignment, the >> benchmark takes around 1 ns per op in BenchmarkConvT2IUintptr. > > What about the test/bench/go1 numbers? I don't have those numbers written down anywhere. I'll go get these numbers when I get to a PC where I have my stuff. > > >> As for using convT2I instead of typ2Itab, there's a problem with that. >> We would have to evaluate the data being converted in the init list, >> but where the init list is inserted, that data might be uninitialized. >> >> Consider: >> >> //begin init list >> Itab tab = cache //atomically >> if tab == nil { >> tmpiface = convT2I(type, itype, &cache, data) >> } else { >> tmpiface = EFACE{tab, data} >> } >> //end init list >> data = realValue >> //the expression we were evaluating. tmpiface will be wrong >> foo(tmpiface) >> >> This situation arises when evaluating AS2FUNC nodes. > > Sorry, I'm probably being dumb, but I don't understand how tmpiface > will be wrong with OAS2FUNC. Can you give a small example.go program > that fails in this way? > > IIUC, the proposal is to do > > Itab tab = cache //atomically > if tab == nil { > tmpiface = convT2I(type, itype, &cache, data) > } else { > tmpiface = EFACE{tab, data} > } > > where we currently do > > tmpiface = convT2I(type, itype, &cache, data) > > Why are we also generating init code? I didn't notice that in my > previous readings. It's true that you could conceivably initialize the > itab caches eagerly instead of lazily, but I would want that to be a > separate change to this one. I wouldn't expect it to be trivial if you > want to also preserve dead code elimination for unused types. I think you have the init list misunderstood. It's a list for taking stuff that must be run in statement context and inserting it immediately before the expression is evaluated. The CONVIFACE node occurs as a part of an expression. The convT2I call is also an expression, so there is no problem with inserting it at a place where the interface conversion was originally, but the If statement is a statement. We can't just insert it into the AST at the CONVIFACE node. We have to put it onto the init list, so it's executed just before the expression is evaluated. If we use convT2I, we have to evaluate the data being converted in the init list. The value of the data may change between the init list is inserted and the conversion expression.
Sign in to reply to this message.
Here is the atomic load version benchmark, compared to the atomicloadtype() version src/pkg/runtime: benchmark old ns/op new ns/op delta BenchmarkConvT2ESmall 11 11 +1.79% BenchmarkConvT2EUintptr 0 0 +1.43% BenchmarkConvT2ELarge 75 76 +0.93% BenchmarkConvT2ISmall 17 13 -19.41% BenchmarkConvT2IUintptr 5 1 -74.32% BenchmarkConvT2ILarge 77 78 +0.90% BenchmarkConvI2E 4 5 +14.79% BenchmarkConvI2I 17 17 +0.00% test/bench/go1: benchmark old ns/op new ns/op delta BenchmarkBinaryTree17 5528479000 5526532000 -0.04% BenchmarkFannkuch11 3920745000 3755168000 -4.22% BenchmarkGobDecode 20638900 20096130 -2.63% BenchmarkGobEncode 11270940 11592350 +2.85% BenchmarkGzip 559168200 553892000 -0.94% BenchmarkGunzip 182290200 180607000 -0.92% BenchmarkJSONEncode 70364760 72977700 +3.71% BenchmarkJSONDecode 418688000 407103000 -2.77% BenchmarkMandelbrot200 6324712 6407060 +1.30% BenchmarkParse 8040630 7889790 -1.88% BenchmarkRevcomp 1693365000 1792351000 +5.85% BenchmarkTemplate 465117000 456074200 -1.94% benchmark old MB/s new MB/s speedup BenchmarkGobDecode 37.19 38.19 1.03x BenchmarkGobEncode 68.10 66.21 0.97x BenchmarkGzip 34.70 35.03 1.01x BenchmarkGunzip 106.45 107.44 1.01x BenchmarkJSONEncode 27.58 26.59 0.96x BenchmarkJSONDecode 4.63 4.77 1.03x BenchmarkParse 7.20 7.34 1.02x BenchmarkRevcomp 150.10 141.81 0.94x BenchmarkTemplate 4.17 4.25 1.02x
Sign in to reply to this message.
2012/7/26 Daniel Morsing <daniel.morsing@gmail.com>: > test/bench/go1: > benchmark old ns/op new ns/op delta > BenchmarkBinaryTree17 5528479000 5526532000 -0.04% > BenchmarkFannkuch11 3920745000 3755168000 -4.22% > BenchmarkGobDecode 20638900 20096130 -2.63% > BenchmarkGobEncode 11270940 11592350 +2.85% > BenchmarkGzip 559168200 553892000 -0.94% > BenchmarkGunzip 182290200 180607000 -0.92% > BenchmarkJSONEncode 70364760 72977700 +3.71% > BenchmarkJSONDecode 418688000 407103000 -2.77% > BenchmarkMandelbrot200 6324712 6407060 +1.30% > BenchmarkParse 8040630 7889790 -1.88% > BenchmarkRevcomp 1693365000 1792351000 +5.85% > BenchmarkTemplate 465117000 456074200 -1.94% So, some wins, some losses, but no clear trend. Even if you compare this to tip instead of to atomicloadtype, I don't think it's conclusive. Hmm... As an extra data point, here are the bench numbers comparing tip (old) to the current atomicloadtype-using patch (new) on my linux, amd64 desktop: benchmark old ns/op new ns/op delta BenchmarkConvT2ESmall 10 10 +0.00% BenchmarkConvT2EUintptr 0 0 -1.45% BenchmarkConvT2ELarge 74 70 -4.96% BenchmarkConvT2ISmall 12 12 +0.78% BenchmarkConvT2IUintptr 12 4 -62.23% BenchmarkConvT2ILarge 76 75 -0.91% BenchmarkConvI2E 4 5 +14.61% BenchmarkConvI2I 20 19 -2.99% benchmark old ns/op new ns/op delta BenchmarkBinaryTree17 5920179000 5953817000 +0.57% BenchmarkFannkuch11 4019578000 3860152000 -3.97% BenchmarkGobDecode 21358690 21715730 +1.67% BenchmarkGobEncode 12917560 12678230 -1.85% BenchmarkGzip 574320000 569419400 -0.85% BenchmarkGunzip 179199300 178863100 -0.19% BenchmarkJSONEncode 95139100 71945500 -24.38% BenchmarkJSONDecode 422837000 406048200 -3.97% BenchmarkMandelbrot200 7043688 7047108 +0.05% BenchmarkParse 7893900 8311090 +5.28% BenchmarkRevcomp 1257351000 1280234000 +1.82% BenchmarkTemplate 509244200 475117200 -6.70% I'll poke at the BenchmarkJSONEncode case some more tomorrow.
Sign in to reply to this message.
On 25 July 2012 18:34, Daniel Morsing <daniel.morsing@gmail.com> wrote: > I think you have the init list misunderstood. It's a list for taking > stuff that must be run in statement context and inserting it > immediately before the expression is evaluated. Ah, I see. I'm still relatively new to the gc code. I don't feel qualified to give an LGTM on this one yet. I'll wait for rsc's opinion.
Sign in to reply to this message.
Thanks for looking into this. If I understand correctly, this replaces 1 function call with 1 or 2 function calls depending on the path taken. The only win is avoiding the generality of arbitrary-sized convT2I. One possibility is to write a convT2Ip like convT2I and use that, which is then 1 function call always. That's a simpler function than convT2I so it would be faster. The other possibility is to keep the current conditional but make the atomic load be generated with real instructions. On the x86 an ordinary load is fine. On the ARM I think if you do a write flush before the store of the cache pointer then you can do an ordinary load on the lookup path too. Let's move forward under the assumption that an ordinary load is fine on all architectures. http://codereview.appspot.com/6351090/diff/19001/src/cmd/5g/cgen.c File src/cmd/5g/cgen.c (right): http://codereview.appspot.com/6351090/diff/19001/src/cmd/5g/cgen.c#newcode46 src/cmd/5g/cgen.c:46: !n->left->addable || !n->right->addable) { What condition is this trying to handle? http://codereview.appspot.com/6351090/diff/19001/src/cmd/6g/cgen.c File src/cmd/6g/cgen.c (right): http://codereview.appspot.com/6351090/diff/19001/src/cmd/6g/cgen.c#newcode49 src/cmd/6g/cgen.c:49: !n->left->addable || !n->right->addable) { What condition is this trying to handle? http://codereview.appspot.com/6351090/diff/19001/src/cmd/8g/cgen.c File src/cmd/8g/cgen.c (right): http://codereview.appspot.com/6351090/diff/19001/src/cmd/8g/cgen.c#newcode79 src/cmd/8g/cgen.c:79: !n->left->addable || !n->right->addable) { What condition is this trying to handle? http://codereview.appspot.com/6351090/diff/19001/src/cmd/gc/go.h File src/cmd/gc/go.h (right): http://codereview.appspot.com/6351090/diff/19001/src/cmd/gc/go.h#newcode260 src/cmd/gc/go.h:260: int likely; // likeliness of if statement schar please
Sign in to reply to this message.
On Mon, Aug 6, 2012 at 12:26 AM, <rsc@golang.org> wrote: > Thanks for looking into this. If I understand correctly, this replaces 1 > function call with 1 or 2 function calls depending on the path taken. > The only win is avoiding the generality of arbitrary-sized convT2I. > > One possibility is to write a convT2Ip like convT2I and use that, which > is then 1 function call always. That's a simpler function than convT2I > so it would be faster. I'm not at a computer where I can test this, but I think this would be slower than the current solution, since there would be a call for convT2Ip and a call to atomicloadp within that function. > > The other possibility is to keep the current conditional but make the > atomic load be generated with real instructions. On the x86 an ordinary > load is fine. On the ARM I think if you do a write flush before the > store of the cache pointer then you can do an ordinary load on the > lookup path too. Let's move forward under the assumption that an > ordinary load is fine on all architectures. > The amount of code that's there for arm atomics frightens me into thinking that a ordinary load has subtle ways of breaking atomicity. I've had some time to think about this CL, and I think there is some value in re-spinning it to have the logic happen in the codegen. It has a couple of advantages: - Having the atomic logic be emitted inline would be easier. - It would get rid of the 2 new functions that I added, since I could use convT2I, and emit the atomic load inline - No need to add the likely field to node. However, I'm not very familiar with the codegen part of gc, and I'm not sure if it's a good idea to handle the interface conversion in codegen. The current CL will work fine, but I think it can be done cleaner. How does this re-spin sound to you? > > http://codereview.appspot.com/6351090/diff/19001/src/cmd/5g/cgen.c > File src/cmd/5g/cgen.c (right): > > http://codereview.appspot.com/6351090/diff/19001/src/cmd/5g/cgen.c#newcode46 > src/cmd/5g/cgen.c:46: !n->left->addable || !n->right->addable) { > What condition is this trying to handle? I ran into a problem with compiling the go/ast package. That package has a compilation AST similar to this: AS \-- NAME: iface \-- EFACE \-- NAME: go.type.itab \-- CALLFUNC \-- LIST \-- NAME: iface In that case, the interface type word would be assigned first and the half assigned interface would then be used as an argument. Spilling to a tempname solves this problem. > > http://codereview.appspot.com/6351090/diff/19001/src/cmd/6g/cgen.c > File src/cmd/6g/cgen.c (right): > > http://codereview.appspot.com/6351090/diff/19001/src/cmd/6g/cgen.c#newcode49 > src/cmd/6g/cgen.c:49: !n->left->addable || !n->right->addable) { > What condition is this trying to handle? > > http://codereview.appspot.com/6351090/diff/19001/src/cmd/8g/cgen.c > File src/cmd/8g/cgen.c (right): > > http://codereview.appspot.com/6351090/diff/19001/src/cmd/8g/cgen.c#newcode79 > src/cmd/8g/cgen.c:79: !n->left->addable || !n->right->addable) { > What condition is this trying to handle? > > http://codereview.appspot.com/6351090/diff/19001/src/cmd/gc/go.h > File src/cmd/gc/go.h (right): > > http://codereview.appspot.com/6351090/diff/19001/src/cmd/gc/go.h#newcode260 > src/cmd/gc/go.h:260: int likely; // likeliness of if statement > schar please > > http://codereview.appspot.com/6351090/
Sign in to reply to this message.
>> One possibility is to write a convT2Ip like convT2I and use that, which >> is then 1 function call always. That's a simpler function than convT2I >> so it would be faster. > > I'm not at a computer where I can test this, but I think this would be > slower than the current solution, since there would be a call for > convT2Ip and a call to atomicloadp within that function. Given the choice between '1 call always' or '1 call always and sometimes 2', I will always choose '1 call always'. It is strictly less expensive in generated code, and you're unlikely to be able to measure a significant difference caused by pushing the optional function call down a bit. It's a very important point, because otherwise we'd just inline everything and never make function calls. This is a real problem in C++ compilers. On the other hand, if you can make the fast path 0 function calls instead of 1 function call, then the increased speed may justify the larger generated code. Not making a function call cuts not just the overhead of the function call but also the overhead of having to spill any registers to the stack and then reload them after the call. It also makes the computation being done instead of the call available to the optimizer to understand and possibly rewrite. That's definitely the goal, and the second approach should get us there. > I've had some time to think about this CL, and I think there is some > value in re-spinning it to have the logic happen in the codegen. It > has a couple of advantages: > > - Having the atomic logic be emitted inline would be easier. > - It would get rid of the 2 new functions that I added, since I could > use convT2I, and emit the atomic load inline > - No need to add the likely field to node. > > However, I'm not very familiar with the codegen part of gc, and I'm > not sure if it's a good idea to handle the interface conversion in > codegen. The current CL will work fine, but I think it can be done > cleaner. How does this re-spin sound to you? Please do as much in the front end as possible. Otherwise we end up with three copies of everything, which is a significant maintenance burden. I've been working to move as much of the logic duplication as possible back into the front ends. The back end should have whatever primitive is needed (like cgen_eface for the T2E optimization) but little else. It's fine to have a cgen_iface and OIFACE that has basically the same implementation as cgen_eface but takes a cached itab value. Also, please just make the ARM do a plain load for now instead of adding complexity we're not sure is necessary. We can worry about the ARM later, once the rest of the code is good. I still believe we can arrange to use a plain load, and even if not we can fix it once we're happy with the x86 code. >> http://codereview.appspot.com/6351090/diff/19001/src/cmd/5g/cgen.c#newcode46 >> src/cmd/5g/cgen.c:46: !n->left->addable || !n->right->addable) { >> What condition is this trying to handle? > > I ran into a problem with compiling the go/ast package. That package > has a compilation AST similar to this: > > AS > \-- NAME: iface > \-- EFACE > \-- NAME: go.type.itab > \-- CALLFUNC > \-- LIST > \-- NAME: iface Do you know about the dump function? I am not sure how you generated that, but the dump function already does dumps like that, with more information than you're showing. > In that case, the interface type word would be assigned first and the > half assigned interface would then be used as an argument. Spilling to > a tempname solves this problem. Okay. But then the problem being solved is that a function call has to be made. That's not really a condition of addressability of the n side. Probably you want if n->ullman >= UINF or something like that. In this code: case OEFACE: if (res->op != ONAME || !res->addable) { tempname(&n1, n->type); cgen_eface(n, &n1); cgen(&n1, res); } else cgen_eface(n, res); goto ret; It says "if res is inappropriate, call cgen_eface with a different res that is appropriate". But when you add the conditions about n->left and n->right, it looks like a check for an inappropriate n, but the cgen_eface still gets called with the same n, which looks suspicious. I think cgen_eface is at fault here if anything is: it should avoid writing to res until it has all the pieces it needs. Right now it is: void cgen_eface(Node *n, Node *res) { Node dst; dst = *res; dst.type = types[tptr]; cgen(n->left, &dst); dst.xoffset += widthptr; cgen(n->right, &dst); } One fix is to make cgen_eface check for n->left->ullman >= UINF and generate into temporaries here. But I don't think that's necessary either. Since n->left is always a constant (a type pointer known at compile time), it should work well enough to swap the order of the generation: dst.type = types[tptr]; dst.xoffset += widthptr; cgen(n->right, &dst); // might be difficult dst.xoffset -= widthptr; cgen(n->left, &dst); // always a constant Russ
Sign in to reply to this message.
On Mon, Aug 6, 2012 at 10:16 PM, Russ Cox <rsc@golang.org> wrote: >>> One possibility is to write a convT2Ip like convT2I and use that, which >>> is then 1 function call always. That's a simpler function than convT2I >>> so it would be faster. >> >> I'm not at a computer where I can test this, but I think this would be >> slower than the current solution, since there would be a call for >> convT2Ip and a call to atomicloadp within that function. > > Given the choice between '1 call always' or '1 call always and > sometimes 2', I will always choose '1 call always'. It is strictly > less expensive in generated code, and you're unlikely to be able to > measure a significant difference caused by pushing the optional > function call down a bit. It's a very important point, because > otherwise we'd just inline everything and never make function calls. > This is a real problem in C++ compilers. > > On the other hand, if you can make the fast path 0 function calls > instead of 1 function call, then the increased speed may justify the > larger generated code. Not making a function call cuts not just the > overhead of the function call but also the overhead of having to spill > any registers to the stack and then reload them after the call. It > also makes the computation being done instead of the call available to > the optimizer to understand and possibly rewrite. That's definitely > the goal, and the second approach should get us there. I've done the zero call fastpath solution before, and it's a trivial amount of code that needs to be changed. I'll do that for the next patchset. > >> I've had some time to think about this CL, and I think there is some >> value in re-spinning it to have the logic happen in the codegen. It >> has a couple of advantages: >> >> - Having the atomic logic be emitted inline would be easier. >> - It would get rid of the 2 new functions that I added, since I could >> use convT2I, and emit the atomic load inline >> - No need to add the likely field to node. >> >> However, I'm not very familiar with the codegen part of gc, and I'm >> not sure if it's a good idea to handle the interface conversion in >> codegen. The current CL will work fine, but I think it can be done >> cleaner. How does this re-spin sound to you? > > Please do as much in the front end as possible. Otherwise we end up > with three copies of everything, which is a significant maintenance > burden. I've been working to move as much of the logic duplication as > possible back into the front ends. The back end should have whatever > primitive is needed (like cgen_eface for the T2E optimization) but > little else. It's fine to have a cgen_iface and OIFACE that has > basically the same implementation as cgen_eface but takes a cached > itab value. > > Also, please just make the ARM do a plain load for now instead of > adding complexity we're not sure is necessary. We can worry about the > ARM later, once the rest of the code is good. I still believe we can > arrange to use a plain load, and even if not we can fix it once we're > happy with the x86 code. > >>> http://codereview.appspot.com/6351090/diff/19001/src/cmd/5g/cgen.c#newcode46 >>> src/cmd/5g/cgen.c:46: !n->left->addable || !n->right->addable) { >>> What condition is this trying to handle? >> >> I ran into a problem with compiling the go/ast package. That package >> has a compilation AST similar to this: >> >> AS >> \-- NAME: iface >> \-- EFACE >> \-- NAME: go.type.itab >> \-- CALLFUNC >> \-- LIST >> \-- NAME: iface > > Do you know about the dump function? I am not sure how you generated > that, but the dump function already does dumps like that, with more > information than you're showing. This was transcribed by hand from memory. I was at my laptop when I wrote this mail, and couldn't get a proper dump. > >> In that case, the interface type word would be assigned first and the >> half assigned interface would then be used as an argument. Spilling to >> a tempname solves this problem. > > Okay. But then the problem being solved is that a function call has to > be made. That's not really a condition of addressability of the n > side. Probably you want if n->ullman >= UINF or something like that. > > In this code: > > case OEFACE: > if (res->op != ONAME || !res->addable) { > tempname(&n1, n->type); > cgen_eface(n, &n1); > cgen(&n1, res); > } else > cgen_eface(n, res); > goto ret; > > It says "if res is inappropriate, call cgen_eface with a different res > that is appropriate". > But when you add the conditions about n->left and n->right, it looks > like a check for an inappropriate n, but the cgen_eface still gets > called with the same n, which looks suspicious. I think cgen_eface is > at fault here if anything is: it should avoid writing to res until it > has all the pieces it needs. > > Right now it is: > > void > cgen_eface(Node *n, Node *res) > { > Node dst; > dst = *res; > dst.type = types[tptr]; > cgen(n->left, &dst); > dst.xoffset += widthptr; > cgen(n->right, &dst); > } > > One fix is to make cgen_eface check for n->left->ullman >= UINF and > generate into temporaries here. But I don't think that's necessary > either. Since n->left is always a constant (a type pointer known at > compile time), it should work well enough to swap the order of the > generation: > > dst.type = types[tptr]; > dst.xoffset += widthptr; > cgen(n->right, &dst); // might be difficult > dst.xoffset -= widthptr; > cgen(n->left, &dst); // always a constant > Will do this for next patch set. > Russ
Sign in to reply to this message.
Hello rsc@golang.org, dave@cheney.net, nigeltao@golang.org (cc: golang-dev@googlegroups.com), Please take another look.
Sign in to reply to this message.
Note that I've updated the description of the CL to include benchmarks for a plain assignment as the atomic load
Sign in to reply to this message.
ping.
Sign in to reply to this message.
I think that this is in rsc's court, but out of curiousity, what does "6g -S" give for this program: -------- package main type T uintptr func (T) Method() {} type I interface { Method() } var t T var i I func main() { i = t } -------- For example, 6g tip gives: --- prog list "main" --- 0002 (main.go:14) TEXT main+0(SB),$48-0 0003 (main.go:15) MOVQ $type."".T+0(SB),(SP) 0004 (main.go:15) MOVQ $type."".I+0(SB),8(SP) 0005 (main.go:15) MOVQ $go.itab."".T."".I+0(SB),16(SP) 0006 (main.go:15) MOVQ t+0(SB),BX 0007 (main.go:15) MOVQ BX,24(SP) 0008 (main.go:15) CALL ,runtime.convT2I+0(SB) 0009 (main.go:15) MOVQ 32(SP),BX 0010 (main.go:15) MOVQ BX,i+0(SB) 0011 (main.go:15) MOVQ 40(SP),BX 0012 (main.go:15) MOVQ BX,i+8(SB) 0013 (main.go:16) RET , https://codereview.appspot.com/6351090/diff/26002/src/cmd/gc/walk.c File src/cmd/gc/walk.c (right): https://codereview.appspot.com/6351090/diff/26002/src/cmd/gc/walk.c#newcode787 src/cmd/gc/walk.c:787: r = l->left; Just use "sym->def" a few lines below instead of assigning to r?
Sign in to reply to this message.
On 2012/08/24 05:02:51, nigeltao wrote: > I think that this is in rsc's court, but out of curiousity, what does "6g -S" > give for this program: > > For example, 6g tip gives: > > --- prog list "main" --- > 0002 (main.go:14) TEXT main+0(SB),$48-0 > 0003 (main.go:15) MOVQ $type."".T+0(SB),(SP) > 0004 (main.go:15) MOVQ $type."".I+0(SB),8(SP) > 0005 (main.go:15) MOVQ $go.itab."".T."".I+0(SB),16(SP) > 0006 (main.go:15) MOVQ t+0(SB),BX > 0007 (main.go:15) MOVQ BX,24(SP) > 0008 (main.go:15) CALL ,runtime.convT2I+0(SB) > 0009 (main.go:15) MOVQ 32(SP),BX > 0010 (main.go:15) MOVQ BX,i+0(SB) > 0011 (main.go:15) MOVQ 40(SP),BX > 0012 (main.go:15) MOVQ BX,i+8(SB) > 0013 (main.go:16) RET , > This is the 6g -S output --- prog list "main" --- 0002 (main.go:14) TEXT main+0(SB),$32-0 0003 (main.go:15) MOVQ go.itab."".T."".I+0(SB),AX 0004 (main.go:15) MOVQ $0,BP 0005 (main.go:15) CMPQ AX,BP 0006 (main.go:15) JNE $1,12 0007 (main.go:15) MOVQ $type."".T+0(SB),(SP) 0008 (main.go:15) MOVQ $type."".I+0(SB),8(SP) 0009 (main.go:15) MOVQ $go.itab."".T."".I+0(SB),16(SP) 0010 (main.go:15) CALL ,runtime.typ2Itab+0(SB) 0011 (main.go:15) MOVQ 24(SP),AX 0012 (main.go:15) MOVQ t+0(SB),BX 0013 (main.go:15) MOVQ BX,i+8(SB) 0014 (main.go:15) MOVQ AX,i+0(SB) 0015 (main.go:16) RET , However, the linker will move the unlikely typ2Itab branch to the end of the function for better pipelining.
Sign in to reply to this message.
Hello rsc@golang.org, dave@cheney.net, nigeltao@golang.org (cc: golang-dev@googlegroups.com), Please take another look.
Sign in to reply to this message.
Huh, thought hg mail would publish draft comments... Anyway, Nigel suggestion has been added.
Sign in to reply to this message.
On Sat, Aug 25, 2012 at 3:24 AM, <daniel.morsing@gmail.com> wrote: > Huh, thought hg mail would publish draft comments... It only does that if your user name is agl. I don't understand it either.
Sign in to reply to this message.
LGTM I expected a win in code that uses interfaces as "unions", so you get lots of T2I conversions. The parser and template speedups reflect that, so great. I am a little surprised by the minor slowdowns in the other benchmarks but that can be code motion and cache lines. It's hard to believe they are actually executing T2I conversions in any serious amount. One thing that might be interesting in a separate CL: now that we have likely bits on the if, if there is a loop with an if inside, and the if branches out of the loop (return, break), mark it unlikely so that the loop body stays as tight as possible. http://codereview.appspot.com/6351090/diff/41001/src/pkg/runtime/iface.c File src/pkg/runtime/iface.c (right): http://codereview.appspot.com/6351090/diff/41001/src/pkg/runtime/iface.c#newc... src/pkg/runtime/iface.c:186: #pragma textflag 7 This is fine to mark as textflag 7. I just want to point out that it doesn't have to be for any particular reason.
Sign in to reply to this message.
On 2012/09/11 02:45:14, rsc wrote: > LGTM > http://codereview.appspot.com/6351090/diff/41001/src/pkg/runtime/iface.c#newc... > src/pkg/runtime/iface.c:186: #pragma textflag 7 > This is fine to mark as textflag 7. I just want to point out that it doesn't > have to be for any particular reason. I'm submitting as is, but if there are any set in stone rules about when to use textflag 7, i'd love to hear them.
Sign in to reply to this message.
*** Submitted as http://code.google.com/p/go/source/detail?r=6d4707371015 *** cmd/gc: Inline pointer sized T2I interface conversions This CL also adds support for marking the likelyness of IF nodes in the AST being true. This feature is being used here to mark the slow path as unlikely. src/pkg/runtime: benchmark old ns/op new ns/op delta BenchmarkConvT2IUintptr 16 1 -91.63% test/bench/go1: benchmark old ns/op new ns/op delta BenchmarkBinaryTree17 5416917000 5461355000 +0.82% BenchmarkFannkuch11 3810355000 3842609000 +0.85% BenchmarkGobDecode 19950950 19855420 -0.48% BenchmarkGobEncode 11301220 11308530 +0.06% BenchmarkGzip 548119600 546869200 -0.23% BenchmarkGunzip 176145400 180208300 +2.31% BenchmarkJSONEncode 93117400 70163100 -24.65% BenchmarkJSONDecode 406626800 409999200 +0.83% BenchmarkMandelbrot200 6300992 6317866 +0.27% BenchmarkParse 7664396 7451625 -2.78% BenchmarkRevcomp 1189424000 1412332000 +18.74% BenchmarkTemplate 491308400 458654200 -6.65% benchmark old MB/s new MB/s speedup BenchmarkGobDecode 38.47 38.66 1.00x BenchmarkGobEncode 67.92 67.87 1.00x BenchmarkGzip 35.40 35.48 1.00x BenchmarkGunzip 110.16 107.68 0.98x BenchmarkJSONEncode 20.84 27.66 1.33x BenchmarkJSONDecode 4.77 4.73 0.99x BenchmarkParse 7.56 7.77 1.03x BenchmarkRevcomp 213.69 179.96 0.84x BenchmarkTemplate 3.95 4.23 1.07x R=rsc, dave, nigeltao CC=golang-dev http://codereview.appspot.com/6351090
Sign in to reply to this message.
Use $7 when the function is a leaf: that is, it never calls another function. Setting textflag to 7 is necessary when it's a function involved in the stack-splitting code: splitting stacks inside the stack-splitting code would be a disaster. Other than that, it's optional and affects only performance, not correctness. -rob
Sign in to reply to this message.
|