Issue 6351090: code review 6351090: cmd/gc: Inline pointer sized T2I interface conversions

Issue 6351090: code review 6351090: cmd/gc: Inline pointer sized T2I interface conversions (Closed)

Can't Edit
Can't Publish+Mail
Start Review

Created:
11 years, 9 months ago by DMorsing

Modified:
11 years, 7 months ago

Reviewers:
r

CC:
rsc, dave_cheney.net, nigeltao, golang-dev

Visibility:
Public.

Description

cmd/gc: Inline pointer sized T2I interface conversions This CL also adds support for marking the likelyness of IF nodes in the AST being true. This feature is being used here to mark the slow path as unlikely. src/pkg/runtime: benchmark old ns/op new ns/op delta BenchmarkConvT2IUintptr 16 1 -91.63% test/bench/go1: benchmark old ns/op new ns/op delta BenchmarkBinaryTree17 5416917000 5461355000 +0.82% BenchmarkFannkuch11 3810355000 3842609000 +0.85% BenchmarkGobDecode 19950950 19855420 -0.48% BenchmarkGobEncode 11301220 11308530 +0.06% BenchmarkGzip 548119600 546869200 -0.23% BenchmarkGunzip 176145400 180208300 +2.31% BenchmarkJSONEncode 93117400 70163100 -24.65% BenchmarkJSONDecode 406626800 409999200 +0.83% BenchmarkMandelbrot200 6300992 6317866 +0.27% BenchmarkParse 7664396 7451625 -2.78% BenchmarkRevcomp 1189424000 1412332000 +18.74% BenchmarkTemplate 491308400 458654200 -6.65% benchmark old MB/s new MB/s speedup BenchmarkGobDecode 38.47 38.66 1.00x BenchmarkGobEncode 67.92 67.87 1.00x BenchmarkGzip 35.40 35.48 1.00x BenchmarkGunzip 110.16 107.68 0.98x BenchmarkJSONEncode 20.84 27.66 1.33x BenchmarkJSONDecode 4.77 4.73 0.99x BenchmarkParse 7.56 7.77 1.03x BenchmarkRevcomp 213.69 179.96 0.84x BenchmarkTemplate 3.95 4.23 1.07x

Patch Set 1 #

Patch Set 2 : diff -r 2874979af37a https://code.google.com/p/go/ #

Total comments: 9

Patch Set 3 : diff -r 64195dbf632d https://code.google.com/p/go/ #

Patch Set 4 : diff -r 64195dbf632d https://code.google.com/p/go/ #

Total comments: 4

Patch Set 5 : diff -r 5d3aaf0b717e https://code.google.com/p/go/ #

Patch Set 6 : diff -r 5d3aaf0b717e https://code.google.com/p/go/ #

Total comments: 1

Patch Set 7 : diff -r ca5e20f93081 https://code.google.com/p/go/ #

Total comments: 1

Patch Set 8 : diff -r 46e193d130ce https://code.google.com/p/go/ #

Created: 11 years, 7 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+62 lines, -4 lines)			Patch
M	src/cmd/gc/builtin.c	View	1 2 3 4	1 chunk	+1 line, -0 lines	0 comments	Download
M	src/cmd/gc/gen.c	View	1 2 3 4	2 chunks	+7 lines, -2 lines	0 comments	Download
M	src/cmd/gc/go.h	View	1 2 3 4	1 chunk	+1 line, -0 lines	0 comments	Download
M	src/cmd/gc/runtime.go	View	1 2 3 4	1 chunk	+1 line, -0 lines	0 comments	Download
M	src/cmd/gc/walk.c	View	1 2 3 4 5 6	3 chunks	+40 lines, -2 lines	0 comments	Download
M	src/pkg/runtime/iface.c	View	1 2 3 4	1 chunk	+12 lines, -0 lines	0 comments	Download

Messages

Total messages: 36

Expand All Messages | Collapse All Messages

dave_cheney.net

linux/arm pandaboard pando(~/go/test/bench/go1) % ~/go/misc/benchcmp {old,new}.txt benchmark old ns/op new ns/op delta BenchmarkBinaryTree17 37124725000 37525849000 ...

11 years, 9 months ago (2012-07-14 00:40:30 UTC) #1

DMorsing

On 2012/07/14 00:40:30, dfc wrote: > linux/arm pandaboard > > pando(~/go/test/bench/go1) % ~/go/misc/benchcmp {old,new}.txt > ...

11 years, 9 months ago (2012-07-14 06:22:23 UTC) #2

DMorsing

Ok, didn't know codereview would send out the CL along with the reply I just ...

11 years, 9 months ago (2012-07-14 06:25:56 UTC) #3

nigeltao

On 14 July 2012 16:25, <daniel.morsing@gmail.com> wrote: > Ok, didn't know codereview would send out ...

11 years, 9 months ago (2012-07-16 02:03:31 UTC) #4

DMorsing

On Mon, Jul 16, 2012 at 4:03 AM, Nigel Tao <nigeltao@golang.org> wrote: > On 14 ...

11 years, 9 months ago (2012-07-16 07:11:00 UTC) #5

nigeltao

I'm not sure if introducing atomicloadtype and typ2Itab is the right way to cut it. ...

11 years, 9 months ago (2012-07-23 07:52:56 UTC) #7

nigeltao

Also, can you paste what the before/after of a disassembly of what the line "ifaceValue ...

11 years, 9 months ago (2012-07-23 08:02:28 UTC) #8

DMorsing

On 2012/07/23 07:52:56, nigeltao wrote: > I'm not sure if introducing atomicloadtype and typ2Itab is ...

11 years, 9 months ago (2012-07-23 11:10:09 UTC) #9

DMorsing

http://codereview.appspot.com/6351090/diff/2001/src/cmd/gc/go.h File src/cmd/gc/go.h (right): http://codereview.appspot.com/6351090/diff/2001/src/cmd/gc/go.h#newcode260 src/cmd/gc/go.h:260: uchar likely; // likeliness of if statement On 2012/07/23 ...

11 years, 9 months ago (2012-07-23 11:10:36 UTC) #10

http://codereview.appspot.com/6351090/diff/2001/src/cmd/gc/go.h
File src/cmd/gc/go.h (right):

http://codereview.appspot.com/6351090/diff/2001/src/cmd/gc/go.h#newcode260
src/cmd/gc/go.h:260: uchar	likely; // likeliness of if statement
On 2012/07/23 07:52:56, nigeltao wrote:
> Does unsignedness work if -1 is supposed to mean unlikely?
> 
> How much difference does the likeliness make on the benchmarks?

Whoops. I think this may have worked solely by luck.

The performance impact of not having likely is small. IIRC, it's about 1 ns. In
this version of the patch, this gain isn't that significant. However I made a
version, where the pointer loading was done inline (relying on amd64 implicitly
loading the pointer atomically). There the gain was about half the time spent
per conversion. This optimization is a relic from that version, but I kept it in
since it's still beneficial to benchmark.

http://codereview.appspot.com/6351090/diff/2001/src/cmd/gc/walk.c
File src/cmd/gc/walk.c (right):

http://codereview.appspot.com/6351090/diff/2001/src/cmd/gc/walk.c#newcode777
src/cmd/gc/walk.c:777: r = l;
On 2012/07/23 07:52:56, nigeltao wrote:
> Some commentary on the C equivalent of what this generated AST does would be
> nice.

I'll include the some variant of the pseudocode from my initial design posting.

http://codereview.appspot.com/6351090/diff/2001/src/cmd/gc/walk.c#newcode1199
src/cmd/gc/walk.c:1199: walkexpr(&r, init);
On 2012/07/23 07:52:56, nigeltao wrote:
> Unrelated? Or an existing bug? What broke without it?

The gc compiler will segfault without this. Apparently, whoever wrote this code
assumed that the r expression couldn't generate any init statements, which was
true before this change.

http://codereview.appspot.com/6351090/diff/2001/src/pkg/runtime/iface.c
File src/pkg/runtime/iface.c (right):

http://codereview.appspot.com/6351090/diff/2001/src/pkg/runtime/iface.c#newco...
src/pkg/runtime/iface.c:186: void
On 2012/07/23 07:52:56, nigeltao wrote:
> I think that this and runtime·typ2Itab should have a
> #pragma textflag 7
> before them to pick up NOSPLIT.

I was a bit unsure as to what exactly "#pragma textflag 7" did. I could only see
it on functions with ... arguments, so I thought it was related. I'll add this
to the next version.

DMorsing

On 2012/07/23 08:02:28, nigeltao wrote: > Also, can you paste what the before/after of a ...

11 years, 9 months ago (2012-07-23 19:02:11 UTC) #11

DMorsing

Hello rsc@golang.org, dave@cheney.net, nigeltao@golang.org (cc: golang-dev@googlegroups.com), I'd like you to review this change to https://code.google.com/p/go/

11 years, 9 months ago (2012-07-23 19:31:42 UTC) #12

nigeltao

On 23 July 2012 21:10, <daniel.morsing@gmail.com> wrote: > As for typ2Itab, The only way I ...

11 years, 9 months ago (2012-07-24 08:43:12 UTC) #13

nigeltao

On 23 July 2012 21:10, <daniel.morsing@gmail.com> wrote: > However I made a version, where the ...

11 years, 9 months ago (2012-07-24 08:51:22 UTC) #14

DMorsing

On Tue, Jul 24, 2012 at 10:51 AM, Nigel Tao <nigeltao@golang.org> wrote: > On 23 ...

11 years, 9 months ago (2012-07-24 10:30:36 UTC) #15

On Tue, Jul 24, 2012 at 10:51 AM, Nigel Tao <nigeltao@golang.org> wrote:
> On 23 July 2012 21:10,  <daniel.morsing@gmail.com> wrote:
>> However I made a version, where the pointer loading was done inline
>> (relying on amd64 implicitly loading the pointer atomically). There the
>> gain was about half the time spent per conversion.
>
> I'm not sure how to interpret that paragraph. What are the benchmark
> numbers if you use a plain assignment instead of a
> runtime·atomicloadtype call? It won't be correct (on arm), but it
> would give me a better idea of the potential impact of this change. As
> it is, you're trying to avoid a convT2I function call, but always call
> runtime·atomicloadtype instead, so it's not an obvious win. Ideally,
> there'd be no function calls on the common path.

Sorry for being inadvertently obtuse. When doing plain assignment, the
benchmark takes around 1 ns per op in BenchmarkConvT2IUintptr.

I've tried making a version of this change that does the following:

    Itab tab = typ2Itab(type, Itype, &cache);
    EFACE{tab, ptr}

where the typ2Itab function does the check on cache and returns it if
set. That version had benchmarks around 8 ns per op. Apparently, the
path for interface conversion is getting so fast that pushing 2 words
onto the stack makes a significant difference.

As for using convT2I instead of typ2Itab, there's a problem with that.
We would have to evaluate the data being converted in the init list,
but where the init list is inserted, that data might be uninitialized.

Consider:

    //begin init list
    Itab tab = cache //atomically
    if tab == nil {
        tmpiface = convT2I(type, itype, &cache, data)
    } else {
        tmpiface = EFACE{tab, data}
    }
    //end init list
    data = realValue
    //the expression we were evaluating. tmpiface will be wrong
    foo(tmpiface)

This situation arises when evaluating AS2FUNC nodes.

nigeltao

Sorry that this one is taking so long... On 24 July 2012 20:30, Daniel Morsing ...

11 years, 9 months ago (2012-07-25 07:47:32 UTC) #16

DMorsing

On Wed, Jul 25, 2012 at 9:47 AM, Nigel Tao <nigeltao@golang.org> wrote: > Sorry that ...

11 years, 9 months ago (2012-07-25 08:34:16 UTC) #17

On Wed, Jul 25, 2012 at 9:47 AM, Nigel Tao <nigeltao@golang.org> wrote:
> Sorry that this one is taking so long...
>
>
> On 24 July 2012 20:30, Daniel Morsing <daniel.morsing@gmail.com> wrote:
>> Sorry for being inadvertently obtuse. When doing plain assignment, the
>> benchmark takes around 1 ns per op in BenchmarkConvT2IUintptr.
>
> What about the test/bench/go1 numbers?

I don't have those numbers written down anywhere. I'll go get these
numbers when I get to a PC where I have my stuff.

>
>
>> As for using convT2I instead of typ2Itab, there's a problem with that.
>> We would have to evaluate the data being converted in the init list,
>> but where the init list is inserted, that data might be uninitialized.
>>
>> Consider:
>>
>>     //begin init list
>>     Itab tab = cache //atomically
>>     if tab == nil {
>>         tmpiface = convT2I(type, itype, &cache, data)
>>     } else {
>>         tmpiface = EFACE{tab, data}
>>     }
>>     //end init list
>>     data = realValue
>>     //the expression we were evaluating. tmpiface will be wrong
>>     foo(tmpiface)
>>
>> This situation arises when evaluating AS2FUNC nodes.
>
> Sorry, I'm probably being dumb, but I don't understand how tmpiface
> will be wrong with OAS2FUNC. Can you give a small example.go program
> that fails in this way?
>
> IIUC, the proposal is to do
>
> Itab tab = cache //atomically
> if tab == nil {
>   tmpiface = convT2I(type, itype, &cache, data)
> } else {
>   tmpiface = EFACE{tab, data}
> }
>
> where we currently do
>
> tmpiface = convT2I(type, itype, &cache, data)
>
> Why are we also generating init code? I didn't notice that in my
> previous readings. It's true that you could conceivably initialize the
> itab caches eagerly instead of lazily, but I would want that to be a
> separate change to this one. I wouldn't expect it to be trivial if you
> want to also preserve dead code elimination for unused types.

I think you have the init list misunderstood. It's a list for taking
stuff that must be run in statement context and inserting it
immediately before the expression is evaluated.

The CONVIFACE node occurs as a part of an expression. The convT2I call
is also an expression, so there is no problem with inserting it at a
place where the interface conversion was originally, but the If
statement is a statement. We can't just insert it into the AST at the
CONVIFACE node. We have to put it onto the init list, so it's executed
just before the expression is evaluated.

If we use convT2I, we have to evaluate the data being converted in the
init list. The value of the data may change between the init list is
inserted and the conversion expression.

DMorsing

Here is the atomic load version benchmark, compared to the atomicloadtype() version src/pkg/runtime: benchmark old ...

11 years, 9 months ago (2012-07-25 17:22:53 UTC) #18

Here is the atomic load version benchmark, compared to the
atomicloadtype() version

src/pkg/runtime:
benchmark                  old ns/op    new ns/op    delta
BenchmarkConvT2ESmall             11           11   +1.79%
BenchmarkConvT2EUintptr            0            0   +1.43%
BenchmarkConvT2ELarge             75           76   +0.93%
BenchmarkConvT2ISmall             17           13  -19.41%
BenchmarkConvT2IUintptr            5            1  -74.32%
BenchmarkConvT2ILarge             77           78   +0.90%
BenchmarkConvI2E                   4            5  +14.79%
BenchmarkConvI2I                  17           17   +0.00%

test/bench/go1:
benchmark                 old ns/op    new ns/op    delta
BenchmarkBinaryTree17    5528479000   5526532000   -0.04%
BenchmarkFannkuch11      3920745000   3755168000   -4.22%
BenchmarkGobDecode         20638900     20096130   -2.63%
BenchmarkGobEncode         11270940     11592350   +2.85%
BenchmarkGzip             559168200    553892000   -0.94%
BenchmarkGunzip           182290200    180607000   -0.92%
BenchmarkJSONEncode        70364760     72977700   +3.71%
BenchmarkJSONDecode       418688000    407103000   -2.77%
BenchmarkMandelbrot200      6324712      6407060   +1.30%
BenchmarkParse              8040630      7889790   -1.88%
BenchmarkRevcomp         1693365000   1792351000   +5.85%
BenchmarkTemplate         465117000    456074200   -1.94%

benchmark                  old MB/s     new MB/s  speedup
BenchmarkGobDecode            37.19        38.19    1.03x
BenchmarkGobEncode            68.10        66.21    0.97x
BenchmarkGzip                 34.70        35.03    1.01x
BenchmarkGunzip              106.45       107.44    1.01x
BenchmarkJSONEncode           27.58        26.59    0.96x
BenchmarkJSONDecode            4.63         4.77    1.03x
BenchmarkParse                 7.20         7.34    1.02x
BenchmarkRevcomp             150.10       141.81    0.94x
BenchmarkTemplate              4.17         4.25    1.02x

nigeltao

2012/7/26 Daniel Morsing <daniel.morsing@gmail.com>: > test/bench/go1: > benchmark old ns/op new ns/op delta > BenchmarkBinaryTree17 ...

11 years, 9 months ago (2012-07-26 07:59:59 UTC) #19

2012/7/26 Daniel Morsing <daniel.morsing@gmail.com>:
> test/bench/go1:
> benchmark                 old ns/op    new ns/op    delta
> BenchmarkBinaryTree17    5528479000   5526532000   -0.04%
> BenchmarkFannkuch11      3920745000   3755168000   -4.22%
> BenchmarkGobDecode         20638900     20096130   -2.63%
> BenchmarkGobEncode         11270940     11592350   +2.85%
> BenchmarkGzip             559168200    553892000   -0.94%
> BenchmarkGunzip           182290200    180607000   -0.92%
> BenchmarkJSONEncode        70364760     72977700   +3.71%
> BenchmarkJSONDecode       418688000    407103000   -2.77%
> BenchmarkMandelbrot200      6324712      6407060   +1.30%
> BenchmarkParse              8040630      7889790   -1.88%
> BenchmarkRevcomp         1693365000   1792351000   +5.85%
> BenchmarkTemplate         465117000    456074200   -1.94%

So, some wins, some losses, but no clear trend. Even if you compare
this to tip instead of to atomicloadtype, I don't think it's
conclusive. Hmm...


As an extra data point, here are the bench numbers comparing tip (old)
to the current atomicloadtype-using patch (new) on my linux, amd64
desktop:

benchmark                  old ns/op    new ns/op    delta
BenchmarkConvT2ESmall             10           10   +0.00%
BenchmarkConvT2EUintptr            0            0   -1.45%
BenchmarkConvT2ELarge             74           70   -4.96%
BenchmarkConvT2ISmall             12           12   +0.78%
BenchmarkConvT2IUintptr           12            4  -62.23%
BenchmarkConvT2ILarge             76           75   -0.91%
BenchmarkConvI2E                   4            5  +14.61%
BenchmarkConvI2I                  20           19   -2.99%

benchmark                 old ns/op    new ns/op    delta
BenchmarkBinaryTree17    5920179000   5953817000   +0.57%
BenchmarkFannkuch11      4019578000   3860152000   -3.97%
BenchmarkGobDecode         21358690     21715730   +1.67%
BenchmarkGobEncode         12917560     12678230   -1.85%
BenchmarkGzip             574320000    569419400   -0.85%
BenchmarkGunzip           179199300    178863100   -0.19%
BenchmarkJSONEncode        95139100     71945500  -24.38%
BenchmarkJSONDecode       422837000    406048200   -3.97%
BenchmarkMandelbrot200      7043688      7047108   +0.05%
BenchmarkParse              7893900      8311090   +5.28%
BenchmarkRevcomp         1257351000   1280234000   +1.82%
BenchmarkTemplate         509244200    475117200   -6.70%


I'll poke at the BenchmarkJSONEncode case some more tomorrow.

nigeltao

On 25 July 2012 18:34, Daniel Morsing <daniel.morsing@gmail.com> wrote: > I think you have the ...

11 years, 9 months ago (2012-07-26 08:05:50 UTC) #20

rsc

Thanks for looking into this. If I understand correctly, this replaces 1 function call with ...

11 years, 9 months ago (2012-08-05 22:26:32 UTC) #21

DMorsing

On Mon, Aug 6, 2012 at 12:26 AM, <rsc@golang.org> wrote: > Thanks for looking into ...

11 years, 9 months ago (2012-08-06 07:51:22 UTC) #22

On Mon, Aug 6, 2012 at 12:26 AM,  <rsc@golang.org> wrote:
> Thanks for looking into this. If I understand correctly, this replaces 1
> function call with 1 or 2 function calls depending on the path taken.
> The only win is avoiding the generality of arbitrary-sized convT2I.
>
> One possibility is to write a convT2Ip like convT2I and use that, which
> is then 1 function call always. That's a simpler function than convT2I
> so it would be faster.

I'm not at a computer where I can test this, but I think this would be
slower than the current solution, since there would be a call for
convT2Ip and a call to atomicloadp within that function.

>
> The other possibility is to keep the current conditional but make the
> atomic load be generated with real instructions. On the x86 an ordinary
> load is fine. On the ARM I think if you do a write flush before the
> store of the cache pointer then you can do an ordinary load on the
> lookup path too. Let's move forward under the assumption that an
> ordinary load is fine on all architectures.
>

The amount of code that's there for arm atomics frightens me into
thinking that a ordinary load has subtle ways of breaking atomicity.

I've had some time to think about this CL, and I think there is some
value in re-spinning it to have the logic happen in the codegen. It
has a couple of advantages:

- Having the atomic logic be emitted inline would be easier.
- It would get rid of the 2 new functions that I added, since I could
use convT2I, and emit the atomic load inline
- No need to add the likely field to node.

However, I'm not very familiar with the codegen part of gc, and I'm
not sure if it's a good idea to handle the interface conversion in
codegen. The current CL will work fine, but I think it can be done
cleaner. How does this re-spin sound to you?

>
> http://codereview.appspot.com/6351090/diff/19001/src/cmd/5g/cgen.c
> File src/cmd/5g/cgen.c (right):
>
> http://codereview.appspot.com/6351090/diff/19001/src/cmd/5g/cgen.c#newcode46
> src/cmd/5g/cgen.c:46: !n->left->addable || !n->right->addable) {
> What condition is this trying to handle?

I ran into a problem with compiling the go/ast package. That package
has a compilation AST similar to this:

AS
\-- NAME: iface
\-- EFACE
    \-- NAME: go.type.itab
    \-- CALLFUNC
        \-- LIST
            \-- NAME: iface

In that case, the interface type word would be assigned first and the
half assigned interface would then be used as an argument. Spilling to
a tempname solves this problem.

>
> http://codereview.appspot.com/6351090/diff/19001/src/cmd/6g/cgen.c
> File src/cmd/6g/cgen.c (right):
>
> http://codereview.appspot.com/6351090/diff/19001/src/cmd/6g/cgen.c#newcode49
> src/cmd/6g/cgen.c:49: !n->left->addable || !n->right->addable) {
> What condition is this trying to handle?
>
> http://codereview.appspot.com/6351090/diff/19001/src/cmd/8g/cgen.c
> File src/cmd/8g/cgen.c (right):
>
> http://codereview.appspot.com/6351090/diff/19001/src/cmd/8g/cgen.c#newcode79
> src/cmd/8g/cgen.c:79: !n->left->addable || !n->right->addable) {
> What condition is this trying to handle?
>
> http://codereview.appspot.com/6351090/diff/19001/src/cmd/gc/go.h
> File src/cmd/gc/go.h (right):
>
> http://codereview.appspot.com/6351090/diff/19001/src/cmd/gc/go.h#newcode260
> src/cmd/gc/go.h:260: int        likely; // likeliness of if statement
> schar please
>
> http://codereview.appspot.com/6351090/

rsc

>> One possibility is to write a convT2Ip like convT2I and use that, which >> ...

11 years, 9 months ago (2012-08-06 20:16:41 UTC) #23

>> One possibility is to write a convT2Ip like convT2I and use that, which
>> is then 1 function call always. That's a simpler function than convT2I
>> so it would be faster.
>
> I'm not at a computer where I can test this, but I think this would be
> slower than the current solution, since there would be a call for
> convT2Ip and a call to atomicloadp within that function.

Given the choice between '1 call always' or '1 call always and
sometimes 2', I will always choose '1 call always'. It is strictly
less expensive in generated code, and you're unlikely to be able to
measure a significant difference caused by pushing the optional
function call down a bit. It's a very important point, because
otherwise we'd just inline everything and never make function calls.
This is a real problem in C++ compilers.

On the other hand, if you can make the fast path 0 function calls
instead of 1 function call, then the increased speed may justify the
larger generated code. Not making a function call cuts not just the
overhead of the function call but also the overhead of having to spill
any registers to the stack and then reload them after the call. It
also makes the computation being done instead of the call available to
the optimizer to understand and possibly rewrite. That's definitely
the goal, and the second approach should get us there.

> I've had some time to think about this CL, and I think there is some
> value in re-spinning it to have the logic happen in the codegen. It
> has a couple of advantages:
>
> - Having the atomic logic be emitted inline would be easier.
> - It would get rid of the 2 new functions that I added, since I could
> use convT2I, and emit the atomic load inline
> - No need to add the likely field to node.
>
> However, I'm not very familiar with the codegen part of gc, and I'm
> not sure if it's a good idea to handle the interface conversion in
> codegen. The current CL will work fine, but I think it can be done
> cleaner. How does this re-spin sound to you?

Please do as much in the front end as possible. Otherwise we end up
with three copies of everything, which is a significant maintenance
burden. I've been working to move as much of the logic duplication as
possible back into the front ends. The back end should have whatever
primitive is needed (like cgen_eface for the T2E optimization) but
little else. It's fine to have a cgen_iface and OIFACE that has
basically the same implementation as cgen_eface but takes a cached
itab value.

Also, please just make the ARM do a plain load for now instead of
adding complexity we're not sure is necessary. We can worry about the
ARM later, once the rest of the code is good. I still believe we can
arrange to use a plain load, and even if not we can fix it once we're
happy with the x86 code.

>> http://codereview.appspot.com/6351090/diff/19001/src/cmd/5g/cgen.c#newcode46
>> src/cmd/5g/cgen.c:46: !n->left->addable || !n->right->addable) {
>> What condition is this trying to handle?
>
> I ran into a problem with compiling the go/ast package. That package
> has a compilation AST similar to this:
>
> AS
> \-- NAME: iface
> \-- EFACE
>     \-- NAME: go.type.itab
>     \-- CALLFUNC
>         \-- LIST
>             \-- NAME: iface

Do you know about the dump function? I am not sure how you generated
that, but the dump function already does dumps like that, with more
information than you're showing.

> In that case, the interface type word would be assigned first and the
> half assigned interface would then be used as an argument. Spilling to
> a tempname solves this problem.

Okay. But then the problem being solved is that a function call has to
be made. That's not really a condition of addressability of the n
side. Probably you want if n->ullman >= UINF or something like that.

In this code:

	case OEFACE:
		if (res->op != ONAME || !res->addable) {
			tempname(&n1, n->type);
			cgen_eface(n, &n1);
			cgen(&n1, res);
		} else
			cgen_eface(n, res);
		goto ret;

It says "if res is inappropriate, call cgen_eface with a different res
that is appropriate".
But when you add the conditions about n->left and n->right, it looks
like a check for an inappropriate n, but the cgen_eface still gets
called with the same n, which looks suspicious. I think cgen_eface is
at fault here if anything is: it should avoid writing to res until it
has all the pieces it needs.

Right now it is:

void
cgen_eface(Node *n, Node *res)
{
	Node dst;
	dst = *res;
	dst.type = types[tptr];
	cgen(n->left, &dst);
	dst.xoffset += widthptr;
	cgen(n->right, &dst);
}

One fix is to make cgen_eface check for n->left->ullman >= UINF and
generate into temporaries here. But I don't think that's necessary
either. Since n->left is always a constant (a type pointer known at
compile time), it should work well enough to swap the order of the
generation:

	dst.type = types[tptr];
	dst.xoffset += widthptr;
	cgen(n->right, &dst); // might be difficult
	dst.xoffset -= widthptr;
	cgen(n->left, &dst); // always a constant

Russ

DMorsing

On Mon, Aug 6, 2012 at 10:16 PM, Russ Cox <rsc@golang.org> wrote: >>> One possibility ...

11 years, 9 months ago (2012-08-06 21:13:42 UTC) #24

On Mon, Aug 6, 2012 at 10:16 PM, Russ Cox <rsc@golang.org> wrote:
>>> One possibility is to write a convT2Ip like convT2I and use that, which
>>> is then 1 function call always. That's a simpler function than convT2I
>>> so it would be faster.
>>
>> I'm not at a computer where I can test this, but I think this would be
>> slower than the current solution, since there would be a call for
>> convT2Ip and a call to atomicloadp within that function.
>
> Given the choice between '1 call always' or '1 call always and
> sometimes 2', I will always choose '1 call always'. It is strictly
> less expensive in generated code, and you're unlikely to be able to
> measure a significant difference caused by pushing the optional
> function call down a bit. It's a very important point, because
> otherwise we'd just inline everything and never make function calls.
> This is a real problem in C++ compilers.
>
> On the other hand, if you can make the fast path 0 function calls
> instead of 1 function call, then the increased speed may justify the
> larger generated code. Not making a function call cuts not just the
> overhead of the function call but also the overhead of having to spill
> any registers to the stack and then reload them after the call. It
> also makes the computation being done instead of the call available to
> the optimizer to understand and possibly rewrite. That's definitely
> the goal, and the second approach should get us there.

I've done the zero call fastpath solution before, and it's a trivial
amount of code that needs to be changed. I'll do that for the next
patchset.

>
>> I've had some time to think about this CL, and I think there is some
>> value in re-spinning it to have the logic happen in the codegen. It
>> has a couple of advantages:
>>
>> - Having the atomic logic be emitted inline would be easier.
>> - It would get rid of the 2 new functions that I added, since I could
>> use convT2I, and emit the atomic load inline
>> - No need to add the likely field to node.
>>
>> However, I'm not very familiar with the codegen part of gc, and I'm
>> not sure if it's a good idea to handle the interface conversion in
>> codegen. The current CL will work fine, but I think it can be done
>> cleaner. How does this re-spin sound to you?
>
> Please do as much in the front end as possible. Otherwise we end up
> with three copies of everything, which is a significant maintenance
> burden. I've been working to move as much of the logic duplication as
> possible back into the front ends. The back end should have whatever
> primitive is needed (like cgen_eface for the T2E optimization) but
> little else. It's fine to have a cgen_iface and OIFACE that has
> basically the same implementation as cgen_eface but takes a cached
> itab value.
>
> Also, please just make the ARM do a plain load for now instead of
> adding complexity we're not sure is necessary. We can worry about the
> ARM later, once the rest of the code is good. I still believe we can
> arrange to use a plain load, and even if not we can fix it once we're
> happy with the x86 code.
>
>>>
http://codereview.appspot.com/6351090/diff/19001/src/cmd/5g/cgen.c#newcode46
>>> src/cmd/5g/cgen.c:46: !n->left->addable || !n->right->addable) {
>>> What condition is this trying to handle?
>>
>> I ran into a problem with compiling the go/ast package. That package
>> has a compilation AST similar to this:
>>
>> AS
>> \-- NAME: iface
>> \-- EFACE
>>     \-- NAME: go.type.itab
>>     \-- CALLFUNC
>>         \-- LIST
>>             \-- NAME: iface
>
> Do you know about the dump function? I am not sure how you generated
> that, but the dump function already does dumps like that, with more
> information than you're showing.

This was transcribed by hand from memory. I was at my laptop when I
wrote this mail, and couldn't get a proper dump.

>
>> In that case, the interface type word would be assigned first and the
>> half assigned interface would then be used as an argument. Spilling to
>> a tempname solves this problem.
>
> Okay. But then the problem being solved is that a function call has to
> be made. That's not really a condition of addressability of the n
> side. Probably you want if n->ullman >= UINF or something like that.
>
> In this code:
>
>         case OEFACE:
>                 if (res->op != ONAME || !res->addable) {
>                         tempname(&n1, n->type);
>                         cgen_eface(n, &n1);
>                         cgen(&n1, res);
>                 } else
>                         cgen_eface(n, res);
>                 goto ret;
>
> It says "if res is inappropriate, call cgen_eface with a different res
> that is appropriate".
> But when you add the conditions about n->left and n->right, it looks
> like a check for an inappropriate n, but the cgen_eface still gets
> called with the same n, which looks suspicious. I think cgen_eface is
> at fault here if anything is: it should avoid writing to res until it
> has all the pieces it needs.
>
> Right now it is:
>
> void
> cgen_eface(Node *n, Node *res)
> {
>         Node dst;
>         dst = *res;
>         dst.type = types[tptr];
>         cgen(n->left, &dst);
>         dst.xoffset += widthptr;
>         cgen(n->right, &dst);
> }
>
> One fix is to make cgen_eface check for n->left->ullman >= UINF and
> generate into temporaries here. But I don't think that's necessary
> either. Since n->left is always a constant (a type pointer known at
> compile time), it should work well enough to swap the order of the
> generation:
>
>         dst.type = types[tptr];
>         dst.xoffset += widthptr;
>         cgen(n->right, &dst); // might be difficult
>         dst.xoffset -= widthptr;
>         cgen(n->left, &dst); // always a constant
>
Will do this for next patch set.

> Russ

DMorsing

Hello rsc@golang.org, dave@cheney.net, nigeltao@golang.org (cc: golang-dev@googlegroups.com), Please take another look.

11 years, 9 months ago (2012-08-08 18:59:25 UTC) #25

DMorsing

Note that I've updated the description of the CL to include benchmarks for a plain ...

11 years, 9 months ago (2012-08-08 19:03:37 UTC) #26

nigeltao

I think that this is in rsc's court, but out of curiousity, what does "6g ...

11 years, 8 months ago (2012-08-24 05:02:51 UTC) #28

DMorsing

On 2012/08/24 05:02:51, nigeltao wrote: > I think that this is in rsc's court, but ...

11 years, 8 months ago (2012-08-24 16:58:04 UTC) #29

DMorsing

Hello rsc@golang.org, dave@cheney.net, nigeltao@golang.org (cc: golang-dev@googlegroups.com), Please take another look.

11 years, 8 months ago (2012-08-25 07:22:24 UTC) #30

DMorsing

Huh, thought hg mail would publish draft comments... Anyway, Nigel suggestion has been added.

11 years, 8 months ago (2012-08-25 07:24:12 UTC) #31

rsc

On Sat, Aug 25, 2012 at 3:24 AM, <daniel.morsing@gmail.com> wrote: > Huh, thought hg mail ...

11 years, 8 months ago (2012-08-31 17:09:56 UTC) #32

rsc

LGTM I expected a win in code that uses interfaces as "unions", so you get ...

11 years, 7 months ago (2012-09-11 02:45:14 UTC) #33

DMorsing

On 2012/09/11 02:45:14, rsc wrote: > LGTM > http://codereview.appspot.com/6351090/diff/41001/src/pkg/runtime/iface.c#newcode186 > src/pkg/runtime/iface.c:186: #pragma textflag 7 > ...

11 years, 7 months ago (2012-09-11 19:40:52 UTC) #34

DMorsing

*** Submitted as http://code.google.com/p/go/source/detail?r=6d4707371015 *** cmd/gc: Inline pointer sized T2I interface conversions This CL also ...

11 years, 7 months ago (2012-09-11 19:42:34 UTC) #35

*** Submitted as http://code.google.com/p/go/source/detail?r=6d4707371015 ***

cmd/gc: Inline pointer sized T2I interface conversions

This CL also adds support for marking the likelyness of IF nodes in the AST
being true. This feature is being used here to mark the slow path as unlikely.

src/pkg/runtime:
benchmark                  old ns/op    new ns/op    delta
BenchmarkConvT2IUintptr           16            1  -91.63%

test/bench/go1:
benchmark                 old ns/op    new ns/op    delta
BenchmarkBinaryTree17    5416917000   5461355000   +0.82%
BenchmarkFannkuch11      3810355000   3842609000   +0.85%
BenchmarkGobDecode         19950950     19855420   -0.48%
BenchmarkGobEncode         11301220     11308530   +0.06%
BenchmarkGzip             548119600    546869200   -0.23%
BenchmarkGunzip           176145400    180208300   +2.31%
BenchmarkJSONEncode        93117400     70163100  -24.65%
BenchmarkJSONDecode       406626800    409999200   +0.83%
BenchmarkMandelbrot200      6300992      6317866   +0.27%
BenchmarkParse              7664396      7451625   -2.78%
BenchmarkRevcomp         1189424000   1412332000  +18.74%
BenchmarkTemplate         491308400    458654200   -6.65%

benchmark                  old MB/s     new MB/s  speedup
BenchmarkGobDecode            38.47        38.66    1.00x
BenchmarkGobEncode            67.92        67.87    1.00x
BenchmarkGzip                 35.40        35.48    1.00x
BenchmarkGunzip              110.16       107.68    0.98x
BenchmarkJSONEncode           20.84        27.66    1.33x
BenchmarkJSONDecode            4.77         4.73    0.99x
BenchmarkParse                 7.56         7.77    1.03x
BenchmarkRevcomp             213.69       179.96    0.84x
BenchmarkTemplate              3.95         4.23    1.07x

R=rsc, dave, nigeltao
CC=golang-dev
http://codereview.appspot.com/6351090

11 years, 7 months ago (2012-09-11 21:15:15 UTC) #36

Use $7 when the function is a leaf: that is, it never calls another
function. Setting textflag to 7 is necessary when it's a function
involved in the stack-splitting code: splitting stacks inside the
stack-splitting code would be a disaster.  Other than that, it's
optional and affects only performance, not correctness.

-rob

Expand All Messages | Collapse All Messages