|
|
Created:
12 years, 7 months ago by remyoudompheng Modified:
11 years, 7 months ago CC:
golang-codereviews Visibility:
Public. |
Descriptioncmd/6g, cmd/8g: add address propagation to peepholes.
This peephole optimization applies to sequences like:
LEAQ (BX)(BP*4), SI
MOVQ (SI), BP
by turning them into
MOVQ (BX)(BP*4), BP
when applicable.
The latter form was naturally produced in code generation
when using the sudoaddable method.
Patch Set 1 #Patch Set 2 : diff -r b88820dbcb86 https://go.googlecode.com/hg/ #Patch Set 3 : diff -r b88820dbcb86 https://go.googlecode.com/hg/ #Patch Set 4 : diff -r b88820dbcb86 https://go.googlecode.com/hg/ #
Total comments: 3
MessagesTotal messages: 21
Hello golang-dev@googlegroups.com (cc: golang-dev@googlegroups.com), I'd like you to review this change to https://go.googlecode.com/hg/
Sign in to reply to this message.
Update issue 1914. maybe ?
Sign in to reply to this message.
I seem to have uploaded a broken version.
Sign in to reply to this message.
This is happening constantly, has anyone logged a bug on rietveld about this ? On Thu, Jan 31, 2013 at 10:35 AM, <remyoudompheng@gmail.com> wrote: > I seem to have uploaded a broken version. > > https://codereview.appspot.com/7221077/
Sign in to reply to this message.
On 2013/01/30 23:36:13, dfc wrote: > This is happening constantly, has anyone logged a bug on rietveld about this ? > > On Thu, Jan 31, 2013 at 10:35 AM, <mailto:remyoudompheng@gmail.com> wrote: > > I seem to have uploaded a broken version. > > > > https://codereview.appspot.com/7221077/ I mean the code is broken, after I tried an unspaghettification. By the way, the patch on 8g needs https://codereview.appspot.com/7226066/ (which is independent) to avoid breakage.
Sign in to reply to this message.
Ahh right. My mistake. I can test on 386, are there any specific packages which might receive a performance improvement from this change ? On Thu, Jan 31, 2013 at 11:04 AM, <remyoudompheng@gmail.com> wrote: > On 2013/01/30 23:36:13, dfc wrote: >> >> This is happening constantly, has anyone logged a bug on rietveld > > about this ? > >> On Thu, Jan 31, 2013 at 10:35 AM, <mailto:remyoudompheng@gmail.com> > > wrote: >> >> > I seem to have uploaded a broken version. >> > >> > https://codereview.appspot.com/7221077/ > > > I mean the code is broken, after I tried an unspaghettification. > By the way, the patch on 8g needs > https://codereview.appspot.com/7226066/ (which is independent) to avoid > breakage. > > > https://codereview.appspot.com/7221077/
Sign in to reply to this message.
The goal was to speed up the following code (adapted from crypto/rc4) benchmark old MB/s new MB/s speedup BenchmarkRC4_128 122.06 142.69 1.17x BenchmarkRC4_1K 123.53 145.18 1.18x BenchmarkRC4_8K 123.65 145.56 1.18x (recently committed assembly routine has ~160MB/s) Before: --- prog list "(*Cipher).XORKeyStream" --- 0124 (rc4.go:47) TEXT (*Cipher).XORKeyStream+0(SB),$0-56 0125 (rc4.go:47) MOVQ dst+16(FP),R15 0126 (rc4.go:47) MOVQ c+0(FP),AX 0127 (rc4.go:48) MOVBQZX 257(AX),BP 0128 (rc4.go:48) MOVQ BP,SI 0129 (rc4.go:48) MOVBQZX 256(AX),BP 0130 (rc4.go:49) MOVQ src+32(FP),R10 0131 (rc4.go:49) MOVQ src+40(FP),R14 0132 (rc4.go:49) MOVQ src+48(FP),BX 0133 (rc4.go:48) MOVQ BP,DI 0134 (rc4.go:49) MOVQ $0,DX 0135 (rc4.go:49) JMP ,137 0136 (rc4.go:49) INCQ ,DX 0137 (rc4.go:49) CMPQ DX,R14 0138 (rc4.go:49) JGE $0,177 0139 (rc4.go:49) MOVBQZX (R10),BP 0140 (rc4.go:49) MOVQ BP,R11 0141 (rc4.go:49) INCQ ,R10 0142 (rc4.go:50) INCQ ,DI 0143 (rc4.go:51) LEAQ (AX),BP 0144 (rc4.go:51) MOVBQZX DI,BX 0145 (rc4.go:51) LEAQ (BP)(BX*1),BP 0146 (rc4.go:51) MOVBQZX (BP),BX 0147 (rc4.go:51) MOVQ BX,CX 0148 (rc4.go:52) ADDQ BX,SI 0149 (rc4.go:53) LEAQ (AX),BP 0150 (rc4.go:53) MOVBQZX SI,BX 0151 (rc4.go:53) LEAQ (BP)(BX*1),BP 0152 (rc4.go:53) MOVBQZX (BP),BX 0153 (rc4.go:53) MOVQ BX,R9 0154 (rc4.go:54) LEAQ (AX),BP 0155 (rc4.go:54) MOVBQZX DI,BX 0156 (rc4.go:54) LEAQ (BP)(BX*1),BP 0157 (rc4.go:54) MOVB R9,(BP) 0158 (rc4.go:54) LEAQ (AX),BP 0159 (rc4.go:54) MOVBQZX SI,BX 0160 (rc4.go:54) LEAQ (BP)(BX*1),BP 0161 (rc4.go:54) MOVB CX,(BP) 0162 (rc4.go:55) ADDQ R9,CX 0163 (rc4.go:55) LEAQ (AX),BP 0164 (rc4.go:55) MOVBQZX CX,R8 0165 (rc4.go:55) LEAQ (BP)(R8*1),BP 0166 (rc4.go:55) MOVBQZX (BP),BX 0167 (rc4.go:55) MOVQ dst+8(FP),BP 0168 (rc4.go:55) XORQ R11,BX 0169 (rc4.go:55) MOVQ DX,R8 0170 (rc4.go:55) CMPQ DX,R15 0171 (rc4.go:55) JCS $1,174 0172 (rc4.go:55) CALL ,runtime.panicindex+0(SB) 0173 (rc4.go:55) UNDEF , 0174 (rc4.go:55) LEAQ (BP)(R8*1),BP 0175 (rc4.go:55) MOVB BX,(BP) 0176 (rc4.go:49) JMP ,136 0177 (rc4.go:57) MOVB DI,256(AX) 0178 (rc4.go:57) MOVB SI,257(AX) 0179 (rc4.go:58) RET , After: --- prog list "(*Cipher).XORKeyStream" --- 0120 (rc4.go:47) TEXT (*Cipher).XORKeyStream+0(SB),$0-56 0121 (rc4.go:47) MOVQ dst+16(FP),R15 0122 (rc4.go:47) MOVQ c+0(FP),AX 0123 (rc4.go:48) MOVBQZX 257(AX),BP 0124 (rc4.go:48) MOVQ BP,SI 0125 (rc4.go:48) MOVBQZX 256(AX),BP 0126 (rc4.go:49) MOVQ src+32(FP),R10 0127 (rc4.go:49) MOVQ src+40(FP),R14 0128 (rc4.go:49) MOVQ src+48(FP),BX 0129 (rc4.go:48) MOVQ BP,DI 0130 (rc4.go:49) MOVQ $0,DX 0131 (rc4.go:49) JMP ,133 0132 (rc4.go:49) INCQ ,DX 0133 (rc4.go:49) CMPQ DX,R14 0134 (rc4.go:49) JGE $0,162 0135 (rc4.go:49) MOVBQZX (R10),BP 0136 (rc4.go:49) INCQ ,R10 0137 (rc4.go:50) INCQ ,DI 0138 (rc4.go:51) MOVBQZX DI,BX 0139 (rc4.go:51) MOVBQZX (AX)(BX*1),BX 0140 (rc4.go:51) MOVQ BX,CX 0141 (rc4.go:52) ADDQ BX,SI 0142 (rc4.go:53) MOVBQZX SI,BX 0143 (rc4.go:53) MOVBQZX (AX)(BX*1),BX 0144 (rc4.go:53) MOVQ BX,R9 0145 (rc4.go:54) MOVBQZX DI,BX 0146 (rc4.go:54) MOVB R9,(AX)(BX*1) 0147 (rc4.go:54) MOVBQZX SI,BX 0148 (rc4.go:54) MOVB CX,(AX)(BX*1) 0149 (rc4.go:55) ADDQ R9,CX 0150 (rc4.go:55) MOVBQZX CX,R8 0151 (rc4.go:55) MOVBQZX (AX)(R8*1),BX 0152 (rc4.go:55) XORQ BP,BX 0153 (rc4.go:55) MOVQ dst+8(FP),BP 0154 (rc4.go:55) MOVQ DX,R8 0155 (rc4.go:55) CMPQ DX,R15 0156 (rc4.go:55) JCS $1,159 0157 (rc4.go:55) CALL ,runtime.panicindex+0(SB) 0158 (rc4.go:55) UNDEF , 0159 (rc4.go:55) LEAQ (BP)(R8*1),BP 0160 (rc4.go:55) MOVB BX,(BP) 0161 (rc4.go:49) JMP ,132 0162 (rc4.go:57) MOVB DI,256(AX) 0163 (rc4.go:57) MOVB SI,257(AX) 0164 (rc4.go:58) RET , Go 1: --- prog list "(*Cipher).XORKeyStream" --- 0112 (rc4.go:47) TEXT (*Cipher).XORKeyStream+0(SB),$24-40 0113 (rc4.go:47) MOVQ c+0(FP),BX 0114 (rc4.go:48) MOVB 257(BX),BP 0115 (rc4.go:48) MOVB BP,CX 0116 (rc4.go:48) MOVQ BX,AX 0117 (rc4.go:48) MOVB 256(BX),BP 0118 (rc4.go:48) MOVB BP,DI 0119 (rc4.go:48) MOVB CX,BX 0120 (rc4.go:48) MOVB BX,SI 0121 (rc4.go:49) MOVQ src+24(FP),BX 0122 (rc4.go:49) MOVQ BX,autotmp_0004+-16(SP) 0123 (rc4.go:49) MOVL src+32(FP),BX 0124 (rc4.go:49) MOVL BX,autotmp_0004+-8(SP) 0125 (rc4.go:49) MOVL src+36(FP),BX 0126 (rc4.go:49) MOVL BX,autotmp_0004+-4(SP) 0127 (rc4.go:49) MOVL $0,DX 0128 (rc4.go:49) MOVL autotmp_0004+-8(SP),BX 0129 (rc4.go:49) MOVL BX,autotmp_0006+-20(SP) 0130 (rc4.go:49) LEAQ autotmp_0004+-16(SP),BX 0131 (rc4.go:49) MOVQ (BX),R9 0132 (rc4.go:49) JMP ,134 0133 (rc4.go:49) INCL ,DX 0134 (rc4.go:49) MOVL autotmp_0006+-20(SP),BP 0135 (rc4.go:49) CMPL DX,BP 0136 (rc4.go:49) JGE ,180 0137 (rc4.go:49) MOVB (R9),BP 0138 (rc4.go:49) MOVB BP,R13 0139 (rc4.go:49) MOVL DX,BX 0140 (rc4.go:49) MOVL DX,R12 0141 (rc4.go:49) MOVB R13,BX 0142 (rc4.go:49) MOVB BX,R11 0143 (rc4.go:49) INCQ ,R9 0144 (rc4.go:49) MOVQ R9,BX 0145 (rc4.go:50) INCB ,DI 0146 (rc4.go:51) MOVQ AX,BX 0147 (rc4.go:51) MOVB DI,BP 0148 (rc4.go:51) MOVBQZX BP,BP 0149 (rc4.go:51) MOVB (AX)(BP*1),R8 0150 (rc4.go:51) MOVB R8,CX 0151 (rc4.go:52) MOVB CX,BX 0152 (rc4.go:52) ADDB BX,SI 0153 (rc4.go:53) MOVB SI,BP 0154 (rc4.go:53) MOVBQZX BP,BP 0155 (rc4.go:53) MOVB (AX)(BP*1),R8 0156 (rc4.go:53) MOVB R8,R10 0157 (rc4.go:54) MOVB DI,BP 0158 (rc4.go:54) MOVBQZX BP,BP 0159 (rc4.go:54) MOVB R10,R8 0160 (rc4.go:54) MOVB R8,(AX)(BP*1) 0161 (rc4.go:54) MOVQ AX,BX 0162 (rc4.go:54) MOVB SI,BP 0163 (rc4.go:54) MOVBQZX BP,BP 0164 (rc4.go:54) MOVB CX,R8 0165 (rc4.go:54) MOVB R8,(AX)(BP*1) 0166 (rc4.go:55) MOVB CX,BP 0167 (rc4.go:55) MOVB R10,R8 0168 (rc4.go:55) ADDB R8,BP 0169 (rc4.go:55) MOVBQZX BP,BP 0170 (rc4.go:55) MOVB (AX)(BP*1),BX 0171 (rc4.go:55) MOVB R11,BP 0172 (rc4.go:55) XORB BP,BX 0173 (rc4.go:55) MOVLQSX R12,BP 0174 (rc4.go:55) CMPL BP,dst+16(FP) 0175 (rc4.go:55) JCS ,177 0176 (rc4.go:55) CALL ,runtime.panicindex+0(SB) 0177 (rc4.go:55) MOVQ dst+8(FP),R8 0178 (rc4.go:55) MOVB BX,(R8)(BP*1) 0179 (rc4.go:49) JMP ,133 0180 (rc4.go:57) MOVB DI,BP 0181 (rc4.go:57) MOVB BP,256(AX) 0182 (rc4.go:57) MOVB SI,BP 0183 (rc4.go:57) MOVB BP,257(AX) 0184 (rc4.go:58) RET ,
Sign in to reply to this message.
On 2013/01/31 00:12:29, remyoudompheng wrote: > The goal was to speed up the following code (adapted from crypto/rc4) > > benchmark old MB/s new MB/s speedup > BenchmarkRC4_128 122.06 142.69 1.17x > BenchmarkRC4_1K 123.53 145.18 1.18x > BenchmarkRC4_8K 123.65 145.56 1.18x > > (recently committed assembly routine has ~160MB/s) > Forgot the source: // XORKeyStream sets dst to the result of XORing src with the key stream. // Dst and src may be the same slice but otherwise should not overlap. func (c *Cipher) XORKeyStream(dst, src []byte) { i, j := c.i, c.j for k, v := range src { i += 1 x := c.s[i] j += x y := c.s[j] c.s[i], c.s[j] = y, x dst[k] = v ^ c.s[x+y] } c.i, c.j = i, j }
Sign in to reply to this message.
This is fine but I suspect you can get even better performance if you can make sudoaddable generate the MOVQ directly. The peephole frees up SI too late for it to be used. Russ
Sign in to reply to this message.
On 2013/01/31 00:10:34, dfc wrote: > Ahh right. My mistake. > > I can test on 386, are there any specific packages which might receive > a performance improvement from this change ? My example on crypto/rc4 shows a noticeable speedup, but not all code benefits from the change. I'd be happy if you have some benchmarks you want to try.
Sign in to reply to this message.
On 2013/01/31 00:13:39, rsc wrote: > This is fine but I suspect you can get even better performance if you can > make sudoaddable generate the MOVQ directly. The peephole frees up SI too > late for it to be used. > > Russ sudoaddable was disabled by this changeset: it is not clear to me how to re-enable it without breaking tests. changeset: 14785:0167b45ce063 user: Rémy Oudompheng <oudomphe@phare.normalesup.org> date: Fri Nov 02 07:50:59 2012 +0100 cmd/5g, cmd/6g: fix out of registers with array indexing. Compiling expressions like: s[s[s[s[s[s[s[s[s[s[s[s[i]]]]]]]]]]]] make 5g and 6g run out of registers. Such expressions can arise if a slice is used to represent a permutation and the user wants to iterate it. [...] In the code which is now generated SI and DI are truncated to bytes 3 times each...
Sign in to reply to this message.
Applying both CLs on linux/386, benchmarks for crypto/rc4 appear to have regressed by 10% 220887(~/go/src/pkg/crypto/rc4) % ~/go/misc/benchcmp {old,new}.txtbenchmark old ns/op new ns/op delta BenchmarkRC4_128 12420 13710 +10.39% BenchmarkRC4_1K 99146 109387 +10.33% BenchmarkRC4_8K 783317 869264 +10.97% benchmark old MB/s new MB/s speedup BenchmarkRC4_128 10.31 9.34 0.91x BenchmarkRC4_1K 10.33 9.36 0.91x BenchmarkRC4_8K 10.34 9.31 0.90x
Sign in to reply to this message.
go1 benchmarks tell a confusing story 220887(~/go/test/bench/go1) % ./go1.07047d188e5d+ -test.bench=. > old.txt && ./go1.test -test.bench=. > new.txt && ~/go/misc/benchcmp {old,new}.txt testing: warning: no tests to run testing: warning: no tests to run benchmark old ns/op new ns/op delta BenchmarkBinaryTree17 2147483647 2147483647 +0.38% BenchmarkFannkuch11 2147483647 2147483647 -4.12% BenchmarkGobDecode 116948991 118130290 +1.01% BenchmarkGobEncode 46165429 45144971 -2.21% BenchmarkGzip 2147483647 2147483647 -3.77% BenchmarkGunzip 578569131 573901528 -0.81% BenchmarkJSONEncode 595919633 593787673 -0.36% BenchmarkJSONDecode 894130881 864169486 -3.35% BenchmarkMandelbrot200 33309028 33309361 +0.00% BenchmarkParse 36518266 36188068 -0.90% BenchmarkRevcomp 2147483647 2147483647 -3.33% BenchmarkTemplate 1525819164 1537115231 +0.74% benchmark old MB/s new MB/s speedup BenchmarkGobDecode 6.56 6.50 0.99x BenchmarkGobEncode 16.63 17.00 1.02x BenchmarkGzip 5.68 5.90 1.04x BenchmarkGunzip 33.54 33.81 1.01x BenchmarkJSONEncode 3.26 3.27 1.00x BenchmarkJSONDecode 2.17 2.25 1.04x BenchmarkParse 1.59 1.60 1.01x BenchmarkRevcomp 52.64 54.45 1.03x BenchmarkTemplate 1.27 1.26 0.99x The 3 and 4 % speedups are constant across several runs.
Sign in to reply to this message.
First impressions here. I will probably look at this more closely some other time. https://codereview.appspot.com/7221077/diff/9001/src/cmd/6g/peep.c File src/cmd/6g/peep.c (right): https://codereview.appspot.com/7221077/diff/9001/src/cmd/6g/peep.c#newcode235 src/cmd/6g/peep.c:235: p->from.type -= D_INDIR; the resulting move might make new optimizations possible. t++ at the end here. https://codereview.appspot.com/7221077/diff/9001/src/cmd/6g/peep.c#newcode1469 src/cmd/6g/peep.c:1469: if(r == R || r == r0) non-unique predecessors and successors aren't necesarily cause to bail out here. The structure from copyprop should be applicable here. https://codereview.appspot.com/7221077/diff/9001/src/cmd/6g/peep.c#newcode1473 src/cmd/6g/peep.c:1473: t = addru(r->prog, t0, f0); Why have addru and addrsub be separated? copyu calls copysub, which i guess addru was based on.
Sign in to reply to this message.
On 2013/01/31 00:44:37, dfc wrote: > Applying both CLs on linux/386, benchmarks for crypto/rc4 appear to have > regressed by 10% > > 220887(~/go/src/pkg/crypto/rc4) % ~/go/misc/benchcmp {old,new}.txtbenchmark > old ns/op new ns/op delta > BenchmarkRC4_128 12420 13710 +10.39% > BenchmarkRC4_1K 99146 109387 +10.33% > BenchmarkRC4_8K 783317 869264 +10.97% > > benchmark old MB/s new MB/s speedup > BenchmarkRC4_128 10.31 9.34 0.91x > BenchmarkRC4_1K 10.33 9.36 0.91x > BenchmarkRC4_8K 10.34 9.31 0.90x What kind of CPU are you using in this benchmark?
Sign in to reply to this message.
220887(~) % uname -a Linux 220887 3.2.0-32-generic-pae #51-Ubuntu SMP Wed Sep 26 21:54:23 UTC 2012 i686 i686 i386 GNU/Linux 220887(~) % head /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 28 model name : Intel(R) Atom(TM) CPU 330 @ 1.6 0GHz stepping : 2 microcode : 0x20d cpu MHz : 1596.088 cache size : 512 KB physical id : 0 220887(~) % On 01/02/2013, at 7:59, remyoudompheng@gmail.com wrote: > On 2013/01/31 00:44:37, dfc wrote: >> Applying both CLs on linux/386, benchmarks for crypto/rc4 appear to > have >> regressed by 10% > >> 220887(~/go/src/pkg/crypto/rc4) % ~/go/misc/benchcmp > {old,new}.txtbenchmark >> old ns/op new ns/op delta >> BenchmarkRC4_128 12420 13710 +10.39% >> BenchmarkRC4_1K 99146 109387 +10.33% >> BenchmarkRC4_8K 783317 869264 +10.97% > >> benchmark old MB/s new MB/s speedup >> BenchmarkRC4_128 10.31 9.34 0.91x >> BenchmarkRC4_1K 10.33 9.36 0.91x >> BenchmarkRC4_8K 10.34 9.31 0.90x > > What kind of CPU are you using in this benchmark? > > https://codereview.appspot.com/7221077/
Sign in to reply to this message.
ping
Sign in to reply to this message.
My main reservation is that I think the compiler should do a better job avoiding the LEAQs in the first place, because then the optimizer has more registers available for the earlier stages. But if you address Daniel's comments, I'm willing to run with it.
Sign in to reply to this message.
Let's hold off on this now until after Go 1.1. Thanks.
Sign in to reply to this message.
Replacing golang-dev with golang-codereviews.
Sign in to reply to this message.
|