|
|
|
Created:
13 years, 9 months ago by remyoudompheng Modified:
13 years, 9 months ago CC:
golang-dev, remy_archlinux.org Visibility:
Public. |
Descriptionmath/big: unroll loops a bit in amd64 assembly routines.
Processing 4 words at a time reduces the amount of instructions
needed to save and restore the carry flag, among other things.
Benchmarks on a Core 2 Quad Q8200@2.33GHz
benchmark old ns/op new ns/op delta
BenchmarkAdd_1w 50 48 -2.40%
BenchmarkAdd_2w 50 52 +4.55%
BenchmarkAdd_5w 55 59 +5.73%
BenchmarkAdd_100kb 4285 2528 -41.00%
BenchmarkAdd_1Mb 44307 24145 -45.51%
BenchmarkAdd_5Mb 325697 289706 -11.05%
BenchmarkAdd_10Mb 1137018 1106273 -2.70%
BenchmarkMul_1w 52 52 -0.76%
BenchmarkMul_2w 117 117 +0.00%
BenchmarkMul_5w 241 228 -5.39%
BenchmarkMul_1kb 1101 940 -14.62%
BenchmarkMul_10kb 59019 47135 -20.14%
BenchmarkMul_50kb 829171 643858 -22.35%
BenchmarkMul_100kb 2563856 1999235 -22.02%
BenchmarkMul_1Mb 105886450 83408800 -21.23%
BenchmarkMul_5Mb 1285270000 1005876000 -21.74%
BenchmarkMul_10Mb 3869718000 3029543000 -21.71%
Patch Set 1 #Patch Set 2 : diff -r b855390a295f https://go.googlecode.com/hg/ #Patch Set 3 : diff -r b855390a295f https://go.googlecode.com/hg/ #Patch Set 4 : diff -r b855390a295f https://go.googlecode.com/hg/ #
Total comments: 20
MessagesTotal messages: 22
Hello gri@golang.org, golang-dev@googlegroups.org (cc: golang-dev@googlegroups.com, remy@archlinux.org), I'd like you to review this change to https://go.googlecode.com/hg/
Sign in to reply to this message.
I will look at this a bit later (today or tomorrow). In the meantime could you add benchmarks showing that this doesn't slow down small numbers? Thanks. - gri On Tue, Aug 21, 2012 at 10:55 AM, <remyoudompheng@gmail.com> wrote: > Reviewers: gri, golang-dev_googlegroups.org, > > Message: > Hello gri@golang.org, golang-dev@googlegroups.org (cc: > golang-dev@googlegroups.com, remy@archlinux.org), > > I'd like you to review this change to > https://go.googlecode.com/hg/ > > > Description: > math/big: unroll loops a bit in amd64 assembly routines. > > Processing 4 words at a time reduces the amount of instructions > needed to save and restore the carry flag, among other things. > > Benchmarks on a Core 2 Quad Q8200@2.33GHz > > benchmark old ns/op new ns/op delta > BenchmarkAdd_100kb 4400 2517 -42.80% > BenchmarkAdd_1Mb 43965 24363 -44.59% > BenchmarkAdd_5Mb 335207 264941 -20.96% > BenchmarkAdd_10Mb 1148330 1158397 +0.88% > BenchmarkMul_1kb 1189 940 -20.94% > BenchmarkMul_10kb 58555 46842 -20.00% > BenchmarkMul_50kb 825427 642377 -22.18% > BenchmarkMul_100kb 2555077 1983290 -22.38% > BenchmarkMul_1Mb 105479750 82940000 -21.37% > BenchmarkMul_5Mb 1284871000 1004868000 -21.79% > BenchmarkMul_10Mb 3875074000 3038164000 -21.60% > > Please review this at http://codereview.appspot.com/**6446165/<http://codereview.appspot.com/6446165/> > > Affected files: > M src/pkg/math/big/arith_amd64.s > M src/pkg/math/big/nat_test.go > > >
Sign in to reply to this message.
Hello gri@golang.org, golang-dev@googlegroups.org (cc: golang-dev@googlegroups.com, remy@archlinux.org), Please take another look.
Sign in to reply to this message.
On 2012/08/21 19:03:31, remyoudompheng wrote: > Hello mailto:gri@golang.org, mailto:golang-dev@googlegroups.org (cc: > mailto:golang-dev@googlegroups.com, mailto:remy@archlinux.org), > > Please take another look. I saw this patch and took a quick look. I spotted a couple of potential improvements. Though, when we are talking about 1Mb+ numbers for multiplication, we should really look into Toom-Cook multiplication, or even an FFT-based multiply.
Sign in to reply to this message.
http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s File src/pkg/math/big/arith_amd64.s (right): http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#n... src/pkg/math/big/arith_amd64.s:38: MOVQ $0, BX // i = 0 This instruction is 7 bytes long, whereas XOR BX, BX is only 3. Ditto for the following instruction. Although this might cause the pipeline to stall, I believe that the instruction size difference should make up for it. If the pipeline stall is too much, move MOVQ $0, BX to before the shifts, and change the latter to MOVQ BX, DX. http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#n... src/pkg/math/big/arith_amd64.s:46: RCRQ $1, DX Move this rotate to before the MOVQs to ease the pipeline a little? On most architectures the processor should reorder this already, but some (like, say, Atom) it could cause a delay. The carry flag should be preserved through the MOVQs. http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#n... src/pkg/math/big/arith_amd64.s:51: RCLQ $1, DX Ditto above. http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#n... src/pkg/math/big/arith_amd64.s:56: ADDL $4, BX // i+=4 I feel like we should move $4 into a register (perhaps AX) to save the byte here. http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#n... src/pkg/math/big/arith_amd64.s:85: MOVQ $0, BX // i = 0 Ditto previous comments about instruction size. http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#n... src/pkg/math/big/arith_amd64.s:93: RCRQ $1, DX See previous comments about moving the RCRQ/RCLQ. http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#n... src/pkg/math/big/arith_amd64.s:103: ADDL $4, BX // i+=4 Ditto above about moving $4 into AX. http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#n... src/pkg/math/big/arith_amd64.s:114: ADDL $1, BX // i++ INCL BX, like above.
Sign in to reply to this message.
http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s File src/pkg/math/big/arith_amd64.s (right): http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#n... src/pkg/math/big/arith_amd64.s:38: MOVQ $0, BX // i = 0 On 2012/08/21 21:42:57, Christopher Swenson wrote: > This instruction is 7 bytes long, whereas XOR BX, BX is only 3. I forget exactly where in 6g/6l it happens, but "MOVQ $0, BX" gets rewritten to "XORQ BX, BX".
Sign in to reply to this message.
http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s File src/pkg/math/big/arith_amd64.s (right): http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#n... src/pkg/math/big/arith_amd64.s:37: SHLQ $2, CX // CX = (n/4)*4 Should the SHRQ/SHLQ be "ANDQ $~3, CX"? objdump -d tells me that ANDQ is 4 bytes and two shifts are 4+4 bytes: 400c87: 48 c1 e8 02 shr $0x2,%rax 400c8b: 48 c1 e0 02 shl $0x2,%rax 400c8f: 48 83 e0 fc and $0xfffffffffffffffc,%rax
Sign in to reply to this message.
The 'linkers' (6l etc.) turn the pseudo-ops into the best instruction for the job. The assemblers are a peculiar mix of literal instructions and pseudo-ops like MOV, but the correspondence with machine instructions can sometimes be unpredictable. Best to disassemble a binary or run 6l -a to see what happens. -rob
Sign in to reply to this message.
On 2012/08/21 21:42:11, Christopher Swenson wrote: > Though, when we are talking about 1Mb+ numbers for multiplication, we should > really look into Toom-Cook multiplication, or even an FFT-based multiply. I am currently working on a FFT-based implementation. It may not be finished but it is already working, and gets closer to GMP performance after applying this CL (1.2-1.5x slower, 2x slower for very large inputs where the algorithm should be recursive but is not yet). The assembly routines are the most deterministic improvement I can find at the moment.
Sign in to reply to this message.
On 2012/08/22 03:44:59, remyoudompheng wrote: > I am currently working on a FFT-based implementation. I sent an e-mail a few days ago about this: it is implemented externally: https://github.com/remyoudompheng/bigfft
Sign in to reply to this message.
Thanks for working on this. I think the assembly code can be made more tight. Also, the tests should be arith-specific. - gri http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s File src/pkg/math/big/arith_amd64.s (right): http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#n... src/pkg/math/big/arith_amd64.s:30: TEXT ·addVV(SB),7,$0 This can be much improved. I'd like to see almost 0% slow-down for the short vectors. For one, here's a routine that does what we have now with less code (and no need to shuffle around carry bits). The unrolled version should be along the same lines. // func addVV(z, x, y []Word) (c Word) TEXT ·addVV(SB),7,$0 MOVQ z+0(FP), R10 MOVQ x+16(FP), R8 MOVQ y+32(FP), R9 MOVL n+8(FP), CX ANDQ $0x00000000ffffffff, CX // "sign-extension" (TODO determine correct MOV w/ sign extension instruction) MOVQ $0, DX TESTQ CX, CX JZ E1 MOVQ $0, BX // i = 0 CLC L1: MOVQ (R8)(BX*8), AX ADCQ (R9)(BX*8), AX MOVQ AX, (R10)(BX*8) INCQ BX // i++ LOOP L1 // n-- ADCQ $0, DX E1: MOVQ DX, c+48(FP) RET http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go File src/pkg/math/big/nat_test.go (right): http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#new... src/pkg/math/big/nat_test.go:214: func benchmarkAdd(b *testing.B, sizex, sizey int) { You always provide the same size below for x and y - so just pass one argument. If it needs to change later, it's trivially changed, but for now it's not necessary. http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#new... src/pkg/math/big/nat_test.go:217: b.ResetTimer() you should do this immediately before the loop, otherwise you are also measuring the SetBits operation. http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#new... src/pkg/math/big/nat_test.go:222: _ = z.Add(&x, &y) These benchmarks are explicitly testing the performance of a handfull a few assembly routines. Should call them directly. Otherwise, when other (Go) improvements to Add and Mul are made, the benchmark results are not comparable anymore. Specifically, the overhead for small numbers (1-5 words) is so large that the assembly code barely matters - they almost all use the same amount of time (around 50ns on your machine, around 43ns on mine). Thus, you are not measuring your code. (It is correct that at the end we care about the top-level operations, but these are the low-level primitives that might be used in a variety of situations. We need to measure them alone.) These tests (in modified form) should be in arith_test.go. Also, to try to compensate for caching effects, it might be useful to run one operation outside the measured loop. http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#new... src/pkg/math/big/nat_test.go:226: func benchmarkMul(b *testing.B, sizex, sizey int) { same comments apply here http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#new... src/pkg/math/big/nat_test.go:241: func BenchmarkAdd_100kb(b *testing.B) { benchmarkAdd(b, 100e3, 100e3) } 100<10 (1kb == 1<<10) http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#new... src/pkg/math/big/nat_test.go:242: func BenchmarkAdd_1Mb(b *testing.B) { benchmarkAdd(b, 1e6, 1e6) } 1<<20 http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#new... src/pkg/math/big/nat_test.go:243: func BenchmarkAdd_5Mb(b *testing.B) { benchmarkAdd(b, 5e6, 5e6) } I might be wrong, but going past 100kb sized numbers is not really important for all practical purposes. But more importantly, the benchmark results are likely dominated by memory latency (the improvements drop significantly). It's more important that some of the "smaller" numbers perform reasonably well. In particular, ideally there should be almost no slowdown (less than 5%) for any size due to this change. Please test the following sizes: 1w 2w 5w 10w 50w 1Kb 10Kb 100Kb Also 1Kb == 1024.
Sign in to reply to this message.
http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s File src/pkg/math/big/arith_amd64.s (right): http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#n... src/pkg/math/big/arith_amd64.s:30: TEXT ·addVV(SB),7,$0 On 2012/08/22 23:35:02, gri wrote: > MOVL n+8(FP), CX > ANDQ $0x00000000ffffffff, CX // "sign-extension" (TODO determine correct MOV w/ > sign extension instruction) "MOVL n+8(FP), CX" will fill the top 32 bits of CX with zeroes; the MOVL is equivalent to a MOVLQZX. If you need sign extension, say "MOVLQSX n+8(FP), CX".
Sign in to reply to this message.
FYI. http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s File src/pkg/math/big/arith_amd64.s (right): http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#n... src/pkg/math/big/arith_amd64.s:30: TEXT ·addVV(SB),7,$0 Here's a version which has even less setup instructions (and the same tight inner loop): // func addVV(z, x, y []Word) (c Word) TEXT ·addVV(SB),7,$0 MOVL n+8(FP), CX // 32bit n ANDQ $0xffffffff, CX // 64bit n; CF = 0; ZF = (n == 0) JZ E1 // if n == 0 goto E1 // CX != 0 && CF == 0 // initialize pointers to x, y, z MOVQ x+16(FP), R8 MOVQ y+32(FP), R9 MOVQ z+0(FP), R10 MOVQ $0, BX // i := 0 // CX != 0 && CF == 0 L1: MOVQ (R8)(BX*8), AX ADCQ (R9)(BX*8), AX MOVQ AX, (R10)(BX*8) INCQ BX // i++, CF unchanged // Note: The LOOP instruction can be replaced with // DECQ CX // JNZ L1 // if they run faster then the LOOP microcode. LOOP L1 // n--; if n != 0 goto L1 // CX == 0 ADCQ $0, CX // CX = carry E1: MOVQ CX, c+48(FP) // return carry RET I'd like to see this code integrated with the unrolled loop body. You need to check for the case n >= 4 and that can be done with just 2 extra instructions in the standard path: After the initialization of BX (i := 0), you need: TESTQ $-4, CX // if n&0xff..ffC == 0 (i.e., n < 4) goto L1, CF = 0 JZ L1 // CX >= 4 && CF == 0 ... unrolled loop code here ... L1: MOVQ (R8)(BX*8), AX ...
Sign in to reply to this message.
Good to know (MOVLQZX). So the first instructions can be:
MOVL n+8(FP), CX
TESTQ CX, CX
JZ E1
On Wed, Aug 22, 2012 at 9:36 PM, <nigeltao@golang.org> wrote:
>
> http://codereview.appspot.com/**6446165/diff/5/src/pkg/math/**
>
big/arith_amd64.s<http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s>
> File src/pkg/math/big/arith_amd64.s (right):
>
> http://codereview.appspot.com/**6446165/diff/5/src/pkg/math/**
>
big/arith_amd64.s#newcode30<http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#newcode30>
> src/pkg/math/big/arith_amd64.**s:30: TEXT ·addVV(SB),7,$0
> On 2012/08/22 23:35:02, gri wrote:
>
>> MOVL n+8(FP), CX
>> ANDQ $0x00000000ffffffff, CX // "sign-extension" (TODO determine
>>
> correct MOV w/
>
>> sign extension instruction)
>>
>
> "MOVL n+8(FP), CX" will fill the top 32 bits of CX with zeroes; the MOVL
> is equivalent to a MOVLQZX. If you need sign extension, say "MOVLQSX
> n+8(FP), CX".
>
>
http://codereview.appspot.com/**6446165/<http://codereview.appspot.com/6446165/>
>
Sign in to reply to this message.
For some reason here DECQ/JNZ is 2x times slower than CMPQ/JL (for both rolled/unrolled versions), I'm not sure why. Maybe someone can find an architecture where it runs faster?
Sign in to reply to this message.
It may well be that DECQ/JNZ is slower than CMPQ/JL. It used to be the case a very long time ago (first Pentiums) that some of the fancier instructions that ran longer sequences of micro-instructions (e.g. LOOP) were significantly slower than a much longer equivalent sequence of more basic "RISC" instructions - and that one was best advised to just stick to the very basic instructions. I was hoping this might have changed, and I am a bit surprised at DECQ being a problem, but I haven't measured myself yet. Also, different architectures may have wildly different results, but we should probably stick to some of the newer machines. Either way, this is why it's important to measure just the assembly routines in the benchmark so we have a clear(er) picture. I think the effects on execution time we are going to see with these routines are a) cycle count for small vectors (data is in cache); b) memory latency for large vectors (data is in uncached memory); and c) various variations of the three. Unrolling will mostly be beneficial for case b) because extra memory fetches are overlapping outstanding ones. - gri On Wed, Aug 22, 2012 at 11:58 PM, <remyoudompheng@gmail.com> wrote: > For some reason here DECQ/JNZ is 2x times slower than CMPQ/JL (for both > rolled/unrolled versions), I'm not sure why. Maybe someone can find an > architecture where it runs faster? > > http://codereview.appspot.com/**6446165/<http://codereview.appspot.com/6446165/> >
Sign in to reply to this message.
Could case b) be helped by using prefetching? I would guess that loop prediction + prefetching might be good enough to make up the difference for most superscalar chips (at least, I've heard Intel claim such things). I also wonder how the benchmarks differ on different Intel and AMD chips. --Christopher On Thu, Aug 23, 2012 at 11:32 AM, Robert Griesemer <gri@golang.org> wrote: > It may well be that DECQ/JNZ is slower than CMPQ/JL. It used to be the > case a very long time ago (first Pentiums) that some of the fancier > instructions that ran longer sequences of micro-instructions (e.g. LOOP) > were significantly slower than a much longer equivalent sequence of more > basic "RISC" instructions - and that one was best advised to just stick to > the very basic instructions. I was hoping this might have changed, and I am > a bit surprised at DECQ being a problem, but I haven't measured myself yet. > Also, different architectures may have wildly different results, but we > should probably stick to some of the newer machines. > > Either way, this is why it's important to measure just the assembly > routines in the benchmark so we have a clear(er) picture. I think the > effects on execution time we are going to see with these routines are a) > cycle count for small vectors (data is in cache); b) memory latency for > large vectors (data is in uncached memory); and c) various variations of > the three. Unrolling will mostly be beneficial for case b) because extra > memory fetches are overlapping outstanding ones. > > - gri > > > On Wed, Aug 22, 2012 at 11:58 PM, <remyoudompheng@gmail.com> wrote: > >> For some reason here DECQ/JNZ is 2x times slower than CMPQ/JL (for both >> rolled/unrolled versions), I'm not sure why. Maybe someone can find an >> architecture where it runs faster? >> >> http://codereview.appspot.com/**6446165/<http://codereview.appspot.com/6446165/> >> > > -- Christopher Swenson cswenson@google.com
Sign in to reply to this message.
FYI: I just submitted http://codereview.appspot.com/**6478055/<http://codereview.appspot.com/6478055/> which contains benchmarks for some of the core vector routines. They measure the routines more precisely and thus permit more accurate tuning. I also experimented a bit with DECQ and I can confirm that simply using DECQ reg (rather then SUBQ $1, reg) makes the code 2x slower. I suspect that for DECQ the condition code might not be readily available for the next instruction (Jcc) and cause a bad pipeline bubble. Anyway, the lesson is to avoid it. Unless you beat me to it, I may have some rather fast versions of addVV and subVV later tonight. - gri On Wed, Aug 22, 2012 at 11:58 PM, <remyoudompheng@gmail.com> wrote: > For some reason here DECQ/JNZ is 2x times slower than CMPQ/JL (for both > rolled/unrolled versions), I'm not sure why. Maybe someone can find an > architecture where it runs faster? > > http://codereview.appspot.com/**6446165/<http://codereview.appspot.com/6446165/> >
Sign in to reply to this message.
On Thu, Aug 23, 2012 at 4:00 PM, Robert Griesemer <gri@golang.org> wrote: > > I also experimented a bit with DECQ and I can confirm that simply using DECQ > reg (rather then SUBQ $1, reg) makes the code 2x slower. I suspect that for > DECQ the condition code might not be readily available for the next > instruction (Jcc) and cause a bad pipeline bubble. Anyway, the lesson is to > avoid it. The Intel optimization manual says: The inc and dec instructions modify only a subset of the bits in the flag register. This creates a dependence on all previous writes of the flag register. This is especially problematic when these instructions are on the critical path because they are used to change an address for a load on which many other instructions depend. Assembly/Compiler Coding Rule 42. (M impact, H generality) inc and dec instructions should be replaced with an add or sub instruction, because add and sub overwrite all flags, whereas inc and dec do not, therefore creating false dependencies on earlier instructions that set the flags. Ian
Sign in to reply to this message.
There you go! Used to be true in 1994, and it's still true :-) - gri On Thu, Aug 23, 2012 at 4:24 PM, Ian Lance Taylor <iant@google.com> wrote: > On Thu, Aug 23, 2012 at 4:00 PM, Robert Griesemer <gri@golang.org> wrote: > > > > I also experimented a bit with DECQ and I can confirm that simply using > DECQ > > reg (rather then SUBQ $1, reg) makes the code 2x slower. I suspect that > for > > DECQ the condition code might not be readily available for the next > > instruction (Jcc) and cause a bad pipeline bubble. Anyway, the lesson is > to > > avoid it. > > The Intel optimization manual says: > > The inc and dec instructions modify only a subset of the bits in the > flag > register. This creates a dependence on all previous writes of the flag > register. This is especially problematic when these instructions are on > the critical path because they are used to change an address for a > load on > which many other instructions depend. > > Assembly/Compiler Coding Rule 42. (M impact, H generality) inc and > dec instructions should be replaced with an add or sub instruction, > because > add and sub overwrite all flags, whereas inc and dec do not, therefore > creating false dependencies on earlier instructions that set the flags. > > Ian >
Sign in to reply to this message.
Superseded by http://codereview.appspot.com/6482062/
Sign in to reply to this message.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
