Issue 6446165: code review 6446165: math/big: unroll loops a bit in amd64 assembly routines.

Issue 6446165: code review 6446165: math/big: unroll loops a bit in amd64 assembly routines. (Closed)

Can't Edit
Can't Publish+Mail
Start Review

Created:
13 years, 9 months ago by remyoudompheng

Modified:
13 years, 9 months ago

Reviewers:
nigeltao, r, Christopher Swenson, gri, iant2

CC:
golang-dev, remy_archlinux.org

Visibility:
Public.

Description

math/big: unroll loops a bit in amd64 assembly routines. Processing 4 words at a time reduces the amount of instructions needed to save and restore the carry flag, among other things. Benchmarks on a Core 2 Quad Q8200@2.33GHz benchmark old ns/op new ns/op delta BenchmarkAdd_1w 50 48 -2.40% BenchmarkAdd_2w 50 52 +4.55% BenchmarkAdd_5w 55 59 +5.73% BenchmarkAdd_100kb 4285 2528 -41.00% BenchmarkAdd_1Mb 44307 24145 -45.51% BenchmarkAdd_5Mb 325697 289706 -11.05% BenchmarkAdd_10Mb 1137018 1106273 -2.70% BenchmarkMul_1w 52 52 -0.76% BenchmarkMul_2w 117 117 +0.00% BenchmarkMul_5w 241 228 -5.39% BenchmarkMul_1kb 1101 940 -14.62% BenchmarkMul_10kb 59019 47135 -20.14% BenchmarkMul_50kb 829171 643858 -22.35% BenchmarkMul_100kb 2563856 1999235 -22.02% BenchmarkMul_1Mb 105886450 83408800 -21.23% BenchmarkMul_5Mb 1285270000 1005876000 -21.74% BenchmarkMul_10Mb 3869718000 3029543000 -21.71%

Patch Set 1 #

Patch Set 2 : diff -r b855390a295f https://go.googlecode.com/hg/ #

Patch Set 3 : diff -r b855390a295f https://go.googlecode.com/hg/ #

Patch Set 4 : diff -r b855390a295f https://go.googlecode.com/hg/ #

Total comments: 20

Created: 13 years, 9 months ago

Download [raw] [tar.bz2]

		Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+158 lines, -25 lines)			Patch
	M	src/pkg/math/big/arith_amd64.s	View	1	2 chunks	+115 lines, -25 lines	13 comments	Download
	M	src/pkg/math/big/nat_test.go	View	1 2 3	1 chunk	+43 lines, -0 lines	7 comments	Download

Messages

Total messages: 22

Expand All Messages | Collapse All Messages

remyoudompheng

Hello gri@golang.org, golang-dev@googlegroups.org (cc: golang-dev@googlegroups.com, remy@archlinux.org), I'd like you to review this change to https://go.googlecode.com/hg/

13 years, 9 months ago (2012-08-21 17:55:46 UTC) #1

gri

I will look at this a bit later (today or tomorrow). In the meantime could ...

13 years, 9 months ago (2012-08-21 18:53:59 UTC) #2

remyoudompheng

Hello gri@golang.org, golang-dev@googlegroups.org (cc: golang-dev@googlegroups.com, remy@archlinux.org), Please take another look.

13 years, 9 months ago (2012-08-21 19:03:31 UTC) #3

Christopher Swenson

On 2012/08/21 19:03:31, remyoudompheng wrote: > Hello mailto:gri@golang.org, mailto:golang-dev@googlegroups.org (cc: > mailto:golang-dev@googlegroups.com, mailto:remy@archlinux.org), > > ...

13 years, 9 months ago (2012-08-21 21:42:11 UTC) #4

Christopher Swenson

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s File src/pkg/math/big/arith_amd64.s (right): http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#newcode38 src/pkg/math/big/arith_amd64.s:38: MOVQ $0, BX // i = 0 This instruction ...

13 years, 9 months ago (2012-08-21 21:42:57 UTC) #5

nigeltao

13 years, 9 months ago (2012-08-22 01:32:39 UTC) #6

nigeltao

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s File src/pkg/math/big/arith_amd64.s (right): http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#newcode37 src/pkg/math/big/arith_amd64.s:37: SHLQ $2, CX // CX = (n/4)*4 Should the ...

13 years, 9 months ago (2012-08-22 01:55:47 UTC) #7

The 'linkers' (6l etc.) turn the pseudo-ops into the best instruction for the job. The ...

13 years, 9 months ago (2012-08-22 02:16:17 UTC) #8

remyoudompheng

On 2012/08/21 21:42:11, Christopher Swenson wrote: > Though, when we are talking about 1Mb+ numbers ...

13 years, 9 months ago (2012-08-22 03:44:59 UTC) #9

remyoudompheng

On 2012/08/22 03:44:59, remyoudompheng wrote: > I am currently working on a FFT-based implementation. I ...

13 years, 9 months ago (2012-08-22 03:45:44 UTC) #10

gri

Thanks for working on this. I think the assembly code can be made more tight. ...

13 years, 9 months ago (2012-08-22 23:35:02 UTC) #11

Thanks for working on this.

I think the assembly code can be made more tight. Also, the tests should be
arith-specific.

- gri

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s
File src/pkg/math/big/arith_amd64.s (right):

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#n...
src/pkg/math/big/arith_amd64.s:30: TEXT ·addVV(SB),7,$0
This can be much improved. I'd like to see almost 0% slow-down for the short
vectors. For one, here's a routine that does what we have now with less code
(and no need to shuffle around carry bits).

The unrolled version should be along the same lines.

// func addVV(z, x, y []Word) (c Word)
TEXT ·addVV(SB),7,$0
	MOVQ z+0(FP), R10
	MOVQ x+16(FP), R8
	MOVQ y+32(FP), R9
	MOVL n+8(FP), CX
	ANDQ $0x00000000ffffffff, CX // "sign-extension" (TODO determine correct MOV w/
sign extension instruction)
	
	MOVQ $0, DX
	TESTQ CX, CX
	JZ E1

	MOVQ $0, BX		// i = 0
	CLC
	
L1:	MOVQ (R8)(BX*8), AX
	ADCQ (R9)(BX*8), AX
	MOVQ AX, (R10)(BX*8)
	
	INCQ BX		// i++
	LOOP L1         // n--

	ADCQ $0, DX

E1:	MOVQ DX, c+48(FP)
	RET

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go
File src/pkg/math/big/nat_test.go (right):

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#new...
src/pkg/math/big/nat_test.go:214: func benchmarkAdd(b *testing.B, sizex, sizey
int) {
You always provide the same size below for x and y - so just pass one argument.
If it needs to change later, it's trivially changed, but for now it's not
necessary.

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#new...
src/pkg/math/big/nat_test.go:217: b.ResetTimer()
you should do this immediately before the loop, otherwise you are also measuring
the SetBits operation.

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#new...
src/pkg/math/big/nat_test.go:222: _ = z.Add(&x, &y)
These benchmarks are explicitly testing the performance of a handfull a few
assembly routines. Should call them directly. Otherwise, when other (Go)
improvements to Add and Mul are made, the benchmark results are not comparable
anymore.

Specifically, the overhead for small numbers (1-5 words) is so large that the
assembly code barely matters - they almost all use the same amount of time
(around 50ns on your machine, around 43ns on mine). Thus, you are not measuring
your code. (It is correct that at the end we care about the top-level
operations, but these are the low-level primitives that might be used in a
variety of situations. We need to measure them alone.)

These tests (in modified form) should be in arith_test.go.

Also, to try to compensate for caching effects, it might be useful to run one
operation outside the measured loop.

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#new...
src/pkg/math/big/nat_test.go:226: func benchmarkMul(b *testing.B, sizex, sizey
int) {
same comments apply here

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#new...
src/pkg/math/big/nat_test.go:241: func BenchmarkAdd_100kb(b *testing.B) {
benchmarkAdd(b, 100e3, 100e3) }
100<10

(1kb == 1<<10)

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#new...
src/pkg/math/big/nat_test.go:242: func BenchmarkAdd_1Mb(b *testing.B)   {
benchmarkAdd(b, 1e6, 1e6) }
1<<20

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/nat_test.go#new...
src/pkg/math/big/nat_test.go:243: func BenchmarkAdd_5Mb(b *testing.B)   {
benchmarkAdd(b, 5e6, 5e6) }
I might be wrong, but going past 100kb sized numbers is not really important for
all practical purposes. But more importantly, the benchmark results are likely
dominated by memory latency (the improvements drop significantly).

 It's more important that some of the "smaller" numbers perform reasonably well.
In particular, ideally there should be almost no slowdown (less than 5%) for any
size due to this change. Please test the following sizes:

1w
2w
5w
10w
50w
1Kb
10Kb
100Kb

Also 1Kb == 1024.

nigeltao

http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s File src/pkg/math/big/arith_amd64.s (right): http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#newcode30 src/pkg/math/big/arith_amd64.s:30: TEXT ·addVV(SB),7,$0 On 2012/08/22 23:35:02, gri wrote: > MOVL ...

13 years, 9 months ago (2012-08-23 04:36:30 UTC) #12

gri

FYI. http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s File src/pkg/math/big/arith_amd64.s (right): http://codereview.appspot.com/6446165/diff/5/src/pkg/math/big/arith_amd64.s#newcode30 src/pkg/math/big/arith_amd64.s:30: TEXT ·addVV(SB),7,$0 Here's a version which has even ...

13 years, 9 months ago (2012-08-23 05:10:17 UTC) #13

gri

Good to know (MOVLQZX). So the first instructions can be: MOVL n+8(FP), CX TESTQ CX, ...

13 years, 9 months ago (2012-08-23 05:18:39 UTC) #14

remyoudompheng

For some reason here DECQ/JNZ is 2x times slower than CMPQ/JL (for both rolled/unrolled versions), ...

13 years, 9 months ago (2012-08-23 06:58:28 UTC) #15

gri

It may well be that DECQ/JNZ is slower than CMPQ/JL. It used to be the ...

13 years, 9 months ago (2012-08-23 15:32:51 UTC) #16

Christopher Swenson

Could case b) be helped by using prefetching? I would guess that loop prediction + ...

13 years, 9 months ago (2012-08-23 15:49:59 UTC) #17

Could case b) be helped by using prefetching? I would guess that loop
prediction + prefetching might be good enough to make up the difference for
most superscalar chips (at least, I've heard Intel claim such things).

I also wonder how the benchmarks differ on different Intel and AMD chips.

--Christopher


On Thu, Aug 23, 2012 at 11:32 AM, Robert Griesemer <gri@golang.org> wrote:

> It may well be that DECQ/JNZ is slower than CMPQ/JL. It used to be the
> case a very long time ago (first Pentiums) that some of the fancier
> instructions that ran longer sequences of micro-instructions  (e.g. LOOP)
> were significantly slower than a much longer equivalent sequence of more
> basic "RISC" instructions - and that one was best advised to just stick to
> the very basic instructions. I was hoping this might have changed, and I am
> a bit surprised at DECQ being a problem, but I haven't measured myself yet.
> Also, different architectures may have wildly different results, but we
> should probably stick to some of the newer machines.
>
> Either way, this is why it's important to measure just the assembly
> routines in the benchmark so we have a clear(er) picture. I think the
> effects on execution time we are going to see with these routines are a)
> cycle count for small vectors (data is in cache); b) memory latency for
> large vectors (data is in uncached memory); and c) various variations of
> the three. Unrolling will mostly be beneficial for case b) because extra
> memory fetches are overlapping outstanding ones.
>
> - gri
>
>
> On Wed, Aug 22, 2012 at 11:58 PM, <remyoudompheng@gmail.com> wrote:
>
>> For some reason here DECQ/JNZ is 2x times slower than CMPQ/JL (for both
>> rolled/unrolled versions), I'm not sure why. Maybe someone can find an
>> architecture where it runs faster?
>>
>>
http://codereview.appspot.com/**6446165/<http://codereview.appspot.com/6446165/>
>>
>
>


-- 
Christopher Swenson
cswenson@google.com

gri

FYI: I just submitted http://codereview.appspot.com/**6478055/<http://codereview.appspot.com/6478055/> which contains benchmarks for some of the core vector routines. ...

13 years, 9 months ago (2012-08-23 23:00:42 UTC) #18

iant2

On Thu, Aug 23, 2012 at 4:00 PM, Robert Griesemer <gri@golang.org> wrote: > > I ...

13 years, 9 months ago (2012-08-23 23:24:26 UTC) #19

gri

There you go! Used to be true in 1994, and it's still true :-) - ...

13 years, 9 months ago (2012-08-23 23:26:45 UTC) #20

remyoudompheng

Superseded by http://codereview.appspot.com/6482062/

13 years, 9 months ago (2012-08-24 19:26:13 UTC) #21

*** Abandoned ***

Expand All Messages | Collapse All Messages