Issue 6482062: code review 6482062: math/big: faster (add|sub)V(V|W) routines

Issue 6482062: code review 6482062: math/big: faster (add|sub)V(V|W) routines (Closed)

Can't Edit
Can't Publish+Mail
Start Review

Created:
13 years, 9 months ago by gri

Modified:
13 years, 9 months ago

Reviewers:

CC:
iant, remyoudompheng, ioe, minux1, golang-dev

Visibility:
Public.

Description

math/big: faster (add|sub)V(V|W) routines Benchmarks run on 3.06GHz Intel Core 2 Duo, 4GB 800MHz DDR2 SDRAM ("iMac"). benchmark old ns/op new ns/op delta BenchmarkAddVV_1 6 6 +2.75% BenchmarkAddVV_2 9 7 -19.71% BenchmarkAddVV_3 9 9 +2.25% BenchmarkAddVV_4 10 8 -20.46% BenchmarkAddVV_5 12 10 -19.53% BenchmarkAddVV_1e1 23 15 -32.48% BenchmarkAddVV_1e2 213 107 -49.77% BenchmarkAddVV_1e3 2088 993 -52.44% BenchmarkAddVV_1e4 20874 12027 -42.38% BenchmarkAddVV_1e5 209858 121480 -42.11% BenchmarkAddVW_1 5 5 +0.90% BenchmarkAddVW_2 11 11 -3.51% BenchmarkAddVW_3 7 7 -0.27% BenchmarkAddVW_4 8 7 -6.32% BenchmarkAddVW_5 9 8 -10.89% BenchmarkAddVW_1e1 17 12 -26.01% BenchmarkAddVW_1e2 155 89 -42.32% BenchmarkAddVW_1e3 1479 873 -40.97% BenchmarkAddVW_1e4 13838 8764 -36.67% BenchmarkAddVW_1e5 147353 89560 -39.22% benchmark old MB/s new MB/s speedup BenchmarkAddVV_1 9765.57 9508.55 0.97x BenchmarkAddVV_2 13077.63 16284.97 1.25x BenchmarkAddVV_3 20599.58 20156.67 0.98x BenchmarkAddVV_4 23591.58 29516.02 1.25x BenchmarkAddVV_5 24920.95 31194.10 1.25x BenchmarkAddVV_1e1 27393.76 40621.71 1.48x BenchmarkAddVV_1e2 29911.96 59592.99 1.99x BenchmarkAddVV_1e3 30650.73 64429.84 2.10x BenchmarkAddVV_1e4 30660.09 53213.08 1.74x BenchmarkAddVV_1e5 30496.74 52683.46 1.73x BenchmarkAddVW_1 11503.39 11405.98 0.99x BenchmarkAddVW_2 11203.56 11586.92 1.03x BenchmarkAddVW_3 26173.45 26224.75 1.00x BenchmarkAddVW_4 30560.30 32621.94 1.07x BenchmarkAddVW_5 33183.81 37269.94 1.12x BenchmarkAddVW_1e1 36991.75 50098.53 1.35x BenchmarkAddVW_1e2 41087.14 71549.93 1.74x BenchmarkAddVW_1e3 43266.42 73279.83 1.69x BenchmarkAddVW_1e4 46246.74 73021.97 1.58x BenchmarkAddVW_1e5 43433.00 71459.96 1.65x Benchmarks run on 2.8GHz Quad-Code Intel Xeon, 4GB 800MHz DDR2 FB-DIMM ("PowerMac"). benchmark old ns/op new ns/op delta BenchmarkAddVV_1 7 7 +2.51% BenchmarkAddVV_2 8 8 +3.70% BenchmarkAddVV_3 10 10 +4.00% BenchmarkAddVV_4 11 9 -19.49% BenchmarkAddVV_5 14 11 -18.44% BenchmarkAddVV_1e1 23 17 -27.00% BenchmarkAddVV_1e2 234 117 -50.00% BenchmarkAddVV_1e3 2284 1095 -52.06% BenchmarkAddVV_1e4 22906 13149 -42.60% BenchmarkAddVV_1e5 229860 135133 -41.21% BenchmarkAddVW_1 6 6 +1.15% BenchmarkAddVW_2 7 7 +1.37% BenchmarkAddVW_3 7 8 +1.00% BenchmarkAddVW_4 9 8 -6.93% BenchmarkAddVW_5 10 9 -13.21% BenchmarkAddVW_1e1 18 14 -24.32% BenchmarkAddVW_1e2 170 97 -42.41% BenchmarkAddVW_1e3 1619 953 -41.14% BenchmarkAddVW_1e4 15142 9776 -35.44% BenchmarkAddVW_1e5 160835 102396 -36.33% benchmark old MB/s new MB/s speedup BenchmarkAddVV_1 8928.95 8702.84 0.97x BenchmarkAddVV_2 15298.84 14739.60 0.96x BenchmarkAddVV_3 19116.52 18375.37 0.96x BenchmarkAddVV_4 21644.30 26935.44 1.24x BenchmarkAddVV_5 22771.64 27754.04 1.22x BenchmarkAddVV_1e1 27017.62 37050.89 1.37x BenchmarkAddVV_1e2 27326.09 54289.15 1.99x BenchmarkAddVV_1e3 28016.84 58428.83 2.09x BenchmarkAddVV_1e4 27939.38 48670.55 1.74x BenchmarkAddVV_1e5 27843.00 47360.54 1.70x BenchmarkAddVW_1 10510.97 10397.27 0.99x BenchmarkAddVW_2 17499.71 17279.03 0.99x BenchmarkAddVW_3 24093.93 23858.39 0.99x BenchmarkAddVW_4 27733.08 29799.42 1.07x BenchmarkAddVW_5 30267.17 34781.83 1.15x BenchmarkAddVW_1e1 34566.78 45629.88 1.32x BenchmarkAddVW_1e2 37521.89 65341.93 1.74x BenchmarkAddVW_1e3 39513.18 67153.67 1.70x BenchmarkAddVW_1e4 42263.80 65464.60 1.55x BenchmarkAddVW_1e5 39792.21 62501.88 1.57x

Patch Set 1 #

Patch Set 2 : diff -r b397807815a6 https://go.googlecode.com/hg/ #

Patch Set 3 : diff -r b397807815a6 https://go.googlecode.com/hg/ #

Patch Set 4 : diff -r b397807815a6 https://go.googlecode.com/hg/ #

Patch Set 5 : diff -r b397807815a6 https://go.googlecode.com/hg/ #

Total comments: 10

Patch Set 6 : diff -r a1c27198d91f https://code.google.com/p/go #

Created: 13 years, 9 months ago

Download [raw] [tar.bz2]

		Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+185 lines, -58 lines)			Patch
	M	src/pkg/math/big/arith_amd64.s	View	1 2 3	2 chunks	+185 lines, -58 lines	0 comments	Download

Messages

Total messages: 10

Expand All Messages | Collapse All Messages

gri

Hello iant@golang.org (cc: golang-dev@googlegroups.com), I'd like you to review this change to https://go.googlecode.com/hg/

13 years, 9 months ago (2012-08-24 06:17:42 UTC) #1

ioe

Stupid question: Any reason this kind of low level optimization is not done at some ...

13 years, 9 months ago (2012-08-24 12:32:23 UTC) #3

minux1

On Fri, Aug 24, 2012 at 8:32 PM, Ingo Oeser <nightlyone@googlemail.com>wrote: > Stupid question: Any ...

13 years, 9 months ago (2012-08-24 13:07:18 UTC) #4

ioe

On Friday, August 24, 2012 3:06:57 PM UTC+2, minux wrote: > > > On Fri, ...

13 years, 9 months ago (2012-08-24 13:32:33 UTC) #5

iant

It's difficult to use peephole optimizations to generate add-with-carry instructions. The C or Go code ...

13 years, 9 months ago (2012-08-24 14:50:51 UTC) #6

iant

LGTM Please feel free to either consider or ignore the various suggestions. http://codereview.appspot.com/6482062/diff/6002/src/pkg/math/big/arith_amd64.s File src/pkg/math/big/arith_amd64.s ...

13 years, 9 months ago (2012-08-24 15:02:42 UTC) #7

gri

Thanks for the suggestions, Ian. Just for the record: I also wrote a version where ...

13 years, 9 months ago (2012-08-24 16:08:53 UTC) #8

Thanks for the suggestions, Ian.

Just for the record: I also wrote a version where at the end I have separate
completely unrolled code for n = 1, 2, 3 (rather than the simple loop). This
gives some modest speedup for short vectors but the amount of extra code and
added irregularity seemed not worth it. I also tried moving that code before
(rather than after the unrolled loop) because it permits the use of simpler
(non-indexed) addressing modes for n = 1, 2, 3 - there was no difference in
performance.

http://codereview.appspot.com/6482062/diff/6002/src/pkg/math/big/arith_amd64.s
File src/pkg/math/big/arith_amd64.s (right):

http://codereview.appspot.com/6482062/diff/6002/src/pkg/math/big/arith_amd64....
src/pkg/math/big/arith_amd64.s:42: 
On 2012/08/24 15:02:42, iant wrote:
> At this point you could subtract 4 from DI.  That would save you the CMPQ in
the
> loop, because you could branch back based on the SUBQ result.  Then you could
> add 4 back again before entering V1.

Yes, I was planning to do this but first wanted to get the loop right. Then I
forgot to make the change. Will do in a separate CL.

http://codereview.appspot.com/6482062/diff/6002/src/pkg/math/big/arith_amd64....
src/pkg/math/big/arith_amd64.s:49: RCRQ $1, CX		// restore CF
On 2012/08/24 15:02:42, iant wrote:
> Did you try moving the RCRQ instruction before the MOVQ instructions?

I did in an earlier version and the results were inconclusive. Will try and
perhaps produce a separate CL.

http://codereview.appspot.com/6482062/diff/6002/src/pkg/math/big/arith_amd64....
src/pkg/math/big/arith_amd64.s:54: RCLQ $1, CX		// save CF
On 2012/08/24 15:02:42, iant wrote:
> Did you try moving the RCLQ instruction after the MOVQ instructions?

see above

http://codereview.appspot.com/6482062/diff/6002/src/pkg/math/big/arith_amd64....
src/pkg/math/big/arith_amd64.s:55: MOVQ R11, 0(R10)(SI*8)
On 2012/08/24 15:02:42, iant wrote:
> Did you experiment with putting the MOVQ instruction immediately after the
ADCQ
> instruction that generates it?  That would be worse on Atom but it might save
a
> cycle on other x86_64 processors because it would avoid the immediate
> instruction dependency on the carry flag.  I don't know if it would help or
> hurt.

I have not. Will try all these - it's easy enough.

http://codereview.appspot.com/6482062/diff/6002/src/pkg/math/big/arith_amd64....
src/pkg/math/big/arith_amd64.s:99: 
On 2012/08/24 15:02:42, iant wrote:
> Same suggestion here about subtracting 4, of course.  And below, too.

Done.

gri

*** Submitted as http://code.google.com/p/go/source/detail?r=2507138fe625 *** math/big: faster (add|sub)V(V|W) routines Benchmarks run on 3.06GHz Intel Core ...

13 years, 9 months ago (2012-08-24 16:21:07 UTC) #9

*** Submitted as http://code.google.com/p/go/source/detail?r=2507138fe625 ***

math/big: faster (add|sub)V(V|W) routines

Benchmarks run on 3.06GHz Intel Core 2 Duo,
4GB 800MHz DDR2 SDRAM ("iMac").

benchmark             old ns/op    new ns/op    delta
BenchmarkAddVV_1              6            6   +2.75%
BenchmarkAddVV_2              9            7  -19.71%
BenchmarkAddVV_3              9            9   +2.25%
BenchmarkAddVV_4             10            8  -20.46%
BenchmarkAddVV_5             12           10  -19.53%
BenchmarkAddVV_1e1           23           15  -32.48%
BenchmarkAddVV_1e2          213          107  -49.77%
BenchmarkAddVV_1e3         2088          993  -52.44%
BenchmarkAddVV_1e4        20874        12027  -42.38%
BenchmarkAddVV_1e5       209858       121480  -42.11%
BenchmarkAddVW_1              5            5   +0.90%
BenchmarkAddVW_2             11           11   -3.51%
BenchmarkAddVW_3              7            7   -0.27%
BenchmarkAddVW_4              8            7   -6.32%
BenchmarkAddVW_5              9            8  -10.89%
BenchmarkAddVW_1e1           17           12  -26.01%
BenchmarkAddVW_1e2          155           89  -42.32%
BenchmarkAddVW_1e3         1479          873  -40.97%
BenchmarkAddVW_1e4        13838         8764  -36.67%
BenchmarkAddVW_1e5       147353        89560  -39.22%

benchmark              old MB/s     new MB/s  speedup
BenchmarkAddVV_1        9765.57      9508.55    0.97x
BenchmarkAddVV_2       13077.63     16284.97    1.25x
BenchmarkAddVV_3       20599.58     20156.67    0.98x
BenchmarkAddVV_4       23591.58     29516.02    1.25x
BenchmarkAddVV_5       24920.95     31194.10    1.25x
BenchmarkAddVV_1e1     27393.76     40621.71    1.48x
BenchmarkAddVV_1e2     29911.96     59592.99    1.99x
BenchmarkAddVV_1e3     30650.73     64429.84    2.10x
BenchmarkAddVV_1e4     30660.09     53213.08    1.74x
BenchmarkAddVV_1e5     30496.74     52683.46    1.73x
BenchmarkAddVW_1       11503.39     11405.98    0.99x
BenchmarkAddVW_2       11203.56     11586.92    1.03x
BenchmarkAddVW_3       26173.45     26224.75    1.00x
BenchmarkAddVW_4       30560.30     32621.94    1.07x
BenchmarkAddVW_5       33183.81     37269.94    1.12x
BenchmarkAddVW_1e1     36991.75     50098.53    1.35x
BenchmarkAddVW_1e2     41087.14     71549.93    1.74x
BenchmarkAddVW_1e3     43266.42     73279.83    1.69x
BenchmarkAddVW_1e4     46246.74     73021.97    1.58x
BenchmarkAddVW_1e5     43433.00     71459.96    1.65x

Benchmarks run on 2.8GHz Quad-Code Intel Xeon,
4GB 800MHz DDR2 FB-DIMM ("PowerMac").

benchmark             old ns/op    new ns/op    delta
BenchmarkAddVV_1              7            7   +2.51%
BenchmarkAddVV_2              8            8   +3.70%
BenchmarkAddVV_3             10           10   +4.00%
BenchmarkAddVV_4             11            9  -19.49%
BenchmarkAddVV_5             14           11  -18.44%
BenchmarkAddVV_1e1           23           17  -27.00%
BenchmarkAddVV_1e2          234          117  -50.00%
BenchmarkAddVV_1e3         2284         1095  -52.06%
BenchmarkAddVV_1e4        22906        13149  -42.60%
BenchmarkAddVV_1e5       229860       135133  -41.21%
BenchmarkAddVW_1              6            6   +1.15%
BenchmarkAddVW_2              7            7   +1.37%
BenchmarkAddVW_3              7            8   +1.00%
BenchmarkAddVW_4              9            8   -6.93%
BenchmarkAddVW_5             10            9  -13.21%
BenchmarkAddVW_1e1           18           14  -24.32%
BenchmarkAddVW_1e2          170           97  -42.41%
BenchmarkAddVW_1e3         1619          953  -41.14%
BenchmarkAddVW_1e4        15142         9776  -35.44%
BenchmarkAddVW_1e5       160835       102396  -36.33%

benchmark              old MB/s     new MB/s  speedup
BenchmarkAddVV_1        8928.95      8702.84    0.97x
BenchmarkAddVV_2       15298.84     14739.60    0.96x
BenchmarkAddVV_3       19116.52     18375.37    0.96x
BenchmarkAddVV_4       21644.30     26935.44    1.24x
BenchmarkAddVV_5       22771.64     27754.04    1.22x
BenchmarkAddVV_1e1     27017.62     37050.89    1.37x
BenchmarkAddVV_1e2     27326.09     54289.15    1.99x
BenchmarkAddVV_1e3     28016.84     58428.83    2.09x
BenchmarkAddVV_1e4     27939.38     48670.55    1.74x
BenchmarkAddVV_1e5     27843.00     47360.54    1.70x
BenchmarkAddVW_1       10510.97     10397.27    0.99x
BenchmarkAddVW_2       17499.71     17279.03    0.99x
BenchmarkAddVW_3       24093.93     23858.39    0.99x
BenchmarkAddVW_4       27733.08     29799.42    1.07x
BenchmarkAddVW_5       30267.17     34781.83    1.15x
BenchmarkAddVW_1e1     34566.78     45629.88    1.32x
BenchmarkAddVW_1e2     37521.89     65341.93    1.74x
BenchmarkAddVW_1e3     39513.18     67153.67    1.70x
BenchmarkAddVW_1e4     42263.80     65464.60    1.55x
BenchmarkAddVW_1e5     39792.21     62501.88    1.57x

R=iant, remyoudompheng, nightlyone, minux.ma
CC=golang-dev
http://codereview.appspot.com/6482062

gri

13 years, 9 months ago (2012-08-24 17:14:23 UTC) #10

I have proposed in the past the use of the comma-ok form in Go for getting
carry bit information. For instance, one could imagine writing

    sum, carry = x + y

where carry is 0 or 1. With this one might write

    func addVV(z, x, y []Word) Word
        carry := 0
        for i := range z {
            z[i], carry = x[i] + y[i] + carry
        }
        return carry
    }

which corresponds to the addVV assembly routine. One would also have to
tell the compiler to not do any array bounds checks. It's conceivable that
even a fairly simple compiler could generate rather good for this. A better
compiler might even do the unrolling. That said, it's a lot of stuff that
needs to be just right and at that point it's still not clear that a
compiler can beat a hand-tuned version. The reason for the assembly
routines in the first place is best-possible performance - they are not
really needed for functionality since there are Go versions of all these
assembly routines.

- gri

On Fri, Aug 24, 2012 at 6:32 AM, Ingo Oeser <nightlyone@googlemail.com>wrote:

> On Friday, August 24, 2012 3:06:57 PM UTC+2, minux wrote:
>
>>
>> On Fri, Aug 24, 2012 at 8:32 PM, Ingo Oeser <night...@googlemail.com>wrote:
>>
>>> Stupid question: Any reason this kind of low level optimization is not
>>> done at some peep-hole optimizer stage in the compiler via instruction
>>> pattern matching?
>>>
>> you mean unrolling loops written in assembly programs?
>
>
> That would be a linker stage but would require code size and code speed
> metrics to find a tradeoff.
>
>
>> as we don't have much assembly routines, adding such a optimizing pass
>> doesn't
>> seem worthy (beside this pass will only have very limited applicability).
>>
>>
> The idea was: We add highly optimized assembler routines for things we
> could detect and optimize in the compiler. Either in a very hacky way
> of simple pattern matching or by better building blocks/optimizations in
> the compiler.
>
> After all programming is a conversation with the compiler. If it doesn't
> get what you mean you can either ignore it and write assembler or teach it
> to understand you.
>
> But I understand that there are currently implemented "point in time
> solutions" here.
>
> I just sense a pattern here, that hand crafted assembler routines get
> accepted, while changes in the compiler to generate better code are neither
> (publically visible) done nor accepted.
>
> This just feels wrong to me and I wondered whether this will change or is
> the intended direction of development.
>
> Best Regards
>
> Ingo Oeser
>

Expand All Messages | Collapse All Messages