Code review - Issue 9038048: code review 9038048: runtime: faster x86 memmove (a.k.a. built-in copy())https://codereview.appspot.com/2013-05-17T19:53:54+00:00rietveld
Message from unknown
2013-05-16T18:14:59+00:00khrurn:md5:22ae85b244a3278a63f9329456ba89bf
Message from unknown
2013-05-16T18:16:14+00:00khrurn:md5:915cc6dc8313891dbf06e4eb4ef200ca
Message from unknown
2013-05-16T20:42:06+00:00khrurn:md5:ac59745762577ab0cb32ca50943e196a
Message from unknown
2013-05-16T20:50:24+00:00khrurn:md5:40c6b4dd09c562a7d23947ec0f6b5bfa
Message from unknown
2013-05-16T20:50:49+00:00khrurn:md5:cdf2dd9f1de63834014d7fd70082a578
Message from khr@golang.org
2013-05-16T20:51:13+00:00khrurn:md5:490c9ff314c1dc8aa07bab78edc5e7f6
Hello golang-dev@googlegroups.com,
I'd like you to review this change to
https://khr%40golang.org@code.google.com/p/go/
Message from unknown
2013-05-16T21:23:27+00:00khrurn:md5:88c48a1cf3ea0f8be662e36861b90be6
Message from bradfitz@golang.org
2013-05-16T21:52:03+00:00bradfitzurn:md5:61d84dba48bd9c9913e523425acfdbd2
Nice.
Update CL to say which CPU this was on?
The cutover point will likely change in the future.
On May 16, 2013 1:51 PM, <khr@golang.org> wrote:
> Reviewers: golang-dev1,
>
> Message:
> Hello golang-dev@googlegroups.com,
>
> I'd like you to review this change to
> https://khr%40golang.org@code.**google.com/p/go/<http://40golang.org@code.google.com/p/go/>
>
>
> Description:
> runtime: faster x86 memmove (a.k.a. built-in copy())
>
> REP instructions have a high startup cost, so we handle small
> sizes with some straightline code. The REP MOVSx instructions
> are really fast for large sizes. The cutover is approximately
> 1K. We implement up to 128/256 because that is the maximum
> SSE register load (loading all data into registers lets us
> ignore copy direction).
>
> benchmark old ns/op new ns/op delta
> BenchmarkMemmove0 3 3 +0.86%
> BenchmarkMemmove1 5 5 +5.40%
> BenchmarkMemmove2 18 8 -56.84%
> BenchmarkMemmove3 18 7 -58.45%
> BenchmarkMemmove4 36 7 -78.63%
> BenchmarkMemmove5 36 8 -77.91%
> BenchmarkMemmove6 36 8 -77.76%
> BenchmarkMemmove7 36 8 -77.82%
> BenchmarkMemmove8 18 8 -56.33%
> BenchmarkMemmove9 18 7 -58.34%
> BenchmarkMemmove10 18 7 -58.34%
> BenchmarkMemmove11 18 7 -58.45%
> BenchmarkMemmove12 36 7 -78.51%
> BenchmarkMemmove13 36 7 -78.48%
> BenchmarkMemmove14 36 7 -78.56%
> BenchmarkMemmove15 36 7 -78.56%
> BenchmarkMemmove16 18 7 -58.24%
> BenchmarkMemmove32 18 8 -54.33%
> BenchmarkMemmove64 18 8 -53.37%
> BenchmarkMemmove128 20 9 -55.93%
> BenchmarkMemmove256 25 11 -55.16%
> BenchmarkMemmove512 33 33 -1.19%
> BenchmarkMemmove1024 43 44 +2.06%
> BenchmarkMemmove2048 61 61 +0.16%
> BenchmarkMemmove4096 95 95 +0.00%
>
> Please review this at https://codereview.appspot.**com/9038048/<https://codereview.appspot.com/9038048/>
>
> Affected files:
> src/pkg/runtime/memmove_386.s
> src/pkg/runtime/memmove_amd64.**s
> src/pkg/runtime/memmove_test.**go
>
>
> --
>
> ---You received this message because you are subscribed to the Google
> Groups "golang-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-dev+unsubscribe@**googlegroups.com<golang-dev%2Bunsubscribe@googlegroups.com>
> .
> For more options, visit https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>
> .
>
>
>
Message from remyoudompheng@gmail.com
2013-05-16T21:54:28+00:00remyoudomphengurn:md5:3f66948ab9c4852665d3c4431fb771a6
This is very interesting, I should try combinations of this with CL9101048.
2013/5/16 Brad Fitzpatrick <bradfitz@golang.org>:
> Nice.
>
> Update CL to say which CPU this was on?
>
> The cutover point will likely change in the future.
>
> On May 16, 2013 1:51 PM, <khr@golang.org> wrote:
>>
>> Reviewers: golang-dev1,
>>
>> Message:
>> Hello golang-dev@googlegroups.com,
>>
>> I'd like you to review this change to
>> https://khr%40golang.org@code.google.com/p/go/
>>
>>
>> Description:
>> runtime: faster x86 memmove (a.k.a. built-in copy())
>>
>> REP instructions have a high startup cost, so we handle small
>> sizes with some straightline code. The REP MOVSx instructions
>> are really fast for large sizes. The cutover is approximately
>> 1K. We implement up to 128/256 because that is the maximum
>> SSE register load (loading all data into registers lets us
>> ignore copy direction).
>>
>> benchmark old ns/op new ns/op delta
>> BenchmarkMemmove0 3 3 +0.86%
>> BenchmarkMemmove1 5 5 +5.40%
>> BenchmarkMemmove2 18 8 -56.84%
>> BenchmarkMemmove3 18 7 -58.45%
>> BenchmarkMemmove4 36 7 -78.63%
>> BenchmarkMemmove5 36 8 -77.91%
>> BenchmarkMemmove6 36 8 -77.76%
>> BenchmarkMemmove7 36 8 -77.82%
>> BenchmarkMemmove8 18 8 -56.33%
>> BenchmarkMemmove9 18 7 -58.34%
>> BenchmarkMemmove10 18 7 -58.34%
>> BenchmarkMemmove11 18 7 -58.45%
>> BenchmarkMemmove12 36 7 -78.51%
>> BenchmarkMemmove13 36 7 -78.48%
>> BenchmarkMemmove14 36 7 -78.56%
>> BenchmarkMemmove15 36 7 -78.56%
>> BenchmarkMemmove16 18 7 -58.24%
>> BenchmarkMemmove32 18 8 -54.33%
>> BenchmarkMemmove64 18 8 -53.37%
>> BenchmarkMemmove128 20 9 -55.93%
>> BenchmarkMemmove256 25 11 -55.16%
>> BenchmarkMemmove512 33 33 -1.19%
>> BenchmarkMemmove1024 43 44 +2.06%
>> BenchmarkMemmove2048 61 61 +0.16%
>> BenchmarkMemmove4096 95 95 +0.00%
>>
>> Please review this at https://codereview.appspot.com/9038048/
>>
>> Affected files:
>> src/pkg/runtime/memmove_386.s
>> src/pkg/runtime/memmove_amd64.s
>> src/pkg/runtime/memmove_test.go
>>
>>
>> --
>>
>> ---You received this message because you are subscribed to the Google
>> Groups "golang-dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to golang-dev+unsubscribe@googlegroups.com.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "golang-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-dev+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
Message from khr@google.com
2013-05-16T22:06:36+00:00khr1urn:md5:a9e2d426223b77d4f8930bc4897df0d8
If you're calling memmove directly with a constant size, maybe we should
split out the copy ranges (e.g. memmove_17through32) as full-fledged
functions that you can call directly, avoiding the size switch at the top
of memmove. Downside, the compiler is exposed to some of the memmove
implementation. Upside, no size switch.
On Thu, May 16, 2013 at 2:54 PM, Rémy Oudompheng
<remyoudompheng@gmail.com>wrote:
> This is very interesting, I should try combinations of this with CL9101048.
>
> 2013/5/16 Brad Fitzpatrick <bradfitz@golang.org>:
> > Nice.
> >
> > Update CL to say which CPU this was on?
> >
> > The cutover point will likely change in the future.
> >
> > On May 16, 2013 1:51 PM, <khr@golang.org> wrote:
> >>
> >> Reviewers: golang-dev1,
> >>
> >> Message:
> >> Hello golang-dev@googlegroups.com,
> >>
> >> I'd like you to review this change to
> >> https://khr%40golang.org@code.google.com/p/go/
> >>
> >>
> >> Description:
> >> runtime: faster x86 memmove (a.k.a. built-in copy())
> >>
> >> REP instructions have a high startup cost, so we handle small
> >> sizes with some straightline code. The REP MOVSx instructions
> >> are really fast for large sizes. The cutover is approximately
> >> 1K. We implement up to 128/256 because that is the maximum
> >> SSE register load (loading all data into registers lets us
> >> ignore copy direction).
> >>
> >> benchmark old ns/op new ns/op delta
> >> BenchmarkMemmove0 3 3 +0.86%
> >> BenchmarkMemmove1 5 5 +5.40%
> >> BenchmarkMemmove2 18 8 -56.84%
> >> BenchmarkMemmove3 18 7 -58.45%
> >> BenchmarkMemmove4 36 7 -78.63%
> >> BenchmarkMemmove5 36 8 -77.91%
> >> BenchmarkMemmove6 36 8 -77.76%
> >> BenchmarkMemmove7 36 8 -77.82%
> >> BenchmarkMemmove8 18 8 -56.33%
> >> BenchmarkMemmove9 18 7 -58.34%
> >> BenchmarkMemmove10 18 7 -58.34%
> >> BenchmarkMemmove11 18 7 -58.45%
> >> BenchmarkMemmove12 36 7 -78.51%
> >> BenchmarkMemmove13 36 7 -78.48%
> >> BenchmarkMemmove14 36 7 -78.56%
> >> BenchmarkMemmove15 36 7 -78.56%
> >> BenchmarkMemmove16 18 7 -58.24%
> >> BenchmarkMemmove32 18 8 -54.33%
> >> BenchmarkMemmove64 18 8 -53.37%
> >> BenchmarkMemmove128 20 9 -55.93%
> >> BenchmarkMemmove256 25 11 -55.16%
> >> BenchmarkMemmove512 33 33 -1.19%
> >> BenchmarkMemmove1024 43 44 +2.06%
> >> BenchmarkMemmove2048 61 61 +0.16%
> >> BenchmarkMemmove4096 95 95 +0.00%
> >>
> >> Please review this at https://codereview.appspot.com/9038048/
> >>
> >> Affected files:
> >> src/pkg/runtime/memmove_386.s
> >> src/pkg/runtime/memmove_amd64.s
> >> src/pkg/runtime/memmove_test.go
> >>
> >>
> >> --
> >>
> >> ---You received this message because you are subscribed to the Google
> >> Groups "golang-dev" group.
> >> To unsubscribe from this group and stop receiving emails from it, send
> an
> >> email to golang-dev+unsubscribe@googlegroups.com.
> >> For more options, visit https://groups.google.com/groups/opt_out.
> >>
> >>
> > --
> >
> > ---
> > You received this message because you are subscribed to the Google Groups
> > "golang-dev" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to golang-dev+unsubscribe@googlegroups.com.
> > For more options, visit https://groups.google.com/groups/opt_out.
> >
> >
>
Message from remyoudompheng@gmail.com
2013-05-16T23:06:01+00:00remyoudomphengurn:md5:db4c9bdba558e225d13c0ac112955209
On 2013/05/16 22:06:36, khr1 wrote:
> If you're calling memmove directly with a constant size, maybe we should
> split out the copy ranges (e.g. memmove_17through32) as full-fledged
> functions that you can call directly, avoiding the size switch at the top
> of memmove. Downside, the compiler is exposed to some of the memmove
> implementation. Upside, no size switch.
>
The naïve combination is already extremely interesting.
This CL alone vs. This CL + CL9101048
benchmark old ns/op new ns/op delta
BenchmarkCopy1Byte 9 8 -5.49%
BenchmarkCopy2Byte 13 8 -35.38%
BenchmarkCopy4Byte 14 7 -44.41%
BenchmarkCopy8Byte 14 8 -40.43%
BenchmarkCopy12Byte 15 9 -39.47%
BenchmarkCopy16Byte 15 9 -39.40%
BenchmarkCopy32Byte 18 11 -35.68%
BenchmarkCopy128Byte 32 25 -23.31%
BenchmarkCopy1024Byte 101 94 -6.93%
BenchmarkCopy1String 10 8 -24.02%
BenchmarkCopy2String 10 8 -24.49%
BenchmarkCopy4String 10 7 -28.35%
BenchmarkCopy8String 10 8 -21.83%
BenchmarkCopy12String 11 9 -18.30%
BenchmarkCopy16String 11 9 -17.50%
BenchmarkCopy32String 14 12 -17.12%
BenchmarkCopy128String 28 24 -12.01%
BenchmarkCopy1024String 96 93 -3.22%
CL9101048 alone vs. This CL + CL9101048
benchmark old ns/op new ns/op delta
BenchmarkCopy1Byte 4 8 +75.87%
BenchmarkCopy2Byte 26 8 -67.94%
BenchmarkCopy4Byte 55 7 -85.65%
BenchmarkCopy8Byte 26 8 -68.29%
BenchmarkCopy12Byte 55 9 -83.61%
BenchmarkCopy16Byte 26 9 -65.31%
BenchmarkCopy32Byte 26 11 -54.58%
BenchmarkCopy128Byte 28 25 -13.19%
BenchmarkCopy1024Byte 96 94 -2.08%
BenchmarkCopy1String 4 8 +89.51%
BenchmarkCopy2String 25 8 -68.68%
BenchmarkCopy4String 54 7 -85.67%
BenchmarkCopy8String 25 8 -66.85%
BenchmarkCopy12String 54 9 -83.21%
BenchmarkCopy16String 25 9 -64.19%
BenchmarkCopy32String 25 12 -52.92%
BenchmarkCopy128String 28 24 -12.32%
BenchmarkCopy1024String 96 93 -3.22%
tip vs. both CLs
benchmark old ns/op new ns/op delta
BenchmarkCopy1Byte 10 8 -16.50%
BenchmarkCopy2Byte 36 8 -76.92%
BenchmarkCopy4Byte 74 7 -89.36%
BenchmarkCopy8Byte 36 8 -77.09%
BenchmarkCopy12Byte 74 9 -87.84%
BenchmarkCopy16Byte 36 9 -75.03%
BenchmarkCopy32Byte 36 11 -67.31%
BenchmarkCopy128Byte 38 25 -34.55%
BenchmarkCopy1024Byte 107 94 -12.15%
BenchmarkCopy1String 27 8 -69.89%
BenchmarkCopy2String 27 8 -70.07%
BenchmarkCopy4String 56 7 -86.10%
BenchmarkCopy8String 27 8 -68.44%
BenchmarkCopy12String 56 9 -83.72%
BenchmarkCopy16String 27 9 -65.78%
BenchmarkCopy32String 27 12 -55.35%
BenchmarkCopy128String 29 24 -15.88%
BenchmarkCopy1024String 96 93 -3.62%
Message from remyoudompheng@gmail.com
2013-05-16T23:06:41+00:00remyoudomphengurn:md5:b0f8c6ecd595512753c97f7690c086fb
On 2013/05/16 23:06:01, remyoudompheng wrote:
> On 2013/05/16 22:06:36, khr1 wrote:
> > If you're calling memmove directly with a constant size, maybe we should
> > split out the copy ranges (e.g. memmove_17through32) as full-fledged
> > functions that you can call directly, avoiding the size switch at the top
> > of memmove. Downside, the compiler is exposed to some of the memmove
> > implementation. Upside, no size switch.
> >
>
> The naïve combination is already extremely interesting.
My CPU is a Core 2 Quad Q8200 (2.33GHz).
Message from iant@golang.org
2013-05-17T13:55:29+00:00ianturn:md5:af6bafc9cc64231c41158f5421443ce0
These timings do very an awful lot on different versions of Intel CPUs, so it's worth doing some timings on other CPU types if possible.
Message from dominik.honnef@gmail.com
2013-05-17T14:42:18+00:00Dominik Honnefurn:md5:8294f324e686db058380414f8c0c0c4d
Testing just CL9038048
Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
benchmark old ns/op new ns/op delta
BenchmarkMemmove0 3 3 +8.86%
BenchmarkMemmove1 5 5 +5.06%
BenchmarkMemmove2 18 7 -57.87%
BenchmarkMemmove3 18 7 -59.52%
BenchmarkMemmove4 36 7 -79.30%
BenchmarkMemmove5 36 7 -78.59%
BenchmarkMemmove6 36 7 -78.59%
BenchmarkMemmove7 36 7 -78.59%
BenchmarkMemmove8 18 7 -57.98%
BenchmarkMemmove9 18 7 -59.36%
BenchmarkMemmove10 18 7 -59.36%
BenchmarkMemmove11 18 7 -59.58%
BenchmarkMemmove12 36 7 -79.30%
BenchmarkMemmove13 36 7 -79.30%
BenchmarkMemmove14 36 7 -79.27%
BenchmarkMemmove15 36 7 -79.30%
BenchmarkMemmove16 18 8 -55.96%
BenchmarkMemmove32 18 8 -56.54%
BenchmarkMemmove64 18 8 -56.54%
BenchmarkMemmove128 21 9 -57.45%
BenchmarkMemmove256 25 10 -57.09%
BenchmarkMemmove512 33 32 -3.25%
BenchmarkMemmove1024 44 43 -2.27%
BenchmarkMemmove2048 61 60 -2.28%
BenchmarkMemmove4096 95 94 -1.36%
Intel(R) Atom(TM) CPU N455 @ 1.66GHz
benchmark old ns/op new ns/op delta
BenchmarkMemmove0 20 20 +0.00%
BenchmarkMemmove1 27 27 +0.00%
BenchmarkMemmove2 72 43 -40.28%
BenchmarkMemmove3 73 43 -40.36%
BenchmarkMemmove4 74 43 -41.40%
BenchmarkMemmove5 75 44 -41.59%
BenchmarkMemmove6 77 44 -42.75%
BenchmarkMemmove7 77 44 -43.39%
BenchmarkMemmove8 69 44 -35.94%
BenchmarkMemmove9 73 44 -39.35%
BenchmarkMemmove10 75 44 -40.27%
BenchmarkMemmove11 76 44 -41.26%
BenchmarkMemmove12 77 44 -42.19%
BenchmarkMemmove13 78 44 -43.06%
BenchmarkMemmove14 79 44 -43.93%
BenchmarkMemmove15 81 44 -44.81%
BenchmarkMemmove16 70 44 -36.18%
BenchmarkMemmove32 72 53 -26.62%
BenchmarkMemmove64 77 68 -11.74%
BenchmarkMemmove128 87 98 +13.10%
BenchmarkMemmove256 106 158 +49.06%
BenchmarkMemmove512 147 140 -4.76%
BenchmarkMemmove1024 225 218 -3.11%
BenchmarkMemmove2048 300 293 -2.33%
BenchmarkMemmove4096 710 702 -1.13%
(In one iteration, the numbers for 2048 and 4096 jumped to +30/-31, repeated runs couldn't reproduce that)
Message from iant@golang.org
2013-05-17T17:33:07+00:00ianturn:md5:ce1f7146e4873ed6421bb4de411863c1
LGTM
Message from unknown
2013-05-17T19:53:44+00:00khrurn:md5:e33786ff37a2682e8d32961f7a574562
Message from khr@golang.org
2013-05-17T19:53:54+00:00khrurn:md5:60257d3282aea378be80108c18272853
*** Submitted as https://code.google.com/p/go/source/detail?r=4cb93e2900d0 ***
runtime: faster x86 memmove (a.k.a. built-in copy())
REP instructions have a high startup cost, so we handle small
sizes with some straightline code. The REP MOVSx instructions
are really fast for large sizes. The cutover is approximately
1K. We implement up to 128/256 because that is the maximum
SSE register load (loading all data into registers before any
stores lets us ignore copy direction).
(on a Sandy Bridge E5-1650 @ 3.20GHz)
benchmark old ns/op new ns/op delta
BenchmarkMemmove0 3 3 +0.86%
BenchmarkMemmove1 5 5 +5.40%
BenchmarkMemmove2 18 8 -56.84%
BenchmarkMemmove3 18 7 -58.45%
BenchmarkMemmove4 36 7 -78.63%
BenchmarkMemmove5 36 8 -77.91%
BenchmarkMemmove6 36 8 -77.76%
BenchmarkMemmove7 36 8 -77.82%
BenchmarkMemmove8 18 8 -56.33%
BenchmarkMemmove9 18 7 -58.34%
BenchmarkMemmove10 18 7 -58.34%
BenchmarkMemmove11 18 7 -58.45%
BenchmarkMemmove12 36 7 -78.51%
BenchmarkMemmove13 36 7 -78.48%
BenchmarkMemmove14 36 7 -78.56%
BenchmarkMemmove15 36 7 -78.56%
BenchmarkMemmove16 18 7 -58.24%
BenchmarkMemmove32 18 8 -54.33%
BenchmarkMemmove64 18 8 -53.37%
BenchmarkMemmove128 20 9 -55.93%
BenchmarkMemmove256 25 11 -55.16%
BenchmarkMemmove512 33 33 -1.19%
BenchmarkMemmove1024 43 44 +2.06%
BenchmarkMemmove2048 61 61 +0.16%
BenchmarkMemmove4096 95 95 +0.00%
R=golang-dev, bradfitz, remyoudompheng, khr, iant, dominik.honnef
CC=golang-dev
https://codereview.appspot.com/9038048