|
|
Created:
11 years, 9 months ago by khr Modified:
11 years, 8 months ago Reviewers:
dave, albert.strasheim CC:
bradfitz, r, khr1, remyoudompheng, minux1, ality, golang-dev Visibility:
Public. |
Descriptionruntime: Implement faster equals for strings and bytes.
(amd64)
benchmark old ns/op new ns/op delta
BenchmarkEqual0 16 6 -63.15%
BenchmarkEqual9 22 7 -65.37%
BenchmarkEqual32 36 9 -74.91%
BenchmarkEqual4K 2187 120 -94.51%
benchmark old MB/s new MB/s speedup
BenchmarkEqual9 392.22 1134.38 2.89x
BenchmarkEqual32 866.72 3457.39 3.99x
BenchmarkEqual4K 1872.73 33998.87 18.15x
(386)
benchmark old ns/op new ns/op delta
BenchmarkEqual0 16 5 -63.85%
BenchmarkEqual9 22 7 -67.84%
BenchmarkEqual32 34 12 -64.94%
BenchmarkEqual4K 2196 113 -94.85%
benchmark old MB/s new MB/s speedup
BenchmarkEqual9 405.81 1260.18 3.11x
BenchmarkEqual32 919.55 2631.21 2.86x
BenchmarkEqual4K 1864.85 36072.54 19.34x
Update issue 3751
Patch Set 1 #Patch Set 2 : diff -r ff83e39f1348 https://khr%40golang.org@code.google.com/p/go/ #Patch Set 3 : diff -r d040d5f08d5d https://khr%40golang.org@code.google.com/p/go/ #Patch Set 4 : diff -r d040d5f08d5d https://khr%40golang.org@code.google.com/p/go/ #Patch Set 5 : diff -r d040d5f08d5d https://khr%40golang.org@code.google.com/p/go/ #Patch Set 6 : diff -r d040d5f08d5d https://khr%40golang.org@code.google.com/p/go/ #
Total comments: 2
Patch Set 7 : diff -r d040d5f08d5d https://khr%40golang.org@code.google.com/p/go/ #
Total comments: 2
Patch Set 8 : diff -r 2433c3567273 https://khr%40golang.org@code.google.com/p/go/ #
Total comments: 6
Patch Set 9 : diff -r 7505bf6d8988 https://khr%40golang.org@code.google.com/p/go/ #Patch Set 10 : diff -r 7505bf6d8988 https://khr%40golang.org@code.google.com/p/go/ #
Total comments: 2
Patch Set 11 : diff -r 7505bf6d8988 https://khr%40golang.org@code.google.com/p/go/ #Patch Set 12 : diff -r 52e3407d249f https://khr%40golang.org@code.google.com/p/go/ #Patch Set 13 : diff -r 52e3407d249f https://khr%40golang.org@code.google.com/p/go/ #Patch Set 14 : diff -r 52e3407d249f https://khr%40golang.org@code.google.com/p/go/ #
MessagesTotal messages: 49
Hello bradfitz@golang.org (cc: golang-dev@googlegroups.com), I'd like you to review this change to https://khr%40golang.org@code.google.com/p/go/
Sign in to reply to this message.
run go vet. i think it will ask for a few tweaks to the assembly
Sign in to reply to this message.
LGTM but I'm not qualified to review runtime/asm_arm.s On Wed, Mar 27, 2013 at 3:14 PM, <khr@golang.org> wrote: > Reviewers: bradfitz, > > Message: > Hello bradfitz@golang.org (cc: golang-dev@googlegroups.com), > > I'd like you to review this change to > https://khr%40golang.org@code.**google.com/p/go/<http://40golang.org@code.goo... > > > Description: > runtime: Implement faster equals for strings and bytes. > > (amd64) > benchmark old ns/op new ns/op delta > BenchmarkMegEqMap 1164712 103451 -91.12% > > (386) > benchmark old ns/op new ns/op delta > BenchmarkMegEqMap 4935048 171105 -96.53% > > (arm) > benchmark old ns/op new ns/op delta > BenchmarkMegEqMap 12018127 8923492 -25.75% > > Update issue 3751 > > Please review this at https://codereview.appspot.**com/8056043/<https://codereview.appspot.com/8056... > > Affected files: > M src/pkg/bytes/asm_386.s > M src/pkg/bytes/asm_amd64.s > M src/pkg/bytes/asm_arm.s > M src/pkg/bytes/bytes_test.go > M src/pkg/runtime/alg.c > M src/pkg/runtime/asm_386.s > M src/pkg/runtime/asm_amd64.s > M src/pkg/runtime/asm_arm.s > M src/pkg/runtime/hashmap.c > M src/pkg/runtime/mapspeed_test.**go > M src/pkg/runtime/runtime.h > M src/pkg/runtime/string.goc > > >
Sign in to reply to this message.
https://codereview.appspot.com/8056043/diff/11001/src/pkg/bytes/bytes_test.go File src/pkg/bytes/bytes_test.go (right): https://codereview.appspot.com/8056043/diff/11001/src/pkg/bytes/bytes_test.go... src/pkg/bytes/bytes_test.go:64: {[]byte("abcdefgh"), []byte("abcdefgh"), 0}, comment above this line about what these are testing? (alignment) https://codereview.appspot.com/8056043/diff/11001/src/pkg/runtime/mapspeed_te... File src/pkg/runtime/mapspeed_test.go (right): https://codereview.appspot.com/8056043/diff/11001/src/pkg/runtime/mapspeed_te... src/pkg/runtime/mapspeed_test.go:121: func BenchmarkMegEqMap(b *testing.B) { it's not clear what "Eq" in the benchmark names here. "Mostly equal"? (it ends in a Y) I'd just make the lookup key be all "X" too.
Sign in to reply to this message.
On Wed, Mar 27, 2013 at 4:35 PM, <bradfitz@golang.org> wrote: > > https://codereview.appspot.**com/8056043/diff/11001/src/** > pkg/bytes/bytes_test.go<https://codereview.appspot.com/8056043/diff/11001/src/pkg/bytes/bytes_test.go> > File src/pkg/bytes/bytes_test.go (right): > > https://codereview.appspot.**com/8056043/diff/11001/src/** > pkg/bytes/bytes_test.go#**newcode64<https://codereview.appspot.com/8056043/diff/11001/src/pkg/bytes/bytes_test.go#newcode64> > src/pkg/bytes/bytes_test.go:**64: {[]byte("abcdefgh"), []byte("abcdefgh"), > 0}, > comment above this line about what these are testing? (alignment) > Done. Not so much testing alignment as testing around the boundary of the chunk size of memeq. > > https://codereview.appspot.**com/8056043/diff/11001/src/** > pkg/runtime/mapspeed_test.go<https://codereview.appspot.com/8056043/diff/11001/src/pkg/runtime/mapspeed_test.go> > File src/pkg/runtime/mapspeed_test.**go (right): > > https://codereview.appspot.**com/8056043/diff/11001/src/** > pkg/runtime/mapspeed_test.go#**newcode121<https://codereview.appspot.com/8056043/diff/11001/src/pkg/runtime/mapspeed_test.go#newcode121> > src/pkg/runtime/mapspeed_test.**go:121: func BenchmarkMegEqMap(b > *testing.B) { > it's not clear what "Eq" in the benchmark names here. > > "Mostly equal"? (it ends in a Y) > > I'd just make the lookup key be all "X" too. > Done. > > https://codereview.appspot.**com/8056043/<https://codereview.appspot.com/8056... >
Sign in to reply to this message.
Thanks. I'll try this on some arm systems soon and get some benchmarks. I'd like to see data from the bytes package.
Sign in to reply to this message.
https://codereview.appspot.com/8056043/diff/18001/src/pkg/bytes/asm_386.s File src/pkg/bytes/asm_386.s (right): https://codereview.appspot.com/8056043/diff/18001/src/pkg/bytes/asm_386.s#new... src/pkg/bytes/asm_386.s:19: TEXT ·Equal(SB),7,$12 I would like to suggest that this function moves into the runtime package (see the way we do time.now() as an example) so that we have all the moving parts closer together.
Sign in to reply to this message.
Some excellent speedups on arm (chromebook) for Equals, but a regression on some other benchmarks. A very brief look at the profile makes me want to blame the additional branch in Index although there doesn't appear to be any stack splitting. benchmark old ns/op new ns/op delta BenchmarkReadString 125100 127392 +1.83% BenchmarkIndexByte32 76 90 +18.74% BenchmarkIndexByte4K 7605 7298 -4.04% BenchmarkIndexByte4M 7574065 7567863 -0.08% BenchmarkIndexByte64M 120986945 120754408 -0.19% BenchmarkIndexBytePortable32 175 175 +0.00% BenchmarkIndexBytePortable4K 17347 17338 -0.05% BenchmarkIndexBytePortable4M 18353531 18341798 -0.06% BenchmarkIndexBytePortable64M 293546342 293211533 -0.11% BenchmarkEqual32 77 63 -17.81% BenchmarkEqual4K 7289 1855 -74.55% BenchmarkEqual4M 8367520 3209180 -61.65% BenchmarkEqual64M 142515029 61400849 -56.92% BenchmarkEqualPort32 224 224 +0.00% BenchmarkEqualPort4K 24263 24250 -0.05% BenchmarkEqualPortable4M 25179603 25247419 +0.27% BenchmarkEqualPortable64M 403745450 403779625 +0.01% BenchmarkIndex32 816 1302 +59.56% BenchmarkIndex4K 120982 192998 +59.53% BenchmarkIndex4M 124260183 193296808 +55.56% BenchmarkIndex64M 1988046002 3095359752 +55.70% BenchmarkIndexEasy32 144 161 +11.81% BenchmarkIndexEasy4K 7352 7369 +0.23% BenchmarkIndexEasy4M 7578549 7575628 -0.04% BenchmarkIndexEasy64M 120737458 120832933 +0.08% BenchmarkCount32 818 1241 +51.71% BenchmarkCount4K 120977 188073 +55.46% BenchmarkCount4M 124265360 192719445 +55.09% BenchmarkCount64M 1988792002 3085427626 +55.14% BenchmarkCountEasy32 142 162 +14.08% BenchmarkCountEasy4K 7349 7361 +0.16% BenchmarkCountEasy4M 7574511 7574003 -0.01% BenchmarkCountEasy64M 120729902 120718660 -0.01% BenchmarkFields 114759929 115037127 +0.24% BenchmarkFieldsFunc 113932977 113976645 +0.04% BenchmarkTrimSpace 345 339 -1.74% benchmark old MB/s new MB/s speedup BenchmarkReadString 261.93 257.22 0.98x BenchmarkIndexByte32 419.35 353.35 0.84x BenchmarkIndexByte4K 538.54 561.22 1.04x BenchmarkIndexByte4M 553.77 554.23 1.00x BenchmarkIndexByte64M 554.68 555.75 1.00x BenchmarkIndexBytePortable32 182.72 182.41 1.00x BenchmarkIndexBytePortable4K 236.11 236.24 1.00x BenchmarkIndexBytePortable4M 228.53 228.67 1.00x BenchmarkIndexBytePortable64M 228.61 228.88 1.00x BenchmarkEqual32 412.82 502.15 1.22x BenchmarkEqual4K 561.90 2207.99 3.93x BenchmarkEqual4M 501.26 1306.97 2.61x BenchmarkEqual64M 470.89 1092.96 2.32x BenchmarkEqualPort32 142.66 142.81 1.00x BenchmarkEqualPort4K 168.82 168.91 1.00x BenchmarkEqualPortable4M 166.58 166.13 1.00x BenchmarkEqualPortable64M 166.22 166.20 1.00x BenchmarkIndex32 39.19 24.56 0.63x BenchmarkIndex4K 33.86 21.22 0.63x BenchmarkIndex4M 33.75 21.70 0.64x BenchmarkIndex64M 33.76 21.68 0.64x BenchmarkIndexEasy32 221.74 198.25 0.89x BenchmarkIndexEasy4K 557.11 555.79 1.00x BenchmarkIndexEasy4M 553.44 553.66 1.00x BenchmarkIndexEasy64M 555.82 555.39 1.00x BenchmarkCount32 39.08 25.78 0.66x BenchmarkCount4K 33.86 21.78 0.64x BenchmarkCount4M 33.75 21.76 0.64x BenchmarkCount64M 33.74 21.75 0.64x BenchmarkCountEasy32 225.27 196.43 0.87x BenchmarkCountEasy4K 557.32 556.37 1.00x BenchmarkCountEasy4M 553.74 553.78 1.00x BenchmarkCountEasy64M 555.86 555.91 1.00x BenchmarkFields 9.14 9.12 1.00x BenchmarkFieldsFunc 9.20 9.20 1.00x
Sign in to reply to this message.
It looks like both the slow Index and Count ones are calling eq with 7-byte slices. For some reason that parameter makes it slow. I'll see what I can do to fix that, 7-byte eq is probably pretty common. On Wed, Mar 27, 2013 at 8:55 PM, Dave Cheney <dave@cheney.net> wrote: > Some excellent speedups on arm (chromebook) for Equals, but a > regression on some other benchmarks. A very brief look at the profile > makes me want to blame the additional branch in Index although there > doesn't appear to be any stack splitting. > > benchmark old ns/op new ns/op delta > BenchmarkReadString 125100 127392 +1.83% > BenchmarkIndexByte32 76 90 +18.74% > BenchmarkIndexByte4K 7605 7298 -4.04% > BenchmarkIndexByte4M 7574065 7567863 -0.08% > BenchmarkIndexByte64M 120986945 120754408 -0.19% > BenchmarkIndexBytePortable32 175 175 +0.00% > BenchmarkIndexBytePortable4K 17347 17338 -0.05% > BenchmarkIndexBytePortable4M 18353531 18341798 -0.06% > BenchmarkIndexBytePortable64M 293546342 293211533 -0.11% > BenchmarkEqual32 77 63 -17.81% > BenchmarkEqual4K 7289 1855 -74.55% > BenchmarkEqual4M 8367520 3209180 -61.65% > BenchmarkEqual64M 142515029 61400849 -56.92% > BenchmarkEqualPort32 224 224 +0.00% > BenchmarkEqualPort4K 24263 24250 -0.05% > BenchmarkEqualPortable4M 25179603 25247419 +0.27% > BenchmarkEqualPortable64M 403745450 403779625 +0.01% > BenchmarkIndex32 816 1302 +59.56% > BenchmarkIndex4K 120982 192998 +59.53% > BenchmarkIndex4M 124260183 193296808 +55.56% > BenchmarkIndex64M 1988046002 3095359752 +55.70% > BenchmarkIndexEasy32 144 161 +11.81% > BenchmarkIndexEasy4K 7352 7369 +0.23% > BenchmarkIndexEasy4M 7578549 7575628 -0.04% > BenchmarkIndexEasy64M 120737458 120832933 +0.08% > BenchmarkCount32 818 1241 +51.71% > BenchmarkCount4K 120977 188073 +55.46% > BenchmarkCount4M 124265360 192719445 +55.09% > BenchmarkCount64M 1988792002 3085427626 +55.14% > BenchmarkCountEasy32 142 162 +14.08% > BenchmarkCountEasy4K 7349 7361 +0.16% > BenchmarkCountEasy4M 7574511 7574003 -0.01% > BenchmarkCountEasy64M 120729902 120718660 -0.01% > BenchmarkFields 114759929 115037127 +0.24% > BenchmarkFieldsFunc 113932977 113976645 +0.04% > BenchmarkTrimSpace 345 339 -1.74% > > benchmark old MB/s new MB/s speedup > BenchmarkReadString 261.93 257.22 0.98x > BenchmarkIndexByte32 419.35 353.35 0.84x > BenchmarkIndexByte4K 538.54 561.22 1.04x > BenchmarkIndexByte4M 553.77 554.23 1.00x > BenchmarkIndexByte64M 554.68 555.75 1.00x > BenchmarkIndexBytePortable32 182.72 182.41 1.00x > BenchmarkIndexBytePortable4K 236.11 236.24 1.00x > BenchmarkIndexBytePortable4M 228.53 228.67 1.00x > BenchmarkIndexBytePortable64M 228.61 228.88 1.00x > BenchmarkEqual32 412.82 502.15 1.22x > BenchmarkEqual4K 561.90 2207.99 3.93x > BenchmarkEqual4M 501.26 1306.97 2.61x > BenchmarkEqual64M 470.89 1092.96 2.32x > BenchmarkEqualPort32 142.66 142.81 1.00x > BenchmarkEqualPort4K 168.82 168.91 1.00x > BenchmarkEqualPortable4M 166.58 166.13 1.00x > BenchmarkEqualPortable64M 166.22 166.20 1.00x > BenchmarkIndex32 39.19 24.56 0.63x > BenchmarkIndex4K 33.86 21.22 0.63x > BenchmarkIndex4M 33.75 21.70 0.64x > BenchmarkIndex64M 33.76 21.68 0.64x > BenchmarkIndexEasy32 221.74 198.25 0.89x > BenchmarkIndexEasy4K 557.11 555.79 1.00x > BenchmarkIndexEasy4M 553.44 553.66 1.00x > BenchmarkIndexEasy64M 555.82 555.39 1.00x > BenchmarkCount32 39.08 25.78 0.66x > BenchmarkCount4K 33.86 21.78 0.64x > BenchmarkCount4M 33.75 21.76 0.64x > BenchmarkCount64M 33.74 21.75 0.64x > BenchmarkCountEasy32 225.27 196.43 0.87x > BenchmarkCountEasy4K 557.32 556.37 1.00x > BenchmarkCountEasy4M 553.74 553.78 1.00x > BenchmarkCountEasy64M 555.86 555.91 1.00x > BenchmarkFields 9.14 9.12 1.00x > BenchmarkFieldsFunc 9.20 9.20 1.00x >
Sign in to reply to this message.
I have added the following benchmarks to runtime/string_test.go: func BenchmarkCompareStringBigUnaligned(b *testing.B) { bytes := make([]byte, 0, 1<<20) for len(bytes) < 1<<20 { bytes = append(bytes, "Hello Gophers!"...) } s1, s2 := string(bytes), "hello"+string(bytes) for i := 0; i < b.N; i++ { if s1 != s2[len("hello"):] { b.Fatal("s1 != s2") } } b.SetBytes(int64(len(s1))) } func BenchmarkCompareStringBig(b *testing.B) { bytes := make([]byte, 0, 1<<20) for len(bytes) < 1<<20 { bytes = append(bytes, "Hello Gophers!"...) } s1, s2 := string(bytes), string(bytes) for i := 0; i < b.N; i++ { if s1 != s2 { b.Fatal("s1 != s2") } } b.SetBytes(int64(len(s1))) } And obtained the following results on ARM (ODROID-X) benchmark old ns/op new ns/op delta BenchmarkCompareStringBigUnaligned 11285388 2904806 -74.26% BenchmarkCompareStringBig 10686704 1807713 -83.08% benchmark old MB/s new MB/s speedup BenchmarkCompareStringBigUnaligned 92.92 360.98 3.88x BenchmarkCompareStringBig 98.12 580.06 5.91x I could squeeze a bit more (20%) on ARM by using the following form for the unaligned case (Dave, do you want to run your benchmarks on this?): // B is not aligned. Compare the word at A // with the value constructed from the // two words near B. // Note: this code is little-endian specific. // Beware that R9, R10 are reserved for m, g. // R11 can be used by linker. MOVW (R4), R7 one_aligned: CMP $4, R3 BLO tail MOVW R7>>R5, R12 MOVW.W 4(R4), R7 ORR R7<<R6, R12 MOVW.P 4(R1), R8 ADD $4, R2 SUB $4, R3 CMP R8, R12 BEQ one_aligned MOVW $0, R0 RET
Sign in to reply to this message.
https://codereview.appspot.com/8056043/diff/18001/src/pkg/bytes/bytes_test.go File src/pkg/bytes/bytes_test.go (right): https://codereview.appspot.com/8056043/diff/18001/src/pkg/bytes/bytes_test.go... src/pkg/bytes/bytes_test.go:92: I would have added a test like this: func TestEqual(t *testing.T) { b1 := make([]byte, 64) b2 := make([]byte, 64) for l := 0; l <= 56; l++ { for off1 := 0; off1 < 8; off1++ { for off2 := 0; off2 < 8; off2++ { for diffoff := 0; diffoff <= l; diffoff++ { for i := range b1 { b1[i], b2[i] = 0, 0 } for i := 0; i < l; i++ { b1[off1+i], b2[off2+i] = byte(i), byte(i) } b1[off1+diffoff], b2[off2+diffoff] = 'x', 'y' cmp := Equal(b1[off1:off1+l], b2[off2:off2+l]) if cmp != (diffoff == l) { t.Errorf("Equal(%q, %q) = %v", b1[off1:off1+l], b2[off2:off2+l], cmp) } } } } } }
Sign in to reply to this message.
On 2013/03/28 07:31:24, remyoudompheng wrote: > I could squeeze a bit more (20%) on ARM by using the following form for the > unaligned case (Dave, do you want to run your benchmarks on this?): > > // B is not aligned. Compare the word at A > // with the value constructed from the > // two words near B. > // Note: this code is little-endian specific. > // Beware that R9, R10 are reserved for m, g. > // R11 can be used by linker. > MOVW (R4), R7 > one_aligned: > CMP $4, R3 > BLO tail > MOVW R7>>R5, R12 > MOVW.W 4(R4), R7 > ORR R7<<R6, R12 > MOVW.P 4(R1), R8 > ADD $4, R2 > SUB $4, R3 > CMP R8, R12 > BEQ one_aligned > MOVW $0, R0 > RET More precisely, I obtained this result compared to the original proposal: benchmark old ns/op new ns/op delta BenchmarkCompareStringBigUnaligned 2894508 2447519 -15.44% BenchmarkCompareStringBig 1802048 1811376 +0.52% benchmark old MB/s new MB/s speedup BenchmarkCompareStringBigUnaligned 362.27 428.43 1.18x BenchmarkCompareStringBig 581.89 578.89 0.99x
Sign in to reply to this message.
Dave: I've moved bytes.Equal to the runtime files. Good idea, that puts all that code in one place. Try your benchmarks again, I've special-cased <8 byte comparisons, they should be only slightly slower than before. Remy: I've included your test and your optimization, thanks. Did you check in your benchmarks, or should I include them in my patch? Brad: I've changed the amd64/386 code quite a bit, seems that REP;CMPSx has a significant startup cost and we're better off just explicitly looping. On Thu, Mar 28, 2013 at 12:51 AM, <remyoudompheng@gmail.com> wrote: > On 2013/03/28 07:31:24, remyoudompheng wrote: > >> I could squeeze a bit more (20%) on ARM by using the following form >> > for the > >> unaligned case (Dave, do you want to run your benchmarks on this?): >> > > // B is not aligned. Compare the word at A >> // with the value constructed from the >> // two words near B. >> // Note: this code is little-endian specific. >> // Beware that R9, R10 are reserved for m, g. >> // R11 can be used by linker. >> MOVW (R4), R7 >> one_aligned: >> CMP $4, R3 >> BLO tail >> MOVW R7>>R5, R12 >> MOVW.W 4(R4), R7 >> ORR R7<<R6, R12 >> MOVW.P 4(R1), R8 >> ADD $4, R2 >> SUB $4, R3 >> CMP R8, R12 >> BEQ one_aligned >> MOVW $0, R0 >> RET >> > > More precisely, I obtained this result compared to the original > proposal: > > > benchmark old ns/op new ns/op delta > BenchmarkCompareStringBigUnali**gned 2894508 2447519 -15.44% > BenchmarkCompareStringBig 1802048 1811376 +0.52% > > > benchmark old MB/s new MB/s speedup > BenchmarkCompareStringBigUnali**gned 362.27 428.43 1.18x > BenchmarkCompareStringBig 581.89 578.89 0.99x > > > https://codereview.appspot.**com/8056043/<https://codereview.appspot.com/8056... >
Sign in to reply to this message.
LGTM Assembly people should chime in, though. https://codereview.appspot.com/8056043/diff/39001/src/pkg/bytes/bytes_test.go File src/pkg/bytes/bytes_test.go (right): https://codereview.appspot.com/8056043/diff/39001/src/pkg/bytes/bytes_test.go... src/pkg/bytes/bytes_test.go:335: buf := make([]byte, 4) or: var buf [4]byte https://codereview.appspot.com/8056043/diff/39001/src/pkg/runtime/asm_386.s File src/pkg/runtime/asm_386.s (right): https://codereview.appspot.com/8056043/diff/39001/src/pkg/runtime/asm_386.s#n... src/pkg/runtime/asm_386.s:996: TEXT bytes·Equal(SB),7,$0 This seems like new precedent. At least leave a comment above "func equalPortable" in pkg/bytes/bytes.go directing people where to find Equal. Like: // func Equal is implemented in assembly in pkg/runtime for // amd64, 386, and arm. Then grepping for "func Equal" finds it. https://codereview.appspot.com/8056043/diff/39001/src/pkg/runtime/mapspeed_te... File src/pkg/runtime/mapspeed_test.go (right): https://codereview.appspot.com/8056043/diff/39001/src/pkg/runtime/mapspeed_te... src/pkg/runtime/mapspeed_test.go:124: key := strings.Repeat("X", 1<<20) move this up a line and just say m[key] = true
Sign in to reply to this message.
On 2013/03/29 07:06:48, khr1 wrote: > Dave: > I've moved bytes.Equal to the runtime files. Good idea, that puts all that > code in one place. > Try your benchmarks again, I've special-cased <8 byte comparisons, they > should be only slightly slower than before. > > Remy: > I've included your test and your optimization, thanks. > Did you check in your benchmarks, or should I include them in my patch? Yes, please include my string_test.go proposals if you wish. Did you update the benchmarks in issue description? What is the speed difference between REP MOVSQ and the SSE2 way on amd64?
Sign in to reply to this message.
On Fri, Mar 29, 2013 at 12:26 AM, <bradfitz@golang.org> wrote: > LGTM > > Assembly people should chime in, though. > > > > https://codereview.appspot.**com/8056043/diff/39001/src/** > pkg/bytes/bytes_test.go<https://codereview.appspot.com/8056043/diff/39001/src/pkg/bytes/bytes_test.go> > File src/pkg/bytes/bytes_test.go (right): > > https://codereview.appspot.**com/8056043/diff/39001/src/** > pkg/bytes/bytes_test.go#**newcode335<https://codereview.appspot.com/8056043/diff/39001/src/pkg/bytes/bytes_test.go#newcode335> > src/pkg/bytes/bytes_test.go:**335: buf := make([]byte, 4) > or: > var buf [4]byte > > https://codereview.appspot.**com/8056043/diff/39001/src/** > pkg/runtime/asm_386.s<https://codereview.appspot.com/8056043/diff/39001/src/pkg/runtime/asm_386.s> > File src/pkg/runtime/asm_386.s (right): > > https://codereview.appspot.**com/8056043/diff/39001/src/** > pkg/runtime/asm_386.s#**newcode996<https://codereview.appspot.com/8056043/diff/39001/src/pkg/runtime/asm_386.s#newcode996> > src/pkg/runtime/asm_386.s:996: TEXT bytes·Equal(SB),7,$0 > This seems like new precedent. At least leave a comment above "func > equalPortable" in pkg/bytes/bytes.go directing people where to find > Equal. > > Like: > > // func Equal is implemented in assembly in pkg/runtime for > // amd64, 386, and arm. > > Then grepping for "func Equal" finds it. > > https://codereview.appspot.**com/8056043/diff/39001/src/** > pkg/runtime/mapspeed_test.go<https://codereview.appspot.com/8056043/diff/39001/src/pkg/runtime/mapspeed_test.go> > File src/pkg/runtime/mapspeed_test.**go (right): > > https://codereview.appspot.**com/8056043/diff/39001/src/** > pkg/runtime/mapspeed_test.go#**newcode124<https://codereview.appspot.com/8056043/diff/39001/src/pkg/runtime/mapspeed_test.go#newcode124> > src/pkg/runtime/mapspeed_test.**go:124: key := strings.Repeat("X", 1<<20) > move this up a line and just say m[key] = true > I want two different strings so the memory comparison has to happen. With your change, it could (and does) detect equality by just comparing pointers. > > https://codereview.appspot.**com/8056043/<https://codereview.appspot.com/8056... >
Sign in to reply to this message.
https://codereview.appspot.com/8056043/diff/39001/src/pkg/runtime/asm_386.s File src/pkg/runtime/asm_386.s (right): https://codereview.appspot.com/8056043/diff/39001/src/pkg/runtime/asm_386.s#n... src/pkg/runtime/asm_386.s:1027: PCMPEQB X1, X0 Do we explicitly require SSE2 on 386 now? Maybe we should document what is required by the runtime code (this instruction for atomics, that one for blah...). I don't think SSE2 is a requirement for Go 1.1, but I don't remember exactly. I would adda CPUID check and switch to the REP MOVSL way if no SSE2 is avaiable.
Sign in to reply to this message.
The SSE way is about 2x as fast, at least when the strings are intermediate length. For really long strings there isn't much difference because it's limited by memory bandwidth. Those SSE2 instructions are already being used in bytes/asm_amd64.s:IndexByte, so they are OK for amd64 at least. I'll ask around about 386. I'll update the benchmark numbers tomorrow. On Fri, Mar 29, 2013 at 12:34 AM, <remyoudompheng@gmail.com> wrote: > On 2013/03/29 07:06:48, khr1 wrote: > >> Dave: >> I've moved bytes.Equal to the runtime files. Good idea, that puts all >> > that > >> code in one place. >> Try your benchmarks again, I've special-cased <8 byte comparisons, >> > they > >> should be only slightly slower than before. >> > > Remy: >> I've included your test and your optimization, thanks. >> Did you check in your benchmarks, or should I include them in my >> > patch? > > Yes, please include my string_test.go proposals if you wish. > Did you update the benchmarks in issue description? > > What is the speed difference between REP MOVSQ and the SSE2 way on > amd64? > > https://codereview.appspot.**com/8056043/<https://codereview.appspot.com/8056... >
Sign in to reply to this message.
Thanks Keith, I'll get to benchmarking. https://codereview.appspot.com/8056043/diff/39001/src/pkg/runtime/asm_386.s File src/pkg/runtime/asm_386.s (right): https://codereview.appspot.com/8056043/diff/39001/src/pkg/runtime/asm_386.s#n... src/pkg/runtime/asm_386.s:996: TEXT bytes·Equal(SB),7,$0 On 2013/03/29 07:26:09, bradfitz wrote: > This seems like new precedent. At least leave a comment above "func > equalPortable" in pkg/bytes/bytes.go directing people where to find Equal. > > Like: > > // func Equal is implemented in assembly in pkg/runtime for > // amd64, 386, and arm. > > Then grepping for "func Equal" finds it. This was my suggestion. I believe the are precedents from the time package. It does deserve a comment.
Sign in to reply to this message.
Looks very good compared to glibc 4.7.2 on a Core i7-3720QM: BenchmarkRuntimeMemequal1K 25993.11 MB/s BenchmarkRuntimeMemequal1M 15857.93 MB/s [2] BenchmarkRuntimeMemequal1G 9783.48 MB/s BenchmarkStringEqual1M 16991.31 MB/s vs BenchmarkMemcmp1K 12672.80 MB/s [1] BenchmarkMemcmp1M 20434.50 MB/s [2] BenchmarkMemcmp1G 9518.44 MB/s BenchmarkStrncmp1M 15551.37 MB/s [1] BenchmarkMemcmp1K has some cgo overhead. [2] This difference might be worth a look: looking at bytes.Equal vs memcmp: BenchmarkBytesEqual1K 29738.67 MB/s BenchmarkBytesEqual32K 22170.51 MB/s BenchmarkBytesEqual64K 22458.35 MB/s BenchmarkBytesEqual128K 21313.28 MB/s BenchmarkBytesEqual256K 16965.64 MB/s BenchmarkBytesEqual512K 16581.24 MB/s BenchmarkBytesEqual1M 16706.55 MB/s BenchmarkMemcmp32Kb 23958.85 MB/s BenchmarkMemcmp64K 26175.43 MB/s BenchmarkMemcmp128K 27440.97 MB/s BenchmarkMemcmp256K 25026.15 MB/s BenchmarkMemcmp512K 22581.35 MB/s BenchmarkMemcmp1M 19161.69 MB/s BenchmarkMemcmp2M 16326.83 MB/s The Go functions have some kind of crossover point at around 256K which only kicks in at around 2M in glibc. code here: github.com/alberts/gofrags
Sign in to reply to this message.
On Fri, Mar 29, 2013 at 3:53 PM, Keith Randall <khr@google.com> wrote: > The SSE way is about 2x as fast, at least when the strings are intermediate length. For really long strings there isn't much difference because it's limited by memory bandwidth. > Those SSE2 instructions are already being used in bytes/asm_amd64.s:IndexByte, so they are OK for amd64 at least. I'll ask around about 386. please use cpuid to detect sse2 on 386 as we still support as old as pentium mmx chips. sse2 is required feature for amd64 so using it is ok.
Sign in to reply to this message.
https://codereview.appspot.com/8056043/diff/39001/src/pkg/runtime/mapspeed_te... File src/pkg/runtime/mapspeed_test.go (right): https://codereview.appspot.com/8056043/diff/39001/src/pkg/runtime/mapspeed_te... src/pkg/runtime/mapspeed_test.go:124: key := strings.Repeat("X", 1<<20) On 2013/03/29 07:26:09, bradfitz wrote: > move this up a line and just say m[key] = true Oh, you probably did it this way so their .str fields were different, to avoid that fastpath. Leave a comment? key1 := strings.Repeat("X", 1<<20) key2 := strings.Repeat("X", 1<<20) // equal but different instance
Sign in to reply to this message.
> This was my suggestion. I believe the are precedents from the time > package. It does deserve a comment. There is already a declaration + comment for Equal in bytes_decl.go. I've updated the comment to point to runtime. > please use cpuid to detect sse2 on 386 as we still support as old as > pentium mmx chips. Done. > key1 := strings.Repeat("X", 1<<20) > key2 := strings.Repeat("X", 1<<20) // equal but different instance Done. > The Go functions have some kind of crossover point at around 256K which > only kicks in at around 2M in glibc. > [2] This difference might be worth a look: There are a lot of things that could be making a difference here. Maybe they do prefetching, maybe they do the extra work to use aligned loads, ... I'll take a look at the glibc code, it might have some useful ideas in it. On Fri, Mar 29, 2013 at 8:44 AM, <bradfitz@golang.org> wrote: > > https://codereview.appspot.**com/8056043/diff/39001/src/** > pkg/runtime/mapspeed_test.go<https://codereview.appspot.com/8056043/diff/39001/src/pkg/runtime/mapspeed_test.go> > File src/pkg/runtime/mapspeed_test.**go (right): > > https://codereview.appspot.**com/8056043/diff/39001/src/** > pkg/runtime/mapspeed_test.go#**newcode124<https://codereview.appspot.com/8056043/diff/39001/src/pkg/runtime/mapspeed_test.go#newcode124> > src/pkg/runtime/mapspeed_test.**go:124: key := strings.Repeat("X", 1<<20) > On 2013/03/29 07:26:09, bradfitz wrote: > >> move this up a line and just say m[key] = true >> > > Oh, you probably did it this way so their .str fields were different, to > avoid that fastpath. Leave a comment? > > key1 := strings.Repeat("X", 1<<20) > key2 := strings.Repeat("X", 1<<20) // equal but different instance > > https://codereview.appspot.**com/8056043/<https://codereview.appspot.com/8056... >
Sign in to reply to this message.
minux <minux.ma@gmail.com> once said: > On Fri, Mar 29, 2013 at 3:53 PM, Keith Randall <khr@google.com> wrote: > > I'll ask around about 386. > please use cpuid to detect sse2 on 386 as we still support as old as > pentium mmx chips. > sse2 is required feature for amd64 so using it is ok. According to my notes¹: We haven't supported the Pentium MMX since January of 2010. Around that time, the compiler started emitting FUCOMIP during floating point codegen so the bare minimum was a Pentium Pro. The math package started using FUCOMI at the same time. In May of 2012, Russ added the PREFETCH builtin for the garbage collector which causes 8c to generate PREFETCHNTA. This bumped the lower limit up to the Pentium III. I don't know how the PREFETCH instructions are decoded on processors older than the Pentium III. I was 11 when it came out. :) Does it end up as a NOP like PAUSE? If so, the oldest we support is a Pentium Pro. If not, the oldest is a Pentium III. Anthony 1. I keep a file of notes about how the toolchain differs from the original Plan 9 code.
Sign in to reply to this message.
https://codereview.appspot.com/8056043/diff/35002/src/pkg/bytes/bytes_test.go File src/pkg/bytes/bytes_test.go (right): https://codereview.appspot.com/8056043/diff/35002/src/pkg/bytes/bytes_test.go... src/pkg/bytes/bytes_test.go:94: b1 := make([]byte, 64) const size = 64 https://codereview.appspot.com/8056043/diff/35002/src/pkg/bytes/bytes_test.go... src/pkg/bytes/bytes_test.go:102: } outside the loop create zeros := make([]byte, size) then copy(b1, zeros) copy(b2, zeros) makes it very clear
Sign in to reply to this message.
Anthony Martin <ality@pbrane.org> once said: > minux <minux.ma@gmail.com> once said: > > On Fri, Mar 29, 2013 at 3:53 PM, Keith Randall <khr@google.com> wrote: > > > I'll ask around about 386. > > please use cpuid to detect sse2 on 386 as we still support as old as > > pentium mmx chips. > > sse2 is required feature for amd64 so using it is ok. > > According to my notes¹: > > We haven't supported the Pentium MMX since January of 2010. > Around that time, the compiler started emitting FUCOMIP > during floating point codegen so the bare minimum was a > Pentium Pro. > > The math package started using FUCOMI at the same time. > > In May of 2012, Russ added the PREFETCH builtin for the > garbage collector which causes 8c to generate PREFETCHNTA. > This bumped the lower limit up to the Pentium III. > > I don't know how the PREFETCH instructions are decoded > on processors older than the Pentium III. I was 11 when > it came out. :) > > Does it end up as a NOP like PAUSE? > If so, the oldest we support is a Pentium Pro. > If not, the oldest is a Pentium III. > > Anthony > > 1. I keep a file of notes about how the toolchain differs > from the original Plan 9 code. > Please s/Pentium Pro/Pentium II/ before reading. I forgot that the Pro didn't have MMX. Anthony
Sign in to reply to this message.
On Sat, Mar 30, 2013 at 1:58 AM, Anthony Martin <ality@pbrane.org> wrote: > According to my notes¹: > > We haven't supported the Pentium MMX since January of 2010. > Around that time, the compiler started emitting FUCOMIP > during floating point codegen so the bare minimum was a > Pentium Pro. oh, seems this one is correct, we require pentium pro. thank you for the correction. > In May of 2012, Russ added the PREFETCH builtin for the > garbage collector which causes 8c to generate PREFETCHNTA. > This bumped the lower limit up to the Pentium III. in CL 7323061, Russ disabled the generation of PREFETCH instructions when GO386=387 (i.e. sse2 is either not detected or explicitly disabled by the user), so this is a non-issue now and the minimum supported 386 hardware is still pentium pro.
Sign in to reply to this message.
minux <minux.ma@gmail.com> once said: > On Sat, Mar 30, 2013 at 1:58 AM, Anthony Martin <ality@pbrane.org> wrote: > > According to my notes¹: > > > > We haven't supported the Pentium MMX since January of 2010. > > Around that time, the compiler started emitting FUCOMIP > > during floating point codegen so the bare minimum was a > > Pentium Pro. > oh, seems this one is correct, we require pentium pro. > thank you for the correction. > > > In May of 2012, Russ added the PREFETCH builtin for the > > garbage collector which causes 8c to generate PREFETCHNTA. > > This bumped the lower limit up to the Pentium III. > in CL 7323061, Russ disabled the generation of PREFETCH instructions > when GO386=387 (i.e. sse2 is either not detected or explicitly disabled > by the user), so this is a non-issue now and the minimum supported > 386 hardware is still pentium pro. I see. I overlooked that CL. Thanks. We should mention somewhere that the minimum requirement is a P6 family processor (i686) with MMX extensions, i.e., the Pentium II. Anthony
Sign in to reply to this message.
On Sun, Mar 31, 2013 at 12:51 AM, Anthony Martin <ality@pbrane.org> wrote: > minux <minux.ma@gmail.com> once said: > We should mention somewhere that the minimum requirement is > a P6 family processor (i686) with MMX extensions, i.e., the > Pentium II. Agreed. For the record, we require MMX because sync/atomic uses MMX instructions for 64-bit atomic. we require Pentium-Pro (P6 class) because we use FUCOMI for FP comparison in 8g. because Pentium Pro is the only P6 class processor that doesn't support MMX, we need Pentium II or higher. I'm not familiar with AMD produce line, so could someone familiar with it supply the info for AMD CPU and update doc/install.html?
Sign in to reply to this message.
I benchmarked patchset 10 on my chromebook, the results are mixed. The regressions in the bytes benchmarks appear to flow through in to strings and fmt. lucky(~) % cat bytes.txt benchmark old ns/op new ns/op delta BenchmarkReadString 112131 127869 +14.04% BenchmarkBufferNotEmptyWriteRead 3634051 3681321 +1.30% BenchmarkBufferFullSmallReads 440993 441639 +0.15% BenchmarkIndexByte32 76 76 +0.00% BenchmarkIndexByte4K 7279 7286 +0.10% BenchmarkIndexByte4M 7533005 7564464 +0.42% BenchmarkIndexByte64M 120355191 120813462 +0.38% BenchmarkIndexBytePortable32 174 175 +0.57% BenchmarkIndexBytePortable4K 17302 17342 +0.23% BenchmarkIndexBytePortable4M 18276630 18349412 +0.40% BenchmarkIndexBytePortable64M 292254766 293383533 +0.39% BenchmarkEqual32 90 60 -33.22% BenchmarkEqual4K 7285 1853 -74.56% BenchmarkEqual4M 8829644 3517534 -60.16% BenchmarkEqual64M 138950637 59169793 -57.42% BenchmarkEqualPort32 223 224 +0.45% BenchmarkEqualPort4K 24201 24255 +0.22% BenchmarkEqualPortable4M 25031372 25131132 +0.40% BenchmarkEqualPortable64M 402236591 403740442 +0.37% BenchmarkIndex32 800 1182 +47.75% BenchmarkIndex4K 118306 178704 +51.05% BenchmarkIndex4M 121457245 183889629 +51.40% BenchmarkIndex64M 1943517375 2939630294 +51.25% BenchmarkIndexEasy32 142 142 +0.00% BenchmarkIndexEasy4K 7331 7336 +0.07% BenchmarkIndexEasy4M 7548436 7543304 -0.07% BenchmarkIndexEasy64M 120458129 120302727 -0.13% BenchmarkCount32 806 1182 +46.65% BenchmarkCount4K 118344 178649 +50.96% BenchmarkCount4M 121673656 183400741 +50.73% BenchmarkCount64M 1944794002 2934607085 +50.90% BenchmarkCountEasy32 144 141 -2.08% BenchmarkCountEasy4K 7360 7335 -0.34% BenchmarkCountEasy4M 7554226 7564146 +0.13% BenchmarkCountEasy64M 120317272 120715047 +0.33% BenchmarkFields 116213231 114222702 -1.71% BenchmarkFieldsFunc 114475948 113263035 -1.06% BenchmarkTrimSpace 342 336 -1.75% benchmark old MB/s new MB/s speedup BenchmarkReadString 292.23 256.26 0.88x BenchmarkIndexByte32 419.49 419.49 1.00x BenchmarkIndexByte4K 562.70 562.14 1.00x BenchmarkIndexByte4M 556.79 554.47 1.00x BenchmarkIndexByte64M 557.59 555.48 1.00x BenchmarkIndexBytePortable32 183.22 182.84 1.00x BenchmarkIndexBytePortable4K 236.72 236.18 1.00x BenchmarkIndexBytePortable4M 229.49 228.58 1.00x BenchmarkIndexBytePortable64M 229.62 228.74 1.00x BenchmarkEqual32 354.47 530.72 1.50x BenchmarkEqual4K 562.22 2210.19 3.93x BenchmarkEqual4M 475.03 1192.40 2.51x BenchmarkEqual64M 482.97 1134.17 2.35x BenchmarkEqualPort32 143.47 142.77 1.00x BenchmarkEqualPort4K 169.24 168.87 1.00x BenchmarkEqualPortable4M 167.56 166.90 1.00x BenchmarkEqualPortable64M 166.84 166.22 1.00x BenchmarkIndex32 39.97 27.07 0.68x BenchmarkIndex4K 34.62 22.92 0.66x BenchmarkIndex4M 34.53 22.81 0.66x BenchmarkIndex64M 34.53 22.83 0.66x BenchmarkIndexEasy32 224.99 224.98 1.00x BenchmarkIndexEasy4K 558.69 558.33 1.00x BenchmarkIndexEasy4M 555.65 556.03 1.00x BenchmarkIndexEasy64M 557.11 557.83 1.00x BenchmarkCount32 39.69 27.06 0.68x BenchmarkCount4K 34.61 22.93 0.66x BenchmarkCount4M 34.47 22.87 0.66x BenchmarkCount64M 34.51 22.87 0.66x BenchmarkCountEasy32 221.56 225.87 1.02x BenchmarkCountEasy4K 556.51 558.36 1.00x BenchmarkCountEasy4M 555.23 554.50 1.00x BenchmarkCountEasy64M 557.77 555.93 1.00x BenchmarkFields 9.02 9.18 1.02x BenchmarkFieldsFunc 9.16 9.26 1.01x lucky(~) % cat strings.txt benchmark old ns/op new ns/op delta BenchmarkGenericNoMatch 7555 7427 -1.69% BenchmarkGenericMatch1 29053 29279 +0.78% BenchmarkGenericMatch2 175206 188767 +7.74% BenchmarkSingleMaxSkipping 5981 5976 -0.08% BenchmarkSingleLongSuffixFail 6083 6043 -0.66% BenchmarkSingleMatch 376140 385756 +2.56% BenchmarkByteByteNoMatch 2301 2301 +0.00% BenchmarkByteByteMatch 4902 4913 +0.22% BenchmarkByteStringMatch 7346 7210 -1.85% BenchmarkHTMLEscapeNew 2305 2339 +1.48% BenchmarkHTMLEscapeOld 3783 3812 +0.77% BenchmarkByteByteReplaces 18253 18348 +0.52% BenchmarkByteByteMap 9875 9685 -1.92% BenchmarkIndexRune 246 254 +3.25% BenchmarkIndexRuneFastPath 130 116 -10.77% BenchmarkIndex 129 129 +0.00% BenchmarkMapNoChanges 799 843 +5.51% BenchmarkIndexHard1 7637311 7636978 -0.00% BenchmarkIndexHard2 7647121 7647579 +0.01% BenchmarkIndexHard3 7628415 7637353 +0.12% BenchmarkCountHard1 8250395 8260440 +0.12% BenchmarkCountHard2 8253451 8260638 +0.09% BenchmarkCountHard3 8250371 8263897 +0.16% BenchmarkIndexTorture 47275 47389 +0.24% BenchmarkCountTorture 49252 49139 -0.23% BenchmarkCountTortureOverlapping 52097076 29447134 -43.48% BenchmarkFields 90120216 89680439 -0.49% BenchmarkFieldsFunc 90004377 89325775 -0.75% BenchmarkSplit1 76455569 76104520 -0.46% BenchmarkSplit2 20392480 19932127 -2.26% BenchmarkSplit3 18983460 18571189 -2.17% benchmark old MB/s new MB/s speedup BenchmarkSingleMaxSkipping 1671.80 1673.30 1.00x BenchmarkSingleLongSuffixFail 164.71 165.78 1.01x BenchmarkSingleMatch 39.88 38.88 0.97x BenchmarkFields 11.64 11.69 1.00x BenchmarkFieldsFunc 11.65 11.74 1.01x lucky(~) % cat fmt.txt benchmark old ns/op new ns/op delta BenchmarkSprintfEmpty 424 424 +0.00% BenchmarkSprintfString 1499 1451 -3.20% BenchmarkSprintfInt 1261 1438 +14.04% BenchmarkSprintfIntInt 2042 2031 -0.54% BenchmarkSprintfPrefixedInt 2028 2165 +6.76% BenchmarkSprintfFloat 3579 3686 +2.99% BenchmarkManyArgs 5573 6130 +9.99% BenchmarkScanInts 5413886 5166830 -4.56% BenchmarkScanRecursiveInt 6447875 6478815 +0.48%
Sign in to reply to this message.
Which arch is your chromebook? On Sat, Mar 30, 2013 at 5:04 PM, <dave@cheney.net> wrote: > I benchmarked patchset 10 on my chromebook, the results are mixed. The > regressions in the bytes benchmarks appear to flow through in to strings > and fmt. > > lucky(~) % cat bytes.txt > > benchmark old ns/op new ns/op delta > BenchmarkReadString 112131 127869 +14.04% > BenchmarkBufferNotEmptyWriteRe**ad 3634051 3681321 +1.30% > BenchmarkBufferFullSmallReads 440993 441639 +0.15% > BenchmarkIndexByte32 76 76 +0.00% > BenchmarkIndexByte4K 7279 7286 +0.10% > BenchmarkIndexByte4M 7533005 7564464 +0.42% > BenchmarkIndexByte64M 120355191 120813462 +0.38% > BenchmarkIndexBytePortable32 174 175 +0.57% > BenchmarkIndexBytePortable4K 17302 17342 +0.23% > BenchmarkIndexBytePortable4M 18276630 18349412 +0.40% > BenchmarkIndexBytePortable64M 292254766 293383533 +0.39% > BenchmarkEqual32 90 60 -33.22% > BenchmarkEqual4K 7285 1853 -74.56% > BenchmarkEqual4M 8829644 3517534 -60.16% > BenchmarkEqual64M 138950637 59169793 -57.42% > BenchmarkEqualPort32 223 224 +0.45% > BenchmarkEqualPort4K 24201 24255 +0.22% > BenchmarkEqualPortable4M 25031372 25131132 +0.40% > BenchmarkEqualPortable64M 402236591 403740442 +0.37% > BenchmarkIndex32 800 1182 +47.75% > BenchmarkIndex4K 118306 178704 +51.05% > BenchmarkIndex4M 121457245 183889629 +51.40% > BenchmarkIndex64M 1943517375 2939630294 +51.25% > BenchmarkIndexEasy32 142 142 +0.00% > BenchmarkIndexEasy4K 7331 7336 +0.07% > BenchmarkIndexEasy4M 7548436 7543304 -0.07% > BenchmarkIndexEasy64M 120458129 120302727 -0.13% > BenchmarkCount32 806 1182 +46.65% > BenchmarkCount4K 118344 178649 +50.96% > BenchmarkCount4M 121673656 183400741 +50.73% > BenchmarkCount64M 1944794002 2934607085 +50.90% > BenchmarkCountEasy32 144 141 -2.08% > BenchmarkCountEasy4K 7360 7335 -0.34% > BenchmarkCountEasy4M 7554226 7564146 +0.13% > BenchmarkCountEasy64M 120317272 120715047 +0.33% > BenchmarkFields 116213231 114222702 -1.71% > BenchmarkFieldsFunc 114475948 113263035 -1.06% > BenchmarkTrimSpace 342 336 -1.75% > > > benchmark old MB/s new MB/s speedup > BenchmarkReadString 292.23 256.26 0.88x > BenchmarkIndexByte32 419.49 419.49 1.00x > BenchmarkIndexByte4K 562.70 562.14 1.00x > BenchmarkIndexByte4M 556.79 554.47 1.00x > BenchmarkIndexByte64M 557.59 555.48 1.00x > BenchmarkIndexBytePortable32 183.22 182.84 1.00x > BenchmarkIndexBytePortable4K 236.72 236.18 1.00x > BenchmarkIndexBytePortable4M 229.49 228.58 1.00x > BenchmarkIndexBytePortable64M 229.62 228.74 1.00x > BenchmarkEqual32 354.47 530.72 1.50x > BenchmarkEqual4K 562.22 2210.19 3.93x > BenchmarkEqual4M 475.03 1192.40 2.51x > BenchmarkEqual64M 482.97 1134.17 2.35x > BenchmarkEqualPort32 143.47 142.77 1.00x > BenchmarkEqualPort4K 169.24 168.87 1.00x > BenchmarkEqualPortable4M 167.56 166.90 1.00x > BenchmarkEqualPortable64M 166.84 166.22 1.00x > BenchmarkIndex32 39.97 27.07 0.68x > BenchmarkIndex4K 34.62 22.92 0.66x > BenchmarkIndex4M 34.53 22.81 0.66x > BenchmarkIndex64M 34.53 22.83 0.66x > BenchmarkIndexEasy32 224.99 224.98 1.00x > BenchmarkIndexEasy4K 558.69 558.33 1.00x > BenchmarkIndexEasy4M 555.65 556.03 1.00x > BenchmarkIndexEasy64M 557.11 557.83 1.00x > BenchmarkCount32 39.69 27.06 0.68x > BenchmarkCount4K 34.61 22.93 0.66x > BenchmarkCount4M 34.47 22.87 0.66x > BenchmarkCount64M 34.51 22.87 0.66x > BenchmarkCountEasy32 221.56 225.87 1.02x > BenchmarkCountEasy4K 556.51 558.36 1.00x > BenchmarkCountEasy4M 555.23 554.50 1.00x > BenchmarkCountEasy64M 557.77 555.93 1.00x > BenchmarkFields 9.02 9.18 1.02x > BenchmarkFieldsFunc 9.16 9.26 1.01x > lucky(~) % cat strings.txt > > benchmark old ns/op new ns/op delta > BenchmarkGenericNoMatch 7555 7427 -1.69% > BenchmarkGenericMatch1 29053 29279 +0.78% > BenchmarkGenericMatch2 175206 188767 +7.74% > BenchmarkSingleMaxSkipping 5981 5976 -0.08% > BenchmarkSingleLongSuffixFail 6083 6043 -0.66% > BenchmarkSingleMatch 376140 385756 +2.56% > BenchmarkByteByteNoMatch 2301 2301 +0.00% > BenchmarkByteByteMatch 4902 4913 +0.22% > BenchmarkByteStringMatch 7346 7210 -1.85% > BenchmarkHTMLEscapeNew 2305 2339 +1.48% > BenchmarkHTMLEscapeOld 3783 3812 +0.77% > BenchmarkByteByteReplaces 18253 18348 +0.52% > BenchmarkByteByteMap 9875 9685 -1.92% > BenchmarkIndexRune 246 254 +3.25% > BenchmarkIndexRuneFastPath 130 116 -10.77% > BenchmarkIndex 129 129 +0.00% > BenchmarkMapNoChanges 799 843 +5.51% > BenchmarkIndexHard1 7637311 7636978 -0.00% > BenchmarkIndexHard2 7647121 7647579 +0.01% > BenchmarkIndexHard3 7628415 7637353 +0.12% > BenchmarkCountHard1 8250395 8260440 +0.12% > BenchmarkCountHard2 8253451 8260638 +0.09% > BenchmarkCountHard3 8250371 8263897 +0.16% > BenchmarkIndexTorture 47275 47389 +0.24% > BenchmarkCountTorture 49252 49139 -0.23% > BenchmarkCountTortureOverlappi**ng 52097076 29447134 -43.48% > BenchmarkFields 90120216 89680439 -0.49% > BenchmarkFieldsFunc 90004377 89325775 -0.75% > BenchmarkSplit1 76455569 76104520 -0.46% > BenchmarkSplit2 20392480 19932127 -2.26% > BenchmarkSplit3 18983460 18571189 -2.17% > > > benchmark old MB/s new MB/s speedup > BenchmarkSingleMaxSkipping 1671.80 1673.30 1.00x > BenchmarkSingleLongSuffixFail 164.71 165.78 1.01x > BenchmarkSingleMatch 39.88 38.88 0.97x > BenchmarkFields 11.64 11.69 1.00x > BenchmarkFieldsFunc 11.65 11.74 1.01x > lucky(~) % cat fmt.txt > > benchmark old ns/op new ns/op delta > BenchmarkSprintfEmpty 424 424 +0.00% > BenchmarkSprintfString 1499 1451 -3.20% > BenchmarkSprintfInt 1261 1438 +14.04% > BenchmarkSprintfIntInt 2042 2031 -0.54% > BenchmarkSprintfPrefixedInt 2028 2165 +6.76% > BenchmarkSprintfFloat 3579 3686 +2.99% > BenchmarkManyArgs 5573 6130 +9.99% > BenchmarkScanInts 5413886 5166830 -4.56% > BenchmarkScanRecursiveInt 6447875 6478815 +0.48% > > > https://codereview.appspot.**com/8056043/<https://codereview.appspot.com/8056... >
Sign in to reply to this message.
This is the dual A15 Exynos 5 version. I can do some benchmarks on a pandaboard or a RPi but I don't expect much change. On Sun, Mar 31, 2013 at 11:54 AM, Brad Fitzpatrick <bradfitz@golang.org> wrote: > Which arch is your chromebook? > > > On Sat, Mar 30, 2013 at 5:04 PM, <dave@cheney.net> wrote: >> >> I benchmarked patchset 10 on my chromebook, the results are mixed. The >> regressions in the bytes benchmarks appear to flow through in to strings >> and fmt. >> >> lucky(~) % cat bytes.txt >> >> benchmark old ns/op new ns/op delta >> BenchmarkReadString 112131 127869 +14.04% >> BenchmarkBufferNotEmptyWriteRead 3634051 3681321 +1.30% >> BenchmarkBufferFullSmallReads 440993 441639 +0.15% >> BenchmarkIndexByte32 76 76 +0.00% >> BenchmarkIndexByte4K 7279 7286 +0.10% >> BenchmarkIndexByte4M 7533005 7564464 +0.42% >> BenchmarkIndexByte64M 120355191 120813462 +0.38% >> BenchmarkIndexBytePortable32 174 175 +0.57% >> BenchmarkIndexBytePortable4K 17302 17342 +0.23% >> BenchmarkIndexBytePortable4M 18276630 18349412 +0.40% >> BenchmarkIndexBytePortable64M 292254766 293383533 +0.39% >> BenchmarkEqual32 90 60 -33.22% >> BenchmarkEqual4K 7285 1853 -74.56% >> BenchmarkEqual4M 8829644 3517534 -60.16% >> BenchmarkEqual64M 138950637 59169793 -57.42% >> BenchmarkEqualPort32 223 224 +0.45% >> BenchmarkEqualPort4K 24201 24255 +0.22% >> BenchmarkEqualPortable4M 25031372 25131132 +0.40% >> BenchmarkEqualPortable64M 402236591 403740442 +0.37% >> BenchmarkIndex32 800 1182 +47.75% >> BenchmarkIndex4K 118306 178704 +51.05% >> BenchmarkIndex4M 121457245 183889629 +51.40% >> BenchmarkIndex64M 1943517375 2939630294 +51.25% >> BenchmarkIndexEasy32 142 142 +0.00% >> BenchmarkIndexEasy4K 7331 7336 +0.07% >> BenchmarkIndexEasy4M 7548436 7543304 -0.07% >> BenchmarkIndexEasy64M 120458129 120302727 -0.13% >> BenchmarkCount32 806 1182 +46.65% >> BenchmarkCount4K 118344 178649 +50.96% >> BenchmarkCount4M 121673656 183400741 +50.73% >> BenchmarkCount64M 1944794002 2934607085 +50.90% >> BenchmarkCountEasy32 144 141 -2.08% >> BenchmarkCountEasy4K 7360 7335 -0.34% >> BenchmarkCountEasy4M 7554226 7564146 +0.13% >> BenchmarkCountEasy64M 120317272 120715047 +0.33% >> BenchmarkFields 116213231 114222702 -1.71% >> BenchmarkFieldsFunc 114475948 113263035 -1.06% >> BenchmarkTrimSpace 342 336 -1.75% >> >> >> benchmark old MB/s new MB/s speedup >> BenchmarkReadString 292.23 256.26 0.88x >> BenchmarkIndexByte32 419.49 419.49 1.00x >> BenchmarkIndexByte4K 562.70 562.14 1.00x >> BenchmarkIndexByte4M 556.79 554.47 1.00x >> BenchmarkIndexByte64M 557.59 555.48 1.00x >> BenchmarkIndexBytePortable32 183.22 182.84 1.00x >> BenchmarkIndexBytePortable4K 236.72 236.18 1.00x >> BenchmarkIndexBytePortable4M 229.49 228.58 1.00x >> BenchmarkIndexBytePortable64M 229.62 228.74 1.00x >> BenchmarkEqual32 354.47 530.72 1.50x >> BenchmarkEqual4K 562.22 2210.19 3.93x >> BenchmarkEqual4M 475.03 1192.40 2.51x >> BenchmarkEqual64M 482.97 1134.17 2.35x >> BenchmarkEqualPort32 143.47 142.77 1.00x >> BenchmarkEqualPort4K 169.24 168.87 1.00x >> BenchmarkEqualPortable4M 167.56 166.90 1.00x >> BenchmarkEqualPortable64M 166.84 166.22 1.00x >> BenchmarkIndex32 39.97 27.07 0.68x >> BenchmarkIndex4K 34.62 22.92 0.66x >> BenchmarkIndex4M 34.53 22.81 0.66x >> BenchmarkIndex64M 34.53 22.83 0.66x >> BenchmarkIndexEasy32 224.99 224.98 1.00x >> BenchmarkIndexEasy4K 558.69 558.33 1.00x >> BenchmarkIndexEasy4M 555.65 556.03 1.00x >> BenchmarkIndexEasy64M 557.11 557.83 1.00x >> BenchmarkCount32 39.69 27.06 0.68x >> BenchmarkCount4K 34.61 22.93 0.66x >> BenchmarkCount4M 34.47 22.87 0.66x >> BenchmarkCount64M 34.51 22.87 0.66x >> BenchmarkCountEasy32 221.56 225.87 1.02x >> BenchmarkCountEasy4K 556.51 558.36 1.00x >> BenchmarkCountEasy4M 555.23 554.50 1.00x >> BenchmarkCountEasy64M 557.77 555.93 1.00x >> BenchmarkFields 9.02 9.18 1.02x >> BenchmarkFieldsFunc 9.16 9.26 1.01x >> lucky(~) % cat strings.txt >> >> benchmark old ns/op new ns/op delta >> BenchmarkGenericNoMatch 7555 7427 -1.69% >> BenchmarkGenericMatch1 29053 29279 +0.78% >> BenchmarkGenericMatch2 175206 188767 +7.74% >> BenchmarkSingleMaxSkipping 5981 5976 -0.08% >> BenchmarkSingleLongSuffixFail 6083 6043 -0.66% >> BenchmarkSingleMatch 376140 385756 +2.56% >> BenchmarkByteByteNoMatch 2301 2301 +0.00% >> BenchmarkByteByteMatch 4902 4913 +0.22% >> BenchmarkByteStringMatch 7346 7210 -1.85% >> BenchmarkHTMLEscapeNew 2305 2339 +1.48% >> BenchmarkHTMLEscapeOld 3783 3812 +0.77% >> BenchmarkByteByteReplaces 18253 18348 +0.52% >> BenchmarkByteByteMap 9875 9685 -1.92% >> BenchmarkIndexRune 246 254 +3.25% >> BenchmarkIndexRuneFastPath 130 116 -10.77% >> BenchmarkIndex 129 129 +0.00% >> BenchmarkMapNoChanges 799 843 +5.51% >> BenchmarkIndexHard1 7637311 7636978 -0.00% >> BenchmarkIndexHard2 7647121 7647579 +0.01% >> BenchmarkIndexHard3 7628415 7637353 +0.12% >> BenchmarkCountHard1 8250395 8260440 +0.12% >> BenchmarkCountHard2 8253451 8260638 +0.09% >> BenchmarkCountHard3 8250371 8263897 +0.16% >> BenchmarkIndexTorture 47275 47389 +0.24% >> BenchmarkCountTorture 49252 49139 -0.23% >> BenchmarkCountTortureOverlapping 52097076 29447134 -43.48% >> BenchmarkFields 90120216 89680439 -0.49% >> BenchmarkFieldsFunc 90004377 89325775 -0.75% >> BenchmarkSplit1 76455569 76104520 -0.46% >> BenchmarkSplit2 20392480 19932127 -2.26% >> BenchmarkSplit3 18983460 18571189 -2.17% >> >> >> benchmark old MB/s new MB/s speedup >> BenchmarkSingleMaxSkipping 1671.80 1673.30 1.00x >> BenchmarkSingleLongSuffixFail 164.71 165.78 1.01x >> BenchmarkSingleMatch 39.88 38.88 0.97x >> BenchmarkFields 11.64 11.69 1.00x >> BenchmarkFieldsFunc 11.65 11.74 1.01x >> lucky(~) % cat fmt.txt >> >> benchmark old ns/op new ns/op delta >> BenchmarkSprintfEmpty 424 424 +0.00% >> BenchmarkSprintfString 1499 1451 -3.20% >> BenchmarkSprintfInt 1261 1438 +14.04% >> BenchmarkSprintfIntInt 2042 2031 -0.54% >> BenchmarkSprintfPrefixedInt 2028 2165 +6.76% >> BenchmarkSprintfFloat 3579 3686 +2.99% >> BenchmarkManyArgs 5573 6130 +9.99% >> BenchmarkScanInts 5413886 5166830 -4.56% >> BenchmarkScanRecursiveInt 6447875 6478815 +0.48% >> >> >> https://codereview.appspot.com/8056043/ > >
Sign in to reply to this message.
Here are some linux/386 from a 4 core atom 330. 220887(~) % cat bytes.txt benchmark old ns/op new ns/op delta BenchmarkReadString 121057 126571 +4.55% BenchmarkBufferNotEmptyWriteRead 5503485 5454785 -0.88% BenchmarkBufferFullSmallReads 891073 889944 -0.13% BenchmarkIndexByte32 94 94 -0.11% BenchmarkIndexByte4K 7781 7813 +0.41% BenchmarkIndexByte4M 8619393 8483188 -1.58% BenchmarkIndexByte64M 136653719 135896031 -0.55% BenchmarkIndexBytePortable32 466 465 -0.21% BenchmarkIndexBytePortable4K 56721 56661 -0.11% BenchmarkIndexBytePortable4M 58213450 58148782 -0.11% BenchmarkIndexBytePortable64M 931072226 930535077 -0.06% BenchmarkEqual32 119 63 -46.64% BenchmarkEqual4K 10348 2443 -76.39% BenchmarkEqual4M 11544390 3224185 -72.07% BenchmarkEqual64M 185016250 56713834 -69.35% BenchmarkEqualPort32 782 785 +0.38% BenchmarkEqualPort4K 95370 96032 +0.69% BenchmarkEqualPortable4M 98030915 98312648 +0.29% BenchmarkEqualPortable64M 1568407685 1573236860 +0.31% BenchmarkIndex32 2319 1923 -17.08% BenchmarkIndex4K 354957 295392 -16.78% BenchmarkIndex4M 364731912 303230991 -16.86% BenchmarkIndex64M 2147483647 2147483647 -16.86% BenchmarkIndexEasy32 227 210 -7.49% BenchmarkIndexEasy4K 7899 7901 +0.03% BenchmarkIndexEasy4M 8486584 8516794 +0.36% BenchmarkIndexEasy64M 135771448 136296410 +0.39% BenchmarkCount32 1838 1460 -20.57% BenchmarkCount4K 288436 220826 -23.44% BenchmarkCount4M 286733235 227035260 -20.82% BenchmarkCount64M 2147483647 2147483647 -21.37% BenchmarkCountEasy32 210 193 -8.10% BenchmarkCountEasy4K 7888 7875 -0.16% BenchmarkCountEasy4M 8483266 8527767 +0.52% BenchmarkCountEasy64M 135677742 136246780 +0.42% BenchmarkFields 226913045 222574539 -1.91% BenchmarkFieldsFunc 226840521 222770626 -1.79% BenchmarkTrimSpace 477 503 +5.45% benchmark old MB/s new MB/s speedup BenchmarkReadString 270.68 258.89 0.96x BenchmarkIndexByte32 338.57 339.07 1.00x BenchmarkIndexByte4K 526.40 524.23 1.00x BenchmarkIndexByte4M 486.61 494.43 1.02x BenchmarkIndexByte64M 491.09 493.83 1.01x BenchmarkIndexBytePortable32 68.55 68.69 1.00x BenchmarkIndexBytePortable4K 72.21 72.29 1.00x BenchmarkIndexBytePortable4M 72.05 72.13 1.00x BenchmarkIndexBytePortable64M 72.08 72.12 1.00x BenchmarkEqual32 267.72 503.99 1.88x BenchmarkEqual4K 395.79 1676.25 4.24x BenchmarkEqual4M 363.32 1300.89 3.58x BenchmarkEqual64M 362.72 1183.29 3.26x BenchmarkEqualPort32 40.88 40.74 1.00x BenchmarkEqualPort4K 42.95 42.65 0.99x BenchmarkEqualPortable4M 42.79 42.66 1.00x BenchmarkEqualPortable64M 42.79 42.66 1.00x BenchmarkIndex32 13.80 16.64 1.21x BenchmarkIndex4K 11.54 13.87 1.20x BenchmarkIndex4M 11.50 13.83 1.20x BenchmarkIndex64M 11.50 13.83 1.20x BenchmarkIndexEasy32 140.52 152.16 1.08x BenchmarkIndexEasy4K 518.55 518.40 1.00x BenchmarkIndexEasy4M 494.23 492.47 1.00x BenchmarkIndexEasy64M 494.28 492.37 1.00x BenchmarkCount32 17.40 21.91 1.26x BenchmarkCount4K 14.20 18.55 1.31x BenchmarkCount4M 14.63 18.47 1.26x BenchmarkCount64M 14.53 18.48 1.27x BenchmarkCountEasy32 152.37 165.42 1.09x BenchmarkCountEasy4K 519.27 520.10 1.00x BenchmarkCountEasy4M 494.42 491.84 0.99x BenchmarkCountEasy64M 494.62 492.55 1.00x BenchmarkFields 4.62 4.71 1.02x BenchmarkFieldsFunc 4.62 4.71 1.02x 220887(~) % cat strings.txt benchmark old ns/op new ns/op delta BenchmarkGenericNoMatch 19988 18825 -5.82% BenchmarkGenericMatch1 68559 68689 +0.19% BenchmarkGenericMatch2 364551 364833 +0.08% BenchmarkSingleMaxSkipping 16873 15462 -8.36% BenchmarkSingleLongSuffixFail 29288 28511 -2.65% BenchmarkSingleMatch 826237 743121 -10.06% BenchmarkByteByteNoMatch 3571 3340 -6.47% BenchmarkByteByteMatch 9512 7961 -16.31% BenchmarkByteStringMatch 10368 10101 -2.58% BenchmarkHTMLEscapeNew 3159 3015 -4.56% BenchmarkHTMLEscapeOld 9431 9076 -3.76% BenchmarkByteByteReplaces 44356 44370 +0.03% BenchmarkByteByteMap 22273 21531 -3.33% BenchmarkIndexRune 606 603 -0.50% BenchmarkIndexRuneFastPath 451 454 +0.67% BenchmarkIndex 499 480 -3.81% BenchmarkMapNoChanges 2166 2145 -0.97% BenchmarkIndexHard1 20219289 18614626 -7.94% BenchmarkIndexHard2 18818291 18609084 -1.11% BenchmarkIndexHard3 18784650 18606772 -0.95% BenchmarkCountHard1 18776488 18589004 -1.00% BenchmarkCountHard2 18633691 18589560 -0.24% BenchmarkCountHard3 19331696 18586440 -3.86% BenchmarkIndexTorture 104744 104592 -0.15% BenchmarkCountTorture 104791 104574 -0.21% BenchmarkCountTortureOverlapping 80928224 61623014 -23.85% BenchmarkFields 171732962 165121290 -3.85% BenchmarkFieldsFunc 165616856 165492483 -0.08% BenchmarkSplit1 187759567 168792316 -10.10% BenchmarkSplit2 48277196 48175434 -0.21% BenchmarkSplit3 40313030 39335365 -2.43% benchmark old MB/s new MB/s speedup BenchmarkSingleMaxSkipping 592.64 646.75 1.09x BenchmarkSingleLongSuffixFail 34.21 35.14 1.03x BenchmarkSingleMatch 18.15 20.19 1.11x BenchmarkFields 6.11 6.35 1.04x BenchmarkFieldsFunc 6.33 6.34 1.00x 220887(~) % cat fmt.txt benchmark old ns/op new ns/op delta BenchmarkSprintfEmpty 681 646 -5.14% BenchmarkSprintfString 2417 2438 +0.87% BenchmarkSprintfInt 1904 1947 +2.26% BenchmarkSprintfIntInt 2835 2810 -0.88% BenchmarkSprintfPrefixedInt 3444 3487 +1.25% BenchmarkSprintfFloat 4997 4978 -0.38% BenchmarkManyArgs 10427 10465 +0.36% BenchmarkScanInts 7494022 7444609 -0.66% BenchmarkScanRecursiveInt 8661682 8842178 +2.08%
Sign in to reply to this message.
Here is a bit more data from the linxu/arm regression for Count4K before: Total: 118 samples 64 54.2% 54.2% 118 100.0% bytes.Count 54 45.8% 100.0% 54 45.8% bytes.Equal 0 0.0% 100.0% 64 54.2% bytes_test.BenchmarkCount4K http://dave.cheney.net/paste/bytes-before.svg after: Total: 180 samples 133 73.9% 73.9% 139 77.2% bytes.Count 38 21.1% 95.0% 38 21.1% runtime.memeqbody 9 5.0% 100.0% 180 100.0% bytes.Equal 0 0.0% 100.0% 137 76.1% bytes_test.BenchmarkCount4K 0 0.0% 100.0% 45 25.0% bytes_test.bmCount 0 0.0% 100.0% 4 2.2% gosched0 0 0.0% 100.0% 4 2.2% testing.(*B).launch 0 0.0% 100.0% 4 2.2% testing.(*B).runN http://dave.cheney.net/paste/bytes-after.svg
Sign in to reply to this message.
I don't understand why it is so slow. A bit slower, maybe, but not that much. The main loop in both the old and new implementations for n<8 are now identical. (The CountX benchmarks are doing n==7.) The new implementation differs just by a few constant moves, a compare and branch, and a BL (together with its stack pointer bump). I'm happy to revert to the old code if we can't fix this. I did the arm just for completeness, the big wins are in the x86 world. On Sat, Mar 30, 2013 at 8:01 PM, <dave@cheney.net> wrote: > Here is a bit more data from the linxu/arm regression for Count4K > > before: > > Total: 118 samples > 64 54.2% 54.2% 118 100.0% bytes.Count > 54 45.8% 100.0% 54 45.8% bytes.Equal > 0 0.0% 100.0% 64 54.2% bytes_test.BenchmarkCount4K > > http://dave.cheney.net/paste/**bytes-before.svg<http://dave.cheney.net/paste/... > > after: > > Total: 180 samples > 133 73.9% 73.9% 139 77.2% bytes.Count > 38 21.1% 95.0% 38 21.1% runtime.memeqbody > 9 5.0% 100.0% 180 100.0% bytes.Equal > 0 0.0% 100.0% 137 76.1% bytes_test.BenchmarkCount4K > 0 0.0% 100.0% 45 25.0% bytes_test.bmCount > 0 0.0% 100.0% 4 2.2% gosched0 > 0 0.0% 100.0% 4 2.2% testing.(*B).launch > 0 0.0% 100.0% 4 2.2% testing.(*B).runN > > http://dave.cheney.net/paste/**bytes-after.svg<http://dave.cheney.net/paste/b... > > https://codereview.appspot.**com/8056043/<https://codereview.appspot.com/8056... >
Sign in to reply to this message.
Your new runtime.memeqbody is much faster than my old byte by byte attempt, i'm experimenting with copying it directly into bytes.Equals to see if that causes a regression in bytes.Count. On Sun, Mar 31, 2013 at 3:14 PM, Keith Randall <khr@google.com> wrote: > I don't understand why it is so slow. A bit slower, maybe, but not that > much. The main loop in both the old and new implementations for n<8 are > now identical. (The CountX benchmarks are doing n==7.) The new > implementation differs just by a few constant moves, a compare and branch, > and a BL (together with its stack pointer bump). > > I'm happy to revert to the old code if we can't fix this. I did the arm > just for completeness, the big wins are in the x86 world. > > > On Sat, Mar 30, 2013 at 8:01 PM, <dave@cheney.net> wrote: > >> Here is a bit more data from the linxu/arm regression for Count4K >> >> before: >> >> Total: 118 samples >> 64 54.2% 54.2% 118 100.0% bytes.Count >> 54 45.8% 100.0% 54 45.8% bytes.Equal >> 0 0.0% 100.0% 64 54.2% bytes_test.BenchmarkCount4K >> >> http://dave.cheney.net/paste/**bytes-before.svg<http://dave.cheney.net/paste/... >> >> after: >> >> Total: 180 samples >> 133 73.9% 73.9% 139 77.2% bytes.Count >> 38 21.1% 95.0% 38 21.1% runtime.memeqbody >> 9 5.0% 100.0% 180 100.0% bytes.Equal >> 0 0.0% 100.0% 137 76.1% bytes_test.BenchmarkCount4K >> 0 0.0% 100.0% 45 25.0% bytes_test.bmCount >> 0 0.0% 100.0% 4 2.2% gosched0 >> 0 0.0% 100.0% 4 2.2% testing.(*B).launch >> 0 0.0% 100.0% 4 2.2% testing.(*B).runN >> >> http://dave.cheney.net/paste/**bytes-after.svg<http://dave.cheney.net/paste/b... >> >> https://codereview.appspot.**com/8056043/<https://codereview.appspot.com/8056... >> > > -- > > --- > You received this message because you are subscribed to the Google Groups > "golang-dev" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to golang-dev+unsubscribe@googlegroups.com. > For more options, visit https://groups.google.com/groups/opt_out. > > >
Sign in to reply to this message.
Here is another arm datapoint, from a pandaboard, Cortex-A9. alarm(~/go/src/pkg/bytes) % ~/go/misc/benchcmp {old,new}.txt benchmark old ns/op new ns/op delta BenchmarkReadString 207691 212456 +2.29% BenchmarkBufferNotEmptyWriteRead 10302655 10189737 -1.10% BenchmarkBufferFullSmallReads 1472871 1447450 -1.73% BenchmarkIndexByte32 164 165 +0.61% BenchmarkIndexByte4K 13756 13779 +0.17% BenchmarkIndexByte4M 27057480 26680883 -1.39% BenchmarkIndexByte64M 438272584 428702042 -2.18% BenchmarkIndexBytePortable32 515 515 +0.00% BenchmarkIndexBytePortable4K 56701 56541 -0.28% BenchmarkIndexBytePortable4M 64782309 64326364 -0.70% BenchmarkIndexBytePortable64M 1031455804 1030296066 -0.11% BenchmarkEqual32 188 106 -43.62% BenchmarkEqual4K 15506 4636 -70.10% BenchmarkEqual4M 45102310 46034946 +2.07% BenchmarkEqual64M 743546635 1001547877 +34.70% BenchmarkEqualPort32 807 796 -1.36% BenchmarkEqualPort4K 92419 90848 -1.70% BenchmarkEqualPortable4M 106740630 108727363 +1.86% BenchmarkEqualPortable64M 1774788812 1798989707 +1.36% BenchmarkIndex32 2503 2939 +17.42% BenchmarkIndex4K 374861 442905 +18.15% BenchmarkIndex4M 391445561 458860175 +17.22% BenchmarkIndex64M 6260006548 7347176750 +17.37% BenchmarkIndexEasy32 330 348 +5.45% BenchmarkIndexEasy4K 14001 13923 -0.56% BenchmarkIndexEasy4M 26652902 26637547 -0.06% BenchmarkIndexEasy64M 428365684 428122196 -0.06% BenchmarkCount32 2622 2956 +12.74% BenchmarkCount4K 384545 444107 +15.49% BenchmarkCount4M 398216529 459708583 +15.44% BenchmarkCount64M 6369725802 7357919173 +15.51% BenchmarkCountEasy32 353 373 +5.67% BenchmarkCountEasy4K 13907 13935 +0.20% BenchmarkCountEasy4M 26631245 26946392 +1.18% BenchmarkCountEasy64M 428438891 433169914 +1.10% BenchmarkFields 306475738 306250622 -0.07% BenchmarkFieldsFunc 305981589 306415421 +0.14% BenchmarkTrimSpace 753 733 -2.66% benchmark old MB/s new MB/s speedup BenchmarkReadString 157.77 154.23 0.98x BenchmarkIndexByte32 194.79 193.42 0.99x BenchmarkIndexByte4K 297.75 297.24 1.00x BenchmarkIndexByte4M 155.01 157.20 1.01x BenchmarkIndexByte64M 153.12 156.54 1.02x BenchmarkIndexBytePortable32 62.12 62.14 1.00x BenchmarkIndexBytePortable4K 72.24 72.44 1.00x BenchmarkIndexBytePortable4M 64.74 65.20 1.01x BenchmarkIndexBytePortable64M 65.06 65.14 1.00x BenchmarkEqual32 169.90 300.05 1.77x BenchmarkEqual4K 264.16 883.50 3.34x BenchmarkEqual4M 93.00 91.11 0.98x BenchmarkEqual64M 90.26 67.01 0.74x BenchmarkEqualPort32 39.61 40.15 1.01x BenchmarkEqualPort4K 44.32 45.09 1.02x BenchmarkEqualPortable4M 39.29 38.58 0.98x BenchmarkEqualPortable64M 37.81 37.30 0.99x BenchmarkIndex32 12.78 10.89 0.85x BenchmarkIndex4K 10.93 9.25 0.85x BenchmarkIndex4M 10.71 9.14 0.85x BenchmarkIndex64M 10.72 9.13 0.85x BenchmarkIndexEasy32 96.80 91.88 0.95x BenchmarkIndexEasy4K 292.54 294.18 1.01x BenchmarkIndexEasy4M 157.37 157.46 1.00x BenchmarkIndexEasy64M 156.66 156.75 1.00x BenchmarkCount32 12.20 10.83 0.89x BenchmarkCount4K 10.65 9.22 0.87x BenchmarkCount4M 10.53 9.12 0.87x BenchmarkCount64M 10.54 9.12 0.87x BenchmarkCountEasy32 90.47 85.58 0.95x BenchmarkCountEasy4K 294.52 293.93 1.00x BenchmarkCountEasy4M 157.50 155.65 0.99x BenchmarkCountEasy64M 156.64 154.93 0.99x BenchmarkFields 3.42 3.42 1.00x BenchmarkFieldsFunc 3.43 3.42 1.00x From my amateur eye, the A9 has a shorter pipeline and is probably taking a smaller hit from the branch than the A15 in the chromebook.
Sign in to reply to this message.
On Sat, Mar 30, 2013 at 9:14 PM, Keith Randall <khr@google.com> wrote: > I don't understand why it is so slow. A bit slower, maybe, but not that > much. The main loop in both the old and new implementations for n<8 are > now identical. (The CountX benchmarks are doing n==7.) The new > implementation differs just by a few constant moves, a compare and branch, > and a BL (together with its stack pointer bump). > > I'm happy to revert to the old code if we can't fix this. I did the arm > just for completeness, the big wins are in the x86 world. > Yeah. I'm in favor of leaving memeq and memeqbody in runtime/asm_arm.s but deleting or renaming bytes·Equal on arm to bytes·equalTODOIssueNNNNN where Issue NNNN is a new bug about investigating this issue specifically on ARM.
Sign in to reply to this message.
Last night I copied khr's new code back into bytes.Equals, removing all the indirection and still saw a slowdown in Count. I suggest, for this round revert bytes.Equals to my crummy version for arm, but keep runtime.memeqbody as it is an improvement over the old version in alg.c (minux do you agree?) and get this change in. We can figure out what is going on with arm without blocking the rest of this change. Cheers Dave On 01/04/2013, at 9:54 AM, Brad Fitzpatrick <bradfitz@golang.org> wrote: > > > > On Sat, Mar 30, 2013 at 9:14 PM, Keith Randall <khr@google.com> wrote: >> I don't understand why it is so slow. A bit slower, maybe, but not that much. The main loop in both the old and new implementations for n<8 are now identical. (The CountX benchmarks are doing n==7.) The new implementation differs just by a few constant moves, a compare and branch, and a BL (together with its stack pointer bump). >> >> I'm happy to revert to the old code if we can't fix this. I did the arm just for completeness, the big wins are in the x86 world. > > Yeah. I'm in favor of leaving memeq and memeqbody in runtime/asm_arm.s but deleting or renaming bytes·Equal on arm to bytes·equalTODOIssueNNNNN where Issue NNNN is a new bug about investigating this issue specifically on ARM. > >
Sign in to reply to this message.
If the simple implemention really is much faster for short strings, I'm in favor of both reverting bytes.Equals and using the same simple implementation in runtime. Faster equality for long strings is just not that important. We can always revisit if we figure out what is going on. On Sun, Mar 31, 2013 at 4:39 PM, Dave Cheney <dave@cheney.net> wrote: > Last night I copied khr's new code back into bytes.Equals, removing all > the indirection and still saw a slowdown in Count. I suggest, for this > round revert bytes.Equals to my crummy version for arm, but keep > runtime.memeqbody as it is an improvement over the old version in alg.c > (minux do you agree?) and get this change in. We can figure out what is > going on with arm without blocking the rest of this change. > > Cheers > > Dave > > > > On 01/04/2013, at 9:54 AM, Brad Fitzpatrick <bradfitz@golang.org> wrote: > > > > > On Sat, Mar 30, 2013 at 9:14 PM, Keith Randall <khr@google.com> wrote: > >> I don't understand why it is so slow. A bit slower, maybe, but not that >> much. The main loop in both the old and new implementations for n<8 are >> now identical. (The CountX benchmarks are doing n==7.) The new >> implementation differs just by a few constant moves, a compare and branch, >> and a BL (together with its stack pointer bump). >> >> I'm happy to revert to the old code if we can't fix this. I did the arm >> just for completeness, the big wins are in the x86 world. >> > > Yeah. I'm in favor of leaving memeq and memeqbody in runtime/asm_arm.s > but deleting or renaming bytes·Equal on arm to bytes·equalTODOIssueNNNNN > where Issue NNNN is a new bug about investigating this issue specifically > on ARM. > > >
Sign in to reply to this message.
SGTM. Lets try that. On 01/04/2013, at 11:17, Keith Randall <khr@google.com> wrote: > If the simple implemention really is much faster for short strings, I'm in favor of both reverting bytes.Equals and using the same simple implementation in runtime. Faster equality for long strings is just not that important. We can always revisit if we figure out what is going on. > > > > On Sun, Mar 31, 2013 at 4:39 PM, Dave Cheney <dave@cheney.net> wrote: >> Last night I copied khr's new code back into bytes.Equals, removing all the indirection and still saw a slowdown in Count. I suggest, for this round revert bytes.Equals to my crummy version for arm, but keep runtime.memeqbody as it is an improvement over the old version in alg.c (minux do you agree?) and get this change in. We can figure out what is going on with arm without blocking the rest of this change. >> >> Cheers >> >> Dave >> >> >> >> On 01/04/2013, at 9:54 AM, Brad Fitzpatrick <bradfitz@golang.org> wrote: >> >>> >>> >>> >>> On Sat, Mar 30, 2013 at 9:14 PM, Keith Randall <khr@google.com> wrote: >>>> I don't understand why it is so slow. A bit slower, maybe, but not that much. The main loop in both the old and new implementations for n<8 are now identical. (The CountX benchmarks are doing n==7.) The new implementation differs just by a few constant moves, a compare and branch, and a BL (together with its stack pointer bump). >>>> >>>> I'm happy to revert to the old code if we can't fix this. I did the arm just for completeness, the big wins are in the x86 world. >>> >>> Yeah. I'm in favor of leaving memeq and memeqbody in runtime/asm_arm.s but deleting or renaming bytes·Equal on arm to bytes·equalTODOIssueNNNNN where Issue NNNN is a new bug about investigating this issue specifically on ARM. >
Sign in to reply to this message.
*** Submitted as https://code.google.com/p/go/source/detail?r=7f91fab04475 *** runtime: Implement faster equals for strings and bytes. (amd64) benchmark old ns/op new ns/op delta BenchmarkEqual0 16 6 -63.15% BenchmarkEqual9 22 7 -65.37% BenchmarkEqual32 36 9 -74.91% BenchmarkEqual4K 2187 120 -94.51% benchmark old MB/s new MB/s speedup BenchmarkEqual9 392.22 1134.38 2.89x BenchmarkEqual32 866.72 3457.39 3.99x BenchmarkEqual4K 1872.73 33998.87 18.15x (386) benchmark old ns/op new ns/op delta BenchmarkEqual0 16 5 -63.85% BenchmarkEqual9 22 7 -67.84% BenchmarkEqual32 34 12 -64.94% BenchmarkEqual4K 2196 113 -94.85% benchmark old MB/s new MB/s speedup BenchmarkEqual9 405.81 1260.18 3.11x BenchmarkEqual32 919.55 2631.21 2.86x BenchmarkEqual4K 1864.85 36072.54 19.34x Update issue 3751 R=bradfitz, r, khr, dave, remyoudompheng, fullung, minux.ma, ality CC=golang-dev https://codereview.appspot.com/8056043
Sign in to reply to this message.
Message was sent while issue was closed.
Thanks khr, i'll try to integrate your faster word-at-a-time version into the asm_arm.s Cheers Dave
Sign in to reply to this message.
Message was sent while issue was closed.
Since this change has gone in, I'm seeing the following crash on linux/amd64: go test -v -short -cpu 1,2,4,8,16,256 std === RUN TestLargeStringWrites-2 unexpected fault address 0xc2003da008 fatal error: fault [signal 0xb code=0x2 addr=0xc2003da008 pc=0x40e6fb] goroutine 57 [running]: [fp=0xc20042a140] runtime.throw(0x605737) /build/go.tip/go/src/pkg/runtime/panic.c:473 +0x67 [fp=0xc20042a158] runtime.sigpanic() /build/go.tip/go/src/pkg/runtime/os_linux.c:239 +0xe7 [fp=0xc20042a210] sweepspan(0xc20008e480, 0xf8) /build/go.tip/go/src/pkg/runtime/mgc0.c:1625 +0x33b [fp=0xc20042a288] runtime.parfordo(0xc20008e480) /build/go.tip/go/src/pkg/runtime/parfor.c:120 +0x125 [fp=0xc20042a3b8] gc(0x7f52d1892c1c) /build/go.tip/go/src/pkg/runtime/mgc0.c:1940 +0x2e3 ----- stack segment boundary ----- [fp=0x7f52d1892c30] runtime.gc(0xc200000000) /build/go.tip/go/src/pkg/runtime/mgc0.c:1857 +0x11b [fp=0x7f52d1892c88] runtime.mallocgc(0x12000, 0x100000001, 0x7f5200000001) /build/go.tip/go/src/pkg/runtime/zmalloc_linux_amd64.c:101 +0x1e4 [fp=0x7f52d1892cc0] makeslice1(0x4e76e0, 0x11170, 0x11170, 0x7f52d1892d08) /build/go.tip/go/src/pkg/runtime/slice.c:61 +0x89 [fp=0x7f52d1892cf0] runtime.makeslice(0x4e76e0, 0x11170, 0x11170, 0xc, 0x11170, ...) /build/go.tip/go/src/pkg/runtime/slice.c:34 +0x9a [fp=0x7f52d1892d28] bytes.makeSlice(0x11170, 0x0, 0x0, 0x0) /build/go.tip/go/src/pkg/bytes/buffer.go:191 +0x61 [fp=0x7f52d1892de0] bytes.(*Buffer).grow(0xc20045a310, 0x2710, 0x2) /build/go.tip/go/src/pkg/bytes/buffer.go:99 +0x20b [fp=0x7f52d1892e30] bytes.(*Buffer).WriteString(0xc20045a310, 0xc2001c4000, 0x2710, 0xc20045a310, 0x0, ...) /build/go.tip/go/src/pkg/bytes/buffer.go:136 +0x49 [fp=0x7f52d1892ef0] bytes_test.fillString(0xc200481120, 0x54e990, 0x13, 0xc20045a310, 0xc2004cf000, ...) /build/go.tip/go/src/pkg/bytes/buffer_test.go:55 +0xe8 [fp=0x7f52d1892f78] bytes_test.TestLargeStringWrites(0xc200481120) /build/go.tip/go/src/pkg/bytes/buffer_test.go:178 +0xd5 [fp=0x7f52d1892fa8] testing.tRunner(0xc200481120, 0x608bc8) /build/go.tip/go/src/pkg/testing/testing.go:349 +0x8a [fp=0x7f52d1892fb0] runtime.goexit() /build/go.tip/go/src/pkg/runtime/proc.c:1214 created by testing.RunTests /build/go.tip/go/src/pkg/testing/testing.go:429 +0x86b goroutine 1 [chan receive]: runtime.park(0x409e70, 0xc200339d10, 0x60682a) /build/go.tip/go/src/pkg/runtime/proc.c:1167 +0x64 runtime.chanrecv(0x4e8260, 0xc200339cc0, 0x7f52d1884ce0, 0x0, 0x0, ...) /build/go.tip/go/src/pkg/runtime/chan.c:366 +0x566 runtime.chanrecv1() /build/go.tip/go/src/pkg/runtime/chan.c:458 +0x38 testing.RunTests(0x56e1e8, 0x608b80, 0x33, 0x33, 0x1, ...) /build/go.tip/go/src/pkg/testing/testing.go:430 +0x88e testing.Main(0x56e1e8, 0x608b80, 0x33, 0x33, 0x608700, ...) /build/go.tip/go/src/pkg/testing/testing.go:361 +0x8a main.main() bytes/_test/_testmain.go:245 +0x9a runtime.main() /build/go.tip/go/src/pkg/runtime/proc.c:182 +0x92 runtime.goexit() /build/go.tip/go/src/pkg/runtime/proc.c:1214 goroutine 2 [syscall]: runtime.entersyscallblock() /build/go.tip/go/src/pkg/runtime/proc.c:1324 +0x16e runtime.MHeap_Scavenger() /build/go.tip/go/src/pkg/runtime/mheap.c:435 +0xee runtime.goexit() /build/go.tip/go/src/pkg/runtime/proc.c:1214 created by runtime.main /build/go.tip/go/src/pkg/runtime/proc.c:165 FAIL bytes 0.158s
Sign in to reply to this message.
Go 1.1 will be fast enough[1] without this change. If it's unstable then my vote would be to revert this change so we have a chance of publishing a release candidate. Thoughts ? [1]. https://github.com/davecheney/autobench On Wed, Apr 3, 2013 at 4:43 PM, <fullung@gmail.com> wrote: > Since this change has gone in, I'm seeing the following crash on > linux/amd64: > > go test -v -short -cpu 1,2,4,8,16,256 std > > === RUN TestLargeStringWrites-2 > unexpected fault address 0xc2003da008 > fatal error: fault > [signal 0xb code=0x2 addr=0xc2003da008 pc=0x40e6fb] > > goroutine 57 [running]: > [fp=0xc20042a140] runtime.throw(0x605737) > /build/go.tip/go/src/pkg/runtime/panic.c:473 +0x67 > [fp=0xc20042a158] runtime.sigpanic() > /build/go.tip/go/src/pkg/runtime/os_linux.c:239 +0xe7 > [fp=0xc20042a210] sweepspan(0xc20008e480, 0xf8) > /build/go.tip/go/src/pkg/runtime/mgc0.c:1625 +0x33b > [fp=0xc20042a288] runtime.parfordo(0xc20008e480) > /build/go.tip/go/src/pkg/runtime/parfor.c:120 +0x125 > [fp=0xc20042a3b8] gc(0x7f52d1892c1c) > /build/go.tip/go/src/pkg/runtime/mgc0.c:1940 +0x2e3 > ----- stack segment boundary ----- > [fp=0x7f52d1892c30] runtime.gc(0xc200000000) > /build/go.tip/go/src/pkg/runtime/mgc0.c:1857 +0x11b > [fp=0x7f52d1892c88] runtime.mallocgc(0x12000, 0x100000001, > 0x7f5200000001) > /build/go.tip/go/src/pkg/runtime/zmalloc_linux_amd64.c:101 +0x1e4 > [fp=0x7f52d1892cc0] makeslice1(0x4e76e0, 0x11170, 0x11170, > 0x7f52d1892d08) > /build/go.tip/go/src/pkg/runtime/slice.c:61 +0x89 > [fp=0x7f52d1892cf0] runtime.makeslice(0x4e76e0, 0x11170, 0x11170, 0xc, > 0x11170, ...) > /build/go.tip/go/src/pkg/runtime/slice.c:34 +0x9a > [fp=0x7f52d1892d28] bytes.makeSlice(0x11170, 0x0, 0x0, 0x0) > /build/go.tip/go/src/pkg/bytes/buffer.go:191 +0x61 > [fp=0x7f52d1892de0] bytes.(*Buffer).grow(0xc20045a310, 0x2710, 0x2) > /build/go.tip/go/src/pkg/bytes/buffer.go:99 +0x20b > [fp=0x7f52d1892e30] bytes.(*Buffer).WriteString(0xc20045a310, > 0xc2001c4000, 0x2710, 0xc20045a310, 0x0, ...) > /build/go.tip/go/src/pkg/bytes/buffer.go:136 +0x49 > [fp=0x7f52d1892ef0] bytes_test.fillString(0xc200481120, 0x54e990, 0x13, > 0xc20045a310, 0xc2004cf000, ...) > /build/go.tip/go/src/pkg/bytes/buffer_test.go:55 +0xe8 > [fp=0x7f52d1892f78] bytes_test.TestLargeStringWrites(0xc200481120) > /build/go.tip/go/src/pkg/bytes/buffer_test.go:178 +0xd5 > [fp=0x7f52d1892fa8] testing.tRunner(0xc200481120, 0x608bc8) > /build/go.tip/go/src/pkg/testing/testing.go:349 +0x8a > [fp=0x7f52d1892fb0] runtime.goexit() > /build/go.tip/go/src/pkg/runtime/proc.c:1214 > created by testing.RunTests > /build/go.tip/go/src/pkg/testing/testing.go:429 +0x86b > > goroutine 1 [chan receive]: > runtime.park(0x409e70, 0xc200339d10, 0x60682a) > /build/go.tip/go/src/pkg/runtime/proc.c:1167 +0x64 > runtime.chanrecv(0x4e8260, 0xc200339cc0, 0x7f52d1884ce0, 0x0, 0x0, ...) > /build/go.tip/go/src/pkg/runtime/chan.c:366 +0x566 > runtime.chanrecv1() > /build/go.tip/go/src/pkg/runtime/chan.c:458 +0x38 > testing.RunTests(0x56e1e8, 0x608b80, 0x33, 0x33, 0x1, ...) > /build/go.tip/go/src/pkg/testing/testing.go:430 +0x88e > testing.Main(0x56e1e8, 0x608b80, 0x33, 0x33, 0x608700, ...) > /build/go.tip/go/src/pkg/testing/testing.go:361 +0x8a > main.main() > bytes/_test/_testmain.go:245 +0x9a > runtime.main() > /build/go.tip/go/src/pkg/runtime/proc.c:182 +0x92 > runtime.goexit() > /build/go.tip/go/src/pkg/runtime/proc.c:1214 > > goroutine 2 [syscall]: > runtime.entersyscallblock() > /build/go.tip/go/src/pkg/runtime/proc.c:1324 +0x16e > runtime.MHeap_Scavenger() > /build/go.tip/go/src/pkg/runtime/mheap.c:435 +0xee > runtime.goexit() > /build/go.tip/go/src/pkg/runtime/proc.c:1214 > created by runtime.main > /build/go.tip/go/src/pkg/runtime/proc.c:165 > FAIL bytes 0.158s > > https://codereview.appspot.com/8056043/
Sign in to reply to this message.
Message was sent while issue was closed.
On 2013/04/03 05:47:04, dfc wrote: > Go 1.1 will be fast enough[1] without this change. If it's unstable > then my vote would be to revert this change so we have a chance of > publishing a release candidate. Thoughts ? Without looking past the stack trace, it seems like it might be a GC issue though, in which case reverting this isn't going to fix the underlying problem... Always forward, never back! :-) Cheers Albert
Sign in to reply to this message.
On 3 Apr 2013 07:47, "Dave Cheney" <dave@cheney.net> wrote: > > Go 1.1 will be fast enough[1] without this change. If it's unstable > then my vote would be to revert this change so we have a chance of > publishing a release candidate. Thoughts ? > > It could be wise but since it's absolutely unclear why the GC is affected a proper investigation is needed. If in the end the change is reverted I would like to keep the new test code. Rémy
Sign in to reply to this message.
On 2013/4/3 <fullung@gmail.com> wrote: > On 2013/04/03 05:47:04, dfc wrote: >> >> Go 1.1 will be fast enough[1] without this change. If it's unstable >> then my vote would be to revert this change so we have a chance of >> publishing a release candidate. Thoughts ? > > > Without looking past the stack trace, it seems like it might be a GC > issue though, in which case reverting this isn't going to fix the > underlying problem... Please try https://codereview.appspot.com/8300044 I'm surprised the test could possibly work for anyone. [maybe the better fix is to check the error returned by syscall.Mprotect] Rémy.
Sign in to reply to this message.
Wow, great catch! On Wed, Apr 3, 2013 at 5:33 PM, Rémy Oudompheng <remyoudompheng@gmail.com> wrote: > On 2013/4/3 <fullung@gmail.com> wrote: >> On 2013/04/03 05:47:04, dfc wrote: >>> >>> Go 1.1 will be fast enough[1] without this change. If it's unstable >>> then my vote would be to revert this change so we have a chance of >>> publishing a release candidate. Thoughts ? >> >> >> Without looking past the stack trace, it seems like it might be a GC >> issue though, in which case reverting this isn't going to fix the >> underlying problem... > > Please try https://codereview.appspot.com/8300044 > I'm surprised the test could possibly work for anyone. > [maybe the better fix is to check the error returned by syscall.Mprotect] > > Rémy.
Sign in to reply to this message.
|