|
|
Created:
12 years, 6 months ago by notcarl Modified:
12 years, 6 months ago Reviewers:
CC:
dave_cheney.net, rsc, iant2, golang-dev Visibility:
Public. |
Descriptioncrypto/sha1: Make sha-1 do block mixup in place
Benchmarks:
benchmark old ns/op new ns/op delta
BenchmarkHash8Bytes 762 674 -11.55%
BenchmarkHash1K 8791 7375 -16.11%
BenchmarkHash8K 65094 54881 -15.69%
benchmark old MB/s new MB/s speedup
BenchmarkHash8Bytes 10.50 11.86 1.13x
BenchmarkHash1K 116.48 138.84 1.19x
BenchmarkHash8K 125.85 149.27 1.19x
Patch Set 1 #Patch Set 2 : diff -r c33545ae0ec0 https://code.google.com/p/go/ #Patch Set 3 : diff -r c33545ae0ec0 https://code.google.com/p/go/ #
Total comments: 6
Patch Set 4 : diff -r c33545ae0ec0 https://code.google.com/p/go/ #
Total comments: 1
MessagesTotal messages: 21
Hello golang-dev@googlegroups.com, I'd like you to review this change to https://code.google.com/p/go/
Sign in to reply to this message.
Hello golang-dev@googlegroups.com (cc: golang-dev@googlegroups.com), Please take another look.
Sign in to reply to this message.
On 2012/11/06 02:12:48, notcarl wrote: > Hello mailto:golang-dev@googlegroups.com (cc: mailto:golang-dev@googlegroups.com), > > Please take another look. lucky(~/go/src/pkg/crypto/sha1) % ~/go/misc/benchcmp {old,new}.txt benchmark old ns/op new ns/op delta BenchmarkHash8Bytes 863 831 -3.71% BenchmarkHash1K 10049 9693 -3.54% BenchmarkHash8K 74555 72132 -3.25% benchmark old MB/s new MB/s speedup BenchmarkHash8Bytes 9.27 9.62 1.04x BenchmarkHash1K 101.90 105.64 1.04x BenchmarkHash8K 109.88 113.57 1.03x I think this can be made a little faster.
Sign in to reply to this message.
https://codereview.appspot.com/6820096/diff/6001/src/pkg/crypto/sha1/sha1bloc... File src/pkg/crypto/sha1/sha1block.go (right): https://codereview.appspot.com/6820096/diff/6001/src/pkg/crypto/sha1/sha1bloc... src/pkg/crypto/sha1/sha1block.go:35: for i := 0; i < 16; i++ { lifting i outside the for loop, then not initalising it again lower down gets another 2% on amd64 i := 0 for ; i < 16 ; i++ lucky(~/go/src/pkg/crypto/sha1) % ~/go/misc/benchcmp {old,new}.txt benchmark old ns/op new ns/op delta BenchmarkHash8Bytes 863 819 -5.10% BenchmarkHash1K 10049 9499 -5.47% BenchmarkHash8K 74555 70395 -5.58% benchmark old MB/s new MB/s speedup BenchmarkHash8Bytes 9.27 9.76 1.05x BenchmarkHash1K 101.90 107.80 1.06x BenchmarkHash8K 109.88 116.37 1.06x https://codereview.appspot.com/6820096/diff/6001/src/pkg/crypto/sha1/sha1bloc... src/pkg/crypto/sha1/sha1block.go:42: for i := 16; i < 20; i++ { for ; i < 20; i++ { .. } and so forth
Sign in to reply to this message.
looks good to me. will leave for dfc to lgtm and submit https://codereview.appspot.com/6820096/diff/6001/src/pkg/crypto/sha1/sha1bloc... File src/pkg/crypto/sha1/sha1block.go (right): https://codereview.appspot.com/6820096/diff/6001/src/pkg/crypto/sha1/sha1bloc... src/pkg/crypto/sha1/sha1block.go:39: t := a5 + f + e + w[i] + _K0 i&0xf will get you something here. https://codereview.appspot.com/6820096/diff/6001/src/pkg/crypto/sha1/sha1bloc... src/pkg/crypto/sha1/sha1block.go:42: for i := 16; i < 20; i++ { On 2012/11/06 04:06:47, dfc wrote: > for ; i < 20; i++ { .. } > and so forth FWIW this is unlikely to matter.
Sign in to reply to this message.
Thanks for the quick review! https://codereview.appspot.com/6820096/diff/6001/src/pkg/crypto/sha1/sha1bloc... File src/pkg/crypto/sha1/sha1block.go (right): https://codereview.appspot.com/6820096/diff/6001/src/pkg/crypto/sha1/sha1bloc... src/pkg/crypto/sha1/sha1block.go:39: t := a5 + f + e + w[i] + _K0 On 2012/11/06 18:52:06, rsc wrote: > i&0xf will get you something here. I am not sure I understand. Since the loop only goes from [0, 15], anding with 0xf should be a no-op. https://codereview.appspot.com/6820096/diff/6001/src/pkg/crypto/sha1/sha1bloc... src/pkg/crypto/sha1/sha1block.go:42: for i := 16; i < 20; i++ { On 2012/11/06 18:52:06, rsc wrote: > On 2012/11/06 04:06:47, dfc wrote: > > for ; i < 20; i++ { .. } > > and so forth > > FWIW this is unlikely to matter. > I was hoping that maybe the compiler saw this. I didn't unroll the loops because the compiler should be doing it and it decreases readability. In addition, the comments at the top of this file hint at it not being a priority to make this as fast as possible.
Sign in to reply to this message.
>> i&0xf will get you something here. > > I am not sure I understand. Since the loop only goes from [0, 15], > anding with 0xf should be a no-op. The compiler isn't smart enough to recognize that i is guaranteed to be in range here, but &0xf will force it to notice that and avoid the bounds check. It is unlikely to matter much. >> On 2012/11/06 04:06:47, dfc wrote: >> > for ; i < 20; i++ { .. } >> > and so forth > >> FWIW this is unlikely to matter. > > I was hoping that maybe the compiler saw this. I didn't unroll the > loops because the compiler should be doing it and it decreases > readability. In addition, the comments at the top of this file hint at > it not being a priority to make this as fast as possible. Dave suggested a single i instead of 5 different i variables. My comment was only that from a performance standpoint it is unlikely to matter. The compiler does not unroll the loops, and in general I think it is unreasonable to expect it to. The big win you can get from the SHA1 block function is if you unroll the loops 5x and do the variable renamings in each iteration, so that all the rotating assignments disappear. If you want to do this it should be done in a separate CL and the speed gain should justify the increase in code. The last time I looked at this I think it was not quite enough. crypto/md5/gen.go is the program that generates the MD5 block routine. You could use something similar for SHA1. Russ
Sign in to reply to this message.
Hello dave@cheney.net, rsc@golang.org (cc: golang-dev@googlegroups.com), Please take another look.
Sign in to reply to this message.
On 2012/11/06 19:08:54, rsc wrote: > >> i&0xf will get you something here. > > > > I am not sure I understand. Since the loop only goes from [0, 15], > > anding with 0xf should be a no-op. > > The compiler isn't smart enough to recognize that i is guaranteed to > be in range here, but &0xf will force it to notice that and avoid the > bounds check. It is unlikely to matter much. > I adding this change, and got a small increase in speed. > >> On 2012/11/06 04:06:47, dfc wrote: > >> > for ; i < 20; i++ { .. } > >> > and so forth > > > >> FWIW this is unlikely to matter. > > > > I was hoping that maybe the compiler saw this. I didn't unroll the > > loops because the compiler should be doing it and it decreases > > readability. In addition, the comments at the top of this file hint at > > it not being a priority to make this as fast as possible. > > Dave suggested a single i instead of 5 different i variables. My > comment was only that from a performance standpoint it is unlikely to > matter. The compiler does not unroll the loops, and in general I think > it is unreasonable to expect it to. The big win you can get from the > SHA1 block function is if you unroll the loops 5x and do the variable > renamings in each iteration, so that all the rotating assignments > disappear. If you want to do this it should be done in a separate CL > and the speed gain should justify the increase in code. The last time > I looked at this I think it was not quite enough. crypto/md5/gen.go is > the program that generates the MD5 block routine. You could use > something similar for SHA1. > > Russ The fear is that unrolling the loop means potentially more i-cache misses, so I agree it should be in a future CL. Hoisting i out of the loop in the meantime showed a little extra gain.
Sign in to reply to this message.
Please put the current benchcmp numbers in the CL description. Leaving for Dave. Thanks. Russ
Sign in to reply to this message.
On 2012/11/06 19:38:57, rsc wrote: > Please put the current benchcmp numbers in the CL description. > Leaving for Dave. > > Thanks. > Russ Done.
Sign in to reply to this message.
hmmm, i can't replicate your benchmark results. linux/amd64: Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz benchmark old ns/op new ns/op delta BenchmarkHash8Bytes 866 824 -4.85% BenchmarkHash1K 10058 9520 -5.35% BenchmarkHash8K 74567 70953 -4.85% benchmark old MB/s new MB/s speedup BenchmarkHash8Bytes 9.24 9.71 1.05x BenchmarkHash1K 101.81 107.56 1.06x BenchmarkHash8K 109.86 115.46 1.05x
Sign in to reply to this message.
$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU W3690 @ 3.47GHz ...
Sign in to reply to this message.
Looks pretty good to me as well. I'll run some benchmarks on 32 bit platforms once mine get through building the overnight changes. https://codereview.appspot.com/6820096/diff/4002/src/pkg/crypto/sha1/sha1bloc... File src/pkg/crypto/sha1/sha1block.go (right): https://codereview.appspot.com/6820096/diff/4002/src/pkg/crypto/sha1/sha1bloc... src/pkg/crypto/sha1/sha1block.go:26: j := i * 4 j := i << 2 avoids the imul on intel. The compiler should be smarter about this.
Sign in to reply to this message.
On 2012/11/06 22:04:24, dfc wrote: > Looks pretty good to me as well. I'll run some benchmarks on 32 bit platforms > once mine get through building the overnight changes. > > https://codereview.appspot.com/6820096/diff/4002/src/pkg/crypto/sha1/sha1bloc... > File src/pkg/crypto/sha1/sha1block.go (right): > > https://codereview.appspot.com/6820096/diff/4002/src/pkg/crypto/sha1/sha1bloc... > src/pkg/crypto/sha1/sha1block.go:26: j := i * 4 > j := i << 2 avoids the imul on intel. The compiler should be smarter about this. I tried changing the variables to unsigned to see if it would change anything. I also tried adding in the shift instead of the mult, but neither of these two changes affected the results of my benchmarks. I think it would be safe to leave this alone for now.
Sign in to reply to this message.
Fair enough. The compiler should be able to do this for us eventually. I know you're a Googler, so most of the CLA process does not apply to you, but have you done whatever is needed on your side ?
Sign in to reply to this message.
On Tue, Nov 6, 2012 at 3:07 PM, <dave@cheney.net> wrote: > Fair enough. The compiler should be able to do this for us eventually. > > I know you're a Googler, so most of the CLA process does not apply to > you, but have you done whatever is needed on your side ? To be clear, are you approving this patch? Let me know. We just need to get him into CONTRIBUTORS. Ian
Sign in to reply to this message.
Yes, LGTM. On Wed, Nov 7, 2012 at 10:18 AM, Ian Lance Taylor <iant@google.com> wrote: > On Tue, Nov 6, 2012 at 3:07 PM, <dave@cheney.net> wrote: >> Fair enough. The compiler should be able to do this for us eventually. >> >> I know you're a Googler, so most of the CLA process does not apply to >> you, but have you done whatever is needed on your side ? > > To be clear, are you approving this patch? Let me know. > > We just need to get him into CONTRIBUTORS. > > Ian
Sign in to reply to this message.
On 2012/11/06 23:07:38, dfc wrote: > Fair enough. The compiler should be able to do this for us eventually. > > I know you're a Googler, so most of the CLA process does not apply to you, but > have you done whatever is needed on your side ? Checked with other Googlers, I believe I'm good to go. Thanks for the speedy review!
Sign in to reply to this message.
Some results from linux/arm: benchmark old ns/op new ns/op delta BenchmarkHash8Bytes 6487 6497 +0.15% BenchmarkHash1K 69794 70773 +1.40% BenchmarkHash8K 517529 525604 +1.56% benchmark old MB/s new MB/s speedup BenchmarkHash8Bytes 1.23 1.23 1.00x BenchmarkHash1K 14.67 14.47 0.99x BenchmarkHash8K 15.83 15.59 0.98x Wasn't expecting much, and it looks like my expectations were met. This is within the margin of error.
Sign in to reply to this message.
*** Submitted as http://code.google.com/p/go/source/detail?r=1369d7bb329d *** crypto/sha1: Make sha-1 do block mixup in place Benchmarks: benchmark old ns/op new ns/op delta BenchmarkHash8Bytes 762 674 -11.55% BenchmarkHash1K 8791 7375 -16.11% BenchmarkHash8K 65094 54881 -15.69% benchmark old MB/s new MB/s speedup BenchmarkHash8Bytes 10.50 11.86 1.13x BenchmarkHash1K 116.48 138.84 1.19x BenchmarkHash8K 125.85 149.27 1.19x R=dave, rsc, iant CC=golang-dev http://codereview.appspot.com/6820096 Committer: Dave Cheney <dave@cheney.net>
Sign in to reply to this message.
|