Descriptionimage/jpeg: unroll the IDCT loop.
The fundamental gain is bounds check elimination for 6g during array
indexing. The "x7 := src[y8+3]" generated code that used to be:
000000000042d174: MOVQ CX, DX
000000000042d177: ADDQ $0x3, DX
000000000042d17b: CMPQ $0x40, DX
000000000042d17f: JAE 0x42d381
000000000042d185: LEAQ 0(AX)(DX*4), BX
000000000042d189: MOVL 0(BX), DI
000000000042d381: CALL runtime.panicindex(SB)
000000000042d386: UD2
becomes eight separate single-instruction translations of a constant
array offset like:
000000000042dcbe: MOVL 0xcc(AX), DX
In the CPU profile for decoding a baseline JPEG, idct drops from 21%
of the CPU time to 13%.
benchmark old ns/op new ns/op delta
BenchmarkIDCT 2124 1319 -37.90%
BenchmarkDecodeBaseline 1220819 1137013 -6.86%
BenchmarkDecodeProgressive 1947365 1857279 -4.63%
benchmark old MB/s new MB/s speedup
BenchmarkDecodeBaseline 50.62 54.35 1.07x
BenchmarkDecodeProgressive 31.74 33.27 1.05x
Patch Set 1 #Patch Set 2 : diff -r 90c0d9c4a9ad https://code.google.com/p/go/ #Patch Set 3 : diff -r 90c0d9c4a9ad https://code.google.com/p/go/ #
MessagesTotal messages: 3
|