DescriptionImprove Blending BlitRow functions on x86 by around 10% by making better use of SSE2
instructions.
The optimizations done in this patch are:
- Use __mm_mulhi_epi16 instead of __mm_mullo_epi16 for scale
multiplication to avoid the need to do the division (or shift).
- Take into account the Atom micro-architecture (execution ports)
constraints and interleave the operations to make better (parallel)
use of two execution ports.
Using:
bench -config 8888 -forceBlend 1 -match bitmap_8888 -repeat 100
(a benchmark with most cycles going in S32A_Blend_BlitRow32_SSE2 and S32_Blend_BlitRow32_SSE2), we get on a z600 64 bit:
running bench [640 480] bitmap_8888_update 8888: cmsecs = 5.60
running bench [640 480] bitmap_8888_update_volatile 8888: cmsecs = 5.61
running bench [640 480] bitmap_8888 8888: cmsecs = 5.59
running bench [640 480] bitmap_8888_A 8888: cmsecs = 6.98
after:
running bench [640 480] bitmap_8888_update 8888: cmsecs = 5.15
running bench [640 480] bitmap_8888_update_volatile 8888: cmsecs = 5.05
running bench [640 480] bitmap_8888 8888: cmsecs = 5.03
running bench [640 480] bitmap_8888_A 8888: cmsecs = 6.30
or between 8 and 11 % on a 64 bit Z600. on a 32 bit Atom, the results are between 4 and 10%
faster.
Credits: Tom C at Intel and lcwu
Patch Set 1 #Patch Set 2 : Added a few more comments, and fixed some typos in the comments #MessagesTotal messages: 5
|