DescriptionImprove performance of S32_{opaque|alpha}_D32_filter_DX on SSSE3 platforms.
The assembly code was adapted from code we received from Intel.
The optimizations done are:
- in 1/16 of the cases, the math that is done becomes much simpler,
and we take advantage of this. It is of interest that this optimization
is valid regardless of the platform, and could be applied on all
versions of that code. Assuming a complete random distribution of
samples, this optimization can therefore at most increase speed
by 1/16 (6%), if the processing was completely removed. Actual
benchmarks show an improvement of 4% of the actual function or
2.5% in the complete benchmark mentioned below.
- math is making use of various new instructions that come with
SSSE3 which pretty much allows to process twice as many pixels for
each pass.
- loop unrolling allows to make use of the fancier instructions above
but also allows to perform fewer memory accesses.
For more details, see the code where more comments detail the
optimizations.
Benchmarking was done on a variety of platforms:
- Z600 64 bit
- Z600 32 bit
- GoogleTV Atom 32 bit.
This was benchmarked using perf/skia bench, and real world scenarios.
using skia bench:
out/Release/bench -config 8888 -scale -forceFilter 1 -match bitmap -repeat 10
(Example output:)
Before:
running bench [640 480] bitmap_8888_update 8888: cmsecs = 42.40
running bench [640 480] bitmap_8888_update_volatile 8888: cmsecs = 42.41
running bench [640 480] bitmap_8888 8888: cmsecs = 42.45
running bench [640 480] bitmap_8888_A 8888: cmsecs = 46.86
After:
running bench [640 480] bitmap_8888_update 8888: cmsecs = 31.25
running bench [640 480] bitmap_8888_update_volatile 8888: cmsecs = 31.26
running bench [640 480] bitmap_8888 8888: cmsecs = 31.21
running bench [640 480] bitmap_8888_A 8888: cmsecs = 35.48
So, in this bench, performance improvement (only the first of the four
benchmarks mentioned was used, since the improvement is the same accross
all 4 benchmarks)
| before | after | improvement |
Z600 64| 42.4 | 31.25 | 1.35 |
Z600 32| 46.27 | 39.46 | 1.17 |
Atom 32| 271.12 | 193.77| 1.4 |
The actual speed up of the function is larger simply because the functions
are only 60% of the benchmarks mentioned above. Benchmarks using that function
alone show speed ups between 1.4 (z600 64 bit) to over 2x (Atom 32 bit).
Credits: This code is the work of many people. Most of the praise should
go to Intel's Tom C who wrote and optimized the code loop. Then lcwu and
I share the reviewing, testing, clean up effort (templatization, factorization, etc).
In real life browser scenarios, this function shows up on a variety of
GFX intensive benchmarks. For example,
http://ie.microsoft.com/testdrive/Performance/AsteroidBelt/Default.html
the function. Before optimization represents 47% of workload, after 38%, so
an improvement in the function of 1.44, which gives a real world improvement
of roughly 16 % in frame rate.
Patch Set 1 #Patch Set 2 : '' #
Total comments: 3
Patch Set 3 : major cleanup, factorization as requested #Patch Set 4 : updated checkin comments #
MessagesTotal messages: 12
|