Issue 5515044: Improve performance of S32_{opaque|alpha}_D32_filter_DX on SSSE3 platforms.

Issue 5515044: Improve performance of S32_{opaque|alpha}_D32_filter_DX on SSSE3 platforms. (Closed)

Can't Edit
Can't Publish+Mail
Start Review

Created:
13 years, 7 months ago by evannier

Modified:
13 years, 5 months ago

Reviewers:
epoger, TomH, reed1

CC:
skia-review_googlegroups.com

Base URL:
http://skia.googlecode.com/svn/trunk/

Visibility:
Public.

Description

Improve performance of S32_{opaque|alpha}_D32_filter_DX on SSSE3 platforms. The assembly code was adapted from code we received from Intel. The optimizations done are: - in 1/16 of the cases, the math that is done becomes much simpler, and we take advantage of this. It is of interest that this optimization is valid regardless of the platform, and could be applied on all versions of that code. Assuming a complete random distribution of samples, this optimization can therefore at most increase speed by 1/16 (6%), if the processing was completely removed. Actual benchmarks show an improvement of 4% of the actual function or 2.5% in the complete benchmark mentioned below. - math is making use of various new instructions that come with SSSE3 which pretty much allows to process twice as many pixels for each pass. - loop unrolling allows to make use of the fancier instructions above but also allows to perform fewer memory accesses. For more details, see the code where more comments detail the optimizations. Benchmarking was done on a variety of platforms: - Z600 64 bit - Z600 32 bit - GoogleTV Atom 32 bit. This was benchmarked using perf/skia bench, and real world scenarios. using skia bench: out/Release/bench -config 8888 -scale -forceFilter 1 -match bitmap -repeat 10 (Example output:) Before: running bench [640 480] bitmap_8888_update 8888: cmsecs = 42.40 running bench [640 480] bitmap_8888_update_volatile 8888: cmsecs = 42.41 running bench [640 480] bitmap_8888 8888: cmsecs = 42.45 running bench [640 480] bitmap_8888_A 8888: cmsecs = 46.86 After: running bench [640 480] bitmap_8888_update 8888: cmsecs = 31.25 running bench [640 480] bitmap_8888_update_volatile 8888: cmsecs = 31.26 running bench [640 480] bitmap_8888 8888: cmsecs = 31.21 running bench [640 480] bitmap_8888_A 8888: cmsecs = 35.48 So, in this bench, performance improvement (only the first of the four benchmarks mentioned was used, since the improvement is the same accross all 4 benchmarks) | before | after | improvement | Z600 64| 42.4 | 31.25 | 1.35 | Z600 32| 46.27 | 39.46 | 1.17 | Atom 32| 271.12 | 193.77| 1.4 | The actual speed up of the function is larger simply because the functions are only 60% of the benchmarks mentioned above. Benchmarks using that function alone show speed ups between 1.4 (z600 64 bit) to over 2x (Atom 32 bit). Credits: This code is the work of many people. Most of the praise should go to Intel's Tom C who wrote and optimized the code loop. Then lcwu and I share the reviewing, testing, clean up effort (templatization, factorization, etc). In real life browser scenarios, this function shows up on a variety of GFX intensive benchmarks. For example, http://ie.microsoft.com/testdrive/Performance/AsteroidBelt/Default.html the function. Before optimization represents 47% of workload, after 38%, so an improvement in the function of 1.44, which gives a real world improvement of roughly 16 % in frame rate.

Patch Set 1 #

Patch Set 2 : '' #

Total comments: 3

Patch Set 3 : major cleanup, factorization as requested #

Patch Set 4 : updated checkin comments #

Created: 13 years, 5 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+584 lines, -12 lines)			Patch
M	gyp/opts.gyp	View	1	2 chunks	+35 lines, -5 lines	0 comments	Download
A	src/opts/SkBitmapProcState_opts_SSSE3.h	View		1 chunk	+15 lines, -0 lines	0 comments	Download
A	src/opts/SkBitmapProcState_opts_SSSE3.cpp	View	1 2	1 chunk	+497 lines, -0 lines	0 comments	Download
M	src/opts/opts_check_SSE2.cpp	View		4 chunks	+37 lines, -7 lines	0 comments	Download

Messages

Total messages: 12

Expand All Messages | Collapse All Messages

TomH

Elliot, can you shed some light on the gyp organization question? Or is there somebody ...

13 years, 6 months ago (2012-02-07 21:21:59 UTC) #1

evannier

On 2012/02/07 21:21:59, TomH wrote: > Elliot, can you shed some light on the gyp ...

13 years, 6 months ago (2012-02-08 03:17:15 UTC) #2

evannier

http://codereview.appspot.com/5515044/diff/1005/src/opts/SkBitmapProcState_opts_SSSE3.cpp File src/opts/SkBitmapProcState_opts_SSSE3.cpp (right): http://codereview.appspot.com/5515044/diff/1005/src/opts/SkBitmapProcState_opts_SSSE3.cpp#newcode83 src/opts/SkBitmapProcState_opts_SSSE3.cpp:83: if (subY == 0) { On 2012/02/07 21:21:59, TomH ...

13 years, 6 months ago (2012-02-08 03:17:28 UTC) #3

epoger

On 2012/02/07 21:21:59, TomH wrote: > Elliot, can you shed some light on the gyp ...

13 years, 6 months ago (2012-02-08 14:48:07 UTC) #4

epoger

On 2012/02/08 14:48:07, epoger wrote: > On 2012/02/07 21:21:59, TomH wrote: > > Elliot, can ...

13 years, 6 months ago (2012-02-08 19:34:36 UTC) #5

TomH

On 2012/02/08 03:17:28, evannier wrote: > So, I am willing to get rid of it ...

13 years, 6 months ago (2012-02-08 19:49:53 UTC) #6

evannier

On 2012/02/08 19:34:36, epoger wrote: > On 2012/02/08 14:48:07, epoger wrote: > > On 2012/02/07 ...

13 years, 6 months ago (2012-02-09 03:12:25 UTC) #7

evannier

On 2012/02/08 19:49:53, TomH wrote: > On 2012/02/08 03:17:28, evannier wrote: > > So, I ...

13 years, 6 months ago (2012-02-09 03:34:53 UTC) #8

On 2012/02/08 19:49:53, TomH wrote:
> On 2012/02/08 03:17:28, evannier wrote:
> > So, I am willing to get rid of it (and remeasure).
> 
> On Windows, it doesn't make a measurable difference (if I run bench multiple
> times at repeat -150, the distributions of if (subY==0) and if (false) overlap
> significantly).
> 
> Is there a particular reason you chose the -scale -forceFilter 1? I know it
> significantly increases how long the benchmarks take to run; is that a use
case
> GoogleTV hits?
According to my perf:
perf record out/Release/bench -config 8888 -scale -match bitmap -repeat 10
Without the forceFilter, the function does not show up at all. With the
forceFilter, it shows up as 16% of the workload, which also shows that the speed
up of the actual function is a lot greater than what is suggested by the
benchmark results from skia_bench.
My micro benchmarks (measuring just the very function, use rdtsc averages) were
showing 2x improvements on Atom, and 1.4 on 32 bit z600, more on 64 bit.

Doing a slightly different command line gives me on a 64 bit z600:
out/Release/bench -config 8888 -scale -forceFilter 1 -match bitmap_8888 -repeat
10

(basically limiting the bench to bitmap_8888) shows the function taking 57.56%
on my 64 bit z600. Removing the subY optimization makes it show up at 58.52

Numbers retrieved using 
out/Release/bench -config 8888 -scale -forceFilter 1 -match bitmap_8888 -repeat
200

running bench [640 480]           bitmap_8888_update  8888: cmsecs =  31.17
running bench [640 480]  bitmap_8888_update_volatile  8888: cmsecs =  31.15
running bench [640 480]                  bitmap_8888  8888: cmsecs =  31.19
running bench [640 480]                bitmap_8888_A  8888: cmsecs =  35.75


Without the subY optimization:
running bench [640 480]           bitmap_8888_update  8888: cmsecs =  31.97
running bench [640 480]  bitmap_8888_update_volatile  8888: cmsecs =  31.95
running bench [640 480]                  bitmap_8888  8888: cmsecs =  31.92
running bench [640 480]                bitmap_8888_A  8888: cmsecs =  36.68

So, this amounts to roughly 2.5% difference on this workload. But since the
function is roughly 60% of the workload, this means an improvement of roughly 4%
in the function itself. This might be debatable whether this is worth it, but
this seems to be clearly measurable. The question is then whether there is a way
to factor out some of this code (which I assume there is).

> 
> Let's either test on other hardware without the subY == 0 specialization, or
> look for a way to coalesce redundant parts of the code and comments between
the
> two branches to make the duplication more manageable. I love the explanatory
> comments, which are better than a lot of our opt procs, but that very density
> works against us when there are two copies of not-quite-identical comments.

Let me know if you think the 4% are worth some more work in factoring this code
out. I can go either way.

I will also update the checkin comments with the more detailed data collected
above.

TomH

On 2012/02/09 03:34:53, evannier wrote: > Let me know if you think the 4% are ...

13 years, 6 months ago (2012-02-10 19:54:16 UTC) #9

evannier

On 2012/02/10 19:54:16, TomH wrote: > On 2012/02/09 03:34:53, evannier wrote: > > Let me ...

13 years, 5 months ago (2012-02-13 23:04:14 UTC) #10

TomH

Attempted to commit to Skia in r3193, but breaks both Windows and Mac compiles.

13 years, 5 months ago (2012-02-14 18:36:30 UTC) #11

TomH

13 years, 5 months ago (2012-02-14 19:58:35 UTC) #12

Fixes in r3194 and r3195 look like they should take; this CL is in and the
review can be closed.

Expand All Messages | Collapse All Messages