Issue 5569077: Implementing Color32 functions for Neon platforms.

Can't Edit
Can't Publish+Mail
Start Review

Created:
12 years, 11 months ago by evannier

Modified:
12 years, 5 months ago

Reviewers:
reed1, EricB, DerekS, TomH

CC:
skia-review_googlegroups.com

Base URL:
http://skia.googlecode.com/svn/trunk/

Visibility:
Public.

Description

Implementing Color32 functions for Neon platforms. Besides the raw processing improvement provided by Neon, the code uses memory preteches (pld) which seem to improve performance greatly when dealing with very large counts. This was tested using bench where color32 accounts for the majority of the workload: bench -match rects_1 -config 8888 -repeat 500 -forceBlend 1 (the forceBlend is there so that the Color32 code does not go through the special cases where alpha == 0xFF as it would transform color32 into a sk_memset32. Numbers averaged over 3 runs: bench name | Before | Neon, no pld | Neon with pld | full boost rrects_1 | 153.9 | 128.3 | 92 | 1.66x rects_1_stroke_4| 32.8 | 31.4 | 28.45 | 1.15x rects_1 | 125.35 | 97.2 | 63.59 | 1.97x Credits: various googletv team members.

Patch Set 1 #

Total comments: 9

Patch Set 2 : small modifications based on comments, small code cleanup, still some issues unresolved #

Patch Set 3 : '' #

Patch Set 4 : '' #

Created: 12 years, 10 months ago

Download [raw] [tar.bz2]

		Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+143 lines, -4 lines)			Patch
	M	src/opts/SkBlitRow_opts_arm.cpp	View	1 2 3	3 chunks	+109 lines, -4 lines	0 comments	Download
	A	src/opts/SkCachePreload_arm.h	View	1 2 3	1 chunk	+34 lines, -0 lines	0 comments	Download

Messages

Total messages: 15

Expand All Messages | Collapse All Messages

TomH

How did you find measured speedup? I ran the full Skia benchmark suite on a ...

12 years, 11 months ago (2012-02-07 16:31:53 UTC) #1

evannier

On 2012/02/07 16:31:53, TomH wrote: > How did you find measured speedup? I ran the ...

12 years, 11 months ago (2012-02-09 03:01:06 UTC) #2

On 2012/02/07 16:31:53, TomH wrote:
> How did you find measured speedup? I ran the full Skia benchmark suite on a
> nexus_s with and without your changes, and see very few improvements.
> 
> We need to understand where you think this will help us in order to accept it.
> Do we need new benchmarks?
> 
I am sorry. This function does not show up on many benchmarks covered by skia
bench. To be more precise, if I remember correctly, when it does, it does show
up with a very small percentage, meaning that it proved impossible for me to
benchmark this very part of the code using skia_bench. Instead, I have created a
rudimentary piece of code that tests that function, and only that function,
allowing me to get results.
This function though shows up in many real life scenarios at large percentages,
I have just found the exact combination of command line options that will make
this appear as sizeable using skia_bench.
I suspect we should work on improving skia_bench in this regard, or maybe
somebody more knowledgeable on the option behavior will be able to find a
command line that allows this method to be at anything but a few percents.
If you do not have a suggestion for a command line, I can probably find a way to
integrate this into skia_bench. Let me know if you have any suggestions.


> http://codereview.appspot.com/5569077/diff/1/src/opts/SkBlitRow_opts_arm.cpp
> File src/opts/SkBlitRow_opts_arm.cpp (right):
> 
>
http://codereview.appspot.com/5569077/diff/1/src/opts/SkBlitRow_opts_arm.cpp#...
> src/opts/SkBlitRow_opts_arm.cpp:19: #define CACHE_LINE_SIZE     32
> Is your intent that compilers would redefine this on other targets? Would that
> make an #ifndef guard reasonable?
> 
> The web suggests that there are 64B neon cache line architectures, although I
> don't know if we're compiling Skia on any of them yet.
> 
>
http://codereview.appspot.com/5569077/diff/1/src/opts/SkBlitRow_opts_arm.cpp#...
> src/opts/SkBlitRow_opts_arm.cpp:24: #   define PLD128(x, n)     PLD64(x, n)
> PLD64(x, (n) + 64)
> Style nit: in Skia, spaces go before the #, not after.
> 
>
http://codereview.appspot.com/5569077/diff/1/src/opts/SkBlitRow_opts_arm.cpp#...
> src/opts/SkBlitRow_opts_arm.cpp:1293: asm volatile (
> Only indenting the subsequent 4 instead of 14 would make the line length limit
a
> little less stringent
> 
>
http://codereview.appspot.com/5569077/diff/1/src/opts/SkBlitRow_opts_arm.cpp#...
> src/opts/SkBlitRow_opts_arm.cpp:1357: // left to process.
> Very nice explanatory comment; thanks!

evannier

http://codereview.appspot.com/5569077/diff/1/src/opts/SkBlitRow_opts_arm.cpp File src/opts/SkBlitRow_opts_arm.cpp (right): http://codereview.appspot.com/5569077/diff/1/src/opts/SkBlitRow_opts_arm.cpp#newcode19 src/opts/SkBlitRow_opts_arm.cpp:19: #define CACHE_LINE_SIZE 32 I admit this is not exactly ...

12 years, 11 months ago (2012-02-09 03:01:26 UTC) #3

TomH

It looks like bench -match rects is the only benchmark that heavily hits the nontrivial ...

12 years, 11 months ago (2012-02-09 18:58:47 UTC) #4

evannier

On 2012/02/09 18:58:47, TomH wrote: > It looks like bench -match rects is the only ...

12 years, 11 months ago (2012-02-14 21:58:30 UTC) #5

evannier

Thanks for the previous reviews. I have made a few modifications based on your comments, ...

12 years, 10 months ago (2012-03-01 23:33:31 UTC) #6

TomH

Eric may be able to do the performance study on actual hardware we wanted?

12 years, 5 months ago (2012-07-24 21:39:04 UTC) #8

evannier

On 2012/07/24 21:39:04, TomH wrote: > Eric may be able to do the performance study ...

12 years, 5 months ago (2012-07-24 21:53:00 UTC) #9

TomH

On 2012/07/24 21:53:00, evannier wrote: > On 2012/07/24 21:39:04, TomH wrote: > > Eric may ...

12 years, 5 months ago (2012-07-24 21:54:30 UTC) #10

EricB

On 2012/07/24 21:54:30, TomH wrote: > On 2012/07/24 21:53:00, evannier wrote: > > On 2012/07/24 ...

12 years, 5 months ago (2012-07-24 22:02:36 UTC) #11

EricB

Looks like we're seeing 5-10% improvement on the rects_* benchmarks. Before: D/skia (30219): running bench ...

12 years, 5 months ago (2012-07-25 12:45:05 UTC) #12

Looks like we're seeing 5-10% improvement on the rects_* benchmarks.  Before:

D/skia    (30219): running bench [640 480]             rects_3_stroke_4
D/skia    (30219):   8888: cmsecs =   6.31
D/skia    (30219):    565: cmsecs =   4.91
D/skia    (30219):    GPU: cmsecs =  14.42
D/skia    (30219):   NULLGPU: cmsecs =   4.17
D/skia    (30219): 
D/skia    (30219): running bench [640 480]                      rects_3
D/skia    (30219):   8888: cmsecs =   4.87
D/skia    (30219):    565: cmsecs =   3.40
D/skia    (30219):    GPU: cmsecs =  11.69
D/skia    (30219):   NULLGPU: cmsecs =   3.86
D/skia    (30219): 
D/skia    (30219): running bench [640 480]             rects_1_stroke_4
D/skia    (30219):   8888: cmsecs =  17.98
D/skia    (30219):    565: cmsecs =  13.68
D/skia    (30219):    GPU: cmsecs =  24.15
D/skia    (30219):   NULLGPU: cmsecs =   4.17
D/skia    (30219): 
D/skia    (30219): running bench [640 480]                      rects_1
D/skia    (30219):   8888: cmsecs =  22.82
D/skia    (30219):    565: cmsecs =  11.81
D/skia    (30219):    GPU: cmsecs =  31.50
D/skia    (30219):   NULLGPU: cmsecs =   3.86

After:

D/skia    (24301): running bench [640 480]             rects_3_stroke_4
D/skia    (24301):   8888: cmsecs =   5.89
D/skia    (24301):    565: cmsecs =   4.72
D/skia    (24301):    GPU: cmsecs =  14.32
D/skia    (24301):   NULLGPU: cmsecs =   4.24
D/skia    (24301): 
D/skia    (24301): running bench [640 480]                      rects_3
D/skia    (24301):   8888: cmsecs =   4.65
D/skia    (24301):    565: cmsecs =   3.49
D/skia    (24301):    GPU: cmsecs =  11.84
D/skia    (24301):   NULLGPU: cmsecs =   3.92
D/skia    (24301): 
D/skia    (24301): running bench [640 480]             rects_1_stroke_4
D/skia    (24301):   8888: cmsecs =  16.33
D/skia    (24301):    565: cmsecs =  13.07
D/skia    (24301):    GPU: cmsecs =  23.69
D/skia    (24301):   NULLGPU: cmsecs =   4.23
D/skia    (24301): 
D/skia    (24301): running bench [640 480]                      rects_1
D/skia    (24301):   8888: cmsecs =  21.66
D/skia    (24301):    565: cmsecs =  11.58
D/skia    (24301):    GPU: cmsecs =  32.09
D/skia    (24301):   NULLGPU: cmsecs =   3.94

evannier

As mentioned in the Change description, on a different processor, the improvements are much larger ...

12 years, 5 months ago (2012-07-25 16:23:21 UTC) #13

As mentioned in the Change description, on a different processor, the
improvements are much larger (see numbers above).

So, if there is an improvement on your CPU as well as ours, this is good news
;-).

On 2012/07/25 12:45:05, EricB wrote:
> Looks like we're seeing 5-10% improvement on the rects_* benchmarks.  Before:
> 
> D/skia    (30219): running bench [640 480]             rects_3_stroke_4
> D/skia    (30219):   8888: cmsecs =   6.31
> D/skia    (30219):    565: cmsecs =   4.91
> D/skia    (30219):    GPU: cmsecs =  14.42
> D/skia    (30219):   NULLGPU: cmsecs =   4.17
> D/skia    (30219): 
> D/skia    (30219): running bench [640 480]                      rects_3
> D/skia    (30219):   8888: cmsecs =   4.87
> D/skia    (30219):    565: cmsecs =   3.40
> D/skia    (30219):    GPU: cmsecs =  11.69
> D/skia    (30219):   NULLGPU: cmsecs =   3.86
> D/skia    (30219): 
> D/skia    (30219): running bench [640 480]             rects_1_stroke_4
> D/skia    (30219):   8888: cmsecs =  17.98
> D/skia    (30219):    565: cmsecs =  13.68
> D/skia    (30219):    GPU: cmsecs =  24.15
> D/skia    (30219):   NULLGPU: cmsecs =   4.17
> D/skia    (30219): 
> D/skia    (30219): running bench [640 480]                      rects_1
> D/skia    (30219):   8888: cmsecs =  22.82
> D/skia    (30219):    565: cmsecs =  11.81
> D/skia    (30219):    GPU: cmsecs =  31.50
> D/skia    (30219):   NULLGPU: cmsecs =   3.86
> 
> After:
> 
> D/skia    (24301): running bench [640 480]             rects_3_stroke_4
> D/skia    (24301):   8888: cmsecs =   5.89
> D/skia    (24301):    565: cmsecs =   4.72
> D/skia    (24301):    GPU: cmsecs =  14.32
> D/skia    (24301):   NULLGPU: cmsecs =   4.24
> D/skia    (24301): 
> D/skia    (24301): running bench [640 480]                      rects_3
> D/skia    (24301):   8888: cmsecs =   4.65
> D/skia    (24301):    565: cmsecs =   3.49
> D/skia    (24301):    GPU: cmsecs =  11.84
> D/skia    (24301):   NULLGPU: cmsecs =   3.92
> D/skia    (24301): 
> D/skia    (24301): running bench [640 480]             rects_1_stroke_4
> D/skia    (24301):   8888: cmsecs =  16.33
> D/skia    (24301):    565: cmsecs =  13.07
> D/skia    (24301):    GPU: cmsecs =  23.69
> D/skia    (24301):   NULLGPU: cmsecs =   4.23
> D/skia    (24301): 
> D/skia    (24301): running bench [640 480]                      rects_1
> D/skia    (24301):   8888: cmsecs =  21.66
> D/skia    (24301):    565: cmsecs =  11.58
> D/skia    (24301):    GPU: cmsecs =  32.09
> D/skia    (24301):   NULLGPU: cmsecs =   3.94

EricB

12 years, 5 months ago (2012-07-26 14:20:50 UTC) #15

On 2012/07/25 17:00:01, TomH wrote:
> LGTM.

Committed as r4779.  Please close.

Expand All Messages | Collapse All Messages