Issue 1148042: ARM Neon optimization for S32A_Opaque_BlitRow32

Can't Edit
Can't Publish+Mail
Start Review

Created:
15 years, 3 months ago by XinQi

Modified:
15 years, 1 month ago

Reviewers:
reed, ray.essick, agl

Base URL:
http://skia.googlecode.com/svn/trunk/src

Visibility:
Public.

Description

Implementing S32A_Opaque_BlitRow32 using v7 neon instructions. Taking the advantage of 16 channels of each QualWord register. Also using the software pipelining to scatter the loads/stores among vector operations. Got roughly 70% improvements on simulation environments. First-time contributor, please let me know of anything missing. And other reviewers needed.

Patch Set 1 #

Patch Set 2 : Update license in header #

Total comments: 3

Patch Set 3 : Update patch upon comments #

Patch Set 4 : Make S32A_Opaque_BlitRow32_neon2 bit exact as S32A_Opaque_BlitRow32_neon #

Created: 15 years, 1 month ago

Download [raw] [tar.bz2]

		Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+284 lines, -1 line)			Patch
		opts/S32A_Opaque_BlitRow32_neon2.S	View	1 2 3	1 chunk	+280 lines, -0 lines	0 comments	Download
		opts/SkBlitRow_opts_arm.cpp	View	1 2 3	2 chunks	+4 lines, -1 line	0 comments	Download

Messages

Total messages: 18

Expand All Messages | Collapse All Messages

agl

The .S file appears to have a BSD like license at the top of it. ...

15 years, 3 months ago (2010-05-12 21:25:29 UTC) #2

reed

Skia has compliance and performance tests that can be run using the proposed patch, in ...

15 years, 2 months ago (2010-05-14 15:06:23 UTC) #3

XinQi

Hi Mike, We've used the following methods to verify the patch: 1. I printed out ...

15 years, 2 months ago (2010-05-17 19:32:44 UTC) #4

agl

http://codereview.appspot.com/1148042/diff/6001/7001 File opts/S32A_Opaque_BlitRow32_neon2.S (right): http://codereview.appspot.com/1148042/diff/6001/7001#newcode18 opts/S32A_Opaque_BlitRow32_neon2.S:18: .fpu neon Assembly code always needs a stonking lot ...

15 years, 2 months ago (2010-06-03 21:47:41 UTC) #6

ray.essick

I'd have written the version in opts/SkBlitRow_opts_arm.cpp run with even wider operations if gcc had ...

15 years, 1 month ago (2010-06-16 22:28:52 UTC) #9

XinQi

Could you expand what you mean? Use inline assembly to implement it or to use ...

15 years, 1 month ago (2010-06-17 00:36:31 UTC) #10

ray.essick

sure, i can expand/explain where i'm coming from. I wrote a lot of that neon ...

15 years, 1 month ago (2010-06-17 03:40:15 UTC) #11

sure, i can expand/explain where i'm coming from.  I wrote a lot of that 
neon code -- basically all the pieces using intrinsics. i'm a big fan of 
the intrinsics -- let the compiler manage all the register assignments, 
scheduling instructions and such. and I think that they are easier to 
maintain than hand assembly.

your comments indicate that you're getting 70% improvement (i take this 
as "almost 2x the speed").  that's pretty good considering that the 
first neon version was alread 3x faster than the scalar version. That 
extra bit would have me thinking about whether to bend my natural 
inclination to stay with the intrinsics.

as to what I was thinking to try ...

i was thinking perhaps to expand the int8x8's that i'd been using into 
int8x16's [with appropriate fattening of the other
intermediate types i'd used].  but as I re-read my code, i see that i 
widen my int8x8's to int16x8 -- so i'd need to widen an int8x16 to 
int16x16 and we don't have that data type.

It appears that the place I was having the ugly register spills was 
where I'd wanted to use intrinsics to generate some vld4.8 instructions 
in the S32A_D565_Opaque_Dither_neon routine.  so my memory is doubly bad 
this evening. You can see there how I worked around the limitations in 
gcc to get clean vld4.8 generated code; the register allocation worked 
nicely and I wasn't seeing any strange register motion.

I might play the same trick that I used for the vld4.8's to get a 
"vld1.32 {d0,d1,d2,d3}" so that we get the big "wider loads are better" 
advantage; i'd have to code it to see whether it would hold together 
through the actual operations once it's in registers.

And your assembly code does do better with loading for the next round 
while finishing this round; i never mastered that with the intrinsics. I 
wouldn't be surprised if it reduces to that.

make sure to build & run skia_test  -- the last go round of changes that 
I did for neon with Mike involved changes to track some updated 
semantics of SkAlpha255To256(). this took a little bit to get right -- 
and I think that the semantics of that routine might have changed back.  
skia_test was good for ferreting out that this wasn't right. you'll want 
to check external/skia/tests/Android.mk to make sure skia_test is being 
built for you.

and to be clear -- I haven't seen anything wrong in the code you posted 
(haven't worked through it either); it was more of a "have we exhausted 
what we could do with intrinsics" comment.

-- Ray Essick

According to XinQi@codeaurora.org  on 06/16/10 19:36:
> Could you expand what you mean?  Use inline assembly to implement it or
> to use intrinsic?
>
> On 2010/06/16 22:28:52, ray.essick wrote:
>> I'd have written the version in opts/SkBlitRow_opts_arm.cpp run with
> even wider
>> operations if gcc had generated nicer code. The gcc 4.3.2 compiler I'd
> been
>> using did a lot of extra register spills when I last tried to get
> really wide --
>> perhaps the gcc 4.4.x stuff fixes that; it might be worth checking.
>
>
>
> http://codereview.appspot.com/1148042/show

XinQi

Ray, Thanks for the detailed comments. * Using testing vector dumped in chromium browser loading ...

15 years, 1 month ago (2010-06-17 18:54:29 UTC) #12

agl

On Thu, Jun 17, 2010 at 2:54 PM, <XinQi@codeaurora.org> wrote: > May have one bit ...

15 years, 1 month ago (2010-06-17 18:58:09 UTC) #13

agl

On Thu, Jun 17, 2010 at 2:58 PM, Adam Langley <agl@chromium.org> wrote: > I haven't ...

15 years, 1 month ago (2010-06-18 15:34:38 UTC) #14

XinQi

Make S32A_Opaque_BlitRow32_neon2 bit exact as S32A_Opaque_BlitRow32_neon

15 years, 1 month ago (2010-06-18 18:28:49 UTC) #15

XinQi

Make S32A_Opaque_BlitRow32_neon2 bit exact as S32A_Opaque_BlitRow32_neon

15 years, 1 month ago (2010-06-18 18:45:03 UTC) #16

XinQi

* Update algorithm to use dst = src + SkAlphaMulQ(dst, SkAlpha255To256(255 - SkGetPackedA32(src))) it is ...

15 years, 1 month ago (2010-06-18 18:50:27 UTC) #17

XinQi

15 years, 1 month ago (2010-06-18 18:56:42 UTC) #18

AGL,

We need the following updates on chromium/src/skia/skia.gyp for the new added .S
files.  

Thanks,
Xin

diff --git a/skia.gyp b/skia.gyp
index 2b17094..4305fa4 100644
--- a/skia.gyp
+++ b/skia.gyp
@@ -726,7 +726,13 @@
           'sources': [
             '../third_party/skia/src/opts/SkBitmapProcState_opts_arm.cpp',
             '../third_party/skia/src/opts/SkBlitRow_opts_arm.cpp',
-            '../third_party/skia/src/opts/SkUtils_opts_none.cpp',
+            '../third_party/skia/src/opts/S32A_Opaque_BlitRow32_neon2.S', 
+            '../third_party/skia/src/opts/S32_Opaque_D32_nofilter_DX_gather.S',

+            '../third_party/skia/src/opts/xfer.S', 
+            '../third_party/skia/src/opts/memset16_neon.S', 
+            '../third_party/skia/src/opts/memset32_neon.S', 
+            '../third_party/skia/src/opts/opts_check_arm_neon.cpp',
+            '../third_party/skia/src/opts/t32cb16blend.S', 
           ],
         }],
       ],

Expand All Messages | Collapse All Messages