|
|
Created:
14 years, 7 months ago by XinQi Modified:
14 years, 6 months ago Base URL:
http://skia.googlecode.com/svn/trunk/src Visibility:
Public. |
DescriptionImplementing S32A_Opaque_BlitRow32 using v7 neon instructions.
Taking the advantage of 16 channels of each QualWord register. Also using the software pipelining to scatter the loads/stores among vector operations.
Got roughly 70% improvements on simulation environments.
First-time contributor, please let me know of anything missing. And other reviewers needed.
Patch Set 1 #Patch Set 2 : Update license in header #
Total comments: 3
Patch Set 3 : Update patch upon comments #Patch Set 4 : Make S32A_Opaque_BlitRow32_neon2 bit exact as S32A_Opaque_BlitRow32_neon #MessagesTotal messages: 18
The .S file appears to have a BSD like license at the top of it. However, Skia is Apache licensed. The boilerplate should be copied from an existing Skia file, such as src/core/SkPaint.cpp. Skia probably has closer ties to Android than Chromium, so you should confirm that you've filled out the agreement first: http://source.android.com/license/individual-contributor-license---android-op... After that, I'll try and find someone to review the ARM code. Cheers
Sign in to reply to this message.
Skia has compliance and performance tests that can be run using the proposed patch, in addition to reviewing the source code. mike On Wed, May 12, 2010 at 5:25 PM, <agl@chromium.org> wrote: > The .S file appears to have a BSD like license at the top of it. > However, Skia is Apache licensed. The boilerplate should be copied from > an existing Skia file, such as src/core/SkPaint.cpp. > > Skia probably has closer ties to Android than Chromium, so you should > confirm that you've filled out the agreement first: > http://source.android.com/license/individual-contributor-license---android-op... > > After that, I'll try and find someone to review the ARM code. > > > Cheers > > http://codereview.appspot.com/1148042/show >
Sign in to reply to this message.
Hi Mike, We've used the following methods to verify the patch: 1. I printed out testing vector in Chromium browser code (when browsing), and verifying in simulation environment that new neon code is generating the same output as old one. Also the cycle number gain. 2. We've merged the patch to arm build of Chromium browser, verifying that it works. (I once had a bug that read more through to the next line, you will see that when rendering nytimes.com, the "New York Times" picture is not rendered correctly, also other small images. So by displaying them correctly, it is sort of verifying that it works as expected ) 3. I've also tried the Skia bench with the source tree, which doesn't seems to make much difference. My guess was that this specific call is not exercised much in "bench" application. Please let me know more details about the compliance and performance tests if you want me to run them. Thanks, Xin On 2010/05/14 15:06:23, reed wrote: > Skia has compliance and performance tests that can be run using the > proposed patch, in addition to reviewing the source code. > > mike
Sign in to reply to this message.
Update license in header
Sign in to reply to this message.
http://codereview.appspot.com/1148042/diff/6001/7001 File opts/S32A_Opaque_BlitRow32_neon2.S (right): http://codereview.appspot.com/1148042/diff/6001/7001#newcode18 opts/S32A_Opaque_BlitRow32_neon2.S:18: .fpu neon Assembly code always needs a stonking lot of comments. Feel free to include the C prototype of the function, inputs, outputs, what its doing, etc etc. http://codereview.appspot.com/1148042/diff/6001/7002 File opts/SkBlitRow_opts_arm.cpp (right): http://codereview.appspot.com/1148042/diff/6001/7002#newcode434 opts/SkBlitRow_opts_arm.cpp:434: extern "C" void S32A_Opaque_BlitRow32_neon2(SkPMColor* SK_RESTRICT dst, fix double space after "C" http://codereview.appspot.com/1148042/diff/6001/7002#newcode560 opts/SkBlitRow_opts_arm.cpp:560: #define S32A_Opaque_BlitRow32_PROC S32A_Opaque_BlitRow32_neon2 there are a couple of tabs in this line. Only spaces should be used.
Sign in to reply to this message.
Update patch upon comments
Sign in to reply to this message.
I'd have written the version in opts/SkBlitRow_opts_arm.cpp run with even wider operations if gcc had generated nicer code. The gcc 4.3.2 compiler I'd been using did a lot of extra register spills when I last tried to get really wide -- perhaps the gcc 4.4.x stuff fixes that; it might be worth checking.
Sign in to reply to this message.
Could you expand what you mean? Use inline assembly to implement it or to use intrinsic? On 2010/06/16 22:28:52, ray.essick wrote: > I'd have written the version in opts/SkBlitRow_opts_arm.cpp run with even wider > operations if gcc had generated nicer code. The gcc 4.3.2 compiler I'd been > using did a lot of extra register spills when I last tried to get really wide -- > perhaps the gcc 4.4.x stuff fixes that; it might be worth checking.
Sign in to reply to this message.
sure, i can expand/explain where i'm coming from. I wrote a lot of that neon code -- basically all the pieces using intrinsics. i'm a big fan of the intrinsics -- let the compiler manage all the register assignments, scheduling instructions and such. and I think that they are easier to maintain than hand assembly. your comments indicate that you're getting 70% improvement (i take this as "almost 2x the speed"). that's pretty good considering that the first neon version was alread 3x faster than the scalar version. That extra bit would have me thinking about whether to bend my natural inclination to stay with the intrinsics. as to what I was thinking to try ... i was thinking perhaps to expand the int8x8's that i'd been using into int8x16's [with appropriate fattening of the other intermediate types i'd used]. but as I re-read my code, i see that i widen my int8x8's to int16x8 -- so i'd need to widen an int8x16 to int16x16 and we don't have that data type. It appears that the place I was having the ugly register spills was where I'd wanted to use intrinsics to generate some vld4.8 instructions in the S32A_D565_Opaque_Dither_neon routine. so my memory is doubly bad this evening. You can see there how I worked around the limitations in gcc to get clean vld4.8 generated code; the register allocation worked nicely and I wasn't seeing any strange register motion. I might play the same trick that I used for the vld4.8's to get a "vld1.32 {d0,d1,d2,d3}" so that we get the big "wider loads are better" advantage; i'd have to code it to see whether it would hold together through the actual operations once it's in registers. And your assembly code does do better with loading for the next round while finishing this round; i never mastered that with the intrinsics. I wouldn't be surprised if it reduces to that. make sure to build & run skia_test -- the last go round of changes that I did for neon with Mike involved changes to track some updated semantics of SkAlpha255To256(). this took a little bit to get right -- and I think that the semantics of that routine might have changed back. skia_test was good for ferreting out that this wasn't right. you'll want to check external/skia/tests/Android.mk to make sure skia_test is being built for you. and to be clear -- I haven't seen anything wrong in the code you posted (haven't worked through it either); it was more of a "have we exhausted what we could do with intrinsics" comment. -- Ray Essick According to XinQi@codeaurora.org on 06/16/10 19:36: > Could you expand what you mean? Use inline assembly to implement it or > to use intrinsic? > > On 2010/06/16 22:28:52, ray.essick wrote: >> I'd have written the version in opts/SkBlitRow_opts_arm.cpp run with > even wider >> operations if gcc had generated nicer code. The gcc 4.3.2 compiler I'd > been >> using did a lot of extra register spills when I last tried to get > really wide -- >> perhaps the gcc 4.4.x stuff fixes that; it might be worth checking. > > > > http://codereview.appspot.com/1148042/show
Sign in to reply to this message.
Ray, Thanks for the detailed comments. * Using testing vector dumped in chromium browser loading nytimes.com, this version uses 76% less cycles in simulator. So it is about 4x speed :-). Of course, it depends on the specific hardware pipelining, and model of memory latency. And we also didn't see dramatic improvements in real usage case as page loading. * You are very much right that I have struggled with SkAlpha255To256 too. dst = src + SkAlphaMulQ(dst, SkAlpha255To256(255 - SkGetPackedA32(src))) and dst = src + (dst * ((255-a) + (255-a)>>7)) >> 8 May have one bit difference. I am using the second algo in 8 points chunk, and only use the first one in the last len%8 points. So it will not be bit exact. I had assumed that 1/255 difference in color should be OK, but I checked your latest code, it seems that you changed all to use (256-a) in 16 bits channel. Is it prefered way? I can change the code too. * I was using chromium browser on chromiumOS to test it. I will run skia_test on Android env * One big part of the gain is to process 8 points at a chunk instead of 4 points. I didn't check but I believe there should be same intrinsic to do the work. * My former experience using intrinsic/inline assembly is that it is easy to start with, but we may have to spend more time understanding why gcc is not generating exactly what we are expecting. I am also not sure whether you can tune for the hardware pipeline using intrinsics because gcc may have the freedom to shift the instruction based on its standards of data dependency and register allocation. Pure assembly is more clean, but you do have to take care of more things as ABI and register allocation. Just personal preference and depends on the complexity of the work. Happy coding! Xin
Sign in to reply to this message.
On Thu, Jun 17, 2010 at 2:54 PM, <XinQi@codeaurora.org> wrote: > May have one bit difference. I am using the second algo in 8 points > chunk, and only use the first one in the last len%8 points. So it will > not be bit exact. I had assumed that 1/255 difference in color should > be OK, but I checked your latest code, it seems that you changed all to > use (256-a) in 16 bits channel. Is it prefered way? I can change the > code too. I haven't rolled this change into Chromium yet. However, if it causes pixel differences it will fail the layout tests and be reverted. AGL
Sign in to reply to this message.
On Thu, Jun 17, 2010 at 2:58 PM, Adam Langley <agl@chromium.org> wrote: > I haven't rolled this change into Chromium yet. However, if it causes > pixel differences it will fail the layout tests and be reverted. Reverted. Broke the build. (Possibly just the new files were missing from the build.) AGL
Sign in to reply to this message.
Make S32A_Opaque_BlitRow32_neon2 bit exact as S32A_Opaque_BlitRow32_neon
Sign in to reply to this message.
Make S32A_Opaque_BlitRow32_neon2 bit exact as S32A_Opaque_BlitRow32_neon
Sign in to reply to this message.
* Update algorithm to use dst = src + SkAlphaMulQ(dst, SkAlpha255To256(255 - SkGetPackedA32(src))) it is not a big change because we have to expand to 16 bits lane to do multiplication anyway. * Have verified that this function has bit exact output as original S32A_Opaque_BlitRow32_neon on Chromium testing vector. * Have also verified that Android skia_test passes with this change. * This patch is based on the reverted r582. So it is the full patch. Removed space at the end of the lines too. On 2010/06/18 18:45:03, XinQi wrote: > Make S32A_Opaque_BlitRow32_neon2 bit exact as S32A_Opaque_BlitRow32_neon
Sign in to reply to this message.
AGL, We need the following updates on chromium/src/skia/skia.gyp for the new added .S files. Thanks, Xin diff --git a/skia.gyp b/skia.gyp index 2b17094..4305fa4 100644 --- a/skia.gyp +++ b/skia.gyp @@ -726,7 +726,13 @@ 'sources': [ '../third_party/skia/src/opts/SkBitmapProcState_opts_arm.cpp', '../third_party/skia/src/opts/SkBlitRow_opts_arm.cpp', - '../third_party/skia/src/opts/SkUtils_opts_none.cpp', + '../third_party/skia/src/opts/S32A_Opaque_BlitRow32_neon2.S', + '../third_party/skia/src/opts/S32_Opaque_D32_nofilter_DX_gather.S', + '../third_party/skia/src/opts/xfer.S', + '../third_party/skia/src/opts/memset16_neon.S', + '../third_party/skia/src/opts/memset32_neon.S', + '../third_party/skia/src/opts/opts_check_arm_neon.cpp', + '../third_party/skia/src/opts/t32cb16blend.S', ], }], ],
Sign in to reply to this message.
|