|
|
Created:
11 years, 9 months ago by shikhas Modified:
10 years, 10 months ago CC:
skia-review_googlegroups.com Base URL:
http://skia.googlecode.com/svn/trunk/ Visibility:
Public. |
DescriptionAdd optimization for 32bit blits on neon
Patch Set 1 #
Total comments: 1
MessagesTotal messages: 21
Thanks for writing this! Can you let us know which benchmarks you've run against it, what hardware that was on, and what the results look like (comparing with your patch and without)? With many Skia benchmarks, you need to run -repeat 150 (or higher repeat counts) to get noise down below 1%.
Sign in to reply to this message.
adding our android dudes
Sign in to reply to this message.
On 2012/07/27 16:35:59, TomH wrote: > Thanks for writing this! > > Can you let us know which benchmarks you've run against it, what hardware that > was on, and what the results look like (comparing with your patch and without)? > > With many Skia benchmarks, you need to run -repeat 150 (or higher repeat counts) > to get noise down below 1%. This patch was tested on Tegra3 platform and the Quadrant 2D benchmark scores tripled when compared to the score without the patch.
Sign in to reply to this message.
If you need any help figuring out how to run our benchmark suite please feel free to ask me as well as check out the documentation on https://sites.google.com/site/skiadocs/user-documentation/quick-start-guides/.... http://codereview.appspot.com/6430069/diff/1/src/opts/SkBlitRow_opts_arm.cpp File src/opts/SkBlitRow_opts_arm.cpp (right): http://codereview.appspot.com/6430069/diff/1/src/opts/SkBlitRow_opts_arm.cpp#... src/opts/SkBlitRow_opts_arm.cpp:425: #if defined(__ARM_HAVE_NEON) && defined(SK_CPU_LENDIAN) && defined(TEST_SRC_ALPHA) This code won't get compiled in Skia as we never define TEST_SRC_ALPHA. In fact we are planning to remove that check in https://codereview.appspot.com/6457056. Is this function intended to replace S32A_Opaque_BlitRow32_neon? I ran Skia's benchmark suite after replacing S32A_Opaque_BlitRow32_neon with your function. However, none of our tests showed the 3x speedup you mentioned. This could easily be a gap in our benchmark coverage, but we will need to demonstrate the advantage of this patch in our benchmark suite before we can accept it. Can you identify the specific steps this benchmark is taking so we can write a benchmark for our suite to replicate it?
Sign in to reply to this message.
On 2012/07/30 20:32:31, DerekS wrote: > If you need any help figuring out how to run our benchmark suite please feel > free to ask me as well as check out the documentation on > https://sites.google.com/site/skiadocs/user-documentation/quick-start-guides/.... > > http://codereview.appspot.com/6430069/diff/1/src/opts/SkBlitRow_opts_arm.cpp > File src/opts/SkBlitRow_opts_arm.cpp (right): > > http://codereview.appspot.com/6430069/diff/1/src/opts/SkBlitRow_opts_arm.cpp#... > src/opts/SkBlitRow_opts_arm.cpp:425: #if defined(__ARM_HAVE_NEON) && > defined(SK_CPU_LENDIAN) && defined(TEST_SRC_ALPHA) > This code won't get compiled in Skia as we never define TEST_SRC_ALPHA. In fact > we are planning to remove that check in https://codereview.appspot.com/6457056. > > Is this function intended to replace S32A_Opaque_BlitRow32_neon? I ran Skia's > benchmark suite after replacing S32A_Opaque_BlitRow32_neon with your function. > However, none of our tests showed the 3x speedup you mentioned. This could > easily be a gap in our benchmark coverage, but we will need to demonstrate the > advantage of this patch in our benchmark suite before we can accept it. Can you > identify the specific steps this benchmark is taking so we can write a benchmark > for our suite to replicate it? Thanks for the review. Would get back to you on results of skia benchmark suite. Yes the function is supposed to replace S32A_Opaque_BlitRow32_neon. I believe you did not try Qaudrant 2D benchmark ?
Sign in to reply to this message.
On 2012/07/31 17:44:34, shikhas wrote: > Thanks for the review. > Would get back to you on results of skia benchmark suite. > Yes the function is supposed to replace S32A_Opaque_BlitRow32_neon. > I believe you did not try Qaudrant 2D benchmark ? Can you tell us where to get it and how to compile it with our code? But we will also need a test that's part of our suite; Skia tries to provide a wide suite of microbenchmarks with easily-understood structure, as well as some vertical slices.
Sign in to reply to this message.
On 2012/07/31 17:50:22, TomH wrote: You can download the free edition from here -http://www.aurorasoftworks.com/products/quadrant. It shall run five different kind of tests. We shall be interested in 2D scores. Thanks!
Sign in to reply to this message.
On 2012/07/31 18:04:13, shikhas wrote: > On 2012/07/31 17:50:22, TomH wrote: > > You can download the free edition from here > -http://www.aurorasoftworks.com/products/quadrant. > It shall run five different kind of tests. We shall be interested in 2D scores. > Thanks! Have a question for DerekS. Did you try this patch on Tegra2 or Tegra3 platform? Tegra2 platform does not have NEON and would not show any benefits with this patch. skia_bench has shown significant performance gains with this patch on Tegra3 Platform. Thanks!!
Sign in to reply to this message.
On 2012/08/16 18:28:41, shikhas wrote: > On 2012/07/31 18:04:13, shikhas wrote: > > On 2012/07/31 17:50:22, TomH wrote: > > > > You can download the free edition from here > > -http://www.aurorasoftworks.com/products/quadrant. > > It shall run five different kind of tests. We shall be interested in 2D > scores. > > Thanks! > > Have a question for DerekS. Did you try this patch on Tegra2 or Tegra3 platform? > Tegra2 platform does not have NEON and would not show any benefits with this > patch. skia_bench has shown significant performance gains with this patch on > Tegra3 Platform. Thanks!! Sorry that I haven't updated this CL lately, but it is still very much on my radar. I'm working to reconcile the NEON changes that are already in AOSP (Android OpenSource Project) with this patch so that we don't have duplicate code.
Sign in to reply to this message.
On 2012/08/17 12:51:34, DerekS wrote: > On 2012/08/16 18:28:41, shikhas wrote: > > On 2012/07/31 18:04:13, shikhas wrote: > > > On 2012/07/31 17:50:22, TomH wrote: > > > > > > You can download the free edition from here > > > -http://www.aurorasoftworks.com/products/quadrant. > > > It shall run five different kind of tests. We shall be interested in 2D > > scores. > > > Thanks! > > > > Have a question for DerekS. Did you try this patch on Tegra2 or Tegra3 > platform? > > Tegra2 platform does not have NEON and would not show any benefits with this > > patch. skia_bench has shown significant performance gains with this patch on > > Tegra3 Platform. Thanks!! > > Sorry that I haven't updated this CL lately, but it is still very much on my > radar. I'm working to reconcile the NEON changes that are already in AOSP > (Android OpenSource Project) with this patch so that we don't have duplicate > code. Ok. So I think I've gotten everything in order with AOSP and this patch is just a sub-set of the code that is already present there. The upstream patch from AOSP to Skia (http://codereview.appspot.com/6465075/) contains not only this function but the non-neon version of this function as well. However, in Skia's trunk we already had a contribution from ARM that implemented these functions. So I still need to take the time and verify that this patch is indeed more efficient than the one we have gotten from ARM.
Sign in to reply to this message.
@dereks, is there any way we can delegate some work to @shikas to help out? Is the other patch available? I believe that if we give shikas some things to run, they can do the tests for us.
Sign in to reply to this message.
On 2012/08/22 23:43:42, nduca wrote: > @dereks, is there any way we can delegate some work to @shikas to help out? Is > the other patch available? I believe that if we give shikas some things to run, > they can do the tests for us. The patch that @shikas would need to replace the existing optimizations with those provided by NVidia is at https://codereview.appspot.com/6465075/. To run our microbenchmark tool with/without that patch he would need to follow these steps https://sites.google.com/site/skiadocs/user-documentation/quick-start-guides/... to build and run on a device. The skia android build needs a device profile so for testing I used... -d nexus_7 for testing the NEON path -d xoom for testing the ARM path After compiling the appropriate code I ran our benchmark tool with a repeat count between 75 and 150 but I couldn't determine that the new Nvidia optimizations were faster than the existing optimizations. If Nvidia could contribute a new bench that shows an improvement with there code or demonstrate an improvement with an existing bench I would be happy to take the patch.
Sign in to reply to this message.
The first test defined in BitmapBench.cpp should see a benefit from this patch: static SkBenchmark* Fact0(void* p) { return new BitmapBench(p, false, SkBitmap::kARGB_8888_Config); From my understanding, the test draws an opaque circle with anti-aliased edges over a transparent background, all within an opaque outline of a rectangle. So most of the blending ops can be skipped if TEST_SRC_ALPHA is defined. This patch attempts to optimize further the TEST_SRC_ALPHA fast paths, especially for runs of fully opaque or transparent pixels. The exact use case of the above test. Quadrant is slightly different in that pretty much all source pixels are opaque. Perhaps someone can add a skia bench test that uses the blending path even when the source pixels are mostly opaque? What is the motivation for removing TEST_SRC_ALPHA, btw? I think there are some use cases on mobile that benefit from it a lot. Also, I think skia-bench uses microsecond resolution timers. If that's true, is there any chance of running this test with higher precision timers?
Sign in to reply to this message.
On 2012/08/27 20:46:34, brianderson wrote: > Also, I think skia-bench uses microsecond resolution timers. If that's true, is > there any chance of running this test with higher precision timers? Ideally, that microsecond-resolution timer should be applied to 3-6ms worth of repeats of the microbenchmark, so it'll have plenty of resolution - we'll only be interested in changes of 10s or 100s of us. There are some tests where the repeat count is too high and we run for scores of ms; there are others where the repeat count is too low and we run for << 1ms. Some tests, particularly GPU, are much slower on some test bots due presumably to driver or hardware issues. Some tests are too fast on some architectures, or too slow on others. But the goal is to be in a range where us resolution is plenty.
Sign in to reply to this message.
On 2012/08/27 21:25:29, TomH wrote: > On 2012/08/27 20:46:34, brianderson wrote: > > Also, I think skia-bench uses microsecond resolution timers. If that's true, > is > > there any chance of running this test with higher precision timers? > > Ideally, that microsecond-resolution timer should be applied to 3-6ms worth of > repeats of the microbenchmark, so it'll have plenty of resolution - we'll only > be interested in changes of 10s or 100s of us. > > There are some tests where the repeat count is too high and we run for scores of > ms; there are others where the repeat count is too low and we run for << 1ms. > Some tests, particularly GPU, are much slower on some test bots due presumably > to driver or hardware issues. Some tests are too fast on some architectures, or > too slow on others. But the goal is to be in a range where us resolution is > plenty. Brian, the Fact0 bench does indeed show some improvement. However, in that case we don't explicitly mark the bitmap as opaque even though it is. The test we are missing is one where the bitmaps pixels aren't opaque where testing the src alpha will result in an branch in our code for a path we wouldn't take. At this point I just don't know how much of a penalty it is in the real world for doing that extra conditional check given I don't know how often the src alpha will really be opaque for a non-opaque bitmap. Also to be clear the presence of the src alpha test is not the reason we haven't accepted the patch. The primary reason is that we have a optimized version of the functions already (provided by ARM) and I have yet to come up with the empirical evidence to prove that the functions in this CL are better.
Sign in to reply to this message.
> Brian, the Fact0 bench does indeed show some improvement. However, in that case we don't explicitly mark the bitmap as opaque even though it is. The test we are missing is one where the bitmaps pixels aren't opaque where testing the src alpha will result in an branch in our code for a path we wouldn't take. Is the improvement over ARM's version or just over the C version? Maybe I misunderstand what Fact0 does, but I thought the source bitmap had a good mix of alpha=0 (the transparent background) and alpha=255 (the red filled circle and blue stroked rectangle), with a small number of pixels at the edges of the circle with alphas in between, somewhat representing an icon use case. Does "bm.eraseColor(isOpaque ? SK_ColorBLACK : 0)" provide a transparent background when isOpaque is false? > At this point I just don't know how much of a penalty it is in the real world for doing that extra conditional check given I don't know how often the src alpha will really be opaque for a non-opaque bitmap. > The primary reason is that we have a optimized version of the functions already (provided by ARM) and I have yet to come up with the empirical evidence to prove that the functions in this CL are better. Ok, thanks for the clarification. You make a good point that the optimization in this patch is very sensitive to the source content. If it's a win at all, it'll likely only be a win in some cases. For example, if there are a lot of transitions between the alpha states, the thrashing will almost definitely make it perform worse. If it's an overall win will depend on the distribution of source content in the real world and if we are okay with a wider variance in performance.
Sign in to reply to this message.
On 2012/08/28 18:39:05, brianderson wrote: > > Brian, the Fact0 bench does indeed show some improvement. However, in that > case we don't explicitly mark the bitmap as opaque even though it is. The test > we are missing is one where the bitmaps pixels aren't opaque where testing the > src alpha will result in an branch in our code for a path we wouldn't take. > > Is the improvement over ARM's version or just over the C version? > > Maybe I misunderstand what Fact0 does, but I thought the source bitmap had a > good mix of alpha=0 (the transparent background) and alpha=255 (the red filled > circle and blue stroked rectangle), with a small number of pixels at the edges > of the circle with alphas in between, somewhat representing an icon use case. > > Does "bm.eraseColor(isOpaque ? SK_ColorBLACK : 0)" provide a transparent > background when isOpaque is false? > > > At this point I just don't know how much of a penalty it is in the real world > for doing that extra conditional check given I don't know how often the src > alpha will really be opaque for a non-opaque bitmap. > > > The primary reason is that we have a optimized version of > the functions already (provided by ARM) and I have yet to come up with the > empirical evidence to prove that the functions in this CL are better. > > Ok, thanks for the clarification. You make a good point that the optimization in > this patch is very sensitive to the source content. If it's a win at all, it'll > likely only be a win in some cases. For example, if there are a lot of > transitions between the alpha states, the thrashing will almost definitely make > it perform worse. If it's an overall win will depend on the distribution of > source content in the real world and if we are okay with a wider variance in > performance. The drawimage calls into canvas context for opaque and transparent image show significant performance improvement, almost ~ 80%. Running a simple javascript in chrome browser (test was done in ChromeOS) to make repeated drawimage calls shows that time taken in paintSkBitmap() in chromium/src/third_party/WebKit/Source/WebCore/platform/graphics/skia/ImageSkia.cpp gets significantly reduced. about:tracing in chrome browser confirmed the above observations. The tests were done with this patch for neon optimization on tegra3 platform and compared with browser build with only ARM patch. No performance improvement was seen for the drawimage call with globalAlpha value 0.5
Sign in to reply to this message.
On 2012/08/28 22:14:25, shikhas wrote: > On 2012/08/28 18:39:05, brianderson wrote: > > > Brian, the Fact0 bench does indeed show some improvement. However, in that > > case we don't explicitly mark the bitmap as opaque even though it is. The > test > > we are missing is one where the bitmaps pixels aren't opaque where testing the > > src alpha will result in an branch in our code for a path we wouldn't take. > > > > Is the improvement over ARM's version or just over the C version? > > > > Maybe I misunderstand what Fact0 does, but I thought the source bitmap had a > > good mix of alpha=0 (the transparent background) and alpha=255 (the red filled > > circle and blue stroked rectangle), with a small number of pixels at the edges > > of the circle with alphas in between, somewhat representing an icon use case. > > > > Does "bm.eraseColor(isOpaque ? SK_ColorBLACK : 0)" provide a transparent > > background when isOpaque is false? > > > > > At this point I just don't know how much of a penalty it is in the real > world > > for doing that extra conditional check given I don't know how often the src > > alpha will really be opaque for a non-opaque bitmap. > > > > > The primary reason is that we have a optimized version of > > the functions already (provided by ARM) and I have yet to come up with the > > empirical evidence to prove that the functions in this CL are better. > > > > Ok, thanks for the clarification. You make a good point that the optimization > in > > this patch is very sensitive to the source content. If it's a win at all, > it'll > > likely only be a win in some cases. For example, if there are a lot of > > transitions between the alpha states, the thrashing will almost definitely > make > > it perform worse. If it's an overall win will depend on the distribution of > > source content in the real world and if we are okay with a wider variance in > > performance. > > The drawimage calls into canvas context for opaque and transparent image show > significant performance improvement, almost ~ 80%. Running a simple javascript > in chrome browser (test was done in ChromeOS) to make repeated drawimage calls > shows that time taken in paintSkBitmap() in > chromium/src/third_party/WebKit/Source/WebCore/platform/graphics/skia/ImageSkia.cpp > gets significantly reduced. about:tracing in chrome browser confirmed the above > observations. The tests were done with this patch for neon optimization on > tegra3 platform and compared with browser build with only ARM patch. > > No performance improvement was seen for the drawimage call with globalAlpha > value 0.5 Thanks for providing these additional numbers. I'm starting to feel like this patch could be an overall gain and the ChromeOS numbers are reassuring, but I think the thing to do before accepting this patch is to create a new bench that is similar to BitmapBech::Fact0. The new bench would test drawing bitmaps with a variety of different src alphas. I'll see if I can construct such a bench and attach it to my existing patch.
Sign in to reply to this message.
I also posted some numbers to Derek's patch in http://codereview.appspot.com/6465075/
Sign in to reply to this message.
On 2012/08/29 16:21:45, Antti wrote: > I also posted some numbers to Derek's patch in > http://codereview.appspot.com/6465075/ I've incorporated this change into https://codereview.appspot.com/6465075/ which has been committed so you can close this issue
Sign in to reply to this message.
|