|
|
DescriptionImplementing Color32 functions for Neon platforms.
Besides the raw processing improvement provided by Neon, the code uses memory
preteches (pld) which seem to improve performance greatly when dealing with
very large counts.
This was tested using bench where color32 accounts for the majority of the
workload:
bench -match rects_1 -config 8888 -repeat 500 -forceBlend 1
(the forceBlend is there so that the Color32 code does not go through the
special cases where alpha == 0xFF as it would transform color32 into
a sk_memset32.
Numbers averaged over 3 runs:
bench name | Before | Neon, no pld | Neon with pld | full boost
rrects_1 | 153.9 | 128.3 | 92 | 1.66x
rects_1_stroke_4| 32.8 | 31.4 | 28.45 | 1.15x
rects_1 | 125.35 | 97.2 | 63.59 | 1.97x
Credits: various googletv team members.
Patch Set 1 #
Total comments: 9
Patch Set 2 : small modifications based on comments, small code cleanup, still some issues unresolved #Patch Set 3 : '' #Patch Set 4 : '' #MessagesTotal messages: 15
How did you find measured speedup? I ran the full Skia benchmark suite on a nexus_s with and without your changes, and see very few improvements. We need to understand where you think this will help us in order to accept it. Do we need new benchmarks? http://codereview.appspot.com/5569077/diff/1/src/opts/SkBlitRow_opts_arm.cpp File src/opts/SkBlitRow_opts_arm.cpp (right): http://codereview.appspot.com/5569077/diff/1/src/opts/SkBlitRow_opts_arm.cpp#... src/opts/SkBlitRow_opts_arm.cpp:19: #define CACHE_LINE_SIZE 32 Is your intent that compilers would redefine this on other targets? Would that make an #ifndef guard reasonable? The web suggests that there are 64B neon cache line architectures, although I don't know if we're compiling Skia on any of them yet. http://codereview.appspot.com/5569077/diff/1/src/opts/SkBlitRow_opts_arm.cpp#... src/opts/SkBlitRow_opts_arm.cpp:24: # define PLD128(x, n) PLD64(x, n) PLD64(x, (n) + 64) Style nit: in Skia, spaces go before the #, not after. http://codereview.appspot.com/5569077/diff/1/src/opts/SkBlitRow_opts_arm.cpp#... src/opts/SkBlitRow_opts_arm.cpp:1293: asm volatile ( Only indenting the subsequent 4 instead of 14 would make the line length limit a little less stringent http://codereview.appspot.com/5569077/diff/1/src/opts/SkBlitRow_opts_arm.cpp#... src/opts/SkBlitRow_opts_arm.cpp:1357: // left to process. Very nice explanatory comment; thanks!
Sign in to reply to this message.
On 2012/02/07 16:31:53, TomH wrote: > How did you find measured speedup? I ran the full Skia benchmark suite on a > nexus_s with and without your changes, and see very few improvements. > > We need to understand where you think this will help us in order to accept it. > Do we need new benchmarks? > I am sorry. This function does not show up on many benchmarks covered by skia bench. To be more precise, if I remember correctly, when it does, it does show up with a very small percentage, meaning that it proved impossible for me to benchmark this very part of the code using skia_bench. Instead, I have created a rudimentary piece of code that tests that function, and only that function, allowing me to get results. This function though shows up in many real life scenarios at large percentages, I have just found the exact combination of command line options that will make this appear as sizeable using skia_bench. I suspect we should work on improving skia_bench in this regard, or maybe somebody more knowledgeable on the option behavior will be able to find a command line that allows this method to be at anything but a few percents. If you do not have a suggestion for a command line, I can probably find a way to integrate this into skia_bench. Let me know if you have any suggestions. > http://codereview.appspot.com/5569077/diff/1/src/opts/SkBlitRow_opts_arm.cpp > File src/opts/SkBlitRow_opts_arm.cpp (right): > > http://codereview.appspot.com/5569077/diff/1/src/opts/SkBlitRow_opts_arm.cpp#... > src/opts/SkBlitRow_opts_arm.cpp:19: #define CACHE_LINE_SIZE 32 > Is your intent that compilers would redefine this on other targets? Would that > make an #ifndef guard reasonable? > > The web suggests that there are 64B neon cache line architectures, although I > don't know if we're compiling Skia on any of them yet. > > http://codereview.appspot.com/5569077/diff/1/src/opts/SkBlitRow_opts_arm.cpp#... > src/opts/SkBlitRow_opts_arm.cpp:24: # define PLD128(x, n) PLD64(x, n) > PLD64(x, (n) + 64) > Style nit: in Skia, spaces go before the #, not after. > > http://codereview.appspot.com/5569077/diff/1/src/opts/SkBlitRow_opts_arm.cpp#... > src/opts/SkBlitRow_opts_arm.cpp:1293: asm volatile ( > Only indenting the subsequent 4 instead of 14 would make the line length limit a > little less stringent > > http://codereview.appspot.com/5569077/diff/1/src/opts/SkBlitRow_opts_arm.cpp#... > src/opts/SkBlitRow_opts_arm.cpp:1357: // left to process. > Very nice explanatory comment; thanks!
Sign in to reply to this message.
http://codereview.appspot.com/5569077/diff/1/src/opts/SkBlitRow_opts_arm.cpp File src/opts/SkBlitRow_opts_arm.cpp (right): http://codereview.appspot.com/5569077/diff/1/src/opts/SkBlitRow_opts_arm.cpp#... src/opts/SkBlitRow_opts_arm.cpp:19: #define CACHE_LINE_SIZE 32 I admit this is not exactly pretty and lacks comments. I will have to go back to the original developer on these. On 2012/02/07 16:31:53, TomH wrote: > Is your intent that compilers would redefine this on other targets? Would that > make an #ifndef guard reasonable? > > The web suggests that there are 64B neon cache line architectures, although I > don't know if we're compiling Skia on any of them yet. http://codereview.appspot.com/5569077/diff/1/src/opts/SkBlitRow_opts_arm.cpp#... src/opts/SkBlitRow_opts_arm.cpp:1293: asm volatile ( On 2012/02/07 16:31:53, TomH wrote: > Only indenting the subsequent 4 instead of 14 would make the line length limit a > little less stringent Done. http://codereview.appspot.com/5569077/diff/1/src/opts/SkBlitRow_opts_arm.cpp#... src/opts/SkBlitRow_opts_arm.cpp:1357: // left to process. On 2012/02/07 16:31:53, TomH wrote: > Very nice explanatory comment; thanks! I found that placing this count outside the if was not necessary, since count is guaranteed to be less than 8 in the hypothetical else part of the if. As a result, I moved the code above, inside the if, adjusting the comment.
Sign in to reply to this message.
It looks like bench -match rects is the only benchmark that heavily hits the nontrivial cases of Color32_SSE2, and even that is only around 12%. Windows benchmark results are too noisy to be useful here, but on Linux with very high -repeat counts (1500?) results are quite stable and it seems like SSE2 gives us a clear 10% win on the rects_1 and rects_3 tests. Can you test that benchmark on neon with and without your patch? If that isn't sufficient we can try to put together a new, more artificial benchmark tomorrow. Could you give us the code you use to test Color32 in isolation? Also, can you point us to some real life workloads that use it heavily?
Sign in to reply to this message.
On 2012/02/09 18:58:47, TomH wrote: > It looks like bench -match rects is the only benchmark that heavily hits the > nontrivial cases of Color32_SSE2, and even that is only around 12%. > When I try on x64 http://ie.microsoft.com/testdrive/Performance/PulsatingBubbles/Default.xhtml the method color32_SSE2 accounts between 20 and 27 % of the workload (using zoom 2x seems to increase the overall weight of this function). In all cases, it is the hottest function of the run where the second hottest SkAlphaRuns::add is coming at 17 or 13 % of the workload. Note that this was measured on x86 which already has this method optimized, so I assume that the weight would be much higher on ARM. (unfortunately, the perf tool on my ARM platform, while working, does affect performance significantly on the CPU I have, so I always take the numbers with a grain of salt). > Windows benchmark results are too noisy to be useful here, but on Linux with > very high -repeat counts (1500?) results are quite stable and it seems like SSE2 > gives us a clear 10% win on the rects_1 and rects_3 tests. Can you test that > benchmark on neon with and without your patch? If that isn't sufficient we can > try to put together a new, more artificial benchmark tomorrow. I think a difference benchmark might be in order. > > Could you give us the code you use to test Color32 in isolation? > > Also, can you point us to some real life workloads that use it heavily? As mentioned above: http://ie.microsoft.com/testdrive/Performance/PulsatingBubbles/Default.xhtml (maybe not that real world really, but at least a web page that exercises this heavily).
Sign in to reply to this message.
Thanks for the previous reviews. I have made a few modifications based on your comments, managed to get the improvements to be easily benchmarked with bench, and to show some sizable improvements. See in the updated change description. Please take another look ! Thanks, http://codereview.appspot.com/5569077/diff/1/src/opts/SkBlitRow_opts_arm.cpp File src/opts/SkBlitRow_opts_arm.cpp (right): http://codereview.appspot.com/5569077/diff/1/src/opts/SkBlitRow_opts_arm.cpp#... src/opts/SkBlitRow_opts_arm.cpp:19: #define CACHE_LINE_SIZE 32 On 2012/02/09 03:01:26, evannier wrote: > I admit this is not exactly pretty and lacks comments. I will have to go back to > the original developer on these. > I have researched a bit more, and I have included all of this inside a different header file, with the macros controlled by defines that might be controlled in gyp. As a result, unless explicitly defined, this will no effect. See though the performance numbers, and there should be a strong motivation to have that enabled if the cache line size can be deduced. I have further extracted this inside a header file, because using PLD in many places seems to help quite a bit, and further changes I will make, will actually make use of this. > > > On 2012/02/07 16:31:53, TomH wrote: > > Is your intent that compilers would redefine this on other targets? Would that > > make an #ifndef guard reasonable? > > > > The web suggests that there are 64B neon cache line architectures, although I > > don't know if we're compiling Skia on any of them yet. > http://codereview.appspot.com/5569077/diff/1/src/opts/SkBlitRow_opts_arm.cpp#... src/opts/SkBlitRow_opts_arm.cpp:24: # define PLD128(x, n) PLD64(x, n) PLD64(x, (n) + 64) On 2012/02/07 16:31:53, TomH wrote: > Style nit: in Skia, spaces go before the #, not after. Done.
Sign in to reply to this message.
Derek, is this more up your alley for actual testing?
Sign in to reply to this message.
Eric may be able to do the performance study on actual hardware we wanted?
Sign in to reply to this message.
On 2012/07/24 21:39:04, TomH wrote: > Eric may be able to do the performance study on actual hardware we wanted? Tom, unfortunately, I have moved to another group, and I do not have access to HW I used to. I will ask if there is anybody else that might be willing to take this over.
Sign in to reply to this message.
On 2012/07/24 21:53:00, evannier wrote: > On 2012/07/24 21:39:04, TomH wrote: > > Eric may be able to do the performance study on actual hardware we wanted? > > Tom, unfortunately, I have moved to another group, and I do not have access to > HW I used to. I will ask if there is anybody else that might be willing to take > this over. Sorry if that was unclear, I meant Eric Boren, our newest Android hire, who was asking me off-thread about this bug this morning; I just wanted to get him on the record as somebody to go to (since I may be temporarily moving to another group myself).
Sign in to reply to this message.
On 2012/07/24 21:54:30, TomH wrote: > On 2012/07/24 21:53:00, evannier wrote: > > On 2012/07/24 21:39:04, TomH wrote: > > > Eric may be able to do the performance study on actual hardware we wanted? > > > > Tom, unfortunately, I have moved to another group, and I do not have access to > > HW I used to. I will ask if there is anybody else that might be willing to > take > > this over. > > Sorry if that was unclear, I meant Eric Boren, our newest Android hire, who was > asking me off-thread about this bug this morning; I just wanted to get him on > the record as somebody to go to (since I may be temporarily moving to another > group myself). The benchmarks are in progress as we speak. I'm about to head out for the day, but I'll have results first thing tomorrow.
Sign in to reply to this message.
Looks like we're seeing 5-10% improvement on the rects_* benchmarks. Before: D/skia (30219): running bench [640 480] rects_3_stroke_4 D/skia (30219): 8888: cmsecs = 6.31 D/skia (30219): 565: cmsecs = 4.91 D/skia (30219): GPU: cmsecs = 14.42 D/skia (30219): NULLGPU: cmsecs = 4.17 D/skia (30219): D/skia (30219): running bench [640 480] rects_3 D/skia (30219): 8888: cmsecs = 4.87 D/skia (30219): 565: cmsecs = 3.40 D/skia (30219): GPU: cmsecs = 11.69 D/skia (30219): NULLGPU: cmsecs = 3.86 D/skia (30219): D/skia (30219): running bench [640 480] rects_1_stroke_4 D/skia (30219): 8888: cmsecs = 17.98 D/skia (30219): 565: cmsecs = 13.68 D/skia (30219): GPU: cmsecs = 24.15 D/skia (30219): NULLGPU: cmsecs = 4.17 D/skia (30219): D/skia (30219): running bench [640 480] rects_1 D/skia (30219): 8888: cmsecs = 22.82 D/skia (30219): 565: cmsecs = 11.81 D/skia (30219): GPU: cmsecs = 31.50 D/skia (30219): NULLGPU: cmsecs = 3.86 After: D/skia (24301): running bench [640 480] rects_3_stroke_4 D/skia (24301): 8888: cmsecs = 5.89 D/skia (24301): 565: cmsecs = 4.72 D/skia (24301): GPU: cmsecs = 14.32 D/skia (24301): NULLGPU: cmsecs = 4.24 D/skia (24301): D/skia (24301): running bench [640 480] rects_3 D/skia (24301): 8888: cmsecs = 4.65 D/skia (24301): 565: cmsecs = 3.49 D/skia (24301): GPU: cmsecs = 11.84 D/skia (24301): NULLGPU: cmsecs = 3.92 D/skia (24301): D/skia (24301): running bench [640 480] rects_1_stroke_4 D/skia (24301): 8888: cmsecs = 16.33 D/skia (24301): 565: cmsecs = 13.07 D/skia (24301): GPU: cmsecs = 23.69 D/skia (24301): NULLGPU: cmsecs = 4.23 D/skia (24301): D/skia (24301): running bench [640 480] rects_1 D/skia (24301): 8888: cmsecs = 21.66 D/skia (24301): 565: cmsecs = 11.58 D/skia (24301): GPU: cmsecs = 32.09 D/skia (24301): NULLGPU: cmsecs = 3.94
Sign in to reply to this message.
As mentioned in the Change description, on a different processor, the improvements are much larger (see numbers above). So, if there is an improvement on your CPU as well as ours, this is good news ;-). On 2012/07/25 12:45:05, EricB wrote: > Looks like we're seeing 5-10% improvement on the rects_* benchmarks. Before: > > D/skia (30219): running bench [640 480] rects_3_stroke_4 > D/skia (30219): 8888: cmsecs = 6.31 > D/skia (30219): 565: cmsecs = 4.91 > D/skia (30219): GPU: cmsecs = 14.42 > D/skia (30219): NULLGPU: cmsecs = 4.17 > D/skia (30219): > D/skia (30219): running bench [640 480] rects_3 > D/skia (30219): 8888: cmsecs = 4.87 > D/skia (30219): 565: cmsecs = 3.40 > D/skia (30219): GPU: cmsecs = 11.69 > D/skia (30219): NULLGPU: cmsecs = 3.86 > D/skia (30219): > D/skia (30219): running bench [640 480] rects_1_stroke_4 > D/skia (30219): 8888: cmsecs = 17.98 > D/skia (30219): 565: cmsecs = 13.68 > D/skia (30219): GPU: cmsecs = 24.15 > D/skia (30219): NULLGPU: cmsecs = 4.17 > D/skia (30219): > D/skia (30219): running bench [640 480] rects_1 > D/skia (30219): 8888: cmsecs = 22.82 > D/skia (30219): 565: cmsecs = 11.81 > D/skia (30219): GPU: cmsecs = 31.50 > D/skia (30219): NULLGPU: cmsecs = 3.86 > > After: > > D/skia (24301): running bench [640 480] rects_3_stroke_4 > D/skia (24301): 8888: cmsecs = 5.89 > D/skia (24301): 565: cmsecs = 4.72 > D/skia (24301): GPU: cmsecs = 14.32 > D/skia (24301): NULLGPU: cmsecs = 4.24 > D/skia (24301): > D/skia (24301): running bench [640 480] rects_3 > D/skia (24301): 8888: cmsecs = 4.65 > D/skia (24301): 565: cmsecs = 3.49 > D/skia (24301): GPU: cmsecs = 11.84 > D/skia (24301): NULLGPU: cmsecs = 3.92 > D/skia (24301): > D/skia (24301): running bench [640 480] rects_1_stroke_4 > D/skia (24301): 8888: cmsecs = 16.33 > D/skia (24301): 565: cmsecs = 13.07 > D/skia (24301): GPU: cmsecs = 23.69 > D/skia (24301): NULLGPU: cmsecs = 4.23 > D/skia (24301): > D/skia (24301): running bench [640 480] rects_1 > D/skia (24301): 8888: cmsecs = 21.66 > D/skia (24301): 565: cmsecs = 11.58 > D/skia (24301): GPU: cmsecs = 32.09 > D/skia (24301): NULLGPU: cmsecs = 3.94
Sign in to reply to this message.
LGTM.
Sign in to reply to this message.
|