|
|
|
Created:
12 years, 10 months ago by gyagp Modified:
10 years, 10 months ago CC:
skia-review_googlegroups.com, yupingx.chen_intel.com Base URL:
https://skia.googlecode.com/svn/trunk Visibility:
Public. |
DescriptionAdd assembly versions of memset32 and memset16 that utilize
64 bit registers for x86 Posix systems.
Contributed by yupingx.chen@intel.com
Patch Set 1 #Patch Set 2 : add memset bench #
MessagesTotal messages: 19
Thank you in advance and happy new year.
Sign in to reply to this message.
On 2013/01/04 03:27:13, gyagp wrote: > Thank you in advance and happy new year. Thank you for the patch. Is there a reason that this needs to be assembly, and cannot use the SSE intrinsics? Relatedly, does the current SSE2 implementation not work for you, or have other missing functionality? I'd rather fix it if possible, than add a separate implementation.
Sign in to reply to this message.
On 2013/01/04 15:44:15, Stephen White wrote: > On 2013/01/04 03:27:13, gyagp wrote: > > Thank you in advance and happy new year. > > Thank you for the patch. > > Is there a reason that this needs to be assembly, and cannot use the SSE > intrinsics? Relatedly, does the current SSE2 implementation not work for you, > or have other missing functionality? I'd rather fix it if possible, than add a > separate implementation. The assembly code will provide 5% - 13% speedup for memset32 and memset16.
Sign in to reply to this message.
On 2013/01/05 01:31:45, gyagp wrote: > On 2013/01/04 15:44:15, Stephen White wrote: > > On 2013/01/04 03:27:13, gyagp wrote: > > > Thank you in advance and happy new year. > > > > Thank you for the patch. > > > > Is there a reason that this needs to be assembly, and cannot use the SSE > > intrinsics? Relatedly, does the current SSE2 implementation not work for you, > > or have other missing functionality? I'd rather fix it if possible, than add > a > > separate implementation. > > The assembly code will provide 5% - 13% speedup for memset32 and memset16. That's great, but I'd rather not fork the implementations (one posix, one non-posix). Could the same speedup be achieved using intrinsics, and shared across platforms?
Sign in to reply to this message.
On 2013/01/05 01:31:45, gyagp wrote: > The assembly code will provide 5% - 13% speedup for memset32 and memset16. Using which tests and which parameters on which platforms?
Sign in to reply to this message.
On 2013/01/07 15:48:40, Stephen White wrote: > On 2013/01/05 01:31:45, gyagp wrote: > > On 2013/01/04 15:44:15, Stephen White wrote: > > > On 2013/01/04 03:27:13, gyagp wrote: > > > > Thank you in advance and happy new year. > > > > > > Thank you for the patch. > > > > > > Is there a reason that this needs to be assembly, and cannot use the SSE > > > intrinsics? Relatedly, does the current SSE2 implementation not work for > you, > > > or have other missing functionality? I'd rather fix it if possible, than > add > > a > > > separate implementation. > > > > The assembly code will provide 5% - 13% speedup for memset32 and memset16. > > That's great, but I'd rather not fork the implementations (one posix, one > non-posix). Could the same speedup be achieved using intrinsics, and shared > across platforms? I have tried to use intrinsics, but failed. The assembly codes are not easy to change into intrinsics.
Sign in to reply to this message.
I get the test result on Ubuntu 12.02 X64 platform, the attachment is my test code. The test binary: memset32_tst and memset16_tst Compile them yourself: Gcc -I ./ -O2 -o memset32_tst memset32_tst.cpp SkUtils_opts_SSE2.cpp memset32_x64_posix.S Gcc -I ./ -O2 -o memset16_tst memset16_tst.cpp SkUtils_opts_SSE2.cpp memset16_x64_posix.S -----Original Message----- From: tomhudson@google.com [mailto:tomhudson@google.com] Sent: Tuesday, January 08, 2013 12:01 AM To: Gu, Yang; agl@chromium.org; reed@android.com; piman@chromium.org; senorblanco@chromium.org Cc: skia-review@googlegroups.com; Chen, YupingX; reply@codereview-hr.appspotmail.com Subject: Re: Add assembly versions of memset32 and memset16 for x64 posix systems (issue 7033051) On 2013/01/05 01:31:45, gyagp wrote: > The assembly code will provide 5% - 13% speedup for memset32 and memset16. Using which tests and which parameters on which platforms? https://codereview.appspot.com/7033051/
Sign in to reply to this message.
On 2013/01/08 08:46:32, yupingx.chen_intel.com wrote: > I get the test result on Ubuntu 12.02 X64 platform, the attachment is my test > code. > The test binary: memset32_tst and memset16_tst > Compile them yourself: > Gcc -I ./ -O2 -o memset32_tst memset32_tst.cpp SkUtils_opts_SSE2.cpp > memset32_x64_posix.S > Gcc -I ./ -O2 -o memset16_tst memset16_tst.cpp SkUtils_opts_SSE2.cpp > memset16_x64_posix.S Thanks for attaching that. If we take code changes based on a microbenchmark, we like to make sure we have an equivalent microbenchmark as part of our suite, so we can evaluate it ourselves on different platforms and repeat as necessary. Also, I note that your SIZE and TURN parameters are hardwired. Have you looked at all at the sensitivity of your results to these values? What about the repeated use of the same VALUE - does it change the results if you pass a different value to memset() on every call? Since (as far as I know) we don't have x86 64b posix test platforms in house, I'm going to push really hard on test quality before accepting.
Sign in to reply to this message.
On 2013/01/08 09:57:26, TomH wrote: > On 2013/01/08 08:46:32, http://yupingx.chen_intel.com wrote: > > I get the test result on Ubuntu 12.02 X64 platform, the attachment is my test > > code. > > The test binary: memset32_tst and memset16_tst > > Compile them yourself: > > Gcc -I ./ -O2 -o memset32_tst memset32_tst.cpp SkUtils_opts_SSE2.cpp > > memset32_x64_posix.S > > Gcc -I ./ -O2 -o memset16_tst memset16_tst.cpp SkUtils_opts_SSE2.cpp > > memset16_x64_posix.S > > Thanks for attaching that. > > If we take code changes based on a microbenchmark, we like to make sure we have > an equivalent microbenchmark as part of our suite, so we can evaluate it > ourselves on different platforms and repeat as necessary. > > Also, I note that your SIZE and TURN parameters are hardwired. Have you looked > at all at the sensitivity of your results to these values? What about the > repeated use of the same VALUE - does it change the results if you pass a > different value to memset() on every call? > > Since (as far as I know) we don't have x86 64b posix test platforms in house, > I'm going to push really hard on test quality before accepting. Thanks for your comments. I will add bench for memset32 and memset16, and do more tests for them. The same VALUE does not affect the results.
Sign in to reply to this message.
Test results(The left column is the memset count range, the right column is the speedup rate) -------- Memset32 ----------- 1-100: 11.76% 100-200: 20.87% 200-300: 19.57% 300-400: 16.87% 400-500: 10.45% 500-600: 6.24% 600-700: 5.85% 700-800: 5.38% 800-900: 4.14% 900-1000: 2.54% 1000-1100: 3.08% 1100-1200: 3.05% 1200-1300: 2.54% 1300-1400: 2.18% 1400-1500: 1.96% 1500-1600: 1.72% 1600-1700: 1.58% 1700-1800: 1.57% 1800-1900: 2.43% 1900-2000: 1.29% 2000-2100: -0.37% 2100-2200: -0.15% 2200-2300: 3.17% 2300-2400: -0.31% 2400-2500: -0.57% 2500-2600: -0.36% 2600-2700: -0.51% 2700-2800: -0.49% 2800-2900: -0.05% 2900-3000: -0.27% 3000-3100: -0.18% 3100-3200: -0.27% 3200-3300: -0.19% 3300-3400: -0.46% 3400-3500: -0.09% 3500-3600: 1.45% 3600-3700: 1.68% 3700-3800: 0.36% 3800-3900: -0.52% 3900-4000: -0.02% 4000-4100: -1.53% 4100-4200: -1.35% 4200-4300: -0.66% 4300-4400: -0.12% 4400-4500: -0.71% 4500-4600: -0.71% 4600-4700: -0.88% 4700-4800: -0.16% 4800-4900: 0.41% 4900-5000: -0.09% ---------- memset16 ---------- 1-100: 44.58% 100-200: 33.38% 200-300: 33.06% 300-400: 24.96% 400-500: 25.91% 500-600: 24.25% 600-700: 20.44% 700-800: 18.47% 800-900: 11.52% 900-1000: 11.99% 1000-1100: 19.08% 1100-1200: 15.81% 1200-1300: 15.60% 1300-1400: 14.63% 1400-1500: 13.60% 1500-1600: 13.61% 1600-1700: 10.79% 1700-1800: 10.55% 1800-1900: 9.01% 1900-2000: 8.70% 2000-2100: 8.43% 2100-2200: 7.85% 2200-2300: 7.82% 2300-2400: 8.04% 2400-2500: 8.15% 2500-2600: 7.60% 2600-2700: 7.12% 2700-2800: 6.59% 2800-2900: 6.47% 2900-3000: 6.21% 3000-3100: 6.75% 3100-3200: 6.78% 3200-3300: 6.56% 3300-3400: 6.67% 3400-3500: 6.25% 3500-3600: 6.34% 3600-3700: 6.36% 3700-3800: 6.18% 3800-3900: 5.79% 3900-4000: 3.24% 4000-4100: 2.87% 4100-4200: 2.65% 4200-4300: 3.00% 4300-4400: 2.79% 4400-4500: 2.74% 4500-4600: 2.87% 4600-4700: 2.38% 4700-4800: 2.88% 4800-4900: 2.15% 4900-5000: 2.36%
Sign in to reply to this message.
On 2013/01/14 01:52:04, gyagp wrote: > Test results(The left column is the memset count range, the right column is the > speedup rate) Are these the speedups vs. the existing SSE2 implementation, or vs. the generic version?
Sign in to reply to this message.
On 2013/01/14 16:04:26, Stephen White wrote: > On 2013/01/14 01:52:04, gyagp wrote: > > Test results(The left column is the memset count range, the right column is > the > > speedup rate) > > Are these the speedups vs. the existing SSE2 implementation, or vs. the generic > version? The speedups are vs. the existing SSE2 implementation.
Sign in to reply to this message.
Ping. Do we have a Skia team owner to push this through to acceptance?
Sign in to reply to this message.
I will install it and try it out.
Sign in to reply to this message.
So we can do comparisons, I have landed the bench already. Here are my numbers (on macpro) before running bench [640 480] memset16_4000_5000 NONRENDERING: cmsecs = 177.73 running bench [640 480] memset16_3000_4000 NONRENDERING: cmsecs = 143.60 running bench [640 480] memset16_2000_3000 NONRENDERING: cmsecs = 108.94 running bench [640 480] memset16_1000_2000 NONRENDERING: cmsecs = 69.53 running bench [640 480] memset16_800_1000 NONRENDERING: cmsecs = 9.82 running bench [640 480] memset16_600_800 NONRENDERING: cmsecs = 8.46 running bench [640 480] memset16_1_600 NONRENDERING: cmsecs = 17.30 running bench [640 480] memset32_4000_5000 NONRENDERING: cmsecs = 162.52 running bench [640 480] memset32_3000_4000 NONRENDERING: cmsecs = 128.28 running bench [640 480] memset32_2000_3000 NONRENDERING: cmsecs = 94.10 running bench [640 480] memset32_1000_2000 NONRENDERING: cmsecs = 59.86 running bench [640 480] memset32_800_1000 NONRENDERING: cmsecs = 7.46 running bench [640 480] memset32_600_800 NONRENDERING: cmsecs = 6.10 running bench [640 480] memset32_1_600 NONRENDERING: cmsecs = 10.14 after running bench [640 480] memset16_4000_5000 NONRENDERING: cmsecs = 170.60 running bench [640 480] memset16_3000_4000 NONRENDERING: cmsecs = 137.69 running bench [640 480] memset16_2000_3000 NONRENDERING: cmsecs = 104.39 running bench [640 480] memset16_1000_2000 NONRENDERING: cmsecs = 66.58 running bench [640 480] memset16_800_1000 NONRENDERING: cmsecs = 9.42 running bench [640 480] memset16_600_800 NONRENDERING: cmsecs = 8.13 running bench [640 480] memset16_1_600 NONRENDERING: cmsecs = 16.58 running bench [640 480] memset32_4000_5000 NONRENDERING: cmsecs = 155.73 running bench [640 480] memset32_3000_4000 NONRENDERING: cmsecs = 123.04 running bench [640 480] memset32_2000_3000 NONRENDERING: cmsecs = 90.25 running bench [640 480] memset32_1000_2000 NONRENDERING: cmsecs = 57.38 running bench [640 480] memset32_800_1000 NONRENDERING: cmsecs = 7.16 running bench [640 480] memset32_600_800 NONRENDERING: cmsecs = 5.83 running bench [640 480] memset32_1_600 NONRENDERING: cmsecs = 9.68 Looks like ~5% speedup. Do we think that is worth the cost of maintaining the extra code? I'm not sure it is. Are there draws in chrome/android that are bottlenecked on memset?
Sign in to reply to this message.
> Looks like ~5% speedup. Do we think that is worth the cost of maintaining the
> extra code? I'm not sure it is. Are there draws in chrome/android that are
> bottlenecked on memset?
It's interesting how the OP reported very different results than what we're
measuring; maybe because of Steven's conjecture?
Chrome/Android is currently spending most of its optimizing attention on ARM
hardware. I don't have any x86 devices handy; here's one of the renderer process
from two weeks ago off an ARM-based tablet:
24.84% ChildProcessMai [kernel] [k] 0xc003bb44
3.03% ChildProcessMai libdvm.so [.] 0x29350
2.49% ChildProcessMai libc.so [.] dlmalloc_stats
2.45% ChildProcessMai libc.so [.] 0x1073e
1.61% ChildProcessMai libchromeview.so [.] arm_memset32
1.38% ChildProcessMai libchromeview.so [.]
cc::TileManager::AssignBinsToTiles()
1.36% ChildProcessMai libchromeview.so [.]
gfx::SizeBase<gfx::SizeF, float>::SizeBase(float, float)
1.35% ChildProcessMai libchromeview.so [.] map2_sd(double
const (*) [4], double const*, int, double*)
1.16% ChildProcessMai libchromeview.so [.]
gfx::QuadF::BoundingBox() const
1.15% ChildProcessMai libchromeview.so [.] void
cc::CalculateDrawPropertiesInternal<cc::LayerImpl, std::vector<cc::LayerImpl*,
std::alloc
1.14% ChildProcessMai libchromeview.so [.] 0x1e06d4
0.91% ChildProcessMai libchromeview.so [.]
cc::BinComparator::operator()(cc::Tile const*, cc::Tile const*) const
0.87% ChildProcessMai libchromeview.so [.]
base::debug::TraceLog::AddTraceEventWithThreadIdAndTimestamp(char, unsigned char
const*, char
0.87% ChildProcessMai [unknown] [.] 0x3180a224
This is thrown off by all kernel calls being lumped together since I wasn't
running with kernel symbols, but you can see that the largest consumer of CPU
time that we wrote is Skia's arm_memset32.
I don't have confidence that this is particularly typical, though; if you can't
easily do it in NC, I can try to grab half-a-dozen profiles here to see if the
pattern holds up across multiple websites.
Sign in to reply to this message.
We've not touched this CL in 18 months; close as obsolete?
Sign in to reply to this message.
On 2014/12/22 23:18:12, TomH wrote: > We've not touched this CL in 18 months; close as obsolete? yes, please.
Sign in to reply to this message.
Message was sent while issue was closed.
On 2014/12/23 00:19:44, gyagp wrote: > On 2014/12/22 23:18:12, TomH wrote: > > We've not touched this CL in 18 months; close as obsolete? > > yes, please. I just closed this review as I found have the right to do so.
Sign in to reply to this message.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
