Issue 7033051: Add assembly versions of memset32 and memset16 for x64 posix systems

Issue 7033051: Add assembly versions of memset32 and memset16 for x64 posix systems (Closed)

Can't Edit
Can't Publish+Mail
Start Review

Created:
12 years, 10 months ago by gyagp

Modified:
10 years, 10 months ago

Reviewers:
reed, Stephen White, piman, TomH, agl, reed1

CC:
skia-review_googlegroups.com, yupingx.chen_intel.com

Base URL:
https://skia.googlecode.com/svn/trunk

Visibility:
Public.

More Reviews

Description

Add assembly versions of memset32 and memset16 that utilize 64 bit registers for x86 Posix systems. Contributed by yupingx.chen@intel.com

Patch Set 1 #

Patch Set 2 : add memset bench #

Created: 12 years, 10 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+882 lines, -0 lines)			Patch
A	bench/MemsetBench.cpp	View	1	1 chunk	+114 lines, -0 lines	0 comments	Download
M	gyp/bench.gypi	View	1	1 chunk	+1 line, -0 lines	0 comments	Download
M	gyp/opts.gyp	View		1 chunk	+6 lines, -0 lines	0 comments	Download
A	src/opts/memset16_x64_posix.S	View		1 chunk	+453 lines, -0 lines	0 comments	Download
A	src/opts/memset32_x64_posix.S	View		1 chunk	+295 lines, -0 lines	0 comments	Download
M	src/opts/opts_check_SSE2.cpp	View		2 chunks	+13 lines, -0 lines	0 comments	Download

Messages

Total messages: 19

Expand All Messages | Collapse All Messages

Stephen White

On 2013/01/04 03:27:13, gyagp wrote: > Thank you in advance and happy new year. Thank ...

12 years, 10 months ago (2013-01-04 15:44:15 UTC) #2

gyagp

On 2013/01/04 15:44:15, Stephen White wrote: > On 2013/01/04 03:27:13, gyagp wrote: > > Thank ...

12 years, 10 months ago (2013-01-05 01:31:45 UTC) #3

Stephen White

On 2013/01/05 01:31:45, gyagp wrote: > On 2013/01/04 15:44:15, Stephen White wrote: > > On ...

12 years, 10 months ago (2013-01-07 15:48:40 UTC) #4

TomH

On 2013/01/05 01:31:45, gyagp wrote: > The assembly code will provide 5% - 13% speedup ...

12 years, 10 months ago (2013-01-07 16:01:13 UTC) #5

gyagp

On 2013/01/07 15:48:40, Stephen White wrote: > On 2013/01/05 01:31:45, gyagp wrote: > > On ...

12 years, 10 months ago (2013-01-08 01:32:49 UTC) #6

yupingx.chen_intel.com

I get the test result on Ubuntu 12.02 X64 platform, the attachment is my test ...

12 years, 10 months ago (2013-01-08 08:46:32 UTC) #7

TomH

On 2013/01/08 08:46:32, yupingx.chen_intel.com wrote: > I get the test result on Ubuntu 12.02 X64 ...

12 years, 10 months ago (2013-01-08 09:57:26 UTC) #8

gyagp

On 2013/01/08 09:57:26, TomH wrote: > On 2013/01/08 08:46:32, http://yupingx.chen_intel.com wrote: > > I get ...

12 years, 10 months ago (2013-01-09 02:57:43 UTC) #9

gyagp

Test results(The left column is the memset count range, the right column is the speedup ...

12 years, 10 months ago (2013-01-14 01:52:04 UTC) #10

Test results(The left column is the memset count range, the right column is the
speedup rate)
-------- Memset32 -----------
1-100: 11.76%
100-200: 20.87%
200-300: 19.57%
300-400: 16.87%
400-500: 10.45%
500-600: 6.24%
600-700: 5.85%
700-800: 5.38%
800-900: 4.14%
900-1000: 2.54%
1000-1100: 3.08%
1100-1200: 3.05%
1200-1300: 2.54%
1300-1400: 2.18%
1400-1500: 1.96%
1500-1600: 1.72%
1600-1700: 1.58%
1700-1800: 1.57%
1800-1900: 2.43%
1900-2000: 1.29%
2000-2100: -0.37%
2100-2200: -0.15%
2200-2300: 3.17%
2300-2400: -0.31%
2400-2500: -0.57%
2500-2600: -0.36%
2600-2700: -0.51%
2700-2800: -0.49%
2800-2900: -0.05%
2900-3000: -0.27%
3000-3100: -0.18%
3100-3200: -0.27%
3200-3300: -0.19%
3300-3400: -0.46%
3400-3500: -0.09%
3500-3600: 1.45%
3600-3700: 1.68%
3700-3800: 0.36%
3800-3900: -0.52%
3900-4000: -0.02%
4000-4100: -1.53%
4100-4200: -1.35%
4200-4300: -0.66%
4300-4400: -0.12%
4400-4500: -0.71%
4500-4600: -0.71%
4600-4700: -0.88%
4700-4800: -0.16%
4800-4900: 0.41%
4900-5000: -0.09%

---------- memset16 ----------
1-100: 44.58%
100-200: 33.38%
200-300: 33.06%
300-400: 24.96%
400-500: 25.91%
500-600: 24.25%
600-700: 20.44%
700-800: 18.47%
800-900: 11.52%
900-1000: 11.99%
1000-1100: 19.08%
1100-1200: 15.81%
1200-1300: 15.60%
1300-1400: 14.63%
1400-1500: 13.60%
1500-1600: 13.61%
1600-1700: 10.79%
1700-1800: 10.55%
1800-1900: 9.01%
1900-2000: 8.70%
2000-2100: 8.43%
2100-2200: 7.85%
2200-2300: 7.82%
2300-2400: 8.04%
2400-2500: 8.15%
2500-2600: 7.60%
2600-2700: 7.12%
2700-2800: 6.59%
2800-2900: 6.47%
2900-3000: 6.21%
3000-3100: 6.75%
3100-3200: 6.78%
3200-3300: 6.56%
3300-3400: 6.67%
3400-3500: 6.25%
3500-3600: 6.34%
3600-3700: 6.36%
3700-3800: 6.18%
3800-3900: 5.79%
3900-4000: 3.24%
4000-4100: 2.87%
4100-4200: 2.65%
4200-4300: 3.00%
4300-4400: 2.79%
4400-4500: 2.74%
4500-4600: 2.87%
4600-4700: 2.38%
4700-4800: 2.88%
4800-4900: 2.15%
4900-5000: 2.36%

Stephen White

On 2013/01/14 01:52:04, gyagp wrote: > Test results(The left column is the memset count range, ...

12 years, 9 months ago (2013-01-14 16:04:26 UTC) #11

gyagp

On 2013/01/14 16:04:26, Stephen White wrote: > On 2013/01/14 01:52:04, gyagp wrote: > > Test ...

12 years, 9 months ago (2013-01-15 01:21:34 UTC) #12

TomH

Ping. Do we have a Skia team owner to push this through to acceptance?

12 years, 5 months ago (2013-06-03 11:19:37 UTC) #13

reed1

So we can do comparisons, I have landed the bench already. Here are my numbers ...

12 years, 5 months ago (2013-06-03 16:59:22 UTC) #15

So we can do comparisons, I have landed the bench already.

Here are my numbers (on macpro)

before

running bench [640 480]           memset16_4000_5000  NONRENDERING: cmsecs =
177.73
running bench [640 480]           memset16_3000_4000  NONRENDERING: cmsecs =
143.60
running bench [640 480]           memset16_2000_3000  NONRENDERING: cmsecs =
108.94
running bench [640 480]           memset16_1000_2000  NONRENDERING: cmsecs = 
69.53
running bench [640 480]            memset16_800_1000  NONRENDERING: cmsecs =  
9.82
running bench [640 480]             memset16_600_800  NONRENDERING: cmsecs =  
8.46
running bench [640 480]               memset16_1_600  NONRENDERING: cmsecs = 
17.30
running bench [640 480]           memset32_4000_5000  NONRENDERING: cmsecs =
162.52
running bench [640 480]           memset32_3000_4000  NONRENDERING: cmsecs =
128.28
running bench [640 480]           memset32_2000_3000  NONRENDERING: cmsecs = 
94.10
running bench [640 480]           memset32_1000_2000  NONRENDERING: cmsecs = 
59.86
running bench [640 480]            memset32_800_1000  NONRENDERING: cmsecs =  
7.46
running bench [640 480]             memset32_600_800  NONRENDERING: cmsecs =  
6.10
running bench [640 480]               memset32_1_600  NONRENDERING: cmsecs = 
10.14



after

running bench [640 480]           memset16_4000_5000  NONRENDERING: cmsecs =
170.60
running bench [640 480]           memset16_3000_4000  NONRENDERING: cmsecs =
137.69
running bench [640 480]           memset16_2000_3000  NONRENDERING: cmsecs =
104.39
running bench [640 480]           memset16_1000_2000  NONRENDERING: cmsecs = 
66.58
running bench [640 480]            memset16_800_1000  NONRENDERING: cmsecs =  
9.42
running bench [640 480]             memset16_600_800  NONRENDERING: cmsecs =  
8.13
running bench [640 480]               memset16_1_600  NONRENDERING: cmsecs = 
16.58
running bench [640 480]           memset32_4000_5000  NONRENDERING: cmsecs =
155.73
running bench [640 480]           memset32_3000_4000  NONRENDERING: cmsecs =
123.04
running bench [640 480]           memset32_2000_3000  NONRENDERING: cmsecs = 
90.25
running bench [640 480]           memset32_1000_2000  NONRENDERING: cmsecs = 
57.38
running bench [640 480]            memset32_800_1000  NONRENDERING: cmsecs =  
7.16
running bench [640 480]             memset32_600_800  NONRENDERING: cmsecs =  
5.83
running bench [640 480]               memset32_1_600  NONRENDERING: cmsecs =  
9.68


Looks like ~5% speedup. Do we think that is worth the cost of maintaining the
extra code? I'm not sure it is. Are there draws in chrome/android that are
bottlenecked on memset?

TomH

> Looks like ~5% speedup. Do we think that is worth the cost of maintaining ...

12 years, 5 months ago (2013-06-04 10:11:08 UTC) #16

> Looks like ~5% speedup. Do we think that is worth the cost of maintaining the
> extra code? I'm not sure it is. Are there draws in chrome/android that are
> bottlenecked on memset?

It's interesting how the OP reported very different results than what we're
measuring; maybe because of Steven's conjecture?

Chrome/Android is currently spending most of its optimizing attention on ARM
hardware. I don't have any x86 devices handy; here's one of the renderer process
from two weeks ago off an ARM-based tablet:

    24.84%  ChildProcessMai  [kernel]                         [k] 0xc003bb44
     3.03%  ChildProcessMai  libdvm.so                        [.] 0x29350 
     2.49%  ChildProcessMai  libc.so                          [.] dlmalloc_stats
     2.45%  ChildProcessMai  libc.so                          [.] 0x1073e 
     1.61%  ChildProcessMai  libchromeview.so                 [.] arm_memset32
     1.38%  ChildProcessMai  libchromeview.so                 [.]
cc::TileManager::AssignBinsToTiles()
     1.36%  ChildProcessMai  libchromeview.so                 [.]
gfx::SizeBase<gfx::SizeF, float>::SizeBase(float, float)
     1.35%  ChildProcessMai  libchromeview.so                 [.] map2_sd(double
const (*) [4], double const*, int, double*)
     1.16%  ChildProcessMai  libchromeview.so                 [.]
gfx::QuadF::BoundingBox() const
     1.15%  ChildProcessMai  libchromeview.so                 [.] void
cc::CalculateDrawPropertiesInternal<cc::LayerImpl, std::vector<cc::LayerImpl*,
std::alloc
     1.14%  ChildProcessMai  libchromeview.so                 [.] 0x1e06d4
     0.91%  ChildProcessMai  libchromeview.so                 [.]
cc::BinComparator::operator()(cc::Tile const*, cc::Tile const*) const
     0.87%  ChildProcessMai  libchromeview.so                 [.]
base::debug::TraceLog::AddTraceEventWithThreadIdAndTimestamp(char, unsigned char
const*, char 
     0.87%  ChildProcessMai  [unknown]                        [.] 0x3180a224


This is thrown off by all kernel calls being lumped together since I wasn't
running with kernel symbols, but you can see that the largest consumer of CPU
time that we wrote is Skia's arm_memset32.

I don't have confidence that this is particularly typical, though; if you can't
easily do it in NC, I can try to grab half-a-dozen profiles here to see if the
pattern holds up across multiple websites.

TomH

We've not touched this CL in 18 months; close as obsolete?

10 years, 10 months ago (2014-12-22 23:18:12 UTC) #17

gyagp

On 2014/12/22 23:18:12, TomH wrote: > We've not touched this CL in 18 months; close ...

10 years, 10 months ago (2014-12-23 00:19:44 UTC) #18

gyagp

10 years, 10 months ago (2014-12-23 00:21:30 UTC) #19

Message was sent while issue was closed.

On 2014/12/23 00:19:44, gyagp wrote:
> On 2014/12/22 23:18:12, TomH wrote:
> > We've not touched this CL in 18 months; close as obsolete?
> 
> yes, please.

I just closed this review as I found have the right to do so.

Expand All Messages | Collapse All Messages