Code review - Issue 7029044: code review 7029044: runtime: less aggressive per-thread stack segment cachinghttps://codereview.appspot.com/2013-01-24T11:02:33+00:00rietveld
Message from unknown
2012-12-30T12:00:25+00:00dvyukovurn:md5:ede9030fd1e4fb8912ab695a1be270b4
Message from unknown
2012-12-30T12:00:28+00:00dvyukovurn:md5:5293a238217217cda030a10098fac223
Message from unknown
2012-12-30T13:07:21+00:00dvyukovurn:md5:133a2bbd99de911f5787d74e0ddcece5
Message from unknown
2012-12-30T13:28:13+00:00dvyukovurn:md5:60f2cfa8ab1f34c31fcf7e3d1cecd0f3
Message from unknown
2012-12-30T13:41:57+00:00dvyukovurn:md5:aa563d3a778ab09074d646f0d45fc495
Message from unknown
2013-01-03T15:27:20+00:00dvyukovurn:md5:8e60e08ab9b859ce1ac5ff1fca771cc4
Message from unknown
2013-01-03T15:29:17+00:00dvyukovurn:md5:50b35419f264352d6de015bc59aecb94
Message from unknown
2013-01-03T15:29:25+00:00dvyukovurn:md5:48aa8ac66d9650a8193780e53ee5e8ee
Message from dvyukov@google.com
2013-01-03T15:29:31+00:00dvyukovurn:md5:b71888f615b128968201237a62485f34
Hello golang-dev@googlegroups.com,
I'd like you to review this change to
https://dvyukov%40google.com@code.google.com/p/go/
Message from dvyukov@google.com
2013-01-03T17:45:38+00:00dvyukovurn:md5:5ed71abe6e4df4360e47eef55268fb70
On 2013/01/03 15:29:31, dvyukov wrote:
> Hello mailto:golang-dev@googlegroups.com,
>
> I'd like you to review this change to
> https://dvyukov%2540google.com%40code.google.com/p/go/
+some people in CC
Message from sougou@google.com
2013-01-03T23:55:50+00:00sougouurn:md5:60ff96a17a15553a823289f90bb397c0
Reran vtocc benchmars, around 10M queries using 100 clients.
Run 1: go version currently used on production 0a3866d6cc6b (Sep 24):
qps: 5832 StackSys: 86MB
Run 2: go @tip d0d76b7fb219 (Jan 3):
qps: 5543 StackSys: 77MB
Run 3: Using CL 6997052:
qps: 5673 StackSys: 3MB
Run 4: Using CL 7029044:
qps: 5699 StackSys: 15MB
Conclusion: Marginal difference in performance between the two CLs. The older CL uses less memory. Maybe it will be more pronounced if you passed large objects by value to functions?
The runtime @tip is slower than the one from September, an unrelated observation.
This is just a summary. I can send you more detailed stats and pprof captures if needed.
Message from dvyukov@google.com
2013-01-04T06:04:59+00:00dvyukovurn:md5:453123dd1d881288f804285378b86551
On 2013/01/03 23:55:50, sougou wrote:
> Reran vtocc benchmars, around 10M queries using 100 clients.
>
> Run 1: go version currently used on production 0a3866d6cc6b (Sep 24):
> qps: 5832 StackSys: 86MB
>
> Run 2: go @tip d0d76b7fb219 (Jan 3):
> qps: 5543 StackSys: 77MB
>
> Run 3: Using CL 6997052:
> qps: 5673 StackSys: 3MB
>
> Run 4: Using CL 7029044:
> qps: 5699 StackSys: 15MB
>
> Conclusion: Marginal difference in performance between the two CLs. The older CL
> uses less memory. Maybe it will be more pronounced if you passed large objects
> by value to functions?
> The runtime @tip is slower than the one from September, an unrelated
> observation.
>
> This is just a summary. I can send you more detailed stats and pprof captures if
> needed.
Can you please test with varying values for StackCacheSize/StackCacheBatch in src/pkg/runtime/runtime.h?
Currently they are set to 128/32. The CL/6997052 is using 16/8. I am inclined towards 32/16 for now (my synthetic tests show still minimal memory consumption and good performance). Another possible point is 64/32.
Message from no.smile.face@gmail.com
2013-01-04T07:19:15+00:00nsfurn:md5:02e040401bf1f92ec2b49190a09067eb
On 2013/01/03 15:29:31, dvyukov wrote:
> Hello mailto:golang-dev@googlegroups.com,
>
> I'd like you to review this change to
> https://dvyukov%2540google.com%40code.google.com/p/go/
For my ray tracer stuff the behaviour is very similar to what I had with previous patch.
Machine: amd64/linux on i5-3470 CPU.
Before (tip d0d76b7fb219):
Rendering took 2m 34.5s
Sys: 85928184
StackInuse: 102400
StackSys: 31981568
After (tip d0d76b7fb219 + issue7029044_6007.diff):
Rendering took 2m 35.2s
Sys: 55128064
StackInuse: 12885004288
StackSys: 2621440
This time the actual runtime.MemStats numbers instead of staring at process' RES (resident memory size). Rendering time is similar (less than 1% difference is not statistically significant). Oh and this time I was using 50 rays per pixel instead of 100, just to make tests quicker. Also note the anomally high StackInuse number in your patch. Is it a miscalculation?
Message from unknown
2013-01-04T07:34:55+00:00dvyukovurn:md5:e225d4252aba47215cd7a33ff2cfe987
Message from unknown
2013-01-04T07:35:42+00:00dvyukovurn:md5:12a6833b4b98a91267352f90021ccb33
Message from dvyukov@google.com
2013-01-04T07:36:59+00:00dvyukovurn:md5:0ba2a28858ba1a7a255c9ceeefc61da8
On 2013/01/04 07:19:15, nsf wrote:
> On 2013/01/03 15:29:31, dvyukov wrote:
> > Hello mailto:golang-dev@googlegroups.com,
> >
> > I'd like you to review this change to
> > https://dvyukov%252540google.com%2540code.google.com/p/go/
>
> For my ray tracer stuff the behaviour is very similar to what I had with
> previous patch.
>
> Machine: amd64/linux on i5-3470 CPU.
>
> Before (tip d0d76b7fb219):
> Rendering took 2m 34.5s
> Sys: 85928184
> StackInuse: 102400
> StackSys: 31981568
> After (tip d0d76b7fb219 + issue7029044_6007.diff):
> Rendering took 2m 35.2s
> Sys: 55128064
> StackInuse: 12885004288
> StackSys: 2621440
>
> This time the actual runtime.MemStats numbers instead of staring at process' RES
> (resident memory size). Rendering time is similar (less than 1% difference is
> not statistically significant). Oh and this time I was using 50 rays per pixel
> instead of 100, just to make tests quicker. Also note the anomally high
> StackInuse number in your patch. Is it a miscalculation?
Thanks! This is actually a bug. Fixed:
https://codereview.appspot.com/7029044/diff2/6007:13007/src/pkg/runtime/runtime.h
https://codereview.appspot.com/7029044/diff2/6007:13007/src/pkg/runtime/stack_test.go
Message from dvyukov@google.com
2013-01-04T07:39:15+00:00dvyukovurn:md5:af5ee9e89d03d915527f4f0587016c3c
On 2013/01/04 07:36:59, dvyukov wrote:
> On 2013/01/04 07:19:15, nsf wrote:
> > On 2013/01/03 15:29:31, dvyukov wrote:
> > > Hello mailto:golang-dev@googlegroups.com,
> > >
> > > I'd like you to review this change to
> > > https://dvyukov%25252540google.com%252540code.google.com/p/go/
> >
> > For my ray tracer stuff the behaviour is very similar to what I had with
> > previous patch.
> >
> > Machine: amd64/linux on i5-3470 CPU.
> >
> > Before (tip d0d76b7fb219):
> > Rendering took 2m 34.5s
> > Sys: 85928184
> > StackInuse: 102400
> > StackSys: 31981568
> > After (tip d0d76b7fb219 + issue7029044_6007.diff):
> > Rendering took 2m 35.2s
> > Sys: 55128064
> > StackInuse: 12885004288
> > StackSys: 2621440
> >
> > This time the actual runtime.MemStats numbers instead of staring at process'
> RES
> > (resident memory size).
Do you miss a part of the sentence?
> > Rendering time is similar (less than 1% difference is
> > not statistically significant).
The results look fine, right?
> > Oh and this time I was using 50 rays per pixel
> > instead of 100, just to make tests quicker. Also note the anomally high
> > StackInuse number in your patch. Is it a miscalculation?
>
> Thanks! This is actually a bug. Fixed:
> https://codereview.appspot.com/7029044/diff2/6007:13007/src/pkg/runtime/runtime.h
> https://codereview.appspot.com/7029044/diff2/6007:13007/src/pkg/runtime/stack_test.go
Message from dvyukov@google.com
2013-01-04T07:48:43+00:00dvyukovurn:md5:4dd33350a83ed9d30afe40f475eff339
On 2013/01/04 06:04:59, dvyukov wrote:
> On 2013/01/03 23:55:50, sougou wrote:
> > Reran vtocc benchmars, around 10M queries using 100 clients.
> >
> > Run 1: go version currently used on production 0a3866d6cc6b (Sep 24):
> > qps: 5832 StackSys: 86MB
> >
> > Run 2: go @tip d0d76b7fb219 (Jan 3):
> > qps: 5543 StackSys: 77MB
> >
> > Run 3: Using CL 6997052:
> > qps: 5673 StackSys: 3MB
> >
> > Run 4: Using CL 7029044:
> > qps: 5699 StackSys: 15MB
> >
> > Conclusion: Marginal difference in performance between the two CLs. The older
> CL
> > uses less memory. Maybe it will be more pronounced if you passed large objects
> > by value to functions?
> > The runtime @tip is slower than the one from September, an unrelated
> > observation.
> >
> > This is just a summary. I can send you more detailed stats and pprof captures
> if
> > needed.
>
> Can you please test with varying values for StackCacheSize/StackCacheBatch in
> src/pkg/runtime/runtime.h?
> Currently they are set to 128/32. The CL/6997052 is using 16/8. I am inclined
> towards 32/16 for now (my synthetic tests show still minimal memory consumption
> and good performance). Another possible point is 64/32.
Or perhaps it's already fine?
15 vs 79-80MB is a good win already. More importantly StackSys must not grow over time now, it's bounded by 512kb per thread (while currently it slowly grows to infinity).
Well, actually not that slowly. I've run the following funny test -- each line is StackSys *increase*.
Current behavior:
$ go test -run=StackMem -v -cpu=1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 2>&1 | grep "for stack mem"
stack_test.go:1569: Consumed 106MB for stack mem
stack_test.go:1569: Consumed 48MB for stack mem
stack_test.go:1569: Consumed 52MB for stack mem
stack_test.go:1569: Consumed 71MB for stack mem
stack_test.go:1569: Consumed 71MB for stack mem
stack_test.go:1569: Consumed 53MB for stack mem
stack_test.go:1569: Consumed 35MB for stack mem
stack_test.go:1569: Consumed 27MB for stack mem
stack_test.go:1569: Consumed 39MB for stack mem
stack_test.go:1569: Consumed 43MB for stack mem
stack_test.go:1569: Consumed 49MB for stack mem
stack_test.go:1569: Consumed 54MB for stack mem
stack_test.go:1569: Consumed 44MB for stack mem
stack_test.go:1569: Consumed 35MB for stack mem
stack_test.go:1569: Consumed 41MB for stack mem
stack_test.go:1569: Consumed 32MB for stack mem
stack_test.go:1569: Consumed 27MB for stack mem
stack_test.go:1569: Consumed 20MB for stack mem
stack_test.go:1569: Consumed 36MB for stack mem
stack_test.go:1569: Consumed 33MB for stack mem
stack_test.go:1569: Consumed 31MB for stack mem
stack_test.go:1569: Consumed 45MB for stack mem
stack_test.go:1569: Consumed 40MB for stack mem
stack_test.go:1569: Consumed 30MB for stack mem
stack_test.go:1569: Consumed 39MB for stack mem
stack_test.go:1569: Consumed 27MB for stack mem
stack_test.go:1569: Consumed 27MB for stack mem
stack_test.go:1569: Consumed 37MB for stack mem
stack_test.go:1569: Consumed 33MB for stack mem
stack_test.go:1569: Consumed 36MB for stack mem
stack_test.go:1569: Consumed 34MB for stack mem
stack_test.go:1569: Consumed 42MB for stack mem
stack_test.go:1569: Consumed 29MB for stack mem
stack_test.go:1569: Consumed 29MB for stack mem
stack_test.go:1569: Consumed 44MB for stack mem
stack_test.go:1569: Consumed 20MB for stack mem
stack_test.go:1569: Consumed 31MB for stack mem
stack_test.go:1569: Consumed 31MB for stack mem
stack_test.go:1569: Consumed 19MB for stack mem
stack_test.go:1569: Consumed 25MB for stack mem
New behavior:
$ go test -run=StackMem -v -cpu=1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 2>&1 | grep "for stack mem"
stack_test.go:1569: Consumed 13MB for stack mem
stack_test.go:1569: Consumed 0MB for stack mem
stack_test.go:1569: Consumed 0MB for stack mem
stack_test.go:1569: Consumed 0MB for stack mem
stack_test.go:1569: Consumed 0MB for stack mem
stack_test.go:1569: Consumed 0MB for stack mem
stack_test.go:1569: Consumed 0MB for stack mem
stack_test.go:1569: Consumed 0MB for stack mem
stack_test.go:1569: Consumed 0MB for stack mem
stack_test.go:1569: Consumed 0MB for stack mem
...
Either somebody must come up with a good tuning methodology, or let's commit it as-is and tune later.
Message from msolomon@google.com
2013-01-04T07:55:36+00:00msolomonurn:md5:77c374d3ab633924578d8859918d2f1d
From our perspective, capping the growth is a big win and the
performance tradeoff is worth it. I'll let Sugu confirm that with a
production test. Once we can reason about good sizes, we can run a few
more tests, but I suspect it may be workload dependent. In the future
it might be worth considering an environment variable, but I tend to
dislike tunables.
Either way, thanks for putting so much time into this.
On Thu, Jan 3, 2013 at 11:48 PM, <dvyukov@google.com> wrote:
> On 2013/01/04 06:04:59, dvyukov wrote:
>>
>> On 2013/01/03 23:55:50, sougou wrote:
>> > Reran vtocc benchmars, around 10M queries using 100 clients.
>> >
>> > Run 1: go version currently used on production 0a3866d6cc6b (Sep
>
> 24):
>>
>> > qps: 5832 StackSys: 86MB
>> >
>> > Run 2: go @tip d0d76b7fb219 (Jan 3):
>> > qps: 5543 StackSys: 77MB
>> >
>> > Run 3: Using CL 6997052:
>> > qps: 5673 StackSys: 3MB
>> >
>> > Run 4: Using CL 7029044:
>> > qps: 5699 StackSys: 15MB
>> >
>> > Conclusion: Marginal difference in performance between the two CLs.
>
> The older
>>
>> CL
>> > uses less memory. Maybe it will be more pronounced if you passed
>
> large objects
>>
>> > by value to functions?
>> > The runtime @tip is slower than the one from September, an unrelated
>> > observation.
>> >
>> > This is just a summary. I can send you more detailed stats and pprof
>
> captures
>>
>> if
>> > needed.
>
>
>> Can you please test with varying values for
>
> StackCacheSize/StackCacheBatch in
>>
>> src/pkg/runtime/runtime.h?
>> Currently they are set to 128/32. The CL/6997052 is using 16/8. I am
>
> inclined
>>
>> towards 32/16 for now (my synthetic tests show still minimal memory
>
> consumption
>>
>> and good performance). Another possible point is 64/32.
>
>
> Or perhaps it's already fine?
> 15 vs 79-80MB is a good win already. More importantly StackSys must not
> grow over time now, it's bounded by 512kb per thread (while currently it
> slowly grows to infinity).
>
> Well, actually not that slowly. I've run the following funny test --
> each line is StackSys *increase*.
>
> Current behavior:
> $ go test -run=StackMem -v
> -cpu=1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
> 2>&1 | grep "for stack mem"
> stack_test.go:1569: Consumed 106MB for stack mem
> stack_test.go:1569: Consumed 48MB for stack mem
> stack_test.go:1569: Consumed 52MB for stack mem
> stack_test.go:1569: Consumed 71MB for stack mem
> stack_test.go:1569: Consumed 71MB for stack mem
> stack_test.go:1569: Consumed 53MB for stack mem
> stack_test.go:1569: Consumed 35MB for stack mem
> stack_test.go:1569: Consumed 27MB for stack mem
> stack_test.go:1569: Consumed 39MB for stack mem
> stack_test.go:1569: Consumed 43MB for stack mem
> stack_test.go:1569: Consumed 49MB for stack mem
> stack_test.go:1569: Consumed 54MB for stack mem
> stack_test.go:1569: Consumed 44MB for stack mem
> stack_test.go:1569: Consumed 35MB for stack mem
> stack_test.go:1569: Consumed 41MB for stack mem
> stack_test.go:1569: Consumed 32MB for stack mem
> stack_test.go:1569: Consumed 27MB for stack mem
> stack_test.go:1569: Consumed 20MB for stack mem
> stack_test.go:1569: Consumed 36MB for stack mem
> stack_test.go:1569: Consumed 33MB for stack mem
> stack_test.go:1569: Consumed 31MB for stack mem
> stack_test.go:1569: Consumed 45MB for stack mem
> stack_test.go:1569: Consumed 40MB for stack mem
> stack_test.go:1569: Consumed 30MB for stack mem
> stack_test.go:1569: Consumed 39MB for stack mem
> stack_test.go:1569: Consumed 27MB for stack mem
> stack_test.go:1569: Consumed 27MB for stack mem
> stack_test.go:1569: Consumed 37MB for stack mem
> stack_test.go:1569: Consumed 33MB for stack mem
> stack_test.go:1569: Consumed 36MB for stack mem
> stack_test.go:1569: Consumed 34MB for stack mem
> stack_test.go:1569: Consumed 42MB for stack mem
> stack_test.go:1569: Consumed 29MB for stack mem
> stack_test.go:1569: Consumed 29MB for stack mem
> stack_test.go:1569: Consumed 44MB for stack mem
> stack_test.go:1569: Consumed 20MB for stack mem
> stack_test.go:1569: Consumed 31MB for stack mem
> stack_test.go:1569: Consumed 31MB for stack mem
> stack_test.go:1569: Consumed 19MB for stack mem
> stack_test.go:1569: Consumed 25MB for stack mem
>
>
> New behavior:
> $ go test -run=StackMem -v
> -cpu=1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
> 2>&1 | grep "for stack mem"
> stack_test.go:1569: Consumed 13MB for stack mem
> stack_test.go:1569: Consumed 0MB for stack mem
> stack_test.go:1569: Consumed 0MB for stack mem
> stack_test.go:1569: Consumed 0MB for stack mem
> stack_test.go:1569: Consumed 0MB for stack mem
> stack_test.go:1569: Consumed 0MB for stack mem
> stack_test.go:1569: Consumed 0MB for stack mem
> stack_test.go:1569: Consumed 0MB for stack mem
> stack_test.go:1569: Consumed 0MB for stack mem
> stack_test.go:1569: Consumed 0MB for stack mem
> ...
>
>
>
> Either somebody must come up with a good tuning methodology, or let's
> commit it as-is and tune later.
>
> https://codereview.appspot.com/7029044/
Message from dvyukov@google.com
2013-01-04T08:05:25+00:00dvyukovurn:md5:6db5a66470889585b50183c90be74e2d
On Fri, Jan 4, 2013 at 11:55 AM, Mike Solomon <msolomon@google.com> wrote:
> From our perspective, capping the growth is a big win and the
> performance tradeoff is worth it. I'll let Sugu confirm that with a
> production test. Once we can reason about good sizes, we can run a few
> more tests, but I suspect it may be workload dependent. In the future
> it might be worth considering an environment variable, but I tend to
> dislike tunables.
When/if I finally submit the improved scheduler, it will allow for
per-processor (per-GOMAXPROC) state. And stack caches along with other
stuff will move there. That will decrease stack mem further, and I
believe eliminate any need in tunables (e.g. now you have say 200
threads each with own cache, and then you will have only 8 procs with
caches).
> Either way, thanks for putting so much time into this.
You are welcome.
> On Thu, Jan 3, 2013 at 11:48 PM, <dvyukov@google.com> wrote:
>> On 2013/01/04 06:04:59, dvyukov wrote:
>>>
>>> On 2013/01/03 23:55:50, sougou wrote:
>>> > Reran vtocc benchmars, around 10M queries using 100 clients.
>>> >
>>> > Run 1: go version currently used on production 0a3866d6cc6b (Sep
>>
>> 24):
>>>
>>> > qps: 5832 StackSys: 86MB
>>> >
>>> > Run 2: go @tip d0d76b7fb219 (Jan 3):
>>> > qps: 5543 StackSys: 77MB
>>> >
>>> > Run 3: Using CL 6997052:
>>> > qps: 5673 StackSys: 3MB
>>> >
>>> > Run 4: Using CL 7029044:
>>> > qps: 5699 StackSys: 15MB
>>> >
>>> > Conclusion: Marginal difference in performance between the two CLs.
>>
>> The older
>>>
>>> CL
>>> > uses less memory. Maybe it will be more pronounced if you passed
>>
>> large objects
>>>
>>> > by value to functions?
>>> > The runtime @tip is slower than the one from September, an unrelated
>>> > observation.
>>> >
>>> > This is just a summary. I can send you more detailed stats and pprof
>>
>> captures
>>>
>>> if
>>> > needed.
>>
>>
>>> Can you please test with varying values for
>>
>> StackCacheSize/StackCacheBatch in
>>>
>>> src/pkg/runtime/runtime.h?
>>> Currently they are set to 128/32. The CL/6997052 is using 16/8. I am
>>
>> inclined
>>>
>>> towards 32/16 for now (my synthetic tests show still minimal memory
>>
>> consumption
>>>
>>> and good performance). Another possible point is 64/32.
>>
>>
>> Or perhaps it's already fine?
>> 15 vs 79-80MB is a good win already. More importantly StackSys must not
>> grow over time now, it's bounded by 512kb per thread (while currently it
>> slowly grows to infinity).
>>
>> Well, actually not that slowly. I've run the following funny test --
>> each line is StackSys *increase*.
>>
>> Current behavior:
>> $ go test -run=StackMem -v
>> -cpu=1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
>> 2>&1 | grep "for stack mem"
>> stack_test.go:1569: Consumed 106MB for stack mem
>> stack_test.go:1569: Consumed 48MB for stack mem
>> stack_test.go:1569: Consumed 52MB for stack mem
>> stack_test.go:1569: Consumed 71MB for stack mem
>> stack_test.go:1569: Consumed 71MB for stack mem
>> stack_test.go:1569: Consumed 53MB for stack mem
>> stack_test.go:1569: Consumed 35MB for stack mem
>> stack_test.go:1569: Consumed 27MB for stack mem
>> stack_test.go:1569: Consumed 39MB for stack mem
>> stack_test.go:1569: Consumed 43MB for stack mem
>> stack_test.go:1569: Consumed 49MB for stack mem
>> stack_test.go:1569: Consumed 54MB for stack mem
>> stack_test.go:1569: Consumed 44MB for stack mem
>> stack_test.go:1569: Consumed 35MB for stack mem
>> stack_test.go:1569: Consumed 41MB for stack mem
>> stack_test.go:1569: Consumed 32MB for stack mem
>> stack_test.go:1569: Consumed 27MB for stack mem
>> stack_test.go:1569: Consumed 20MB for stack mem
>> stack_test.go:1569: Consumed 36MB for stack mem
>> stack_test.go:1569: Consumed 33MB for stack mem
>> stack_test.go:1569: Consumed 31MB for stack mem
>> stack_test.go:1569: Consumed 45MB for stack mem
>> stack_test.go:1569: Consumed 40MB for stack mem
>> stack_test.go:1569: Consumed 30MB for stack mem
>> stack_test.go:1569: Consumed 39MB for stack mem
>> stack_test.go:1569: Consumed 27MB for stack mem
>> stack_test.go:1569: Consumed 27MB for stack mem
>> stack_test.go:1569: Consumed 37MB for stack mem
>> stack_test.go:1569: Consumed 33MB for stack mem
>> stack_test.go:1569: Consumed 36MB for stack mem
>> stack_test.go:1569: Consumed 34MB for stack mem
>> stack_test.go:1569: Consumed 42MB for stack mem
>> stack_test.go:1569: Consumed 29MB for stack mem
>> stack_test.go:1569: Consumed 29MB for stack mem
>> stack_test.go:1569: Consumed 44MB for stack mem
>> stack_test.go:1569: Consumed 20MB for stack mem
>> stack_test.go:1569: Consumed 31MB for stack mem
>> stack_test.go:1569: Consumed 31MB for stack mem
>> stack_test.go:1569: Consumed 19MB for stack mem
>> stack_test.go:1569: Consumed 25MB for stack mem
>>
>>
>> New behavior:
>> $ go test -run=StackMem -v
>> -cpu=1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
>> 2>&1 | grep "for stack mem"
>> stack_test.go:1569: Consumed 13MB for stack mem
>> stack_test.go:1569: Consumed 0MB for stack mem
>> stack_test.go:1569: Consumed 0MB for stack mem
>> stack_test.go:1569: Consumed 0MB for stack mem
>> stack_test.go:1569: Consumed 0MB for stack mem
>> stack_test.go:1569: Consumed 0MB for stack mem
>> stack_test.go:1569: Consumed 0MB for stack mem
>> stack_test.go:1569: Consumed 0MB for stack mem
>> stack_test.go:1569: Consumed 0MB for stack mem
>> stack_test.go:1569: Consumed 0MB for stack mem
>> ...
>>
>>
>>
>> Either somebody must come up with a good tuning methodology, or let's
>> commit it as-is and tune later.
>>
>> https://codereview.appspot.com/7029044/
Message from sougou@google.com
2013-01-04T08:05:45+00:00sougouurn:md5:a4af84e018b5c6f6b4aad291e41a4cb1
Sorry, I forgot to mention that StackSys was indefinitely growing in the
first two runs (without the CLs).
For run 3 (old CL), it started off at 3MB and stayed there.
For run 4 (new CL), it started at 14MB and inched up to 15.
So, both CLs are good enough. I would personally lean towards the old CL,
but the new CL should also be fine if it offers more balanced performance.
Message from dvyukov@google.com
2013-01-04T08:11:35+00:00dvyukovurn:md5:4e0e6bcd69b032eaefcaddd0f91a995d
On Fri, Jan 4, 2013 at 12:05 PM, Sugu Sougoumarane <sougou@google.com> wrote:
> Sorry, I forgot to mention that StackSys was indefinitely growing in the
> first two runs (without the CLs).
> For run 3 (old CL), it started off at 3MB and stayed there.
> For run 4 (new CL), it started at 14MB and inched up to 15.
>
> So, both CLs are good enough. I would personally lean towards the old CL,
> but the new CL should also be fine if it offers more balanced performance.
I just afraid that I can badly penalize some other workloads that I
don't know about.
You may try new CL with e.g. StackCacheSize=32, StackCacheBatch=16 or
64/32. On the synthetic tests 32/16 is still good enough, so if it
reduces StackSys for you, I am happy to change it to 32/16.
Message from no.smile.face@gmail.com
2013-01-04T12:59:28+00:00nsfurn:md5:e4c65982d6b489701e41c08e9809375f
On 2013/01/04 07:39:15, dvyukov wrote:
> Do you miss a part of the sentence?
Not really.
> The results look fine, right?
Yes, everything is fine. :)
Message from sougou@google.com
2013-01-04T20:15:38+00:00sougouurn:md5:05788fe2c220af678a2b1083169b2ac6
5M rows using 100 connections:
128/32:
qps: 5885 StackSys: 14.7MB
64/32:
qps: 5876 StackSys: 8.4MB
32/16:
qps: 5850 StackSys: 4.9MB
16/8:
qps: 5892 StackSys: 3.15MB
All other stats were comparable.
Conclusion. No significant change in performance for different values of StackCacheSize/StackCacheBatch, except for the StackSys footprint, but there are diminishing returns below 32/16.
Message from dvyukov@google.com
2013-01-04T20:36:32+00:00dvyukovurn:md5:2fbb7a224d8cd95726892c38376b0547
OK, I will replace the constants with 32/16.
And let's wait for Russ' blessing.
On Sat, Jan 5, 2013 at 12:15 AM, <sougou@google.com> wrote:
> 5M rows using 100 connections:
>
> 128/32:
> qps: 5885 StackSys: 14.7MB
>
> 64/32:
> qps: 5876 StackSys: 8.4MB
>
> 32/16:
> qps: 5850 StackSys: 4.9MB
>
> 16/8:
> qps: 5892 StackSys: 3.15MB
>
> All other stats were comparable.
>
> Conclusion. No significant change in performance for different values of
> StackCacheSize/StackCacheBatch, except for the StackSys footprint, but
> there are diminishing returns below 32/16.
>
> https://codereview.appspot.com/7029044/
Message from unknown
2013-01-05T10:22:37+00:00dvyukovurn:md5:61bc7e42a6f7a054de58a10fd1fda6a1
Message from dvyukov@google.com
2013-01-05T10:25:55+00:00dvyukovurn:md5:53b3ebf6780d8e520c5b64634fffcaa0
Updated the constants to 32/16
Message from rsc@golang.org
2013-01-07T04:18:57+00:00rscurn:md5:91d0e6864a408c7ac4d3c86e3412b287
https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc
File src/pkg/runtime/malloc.goc (right):
https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc#newcode754
src/pkg/runtime/malloc.goc:754: StackCacheNode *next;
The stack itself is way bigger than this struct. Listing all StackCacheBatch elements in batch will make the code significantly simpler, and a comment would help too:
// A StackCacheNode is a group of StackCacheBatch fixed-size stacks that
// can be transferred between a goroutine and a central cache (stackcache).
// The StackCacheNode contents are stored in one of the stacks, so they can
// only be used when the stacks are free.
typedef struct StackCacheNode StackCacheNode;
struct StackCacheNode
{
StackCacheNode *next;
void *batch[StackCacheBatch];
}
refill() {
...
if(n == nil) {
...
for(i = 0; i < StackCacheBatch; i++)
n->batch[i] = (byte*)n + i*FixedStack;
}
pos = m->stackcachepos;
for(i = 0; i < StackCacheBatch; i++) {
m->stackcache[pos++] = n->batch[i];
pos %= StackCacheSize;
}
...
}
release() {
...
n = (StackCacheNode*)m->stackcache[pos];
for(i = 0; i < StackCacheBatch; i++) {
n->batch[i] = m->stackcache[pos++];
pos %= StackCacheSize;
}
https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc#newcode758
src/pkg/runtime/malloc.goc:758: static StackCacheNode *stackcache;
static struct {
Lock;
StackCacheNode *top;
} stackcache;
And then
runtime.lock(&stackcache);
n = stackcache.top;
etc.
https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc#newcode793
src/pkg/runtime/malloc.goc:793: static void
// Release oldest StackCacheBatch stack segments to central free list.
https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc#newcode799
src/pkg/runtime/malloc.goc:799: pos = (m->stackcachepos - m->stackcachecnt) % StackCacheSize;
This only works because (a) the left hand side of the % is an unsigned type, and (b) StackCacheSize is a power of two. Please use
pos = (m->stackcachepos - m->stackcachecnt + StackCacheSize) % StackCacheSize
which does not rely on either assumption.
https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc#newcode840
src/pkg/runtime/malloc.goc:840: pos = (pos - 1) % StackCacheSize;
pos = (pos - 1 + StackCacheSize) % StackCacheSize;
Message from unknown
2013-01-08T12:30:16+00:00dvyukovurn:md5:01873f1a5b8df078d71bdad86c2d5303
Message from dvyukov@google.com
2013-01-08T12:33:47+00:00dvyukovurn:md5:4ed83d13227722d5c073b3fda0bc7a50
PTAL
https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc
File src/pkg/runtime/malloc.goc (right):
https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc#newcode754
src/pkg/runtime/malloc.goc:754: StackCacheNode *next;
On 2013/01/07 04:18:57, rsc wrote:
> The stack itself is way bigger than this struct. Listing all StackCacheBatch
> elements in batch will make the code significantly simpler, and a comment would
> help too:
>
> // A StackCacheNode is a group of StackCacheBatch fixed-size stacks that
> // can be transferred between a goroutine and a central cache (stackcache).
> // The StackCacheNode contents are stored in one of the stacks, so they can
> // only be used when the stacks are free.
> typedef struct StackCacheNode StackCacheNode;
> struct StackCacheNode
> {
> StackCacheNode *next;
> void *batch[StackCacheBatch];
> }
>
> refill() {
> ...
> if(n == nil) {
> ...
> for(i = 0; i < StackCacheBatch; i++)
> n->batch[i] = (byte*)n + i*FixedStack;
> }
> pos = m->stackcachepos;
> for(i = 0; i < StackCacheBatch; i++) {
> m->stackcache[pos++] = n->batch[i];
> pos %= StackCacheSize;
> }
> ...
> }
>
> release() {
> ...
> n = (StackCacheNode*)m->stackcache[pos];
> for(i = 0; i < StackCacheBatch; i++) {
> n->batch[i] = m->stackcache[pos++];
> pos %= StackCacheSize;
> }
Done.
https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc#newcode758
src/pkg/runtime/malloc.goc:758: static StackCacheNode *stackcache;
On 2013/01/07 04:18:57, rsc wrote:
> static struct {
> Lock;
> StackCacheNode *top;
> } stackcache;
>
> And then
>
> runtime.lock(&stackcache);
> n = stackcache.top;
> etc.
Done.
https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc#newcode793
src/pkg/runtime/malloc.goc:793: static void
On 2013/01/07 04:18:57, rsc wrote:
> // Release oldest StackCacheBatch stack segments to central free list.
Done.
https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc#newcode799
src/pkg/runtime/malloc.goc:799: pos = (m->stackcachepos - m->stackcachecnt) % StackCacheSize;
On 2013/01/07 04:18:57, rsc wrote:
> This only works because (a) the left hand side of the % is an unsigned type, and
> (b) StackCacheSize is a power of two. Please use
>
> pos = (m->stackcachepos - m->stackcachecnt + StackCacheSize) % StackCacheSize
>
> which does not rely on either assumption.
Done.
https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc#newcode840
src/pkg/runtime/malloc.goc:840: pos = (pos - 1) % StackCacheSize;
On 2013/01/07 04:18:57, rsc wrote:
> pos = (pos - 1 + StackCacheSize) % StackCacheSize;
Done.
Message from rsc@golang.org
2013-01-09T19:08:26+00:00rscurn:md5:4f78cf15ef99a3626f8335f657e1ace8
LGTM
Message from unknown
2013-01-10T05:55:56+00:00dvyukovurn:md5:e7816c4c3a26ff5769c804a0ac8bc97c
Message from unknown
2013-01-10T05:57:01+00:00dvyukovurn:md5:231b3fca7dee692517b5b40594f6796e
Message from dvyukov@google.com
2013-01-10T05:59:56+00:00dvyukovurn:md5:18aed95ac4872b4726047933e7cdc1e2
*** Submitted as https://code.google.com/p/go/source/detail?r=88d31369e105 ***
runtime: less aggressive per-thread stack segment caching
Introduce global stack segment cache and limit per-thread cache size.
This greatly reduces StackSys memory on workloads that create lots of threads.
benchmark old ns/op new ns/op delta
BenchmarkStackGrowth 665 656 -1.35%
BenchmarkStackGrowth-2 333 328 -1.50%
BenchmarkStackGrowth-4 224 172 -23.21%
BenchmarkStackGrowth-8 124 91 -26.13%
BenchmarkStackGrowth-16 82 47 -41.94%
BenchmarkStackGrowth-32 73 40 -44.79%
BenchmarkStackGrowthDeep 97231 94391 -2.92%
BenchmarkStackGrowthDeep-2 47230 58562 +23.99%
BenchmarkStackGrowthDeep-4 24993 49356 +97.48%
BenchmarkStackGrowthDeep-8 15105 30072 +99.09%
BenchmarkStackGrowthDeep-16 10005 15623 +56.15%
BenchmarkStackGrowthDeep-32 12517 13069 +4.41%
TestStackMem#1,MB 310 12 -96.13%
TestStackMem#2,MB 296 14 -95.27%
TestStackMem#3,MB 479 14 -97.08%
TestStackMem#1,sec 3.22 2.26 -29.81%
TestStackMem#2,sec 2.43 2.15 -11.52%
TestStackMem#3,sec 2.50 2.38 -4.80%
R=sougou, no.smile.face, rsc
CC=golang-dev, msolomon
https://codereview.appspot.com/7029044
Message from fullung@gmail.com
2013-01-24T11:02:33+00:00albert.strasheimurn:md5:e598816f1d59c8011a9117d2a6081660
http://code.google.com/p/go/issues/detail?id=4698