Reran vtocc benchmars, around 10M queries using 100 clients. Run 1: go version currently used ...
12 years, 9 months ago
(2013-01-03 23:55:50 UTC)
#3
Reran vtocc benchmars, around 10M queries using 100 clients.
Run 1: go version currently used on production 0a3866d6cc6b (Sep 24):
qps: 5832 StackSys: 86MB
Run 2: go @tip d0d76b7fb219 (Jan 3):
qps: 5543 StackSys: 77MB
Run 3: Using CL 6997052:
qps: 5673 StackSys: 3MB
Run 4: Using CL 7029044:
qps: 5699 StackSys: 15MB
Conclusion: Marginal difference in performance between the two CLs. The older CL
uses less memory. Maybe it will be more pronounced if you passed large objects
by value to functions?
The runtime @tip is slower than the one from September, an unrelated
observation.
This is just a summary. I can send you more detailed stats and pprof captures if
needed.
On 2013/01/03 23:55:50, sougou wrote: > Reran vtocc benchmars, around 10M queries using 100 clients. ...
12 years, 9 months ago
(2013-01-04 06:04:59 UTC)
#4
On 2013/01/03 23:55:50, sougou wrote:
> Reran vtocc benchmars, around 10M queries using 100 clients.
>
> Run 1: go version currently used on production 0a3866d6cc6b (Sep 24):
> qps: 5832 StackSys: 86MB
>
> Run 2: go @tip d0d76b7fb219 (Jan 3):
> qps: 5543 StackSys: 77MB
>
> Run 3: Using CL 6997052:
> qps: 5673 StackSys: 3MB
>
> Run 4: Using CL 7029044:
> qps: 5699 StackSys: 15MB
>
> Conclusion: Marginal difference in performance between the two CLs. The older
CL
> uses less memory. Maybe it will be more pronounced if you passed large objects
> by value to functions?
> The runtime @tip is slower than the one from September, an unrelated
> observation.
>
> This is just a summary. I can send you more detailed stats and pprof captures
if
> needed.
Can you please test with varying values for StackCacheSize/StackCacheBatch in
src/pkg/runtime/runtime.h?
Currently they are set to 128/32. The CL/6997052 is using 16/8. I am inclined
towards 32/16 for now (my synthetic tests show still minimal memory consumption
and good performance). Another possible point is 64/32.
On 2013/01/03 15:29:31, dvyukov wrote: > Hello mailto:golang-dev@googlegroups.com, > > I'd like you to review ...
12 years, 9 months ago
(2013-01-04 07:19:15 UTC)
#5
On 2013/01/03 15:29:31, dvyukov wrote:
> Hello mailto:golang-dev@googlegroups.com,
>
> I'd like you to review this change to
> https://dvyukov%2540google.com%40code.google.com/p/go/
For my ray tracer stuff the behaviour is very similar to what I had with
previous patch.
Machine: amd64/linux on i5-3470 CPU.
Before (tip d0d76b7fb219):
Rendering took 2m 34.5s
Sys: 85928184
StackInuse: 102400
StackSys: 31981568
After (tip d0d76b7fb219 + issue7029044_6007.diff):
Rendering took 2m 35.2s
Sys: 55128064
StackInuse: 12885004288
StackSys: 2621440
This time the actual runtime.MemStats numbers instead of staring at process' RES
(resident memory size). Rendering time is similar (less than 1% difference is
not statistically significant). Oh and this time I was using 50 rays per pixel
instead of 100, just to make tests quicker. Also note the anomally high
StackInuse number in your patch. Is it a miscalculation?
On 2013/01/04 07:19:15, nsf wrote: > On 2013/01/03 15:29:31, dvyukov wrote: > > Hello mailto:golang-dev@googlegroups.com, ...
12 years, 9 months ago
(2013-01-04 07:36:59 UTC)
#6
On 2013/01/04 07:19:15, nsf wrote:
> On 2013/01/03 15:29:31, dvyukov wrote:
> > Hello mailto:golang-dev@googlegroups.com,
> >
> > I'd like you to review this change to
> > https://dvyukov%252540google.com%2540code.google.com/p/go/
>
> For my ray tracer stuff the behaviour is very similar to what I had with
> previous patch.
>
> Machine: amd64/linux on i5-3470 CPU.
>
> Before (tip d0d76b7fb219):
> Rendering took 2m 34.5s
> Sys: 85928184
> StackInuse: 102400
> StackSys: 31981568
> After (tip d0d76b7fb219 + issue7029044_6007.diff):
> Rendering took 2m 35.2s
> Sys: 55128064
> StackInuse: 12885004288
> StackSys: 2621440
>
> This time the actual runtime.MemStats numbers instead of staring at process'
RES
> (resident memory size). Rendering time is similar (less than 1% difference is
> not statistically significant). Oh and this time I was using 50 rays per pixel
> instead of 100, just to make tests quicker. Also note the anomally high
> StackInuse number in your patch. Is it a miscalculation?
Thanks! This is actually a bug. Fixed:
https://codereview.appspot.com/7029044/diff2/6007:13007/src/pkg/runtime/runti...https://codereview.appspot.com/7029044/diff2/6007:13007/src/pkg/runtime/stack...
On 2013/01/04 07:36:59, dvyukov wrote: > On 2013/01/04 07:19:15, nsf wrote: > > On 2013/01/03 ...
12 years, 9 months ago
(2013-01-04 07:39:15 UTC)
#7
On 2013/01/04 07:36:59, dvyukov wrote:
> On 2013/01/04 07:19:15, nsf wrote:
> > On 2013/01/03 15:29:31, dvyukov wrote:
> > > Hello mailto:golang-dev@googlegroups.com,
> > >
> > > I'd like you to review this change to
> > > https://dvyukov%25252540google.com%252540code.google.com/p/go/
> >
> > For my ray tracer stuff the behaviour is very similar to what I had with
> > previous patch.
> >
> > Machine: amd64/linux on i5-3470 CPU.
> >
> > Before (tip d0d76b7fb219):
> > Rendering took 2m 34.5s
> > Sys: 85928184
> > StackInuse: 102400
> > StackSys: 31981568
> > After (tip d0d76b7fb219 + issue7029044_6007.diff):
> > Rendering took 2m 35.2s
> > Sys: 55128064
> > StackInuse: 12885004288
> > StackSys: 2621440
> >
> > This time the actual runtime.MemStats numbers instead of staring at process'
> RES
> > (resident memory size).
Do you miss a part of the sentence?
> > Rendering time is similar (less than 1% difference is
> > not statistically significant).
The results look fine, right?
> > Oh and this time I was using 50 rays per pixel
> > instead of 100, just to make tests quicker. Also note the anomally high
> > StackInuse number in your patch. Is it a miscalculation?
>
> Thanks! This is actually a bug. Fixed:
>
https://codereview.appspot.com/7029044/diff2/6007:13007/src/pkg/runtime/runti...
>
https://codereview.appspot.com/7029044/diff2/6007:13007/src/pkg/runtime/stack...
On 2013/01/04 06:04:59, dvyukov wrote: > On 2013/01/03 23:55:50, sougou wrote: > > Reran vtocc ...
12 years, 9 months ago
(2013-01-04 07:48:43 UTC)
#8
On 2013/01/04 06:04:59, dvyukov wrote:
> On 2013/01/03 23:55:50, sougou wrote:
> > Reran vtocc benchmars, around 10M queries using 100 clients.
> >
> > Run 1: go version currently used on production 0a3866d6cc6b (Sep 24):
> > qps: 5832 StackSys: 86MB
> >
> > Run 2: go @tip d0d76b7fb219 (Jan 3):
> > qps: 5543 StackSys: 77MB
> >
> > Run 3: Using CL 6997052:
> > qps: 5673 StackSys: 3MB
> >
> > Run 4: Using CL 7029044:
> > qps: 5699 StackSys: 15MB
> >
> > Conclusion: Marginal difference in performance between the two CLs. The
older
> CL
> > uses less memory. Maybe it will be more pronounced if you passed large
objects
> > by value to functions?
> > The runtime @tip is slower than the one from September, an unrelated
> > observation.
> >
> > This is just a summary. I can send you more detailed stats and pprof
captures
> if
> > needed.
>
> Can you please test with varying values for StackCacheSize/StackCacheBatch in
> src/pkg/runtime/runtime.h?
> Currently they are set to 128/32. The CL/6997052 is using 16/8. I am inclined
> towards 32/16 for now (my synthetic tests show still minimal memory
consumption
> and good performance). Another possible point is 64/32.
Or perhaps it's already fine?
15 vs 79-80MB is a good win already. More importantly StackSys must not grow
over time now, it's bounded by 512kb per thread (while currently it slowly grows
to infinity).
Well, actually not that slowly. I've run the following funny test -- each line
is StackSys *increase*.
Current behavior:
$ go test -run=StackMem -v
-cpu=1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
2>&1 | grep "for stack mem"
stack_test.go:1569: Consumed 106MB for stack mem
stack_test.go:1569: Consumed 48MB for stack mem
stack_test.go:1569: Consumed 52MB for stack mem
stack_test.go:1569: Consumed 71MB for stack mem
stack_test.go:1569: Consumed 71MB for stack mem
stack_test.go:1569: Consumed 53MB for stack mem
stack_test.go:1569: Consumed 35MB for stack mem
stack_test.go:1569: Consumed 27MB for stack mem
stack_test.go:1569: Consumed 39MB for stack mem
stack_test.go:1569: Consumed 43MB for stack mem
stack_test.go:1569: Consumed 49MB for stack mem
stack_test.go:1569: Consumed 54MB for stack mem
stack_test.go:1569: Consumed 44MB for stack mem
stack_test.go:1569: Consumed 35MB for stack mem
stack_test.go:1569: Consumed 41MB for stack mem
stack_test.go:1569: Consumed 32MB for stack mem
stack_test.go:1569: Consumed 27MB for stack mem
stack_test.go:1569: Consumed 20MB for stack mem
stack_test.go:1569: Consumed 36MB for stack mem
stack_test.go:1569: Consumed 33MB for stack mem
stack_test.go:1569: Consumed 31MB for stack mem
stack_test.go:1569: Consumed 45MB for stack mem
stack_test.go:1569: Consumed 40MB for stack mem
stack_test.go:1569: Consumed 30MB for stack mem
stack_test.go:1569: Consumed 39MB for stack mem
stack_test.go:1569: Consumed 27MB for stack mem
stack_test.go:1569: Consumed 27MB for stack mem
stack_test.go:1569: Consumed 37MB for stack mem
stack_test.go:1569: Consumed 33MB for stack mem
stack_test.go:1569: Consumed 36MB for stack mem
stack_test.go:1569: Consumed 34MB for stack mem
stack_test.go:1569: Consumed 42MB for stack mem
stack_test.go:1569: Consumed 29MB for stack mem
stack_test.go:1569: Consumed 29MB for stack mem
stack_test.go:1569: Consumed 44MB for stack mem
stack_test.go:1569: Consumed 20MB for stack mem
stack_test.go:1569: Consumed 31MB for stack mem
stack_test.go:1569: Consumed 31MB for stack mem
stack_test.go:1569: Consumed 19MB for stack mem
stack_test.go:1569: Consumed 25MB for stack mem
New behavior:
$ go test -run=StackMem -v
-cpu=1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
2>&1 | grep "for stack mem"
stack_test.go:1569: Consumed 13MB for stack mem
stack_test.go:1569: Consumed 0MB for stack mem
stack_test.go:1569: Consumed 0MB for stack mem
stack_test.go:1569: Consumed 0MB for stack mem
stack_test.go:1569: Consumed 0MB for stack mem
stack_test.go:1569: Consumed 0MB for stack mem
stack_test.go:1569: Consumed 0MB for stack mem
stack_test.go:1569: Consumed 0MB for stack mem
stack_test.go:1569: Consumed 0MB for stack mem
stack_test.go:1569: Consumed 0MB for stack mem
...
Either somebody must come up with a good tuning methodology, or let's commit it
as-is and tune later.
From our perspective, capping the growth is a big win and the performance tradeoff is ...
12 years, 9 months ago
(2013-01-04 07:55:36 UTC)
#9
From our perspective, capping the growth is a big win and the
performance tradeoff is worth it. I'll let Sugu confirm that with a
production test. Once we can reason about good sizes, we can run a few
more tests, but I suspect it may be workload dependent. In the future
it might be worth considering an environment variable, but I tend to
dislike tunables.
Either way, thanks for putting so much time into this.
On Thu, Jan 3, 2013 at 11:48 PM, <dvyukov@google.com> wrote:
> On 2013/01/04 06:04:59, dvyukov wrote:
>>
>> On 2013/01/03 23:55:50, sougou wrote:
>> > Reran vtocc benchmars, around 10M queries using 100 clients.
>> >
>> > Run 1: go version currently used on production 0a3866d6cc6b (Sep
>
> 24):
>>
>> > qps: 5832 StackSys: 86MB
>> >
>> > Run 2: go @tip d0d76b7fb219 (Jan 3):
>> > qps: 5543 StackSys: 77MB
>> >
>> > Run 3: Using CL 6997052:
>> > qps: 5673 StackSys: 3MB
>> >
>> > Run 4: Using CL 7029044:
>> > qps: 5699 StackSys: 15MB
>> >
>> > Conclusion: Marginal difference in performance between the two CLs.
>
> The older
>>
>> CL
>> > uses less memory. Maybe it will be more pronounced if you passed
>
> large objects
>>
>> > by value to functions?
>> > The runtime @tip is slower than the one from September, an unrelated
>> > observation.
>> >
>> > This is just a summary. I can send you more detailed stats and pprof
>
> captures
>>
>> if
>> > needed.
>
>
>> Can you please test with varying values for
>
> StackCacheSize/StackCacheBatch in
>>
>> src/pkg/runtime/runtime.h?
>> Currently they are set to 128/32. The CL/6997052 is using 16/8. I am
>
> inclined
>>
>> towards 32/16 for now (my synthetic tests show still minimal memory
>
> consumption
>>
>> and good performance). Another possible point is 64/32.
>
>
> Or perhaps it's already fine?
> 15 vs 79-80MB is a good win already. More importantly StackSys must not
> grow over time now, it's bounded by 512kb per thread (while currently it
> slowly grows to infinity).
>
> Well, actually not that slowly. I've run the following funny test --
> each line is StackSys *increase*.
>
> Current behavior:
> $ go test -run=StackMem -v
>
-cpu=1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
> 2>&1 | grep "for stack mem"
> stack_test.go:1569: Consumed 106MB for stack mem
> stack_test.go:1569: Consumed 48MB for stack mem
> stack_test.go:1569: Consumed 52MB for stack mem
> stack_test.go:1569: Consumed 71MB for stack mem
> stack_test.go:1569: Consumed 71MB for stack mem
> stack_test.go:1569: Consumed 53MB for stack mem
> stack_test.go:1569: Consumed 35MB for stack mem
> stack_test.go:1569: Consumed 27MB for stack mem
> stack_test.go:1569: Consumed 39MB for stack mem
> stack_test.go:1569: Consumed 43MB for stack mem
> stack_test.go:1569: Consumed 49MB for stack mem
> stack_test.go:1569: Consumed 54MB for stack mem
> stack_test.go:1569: Consumed 44MB for stack mem
> stack_test.go:1569: Consumed 35MB for stack mem
> stack_test.go:1569: Consumed 41MB for stack mem
> stack_test.go:1569: Consumed 32MB for stack mem
> stack_test.go:1569: Consumed 27MB for stack mem
> stack_test.go:1569: Consumed 20MB for stack mem
> stack_test.go:1569: Consumed 36MB for stack mem
> stack_test.go:1569: Consumed 33MB for stack mem
> stack_test.go:1569: Consumed 31MB for stack mem
> stack_test.go:1569: Consumed 45MB for stack mem
> stack_test.go:1569: Consumed 40MB for stack mem
> stack_test.go:1569: Consumed 30MB for stack mem
> stack_test.go:1569: Consumed 39MB for stack mem
> stack_test.go:1569: Consumed 27MB for stack mem
> stack_test.go:1569: Consumed 27MB for stack mem
> stack_test.go:1569: Consumed 37MB for stack mem
> stack_test.go:1569: Consumed 33MB for stack mem
> stack_test.go:1569: Consumed 36MB for stack mem
> stack_test.go:1569: Consumed 34MB for stack mem
> stack_test.go:1569: Consumed 42MB for stack mem
> stack_test.go:1569: Consumed 29MB for stack mem
> stack_test.go:1569: Consumed 29MB for stack mem
> stack_test.go:1569: Consumed 44MB for stack mem
> stack_test.go:1569: Consumed 20MB for stack mem
> stack_test.go:1569: Consumed 31MB for stack mem
> stack_test.go:1569: Consumed 31MB for stack mem
> stack_test.go:1569: Consumed 19MB for stack mem
> stack_test.go:1569: Consumed 25MB for stack mem
>
>
> New behavior:
> $ go test -run=StackMem -v
>
-cpu=1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
> 2>&1 | grep "for stack mem"
> stack_test.go:1569: Consumed 13MB for stack mem
> stack_test.go:1569: Consumed 0MB for stack mem
> stack_test.go:1569: Consumed 0MB for stack mem
> stack_test.go:1569: Consumed 0MB for stack mem
> stack_test.go:1569: Consumed 0MB for stack mem
> stack_test.go:1569: Consumed 0MB for stack mem
> stack_test.go:1569: Consumed 0MB for stack mem
> stack_test.go:1569: Consumed 0MB for stack mem
> stack_test.go:1569: Consumed 0MB for stack mem
> stack_test.go:1569: Consumed 0MB for stack mem
> ...
>
>
>
> Either somebody must come up with a good tuning methodology, or let's
> commit it as-is and tune later.
>
> https://codereview.appspot.com/7029044/
On Fri, Jan 4, 2013 at 11:55 AM, Mike Solomon <msolomon@google.com> wrote: > From our ...
12 years, 9 months ago
(2013-01-04 08:05:25 UTC)
#10
On Fri, Jan 4, 2013 at 11:55 AM, Mike Solomon <msolomon@google.com> wrote:
> From our perspective, capping the growth is a big win and the
> performance tradeoff is worth it. I'll let Sugu confirm that with a
> production test. Once we can reason about good sizes, we can run a few
> more tests, but I suspect it may be workload dependent. In the future
> it might be worth considering an environment variable, but I tend to
> dislike tunables.
When/if I finally submit the improved scheduler, it will allow for
per-processor (per-GOMAXPROC) state. And stack caches along with other
stuff will move there. That will decrease stack mem further, and I
believe eliminate any need in tunables (e.g. now you have say 200
threads each with own cache, and then you will have only 8 procs with
caches).
> Either way, thanks for putting so much time into this.
You are welcome.
> On Thu, Jan 3, 2013 at 11:48 PM, <dvyukov@google.com> wrote:
>> On 2013/01/04 06:04:59, dvyukov wrote:
>>>
>>> On 2013/01/03 23:55:50, sougou wrote:
>>> > Reran vtocc benchmars, around 10M queries using 100 clients.
>>> >
>>> > Run 1: go version currently used on production 0a3866d6cc6b (Sep
>>
>> 24):
>>>
>>> > qps: 5832 StackSys: 86MB
>>> >
>>> > Run 2: go @tip d0d76b7fb219 (Jan 3):
>>> > qps: 5543 StackSys: 77MB
>>> >
>>> > Run 3: Using CL 6997052:
>>> > qps: 5673 StackSys: 3MB
>>> >
>>> > Run 4: Using CL 7029044:
>>> > qps: 5699 StackSys: 15MB
>>> >
>>> > Conclusion: Marginal difference in performance between the two CLs.
>>
>> The older
>>>
>>> CL
>>> > uses less memory. Maybe it will be more pronounced if you passed
>>
>> large objects
>>>
>>> > by value to functions?
>>> > The runtime @tip is slower than the one from September, an unrelated
>>> > observation.
>>> >
>>> > This is just a summary. I can send you more detailed stats and pprof
>>
>> captures
>>>
>>> if
>>> > needed.
>>
>>
>>> Can you please test with varying values for
>>
>> StackCacheSize/StackCacheBatch in
>>>
>>> src/pkg/runtime/runtime.h?
>>> Currently they are set to 128/32. The CL/6997052 is using 16/8. I am
>>
>> inclined
>>>
>>> towards 32/16 for now (my synthetic tests show still minimal memory
>>
>> consumption
>>>
>>> and good performance). Another possible point is 64/32.
>>
>>
>> Or perhaps it's already fine?
>> 15 vs 79-80MB is a good win already. More importantly StackSys must not
>> grow over time now, it's bounded by 512kb per thread (while currently it
>> slowly grows to infinity).
>>
>> Well, actually not that slowly. I've run the following funny test --
>> each line is StackSys *increase*.
>>
>> Current behavior:
>> $ go test -run=StackMem -v
>>
-cpu=1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
>> 2>&1 | grep "for stack mem"
>> stack_test.go:1569: Consumed 106MB for stack mem
>> stack_test.go:1569: Consumed 48MB for stack mem
>> stack_test.go:1569: Consumed 52MB for stack mem
>> stack_test.go:1569: Consumed 71MB for stack mem
>> stack_test.go:1569: Consumed 71MB for stack mem
>> stack_test.go:1569: Consumed 53MB for stack mem
>> stack_test.go:1569: Consumed 35MB for stack mem
>> stack_test.go:1569: Consumed 27MB for stack mem
>> stack_test.go:1569: Consumed 39MB for stack mem
>> stack_test.go:1569: Consumed 43MB for stack mem
>> stack_test.go:1569: Consumed 49MB for stack mem
>> stack_test.go:1569: Consumed 54MB for stack mem
>> stack_test.go:1569: Consumed 44MB for stack mem
>> stack_test.go:1569: Consumed 35MB for stack mem
>> stack_test.go:1569: Consumed 41MB for stack mem
>> stack_test.go:1569: Consumed 32MB for stack mem
>> stack_test.go:1569: Consumed 27MB for stack mem
>> stack_test.go:1569: Consumed 20MB for stack mem
>> stack_test.go:1569: Consumed 36MB for stack mem
>> stack_test.go:1569: Consumed 33MB for stack mem
>> stack_test.go:1569: Consumed 31MB for stack mem
>> stack_test.go:1569: Consumed 45MB for stack mem
>> stack_test.go:1569: Consumed 40MB for stack mem
>> stack_test.go:1569: Consumed 30MB for stack mem
>> stack_test.go:1569: Consumed 39MB for stack mem
>> stack_test.go:1569: Consumed 27MB for stack mem
>> stack_test.go:1569: Consumed 27MB for stack mem
>> stack_test.go:1569: Consumed 37MB for stack mem
>> stack_test.go:1569: Consumed 33MB for stack mem
>> stack_test.go:1569: Consumed 36MB for stack mem
>> stack_test.go:1569: Consumed 34MB for stack mem
>> stack_test.go:1569: Consumed 42MB for stack mem
>> stack_test.go:1569: Consumed 29MB for stack mem
>> stack_test.go:1569: Consumed 29MB for stack mem
>> stack_test.go:1569: Consumed 44MB for stack mem
>> stack_test.go:1569: Consumed 20MB for stack mem
>> stack_test.go:1569: Consumed 31MB for stack mem
>> stack_test.go:1569: Consumed 31MB for stack mem
>> stack_test.go:1569: Consumed 19MB for stack mem
>> stack_test.go:1569: Consumed 25MB for stack mem
>>
>>
>> New behavior:
>> $ go test -run=StackMem -v
>>
-cpu=1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
>> 2>&1 | grep "for stack mem"
>> stack_test.go:1569: Consumed 13MB for stack mem
>> stack_test.go:1569: Consumed 0MB for stack mem
>> stack_test.go:1569: Consumed 0MB for stack mem
>> stack_test.go:1569: Consumed 0MB for stack mem
>> stack_test.go:1569: Consumed 0MB for stack mem
>> stack_test.go:1569: Consumed 0MB for stack mem
>> stack_test.go:1569: Consumed 0MB for stack mem
>> stack_test.go:1569: Consumed 0MB for stack mem
>> stack_test.go:1569: Consumed 0MB for stack mem
>> stack_test.go:1569: Consumed 0MB for stack mem
>> ...
>>
>>
>>
>> Either somebody must come up with a good tuning methodology, or let's
>> commit it as-is and tune later.
>>
>> https://codereview.appspot.com/7029044/
Sorry, I forgot to mention that StackSys was indefinitely growing in the first two runs ...
12 years, 9 months ago
(2013-01-04 08:05:45 UTC)
#11
Sorry, I forgot to mention that StackSys was indefinitely growing in the
first two runs (without the CLs).
For run 3 (old CL), it started off at 3MB and stayed there.
For run 4 (new CL), it started at 14MB and inched up to 15.
So, both CLs are good enough. I would personally lean towards the old CL,
but the new CL should also be fine if it offers more balanced performance.
On Fri, Jan 4, 2013 at 12:05 PM, Sugu Sougoumarane <sougou@google.com> wrote: > Sorry, I ...
12 years, 9 months ago
(2013-01-04 08:11:35 UTC)
#12
On Fri, Jan 4, 2013 at 12:05 PM, Sugu Sougoumarane <sougou@google.com> wrote:
> Sorry, I forgot to mention that StackSys was indefinitely growing in the
> first two runs (without the CLs).
> For run 3 (old CL), it started off at 3MB and stayed there.
> For run 4 (new CL), it started at 14MB and inched up to 15.
>
> So, both CLs are good enough. I would personally lean towards the old CL,
> but the new CL should also be fine if it offers more balanced performance.
I just afraid that I can badly penalize some other workloads that I
don't know about.
You may try new CL with e.g. StackCacheSize=32, StackCacheBatch=16 or
64/32. On the synthetic tests 32/16 is still good enough, so if it
reduces StackSys for you, I am happy to change it to 32/16.
12 years, 9 months ago
(2013-01-04 20:15:38 UTC)
#14
5M rows using 100 connections:
128/32:
qps: 5885 StackSys: 14.7MB
64/32:
qps: 5876 StackSys: 8.4MB
32/16:
qps: 5850 StackSys: 4.9MB
16/8:
qps: 5892 StackSys: 3.15MB
All other stats were comparable.
Conclusion. No significant change in performance for different values of
StackCacheSize/StackCacheBatch, except for the StackSys footprint, but there are
diminishing returns below 32/16.
OK, I will replace the constants with 32/16. And let's wait for Russ' blessing. On ...
12 years, 9 months ago
(2013-01-04 20:36:32 UTC)
#15
OK, I will replace the constants with 32/16.
And let's wait for Russ' blessing.
On Sat, Jan 5, 2013 at 12:15 AM, <sougou@google.com> wrote:
> 5M rows using 100 connections:
>
> 128/32:
> qps: 5885 StackSys: 14.7MB
>
> 64/32:
> qps: 5876 StackSys: 8.4MB
>
> 32/16:
> qps: 5850 StackSys: 4.9MB
>
> 16/8:
> qps: 5892 StackSys: 3.15MB
>
> All other stats were comparable.
>
> Conclusion. No significant change in performance for different values of
> StackCacheSize/StackCacheBatch, except for the StackSys footprint, but
> there are diminishing returns below 32/16.
>
> https://codereview.appspot.com/7029044/
https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc File src/pkg/runtime/malloc.goc (right): https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc#newcode754 src/pkg/runtime/malloc.goc:754: StackCacheNode *next; The stack itself is way bigger than ...
12 years, 9 months ago
(2013-01-07 04:18:57 UTC)
#17
https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc
File src/pkg/runtime/malloc.goc (right):
https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc#...
src/pkg/runtime/malloc.goc:754: StackCacheNode *next;
The stack itself is way bigger than this struct. Listing all StackCacheBatch
elements in batch will make the code significantly simpler, and a comment would
help too:
// A StackCacheNode is a group of StackCacheBatch fixed-size stacks that
// can be transferred between a goroutine and a central cache (stackcache).
// The StackCacheNode contents are stored in one of the stacks, so they can
// only be used when the stacks are free.
typedef struct StackCacheNode StackCacheNode;
struct StackCacheNode
{
StackCacheNode *next;
void *batch[StackCacheBatch];
}
refill() {
...
if(n == nil) {
...
for(i = 0; i < StackCacheBatch; i++)
n->batch[i] = (byte*)n + i*FixedStack;
}
pos = m->stackcachepos;
for(i = 0; i < StackCacheBatch; i++) {
m->stackcache[pos++] = n->batch[i];
pos %= StackCacheSize;
}
...
}
release() {
...
n = (StackCacheNode*)m->stackcache[pos];
for(i = 0; i < StackCacheBatch; i++) {
n->batch[i] = m->stackcache[pos++];
pos %= StackCacheSize;
}
https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc#...
src/pkg/runtime/malloc.goc:758: static StackCacheNode *stackcache;
static struct {
Lock;
StackCacheNode *top;
} stackcache;
And then
runtime.lock(&stackcache);
n = stackcache.top;
etc.
https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc#...
src/pkg/runtime/malloc.goc:793: static void
// Release oldest StackCacheBatch stack segments to central free list.
https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc#...
src/pkg/runtime/malloc.goc:799: pos = (m->stackcachepos - m->stackcachecnt) %
StackCacheSize;
This only works because (a) the left hand side of the % is an unsigned type, and
(b) StackCacheSize is a power of two. Please use
pos = (m->stackcachepos - m->stackcachecnt + StackCacheSize) % StackCacheSize
which does not rely on either assumption.
https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc#...
src/pkg/runtime/malloc.goc:840: pos = (pos - 1) % StackCacheSize;
pos = (pos - 1 + StackCacheSize) % StackCacheSize;
12 years, 9 months ago
(2013-01-08 12:33:47 UTC)
#18
PTAL
https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc
File src/pkg/runtime/malloc.goc (right):
https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc#...
src/pkg/runtime/malloc.goc:754: StackCacheNode *next;
On 2013/01/07 04:18:57, rsc wrote:
> The stack itself is way bigger than this struct. Listing all StackCacheBatch
> elements in batch will make the code significantly simpler, and a comment
would
> help too:
>
> // A StackCacheNode is a group of StackCacheBatch fixed-size stacks that
> // can be transferred between a goroutine and a central cache (stackcache).
> // The StackCacheNode contents are stored in one of the stacks, so they can
> // only be used when the stacks are free.
> typedef struct StackCacheNode StackCacheNode;
> struct StackCacheNode
> {
> StackCacheNode *next;
> void *batch[StackCacheBatch];
> }
>
> refill() {
> ...
> if(n == nil) {
> ...
> for(i = 0; i < StackCacheBatch; i++)
> n->batch[i] = (byte*)n + i*FixedStack;
> }
> pos = m->stackcachepos;
> for(i = 0; i < StackCacheBatch; i++) {
> m->stackcache[pos++] = n->batch[i];
> pos %= StackCacheSize;
> }
> ...
> }
>
> release() {
> ...
> n = (StackCacheNode*)m->stackcache[pos];
> for(i = 0; i < StackCacheBatch; i++) {
> n->batch[i] = m->stackcache[pos++];
> pos %= StackCacheSize;
> }
Done.
https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc#...
src/pkg/runtime/malloc.goc:758: static StackCacheNode *stackcache;
On 2013/01/07 04:18:57, rsc wrote:
> static struct {
> Lock;
> StackCacheNode *top;
> } stackcache;
>
> And then
>
> runtime.lock(&stackcache);
> n = stackcache.top;
> etc.
Done.
https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc#...
src/pkg/runtime/malloc.goc:793: static void
On 2013/01/07 04:18:57, rsc wrote:
> // Release oldest StackCacheBatch stack segments to central free list.
Done.
https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc#...
src/pkg/runtime/malloc.goc:799: pos = (m->stackcachepos - m->stackcachecnt) %
StackCacheSize;
On 2013/01/07 04:18:57, rsc wrote:
> This only works because (a) the left hand side of the % is an unsigned type,
and
> (b) StackCacheSize is a power of two. Please use
>
> pos = (m->stackcachepos - m->stackcachecnt + StackCacheSize) % StackCacheSize
>
> which does not rely on either assumption.
Done.
https://codereview.appspot.com/7029044/diff/22001/src/pkg/runtime/malloc.goc#...
src/pkg/runtime/malloc.goc:840: pos = (pos - 1) % StackCacheSize;
On 2013/01/07 04:18:57, rsc wrote:
> pos = (pos - 1 + StackCacheSize) % StackCacheSize;
Done.
Issue 7029044: code review 7029044: runtime: less aggressive per-thread stack segment caching
(Closed)
Created 12 years, 10 months ago by dvyukov
Modified 12 years, 9 months ago
Reviewers: albert.strasheim
Base URL:
Comments: 10