|
|
Descriptionruntime: simplify runtime·settype()
This changeset removes buffering of type information
and removes support for SysAlloc from the code.
Patch Set 1 #Patch Set 2 : diff -r 60d35f8bbbf2 https://code.google.com/p/go/ #Patch Set 3 : diff -r 60d35f8bbbf2 https://code.google.com/p/go/ #Patch Set 4 : diff -r 60d35f8bbbf2 https://code.google.com/p/go/ #Patch Set 5 : diff -r 60d35f8bbbf2 https://code.google.com/p/go/ #
Total comments: 2
Patch Set 6 : diff -r 8c800027d5a6 https://code.google.com/p/go/ #Patch Set 7 : diff -r 8c800027d5a6 https://code.google.com/p/go/ #
Total comments: 2
Patch Set 8 : diff -r 1faca3687fe6 https://code.google.com/p/go/ #Patch Set 9 : diff -r 1faca3687fe6 https://code.google.com/p/go/ #Patch Set 10 : diff -r e86ab7e59e50 https://code.google.com/p/go/ #
Total comments: 11
Patch Set 11 : diff -r 1f7fdf4ad92d https://code.google.com/p/go/ #
Total comments: 10
Patch Set 12 : diff -r baa90b763ecd https://code.google.com/p/go/ #
MessagesTotal messages: 35
Hello golang-dev@googlegroups.com (cc: dvyukov@google.com, rsc@golang.org), I'd like you to review this change to https://code.google.com/p/go/
Sign in to reply to this message.
https://codereview.appspot.com/9716045/diff/1007/src/pkg/runtime/malloc.goc File src/pkg/runtime/malloc.goc (right): https://codereview.appspot.com/9716045/diff/1007/src/pkg/runtime/malloc.goc#n... src/pkg/runtime/malloc.goc:572: data2 = runtime·mallocgc(nbytes2, FlagNoProfiling|FlagNoPointers, 0, 1); Don't call malloc while holding a spin lock. Malloc might gc, and everyone else waiting on this lock is spinning. stoptheworld won't be able to stop them => deadlock. Calling malloc is ok. Use a better lock.
Sign in to reply to this message.
https://codereview.appspot.com/9716045/diff/1007/src/pkg/runtime/malloc.goc File src/pkg/runtime/malloc.goc (right): https://codereview.appspot.com/9716045/diff/1007/src/pkg/runtime/malloc.goc#n... src/pkg/runtime/malloc.goc:572: data2 = runtime·mallocgc(nbytes2, FlagNoProfiling|FlagNoPointers, 0, 1); On 2013/05/30 17:34:03, khr wrote: > Don't call malloc while holding a spin lock. Malloc might gc, and everyone else > waiting on this lock is spinning. stoptheworld won't be able to stop them => > deadlock. > > Calling malloc is ok. Use a better lock. mallocgc is called with dogc=0.
Sign in to reply to this message.
I did not see that, sorry. I'm still concerned about the spin lock, though, if the holder gets (os) descheduled for some reason. Why can't we just use a Lock? On Thu, May 30, 2013 at 11:25 AM, <0xE2.0x9A.0x9B@gmail.com> wrote: > > https://codereview.appspot.**com/9716045/diff/1007/src/pkg/** > runtime/malloc.goc<https://codereview.appspot.com/9716045/diff/1007/src/pkg/runtime/malloc.goc> > File src/pkg/runtime/malloc.goc (right): > > https://codereview.appspot.**com/9716045/diff/1007/src/pkg/** > runtime/malloc.goc#newcode572<https://codereview.appspot.com/9716045/diff/1007/src/pkg/runtime/malloc.goc#newcode572> > src/pkg/runtime/malloc.goc:**572: data2 = runtime·mallocgc(nbytes2, > FlagNoProfiling|**FlagNoPointers, 0, 1); > On 2013/05/30 17:34:03, khr wrote: > >> Don't call malloc while holding a spin lock. Malloc might gc, and >> > everyone else > >> waiting on this lock is spinning. stoptheworld won't be able to stop >> > them => > >> deadlock. >> > > Calling malloc is ok. Use a better lock. >> > > mallocgc is called with dogc=0. > > https://codereview.appspot.**com/9716045/<https://codereview.appspot.com/9716... >
Sign in to reply to this message.
Hello golang-dev@googlegroups.com, khr@golang.org, khr@google.com (cc: dvyukov@google.com, golang-dev@googlegroups.com, rsc@golang.org), Please take another look.
Sign in to reply to this message.
LGTM. Have Dmitry or Carl take a look also. On Thu, May 30, 2013 at 12:52 PM, <0xE2.0x9A.0x9B@gmail.com> wrote: > Hello golang-dev@googlegroups.com, khr@golang.org, khr@google.com (cc: > dvyukov@google.com, golang-dev@googlegroups.com, rsc@golang.org), > > Please take another look. > > > https://codereview.appspot.**com/9716045/<https://codereview.appspot.com/9716... >
Sign in to reply to this message.
LGTM What, if any, is the performance consequence of this change? https://codereview.appspot.com/9716045/diff/18006/src/pkg/runtime/malloc.goc File src/pkg/runtime/malloc.goc (right): https://codereview.appspot.com/9716045/diff/18006/src/pkg/runtime/malloc.goc#... src/pkg/runtime/malloc.goc:509: uintptr ntypes, nbytes2, nbytes3; I realize the names "nbytes{2,3}" and "data{2,3}" come from the earlier revision, but they are very confusing. Could you rename these values to something more descriptive? A follow-up change is okay. It looks like they are for word type data and byte type data, a better name might reflect that.
Sign in to reply to this message.
https://codereview.appspot.com/9716045/diff/18006/src/pkg/runtime/malloc.goc File src/pkg/runtime/malloc.goc (right): https://codereview.appspot.com/9716045/diff/18006/src/pkg/runtime/malloc.goc#... src/pkg/runtime/malloc.goc:509: uintptr ntypes, nbytes2, nbytes3; On 2013/05/30 21:59:08, cshapiro1 wrote: > I realize the names "nbytes{2,3}" and "data{2,3}" come from the earlier > revision, but they are very confusing. Could you rename these values to > something more descriptive? A follow-up change is okay. It looks like they are > for word type data and byte type data, a better name might reflect that. I suggest to keep the names and add a comment to C code: // The suffix 2 (nbytes2, data2) means that the variable is related to MType_Words. // The suffix 3 (nbytes3, data3) means that the variable is related to MType_Bytes.
Sign in to reply to this message.
The "remove sysalloc" part looks good. But the "remove caching" NOT LGTM. There is already very significant slowdown in the memory subsystem since Go1.1, and this adds another 6%. On test/bench/garbage/parser: before: garbage.BenchmarkParser 4 2851583751 ns/op after: garbage.BenchmarkParser 4 3037543341 ns/op before: 9.78% parser parser [.] scanblock ◆ 7.83% parser parser [.] sweepspan ▒ 6.61% parser parser [.] runtime.mallocgc ▒ 6.44% parser parser [.] flushptrbuf ▒ 4.06% parser parser [.] runtime.settype_flush ▒ 3.76% parser parser [.] go/scanner.(*Scanner).next ▒ 2.92% parser parser [.] runtime.gettype ▒ 2.44% parser parser [.] runtime.newstack ▒ 2.29% parser parser [.] runtime.memclr after: 9.82% parser parser [.] scanblock ◆ 7.63% parser parser [.] sweepspan ▒ 6.20% parser parser [.] flushptrbuf ▒ 5.77% parser parser [.] runtime.mallocgc ▒ 5.11% parser parser [.] runtime.xchg ▒ 4.10% parser parser [.] runtime.settype ▒ 3.32% parser parser [.] go/scanner.(*Scanner).next ▒ 2.93% parser parser [.] runtime.gettype ▒ 2.45% parser parser [.] runtime.newstack ▒ 2.19% parser parser [.] go/scanner.(*Scanner).Scan ▒ 2.13% parser parser [.] runtime.memclr ▒ Note that 5.11% for runtime.xchg. Malloc really should be just a pop from freelist (or bump pointer), in no way it must take locks (even if local).
Sign in to reply to this message.
Patchset 5 has runtime·lock() inlined into runtime·settype(). Could you please rerun the benchmark with patchset 5 and state whether the performance of patchset 5 would be LGTM? I can rewrite the for(;;) loop into assembler if necessary. On 2013/05/31 06:33:07, dvyukov wrote: > The "remove sysalloc" part looks good. But the "remove caching" NOT LGTM. > There is already very significant slowdown in the memory subsystem since Go1.1, > and this adds another 6%. > > On test/bench/garbage/parser: > before: > garbage.BenchmarkParser 4 2851583751 ns/op > after: > garbage.BenchmarkParser 4 3037543341 ns/op > > before: > 9.78% parser parser [.] scanblock > ◆ > 7.83% parser parser [.] sweepspan > ▒ > 6.61% parser parser [.] runtime.mallocgc > ▒ > 6.44% parser parser [.] flushptrbuf > ▒ > 4.06% parser parser [.] runtime.settype_flush > ▒ > 3.76% parser parser [.] go/scanner.(*Scanner).next > ▒ > 2.92% parser parser [.] runtime.gettype > ▒ > 2.44% parser parser [.] runtime.newstack > ▒ > 2.29% parser parser [.] runtime.memclr > > > after: > 9.82% parser parser [.] scanblock > ◆ > 7.63% parser parser [.] sweepspan > ▒ > 6.20% parser parser [.] flushptrbuf > ▒ > 5.77% parser parser [.] runtime.mallocgc > ▒ > 5.11% parser parser [.] runtime.xchg > ▒ > 4.10% parser parser [.] runtime.settype > ▒ > 3.32% parser parser [.] go/scanner.(*Scanner).next > ▒ > 2.93% parser parser [.] runtime.gettype > ▒ > 2.45% parser parser [.] runtime.newstack > ▒ > 2.19% parser parser [.] go/scanner.(*Scanner).Scan > ▒ > 2.13% parser parser [.] runtime.memclr > ▒ > > Note that 5.11% for runtime.xchg. > Malloc really should be just a pop from freelist (or bump pointer), in no way it > must take locks (even if local).
Sign in to reply to this message.
I expect the spinlock variable to be zero on entry to settype() in majority of cases. If mallocgc if called from N threads at the same time, it is likely that the N mspans will be different from each other. I am unable to prove that those N mspans are different in all possible situations, so I was forced to put a lock there.
Sign in to reply to this message.
On 2013/05/31 06:40:27, atom wrote: > Patchset 5 has runtime·lock() inlined into runtime·settype(). Could you please > rerun the benchmark with patchset 5 and state whether the performance of > patchset 5 would be LGTM? Slightly better, but still 4% slowdown: before: garbage.BenchmarkParser 4 2814602888 ns/op garbage.BenchmarkParser 4 2796648863 ns/op garbage.BenchmarkParser 4 2786491624 ns/op after: garbage.BenchmarkParser 4 2961402827 ns/op garbage.BenchmarkParser 4 2882486866 ns/op garbage.BenchmarkParser 4 2899995700 ns/op
Sign in to reply to this message.
I uploaded a new patchset. Please rerun the benchmark.
Sign in to reply to this message.
I updated the settype function and fixed the irrational code introduced in the previous changeset. I hope the code is correct now. Please take another look.
Sign in to reply to this message.
https://codereview.appspot.com/9716045/diff/35001/src/pkg/runtime/malloc.goc File src/pkg/runtime/malloc.goc (right): https://codereview.appspot.com/9716045/diff/35001/src/pkg/runtime/malloc.goc#... src/pkg/runtime/malloc.goc:564: if(DebugTypeAtBlockEnd) { This seems unrelated to this CL. https://codereview.appspot.com/9716045/diff/35001/src/pkg/runtime/malloc.goc#... src/pkg/runtime/malloc.goc:577: switch(s->types.compression) { This appears to be an unlocked access to a field that is protected by a lock. Shouldn't it be an atomic load? The locking semantics here are puzzling; is there a comment somewhere, e.g., malloc.h, that explains them? https://codereview.appspot.com/9716045/diff/35001/src/pkg/runtime/malloc.goc#... src/pkg/runtime/malloc.goc:580: { Why start a new block here? https://codereview.appspot.com/9716045/diff/35001/src/pkg/runtime/malloc.goc#... src/pkg/runtime/malloc.goc:582: // An other OS thread won the race s/An other/Another/ https://codereview.appspot.com/9716045/diff/35001/src/pkg/runtime/malloc.goc#... src/pkg/runtime/malloc.goc:587: // An other OS thread won the race a/An other/Another/
Sign in to reply to this message.
https://codereview.appspot.com/9716045/diff/35001/src/pkg/runtime/malloc.goc File src/pkg/runtime/malloc.goc (right): https://codereview.appspot.com/9716045/diff/35001/src/pkg/runtime/malloc.goc#... src/pkg/runtime/malloc.goc:577: switch(s->types.compression) { On 2013/05/31 19:04:17, iant wrote: > This appears to be an unlocked access to a field that is protected by a lock. > Shouldn't it be an atomic load? > > The locking semantics here are puzzling; is there a comment somewhere, e.g., > malloc.h, that explains them? The semantics is based on the fact that MTypes_Words is the final state. Once the final state is reached, the value s->types.data is a constant. I just realized there is a race condition in s->types.data when going from MTypes_Bytes to MTypes_Words. I will attempt to post a fix tomorrow.
Sign in to reply to this message.
https://codereview.appspot.com/9716045/diff/35001/src/pkg/runtime/malloc.goc File src/pkg/runtime/malloc.goc (right): https://codereview.appspot.com/9716045/diff/35001/src/pkg/runtime/malloc.goc#... src/pkg/runtime/malloc.goc:564: if(DebugTypeAtBlockEnd) { On 2013/05/31 19:04:17, iant wrote: > This seems unrelated to this CL. It comes from the old runtime·settype(). https://codereview.appspot.com/9716045/diff/35001/src/pkg/runtime/malloc.goc#... src/pkg/runtime/malloc.goc:577: switch(s->types.compression) { The code has been updated. It isn't race free, but it should be very close to being race free. The non-zero probability of information loss shouldn't be a problem for the garbage collector. There are now more comments in the code. https://codereview.appspot.com/9716045/diff/35001/src/pkg/runtime/malloc.goc#... src/pkg/runtime/malloc.goc:580: { On 2013/05/31 19:04:17, iant wrote: > Why start a new block here? Done. https://codereview.appspot.com/9716045/diff/35001/src/pkg/runtime/malloc.goc#... src/pkg/runtime/malloc.goc:582: // An other OS thread won the race On 2013/05/31 19:04:17, iant wrote: > s/An other/Another/ Done. https://codereview.appspot.com/9716045/diff/35001/src/pkg/runtime/malloc.goc#... src/pkg/runtime/malloc.goc:587: // An other OS thread won the race On 2013/05/31 19:04:17, iant wrote: > a/An other/Another/ Done.
Sign in to reply to this message.
https://codereview.appspot.com/9716045/diff/41001/src/pkg/runtime/malloc.goc File src/pkg/runtime/malloc.goc (right): https://codereview.appspot.com/9716045/diff/41001/src/pkg/runtime/malloc.goc#... src/pkg/runtime/malloc.goc:560: // The suffix 3 (nbytes3, data3) means that the variable is related to MType_Bytes. I also experience cognitive pressure when encounter 2/3 names. That may be find for a very local thing. At least I would expect 1 - single, 2 - bytes, 3 - words; because words are larger and the final state. But they are actually the other way around. datas, datab, dataw would give a useful hint. https://codereview.appspot.com/9716045/diff/41001/src/pkg/runtime/malloc.goc#... src/pkg/runtime/malloc.goc:568: if(DebugTypeAtBlockEnd) { drop {} https://codereview.appspot.com/9716045/diff/41001/src/pkg/runtime/malloc.goc#... src/pkg/runtime/malloc.goc:580: ofs = ((uintptr)v - (s->start<<PageShift)) / size; this seems to be index, not offset https://codereview.appspot.com/9716045/diff/41001/src/pkg/runtime/malloc.goc#... src/pkg/runtime/malloc.goc:635: if(((uintptr*)data3)[j] == typ) { drop {} https://codereview.appspot.com/9716045/diff/41001/src/pkg/runtime/malloc.goc#... src/pkg/runtime/malloc.goc:640: // Label1 ? https://codereview.appspot.com/9716045/diff/41001/src/pkg/runtime/malloc.goc#... src/pkg/runtime/malloc.goc:640: // Label1 Now I see you refer to the Label1 below. Perhaps say something like: // possible race condition, see the "Move contents of data3 to data2" comment below https://codereview.appspot.com/9716045/diff/41001/src/pkg/runtime/malloc.goc#... src/pkg/runtime/malloc.goc:657: if(((uintptr*)data3)[j] == typ) { drop {} https://codereview.appspot.com/9716045/diff/41001/src/pkg/runtime/malloc.goc#... src/pkg/runtime/malloc.goc:666: // Another OS thread added typ to data3 it is not necessary another thread, it can be the current thread https://codereview.appspot.com/9716045/diff/41001/src/pkg/runtime/malloc.goc#... src/pkg/runtime/malloc.goc:685: // Although we may misread some elements of data3[8..N] due to a race condition with Label1, it can also race with 'Another OS thread added typ to data3'
Sign in to reply to this message.
https://codereview.appspot.com/9716045/diff/41001/src/pkg/runtime/malloc.goc File src/pkg/runtime/malloc.goc (right): https://codereview.appspot.com/9716045/diff/41001/src/pkg/runtime/malloc.goc#... src/pkg/runtime/malloc.goc:687: // The probability of the race condition is extremely low. It's fine to have such races now, but we must aim for 100% precise GC in future. At least we need some plan on how to implement it w/o the race and so it is still fast. It will be pity to revert this CL in future.
Sign in to reply to this message.
Now it's quite debatable whether it's a simplification as the change description says. I've remeasured the performance, the best out of 8 runs is: garbage.BenchmarkParser 4 2825486313 ns/op So it's very close, but ~0.5-1% slower. Before: 4.06% parser parser [.] runtime.settype_flush 1.11% parser parser [.] runtime.xchg After: 4.65% parser parser [.] runtime.settype 1.30% parser parser [.] runtime.xchg My conclusion is still the same: The "remove sysalloc" part is good. But the rest is either complicates the code, or slows it down, or both.
Sign in to reply to this message.
Considering what's been written in discussion "Better GC and Memory Allocator" (https://groups.google.com/forum/?fromgroups#!topic/golang-dev/pwUh0BVFpY0) and taking into account the fact that further optimizing the current settype() seems impossible, I am thinking of removing MTypes_Bytes. This would make settype() and mgc0.c both simpler and slightly faster, at the cost of increased memory consumption. The cost per object would initially be 8 bytes on 64-bit platforms, with the option of shrinking it to 4 bytes per object in the future. Is there an agreement that this is the way to go? On 2013/06/03 09:04:17, dvyukov wrote: > Now it's quite debatable whether it's a simplification as the change description > says. > I've remeasured the performance, the best out of 8 runs is: > garbage.BenchmarkParser 4 2825486313 ns/op > So it's very close, but ~0.5-1% slower. > Before: > 4.06% parser parser [.] runtime.settype_flush > 1.11% parser parser [.] runtime.xchg > After: > 4.65% parser parser [.] runtime.settype > 1.30% parser parser [.] runtime.xchg > > My conclusion is still the same: > The "remove sysalloc" part is good. > But the rest is either complicates the code, or slows it down, or both.
Sign in to reply to this message.
On 2013/06/03 12:21:51, atom wrote: > Considering what's been written in discussion "Better GC and Memory Allocator" > (https://groups.google.com/forum/?fromgroups#%21topic/golang-dev/pwUh0BVFpY0) and > taking into account the fact that further optimizing the current settype() seems > impossible, I am thinking of removing MTypes_Bytes. This would make settype() > and mgc0.c both simpler and slightly faster, at the cost of increased memory > consumption. The cost per object would initially be 8 bytes on 64-bit platforms, > with the option of shrinking it to 4 bytes per object in the future. > > Is there an agreement that this is the way to go? Simplifications due to removal of sysalloc and MTypes_Bytes will be great. But you need to measure how it affects performance and memory consumption.
Sign in to reply to this message.
On 2013/06/03 12:29:17, dvyukov wrote: > On 2013/06/03 12:21:51, atom wrote: > > Considering what's been written in discussion "Better GC and Memory Allocator" > > (https://groups.google.com/forum/?fromgroups#%2521topic/golang-dev/pwUh0BVFpY0) > and > > taking into account the fact that further optimizing the current settype() > seems > > impossible, I am thinking of removing MTypes_Bytes. This would make settype() > > and mgc0.c both simpler and slightly faster, at the cost of increased memory > > consumption. The cost per object would initially be 8 bytes on 64-bit > platforms, > > with the option of shrinking it to 4 bytes per object in the future. > > > > Is there an agreement that this is the way to go? > > > Simplifications due to removal of sysalloc and MTypes_Bytes will be great. But > you need to measure how it affects performance and memory consumption. A = with MTypes_Bytes B = without MTypes_Bytes linux/386 test/bench/garbage/parser.go: A: 65.087user 1.870system 0:54.170elapsed 123%CPU (390488maxresident)k B: 63.042user 1.960system 0:52.967elapsed 122%CPU (401664maxresident)k test/bench/garbage/tree2.go: A: 12.246user 0.192system 0:11.578elapsed 107%CPU (148576maxresident)k B: 11.827user 0.213system 0:11.349elapsed 106%CPU (173208maxresident)k
Sign in to reply to this message.
On 2013/06/03 13:11:39, atom wrote: > On 2013/06/03 12:29:17, dvyukov wrote: > > On 2013/06/03 12:21:51, atom wrote: > > > Considering what's been written in discussion "Better GC and Memory > Allocator" > > > > (https://groups.google.com/forum/?fromgroups#%252521topic/golang-dev/pwUh0BVFpY0) > > and > > > taking into account the fact that further optimizing the current settype() > > seems > > > impossible, I am thinking of removing MTypes_Bytes. This would make > settype() > > > and mgc0.c both simpler and slightly faster, at the cost of increased memory > > > consumption. The cost per object would initially be 8 bytes on 64-bit > > platforms, > > > with the option of shrinking it to 4 bytes per object in the future. > > > > > > Is there an agreement that this is the way to go? > > > > > > Simplifications due to removal of sysalloc and MTypes_Bytes will be great. But > > you need to measure how it affects performance and memory consumption. > > A = with MTypes_Bytes > B = without MTypes_Bytes > > linux/386 > > test/bench/garbage/parser.go: > A: 65.087user 1.870system 0:54.170elapsed 123%CPU (390488maxresident)k > B: 63.042user 1.960system 0:52.967elapsed 122%CPU (401664maxresident)k > > test/bench/garbage/tree2.go: > A: 12.246user 0.192system 0:11.578elapsed 107%CPU (148576maxresident)k > B: 11.827user 0.213system 0:11.349elapsed 106%CPU (173208maxresident)k Please also measure memory consumption on linux/amd64. You can do it with: $ TIME="%e %M" time ./parser
Sign in to reply to this message.
On 2013/06/03 13:23:43, dvyukov wrote: > Please also measure memory consumption on linux/amd64. You can do it with: > $ TIME="%e %M" time ./parser This could take half an hour to complete with my resources. I uploaded the new code as patchset 12. Could you please run the benchmarks?
Sign in to reply to this message.
On 2013/06/03 13:28:56, atom wrote: > On 2013/06/03 13:23:43, dvyukov wrote: > > Please also measure memory consumption on linux/amd64. You can do it with: > > $ TIME="%e %M" time ./parser > > This could take half an hour to complete with my resources. I uploaded the new > code as patchset 12. Could you please run the benchmarks? About half an hour, as estimated. The parser benchmark exceeds the 512 MB memory limit of the amd64 virtualized environment I am occasionally using for testing Go.
Sign in to reply to this message.
So your benchmarks show 2.22% speedup and 2.86% memory increase. On linux/am64 I see 2.19% speedup and 5.27% memory increase (it's probably expected, because word is twice as big). + some code simplification I can't make my mind right now, it's a difficult decision.
Sign in to reply to this message.
On 2013/06/03 14:55:26, dvyukov wrote: > So your benchmarks show 2.22% speedup and 2.86% memory increase. > On linux/am64 I see 2.19% speedup and 5.27% memory increase (it's probably > expected, because word is twice as big). > + some code simplification > I can't make my mind right now, it's a difficult decision. There are possibilities for lowering the typeinfo memory consumption in the future: 1. Avoid mallocgc() in settype(). 2. Try to allocate hashmaps outside of mheap. The runtime knows type information of all hashmaps and the garbage collection of hashmaps is fully precise. 3. (already mentioned) Shrinking the 8 byte typeinfo into 4 bytes. I plan to deal with (1) in a short time. (2) shouldn't be hard, it shares code with (1). (3) requires changes to the runtime, the compiler and the reflect package, so it may take some time to implement. If we have a sufficiently firm belief in the realization of these 3 changes, the 5.27% memory increase seems acceptable.
Sign in to reply to this message.
On 2013/06/03 14:55:26, dvyukov wrote: > So your benchmarks show 2.22% speedup and 2.86% memory increase. > On linux/am64 I see 2.19% speedup and 5.27% memory increase (it's probably > expected, because word is twice as big). > + some code simplification > I can't make my mind right now, it's a difficult decision. I have a prototype implementation of settype() without mallocgc(). It uses a new allocator. The results: A = with MTypes_Bytes, with mallocgc in settype B = without MTypes_Bytes, without mallocgc in settype linux/386 test/bench/garbage/parser.go: A: 65.190user 2.057system 0:54.463elapsed 123%CPU 390488maxresident B: 66.798user 2.015system 0:54.812elapsed 125%CPU 381224maxresident test/bench/garbage/tree2.go: A: 12.243user 0.192system 0:11.563elapsed 107%CPU 148576maxresident B: 12.081user 0.219system 0:11.436elapsed 107%CPU 169152maxresident The user times for parser.go are not comparable because in case B the scanblock() function is consuming about 1 second more time than in case A. The garbage collector also runs more often in case B when running parser.go, but this is to be expected because of smaller mheap.
Sign in to reply to this message.
On 2013/06/05 10:11:30, atom wrote: > On 2013/06/03 14:55:26, dvyukov wrote: > > So your benchmarks show 2.22% speedup and 2.86% memory increase. > > On linux/am64 I see 2.19% speedup and 5.27% memory increase (it's probably > > expected, because word is twice as big). > > + some code simplification > > I can't make my mind right now, it's a difficult decision. > > I have a prototype implementation of settype() without mallocgc(). It uses a new > allocator. > > The results: > > A = with MTypes_Bytes, with mallocgc in settype > B = without MTypes_Bytes, without mallocgc in settype > > linux/386 > > test/bench/garbage/parser.go: > A: 65.190user 2.057system 0:54.463elapsed 123%CPU 390488maxresident > B: 66.798user 2.015system 0:54.812elapsed 125%CPU 381224maxresident > > test/bench/garbage/tree2.go: > A: 12.243user 0.192system 0:11.563elapsed 107%CPU 148576maxresident > B: 12.081user 0.219system 0:11.436elapsed 107%CPU 169152maxresident > > The user times for parser.go are not comparable because in case B the > scanblock() function is consuming about 1 second more time than in case A. The > garbage collector also runs more often in case B when running parser.go, but > this is to be expected because of smaller mheap. parser looks good. I don't care too much about tree2 for such changes. What do you mean by "new allocator"? Also please allocate the type array directly when the span is allocated for small blocks, it will simplify settype() significantly. And free it directly when the span is returned to heap, it should be possible with custom allocator.
Sign in to reply to this message.
On 2013/06/05 11:50:05, dvyukov wrote: > On 2013/06/05 10:11:30, atom wrote: > > I have a prototype implementation of settype() without mallocgc(). It uses a > new > > allocator. > > > > The results: > > > > A = with MTypes_Bytes, with mallocgc in settype > > B = without MTypes_Bytes, without mallocgc in settype > > > > linux/386 > > > > test/bench/garbage/parser.go: > > A: 65.190user 2.057system 0:54.463elapsed 123%CPU 390488maxresident > > B: 66.798user 2.015system 0:54.812elapsed 125%CPU 381224maxresident > > > > test/bench/garbage/tree2.go: > > A: 12.243user 0.192system 0:11.563elapsed 107%CPU 148576maxresident > > B: 12.081user 0.219system 0:11.436elapsed 107%CPU 169152maxresident > > > > The user times for parser.go are not comparable because in case B the > > scanblock() function is consuming about 1 second more time than in case A. The > > garbage collector also runs more often in case B when running parser.go, but > > this is to be expected because of smaller mheap. > > > parser looks good. I don't care too much about tree2 for such changes. > What do you mean by "new allocator"? https://codereview.appspot.com/10046043 It isn't prepared for code review yet, but you can comment on it. I published it in advance because this CL (9716045) cannot be LGTMed without seeing the allocator source code. Please ignore the style of the source code for now. > Also please allocate the type array directly when the span is allocated for > small blocks, it will simplify settype() significantly. And free it directly > when the span is returned to heap, it should be possible with custom allocator. I agree. Because of recursion it wasn't possible when settype() was using mallocgc(). I would suggest for this to be a separate code review that should be posted after we are done with CL 10046043. Splitting the process into 3 CLs should make it easier to review.
Sign in to reply to this message.
On Wed, Jun 5, 2013 at 4:29 PM, <0xE2.0x9A.0x9B@gmail.com> wrote: > On 2013/06/05 11:50:05, dvyukov wrote: >> >> On 2013/06/05 10:11:30, atom wrote: >> > I have a prototype implementation of settype() without mallocgc(). > > It uses a >> >> new >> > allocator. >> > >> > The results: >> > >> > A = with MTypes_Bytes, with mallocgc in settype >> > B = without MTypes_Bytes, without mallocgc in settype >> > >> > linux/386 >> > >> > test/bench/garbage/parser.go: >> > A: 65.190user 2.057system 0:54.463elapsed 123%CPU > > 390488maxresident >> >> > B: 66.798user 2.015system 0:54.812elapsed 125%CPU > > 381224maxresident >> >> > >> > test/bench/garbage/tree2.go: >> > A: 12.243user 0.192system 0:11.563elapsed 107%CPU > > 148576maxresident >> >> > B: 12.081user 0.219system 0:11.436elapsed 107%CPU > > 169152maxresident >> >> > >> > The user times for parser.go are not comparable because in case B > > the >> >> > scanblock() function is consuming about 1 second more time than in > > case A. The >> >> > garbage collector also runs more often in case B when running > > parser.go, but >> >> > this is to be expected because of smaller mheap. > > > >> parser looks good. I don't care too much about tree2 for such changes. >> What do you mean by "new allocator"? > > > https://codereview.appspot.com/10046043 > > It isn't prepared for code review yet, but you can comment on it. I > published it in advance because this CL (9716045) cannot be LGTMed > without seeing the allocator source code. Please ignore the style of the > source code for now. Kind of lot of code... and another allocator in the runtime... Can we do what we've discussed in golang-dev about improved GC -- embed this type table directly at the end of the span itself? Then it won't require all that code. MSpan will need to embed few slots for large objects (so that 4K object can fit into 4K span). >> Also please allocate the type array directly when the span is > > allocated for >> >> small blocks, it will simplify settype() significantly. And free it > > directly >> >> when the span is returned to heap, it should be possible with custom > > allocator. > > I agree. Because of recursion it wasn't possible when settype() was > using mallocgc(). I would suggest for this to be a separate code review > that should be posted after we are done with CL 10046043. Splitting the > process into 3 CLs should make it easier to review. > > https://codereview.appspot.com/9716045/
Sign in to reply to this message.
On 2013/06/05 13:03:01, dvyukov wrote: > On Wed, Jun 5, 2013 at 4:29 PM, <mailto:0xE2.0x9A.0x9B@gmail.com> wrote: > > > > https://codereview.appspot.com/10046043 > > > > It isn't prepared for code review yet, but you can comment on it. I > > published it in advance because this CL (9716045) cannot be LGTMed > > without seeing the allocator source code. Please ignore the style of the > > source code for now. > > > Kind of lot of code... and another allocator in the runtime... I believe an allocator completely separate from mallocgc() is necessary and can be used from multiple places in the runtime. > Can we do what we've discussed in golang-dev about improved GC -- > embed this type table directly at the end of the span itself? Then it > won't require all that code. > MSpan will need to embed few slots for large objects (so that 4K > object can fit into 4K span). The typical size of a memory block holding typeinfos is about 1024 bytes. The list of memory allocation requests for the parser benchmark looks as follows (size in bytes, 32-bit platform): ... 2048 512 1024 512 2048 1024 344 512 144 168 256 128 1024 344 512 2048 1024 256 512 1024 512 2048 1024 32 512 1024 512 1024 2048 512 1024 344 512 2048 1024 16 512 1024 512 344 2048 1024 512 512 1024 2048 512 64 1024 344 512 2048 1024 512 256 1024 2048 512 16 512 1024 2048 512 1024 512 1024 2048 88 1024 8 8 ... A span is aligned to a page size, which is 4096 bytes. Embedding the type table directly into span itself would mean that the above numbers have to be rounded to a multiple of 4096. This would increase memory consumption. I would like to close this CL (9716045) soon if possible. There is a lot of other improvements ahead.
Sign in to reply to this message.
On Wed, Jun 5, 2013 at 5:29 PM, <0xE2.0x9A.0x9B@gmail.com> wrote: > On 2013/06/05 13:03:01, dvyukov wrote: >> >> On Wed, Jun 5, 2013 at 4:29 PM, <mailto:0xE2.0x9A.0x9B@gmail.com> > > wrote: > >> > >> > https://codereview.appspot.com/10046043 >> > >> > It isn't prepared for code review yet, but you can comment on it. I >> > published it in advance because this CL (9716045) cannot be LGTMed >> > without seeing the allocator source code. Please ignore the style of > > the >> >> > source code for now. > > > >> Kind of lot of code... and another allocator in the runtime... > > > I believe an allocator completely separate from mallocgc() is necessary > and can be used from multiple places in the runtime. Where else? >> Can we do what we've discussed in golang-dev about improved GC -- >> embed this type table directly at the end of the span itself? Then it >> won't require all that code. >> MSpan will need to embed few slots for large objects (so that 4K >> object can fit into 4K span). > > > The typical size of a memory block holding typeinfos is about 1024 > bytes. The list of memory allocation requests for the parser benchmark > looks as follows (size in bytes, 32-bit platform): > > ... 2048 512 1024 512 2048 1024 344 512 144 168 256 128 1024 344 512 > 2048 1024 256 512 1024 512 2048 1024 32 512 1024 512 1024 2048 512 1024 > 344 512 2048 1024 16 512 1024 512 344 2048 1024 512 512 1024 2048 512 64 > 1024 344 512 2048 1024 512 256 1024 2048 512 16 512 1024 2048 512 1024 > 512 1024 2048 88 1024 8 8 ... > > A span is aligned to a page size, which is 4096 bytes. Embedding the > type table directly into span itself would mean that the above numbers > have to be rounded to a multiple of 4096. What they have to be rounded to 4096? A single 4096-byte span can hold lots of 8-byte allocations and the associated type info, no rounding is required. > This would increase memory > consumption. > > I would like to close this CL (9716045) soon if possible. There is a lot > of other improvements ahead.
Sign in to reply to this message.
On 2013/06/05 13:36:49, dvyukov wrote: > On Wed, Jun 5, 2013 at 5:29 PM, <mailto:0xE2.0x9A.0x9B@gmail.com> wrote: > > The typical size of a memory block holding typeinfos is about 1024 > > bytes. The list of memory allocation requests for the parser benchmark > > looks as follows (size in bytes, 32-bit platform): > > > > ... 2048 512 1024 512 2048 1024 344 512 144 168 256 128 1024 344 512 > > 2048 1024 256 512 1024 512 2048 1024 32 512 1024 512 1024 2048 512 1024 > > 344 512 2048 1024 16 512 1024 512 344 2048 1024 512 512 1024 2048 512 64 > > 1024 344 512 2048 1024 512 256 1024 2048 512 16 512 1024 2048 512 1024 > > 512 1024 2048 88 1024 8 8 ... > > > > A span is aligned to a page size, which is 4096 bytes. Embedding the > > type table directly into span itself would mean that the above numbers > > have to be rounded to a multiple of 4096. > > What they have to be rounded to 4096? > A single 4096-byte span can hold lots of 8-byte allocations and the > associated type info, no rounding is required. I don't have enough vitality. I am closing this CL.
Sign in to reply to this message.
|