|
|
Descriptionruntime: no longer skip stack growth test in short mode
We originally decided to skip this test in short mode
to prevent the parallel runtime test to timeout on the
Plan 9 builder. This should no longer be required since
the issue was fixed in CL 86210043.
Patch Set 1 #Patch Set 2 : diff -r 98e63c935796 https://code.google.com/p/go #Patch Set 3 : diff -r cd08410c17bf https://code.google.com/p/go #MessagesTotal messages: 22
Hello dvyukov (cc: golang-codereviews@googlegroups.com, rsc), I'd like you to review this change to https://code.google.com/p/go
Sign in to reply to this message.
LGTM. On Thu, Apr 10, 2014 at 7:16 AM, <0intro@gmail.com> wrote: > Reviewers: dvyukov, > > Message: > Hello dvyukov (cc: golang-codereviews@googlegroups.com, rsc), > > I'd like you to review this change to > https://code.google.com/p/go > > > Description: > runtime: no longer skip stack growth test in short mode > > We originally decided to skip this test in short mode > to prevent the parallel runtime test to timeout on the > Plan 9 builder. This should no longer be required since > the issue was fixed in CL 86210043. > > Please review this at https://codereview.appspot.com/84790044/ > > Affected files (+0, -3 lines): > M src/pkg/runtime/stack_test.go > > > Index: src/pkg/runtime/stack_test.go > =================================================================== > --- a/src/pkg/runtime/stack_test.go > +++ b/src/pkg/runtime/stack_test.go > @@ -123,9 +123,6 @@ > > // Test stack growing in different contexts. > func TestStackGrowth(t *testing.T) { > - if testing.Short() { > - t.Skip("skipping test in short mode") > - } > t.Parallel() > var wg sync.WaitGroup > > > > -- > You received this message because you are subscribed to the Google Groups > "golang-codereviews" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to golang-codereviews+unsubscribe@googlegroups.com. > For more options, visit https://groups.google.com/d/optout.
Sign in to reply to this message.
LGTM There might be a few more of these too. At least one in http On Apr 9, 2014 2:16 PM, <0intro@gmail.com> wrote: > Reviewers: dvyukov, > > Message: > Hello dvyukov (cc: golang-codereviews@googlegroups.com, rsc), > > I'd like you to review this change to > https://code.google.com/p/go > > > Description: > runtime: no longer skip stack growth test in short mode > > We originally decided to skip this test in short mode > to prevent the parallel runtime test to timeout on the > Plan 9 builder. This should no longer be required since > the issue was fixed in CL 86210043. > > Please review this at https://codereview.appspot.com/84790044/ > > Affected files (+0, -3 lines): > M src/pkg/runtime/stack_test.go > > > Index: src/pkg/runtime/stack_test.go > =================================================================== > --- a/src/pkg/runtime/stack_test.go > +++ b/src/pkg/runtime/stack_test.go > @@ -123,9 +123,6 @@ > > // Test stack growing in different contexts. > func TestStackGrowth(t *testing.T) { > - if testing.Short() { > - t.Skip("skipping test in short mode") > - } > t.Parallel() > var wg sync.WaitGroup > > > > -- > You received this message because you are subscribed to the Google Groups > "golang-codereviews" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to golang-codereviews+unsubscribe@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. >
Sign in to reply to this message.
next time you are going to disable a test for one OS, please use a condition checking runtime.GOOS.
Sign in to reply to this message.
*** Submitted as https://code.google.com/p/go/source/detail?r=9ef11603cde8 *** runtime: no longer skip stack growth test in short mode We originally decided to skip this test in short mode to prevent the parallel runtime test to timeout on the Plan 9 builder. This should no longer be required since the issue was fixed in CL 86210043. LGTM=dave, bradfitz R=dvyukov, dave, bradfitz CC=golang-codereviews, rsc https://codereview.appspot.com/84790044
Sign in to reply to this message.
Message was sent while issue was closed.
This CL appears to have broken the linux-arm-cheney-imx6 builder. See http://build.golang.org/log/d18afd8e1ca997a4672687d4d32f9f68d2fb503c
Sign in to reply to this message.
The test is failing on both linux/arm and solaris. The test seems rather slow on arm machines, is it the cause of the failure? -- David du Colombier
Sign in to reply to this message.
It also happen sporadically on darwin/amd64 and netbsd/amd64. -- David du Colombier
Sign in to reply to this message.
I can't reproduce builder failures on any of my machines, investigating... -- Aram Hăvărneanu
Sign in to reply to this message.
Hmm, although I can't reproduce the failure, I do see some interesting performance degradation for TestStackGrowth with cpu >= 2. : oos:runtime; GOMAXPROCS=2 go test -v -cpu=1,2,4 -run TestStackGrowth runtime === RUN TestStackGrowth === RUN TestStackGrowthCallback --- PASS: TestStackGrowthCallback (2.23 seconds) --- PASS: TestStackGrowth (2.23 seconds) === RUN TestStackGrowth-2 === RUN TestStackGrowthCallback-2 --- PASS: TestStackGrowthCallback-2 (6.07 seconds) --- PASS: TestStackGrowth-2 (28.45 seconds) === RUN TestStackGrowth-4 === RUN TestStackGrowthCallback-4 --- PASS: TestStackGrowthCallback-4 (12.44 seconds) --- PASS: TestStackGrowth-4 (21.40 seconds) PASS ok runtime 67.345s : oos:runtime; GOMAXPROCS=2 go test -v -cpu=1,2,4 -run TestStackGrowth runtime === RUN TestStackGrowth === RUN TestStackGrowthCallback --- PASS: TestStackGrowthCallback (2.30 seconds) --- PASS: TestStackGrowth (2.30 seconds) === RUN TestStackGrowth-2 === RUN TestStackGrowthCallback-2 --- PASS: TestStackGrowthCallback-2 (7.99 seconds) --- PASS: TestStackGrowth-2 (28.72 seconds) === RUN TestStackGrowth-4 === RUN TestStackGrowthCallback-4 --- PASS: TestStackGrowthCallback-4 (12.23 seconds) --- PASS: TestStackGrowth-4 (21.16 seconds) PASS ok runtime 67.270s Only one one machine though, this is fine: 04757c51-b5c2-4d23-9a69-1e9e305bc4da:aram$ GOMAXPROCS=2 /home/aram/go-up/bin/go test -cpu=1,2,4 -v -run TestStackGrowth runtime === RUN TestStackGrowth === RUN TestStackGrowthCallback --- PASS: TestStackGrowthCallback (2.48 seconds) --- PASS: TestStackGrowth (2.48 seconds) === RUN TestStackGrowth-2 === RUN TestStackGrowthCallback-2 --- PASS: TestStackGrowthCallback-2 (2.61 seconds) --- PASS: TestStackGrowth-2 (4.57 seconds) === RUN TestStackGrowth-4 === RUN TestStackGrowthCallback-4 --- PASS: TestStackGrowthCallback-4 (5.63 seconds) --- PASS: TestStackGrowth-4 (5.73 seconds) PASS ok runtime 14.741s 04757c51-b5c2-4d23-9a69-1e9e305bc4da:aram$ GOMAXPROCS=2 /home/aram/go-up/bin/go test -cpu=1,2,4 -v -run TestStackGrowth runtime === RUN TestStackGrowth === RUN TestStackGrowthCallback --- PASS: TestStackGrowthCallback (2.34 seconds) --- PASS: TestStackGrowth (2.34 seconds) === RUN TestStackGrowth-2 === RUN TestStackGrowthCallback-2 --- PASS: TestStackGrowthCallback-2 (3.07 seconds) --- PASS: TestStackGrowth-2 (4.76 seconds) === RUN TestStackGrowth-4 === RUN TestStackGrowthCallback-4 --- PASS: TestStackGrowthCallback-4 (5.49 seconds) --- PASS: TestStackGrowth-4 (5.61 seconds) PASS ok runtime 14.344s I'll test on the builder too. -- Aram Hăvărneanu
Sign in to reply to this message.
It also happen on the Plan 9 builder, but it's much less frequent than before the semasleep fix (CL 86210043). -- David du Colombier
Sign in to reply to this message.
On Thu, Apr 10, 2014 at 6:05 AM, Aram Hăvărneanu <aram.h@mgk.ro> wrote: > Hmm, although I can't reproduce the failure, I do see some interesting > performance degradation for TestStackGrowth with cpu >= 2. This looks like what was happening on the Plan 9 builder. There, the problem was that stoptheworld called notetsleep called semasleep to sleep for 100us, but in the tsemacquire system call that rounded down to 0ms, so it was hard-spinning on the cpu instead of sleeping and therefore rescheduling. The fix was to round up so that some real sleeping (and therefore rescheduling) always happens in that situation. GOMAXPROCS>1 would exacerbate this. It is entirely possible that the runtime for other systems has the same problem. What system are you seeing this on? Russ
Sign in to reply to this message.
> This looks like what was happening on the Plan 9 builder. There, the > problem was that stoptheworld called notetsleep called semasleep to sleep > for 100us, but in the tsemacquire system call that rounded down to 0ms, > so it was hard-spinning on the cpu instead of sleeping and therefore > rescheduling. The fix was to round up so that some real sleeping (and > therefore rescheduling) always happens in that situation. GOMAXPROCS>1 > would exacerbate this. That was my first hypothesis, but I can't see it. It looks like there's actual sleeping going on. Also, the problem seems specific to the kernel version. I can see it on older kernels, but on latest kernel from Joyent I can't trigger this problem at all. FWIW, this is on the "bad" machine: : oos:runtime; time runtime.test -test.cpu=4 -test.run TestStackGrowth$ runtime PASS real 0m31.928s user 0m4.615s sys 0m0.393s : oos:runtime; And this is on the "good" machine: real 0m6.064s user 0m3.954s sys 0m0.592s The faulty machine doesn't have higher user time, and the ratio of the two user times is very close to the ratio of the performance of the two CPUs. So I ran this: dtrace -n 'syscall:::entry /pid==$target/ { @num[probefunc] = count(); }' -c "./runtime.test -test.v -test.cpu=4 -test.run TestStackGrowth$ runtime" And compared the output on the bad Solaris machine to the good Solaris machine. This is the bad machine: dtrace: description 'syscall:::entry ' matched 231 probes === RUN TestStackGrowth-4 --- PASS: TestStackGrowth-4 (29.70 seconds) PASS dtrace: pid 13974 has exited getpid 1 getrlimit 1 rexit 1 brk 2 priocntlsys 3 write 3 sysconfig 4 lwp_continue 8 lwp_create 8 uucopy 8 schedctl 9 sigaltstack 9 setcontext 10 lwp_sigmask 25 mmap 26 sigaction 68 nanosleep 5601 yield 36150 lwp_park 48428 And this is the good machine: dtrace: description 'syscall:::entry ' matched 233 probes === RUN TestStackGrowth-4 --- PASS: TestStackGrowth-4 (5.30 seconds) PASS dtrace: pid 95944 has exited close 1 getpid 1 getrlimit 1 open 1 read 1 rexit 1 brk 2 write 3 sysconfig 4 priocntlsys 5 lwp_continue 9 lwp_create 9 uucopy 9 schedctl 10 sigaltstack 10 setcontext 11 lwp_sigmask 28 mmap 28 sigaction 68 nanosleep 13993 lwp_park 73866 yield 151500 The good machine yields a lot more than the bad machine. This could be because either the program state is different (for whatever reason), or the program state is the same but libc yields differently. Let's test this out. We'll look at lwp_park and yield, and aggregate the stacks. We'll use this: dtrace -n 'syscall::lwp_park:entry, syscall::yield:entry /pid==$target/ { @num[ustack()] = count(); }' -c "./runtime.test -test.v -test.cpu=4 -test.run TestStackGrowth$ runtime" This is the bad machine: libc.so.1`yield+0xa libc.so.1`sema_post+0x1fc libc.so.1`sem_post+0x38 runtime.test`runtime.asmsysvicall6+0x50 3141 libc.so.1`__lwp_park+0x17 libc.so.1`sema_reltimedwait+0x28 libc.so.1`sem_reltimedwait_np+0x4b runtime.test`runtime.asmsysvicall6+0x50 4028 libc.so.1`__lwp_park+0x17 libc.so.1`sema_wait+0x13 libc.so.1`sem_wait+0x38 runtime.test`runtime.asmsysvicall6+0x50 20658 libc.so.1`__lwp_unpark+0x14 libc.so.1`sem_post+0x38 runtime.test`runtime.asmsysvicall6+0x50 24678 libc.so.1`yield+0xa runtime.test`runtime.asmsysvicall6+0x50 33669 And this is the good machine libc.so.1`yield+0xa libc.so.1`sema_post+0x1fc libc.so.1`sem_post+0x38 runtime.test`runtime.asmsysvicall6+0x50 175 libc.so.1`__lwp_park+0x17 libc.so.1`sema_reltimedwait+0x28 libc.so.1`sem_reltimedwait_np+0x4b runtime.test`runtime.asmsysvicall6+0x50 4090 libc.so.1`__lwp_park+0x17 libc.so.1`sema_wait+0x13 libc.so.1`sem_wait+0x38 runtime.test`runtime.asmsysvicall6+0x50 32882 libc.so.1`__lwp_unpark+0x14 libc.so.1`sem_post+0x38 runtime.test`runtime.asmsysvicall6+0x50 36964 libc.so.1`yield+0xa runtime.test`runtime.asmsysvicall6+0x50 162616 So, we call sem_post (which eventually yields) twenty times less often, we call sem_reltimedwait_np the same number of times, sem_post and sem_wait is called more on the good machine, but not by much, and yield is called five times more often on the good machine. Now we know something in the program calls yield more often. Yield is called directly only by runtime·osyield. Sem_post is called only by runtime·semawakeup. This syscall data is interesting, but we still don't know exactly where we waste time. Instead of guesssing, we can measure: /opt/DTT/procsystime -e ./runtime.test -test.v -test.cpu=4 -test.run TestStackGrowth$ runtime This is the bad machine: /opt/DTT/procsystime -eT ./runtime.test -test.v -test.cpu=4 -test.run TestStackGrowth$ runtime === RUN TestStackGrowth-4 --- PASS: TestStackGrowth-4 (27.24 seconds) PASS Elapsed Times for command ./runtime.test -test.v -test.cpu=4 -test.run TestStackGrowth$ runtime, SYSCALL TIME (ns) getpid 1672 getrlimit 3374 systeminfo 3375 close 3855 sysconfig 6766 brk 14064 sigaltstack 17449 setcontext 20898 priocntlsys 24026 open 25306 stat 33366 lwp_continue 34503 resolvepath 42401 lwp_sigmask 42990 schedctl 48513 mmapobj 50641 write 73168 getcwd 78745 sigaction 91300 memcntl 155442 lwp_create 195676 uucopy 222416 mmap 251470 yield 166226654 nanosleep 51001870920 lwp_park 190318570199 TOTAL: 241488109189 And this is the good machine: 04757c51-b5c2-4d23-9a69-1e9e305bc4da:runtime$ pfexec /opt/DTT/procsystime -eT ./runtime.test -test.v -test.cpu=4 -test.run TestStackGrowth$ runtime === RUN TestStackGrowth-4 --- PASS: TestStackGrowth-4 (5.05 seconds) PASS Elapsed Times for command ./runtime.test -test.v -test.cpu=4 -test.run TestStackGrowth$ runtime, SYSCALL TIME (ns) getpid 1882 getrlimit 3183 read 4265 systeminfo 5092 sysconfig 10346 brk 11545 close 12225 sigaltstack 17543 lwp_continue 24542 stat 39306 setcontext 40303 resolvepath 49826 lwp_sigmask 50834 write 55122 schedctl 55252 getcwd 64392 open 67225 mmapobj 78281 sigaction 86023 priocntlsys 130477 lwp_create 159157 uucopy 181476 mmap 314897 memcntl 439203 yield 448265812 nanosleep 12854501047 lwp_park 30347023675 TOTAL: 43651692931 You might wonder why the total times are much greater than the real times. That is because this tool measures time spent in system calls accross all threads, and there are many threads in a Go process. The lwp_park numbers are very skewed because many threads in a Go process are parked almost all the time. I suspect all threads are created early, they live for the whole process lifetime, and only 4 threads are doing anything, the reast are parked. I'm not insane to use this hypothesis without testing it, so let's test it: dtrace -qn 'syscall::lwp_create:entry, syscall::lwp_exit:entry /pid==$target/ { printf("%d %d %s\n", timestamp, tid, probefunc) }' -c "./runtime.test -test.v -test.cpu=4 -test.run TestStackGrowth$ runtime" This is the bad machine: === RUN TestStackGrowth-4 22270995985559 1 lwp_create 22271148086316 1 lwp_create 22271148201956 3 lwp_create 22271866554369 4 lwp_create 22271877971811 1 lwp_create 22271878464099 5 lwp_create 22271890776999 7 lwp_create 22271984380961 4 lwp_create And this is the good machine: 11721043519613302 1 lwp_create 11721043603792260 1 lwp_create 11721043603941641 3 lwp_create 11721044386989497 4 lwp_create 11721044388789803 1 lwp_create 11721044392657155 6 lwp_create 11721044417728140 7 lwp_create 11721044916077749 4 lwp_create 11721044916149965 8 lwp_create I wasn't very wrong, but I wasn't perfectly accurate either. Threads are created early and live long, but they are created in two steps. First, a bunch of threads are created, then, with a delay of 0.1s for the bad machine and 0.5s for the good machine, another set of threads is created. For the bad machine this set only has one thread, for the good machine two threads are created almost at the same time. What are these threads? The obvious suspect is the printing this line === RUN TestStackGrowth-4 could create a thread, but this doesn't explain why two threads are created for the good machine. Let's disable that write to eliminate unknowns. Bad machine: 23198516208356 1 lwp_create 23198678977029 1 lwp_create 23198679085298 3 lwp_create 23199387572555 4 lwp_create 23199388103588 4 lwp_create 23199388764763 5 lwp_create 23199390491253 5 lwp_create 23199450598099 8 lwp_create Good machine: 11722471357043810 1 lwp_create 11722471491789019 1 lwp_create 11722471491917276 3 lwp_create 11722472251577077 4 lwp_create 11722472251859397 4 lwp_create 11722472252279168 5 lwp_create 11722472282499772 5 lwp_create 11722472440879247 6 lwp_create There is still a 0.15 delay, but it's consistent accross machine, and at least now they have the same number of threads! Unfortunately, we can't affort not to use -v and still use procsystime, becase we can't time the tests precisely. We can't use time(1) to call procsystime because dtrace initialization takes long enough to invalidate the tests on the good machine (I tried). The best strategy is to abandon procsystime and use dtrace directly. Let's do this. We'll just measure lwp_park block time and plot it. We use this: dtrace -n 'syscall::lwp_park:entry /pid==$target/ { self->ts = timestamp } syscall::lwp_park:return /self->ts/ { @time[tid] = quantize(timestamp - self->ts); self->ts = 0; } ' -c "./runtime.test -test.cpu=4 -test.run TestStackGrowth$ runtime" We get these nice plots (fixed width font required). For the bad machine: 1 value ------------- Distribution ------------- count 1024 | 0 2048 |@@@@@@@ 910 4096 |@@@@@@@@@@@@ 1630 8192 |@ 100 16384 |@@ 229 32768 |@@@@@@ 743 65536 |@@@@@ 736 131072 |@@ 268 262144 |@ 100 524288 | 6 1048576 | 1 2097152 | 0 4194304 | 0 8388608 |@@@@ 515 16777216 |@ 134 33554432 | 7 67108864 | 7 134217728 | 1 268435456 | 1 536870912 | 0 1073741824 | 1 2147483648 | 0 4294967296 | 0 8589934592 | 1 17179869184 | 0 2 value ------------- Distribution ------------- count 4096 | 0 8192 | 4 16384 | 8 32768 | 10 65536 | 23 131072 |@ 53 262144 |@@@@@ 396 524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2347 1048576 |@ 40 2097152 | 7 4194304 | 2 8388608 | 5 16777216 | 0 3 value ------------- Distribution ------------- count 8589934592 | 0 17179869184 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 34359738368 | 0 4 value ------------- Distribution ------------- count 512 | 0 1024 | 2 2048 |@@@@@@@@ 2149 4096 |@@@@@@@@@@@@@@@@@@ 4619 8192 |@@@@ 1140 16384 |@@@@ 1028 32768 |@@@ 826 65536 | 113 131072 | 25 262144 | 45 524288 | 29 1048576 | 15 2097152 | 2 4194304 | 4 8388608 | 88 16777216 |@ 242 33554432 | 41 67108864 | 17 134217728 | 14 268435456 | 7 536870912 | 3 1073741824 | 0 5 value ------------- Distribution ------------- count 512 | 0 1024 | 1 2048 |@@ 231 4096 |@@@@@@@@@@@@@@@ 1499 8192 |@ 110 16384 |@ 125 32768 |@@@ 328 65536 |@@ 197 131072 |@@@ 356 262144 | 42 524288 | 22 1048576 | 50 2097152 | 4 4194304 | 1 8388608 |@@@@@@@@ 857 16777216 |@@ 251 33554432 | 15 67108864 | 6 134217728 | 0 268435456 | 0 536870912 | 1 1073741824 | 0 2147483648 | 1 4294967296 | 1 8589934592 | 0 6 value ------------- Distribution ------------- count 1024 | 0 2048 |@@ 451 4096 |@@@@@@@@@@@@@@@@ 3030 8192 |@ 179 16384 |@@ 315 32768 |@@@ 646 65536 |@@ 327 131072 |@@@ 557 262144 |@@ 342 524288 |@ 135 1048576 | 45 2097152 | 10 4194304 | 2 8388608 |@@@@@@@ 1373 16777216 |@@ 331 33554432 | 25 67108864 | 15 134217728 | 4 268435456 | 0 7 value ------------- Distribution ------------- count 512 | 0 1024 | 1 2048 |@@@@@ 1181 4096 |@@@@@@@@@@@@@@@@ 3972 8192 |@@ 414 16384 |@@ 555 32768 |@@@@ 874 65536 |@ 305 131072 |@@ 513 262144 |@ 221 524288 |@ 128 1048576 | 6 2097152 | 0 4194304 | 2 8388608 |@@@@@ 1246 16777216 |@ 285 33554432 | 28 67108864 | 7 134217728 | 5 268435456 | 1 536870912 | 0 8 value ------------- Distribution ------------- count 512 | 0 1024 | 3 2048 |@@@@@@@ 1631 4096 |@@@@@@@@@@@ 2327 8192 |@@ 394 16384 |@@@ 617 32768 |@@ 397 65536 |@@@ 568 131072 |@@@ 590 262144 |@ 311 524288 | 85 1048576 | 8 2097152 | 0 4194304 | 0 8388608 |@@@@@@@ 1496 16777216 |@ 294 33554432 | 13 67108864 | 12 134217728 | 3 268435456 | 3 536870912 | 0 1073741824 | 1 2147483648 | 0 9 value ------------- Distribution ------------- count 1024 | 0 2048 |@@ 3 4096 |@@@@@@@@@ 16 8192 |@ 1 16384 |@ 1 32768 |@@ 3 65536 | 0 131072 |@@@@@ 9 262144 |@ 1 524288 |@@@ 5 1048576 | 0 2097152 | 0 4194304 | 0 8388608 |@@@@@@@@@@@@@@@ 28 16777216 |@@ 4 33554432 | 0 67108864 |@ 1 134217728 | 0 268435456 | 0 536870912 | 0 1073741824 | 0 2147483648 | 0 4294967296 | 0 8589934592 | 0 17179869184 |@ 1 34359738368 | 0 For the good machine: 1 value ------------- Distribution ------------- count 512 | 0 1024 | 9 2048 |@@@@@@@@@ 1533 4096 |@@@@@@@@@@@@ 1896 8192 |@@ 284 16384 |@@ 375 32768 |@@@@ 675 65536 |@@@@@ 890 131072 |@ 203 262144 |@ 100 524288 |@@ 353 1048576 |@ 130 2097152 | 19 4194304 | 17 8388608 | 6 16777216 | 1 33554432 | 1 67108864 | 1 134217728 | 2 268435456 | 0 536870912 | 2 1073741824 | 1 2147483648 | 0 2 value ------------- Distribution ------------- count 4096 | 0 8192 | 15 16384 | 42 32768 |@ 64 65536 |@ 147 131072 |@@@@@@@ 769 262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3325 524288 |@@ 270 1048576 | 13 2097152 | 0 4194304 | 0 8388608 | 1 16777216 | 0 3 value ------------- Distribution ------------- count 2147483648 | 0 4294967296 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 8589934592 | 0 4 value ------------- Distribution ------------- count 256 | 0 512 | 3 1024 | 27 2048 |@@@@@@@@@ 2661 4096 |@@@@@@@@ 2475 8192 |@ 353 16384 |@@@ 849 32768 |@@@@@ 1461 65536 |@@@@@@@ 2068 131072 |@ 425 262144 | 136 524288 |@@@ 894 1048576 |@ 302 2097152 | 29 4194304 | 32 8388608 | 60 16777216 | 12 33554432 | 2 67108864 | 1 134217728 | 0 268435456 | 1 536870912 | 0 5 value ------------- Distribution ------------- count 512 | 0 1024 | 9 2048 |@@@@@@@ 2047 4096 |@@@@@@@@@@@@@@ 4371 8192 |@ 438 16384 |@@@ 1008 32768 |@@@@@ 1387 65536 |@@@@@ 1562 131072 |@ 321 262144 | 92 524288 |@@ 747 1048576 | 110 2097152 | 31 4194304 | 20 8388608 | 8 16777216 | 6 33554432 | 0 67108864 | 2 134217728 | 5 268435456 | 0 6 value ------------- Distribution ------------- count 512 | 0 1024 | 100 2048 |@@@@@@@@@@@@@@@@@@@@@ 9123 4096 |@@@@@@@ 3155 8192 |@@@ 1363 16384 |@@@ 1478 32768 |@ 626 65536 |@ 352 131072 | 92 262144 | 83 524288 | 121 1048576 |@ 394 2097152 |@ 621 4194304 | 36 8388608 | 6 16777216 | 0 7 value ------------- Distribution ------------- count 512 | 0 1024 | 10 2048 |@@@@@@@ 316 4096 |@@@@ 190 8192 |@ 62 16384 |@@ 111 32768 |@@@@@@ 254 65536 |@@@@@@@ 317 131072 |@@@@ 160 262144 | 22 524288 |@@@@@@ 275 1048576 |@ 50 2097152 | 2 4194304 | 3 8388608 | 3 16777216 | 4 33554432 | 3 67108864 | 0 134217728 | 0 268435456 | 0 536870912 | 0 1073741824 | 0 2147483648 | 1 4294967296 | 0 8 value ------------- Distribution ------------- count 512 | 0 1024 | 14 2048 |@@@@@ 1271 4096 |@@@@@@@@@@@@@ 3326 8192 |@ 253 16384 |@@@ 853 32768 |@@@@@@ 1414 65536 |@@@@@@@ 1755 131072 |@ 374 262144 | 127 524288 |@@@ 657 1048576 |@ 164 2097152 | 25 4194304 | 19 8388608 | 11 16777216 | 4 33554432 | 1 67108864 | 5 134217728 | 1 268435456 | 1 536870912 | 0 9 value ------------- Distribution ------------- count 512 | 0 1024 | 8 2048 |@@@@@@@@@ 2014 4096 |@@@@@@@@@ 2115 8192 |@ 300 16384 |@@@ 615 32768 |@@@@@ 1201 65536 |@@@@@@@ 1677 131072 |@@ 352 262144 |@ 118 524288 |@@@ 615 1048576 |@ 241 2097152 | 29 4194304 | 17 8388608 | 7 16777216 | 2 33554432 | 3 67108864 | 1 134217728 | 0 268435456 | 0 536870912 | 0 1073741824 | 1 2147483648 | 0 Compare the plots for threads 1, 4, 5, 6, 7, 8. Let's take thread 7 for example: bad: 131072 |@@ 513 262144 |@ 221 524288 |@ 128 1048576 | 6 2097152 | 0 4194304 | 2 8388608 |@@@@@ 1246 16777216 |@ 285 33554432 | 28 good: 131072 |@@@@ 160 262144 | 22 524288 |@@@@@@ 275 1048576 |@ 50 2097152 | 2 4194304 | 3 8388608 | 3 16777216 | 4 33554432 | 3 In general, the statistics for time spent parked match the low numbers, but the bad machine spends a second "cluster" of time, centered around 10ms wait time. I don't know what that 10ms is yet. -- Aram Hăvărneanu
Sign in to reply to this message.
Please hg sync to pick up CL 86620043 and then run with GODEBUG=gctrace=1.
Sign in to reply to this message.
> Please hg sync to pick up CL 86620043 and then run with > GODEBUG=gctrace=1. Okay, bad machine: cpu=1 (fast): http://paste.ubuntu.com/7232041/ cpu=2 (slow): http://paste.ubuntu.com/7232048/ cpu=4 (slow): http://paste.ubuntu.com/7232053/ Good machine: cpu=1 (fast): http://paste.ubuntu.com/7232065/ cpu=2 (fast): http://paste.ubuntu.com/7232067/ cpu=4 (fast): http://paste.ubuntu.com/7232068/ I also ran prstat -mLc on the process. This prints the microstate accounting. Every second it prints how much time each thread spent in each scheduler state (there are 10 scheduler states). This is the log for the bad machine: http://paste.ubuntu.com/7231991/ This is the log for the good machine: http://paste.ubuntu.com/7231993/ USR and SYS are self explanatory, LAT means runnable, LCK + SLP means time waiting for something (sleep, locks, idle), VCX are voluntary context switches and IVX are involuntary context switches. You can see USR is generally lower for the bad machine, with LCK + SLP is almost always close to 100%. IVX are the same, but VCX is an older of magnitude less for the bad machine. -- Aram Hăvărneanu
Sign in to reply to this message.
grep hires /etc/system on both machines. The delay is being caused by the two or more threads not finishing at roughly the same time during either parfordo or scanblock during phase 3 (my bet is on scanblock). When this happens parfordo eventually does usleep(1) repeatedly in the waiting thread; scanblock (via getfull) does usleep(100) repeatedly instead. But on Solaris, both of those round up to 1 clock tick, so they end up being 10ms on a default system, and 1ms on a system with hires_tick=1 and hires_hz=1000 (a common alternate configuration). It is possible that the problem is similar on the arm machines. I think this test is just doing too much. It is written for a fast machine and there are slow machines out there (having a minimum 10ms kernel sleep counts as being slow, sorry). I sent CL 86550045 to scale back the test on the builders. Russ
Sign in to reply to this message.
On Fri, Apr 11, 2014 at 5:56 AM, Russ Cox <rsc@golang.org> wrote: > grep hires /etc/system on both machines. It was off on both systems, but read below. I think we are on the right track. I bumped clock Hz to 2500 on the bad machine, and the problem disappeared: oos:runtime$ GOMAXPROCS=2 runtime.test -test.cpu=1,2,4 -test.v -test.run TestStackGrowth$ runtime === RUN TestStackGrowth --- PASS: TestStackGrowth (2.24 seconds) === RUN TestStackGrowth-2 --- PASS: TestStackGrowth-2 (2.75 seconds) === RUN TestStackGrowth-4 --- PASS: TestStackGrowth-4 (2.45 seconds) PASS So we now know what causes this problem, but we still don't know why it works on the newer machine, which worries me. Linux uses HPETs for all type related bussiness if it can find HPETs. It's conceivable the ARM builders don't have HPETs. Solaris doesn't use HPETs by default, but it *can* use them if the programmer asks for them. Do you want me to try changing the Solaris runtime to alwayse use HPETs? (we only support amd64, and all amd64 machines have HPETs). -- Aram Hăvărneanu
Sign in to reply to this message.
I filled issue 7763 so we have a record of this. Unfortunately, using HPETs on Solaris requires the proc_clock_highres privilege. I think this is on by default on regular installs, but it's off by default in a Solaris zone, so we can't mandate it. I hate to introduce complexity and optional machinery, but I'd like to change to runtime to try to use HPETs and fallback to the slower case if it can't. The improvement is dramatic, and the additional complexity seems small. How does that sound? -- Aram Hăvărneanu
Sign in to reply to this message.
Ok, so I replaced the sleeps with yields, and problem is solved: http://paste.ubuntu.com/7234785/ -- Aram Hăvărneanu
Sign in to reply to this message.
Also, I decided agains the optional HPET thing, it's complicated and involves signals. I'll pass. -- Aram Hăvărneanu
Sign in to reply to this message.
On Fri, Apr 11, 2014 at 7:25 AM, Aram Hăvărneanu <aram.h@mgk.ro> wrote: > I filled issue 7763 so we have a record of this. > > Unfortunately, using HPETs on Solaris requires the proc_clock_highres > privilege. I think this is on by default on regular installs, but it's > off by default in a Solaris zone, so we can't mandate it. > > I hate to introduce complexity and optional machinery, but I'd like to > change to runtime to try to use HPETs and fallback to the slower case > if it can't. The improvement is dramatic, and the additional > complexity seems small. How does that sound? > It's fine but after the release freeze is over. As for your bad-vs-good machines, I'm basically certain that the bad machine is using 100 Hz timers and the good machine is using 1000 Hz timers. The only question is how it is that they are configured differently, but I'll leave that to you. Russ
Sign in to reply to this message.
On Fri, Apr 11, 2014 at 3:24 PM, Russ Cox <rsc@golang.org> wrote: > It's fine but after the release freeze is over. No, it's ridiculously complicated, I'll pass. > As for your bad-vs-good machines, I'm basically certain that the bad machine > is using 100 Hz timers and the good machine is using 1000 Hz timers Yes, you're right, the good machines ticks at 1kHz, I should have measured before. -- Aram Hăvărneanu
Sign in to reply to this message.
|