runtime: improved scheduler
Distribute runnable queues, memory cache
and cache of dead G's per processor.
Faster non-blocking syscall enter/exit.
More conservative worker thread blocking/unblocking.
Could you please hg upload this again, % hg clpatch 7314062 abort: codereview issue 7314062 ...
12 years, 2 months ago
(2013-02-21 00:33:45 UTC)
#1
Could you please hg upload this again,
% hg clpatch 7314062
abort: codereview issue 7314062 is out of date: patch and recent changes
conflict (d2f4fe93c8d6->127f96912009)
On 2013/02/27 20:50:58, dvyukov wrote: > I guess I need to attach some benchmarks. Unfortunately ...
12 years, 2 months ago
(2013-02-27 21:10:46 UTC)
#5
On 2013/02/27 20:50:58, dvyukov wrote:
> I guess I need to attach some benchmarks.
Unfortunately I am currently on low-core laptop and can't ssh into anything
serious. I will try to collect benchmark results tomorrow.
On Wed, Feb 27, 2013 at 1:10 PM, <dvyukov@google.com> wrote: > On 2013/02/27 20:50:58, dvyukov ...
12 years, 2 months ago
(2013-02-27 23:22:31 UTC)
#6
On Wed, Feb 27, 2013 at 1:10 PM, <dvyukov@google.com> wrote:
> On 2013/02/27 20:50:58, dvyukov wrote:
>
>> I guess I need to attach some benchmarks.
>>
>
> Unfortunately I am currently on low-core laptop and can't ssh into
> anything serious. I will try to collect benchmark results tomorrow.
Hopefully the improved scheduler doesn't hurt performance on a low-core
laptop. :-)
https://codereview.appspot.com/7314062/diff/62004/src/pkg/runtime/proc.c File src/pkg/runtime/proc.c (right): https://codereview.appspot.com/7314062/diff/62004/src/pkg/runtime/proc.c#newcode948 src/pkg/runtime/proc.c:948: // Tries to steal from other P's, get g ...
12 years, 2 months ago
(2013-02-28 06:26:38 UTC)
#7
On Thursday, February 28, 2013, Brad Fitzpatrick wrote: > > > On Wed, Feb 27, ...
12 years, 2 months ago
(2013-02-28 09:04:54 UTC)
#8
On Thursday, February 28, 2013, Brad Fitzpatrick wrote:
>
>
> On Wed, Feb 27, 2013 at 1:10 PM, <dvyukov@google.com <javascript:_e({},
> 'cvml', 'dvyukov@google.com');>> wrote:
>
>> On 2013/02/27 20:50:58, dvyukov wrote:
>>
>>> I guess I need to attach some benchmarks.
>>>
>>
>> Unfortunately I am currently on low-core laptop and can't ssh into
>> anything serious. I will try to collect benchmark results tomorrow.
>
>
> Hopefully the improved scheduler doesn't hurt performance on a low-core
> laptop. :-)
>
>
Good point
On Thursday, February 28, 2013, wrote: > > https://codereview.appspot.**com/7314062/diff/62004/src/** > pkg/runtime/proc.c<https://codereview.appspot.com/7314062/diff/62004/src/pkg/runtime/proc.c> > File src/pkg/runtime/proc.c (right): ...
12 years, 2 months ago
(2013-02-28 09:06:41 UTC)
#9
On Thursday, February 28, 2013, wrote:
>
> https://codereview.appspot.**com/7314062/diff/62004/src/**
>
pkg/runtime/proc.c<https://codereview.appspot.com/7314062/diff/62004/src/pkg/runtime/proc.c>
> File src/pkg/runtime/proc.c (right):
>
> https://codereview.appspot.**com/7314062/diff/62004/src/**
>
pkg/runtime/proc.c#newcode948<https://codereview.appspot.com/7314062/diff/62004/src/pkg/runtime/proc.c#newcode948>
> src/pkg/runtime/proc.c:948: // Tries to steal from other P's, get g from
> global queue, polls network connections.
> polls network connections !?
>
>
That's why I don't like comments -- they does not stop to compile when
become incorrect :)
Polling of network connections is Phase II, it will be submitted separately.
On 2013/02/27 23:22:31, bradfitz wrote: > On Wed, Feb 27, 2013 at 1:10 PM, <mailto:dvyukov@google.com> ...
12 years, 2 months ago
(2013-02-28 23:09:46 UTC)
#11
On 2013/02/27 23:22:31, bradfitz wrote:
> On Wed, Feb 27, 2013 at 1:10 PM, <mailto:dvyukov@google.com> wrote:
>
> > On 2013/02/27 20:50:58, dvyukov wrote:
> >
> >> I guess I need to attach some benchmarks.
> >>
> >
> > Unfortunately I am currently on low-core laptop and can't ssh into
> > anything serious. I will try to collect benchmark results tomorrow.
>
>
> Hopefully the improved scheduler doesn't hurt performance on a low-core
> laptop. :-)
Here are some numbers on darwin/amd64, Intel Core 2 Duo 2.13 GHz, 2 cores:
GOMAXPROCS=1 time go test -short std
old:
#1 real 81.743s
#2 real 80.896s
new:
#1 real 81.278s
#2 real 81.319s
GOMAXPROCS=2 time go test -short std
old:
#1 real 85.61s
#2 real 104.93s
new:
#1 real 81.36s
#2 real 82.51s
GOMAXPROCS=1 time ./threadring -n=2000000
old:
real 0m0.907s
user 0m0.894s
sys 0m0.007s
new:
real 0m0.763s
user 0m0.753s
sys 0m0.006s
GOMAXPROCS=2 time ./threadring -n=2000000
old:
real 0m6.070s
user 0m2.650s
sys 0m4.851s
new:
real 0m2.090s
user 0m2.759s
sys 0m0.735s
benchmark old ns/op new ns/op delta
BenchmarkMatmult 26 25 -5.28%
BenchmarkMatmult-2 10 9 -12.78%
benchmark old ns/op new ns/op delta
BenchmarkSyscall 56 80 +44.46%
BenchmarkSyscall-2 57 43 -24.13%
BenchmarkSyscallWork 635 615 -3.15%
BenchmarkSyscallWork-2 315 302 -4.13%
BenchmarkSyscallExcess 2698 80 -97.02%
BenchmarkSyscallExcess-2 1192 42 -96.43%
BenchmarkSyscallExcessWork 2832 596 -78.95%
BenchmarkSyscallExcessWork-2 1966 302 -84.64%
benchmark old ns/op new ns/op delta
BenchmarkCreateGoroutines 355 266 -25.07%
BenchmarkCreateGoroutines-2 2314 701 -69.71%
BenchmarkCreateGoroutinesParallel 358 289 -19.27%
BenchmarkCreateGoroutinesParallel-2 439 163 -62.87%
benchmark old ns/op new ns/op delta
BenchmarkTCPOneShot 373801 197980 -47.04%
BenchmarkTCPOneShot-2 367226 334230 -8.99%
BenchmarkTCPOneShotTimeout 388088 201246 -48.14%
BenchmarkTCPOneShotTimeout-2 438246 348140 -20.56%
BenchmarkTCPPersistent 64992 79598 +22.47%
BenchmarkTCPPersistent-2 58297 41134 -29.44%
BenchmarkTCPPersistentTimeout 65471 77702 +18.68%
BenchmarkTCPPersistentTimeout-2 57122 42875 -24.94%
> benchmark old ns/op new ns/op delta > BenchmarkSyscall 56 80 +44.46% This looks great, ...
12 years, 2 months ago
(2013-02-28 23:26:54 UTC)
#13
> benchmark old ns/op new ns/op delta
> BenchmarkSyscall 56 80 +44.46%
This looks great, except for this number. Any idea why this number
regressed for single GOMAXPROCS ?
On 2013/02/28 23:23:49, remyoudompheng wrote: > Is it expected that proc.p is updated one day ...
12 years, 2 months ago
(2013-02-28 23:28:27 UTC)
#14
On 2013/02/28 23:23:49, remyoudompheng wrote:
> Is it expected that proc.p is updated one day to match the new scheduler?
I have no such plans. Perhaps I need to delete proc.p with this patch.
On 2013/02/28 23:26:54, dfc wrote: > > benchmark old ns/op new ns/op delta > > ...
12 years, 2 months ago
(2013-02-28 23:29:42 UTC)
#15
On 2013/02/28 23:26:54, dfc wrote:
> > benchmark old ns/op new ns/op delta
> > BenchmarkSyscall 56 80 +44.46%
>
> This looks great, except for this number. Any idea why this number
> regressed for single GOMAXPROCS ?
I don't know how to profile on darwin. I am currently benchmarking on linux, if
it has the same regression I will profile.
On 2013/02/28 23:26:54, dfc wrote: > > benchmark old ns/op new ns/op delta > > ...
12 years, 2 months ago
(2013-02-28 23:36:11 UTC)
#16
On 2013/02/28 23:26:54, dfc wrote:
> > benchmark old ns/op new ns/op delta
> > BenchmarkSyscall 56 80 +44.46%
>
> This looks great, except for this number. Any idea why this number
> regressed for single GOMAXPROCS ?
There is also the regression in BenchmarkTCPPersistent. It's intended to be
fixed by integrated network poller.
On 2013/03/01 00:07:20, dvyukov wrote: > linux/amd64, 2 x Intel Xeon E5-2690, 2.90GHz, 16 HT ...
12 years, 2 months ago
(2013-03-01 00:45:46 UTC)
#20
On 2013/03/01 00:07:20, dvyukov wrote:
> linux/amd64, 2 x Intel Xeon E5-2690, 2.90GHz, 16 HT cores, 32 HW threads.
> benchmark old ns/op new ns/op delta
> BenchmarkSyscall 21 30 +39.17%
It seems to be just some issues with code generation in 6c.
Just caching g and m as:
G *gp;
M *mp;
mp = m;
gp = g;
and then doing everything with gp and mp:
gp->gcsp = gp->sched.sp;
gp->gcstack = gp->stackbase;
gp->gcguard = gp->stackguard;
gp->status = Gsyscall;
reduces exec time from 30.2 to 26.2 ns/op.
Then replacing:
-runtime·atomicstore(&mp->p->status, Psyscall);
+mp->p->status = Psyscall;
reduces exec time to 17.7ns.
We can do some of that nanosecond-squeezing after submitting the patch.
Dave, you asked whether it makes sense to add
ATOMIC_LOAD/ATOMIC_STORE/ATOMIC_CAS intrinsics to gc similar to PREFETCH. I
think it can have some impact with the new scheduler.
> Then replacing: > -runtime·atomicstore(&mp->p->status, Psyscall); > +mp->p->status = Psyscall; > > reduces exec time ...
12 years, 2 months ago
(2013-03-01 00:50:41 UTC)
#21
> Then replacing:
> -runtime·atomicstore(&mp->p->status, Psyscall);
> +mp->p->status = Psyscall;
>
> reduces exec time to 17.7ns.
Nice catch. I'd vote for making that change to the source (with a
comment of course), and raising a bug against cmd/6c (any others) for
the code generation suckage.
> We can do some of that nanosecond-squeezing after submitting the patch.
> Dave, you asked whether it makes sense to add
> ATOMIC_LOAD/ATOMIC_STORE/ATOMIC_CAS intrinsics to gc similar to
> PREFETCH. I think it can have some impact with the new scheduler.
I agree, I would like to see those intrinsics added post Go 1.1. If
you wanted to open an issue as a reminder, I'd +1 that.
On Fri, Mar 1, 2013 at 2:50 AM, Dave Cheney <dave@cheney.net> wrote: > > Then ...
12 years, 2 months ago
(2013-03-01 01:18:00 UTC)
#22
On Fri, Mar 1, 2013 at 2:50 AM, Dave Cheney <dave@cheney.net> wrote:
> > Then replacing:
> > -runtime·atomicstore(&mp->p->status, Psyscall);
> > +mp->p->status = Psyscall;
> >
> > reduces exec time to 17.7ns.
>
> Nice catch. I'd vote for making that change to the source (with a
> comment of course), and raising a bug against cmd/6c (any others) for
> the code generation suckage.
>
> > We can do some of that nanosecond-squeezing after submitting the patch.
> > Dave, you asked whether it makes sense to add
> > ATOMIC_LOAD/ATOMIC_STORE/ATOMIC_CAS intrinsics to gc similar to
> > PREFETCH. I think it can have some impact with the new scheduler.
>
> I agree, I would like to see those intrinsics added post Go 1.1. If
> you wanted to open an issue as a reminder, I'd +1 that.
>
Sorry, the effect comes mostly from XCHG instruction.
With atomicstore - 27.6ns
With atomicstore with MOVL instead of XCHGL - 17.2ns
With plain assignment - 15.6ns
But I still think it would be useful, it can cut few ns here and there
(e.g. runtime.lock/unlock).
On 2013/02/28 23:23:49, remyoudompheng wrote: > Is it expected that proc.p is updated one day ...
12 years, 2 months ago
(2013-03-01 05:14:49 UTC)
#24
On 2013/02/28 23:23:49, remyoudompheng wrote:
> Is it expected that proc.p is updated one day to match the new scheduler?
It would be nice, but I won't make it a requirement. The tricky part that proc.p
was meant to verify is the unlocked fast path in entersyscall/exitsyscall. That
mostly goes away in this CL, left instead to the sysmon thread. The surrounding
code is a bit more complex, but at least the system call path is easier to
understand.
On 2013/03/01 05:11:43, rsc wrote: > LGTM > > Thanks very much for this work. ...
12 years, 2 months ago
(2013-03-01 09:14:53 UTC)
#25
On 2013/03/01 05:11:43, rsc wrote:
> LGTM
>
> Thanks very much for this work.
>
> I patched this into my client and tested that all.bash passes. I also printed
> proc.c (without diffs) and read it all, even the unchanged parts. It looks
good.
>
>
> As long as it isn't going to break the build, I think we should submit this
> soon, so that it can soak. I realize you probably have some tweaks planned
> still, but it will be easier to review them as deltas than to keep updating
this
> CL.
Yeah, it's a pain to work with this CL.
> My only substantial concern is some of the locking. At least nmspinning,
> runqsize, and sysmonwait are accessed both directly and via atomic operations,
> sometimes in a double-checked locking pattern and sometimes just using an
> ordinary read where one might expect an atomic read. Is this safe even on
> multiprocessor ARM? I assume so - you're the locking expert - but I just
wanted
> to double-check, as it were.
>
> If there are places where you think an atomic is warranted but you have
changed
> it to be non-atomic purely for performance, please mark those with comments
that
> say something like:
> // TODO: fast atomic
> and those will serve as markers to revisit if we put atomic intrinsics into
6c.
Yes, you are right. Some of the accesses must be atomic and I did not put
atomicload/atomicstore because of performance concerns. I will add the TODOs.
https://codereview.appspot.com/7314062/diff/76001/src/pkg/runtime/proc.c File src/pkg/runtime/proc.c (right): https://codereview.appspot.com/7314062/diff/76001/src/pkg/runtime/proc.c#newcode28 src/pkg/runtime/proc.c:28: M* mhead; // m's waiting for work On 2013/03/01 ...
12 years, 2 months ago
(2013-03-01 11:34:28 UTC)
#27
https://codereview.appspot.com/7314062/diff/76001/src/pkg/runtime/proc.c
File src/pkg/runtime/proc.c (right):
https://codereview.appspot.com/7314062/diff/76001/src/pkg/runtime/proc.c#newc...
src/pkg/runtime/proc.c:28: M* mhead; // m's waiting for work
On 2013/03/01 05:11:43, rsc wrote:
> You left this name unchanged, but given pidle and npidle below, I suggest
> renaming mhead, mwait to midle, mpidle. Your decision.
renamed to midle and nmidle.
https://codereview.appspot.com/7314062/diff/76001/src/pkg/runtime/proc.c#newc...
src/pkg/runtime/proc.c:54: enum { maxgomaxprocs = 1<<8 };
On 2013/03/01 05:11:43, rsc wrote:
> Since this is C and not Go, constants should begin with an upper case letter.
> MaxGomaxprocs perhaps.
Done.
https://codereview.appspot.com/7314062/diff/76001/src/pkg/runtime/proc.c#newc...
src/pkg/runtime/proc.c:399: if(p->runqhead == p->runqtail) {
On 2013/03/01 05:11:43, rsc wrote:
> Can you please add a comment explaining why finding a p without a run queue
> means the remainder of the idle p's have no run queue either?
> Something like:
>
> // Stopped p's are at beginning of list.
> // Once we reach a p without a run queue, the rest don't have one either.
Done.
procresize() puts p's with work at the beginning of the list.
https://codereview.appspot.com/7314062/diff/76001/src/pkg/runtime/proc.c#newc...
src/pkg/runtime/proc.c:487: // Allocate a new m unassociated with any thread.
On 2013/03/01 05:11:43, rsc wrote:
> add
> // Can use p for allocation context if needed.
Done.
https://codereview.appspot.com/7314062/diff/76001/src/pkg/runtime/proc.c#newc...
src/pkg/runtime/proc.c:1162: g->gcsp = g->sched.sp;
On 2013/03/01 05:11:43, rsc wrote:
> Carl has a pending change that adds g->gcpc and sets it here and in
> entersyscallblock, but we're still trying to track down a linux/386 problem in
> that CL, so it probably won't land soon.
>
> Please pick up those changes to entersyscall and entersyscallblock, from
> https://codereview.appspot.com/7301062/diff/24002/src/pkg/runtime/proc.c
> so that Carl doesn't need to reapply them at the inevitable merge conflict.
>
> Thanks.
Done.
https://codereview.appspot.com/7314062/diff/76001/src/pkg/runtime/proc.c#newc...
src/pkg/runtime/proc.c:1209: g->gcsp = g->sched.sp;
On 2013/03/01 05:11:43, rsc wrote:
> This part needs a gcpc too.
Done.
https://codereview.appspot.com/7314062/diff/76001/src/pkg/runtime/proc.c#newc...
src/pkg/runtime/proc.c:1241: s = m->p ? m->p->status : Pidle;
On 2013/03/01 05:11:43, rsc wrote:
> In general please try to avoid ? :. Writing the if/else is longer but in the
> long term usually more readable.
>
> In this specific case, I don't think you need the condition at all: I don't
see
> where s is used other than in the condition on the next line, which can
become:
>
> if(m->p && runtime.cas(&m->p->status, Psyscall, Prunning)) {
>
> (with an && m->p->status == Psyscall in the middle if you think it's
important)
Done.
https://codereview.appspot.com/7314062/diff/76001/src/pkg/runtime/proc.c#newc...
src/pkg/runtime/proc.c:1250: g->gcstack = (uintptr)nil;
On 2013/03/01 05:11:43, rsc wrote:
> Please clear gcsp here too.
Done.
https://codereview.appspot.com/7314062/diff/76001/src/pkg/runtime/proc.c#newc...
src/pkg/runtime/proc.c:1278: g->gcstack = (uintptr)nil;
On 2013/03/01 05:11:43, rsc wrote:
> Here too.
Done.
https://codereview.appspot.com/7314062/diff/76001/src/pkg/runtime/proc.c#newc...
src/pkg/runtime/proc.c:1624: for(gp = runtime·allg; gp; gp = gp->alllink) {
On 2013/03/01 05:11:43, rsc wrote:
> I did not realize that runtime.NumGoroutine was O(n). Perhaps it is worth a
TODO
> to make it O(1). This is kind of embarrassing (my fault not yours).
Yes, it's O(N). I do not want to increment/decrement atomic counter in
newproc/goexit, they can proceed almost P-local otherwise.
If it's important we can have per-P counters, then aggregation will be much
faster.
Adding the comment.
https://codereview.appspot.com/7314062/diff/76001/src/pkg/runtime/proc.c#newc...
src/pkg/runtime/proc.c:1747: for(i = 1; runtime·sched.runqhead; i++) {
On 2013/03/01 05:11:43, rsc wrote:
> I am a little confused about why i starts at 1 here. Is there a reason to skip
> 0? If so, it's worth a comment; if not, please i = 0.
Done.
Yes, there is a reason.
https://codereview.appspot.com/7314062/diff/76001/src/pkg/runtime/proc.c#newc...
src/pkg/runtime/proc.c:1820: inclocked(int32 v) {
On 2013/03/01 05:11:43, rsc wrote:
> { on next line.
Done.
This broke FreeBSD and Windows, which assumed that newosproc would only ever be called with ...
12 years, 2 months ago
(2013-03-01 13:30:56 UTC)
#30
This broke FreeBSD and Windows, which assumed that newosproc would only
ever be called with fn == mstart (probably on my suggestion).
CL 7443046, just submitted, should fix those.
This broke Linux/ARM: the builds time out. I don't know why yet. It does
not appear to be the same reason.
Russ
Issue 7314062: code review 7314062: runtime: improved scheduler
(Closed)
Created 12 years, 3 months ago by dvyukov
Modified 12 years, 2 months ago
Reviewers:
Base URL:
Comments: 27