Issue 69190044: code review 69190044: runtime: fix use after close race in Solaris network poller

aram

Hello golang-codereviews@googlegroups.com, I'd like you to review this change to https://code.google.com/p/go

11 years, 9 months ago (2014-02-26 18:16:08 UTC) #1

bradfitz

[+iant, dmitry] On Wed, Feb 26, 2014 at 10:16 AM, <aram@mgk.ro> wrote: > Reviewers: golang-codereviews, ...

11 years, 9 months ago (2014-02-26 18:18:11 UTC) #3

[+iant, dmitry]



On Wed, Feb 26, 2014 at 10:16 AM, <aram@mgk.ro> wrote:

> Reviewers: golang-codereviews,
>
> Message:
> Hello golang-codereviews@googlegroups.com,
>
> I'd like you to review this change to
> https://code.google.com/p/go
>
>
> Description:
> runtime: fix use after close race in Solaris network poller
>
> The Solaris network poller uses event ports, which are
> level-triggered. As such, it has to re-arm itself after each
> wakeup. The arming mechanism (which runs in its own thread) raced
> with the closing of a file descriptor happening in a different
> thread. When a network file descriptor is about to be closed,
> the network poller is awaken to give it a chance to remove its
> association with the file descriptor. Because the poller always
> re-armed itself, it raced with code that closed the descriptor.
>
> This change makes the network poller check before re-arming if
> the file descriptor is about to be closed, in which case it will
> ignore the re-arming request.
>
> Fixes issue 7410.
>
> Please review this at https://codereview.appspot.com/69190044/
>
> Affected files (+9, -0 lines):
>   M src/pkg/runtime/netpoll.goc
>   M src/pkg/runtime/netpoll_solaris.c
>   M src/pkg/runtime/runtime.h
>
>
> Index: src/pkg/runtime/netpoll.goc
> ===================================================================
> --- a/src/pkg/runtime/netpoll.goc
> +++ b/src/pkg/runtime/netpoll.goc
> @@ -263,6 +263,12 @@
>         return &pd->user;
>  }
>
> +bool
> +runtime·netpollclosing(PollDesc *pd)
> +{
> +       return pd->closing;
> +}
> +
>  // make pd ready, newly runnable goroutines (if any) are enqueued info
> gpp list
>  void
>  runtime·netpollready(G **gpp, PollDesc *pd, int32 mode)
> Index: src/pkg/runtime/netpoll_solaris.c
> ===================================================================
> --- a/src/pkg/runtime/netpoll_solaris.c
> +++ b/src/pkg/runtime/netpoll_solaris.c
> @@ -90,6 +90,8 @@
>         uintptr fd = runtime·netpollfd(pd);
>         ep = (uint32*)runtime·netpolluser(pd);
>
> +       if(runtime·netpollclosing(pd))
> +               return;
>         do {
>                 old = *ep;
>                 events = (old & ~clear) | set;
> Index: src/pkg/runtime/runtime.h
> ===================================================================
> --- a/src/pkg/runtime/runtime.h
> +++ b/src/pkg/runtime/runtime.h
> @@ -942,6 +942,7 @@
>  uintptr        runtime·netpollfd(PollDesc*);
>  void   runtime·netpollarm(PollDesc*, int32);
>  void** runtime·netpolluser(PollDesc*);
> +bool   runtime·netpollclosing(PollDesc*);
>  void   runtime·crash(void);
>  void   runtime·parsedebugvars(void);
>  void   _rt0_go(void);
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "golang-codereviews" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-codereviews+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>

Sign in to reply to this message.

iant

I don't understand how this avoids the race. I can see that it narrows the ...

11 years, 9 months ago (2014-02-26 19:11:19 UTC) #4

aram

On Wed, Feb 26, 2014 at 8:11 PM, <iant@golang.org> wrote: > I don't understand how ...

11 years, 9 months ago (2014-02-26 19:26:53 UTC) #5

On Wed, Feb 26, 2014 at 8:11 PM,  <iant@golang.org> wrote:
> I don't understand how this avoids the race.  I can see that it narrows
> the time span when the race can occur, but I don't see how the race is
> avoided.  Isn't it possible for closing = true to occur during the loop
> in runtime·netpollupdate?

There are three events of importance here, closing=true, unblocking
poller, and poller wakeup (for whatever reason). Unblocking is always
after close=true, so what really is important here is how poller wakeup
relates to the other two.

So we can have this three situations:

    1: wakeup, closing=true, unblock

This is not very interesting at all, works fine.

    2: closing=true, wakeup, unblock

Here the poller awakens for whatever reason after closing is set to true,
but before being forcibly unblocked. Netpollupdate sees closing=true and
doesn't do anything. That's ok, since we are going to disassociate the fd
anyway, and in the time between now and when the fd is actually closed,
nothing can use it.

After all this, the poller will be forcibly awakened, and that's ok. We
are already disassociated.

    3: closing=true, unblock, wakeup

Netpollupdate sees closing=true and will leave the fd disassociated. The
poller doesn't unblock again for this fd.

Do you find anything wrong in the explanation above? I could be nuts
indeed.

> Would it work to check whether the port is closed in the case where >
runtime·port_associate fails, and ignore the failure in that case?

No, this exposes a different race.

fd=8 is about to be closed, the poller unblocks, the fd is closed, some
other thread opens a new socket, gets the same fd=8, netpollupdate runs,
it calls port_associate *successfully* (because the fd is valid), but
with a wrong PortDesc. Havoc happens when code actually uses that stale,
possibly freed, PortDesc.

-- 
Aram Hăvărneanu

Sign in to reply to this message.

iant

On Wed, Feb 26, 2014 at 11:26 AM, Aram Hăvărneanu <aram@mgk.ro> wrote: > On Wed, ...

11 years, 9 months ago (2014-02-26 20:40:20 UTC) #6

On Wed, Feb 26, 2014 at 11:26 AM, Aram Hăvărneanu <aram@mgk.ro> wrote:
> On Wed, Feb 26, 2014 at 8:11 PM,  <iant@golang.org> wrote:
>> I don't understand how this avoids the race.  I can see that it narrows
>> the time span when the race can occur, but I don't see how the race is
>> avoided.  Isn't it possible for closing = true to occur during the loop
>> in runtime·netpollupdate?
>
> There are three events of importance here, closing=true, unblocking
> poller, and poller wakeup (for whatever reason). Unblocking is always
> after close=true, so what really is important here is how poller wakeup
> relates to the other two.
>
> So we can have this three situations:
>
>     1: wakeup, closing=true, unblock
>
> This is not very interesting at all, works fine.
>
>     2: closing=true, wakeup, unblock
>
> Here the poller awakens for whatever reason after closing is set to true,
> but before being forcibly unblocked. Netpollupdate sees closing=true and
> doesn't do anything. That's ok, since we are going to disassociate the fd
> anyway, and in the time between now and when the fd is actually closed,
> nothing can use it.
>
> After all this, the poller will be forcibly awakened, and that's ok. We
> are already disassociated.
>
>     3: closing=true, unblock, wakeup
>
> Netpollupdate sees closing=true and will leave the fd disassociated. The
> poller doesn't unblock again for this fd.
>
> Do you find anything wrong in the explanation above? I could be nuts
> indeed.

The case I wonder about is:

First runtime·netpoll is called, it calls runtime·netpollupdate, that
picks up fd, and sees that closing is false.  At this point the kernel
preempts the thread.

Another thread calls fd.Close() which calls fd.pd.Evict which calls
runtime_pollUnblock which locks the pd, sets pd->closing = true, and
eventually proceeds to close the file descriptor.

Now the kernel schedules the first thread again.  That thread tries to
associate fd, and fails because it is closed.

I can't see any locks that prevent that from happening.  There is no
similar problem in the epoll code because that code is careful to
never look at the PollDesc in runtime·netpoll, except for the rg and
wg fields that it accesses using atomic operations.

Can you explain why this scenario can not occur?


>> Would it work to check whether the port is closed in the case where >
> runtime·port_associate fails, and ignore the failure in that case?
>
> No, this exposes a different race.
>
> fd=8 is about to be closed, the poller unblocks, the fd is closed, some
> other thread opens a new socket, gets the same fd=8, netpollupdate runs,
> it calls port_associate *successfully* (because the fd is valid), but
> with a wrong PortDesc. Havoc happens when code actually uses that stale,
> possibly freed, PortDesc.

Good point, I also don't understand why that can't happen.

Ian

Sign in to reply to this message.

dvyukov

Why do you need to rearm it after polling? You arm it before each netpollwait, ...

11 years, 9 months ago (2014-02-27 13:25:27 UTC) #7

aram

> Why do you need to rearm it after polling? > You arm it before ...

11 years, 9 months ago (2014-03-02 12:56:22 UTC) #8

> Why do you need to rearm it after polling?
> You arm it before each netpollwait, and netpoll does not care about
> notification if nobody is waiting. So it seems that arm before wait is
> enough.

It doesn't work otherwise, e.g. (with your proposal):

  : oos:src; go test -v -short -timeout 20s -run
'TestShutdown|TestConnAndListener' net
  === RUN TestConnAndListener
  --- FAIL: TestConnAndListener (10.01 seconds)
  conn_test.go:114: Conn.Read failed: read tcp 127.0.0.1:59245: i/o timeout
  conn_test.go:80: Conn.Read failed: read tcp 127.0.0.1:52002: i/o timeout
  === RUN TestShutdown
  panic: test timed out after 20s

  goroutine 24 [running]:
  runtime.panic(0x5ba840, 0xc208001380)
  /home/aram/go/src/pkg/runtime/panic.c:250 +0xba
  testing.func·008()
  /home/aram/go/src/pkg/testing/testing.go:628 +0xf5
  created by time.goFunc
  /home/aram/go/src/pkg/time/sleep.go:121 +0x47

  goroutine 16 [chan receive]:
  testing.RunTests(0x705a20, 0x7dfe20, 0x79, 0x79, 0x0)
  /home/aram/go/src/pkg/testing/testing.go:504 +0x906
  testing.Main(0x705a20, 0x7dfe20, 0x79, 0x79, 0x7d64c0, ...)
  /home/aram/go/src/pkg/testing/testing.go:435 +0x9e
  main.main()
  /tmp/go-build116241956/net/_test/_testmain.go:321 +0x9c

  goroutine 22 [IO wait]:
  net.runtime_pollWait(0xfffffd7fff160718, 0x72, 0x0)
  /home/aram/go/src/pkg/runtime/netpoll.goc:146 +0x76
  net.(*pollDesc).Wait(0xc20802c300, 0x72, 0x0, 0x0)
  /home/aram/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
  net.(*pollDesc).WaitRead(0xc20802c300, 0x0, 0x0)
  /home/aram/go/src/pkg/net/fd_poll_runtime.go:89 +0x42
  net.(*netFD).Read(0xc20802c2a0, 0xc208001360, 0xa, 0xa, 0x0, ...)
  /home/aram/go/src/pkg/net/fd_unix.go:214 +0x30c
  net.(*conn).Read(0xc208042068, 0xc208001360, 0xa, 0xa, 0x0, ...)
  /home/aram/go/src/pkg/net/net.go:122 +0xf2
  net.TestShutdown(0xc208054000)
  /home/aram/go/src/pkg/net/net_test.go:53 +0x507
  testing.tRunner(0xc208054000, 0x7e03d8)
  /home/aram/go/src/pkg/testing/testing.go:422 +0x9d
  created by testing.RunTests
  /home/aram/go/src/pkg/testing/testing.go:503 +0x8d6

  goroutine 21 [chan send]:
  net.func·029()
  /home/aram/go/src/pkg/net/conn_test.go:88 +0x45
  net.transponder(0xc208054120, 0xfffffd7fff15f7f8, 0xc208042018, 0xc208030480)
  /home/aram/go/src/pkg/net/conn_test.go:121 +0x24c
  created by net.TestConnAndListener
  /home/aram/go/src/pkg/net/conn_test.go:61 +0x4ef

  goroutine 23 [IO wait]:
  net.runtime_pollWait(0xfffffd7fff160668, 0x72, 0x0)
  /home/aram/go/src/pkg/runtime/netpoll.goc:146 +0x76
  net.(*pollDesc).Wait(0xc20802c370, 0x72, 0x0, 0x0)
  /home/aram/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
  net.(*pollDesc).WaitRead(0xc20802c370, 0x0, 0x0)
  /home/aram/go/src/pkg/net/fd_poll_runtime.go:89 +0x42
  net.(*netFD).Read(0xc20802c310, 0xc208001340, 0xa, 0xa, 0x0, ...)
  /home/aram/go/src/pkg/net/fd_unix.go:214 +0x30c
  net.(*conn).Read(0xc208042060, 0xc208001340, 0xa, 0xa, 0x0, ...)
  /home/aram/go/src/pkg/net/net.go:122 +0xf2
  net.func·056()
  /home/aram/go/src/pkg/net/net_test.go:34 +0x1e2
  created by net.TestShutdown
  /home/aram/go/src/pkg/net/net_test.go:40 +0x20e
  exit status 2
  FAIL net 20.035s

It needs re-arming after many types of events, not just reads and writes,
e.g. concurrent accept.

Sign in to reply to this message.

dvyukov

On 2014/03/02 12:56:22, aram wrote: > > Why do you need to rearm it after ...

11 years, 9 months ago (2014-03-02 15:24:28 UTC) #9

On 2014/03/02 12:56:22, aram wrote:
> > Why do you need to rearm it after polling?
> > You arm it before each netpollwait, and netpoll does not care about
> > notification if nobody is waiting. So it seems that arm before wait is
> > enough.
> 
> It doesn't work otherwise, e.g. (with your proposal):

why?

>   : oos:src; go test -v -short -timeout 20s -run
> 'TestShutdown|TestConnAndListener' net
>   === RUN TestConnAndListener
>   --- FAIL: TestConnAndListener (10.01 seconds)
>   conn_test.go:114: Conn.Read failed: read tcp 127.0.0.1:59245: i/o timeout
>   conn_test.go:80: Conn.Read failed: read tcp 127.0.0.1:52002: i/o timeout
>   === RUN TestShutdown
>   panic: test timed out after 20s
> 
>   goroutine 24 [running]:
>   runtime.panic(0x5ba840, 0xc208001380)
>   /home/aram/go/src/pkg/runtime/panic.c:250 +0xba
>   testing.func·008()
>   /home/aram/go/src/pkg/testing/testing.go:628 +0xf5
>   created by time.goFunc
>   /home/aram/go/src/pkg/time/sleep.go:121 +0x47
> 
>   goroutine 16 [chan receive]:
>   testing.RunTests(0x705a20, 0x7dfe20, 0x79, 0x79, 0x0)
>   /home/aram/go/src/pkg/testing/testing.go:504 +0x906
>   testing.Main(0x705a20, 0x7dfe20, 0x79, 0x79, 0x7d64c0, ...)
>   /home/aram/go/src/pkg/testing/testing.go:435 +0x9e
>   main.main()
>   /tmp/go-build116241956/net/_test/_testmain.go:321 +0x9c
> 
>   goroutine 22 [IO wait]:
>   net.runtime_pollWait(0xfffffd7fff160718, 0x72, 0x0)
>   /home/aram/go/src/pkg/runtime/netpoll.goc:146 +0x76
>   net.(*pollDesc).Wait(0xc20802c300, 0x72, 0x0, 0x0)
>   /home/aram/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
>   net.(*pollDesc).WaitRead(0xc20802c300, 0x0, 0x0)
>   /home/aram/go/src/pkg/net/fd_poll_runtime.go:89 +0x42
>   net.(*netFD).Read(0xc20802c2a0, 0xc208001360, 0xa, 0xa, 0x0, ...)
>   /home/aram/go/src/pkg/net/fd_unix.go:214 +0x30c
>   net.(*conn).Read(0xc208042068, 0xc208001360, 0xa, 0xa, 0x0, ...)
>   /home/aram/go/src/pkg/net/net.go:122 +0xf2
>   net.TestShutdown(0xc208054000)
>   /home/aram/go/src/pkg/net/net_test.go:53 +0x507
>   testing.tRunner(0xc208054000, 0x7e03d8)
>   /home/aram/go/src/pkg/testing/testing.go:422 +0x9d
>   created by testing.RunTests
>   /home/aram/go/src/pkg/testing/testing.go:503 +0x8d6
> 
>   goroutine 21 [chan send]:
>   net.func·029()
>   /home/aram/go/src/pkg/net/conn_test.go:88 +0x45
>   net.transponder(0xc208054120, 0xfffffd7fff15f7f8, 0xc208042018,
0xc208030480)
>   /home/aram/go/src/pkg/net/conn_test.go:121 +0x24c
>   created by net.TestConnAndListener
>   /home/aram/go/src/pkg/net/conn_test.go:61 +0x4ef
> 
>   goroutine 23 [IO wait]:
>   net.runtime_pollWait(0xfffffd7fff160668, 0x72, 0x0)
>   /home/aram/go/src/pkg/runtime/netpoll.goc:146 +0x76
>   net.(*pollDesc).Wait(0xc20802c370, 0x72, 0x0, 0x0)
>   /home/aram/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
>   net.(*pollDesc).WaitRead(0xc20802c370, 0x0, 0x0)
>   /home/aram/go/src/pkg/net/fd_poll_runtime.go:89 +0x42
>   net.(*netFD).Read(0xc20802c310, 0xc208001340, 0xa, 0xa, 0x0, ...)
>   /home/aram/go/src/pkg/net/fd_unix.go:214 +0x30c
>   net.(*conn).Read(0xc208042060, 0xc208001340, 0xa, 0xa, 0x0, ...)
>   /home/aram/go/src/pkg/net/net.go:122 +0xf2
>   net.func·056()
>   /home/aram/go/src/pkg/net/net_test.go:34 +0x1e2
>   created by net.TestShutdown
>   /home/aram/go/src/pkg/net/net_test.go:40 +0x20e
>   exit status 2
>   FAIL net 20.035s
> 
> It needs re-arming after many types of events, not just reads and writes,
> e.g. concurrent accept.

we don't do accept and read/write on the same descriptor

Sign in to reply to this message.

dvyukov

I do not understand what invariant you want to preserve: 1. fd is always armed ...

11 years, 9 months ago (2014-03-02 15:36:41 UTC) #10

aram

> Good point, I also don't understand why that can't happen. Well this can't happen ...

11 years, 9 months ago (2014-03-07 20:48:10 UTC) #11

iant

On Fri, Mar 7, 2014 at 12:48 PM, <aram@mgk.ro> wrote: >> Good point, I also ...

11 years, 9 months ago (2014-03-08 00:01:56 UTC) #12

On Fri, Mar 7, 2014 at 12:48 PM,  <aram@mgk.ro> wrote:
>> Good point, I also don't understand why that can't happen.
>
>
> Well this can't happen because if closing = false, we won't be trying to
> associate a stale PortDesc. Sure, another thread can race us and
> eventually get the same file descriptor, but then we will call
> port_associate with a *new*, valid PortDesc.
>
> The point is that by requiring closing = true, we never associate a dead
> PortDesc. Yes, we can get into a situation where we have two PortDescs
> with the same fd, one live, one stale but not yet freed. In that case
> however, we will always call port_associate for the live one.

The sequence would be:

thread 1:
  fd.Close calls
  fd.decref calls
  fd.destroy calls
  fd.pd.Close calls
  runtime_pollClose calls
  runtime·netpollclose

thread 2:
  preempts thread 1
  runtime·netpoll returns from runtime·port_getn with fd's descriptor

thread 1:
  preempts thread 2
  runtime·netpollclose calls
  runtime·port_dissociate
  back to runtime_pollClose
  back to fd.pd.Close
  back to fd.Close
  back to fd.destroy which calls closesocket

Now a new socket is opened with the same descriptor number.

thread 2:
  finally starts running again in runtime·netpolln
  calls runtime·netpollupdate
  netpollupdate still has a PollDesc.
  It has been released by runtime_pollClose
  but netpollupdate still has a pointer to it.
  The closing field is set but nothing checks it.
  calls runtime·port_associate with the old PollDesc

I don't see anything that prevents that sequence from happening.  I
may well be missing something.

To me this code seems to be full of race conditions.  The epoll code
was carefully designed to work without any locks.

The epoll code can also be preempted just after epoll_wait returns but
before the socket is closed and the PollDesc released.  If the
PollDesc is reallocated before runtime·netpoll continues, I think
there can be a spurious wakeup of a goroutine.  But I don't think the
consequences are any more severe than that.

Ian

Sign in to reply to this message.

aram

Ian Lance Taylor <iant@golang.org> wrote: > ... Thanks for the detailed explanation. > I don't ...

11 years, 9 months ago (2014-03-08 12:28:14 UTC) #13

aram

I'd like to reference Dmitry's #5 comment in this bug report here: https://code.google.com/p/go/issues/detail?id=6955#c5 Dmitry, care ...

11 years, 9 months ago (2014-03-08 19:00:17 UTC) #14

dvyukov

On 2014/03/08 19:00:17, aram wrote: > I'd like to reference Dmitry's #5 comment in this ...

11 years, 9 months ago (2014-03-11 10:23:30 UTC) #15

dvyukov

On 2014/03/08 00:01:56, iant wrote: > On Fri, Mar 7, 2014 at 12:48 PM, <mailto:aram@mgk.ro> ...

11 years, 9 months ago (2014-03-11 10:24:29 UTC) #16

On 2014/03/08 00:01:56, iant wrote:
> On Fri, Mar 7, 2014 at 12:48 PM,  <mailto:aram@mgk.ro> wrote:
> >> Good point, I also don't understand why that can't happen.
> >
> >
> > Well this can't happen because if closing = false, we won't be trying to
> > associate a stale PortDesc. Sure, another thread can race us and
> > eventually get the same file descriptor, but then we will call
> > port_associate with a *new*, valid PortDesc.
> >
> > The point is that by requiring closing = true, we never associate a dead
> > PortDesc. Yes, we can get into a situation where we have two PortDescs
> > with the same fd, one live, one stale but not yet freed. In that case
> > however, we will always call port_associate for the live one.
> 
> The sequence would be:
> 
> thread 1:
>   fd.Close calls
>   fd.decref calls
>   fd.destroy calls
>   fd.pd.Close calls
>   runtime_pollClose calls
>   runtime·netpollclose
> 
> thread 2:
>   preempts thread 1
>   runtime·netpoll returns from runtime·port_getn with fd's descriptor
> 
> thread 1:
>   preempts thread 2
>   runtime·netpollclose calls
>   runtime·port_dissociate
>   back to runtime_pollClose
>   back to fd.pd.Close
>   back to fd.Close
>   back to fd.destroy which calls closesocket
> 
> Now a new socket is opened with the same descriptor number.
> 
> thread 2:
>   finally starts running again in runtime·netpolln
>   calls runtime·netpollupdate
>   netpollupdate still has a PollDesc.
>   It has been released by runtime_pollClose
>   but netpollupdate still has a pointer to it.
>   The closing field is set but nothing checks it.
>   calls runtime·port_associate with the old PollDesc
> 
> I don't see anything that prevents that sequence from happening.  I
> may well be missing something.

I agree that the racy netpollclosing check does not look like fixing anything
(other than reducing probability of bad things).



> To me this code seems to be full of race conditions.  The epoll code
> was carefully designed to work without any locks.
> 
> The epoll code can also be preempted just after epoll_wait returns but
> before the socket is closed and the PollDesc released.  If the
> PollDesc is reallocated before runtime·netpoll continues, I think
> there can be a spurious wakeup of a goroutine.  But I don't think the
> consequences are any more severe than that.
> 
> Ian

Sign in to reply to this message.

dvyukov

On 2014/03/08 12:28:14, aram wrote: > Ian Lance Taylor <mailto:iant@golang.org> wrote: > > ... > ...

11 years, 9 months ago (2014-03-11 10:27:21 UTC) #17

aram

> I still do not understand why we need to arm it in netpollopen, nobody ...

11 years, 9 months ago (2014-03-11 10:43:57 UTC) #18

aram

On Tue, Mar 11, 2014 at 11:23 AM, <dvyukov@google.com> wrote: > Nevermind. Maybe I was ...

11 years, 9 months ago (2014-03-11 10:44:15 UTC) #19

dvyukov

https://codereview.appspot.com/69190044/diff/40001/src/pkg/runtime/netpoll_solaris.c File src/pkg/runtime/netpoll_solaris.c (right): https://codereview.appspot.com/69190044/diff/40001/src/pkg/runtime/netpoll_solaris.c#newcode105 src/pkg/runtime/netpoll_solaris.c:105: } while(runtime·cas(ep, old, events) != events); cas return success/fail ...

11 years, 9 months ago (2014-03-11 10:45:07 UTC) #20

dvyukov

Here is what I would do: 1. remove port_associate from netpollopen 2. mutate the events ...

11 years, 9 months ago (2014-03-11 10:47:41 UTC) #21

aram

On Tue, Mar 11, 2014 at 11:47 AM, <dvyukov@google.com> wrote: > remove port_associate from netpollopen ...

11 years, 9 months ago (2014-03-11 11:30:48 UTC) #22

dvyukov

On Tue, Mar 11, 2014 at 3:30 PM, Aram Hăvărneanu <aram@mgk.ro> wrote: > On Tue, ...

11 years, 9 months ago (2014-03-11 11:49:20 UTC) #23

aram

> I do not see any need for the association while nobody is waiting for ...

11 years, 9 months ago (2014-03-11 13:33:27 UTC) #24

aram

Hello golang-codereviews@googlegroups.com, bradfitz@golang.org, iant@golang.org, dvyukov@google.com (cc: golang-codereviews@googlegroups.com), Please take another look.

11 years, 9 months ago (2014-03-11 13:33:39 UTC) #25

dvyukov

On Tue, Mar 11, 2014 at 5:33 PM, <aram@mgk.ro> wrote: >> I do not see ...

11 years, 9 months ago (2014-03-11 13:35:58 UTC) #26

aram2

In #8 you can see the test timeout panic trace. -- Aram Hăvărneanu

11 years, 9 months ago (2014-03-11 13:37:42 UTC) #27

dvyukov

On 2014/03/11 13:37:42, aram2 wrote: > In #8 you can see the test timeout panic ...

11 years, 9 months ago (2014-03-11 13:51:29 UTC) #28

aram

> #8 was about re-arming after polling > I am asking about arming in netpollopen ...

11 years, 9 months ago (2014-03-11 17:08:00 UTC) #29

> #8 was about re-arming after polling
> I am asking about arming in netpollopen

Here it is with arming in netpollopen disabled:

go test -v -timeout 20s -short net
=== RUN TestConnAndListener
--- PASS: TestConnAndListener (0.00 seconds)
=== RUN TestDialTimeout
--- SKIP: TestDialTimeout (0.00 seconds)
dial_test.go:88: skipping test on "solaris"; untested.
=== RUN TestSelfConnect
--- PASS: TestSelfConnect (0.62 seconds)
=== RUN TestDialError
--- PASS: TestDialError (0.00 seconds)
dial_test.go:213: test disabled; use -run_error_test to enable
=== RUN TestInvalidDialAndListenArgs
--- PASS: TestInvalidDialAndListenArgs (0.00 seconds)
=== RUN TestDialTimeoutFDLeak
--- SKIP: TestDialTimeoutFDLeak (0.00 seconds)
dial_test.go:265: skipping test on "solaris"
=== RUN TestDialMultiFDLeak
--- SKIP: TestDialMultiFDLeak (0.00 seconds)
dial_test.go:367: skipping test; error finding or running lsof: exec:
"lsof": executable file not found in $PATH
=== RUN TestDialFailPDLeak
--- SKIP: TestDialFailPDLeak (0.00 seconds)
dial_test.go:432: skipping test in short mode
=== RUN TestDialer
panic: test timed out after 20s

goroutine 34 [running]:
runtime.panic(0x5c2ca0, 0xc208001fc0)
/home/aram/go/src/pkg/runtime/panic.c:279 +0xf5
testing.func·008()
/home/aram/go/src/pkg/testing/testing.go:628 +0xf5
created by time.goFunc
/home/aram/go/src/pkg/time/sleep.go:121 +0x47

goroutine 16 [chan receive]:
testing.RunTests(0x713710, 0x7f1220, 0x7a, 0x7a, 0x1)
/home/aram/go/src/pkg/testing/testing.go:504 +0x906
testing.Main(0x713710, 0x7f1220, 0x7a, 0x7a, 0x7e76a0, ...)
/home/aram/go/src/pkg/testing/testing.go:435 +0x9e
main.main()
/tmp/go-build365758353/net/_test/_testmain.go:323 +0x9c

goroutine 32 [IO wait]:
net.runtime_pollWait(0xfffffd7fff160790, 0x77, 0x0)
/home/aram/go/src/pkg/runtime/netpoll.goc:146 +0x76
net.(*pollDesc).Wait(0xc20807b4f0, 0x77, 0x0, 0x0)
/home/aram/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
net.(*pollDesc).WaitWrite(0xc20807b4f0, 0x0, 0x0)
/home/aram/go/src/pkg/net/fd_poll_runtime.go:93 +0x42
net.(*netFD).connect(0xc20807b490, 0xfffffd7fff15f848, 0xc208041c40,
0xfffffd7fff15f848, 0xc208041c60, ...)
/home/aram/go/src/pkg/net/fd_unix.go:96 +0x1ec
net.(*netFD).dial(0xc20807b490, 0xfffffd7fff15f7f8, 0xc208029560,
0xfffffd7fff15f7f8, 0xc208029590, ...)
/home/aram/go/src/pkg/net/sock_posix.go:121 +0x333
net.socket(0x649d90, 0x4, 0x2, 0x2, 0x0, ...)
/home/aram/go/src/pkg/net/sock_posix.go:91 +0x43a
net.internetSocket(0x649d90, 0x4, 0xfffffd7fff15f7f8, 0xc208029560,
0xfffffd7fff15f7f8, ...)
/home/aram/go/src/pkg/net/ipsock_posix.go:136 +0x179
net.dialTCP(0x649d90, 0x4, 0xc208029560, 0xc208029590, 0x0, ...)
/home/aram/go/src/pkg/net/tcpsock_posix.go:155 +0x11c
net.dialSingle(0x649d90, 0x4, 0xc208001f80, 0xf, 0xfffffd7fff15f790, ...)
/home/aram/go/src/pkg/net/dial.go:238 +0x426
net.func·019(0x0, 0x0, 0x0, 0x0, 0x0, ...)
/home/aram/go/src/pkg/net/dial.go:164 +0x120
net.dial(0x649d90, 0x4, 0xfffffd7fff15f790, 0xc208029590,
0xfffffd7ffeb20e28, ...)
/home/aram/go/src/pkg/net/fd_unix.go:40 +0x75
net.(*Dialer).Dial(0xc208079c80, 0x649d90, 0x4, 0xc208001f80, 0xf, ...)
/home/aram/go/src/pkg/net/dial.go:171 +0x42b
net.TestDialer(0xc208054630)
/home/aram/go/src/pkg/net/dial_test.go:504 +0x3cf
testing.tRunner(0xc208054630, 0x7f12e0)
/home/aram/go/src/pkg/testing/testing.go:422 +0x9d
created by testing.RunTests
/home/aram/go/src/pkg/testing/testing.go:503 +0x8d6
exit status 2
FAIL net 20.037s
: oos:runtime;

-- 
Aram Hăvărneanu

Sign in to reply to this message.

dvyukov

But why does it hang? Maybe there is some other latent issue.

11 years, 9 months ago (2014-03-11 17:39:33 UTC) #30

aram

On Tue, Mar 11, 2014 at 6:39 PM, <dvyukov@google.com> wrote: > But why does it ...

11 years, 9 months ago (2014-03-11 18:02:08 UTC) #32

dvyukov

On 2014/03/11 18:02:08, aram wrote: > On Tue, Mar 11, 2014 at 6:39 PM, <mailto:dvyukov@google.com> ...

11 years, 9 months ago (2014-03-11 18:12:02 UTC) #33

aram

On Tue, Mar 11, 2014 at 7:12 PM, <dvyukov@google.com> wrote: > What if we call ...

11 years, 9 months ago (2014-03-11 18:23:42 UTC) #34

dvyukov

On 2014/03/11 18:23:42, aram wrote: > On Tue, Mar 11, 2014 at 7:12 PM, <mailto:dvyukov@google.com> ...

11 years, 9 months ago (2014-03-11 18:34:50 UTC) #35

On 2014/03/11 18:23:42, aram wrote:
> On Tue, Mar 11, 2014 at 7:12 PM,  <mailto:dvyukov@google.com> wrote:
> > What if we call port_associate in netpollopen, but there is already
> > a thread blocked on port_getn, port_getn still hangs, right?
> 
> Right.
> 
> > If we don't forcibly unblock outstanding port_getn when we start
> > waiting of a new file descriptor (which we don't do now), then
> > we can only ensure that we call port_associate before parking the
> > waiting goroutine (but not before port_getn). And this works fine
> > for epoll/kqueue
> 
> I don't understand this.

You've said:
"Because runtime·netpoll is blocked in port_getn, and calling
port_associate during this time will not unblock it, it needs to be
associated a priori."

I do not see how calling port_associate in netpollopen helps to overcome this.
netpollopen can happen while runtime·netpoll is already blocked in port_getn.



> Do you mean that we can't guarantee
> that we will unblock the poller right away? I guess that's true
> (but see below).
> 
> > if it does not work for solaris ports, then you need to forcibly
> > unblock outstanding port_getn's when you start waiting on any
> > new file descriptor.
> 
> Maybe, it could be that the code works just by chance, but see below.
> 
> > What am I missing?
> 
> A successful connect will always unblock port_getn. We register
> with the pollwe when we dial (by calling netpollopen), when we
> do any kind of I/O we won't block indefinitely because we already
> have this one event to consume. We re-arm *before* the I/O thread
> has a chance to do I/O, which means that after it will have done
> its I/O, it will always be registered.
> 
> Considering the above, do you still think we need to forcibly
> unblock?


Sorry, I still do not understand. Probably I should not be reviewing this
change.

In runtime·netpoll we don't rearm, we *clear* the association:
  runtime·netpollupdate(pd, 0, /*clear=*/ ev->portev_events & (POLLIN|POLLOUT));

Sign in to reply to this message.

aram

Could we review the locking bits? If there are any doubts about the design of ...

11 years, 9 months ago (2014-03-11 19:22:21 UTC) #36

aram

> You've said: "Because runtime·netpoll is blocked in port_getn, and > calling port_associate during this ...

11 years, 9 months ago (2014-03-11 19:22:35 UTC) #37

dvyukov

OK, now I see how it fixes the closing race https://codereview.appspot.com/69190044/diff/70001/src/pkg/runtime/netpoll_solaris.c File src/pkg/runtime/netpoll_solaris.c (right): https://codereview.appspot.com/69190044/diff/70001/src/pkg/runtime/netpoll_solaris.c#newcode109 ...

11 years, 9 months ago (2014-03-12 05:58:42 UTC) #38

dvyukov

On 2014/03/11 19:22:21, aram wrote: > Could we review the locking bits? If there are ...

11 years, 9 months ago (2014-03-12 05:59:10 UTC) #39

dvyukov

On 2014/03/11 19:22:21, aram wrote: > Could we review the locking bits? If there are ...

11 years, 9 months ago (2014-03-12 05:59:10 UTC) #40

dvyukov

On 2014/03/11 19:22:35, aram wrote: > > You've said: "Because runtime·netpoll is blocked in port_getn, ...

11 years, 9 months ago (2014-03-12 06:11:00 UTC) #41

On 2014/03/11 19:22:35, aram wrote:
> > You've said: "Because runtime·netpoll is blocked in port_getn, and
> > calling port_associate during this time will not unblock it, it needs
> > to be associated a priori."
> >
> > I do not see how calling port_associate in netpollopen helps to overcome
> > this. netpollopen can happen while runtime·netpoll is already blocked
> > in port_getn.
> 
> What I wanted to say is that the mere act of calling port_associate
> will not unblock port_getn (even if the descriptor is ready
> for I/O). If port_getn is blocked, and we call port_associate,
> and *then* the descriptor gets the I/O ready notification from
> asynchronous connect/accept completion, port_getn will unblock. If
> asynchronous connect completes *before* calling port_associate,
> port_getn will not unblock.
> 
> > In runtime·netpoll we don't rearm, we *clear* the association:
> >   runtime·netpollupdate(pd, 0, /*clear=*/ ev->portev_events &
> > (POLLIN|POLLOUT));
> 
> We clear an association only with port_dissociate, after return
> from port_getn a descriptor is still associated, it's just not
> registered for that event. Because it's still associated, when
> we re-register for the type of event we care about, we get what
> we want.

Ah, OK, now it all makes sense.
So there are 2 separate things: (1) association with the port and (2)
subscription to events. And they differently affect behavior of port_getn.
Right?

Please add a comment at the top of netpoll_solaris.c explaining this (in a
separate CL if you prefer). This IO readiness mechanism seems to be quite
different from all of select, epoll, kqueue and IOCP. It would be nice if there
will be an explanation of why we call port_associate in netpollopen, why we need
to re-arm in netpollwait and why we need to re-arm in netpoll.

One last question: do we need to arm POLLIN|POLLOUT in *netpollopen*? From your
description it seems that we need a mere port association, but we don't yet need
subscription to any events. Later, when we actually wait on the fd, we will call
port_accosiate with the interesting events and receive the pending
notifications. Or port_accosiate with an empty event set will fail?

Thanks for bearing with me.

Sign in to reply to this message.

aram

> Ah, OK, now it all makes sense. So there are 2 separate things: (1) ...

11 years, 9 months ago (2014-03-13 12:46:29 UTC) #42

> Ah, OK, now it all makes sense.  So there are 2 separate things: (1)
> association with the port and (2) subscription to events. And they
> differently affect behavior of port_getn.  Right?

Yes.

> Please add a comment at the top of netpoll_solaris.c explaining this (in a
> separate CL if you prefer). This IO readiness mechanism seems to be quite
> different from all of select, epoll, kqueue and IOCP. It would be nice if
> there will be an explanation of why we call port_associate in netpollopen,
> why we need to re-arm in netpollwait and why we need to re-arm in netpoll.

I added a huge comment and smaller comments throughout.

> One last question: do we need to arm POLLIN|POLLOUT in *netpollopen*? From
> your description it seems that we need a mere port association, but we
> don't yet need subscription to any events. Later, when we actually wait
> on the fd, we will call port_accosiate with the interesting events and
> receive the pending notifications. Or port_accosiate with an empty event
> set will fail?

Empty port_associate is fine, I have updated code to make empty
association in netpollopen, maybe it's clearer this way.

> Thanks for bearing with me.

Thank you.

https://codereview.appspot.com/69190044/diff/70001/src/pkg/runtime/netpoll_so...
File src/pkg/runtime/netpoll_solaris.c (right):

https://codereview.appspot.com/69190044/diff/70001/src/pkg/runtime/netpoll_so...
src/pkg/runtime/netpoll_solaris.c:109: } while(!runtime·cas(ep, old, events));
On 2014/03/12 05:58:43, dvyukov wrote:
> remove this cas-loop
> now that it's under the mutex, you can just call port_associate and update the
> netpolluser word

Done.

https://codereview.appspot.com/69190044/diff/70001/src/pkg/runtime/netpoll_so...
src/pkg/runtime/netpoll_solaris.c:179: if((ev->portev_events & (POLLERR)) !=
POLLERR)
On 2014/03/12 05:58:43, dvyukov wrote:
> On both linux and windows we ignore some errors on some sockets (e.g.
> ECONNABORTED from accept). I don't know whether it can happen on solaris. If
it
> can, then it is safer to always re-arm regardless of errors (there must be no
> notifications of badly broken sockets, so a single re-arm should not cause
> performance issues).

Done.

Sign in to reply to this message.

aram

Hello golang-codereviews@googlegroups.com, bradfitz@golang.org, iant@golang.org, dvyukov@google.com, aram.h@mgk.ro, gobot@golang.org (cc: golang-codereviews@googlegroups.com), Please take another look.

11 years, 9 months ago (2014-03-13 12:46:49 UTC) #43

dvyukov

On 2014/03/13 12:46:29, aram wrote: > > Ah, OK, now it all makes sense. So ...

11 years, 9 months ago (2014-03-14 06:54:55 UTC) #44

On 2014/03/13 12:46:29, aram wrote:
> > Ah, OK, now it all makes sense.  So there are 2 separate things: (1)
> > association with the port and (2) subscription to events. And they
> > differently affect behavior of port_getn.  Right?
> 
> Yes.
> 
> > Please add a comment at the top of netpoll_solaris.c explaining this (in a
> > separate CL if you prefer). This IO readiness mechanism seems to be quite
> > different from all of select, epoll, kqueue and IOCP. It would be nice if
> > there will be an explanation of why we call port_associate in netpollopen,
> > why we need to re-arm in netpollwait and why we need to re-arm in netpoll.
> 
> I added a huge comment and smaller comments throughout.


Thank you very much. I now feel that I understand what happens here.


> > One last question: do we need to arm POLLIN|POLLOUT in *netpollopen*? From
> > your description it seems that we need a mere port association, but we
> > don't yet need subscription to any events. Later, when we actually wait
> > on the fd, we will call port_accosiate with the interesting events and
> > receive the pending notifications. Or port_accosiate with an empty event
> > set will fail?
> 
> Empty port_associate is fine, I have updated code to make empty
> association in netpollopen, maybe it's clearer this way.
> 
> > Thanks for bearing with me.
> 
> Thank you.
> 
>
https://codereview.appspot.com/69190044/diff/70001/src/pkg/runtime/netpoll_so...
> File src/pkg/runtime/netpoll_solaris.c (right):
> 
>
https://codereview.appspot.com/69190044/diff/70001/src/pkg/runtime/netpoll_so...
> src/pkg/runtime/netpoll_solaris.c:109: } while(!runtime·cas(ep, old, events));
> On 2014/03/12 05:58:43, dvyukov wrote:
> > remove this cas-loop
> > now that it's under the mutex, you can just call port_associate and update
the
> > netpolluser word
> 
> Done.
> 
>
https://codereview.appspot.com/69190044/diff/70001/src/pkg/runtime/netpoll_so...
> src/pkg/runtime/netpoll_solaris.c:179: if((ev->portev_events & (POLLERR)) !=
> POLLERR)
> On 2014/03/12 05:58:43, dvyukov wrote:
> > On both linux and windows we ignore some errors on some sockets (e.g.
> > ECONNABORTED from accept). I don't know whether it can happen on solaris. If
> it
> > can, then it is safer to always re-arm regardless of errors (there must be
no
> > notifications of badly broken sockets, so a single re-arm should not cause
> > performance issues).
> 
> Done.

Sign in to reply to this message.

dvyukov

https://codereview.appspot.com/69190044/diff/110001/src/pkg/runtime/netpoll_solaris.c File src/pkg/runtime/netpoll_solaris.c (right): https://codereview.appspot.com/69190044/diff/110001/src/pkg/runtime/netpoll_solaris.c#newcode156 src/pkg/runtime/netpoll_solaris.c:156: // Updates the the association with a new set ...

11 years, 9 months ago (2014-03-14 06:55:44 UTC) #45

dvyukov

Ian, do you want to take another look? I am going to l-g-t-m it after ...

11 years, 9 months ago (2014-03-14 06:56:53 UTC) #46

aram

https://codereview.appspot.com/69190044/diff/110001/src/pkg/runtime/netpoll_solaris.c File src/pkg/runtime/netpoll_solaris.c (right): https://codereview.appspot.com/69190044/diff/110001/src/pkg/runtime/netpoll_solaris.c#newcode156 src/pkg/runtime/netpoll_solaris.c:156: // Updates the the association with a new set ...

11 years, 9 months ago (2014-03-14 12:28:32 UTC) #47

aram

Hello golang-codereviews@googlegroups.com, bradfitz@golang.org, iant@golang.org, dvyukov@google.com, aram.h@mgk.ro, gobot@golang.org (cc: golang-codereviews@googlegroups.com), Please take another look.

11 years, 9 months ago (2014-03-14 12:28:47 UTC) #48

dvyukov

11 years, 9 months ago (2014-03-14 13:53:22 UTC) #51

*** Submitted as https://code.google.com/p/go/source/detail?r=b6b69b97b4b7 ***

runtime: fix use after close race in Solaris network poller

The Solaris network poller uses event ports, which are
level-triggered. As such, it has to re-arm itself after each
wakeup. The arming mechanism (which runs in its own thread) raced
with the closing of a file descriptor happening in a different
thread. When a network file descriptor is about to be closed,
the network poller is awaken to give it a chance to remove its
association with the file descriptor. Because the poller always
re-armed itself, it raced with code that closed the descriptor.

This change makes the network poller check before re-arming if
the file descriptor is about to be closed, in which case it will
ignore the re-arming request. It uses the per-PollDesc lock in
order to serialize access to the PollDesc.

This change also adds extensive documentation describing the
Solaris implementation of the network poller.

Fixes issue 7410.

LGTM=dvyukov, iant
R=golang-codereviews, bradfitz, iant, dvyukov, aram.h, gobot
CC=golang-codereviews
https://codereview.appspot.com/69190044

Committer: Dmitriy Vyukov <dvyukov@google.com>

Sign in to reply to this message.

Issue 69190044: code review 69190044: runtime: fix use after close race in Solaris network poller

Description

Patch Set 1 #

Patch Set 2 : diff -r 3512bf47df7e https://code.google.com/p/go #

Patch Set 3 : diff -r 3512bf47df7e https://code.google.com/p/go #

Patch Set 4 : diff -r 3cd546ac3a99 https://code.google.com/p/go #

Patch Set 5 : diff -r 3cd546ac3a99 https://code.google.com/p/go #

Patch Set 6 : diff -r ceb13960d05b https://code.google.com/p/go #

Patch Set 7 : diff -r ceb13960d05b https://code.google.com/p/go #

Patch Set 8 : diff -r 08dcdcdb757b https://code.google.com/p/go #

Patch Set 9 : diff -r 08dcdcdb757b https://code.google.com/p/go #

Messages