|
|
Created:
11 years, 4 months ago by dvyukov Modified:
11 years, 4 months ago Reviewers:
CC:
golang-dev, bradfitz, dfc, mikio, remyoudompheng, rsc Visibility:
Public. |
Descriptionnet: fix flaky test
The test failed on one of the builders with:
timeout_test.go:594: ln.Accept: accept tcp 127.0.0.1:19373: use of closed network connection
http://build.golang.org/log/e83f4a152b37071b9d079096e15913811ad296b5
Patch Set 1 #Patch Set 2 : diff -r d2b512689ae1 https://dvyukov%40google.com@code.google.com/p/go/ #Patch Set 3 : diff -r d2b512689ae1 https://dvyukov%40google.com@code.google.com/p/go/ #
Total comments: 3
Patch Set 4 : diff -r 69e2f3d33300 https://dvyukov%40google.com@code.google.com/p/go/ #MessagesTotal messages: 16
Hello golang-dev@googlegroups.com, I'd like you to review this change to https://dvyukov%40google.com@code.google.com/p/go/
Sign in to reply to this message.
LGTM On Mon, Nov 26, 2012 at 11:15 AM, <dvyukov@google.com> wrote: > Reviewers: golang-dev_googlegroups.com, > > Message: > Hello golang-dev@googlegroups.com, > > I'd like you to review this change to > https://dvyukov%40google.com@**code.google.com/p/go/<http://40google.com@code... > > > Description: > net: fix flaky test > The test failed on one of the builders with: > timeout_test.go:594: ln.Accept: accept tcp 127.0.0.1:19373: use of > closed network connection > http://build.golang.org/log/**e83f4a152b37071b9d079096e15913**811ad296b5<http... > > Please review this at http://codereview.appspot.com/**6859043/<http://codereview.appspot.com/6859043/> > > Affected files: > M src/pkg/net/timeout_test.go > > > Index: src/pkg/net/timeout_test.go > ==============================**==============================**======= > --- a/src/pkg/net/timeout_test.go > +++ b/src/pkg/net/timeout_test.go > @@ -588,8 +588,10 @@ > > ln := newLocalListener(t) > defer ln.Close() > + connected := make(chan bool) > go func() { > s, err := ln.Accept() > + connected <- true > if err != nil { > t.Fatalf("ln.Accept: %v", err) > } > @@ -619,6 +621,7 @@ > t.Fatalf("DialTCP: %v", err) > } > defer c.Close() > + <-connected > for i := 0; i < 1024; i++ { > var buf [1]byte > c.Write(buf[:]) > > >
Sign in to reply to this message.
+mikioh I'm not sure why this works, Dial should not return until the connection has been Accept'd, right ? We use the same pattern in the test immediately above this one. What looks like is happening is the client is connecting, pumping out 1024 x 1 byte writes, then shutting the connection before the Accept has completed. Also, why is this only failing on FreeBSD/386? Then again most of the other *BSD builders are flaky at the same time so we don't have the best coverage. https://codereview.appspot.com/6859043/diff/4001/src/pkg/net/timeout_test.go File src/pkg/net/timeout_test.go (right): https://codereview.appspot.com/6859043/diff/4001/src/pkg/net/timeout_test.go#... src/pkg/net/timeout_test.go:596: t.Fatalf("ln.Accept: %v", err) Must use t.Errorf here https://codereview.appspot.com/6859043/diff/4001/src/pkg/net/timeout_test.go#... src/pkg/net/timeout_test.go:604: if err != nil { Can this be tightened up a bit here, the expect errors are EOF and timeout, right ? https://codereview.appspot.com/6859043/diff/4001/src/pkg/net/timeout_test.go#... src/pkg/net/timeout_test.go:627: c.Write(buf[:]) Please add a check for the err value here.
Sign in to reply to this message.
On 2012/11/26 <dave@cheney.net> wrote: > +mikioh > > I'm not sure why this works, Dial should not return until the connection > has been Accept'd, right ? We use the same pattern in the test > immediately above this one. I suppose it should return when the connection is established at the TCP level, but being accepted by a process is a different thing. Rémy.
Sign in to reply to this message.
I believe Dial can return before the Accept executes. On most systems I believe that the completion of the TCP 3-way handshake is what places the socket on the queue of connections that can be returned by accept(2) and also what makes the connect(2) on the client side finish. It is easy to believe that a fast client might close the connection (send a FIN) before the accept(2) happens. Of course implementations differ on whether a closed connection turns into a failed accept(2). http://www.cl.cam.ac.uk/cgi-bin/manpage?2+accept says: ERROR HANDLING Linux accept() passes already-pending network errors on the new socket as an error code from accept(). This behaviour differs from other BSD socket implementations. For reliable operation the application should detect the network errors defined for the protocol after accept() and treat them like EAGAIN by retrying. In case of TCP/IP these are ENET- DOWN, EPROTO, ENOPROTOOPT, EHOSTDOWN, ENONET, EHOSTUNREACH, EOPNOTSUPP, and ENETUNREACH. but of course you are on BSD so maybe the characterization is wrong, outdated, or backward. Russ
Sign in to reply to this message.
LGTM assuming the other comments are addressed. Sorry for being a stick in the mud about this. I can see how, if nagel was in effect, then those 1024 writes would be coalesced into a single write, which could piggy back on the 3rd packet in the handshake (TCP allows data on the 3rd ACK, assuming there is window size available, right?) then follow up with a FIN before the server even saw the packet. On Tue, Nov 27, 2012 at 8:09 AM, Russ Cox <rsc@golang.org> wrote: > I believe Dial can return before the Accept executes. On most systems > I believe that the completion of the TCP 3-way handshake is what > places the socket on the queue of connections that can be returned by > accept(2) and also what makes the connect(2) on the client side > finish. It is easy to believe that a fast client might close the > connection (send a FIN) before the accept(2) happens. > > Of course implementations differ on whether a closed connection turns > into a failed accept(2). > http://www.cl.cam.ac.uk/cgi-bin/manpage?2+accept says: > > ERROR HANDLING > > Linux accept() passes already-pending network errors on the new socket > as an error code from accept(). This behaviour differs from other BSD > socket implementations. For reliable operation the application should > detect the network errors defined for the protocol after accept() and > treat them like EAGAIN by retrying. In case of TCP/IP these are ENET- > DOWN, EPROTO, ENOPROTOOPT, EHOSTDOWN, ENONET, EHOSTUNREACH, EOPNOTSUPP, > and ENETUNREACH. > > but of course you are on BSD so maybe the characterization is wrong, > outdated, or backward. > > Russ
Sign in to reply to this message.
On Mon, Nov 26, 2012 at 1:09 PM, Russ Cox <rsc@golang.org> wrote: > I believe Dial can return before the Accept executes. On most systems > I believe that the completion of the TCP 3-way handshake is what > places the socket on the queue of connections that can be returned by > accept(2) and also what makes the connect(2) on the client side > finish. It is easy to believe that a fast client might close the > connection (send a FIN) before the accept(2) happens. > > Of course implementations differ on whether a closed connection turns > into a failed accept(2). > http://www.cl.cam.ac.uk/cgi-bin/manpage?2+accept says: > > ERROR HANDLING > > Linux accept() passes already-pending network errors on the new > socket > as an error code from accept(). This behaviour differs from other > BSD > socket implementations. For reliable operation the application > should > detect the network errors defined for the protocol after accept() > and > treat them like EAGAIN by retrying. In case of TCP/IP these are > ENET- > DOWN, EPROTO, ENOPROTOOPT, EHOSTDOWN, ENONET, EHOSTUNREACH, > EOPNOTSUPP, > and ENETUNREACH. > Do that mean there's a possible series of packets I can throw at a Linux Go server to make syscall.Accept return an error and end its http.ListenAndServe remotely? Do we need to change the net package on Linux to treat all those EFOO errors as Temporary in our accept?
Sign in to reply to this message.
On 2012/11/26 Brad Fitzpatrick <bradfitz@golang.org> wrote: > On Mon, Nov 26, 2012 at 1:09 PM, Russ Cox <rsc@golang.org> wrote: >> >> I believe Dial can return before the Accept executes. On most systems >> I believe that the completion of the TCP 3-way handshake is what >> places the socket on the queue of connections that can be returned by >> accept(2) and also what makes the connect(2) on the client side >> finish. It is easy to believe that a fast client might close the >> connection (send a FIN) before the accept(2) happens. >> >> Of course implementations differ on whether a closed connection turns >> into a failed accept(2). >> http://www.cl.cam.ac.uk/cgi-bin/manpage?2+accept says: >> >> ERROR HANDLING >> >> Linux accept() passes already-pending network errors on the new >> socket >> as an error code from accept(). This behaviour differs from other >> BSD >> socket implementations. For reliable operation the application >> should >> detect the network errors defined for the protocol after accept() >> and >> treat them like EAGAIN by retrying. In case of TCP/IP these are >> ENET- >> DOWN, EPROTO, ENOPROTOOPT, EHOSTDOWN, ENONET, EHOSTUNREACH, >> EOPNOTSUPP, >> and ENETUNREACH. > > > Do that mean there's a possible series of packets I can throw at a Linux Go > server to make syscall.Accept return an error and end its > http.ListenAndServe remotely? > > Do we need to change the net package on Linux to treat all those EFOO errors > as Temporary in our accept? We already fixed issue 3395 but indeed more may be hidden. http://code.google.com/p/go/issues/detail?id=3395 Rémy.
Sign in to reply to this message.
> Sorry for being a stick in the mud about this. I can see how, if nagel > was in effect, then those 1024 writes would be coalesced into a single > write, which could piggy back on the 3rd packet in the handshake (TCP > allows data on the 3rd ACK, assuming there is window size available, > right?) then follow up with a FIN before the server even saw the > packet. You don't need any packet coalescing. The kernel is running the TCP protocol independently of the accept(2) system call finishing. You just need enough delay on the machine that all the 1024 writes and the final FIN get processed by the server half of the kernel before the user-level Go program gets around to calling accept(2) to retrieve an fd for the connection. And by then the function has returned and done the deferred ln.Close, so the ln.Accept fails. Russ
Sign in to reply to this message.
> Do that mean there's a possible series of packets I can throw at a Linux Go > server to make syscall.Accept return an error and end its > http.ListenAndServe remotely? Yes, but that's not what is happening here and should already be fixed (see Remy's link). What's happening here is that after a Listen, one goroutine does c := Dial, c.Write c.Write c.Write, c.Close, ln.Close while another goroutine does ln.Accept. In some cases the first goroutine can do its whole sequence (including ln.Close) before the other goroutine does ln.Accept. That is, the one goroutine is closing the *listener* out from under the other. Dave suggested that perhaps the Dial would not return until the ln.Accept had completed; if so the ln.Close could not possibly run before the ln.Accept finished, but I was explaining why that's not the case. Russ
Sign in to reply to this message.
On Mon, Nov 26, 2012 at 1:53 PM, Russ Cox <rsc@golang.org> wrote: > > Do that mean there's a possible series of packets I can throw at a Linux > Go > > server to make syscall.Accept return an error and end its > > http.ListenAndServe remotely? > > Yes, but that's not what is happening here and should already be fixed > (see Remy's link). > Yeah, I was just going on a tangent. Issue 3395 only fixed ECONNABORTED. The manpage suggests there are more? But I guess most servers would have bigger problems if ENETDOWN or ENONET.
Sign in to reply to this message.
> Yeah, I was just going on a tangent. Issue 3395 only fixed ECONNABORTED. > The manpage suggests there are more? But I guess most servers would have > bigger problems if ENETDOWN or ENONET. No idea. Linux man pages are somewhere below Wikipedia on the reliability scale. Russ
Sign in to reply to this message.
So, is this change LGTM? On Tue, Nov 27, 2012 at 2:10 AM, Russ Cox <rsc@golang.org> wrote: >> Yeah, I was just going on a tangent. Issue 3395 only fixed ECONNABORTED. >> The manpage suggests there are more? But I guess most servers would have >> bigger problems if ENETDOWN or ENONET. > > No idea. Linux man pages are somewhere below Wikipedia on the reliability scale. > > Russ
Sign in to reply to this message.
On 2012/11/27 05:49:41, dvyukov wrote: > So, is this change LGTM? > Yes.
Sign in to reply to this message.
LGTM I couldn't repro the issue on freebsd/386 9.0-release but let's see what happens.
Sign in to reply to this message.
*** Submitted as http://code.google.com/p/go/source/detail?r=34e54cf71fed *** net: fix flaky test The test failed on one of the builders with: timeout_test.go:594: ln.Accept: accept tcp 127.0.0.1:19373: use of closed network connection http://build.golang.org/log/e83f4a152b37071b9d079096e15913811ad296b5 R=golang-dev, bradfitz, dave, mikioh.mikioh, remyoudompheng, rsc CC=golang-dev http://codereview.appspot.com/6859043
Sign in to reply to this message.
|