|
|
Descriptionapiserver: analyzing ping timeouts
This is a first fix of #1205451 handling externally
killed juju nodes. Currently the pings of clients to
the API server are not analyzed. With this change
each ping leads to the reset of a timer waiting for
3 minutes. If it is not reset (by the ping) the timer
signal tells the pinger to kill the connection.
The overall problem with the presence pinger,
especially regarding to HA environments, is not
covered by this change.
https://code.launchpad.net/~themue/juju-core/055-apiserver-heartbeat-analyzing/+merge/194862
(do not edit description out of merge proposal)
Patch Set 1 #
Total comments: 18
Patch Set 2 : apiserver: analyzing ping timeouts #
Total comments: 14
Patch Set 3 : apiserver: analyzing ping timeouts #
Total comments: 14
Patch Set 4 : apiserver: analyzing ping timeouts #
MessagesTotal messages: 9
Please take a look.
Sign in to reply to this message.
This is going in a reasonable direction, but I have a few comments to make. It's a bit of a pity that this adds yet another goroutine per connection, but I think that can be sorted out at some other time (we could probably have one goroutine that deals with all the pinging). The other thing that this CL does not do is actually close the client connection. That can probably be left for a followup though. https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go File state/apiserver/root.go (right): https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod... state/apiserver/root.go:251: pinger, err := newSrvPinger(r, 3*time.Minute) This seems weird to me. For a start, it's racy - technically a client is allowed to call Ping twice concurrently, and if it does that, we'll have two srvPingers active. Secondly, it seems odd that if a client never bothers to ping, they will never get one of these allocated, so we'll never get the timeout behaviour. Thirdly, just deleting all the resources without killing the connection seems unlikely to be worth it. I think you need to set this up inside apiRootForEntity. Additionally, I don't think you should return the srvPinger directly from this method - that exposes its Stop method to the API, which you probably don't want to do. I make a suggestion below. One other thing: I think the actual value of the duration should perhaps be tied to the value used by the client (perhaps the client should use the value divided by 10 or something) https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod... state/apiserver/root.go:312: reset chan interface{} s/interface{}/struct{}/ https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod... state/apiserver/root.go:315: // newSrvPinger creates a new pinger with a running killer. This type isn't really a pinger - it doesn't do any pinging. How about something like this? type resourceTimeout struct { tomb tomb.Tomb timer *time.Timer reset chan struct{} resource killer } // newResourceTimeout will call resource.Kill if // Refresh is not repeatedly called on the returned resourceTimeout // within the given timeout. func newResourceTimeout(resource killer, timeout time.Duration) *resourceTimeout https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod... state/apiserver/root.go:332: p.reset <- struct{}{} This needs to make sure that the pinger isn't being stopped by selecting on the tomb's dying chan too. https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod... state/apiserver/root.go:337: // p.tomb.Kill(nil) Why is this commented out? https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod... state/apiserver/root.go:345: defer close(p.reset) Why close the reset channel? https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod... state/apiserver/root.go:350: case <-time.After(p.timeout): I think you should use NewTimer here (and Stop it after the select) to avoid creating a bit of garbage each time a value is received on reset. Even better, just use a single Timer instance inside p and call Reset on that. https://codereview.appspot.com/24040044/diff/1/state/apiserver/root_test.go File state/apiserver/root_test.go (right): https://codereview.appspot.com/24040044/diff/1/state/apiserver/root_test.go#n... state/apiserver/root_test.go:65: time.Sleep(10 * time.Millisecond) This test will trigger a race detector error - there's no synchronisation between the Kill call and the read of killed. Better to make Kill send the time on a channel.
Sign in to reply to this message.
Thanks rog -- just emphasising a couple of your points. https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go File state/apiserver/root.go (right): https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod... state/apiserver/root.go:251: pinger, err := newSrvPinger(r, 3*time.Minute) On 2013/11/12 16:01:21, rog wrote: > This seems weird to me. For a start, it's racy - technically a client is allowed > to call Ping twice concurrently, and if it does that, we'll have two srvPingers > active. Secondly, it seems odd that if a client never bothers to ping, they will > never get one of these allocated, so we'll never get the timeout behaviour. > Thirdly, just deleting all the resources without killing the connection seems > unlikely to be worth it. > > I think you need to set this up inside apiRootForEntity. > Additionally, I don't think you should return the srvPinger directly from this > method - that exposes its Stop method to the API, which you probably don't want > to do. I make a suggestion below. > > One other thing: I think the actual value of the duration should perhaps be tied > to the value used by the client (perhaps the client should use the value divided > by 10 or something) Strong agreement with all of this. Create one per connection, use the Ping method purely for resetting the connection-killing timeout. And please make a timeout actually kill the connection -- that's the point of this branch, and I don't think it should be left to a later one. https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod... state/apiserver/root.go:315: // newSrvPinger creates a new pinger with a running killer. On 2013/11/12 16:01:21, rog wrote: > This type isn't really a pinger - it doesn't do any pinging. > > How about something like this? > > type resourceTimeout struct { > tomb tomb.Tomb > timer *time.Timer > reset chan struct{} > resource killer > } > > > // newResourceTimeout will call resource.Kill if > // Refresh is not repeatedly called on the returned resourceTimeout > // within the given timeout. > func newResourceTimeout(resource killer, timeout time.Duration) *resourceTimeout FWIW, this is not (should not be) about killing resources -- it should be about closing the connection. The logic for cleaning up resources already exists, and is triggered by closing the connection.
Sign in to reply to this message.
Please take a look. https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go File state/apiserver/root.go (right): https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod... state/apiserver/root.go:251: pinger, err := newSrvPinger(r, 3*time.Minute) On 2013/11/12 16:01:21, rog wrote: > This seems weird to me. For a start, it's racy - technically a client is allowed > to call Ping twice concurrently, and if it does that, we'll have two srvPingers > active. Secondly, it seems odd that if a client never bothers to ping, they will > never get one of these allocated, so we'll never get the timeout behaviour. > Thirdly, just deleting all the resources without killing the connection seems > unlikely to be worth it. > > I think you need to set this up inside apiRootForEntity. > Additionally, I don't think you should return the srvPinger directly from this > method - that exposes its Stop method to the API, which you probably don't want > to do. I make a suggestion below. > > One other thing: I think the actual value of the duration should perhaps be tied > to the value used by the client (perhaps the client should use the value divided > by 10 or something) Found a different approach. Only the part with the duration should be discussed and may be part of the next round. Not sure about it. https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod... state/apiserver/root.go:312: reset chan interface{} On 2013/11/12 16:01:21, rog wrote: > s/interface{}/struct{}/ Done. https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod... state/apiserver/root.go:315: // newSrvPinger creates a new pinger with a running killer. On 2013/11/12 16:01:21, rog wrote: > This type isn't really a pinger - it doesn't do any pinging. > > How about something like this? > > type resourceTimeout struct { > tomb tomb.Tomb > timer *time.Timer > reset chan struct{} > resource killer > } > > > // newResourceTimeout will call resource.Kill if > // Refresh is not repeatedly called on the returned resourceTimeout > // within the given timeout. > func newResourceTimeout(resource killer, timeout time.Duration) *resourceTimeout Yeah, looks better. Done. https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod... state/apiserver/root.go:332: p.reset <- struct{}{} On 2013/11/12 16:01:21, rog wrote: > This needs to make sure that the pinger isn't being stopped > by selecting on the tomb's dying chan too. Done. https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod... state/apiserver/root.go:337: // p.tomb.Kill(nil) On 2013/11/12 16:01:21, rog wrote: > Why is this commented out? Argh, forgot after a test run. :/ https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod... state/apiserver/root.go:345: defer close(p.reset) On 2013/11/12 16:01:21, rog wrote: > Why close the reset channel? Done. https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod... state/apiserver/root.go:350: case <-time.After(p.timeout): On 2013/11/12 16:01:21, rog wrote: > I think you should use NewTimer here (and Stop it after the select) to avoid > creating a bit of garbage each time a value is received on reset. Even better, > just use a single Timer instance inside p and call Reset on that. Done. https://codereview.appspot.com/24040044/diff/1/state/apiserver/root_test.go File state/apiserver/root_test.go (right): https://codereview.appspot.com/24040044/diff/1/state/apiserver/root_test.go#n... state/apiserver/root_test.go:65: time.Sleep(10 * time.Millisecond) On 2013/11/12 16:01:21, rog wrote: > This test will trigger a race detector error - there's no synchronisation > between the Kill call and the read of killed. > > Better to make Kill send the time on a channel. Done.
Sign in to reply to this message.
Not quite there yet. Some thoughts below. https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go File state/apiserver/root.go (right): https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#ne... state/apiserver/root.go:44: initialRoot: root, Rather than having initialRoot inside srvRoot, I'd prefer to explicitly include the members we need (srv and rpcConn). https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#ne... state/apiserver/root.go:52: r.pingTimeout = newPingTimeout(r.entity.Tag(), action, 3*time.Minute) There's nothing that cleans this up when the connection is killed. Also, it would be good to have the timeout in a named constant, and connected somehow to the client ping logic. https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#ne... state/apiserver/root.go:240: // Pinger returns a server pinger tracing the client pings and // Pinger returns an object that can be pinged // by calling its Ping method. If this method // is not called frequently enough, the connection // will be dropped. ? https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#ne... state/apiserver/root.go:296: tag string Do we need the tag here? https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#ne... state/apiserver/root.go:302: // newPingTimeout creates a ping timeout instance // newPingTimeout returns a new pingTimeout instance // that invokes the given action if there is more // than the given timeout interval between calls // to its Ping method. ? https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#ne... state/apiserver/root.go:311: go func() { I *think* that if you do: defer action() here that you'll be able to register the pingTimeout as a resource and have the action close the rpc connection. alternatively, call go pt.action() below instead of calling it synchronously. On balance I think I prefer the latter option. I'd make action() not return an error, so the caller can decide what they might want to do with any error. https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#ne... state/apiserver/root.go:328: func (pt *pingTimeout) stop() error { I don't think this is ever called.
Sign in to reply to this message.
Please take a look. https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go File state/apiserver/root.go (right): https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#ne... state/apiserver/root.go:44: initialRoot: root, On 2013/11/18 18:19:48, rog wrote: > Rather than having initialRoot inside srvRoot, I'd prefer > to explicitly include the members we need (srv and rpcConn). Done. https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#ne... state/apiserver/root.go:52: r.pingTimeout = newPingTimeout(r.entity.Tag(), action, 3*time.Minute) On 2013/11/18 18:19:48, rog wrote: > There's nothing that cleans this up when the connection is killed. > > Also, it would be good to have the timeout in a named constant, > and connected somehow to the client ping logic. Yep, cleanup has been prepared but led to the deadlock in the first approach. Will be changed. Please describe "... connected somehow to the client ping logic" more. Shall different agent influence their ping times, shall this be configurable or fixed by agent type, what is the benefit of this idea and is it worth the increasing complexity? https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#ne... state/apiserver/root.go:240: // Pinger returns a server pinger tracing the client pings and On 2013/11/18 18:19:48, rog wrote: > // Pinger returns an object that can be pinged > // by calling its Ping method. If this method > // is not called frequently enough, the connection > // will be dropped. > > ? Done. https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#ne... state/apiserver/root.go:296: tag string On 2013/11/18 18:19:48, rog wrote: > Do we need the tag here? Removed, I only needed it during testing. https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#ne... state/apiserver/root.go:302: // newPingTimeout creates a ping timeout instance On 2013/11/18 18:19:48, rog wrote: > // newPingTimeout returns a new pingTimeout instance > // that invokes the given action if there is more > // than the given timeout interval between calls > // to its Ping method. > > ? Done. https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#ne... state/apiserver/root.go:311: go func() { On 2013/11/18 18:19:48, rog wrote: > I *think* that if you do: > > defer action() > > here that you'll be able to register the pingTimeout > as a resource and have the action close the rpc connection. > > alternatively, call > > go pt.action() > > below instead of calling it synchronously. > > On balance I think I prefer the latter option. > I'd make action() not return an error, so the caller > can decide what they might want to do with any error. Done. https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#ne... state/apiserver/root.go:328: func (pt *pingTimeout) stop() error { On 2013/11/18 18:19:48, rog wrote: > I don't think this is ever called. No it's done, it already had been intended for the timeout cleanup.
Sign in to reply to this message.
> Please describe "... connected somehow to the client ping logic" more. Shall > different agent influence their ping times, shall this be configurable or fixed > by agent type, what is the benefit of this idea and is it worth the increasing > complexity? I suggest a specific solution in the comment on the timeout const. This is really close now, thanks. Just a few small things left. https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go File state/apiserver/root.go (right): https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#ne... state/apiserver/root.go:33: const timeout = 3 * time.Minute s/timeout/maxPingInterval/ ? (more obvious what the timeout is) I think I'd define it inside state/api. Then the api client code can use it to calculate an appropriate ping interval. https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#ne... state/apiserver/root.go:317: // that invokes the given action if there is more s/action/action asynchronously/ https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#ne... state/apiserver/root.go:329: go pt.action() I don't think we want to call this if stop has been called. when I said "below" I meant in the <-timer.C arm of the select statement. https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go File state/apiserver/root_test.go (right): https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.... state/apiserver/root_test.go:53: for i := 0; i < 10; i++ { s/10/2/ i don't think there's really any particular reason to make this test run any longer than necessary. https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.... state/apiserver/root_test.go:60: closed := <-closedc select { case closed = <-closed: case <-testing.LongWait: c.Fatalf("action never executed") } otherwise the test will hang forever if the action never gets called. https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.... state/apiserver/root_test.go:61: closeDiff := closed.Sub(broken).Nanoseconds() / 1000000 or: closeDiff := closed.Sub(broken) / time.Millisecond https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.... state/apiserver/root_test.go:62: c.Assert(closeDiff >= 50 && closeDiff <= 60, gc.Equals, true) 50 <= closeDiff && closeDiff <= 60 is a nice convention for range checks. I'd also make the allowable range much larger to cater for dubious scheduling in the cloud, so that this doesn't break too often. perhaps 50 <= closeDiff && closeDiff <= 100
Sign in to reply to this message.
Please take a look. https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go File state/apiserver/root.go (right): https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#ne... state/apiserver/root.go:33: const timeout = 3 * time.Minute On 2013/11/19 14:44:08, rog wrote: > s/timeout/maxPingInterval/ > ? > (more obvious what the timeout is) > > I think I'd define it inside state/api. Then the api client > code can use it to calculate an appropriate ping interval. Renamed it and made a comment. I would move it not earlier than the according API changes which could be done in another CL. https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#ne... state/apiserver/root.go:317: // that invokes the given action if there is more On 2013/11/19 14:44:08, rog wrote: > s/action/action asynchronously/ Done. https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#ne... state/apiserver/root.go:329: go pt.action() On 2013/11/19 14:44:08, rog wrote: > I don't think we want to call this if stop has been called. > when I said "below" I meant in the <-timer.C arm of the select statement. Ah, already wondered. https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go File state/apiserver/root_test.go (right): https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.... state/apiserver/root_test.go:53: for i := 0; i < 10; i++ { On 2013/11/19 14:44:08, rog wrote: > s/10/2/ > > i don't think there's really any particular reason to make this test run any > longer than necessary. Done. https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.... state/apiserver/root_test.go:60: closed := <-closedc On 2013/11/19 14:44:08, rog wrote: > select { > case closed = <-closed: > case <-testing.LongWait: > c.Fatalf("action never executed") > } > > otherwise the test will hang forever if the action never > gets called. Had luck so far, but that's better. https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.... state/apiserver/root_test.go:61: closeDiff := closed.Sub(broken).Nanoseconds() / 1000000 On 2013/11/19 14:44:08, rog wrote: > or: > closeDiff := closed.Sub(broken) / time.Millisecond Gnah! Sure. :) https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.... state/apiserver/root_test.go:62: c.Assert(closeDiff >= 50 && closeDiff <= 60, gc.Equals, true) On 2013/11/19 14:44:08, rog wrote: > 50 <= closeDiff && closeDiff <= 60 > > is a nice convention for range checks. > > I'd also make the allowable range much larger to cater > for dubious scheduling in the cloud, so that this doesn't > break too often. > > perhaps 50 <= closeDiff && closeDiff <= 100 Done.
Sign in to reply to this message.
On 2013/11/19 15:52:37, mue wrote: > Please take a look. > > https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go > File state/apiserver/root.go (right): > > https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#ne... > state/apiserver/root.go:33: const timeout = 3 * time.Minute > On 2013/11/19 14:44:08, rog wrote: > > s/timeout/maxPingInterval/ > > ? > > (more obvious what the timeout is) > > > > I think I'd define it inside state/api. Then the api client > > code can use it to calculate an appropriate ping interval. > > Renamed it and made a comment. I would move it not earlier than the according > API changes which could be done in another CL. > > https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#ne... > state/apiserver/root.go:317: // that invokes the given action if there is more > On 2013/11/19 14:44:08, rog wrote: > > s/action/action asynchronously/ > > Done. > > https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#ne... > state/apiserver/root.go:329: go pt.action() > On 2013/11/19 14:44:08, rog wrote: > > I don't think we want to call this if stop has been called. > > when I said "below" I meant in the <-timer.C arm of the select statement. > > Ah, already wondered. > > https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go > File state/apiserver/root_test.go (right): > > https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.... > state/apiserver/root_test.go:53: for i := 0; i < 10; i++ { > On 2013/11/19 14:44:08, rog wrote: > > s/10/2/ > > > > i don't think there's really any particular reason to make this test run any > > longer than necessary. > > Done. > > https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.... > state/apiserver/root_test.go:60: closed := <-closedc > On 2013/11/19 14:44:08, rog wrote: > > select { > > case closed = <-closed: > > case <-testing.LongWait: > > c.Fatalf("action never executed") > > } > > > > otherwise the test will hang forever if the action never > > gets called. > > Had luck so far, but that's better. > > https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.... > state/apiserver/root_test.go:61: closeDiff := closed.Sub(broken).Nanoseconds() / > 1000000 > On 2013/11/19 14:44:08, rog wrote: > > or: > > closeDiff := closed.Sub(broken) / time.Millisecond > > Gnah! Sure. :) > > https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.... > state/apiserver/root_test.go:62: c.Assert(closeDiff >= 50 && closeDiff <= 60, > gc.Equals, true) > On 2013/11/19 14:44:08, rog wrote: > > 50 <= closeDiff && closeDiff <= 60 > > > > is a nice convention for range checks. > > > > I'd also make the allowable range much larger to cater > > for dubious scheduling in the cloud, so that this doesn't > > break too often. > > > > perhaps 50 <= closeDiff && closeDiff <= 100 > > Done. LGTM, thanks.
Sign in to reply to this message.
|