Issue 24040044: apiserver: analyzing ping timeouts

Can't Edit
Can't Publish+Mail
Start Review

Created:
11 years, 7 months ago by mue

Modified:
11 years, 7 months ago

Reviewers:
mp+194862, fwereade, rog

Visibility:
Public.

Description

apiserver: analyzing ping timeouts This is a first fix of #1205451 handling externally killed juju nodes. Currently the pings of clients to the API server are not analyzed. With this change each ping leads to the reset of a timer waiting for 3 minutes. If it is not reset (by the ping) the timer signal tells the pinger to kill the connection. The overall problem with the presence pinger, especially regarding to HA environments, is not covered by this change. https://code.launchpad.net/~themue/juju-core/055-apiserver-heartbeat-analyzing/+merge/194862 (do not edit description out of merge proposal)

Patch Set 1 #

Total comments: 18

Patch Set 2 : apiserver: analyzing ping timeouts #

Total comments: 14

Patch Set 3 : apiserver: analyzing ping timeouts #

Total comments: 14

Patch Set 4 : apiserver: analyzing ping timeouts #

Created: 11 years, 7 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+133 lines, -14 lines)			Patch
A	[revision details]	View	1 2 3	1 chunk	+2 lines, -0 lines	0 comments	Download
M	state/apiserver/admin.go	View	1	1 chunk	+1 line, -1 line	0 comments	Download
M	state/apiserver/export_test.go	View	1	1 chunk	+2 lines, -0 lines	0 comments	Download
M	state/apiserver/root.go	View	1 2 3	5 chunks	+102 lines, -13 lines	0 comments	Download
M	state/apiserver/root_test.go	View	1 2 3	2 chunks	+26 lines, -0 lines	0 comments	Download

Messages

Total messages: 9

Expand All Messages | Collapse All Messages

rog

This is going in a reasonable direction, but I have a few comments to make. ...

11 years, 7 months ago (2013-11-12 16:01:21 UTC) #2

This is going in a reasonable direction, but I have a few comments to make. It's
a bit of a pity that this adds yet another goroutine per connection, but I think
that can be sorted out at some other time (we could probably have one goroutine
that deals with all the pinging). The other thing that this CL does not do is
actually close the client connection. That can probably be left for a followup
though.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go
File state/apiserver/root.go (right):

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod...
state/apiserver/root.go:251: pinger, err := newSrvPinger(r, 3*time.Minute)
This seems weird to me. For a start, it's racy - technically a client is allowed
to call Ping twice concurrently, and if it does that, we'll have two srvPingers
active. Secondly, it seems odd that if a client never bothers to ping, they will
never get one of these allocated, so we'll never get the timeout behaviour.
Thirdly, just deleting all the resources without killing the connection seems
unlikely to be worth it.

I think you need to set this up inside apiRootForEntity.
Additionally, I don't think you should return the srvPinger directly from this
method - that exposes its Stop method to the API, which you probably don't want
to do. I make a suggestion below.

One other thing: I think the actual value of the duration should perhaps be tied
to the value used by the client (perhaps the client should use the value divided
by 10 or something)

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod...
state/apiserver/root.go:312: reset   chan interface{}
s/interface{}/struct{}/

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod...
state/apiserver/root.go:315: // newSrvPinger creates a new pinger with a running
killer.
This type isn't really a pinger - it doesn't do any pinging.

How about something like this?

type resourceTimeout struct {
   tomb tomb.Tomb
   timer *time.Timer
   reset chan struct{}
   resource killer
}


// newResourceTimeout will call resource.Kill if
// Refresh is not repeatedly called on the returned resourceTimeout
// within the given timeout.
func newResourceTimeout(resource killer, timeout time.Duration) *resourceTimeout

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod...
state/apiserver/root.go:332: p.reset <- struct{}{}
This needs to make sure that the pinger isn't being stopped
by selecting on the tomb's dying chan too.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod...
state/apiserver/root.go:337: // p.tomb.Kill(nil)
Why is this commented out?

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod...
state/apiserver/root.go:345: defer close(p.reset)
Why close the reset channel?

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod...
state/apiserver/root.go:350: case <-time.After(p.timeout):
I think you should use NewTimer here (and Stop it after the select) to avoid
creating a bit of garbage each time a value is received on reset. Even better,
just use a single Timer instance inside p and call Reset on that.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root_test.go
File state/apiserver/root_test.go (right):

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root_test.go#n...
state/apiserver/root_test.go:65: time.Sleep(10 * time.Millisecond)
This test will trigger a race detector error - there's no synchronisation
between the Kill call and the read of killed.

Better to make Kill send the time on a channel.

fwereade

Thanks rog -- just emphasising a couple of your points. https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go File state/apiserver/root.go (right): https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode251 ...

11 years, 7 months ago (2013-11-14 15:10:12 UTC) #3

Thanks rog -- just emphasising a couple of your points.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go
File state/apiserver/root.go (right):

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod...
state/apiserver/root.go:251: pinger, err := newSrvPinger(r, 3*time.Minute)
On 2013/11/12 16:01:21, rog wrote:
> This seems weird to me. For a start, it's racy - technically a client is
allowed
> to call Ping twice concurrently, and if it does that, we'll have two
srvPingers
> active. Secondly, it seems odd that if a client never bothers to ping, they
will
> never get one of these allocated, so we'll never get the timeout behaviour.
> Thirdly, just deleting all the resources without killing the connection seems
> unlikely to be worth it.
> 
> I think you need to set this up inside apiRootForEntity.
> Additionally, I don't think you should return the srvPinger directly from this
> method - that exposes its Stop method to the API, which you probably don't
want
> to do. I make a suggestion below.
> 
> One other thing: I think the actual value of the duration should perhaps be
tied
> to the value used by the client (perhaps the client should use the value
divided
> by 10 or something)

Strong agreement with all of this. Create one per connection, use the Ping
method purely for resetting the connection-killing timeout. And please make a
timeout actually kill the connection -- that's the point of this branch, and I
don't think it should be left to a later one.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod...
state/apiserver/root.go:315: // newSrvPinger creates a new pinger with a running
killer.
On 2013/11/12 16:01:21, rog wrote:
> This type isn't really a pinger - it doesn't do any pinging.
> 
> How about something like this?
> 
> type resourceTimeout struct {
>    tomb tomb.Tomb
>    timer *time.Timer
>    reset chan struct{}
>    resource killer
> }
> 
> 
> // newResourceTimeout will call resource.Kill if
> // Refresh is not repeatedly called on the returned resourceTimeout
> // within the given timeout.
> func newResourceTimeout(resource killer, timeout time.Duration)
*resourceTimeout

FWIW, this is not (should not be) about killing resources -- it should be about
closing the connection. The logic for cleaning up resources already exists, and
is triggered by closing the connection.

mue

Please take a look. https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go File state/apiserver/root.go (right): https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcode251 state/apiserver/root.go:251: pinger, err := newSrvPinger(r, 3*time.Minute) ...

11 years, 7 months ago (2013-11-18 15:31:06 UTC) #4

Please take a look.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go
File state/apiserver/root.go (right):

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod...
state/apiserver/root.go:251: pinger, err := newSrvPinger(r, 3*time.Minute)
On 2013/11/12 16:01:21, rog wrote:
> This seems weird to me. For a start, it's racy - technically a client is
allowed
> to call Ping twice concurrently, and if it does that, we'll have two
srvPingers
> active. Secondly, it seems odd that if a client never bothers to ping, they
will
> never get one of these allocated, so we'll never get the timeout behaviour.
> Thirdly, just deleting all the resources without killing the connection seems
> unlikely to be worth it.
> 
> I think you need to set this up inside apiRootForEntity.
> Additionally, I don't think you should return the srvPinger directly from this
> method - that exposes its Stop method to the API, which you probably don't
want
> to do. I make a suggestion below.
> 
> One other thing: I think the actual value of the duration should perhaps be
tied
> to the value used by the client (perhaps the client should use the value
divided
> by 10 or something)

Found a different approach. Only the part with the duration should be discussed
and may be part of the next round. Not sure about it.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod...
state/apiserver/root.go:312: reset   chan interface{}
On 2013/11/12 16:01:21, rog wrote:
> s/interface{}/struct{}/

Done.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod...
state/apiserver/root.go:315: // newSrvPinger creates a new pinger with a running
killer.
On 2013/11/12 16:01:21, rog wrote:
> This type isn't really a pinger - it doesn't do any pinging.
> 
> How about something like this?
> 
> type resourceTimeout struct {
>    tomb tomb.Tomb
>    timer *time.Timer
>    reset chan struct{}
>    resource killer
> }
> 
> 
> // newResourceTimeout will call resource.Kill if
> // Refresh is not repeatedly called on the returned resourceTimeout
> // within the given timeout.
> func newResourceTimeout(resource killer, timeout time.Duration)
*resourceTimeout

Yeah, looks better. Done.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod...
state/apiserver/root.go:332: p.reset <- struct{}{}
On 2013/11/12 16:01:21, rog wrote:
> This needs to make sure that the pinger isn't being stopped
> by selecting on the tomb's dying chan too.

Done.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod...
state/apiserver/root.go:337: // p.tomb.Kill(nil)
On 2013/11/12 16:01:21, rog wrote:
> Why is this commented out?

Argh, forgot after a test run. :/

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod...
state/apiserver/root.go:345: defer close(p.reset)
On 2013/11/12 16:01:21, rog wrote:
> Why close the reset channel?

Done.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root.go#newcod...
state/apiserver/root.go:350: case <-time.After(p.timeout):
On 2013/11/12 16:01:21, rog wrote:
> I think you should use NewTimer here (and Stop it after the select) to avoid
> creating a bit of garbage each time a value is received on reset. Even better,
> just use a single Timer instance inside p and call Reset on that.

Done.

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root_test.go
File state/apiserver/root_test.go (right):

https://codereview.appspot.com/24040044/diff/1/state/apiserver/root_test.go#n...
state/apiserver/root_test.go:65: time.Sleep(10 * time.Millisecond)
On 2013/11/12 16:01:21, rog wrote:
> This test will trigger a race detector error - there's no synchronisation
> between the Kill call and the read of killed.
> 
> Better to make Kill send the time on a channel.

Done.

rog

Not quite there yet. Some thoughts below. https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go File state/apiserver/root.go (right): https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode44 state/apiserver/root.go:44: initialRoot: root, ...

11 years, 7 months ago (2013-11-18 18:19:47 UTC) #5

mue

Please take a look. https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go File state/apiserver/root.go (right): https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#newcode44 state/apiserver/root.go:44: initialRoot: root, On 2013/11/18 18:19:48, ...

11 years, 7 months ago (2013-11-19 13:50:46 UTC) #6

Please take a look.

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go
File state/apiserver/root.go (right):

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#ne...
state/apiserver/root.go:44: initialRoot: root,
On 2013/11/18 18:19:48, rog wrote:
> Rather than having initialRoot inside srvRoot, I'd prefer
> to explicitly include the members we need (srv and rpcConn).

Done.

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#ne...
state/apiserver/root.go:52: r.pingTimeout = newPingTimeout(r.entity.Tag(),
action, 3*time.Minute)
On 2013/11/18 18:19:48, rog wrote:
> There's nothing that cleans this up when the connection is killed.
> 
> Also, it would be good to have the timeout in a named constant,
> and connected somehow to the client ping logic.

Yep, cleanup has been prepared but led to the deadlock in the first approach.
Will be changed.

Please describe "... connected somehow to the client ping logic" more. Shall
different agent influence their ping times, shall this be configurable or fixed
by agent type, what is the benefit of this idea and is it worth the increasing
complexity?

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#ne...
state/apiserver/root.go:240: // Pinger returns a server pinger tracing the
client pings and
On 2013/11/18 18:19:48, rog wrote:
> // Pinger returns an object that can be pinged
> // by calling its Ping method. If this method
> // is not called frequently enough, the connection
> // will be dropped.
> 
> ?

Done.

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#ne...
state/apiserver/root.go:296: tag     string
On 2013/11/18 18:19:48, rog wrote:
> Do we need the tag here?

Removed, I only needed it during testing.

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#ne...
state/apiserver/root.go:302: // newPingTimeout creates a ping timeout instance
On 2013/11/18 18:19:48, rog wrote:
> // newPingTimeout returns a new pingTimeout instance
> // that invokes the given action if there is more
> // than the given timeout interval between calls
> // to its Ping method.
> 
> ?

Done.

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#ne...
state/apiserver/root.go:311: go func() {
On 2013/11/18 18:19:48, rog wrote:
> I *think* that if you do:
> 
>    defer action()
> 
> here that you'll be able to register the pingTimeout
> as a resource and have the action close the rpc connection.
> 
> alternatively, call
> 
>    go pt.action()
> 
> below instead of calling it synchronously.
> 
> On balance I think I prefer the latter option.
> I'd make action() not return an error, so the caller
> can decide what they might want to do with any error.

Done.

https://codereview.appspot.com/24040044/diff/20001/state/apiserver/root.go#ne...
state/apiserver/root.go:328: func (pt *pingTimeout) stop() error {
On 2013/11/18 18:19:48, rog wrote:
> I don't think this is ever called.

No it's done, it already had been intended for the timeout cleanup.

rog

> Please describe "... connected somehow to the client ping logic" more. Shall > different ...

11 years, 7 months ago (2013-11-19 14:44:07 UTC) #7

mue

Please take a look. https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go File state/apiserver/root.go (right): https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#newcode33 state/apiserver/root.go:33: const timeout = 3 * ...

11 years, 7 months ago (2013-11-19 15:52:37 UTC) #8

rog

11 years, 7 months ago (2013-11-20 12:31:10 UTC) #9

On 2013/11/19 15:52:37, mue wrote:
> Please take a look.
> 
> https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go
> File state/apiserver/root.go (right):
> 
>
https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#ne...
> state/apiserver/root.go:33: const timeout = 3 * time.Minute
> On 2013/11/19 14:44:08, rog wrote:
> > s/timeout/maxPingInterval/
> > ?
> > (more obvious what the timeout is)
> > 
> > I think I'd define it inside state/api. Then the api client
> > code can use it to calculate an appropriate ping interval.
> 
> Renamed it and made a comment. I would move it not earlier than the according
> API changes which could be done in another CL.
> 
>
https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#ne...
> state/apiserver/root.go:317: // that invokes the given action if there is more
> On 2013/11/19 14:44:08, rog wrote:
> > s/action/action asynchronously/
> 
> Done.
> 
>
https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root.go#ne...
> state/apiserver/root.go:329: go pt.action()
> On 2013/11/19 14:44:08, rog wrote:
> > I don't think we want to call this if stop has been called.
> > when I said "below" I meant in the <-timer.C arm of the select statement.
> 
> Ah, already wondered.
> 
>
https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test.go
> File state/apiserver/root_test.go (right):
> 
>
https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test....
> state/apiserver/root_test.go:53: for i := 0; i < 10; i++ {
> On 2013/11/19 14:44:08, rog wrote:
> > s/10/2/
> > 
> > i don't think there's really any particular reason to make this test run any
> > longer than necessary.
> 
> Done.
> 
>
https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test....
> state/apiserver/root_test.go:60: closed := <-closedc
> On 2013/11/19 14:44:08, rog wrote:
> > select {
> > case closed = <-closed:
> > case <-testing.LongWait:
> >     c.Fatalf("action never executed")
> > }
> > 
> > otherwise the test will hang forever if the action never
> > gets called.
> 
> Had luck so far, but that's better.
> 
>
https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test....
> state/apiserver/root_test.go:61: closeDiff := closed.Sub(broken).Nanoseconds()
/
> 1000000
> On 2013/11/19 14:44:08, rog wrote:
> > or:
> > closeDiff := closed.Sub(broken) / time.Millisecond
> 
> Gnah! Sure. :)
> 
>
https://codereview.appspot.com/24040044/diff/40001/state/apiserver/root_test....
> state/apiserver/root_test.go:62: c.Assert(closeDiff >= 50 && closeDiff <= 60,
> gc.Equals, true)
> On 2013/11/19 14:44:08, rog wrote:
> > 50 <= closeDiff && closeDiff <= 60
> > 
> > is a nice convention for range checks.
> > 
> > I'd also make the allowable range much larger to cater
> > for dubious scheduling in the cloud, so that this doesn't
> > break too often.
> > 
> > perhaps 50 <= closeDiff && closeDiff <= 100
> 
> Done.

LGTM, thanks.

Expand All Messages | Collapse All Messages