|
|
Descriptionstate: add retry delay during mongo connection
If the connection to the mongo state database fails,
e.g. while the system is still bootstrapping, the
system immediately retries to connect until timeout.
This change adds an increasing delay between those
retries.
https://code.launchpad.net/~themue/juju-core/006-state-retry-delay/+merge/139745
(do not edit description out of merge proposal)
Patch Set 1 #Patch Set 2 : state: add retry delay during mongo connection #
Total comments: 1
Patch Set 3 : state: add retry delay during mongo connection #
Total comments: 8
MessagesTotal messages: 21
Please take a look.
Sign in to reply to this message.
On 2012/12/13 17:00:44, TheMue wrote: > Please take a look. This looks reasonable, but why not use the AttemptStrategy ?
Sign in to reply to this message.
On 2012/12/13 21:08:25, dfc wrote: > On 2012/12/13 17:00:44, TheMue wrote: > > Please take a look. > > This looks reasonable, but why not use the AttemptStrategy ? Lack of knowledge, could you tell me more about it?
Sign in to reply to this message.
Here ya go http://godoc.org/launchpad.net/juju-core/trivial#Attempt There are examples all the way through the goose and juju-core repos, all the workers use it, as well as ec2 for retrying eventually consistent operations. On Fri, Dec 14, 2012 at 8:11 AM, <themue@gmail.com> wrote: > On 2012/12/13 21:08:25, dfc wrote: >> >> On 2012/12/13 17:00:44, TheMue wrote: >> > Please take a look. > > >> This looks reasonable, but why not use the AttemptStrategy ? > > > Lack of knowledge, could you tell me more about it? > > https://codereview.appspot.com/6949044/
Sign in to reply to this message.
On 2012/12/13 21:16:44, dfc wrote: > Here ya go http://godoc.org/launchpad.net/juju-core/trivial#Attempt Great, thx, missed it. Interesting if different strategies are needed. But has to be extended for exponential behavior then. If we're interested in supporting different strategies here it's worth it.
Sign in to reply to this message.
Please take a look.
Sign in to reply to this message.
LGTM
Sign in to reply to this message.
Your overview says it adds "increasing delay", but AttemptStrategy seems to do a fixed delay. I guess you just changed your mind, but delaying does seem like a good idea. LGTM. https://codereview.appspot.com/6949044/diff/3002/state/open.go File state/open.go (right): https://codereview.appspot.com/6949044/diff/3002/state/open.go#newcode70 state/open.go:70: attempt.Next() don't you need to check if attempt.Next is returning false?
Sign in to reply to this message.
NOT LGTM. There is already a retry loop in the juju.Conn, why are we adding an inner loop ?
Sign in to reply to this message.
On 2012/12/17 10:00:32, dfc wrote: > NOT LGTM. > > There is already a retry loop in the juju.Conn, why are we adding an inner loop > ? There is no additional inner loop, only a Sleep() inside of Next() of the AttempStretegy. Increasing delays have to be implemented as an option for AttemptStretegy so that it could be used here later.
Sign in to reply to this message.
Check out juju.Conn. On Mon, Dec 17, 2012 at 9:04 PM, <themue@gmail.com> wrote: > On 2012/12/17 10:00:32, dfc wrote: >> >> NOT LGTM. > > >> There is already a retry loop in the juju.Conn, why are we adding an > > inner loop >> >> ? > > > There is no additional inner loop, only a Sleep() inside of Next() of > the AttempStretegy. > > Increasing delays have to be implemented as an option for > AttemptStretegy so that it could be used here later. > > https://codereview.appspot.com/6949044/
Sign in to reply to this message.
On 2012/12/17 10:05:34, dfc wrote: > Check out juju.Conn. > > On Mon, Dec 17, 2012 at 9:04 PM, <mailto:themue@gmail.com> wrote: > > On 2012/12/17 10:00:32, dfc wrote: > >> > >> NOT LGTM. > > > > > >> There is already a retry loop in the juju.Conn, why are we adding an > > > > inner loop > >> > >> ? > > > > > > There is no additional inner loop, only a Sleep() inside of Next() of > > the AttempStretegy. > > > > Increasing delays have to be implemented as an option for > > AttemptStretegy so that it could be used here later. > > > > https://codereview.appspot.com/6949044/ The error you had in the beginning (quick retries) is due to the mgo mechanism of calling the dial func in a loop until success or timeout.
Sign in to reply to this message.
On 2012/12/17 10:08:34, TheMue wrote: > On 2012/12/17 10:05:34, dfc wrote: > > Check out juju.Conn. > > > > On Mon, Dec 17, 2012 at 9:04 PM, <mailto:themue@gmail.com> wrote: > > > On 2012/12/17 10:00:32, dfc wrote: > > >> > > >> NOT LGTM. > > > > > > > > >> There is already a retry loop in the juju.Conn, why are we adding an > > > > > > inner loop > > >> > > >> ? > > > > > > > > > There is no additional inner loop, only a Sleep() inside of Next() of > > > the AttempStretegy. > > > > > > Increasing delays have to be implemented as an option for > > > AttemptStretegy so that it could be used here later. > > > > > > https://codereview.appspot.com/6949044/ > > The error you had in the beginning (quick retries) is due to the mgo mechanism > of calling the dial func in a loop until success or timeout. Hi Dave, could you please tell me more about the advantage of moving the delay into the Conn loop (together with a reduction of the mgo dial timeout) instead of a delay in the loop which is provided by the mgo dial mechanism? It would help me to decide if another change is needed or if the current approach is ok. Thx mue
Sign in to reply to this message.
On 18/12/2012, at 18:43, themue@gmail.com wrote: > On 2012/12/17 10:08:34, TheMue wrote: >> On 2012/12/17 10:05:34, dfc wrote: >> > Check out juju.Conn. >> > >> > On Mon, Dec 17, 2012 at 9:04 PM, <mailto:themue@gmail.com> wrote: >> > > On 2012/12/17 10:00:32, dfc wrote: >> > >> >> > >> NOT LGTM. >> > > >> > > >> > >> There is already a retry loop in the juju.Conn, why are we adding > an >> > > >> > > inner loop >> > >> >> > >> ? >> > > >> > > >> > > There is no additional inner loop, only a Sleep() inside of Next() > of >> > > the AttempStretegy. >> > > >> > > Increasing delays have to be implemented as an option for >> > > AttemptStretegy so that it could be used here later. >> > > >> > > https://codereview.appspot.com/6949044/ > >> The error you had in the beginning (quick retries) is due to the mgo > mechanism >> of calling the dial func in a loop until success or timeout. > > Hi Dave, > > could you please tell me more about the advantage of moving the delay > into the Conn loop (together with a reduction of the mgo dial timeout) > instead of a delay in the loop which is provided by the mgo dial > mechanism? It would help me to decide if another change is needed or if > the current approach is ok. > > Thx mue > > https://codereview.appspot.com/6949044/
Sign in to reply to this message.
Hi frank, I though you were on vacation? If so, this can easily wait til the new year. On 18/12/2012, at 18:43, themue@gmail.com wrote: > On 2012/12/17 10:08:34, TheMue wrote: >> On 2012/12/17 10:05:34, dfc wrote: >> > Check out juju.Conn. >> > >> > On Mon, Dec 17, 2012 at 9:04 PM, <mailto:themue@gmail.com> wrote: >> > > On 2012/12/17 10:00:32, dfc wrote: >> > >> >> > >> NOT LGTM. >> > > >> > > >> > >> There is already a retry loop in the juju.Conn, why are we adding > an >> > > >> > > inner loop >> > >> >> > >> ? >> > > >> > > >> > > There is no additional inner loop, only a Sleep() inside of Next() > of >> > > the AttempStretegy. >> > > >> > > Increasing delays have to be implemented as an option for >> > > AttemptStretegy so that it could be used here later. >> > > >> > > https://codereview.appspot.com/6949044/ > >> The error you had in the beginning (quick retries) is due to the mgo > mechanism >> of calling the dial func in a loop until success or timeout. > > Hi Dave, > > could you please tell me more about the advantage of moving the delay > into the Conn loop (together with a reduction of the mgo dial timeout) > instead of a delay in the loop which is provided by the mgo dial > mechanism? It would help me to decide if another change is needed or if > the current approach is ok. > > Thx mue > > https://codereview.appspot.com/6949044/
Sign in to reply to this message.
Please take a look.
Sign in to reply to this message.
LGTM. https://codereview.appspot.com/6949044/diff/12001/trivial/attempt.go File trivial/attempt.go (right): https://codereview.appspot.com/6949044/diff/12001/trivial/attempt.go#newcode56 trivial/attempt.go:56: a.delay = time.Duration(1.2 * float64(a.delay)) What motivated the choice of 1.2? https://codereview.appspot.com/6949044/diff/12001/trivial/trivial_test.go File trivial/trivial_test.go (right): https://codereview.appspot.com/6949044/diff/12001/trivial/trivial_test.go#new... trivial/trivial_test.go:20: delta := 10 * time.Millisecond I'm a bit concerned that it will be bard to test these entirely reliably -- but so long as you've run these tests in a few different situations I'm happy.
Sign in to reply to this message.
NOT LGTM. i don't believe this will fix the problem, as discussed online and outlined below. https://codereview.appspot.com/6949044/diff/12001/juju/conn.go File juju/conn.go (right): https://codereview.appspot.com/6949044/diff/12001/juju/conn.go#newcode29 juju/conn.go:29: Behaviour: trivial.ExponentialInterval, there's a reason this delay is short - this strategy is only here to redial when we get an unauthorized access error, which is only likely to happen while the bootstrap process is taking place, which usually only takes about half a seco nd (the 60 second Total here is to cater for excessive VM scheduler variance). the redial loop in NewConn will not fix the issue of rapid dial retries. i believe that's due to the mgo package which has a redial loop with no sleep, and that the correct fix is there.
Sign in to reply to this message.
a few comments on the attempt changes. https://codereview.appspot.com/6949044/diff/12001/trivial/attempt.go File trivial/attempt.go (right): https://codereview.appspot.com/6949044/diff/12001/trivial/attempt.go#newcode12 trivial/attempt.go:12: LinearInterval when would it be appropriate to use LinearInterval? https://codereview.appspot.com/6949044/diff/12001/trivial/attempt.go#newcode13 trivial/attempt.go:13: ExponentialInterval if we are going to have exponential backoff, i think we should do it right, avoid thundering herds, and add a random element so that multiple clients that are all running the same code don't create huge spikes as they all redial at the same time. that said, i'm not sure we need it yet, so i'd be tempted to drop this code for now. i've made some comments anyway, in case the code does land. https://codereview.appspot.com/6949044/diff/12001/trivial/attempt.go#newcode21 trivial/attempt.go:21: Behaviour IntervalBehaviour // Control the delays. we use US spelling as a convention, so Behavior would be more correct. also, i think we should have a maximum value, so the exponential backoff is truncated. for example, if a juju ec2 instance takes about 140 seconds before it accepts connections, using an initial delay of 1s and an exponent of 1.2, we're waiting more than 25 seconds between attempts by the time the instance comes on line. i think that's probably too much and we'd want to bound it. https://codereview.appspot.com/6949044/diff/12001/trivial/attempt.go#newcode56 trivial/attempt.go:56: a.delay = time.Duration(1.2 * float64(a.delay)) we don't need to convert to and from float64 here. a.delay = a.delay * 6 / 5 would be fine in this case. https://codereview.appspot.com/6949044/diff/12001/trivial/trivial_test.go File trivial/trivial_test.go (right): https://codereview.appspot.com/6949044/diff/12001/trivial/trivial_test.go#new... trivial/trivial_test.go:25: want := []time.Duration{0, 200 * time.Millisecond, 400 * time.Millisecond, this are verbose enough now that i'd be tempted to write a function: var quantum = time.Millisecond func durations(xs ...int) []time.Duration { ds := make(time.Duration, len(ms)) for i, m := range ms { ds = time.Duration(m) * quantum } return ds } then we can easily vary the quantum if we find that the test fails on some machines. also, the original test ran for only 0.25s, which worked fine, and it also checked that the final Next took no time, which is no longer checked. there's no way that testing this "trivial" function should add four seconds to our testing time. it might be better to consider a way of testing it without it actually sleeping (for example by making the tests define the "now" and "sleep" functions for the duration of the test)
Sign in to reply to this message.
Copy from the respective email thread: On Fri, Jan 18, 2013 at 8:20 AM, roger peppe <roger.peppe@canonical.com> wrote: >> Alternatively, lets just change the retry delay in juju.Conn to 1 second an call it a day. > > As I've tried to point out above, the retry delay in juju.Conn is irrelevant. > In all the live tests I've seen, I've never seen that delay being > exercised - it's > a highly unusual corner case. > > The right fix is in mgo. mgo doesn't retry every 250ms.. the original reason for the bug was purely to slow down the crazy punching and logging to more reasonable levels. That said, I agree with David. The level of importance and detail being given to that problem is over the top. There are conversations about this for more than *two months*, and this was supposed to be a trivial bug. Unless someone wants to propose that trivial branch, there are much more important things to be doing than fine tuning how fast we connect (!!!). gustavo @ http://niemeyer.net
Sign in to reply to this message.
On 18 January 2013 12:02, <n13m3y3r@gmail.com> wrote: > Copy from the respective email thread: > > On Fri, Jan 18, 2013 at 8:20 AM, roger peppe <roger.peppe@canonical.com> > wrote: >>> >>> Alternatively, lets just change the retry delay in juju.Conn to 1 > > second an call it a day. > >> As I've tried to point out above, the retry delay in juju.Conn is > > irrelevant. >> >> In all the live tests I've seen, I've never seen that delay being >> exercised - it's >> a highly unusual corner case. > > >> The right fix is in mgo. > > > mgo doesn't retry every 250ms.. True. mgo seems to try about three times every 500ms, averaging once every 185ms. http://play.golang.org/p/f4XEMFXKDz > Unless someone wants to propose that trivial branch, there are much > more important things to be doing than fine tuning how fast we connect > (!!!). I agree with that.
Sign in to reply to this message.
|