|
|
Created:
12 years, 6 months ago by dave Modified:
12 years, 5 months ago Reviewers:
mp+109750 Visibility:
Public. |
Descriptioncmd/jujud: add retry strategy
Adding a simple 10 second delay between retry attempts.
https://code.launchpad.net/~dave-cheney/juju-core/go-cmd-jujud-provisioning-retry/+merge/109750
Requires: https://code.launchpad.net/~dave-cheney/juju-core/go-juju-short-attempt/+merge/109744
(do not edit description out of merge proposal)
Patch Set 1 #Patch Set 2 : cmd/jujud: add retry strategy #
Total comments: 2
Patch Set 3 : cmd/jujud: add retry strategy #Patch Set 4 : cmd/jujud: add retry strategy #Patch Set 5 : cmd/jujud: add retry strategy #Patch Set 6 : cmd/jujud: add retry strategy #
Total comments: 14
Patch Set 7 : cmd/jujud: add retry strategy #
Total comments: 1
Patch Set 8 : cmd/jujud: add retry strategy #Patch Set 9 : cmd/jujud: add retry strategy #
Total comments: 2
Patch Set 10 : cmd/jujud: add retry strategy #
Total comments: 5
Patch Set 11 : cmd/jujud: add retry strategy #Patch Set 12 : cmd/jujud: add retry strategy #Patch Set 13 : cmd/jujud: add retry strategy #Patch Set 14 : cmd/jujud: add retry strategy #
MessagesTotal messages: 25
Please take a look.
Sign in to reply to this message.
On 2012/06/11 23:26:20, dfc wrote: > Please take a look. Before I can write a functional test for this, the PA will have to (shortly) grow the ability to start and stop machines.
Sign in to reply to this message.
https://codereview.appspot.com/6295067/diff/2001/cmd/jujud/provisioning.go File cmd/jujud/provisioning.go (right): https://codereview.appspot.com/6295067/diff/2001/cmd/jujud/provisioning.go#ne... cmd/jujud/provisioning.go:47: for attempt := lengthyAttempt.Start(); attempt.Next(); { i'm not convinced that the AttemptStrategy type is worth it here. why not just: for { ... time.Sleep(10 * time.Second) } ? also, why wait 10 seconds always? might it not be better to have at least one attempt at retrying immediately before falling back to occasional retries?
Sign in to reply to this message.
This was an exercise in reusing the ec2 retry code. It's not a perfect fit, evidenced by the requirement to set the total timeout to a year. I'm happy do to it either way based on feedback. On 12/06/2012, at 19:38, rogpeppe@gmail.com wrote: > > https://codereview.appspot.com/6295067/diff/2001/cmd/jujud/provisioning.go > File cmd/jujud/provisioning.go (right): > > https://codereview.appspot.com/6295067/diff/2001/cmd/jujud/provisioning.go#ne... > cmd/jujud/provisioning.go:47: for attempt := lengthyAttempt.Start(); > attempt.Next(); { > i'm not convinced that the AttemptStrategy type is worth it here. > why not just: > > for { > ... > time.Sleep(10 * time.Second) > } > ? > > also, why wait 10 seconds always? might it not be better to have at > least one attempt at retrying immediately before falling back to > occasional retries? > > https://codereview.appspot.com/6295067/
Sign in to reply to this message.
Please take a look. https://codereview.appspot.com/6295067/diff/2001/cmd/jujud/provisioning.go File cmd/jujud/provisioning.go (right): https://codereview.appspot.com/6295067/diff/2001/cmd/jujud/provisioning.go#ne... cmd/jujud/provisioning.go:47: for attempt := lengthyAttempt.Start(); attempt.Next(); { On 2012/06/12 09:38:51, rog wrote: > i'm not convinced that the AttemptStrategy type is worth it here. > why not just: > > for { > ... > time.Sleep(10 * time.Second) > } > ? > > also, why wait 10 seconds always? might it not be better to have at least one > attempt at retrying immediately before falling back to occasional retries? This was an exercise in reusing the ec2 retry code. It's not a perfect fit, evidenced by the requirement to set the total timeout to a year.
Sign in to reply to this message.
This CL is dirty (includes unrelated changes). The likely cause of that is a bug in lbox: the pre-req branch was merged, and the branch being reviewed here was updated with the content from trunk, so the delta between the *original* pre-req branch and this branch include changes from trunk. To fix that right now, merge trunk onto the pre-req branch locally and push the pre-req again. Then, re-propose this branch. I'm sorry for the trouble. I'll try to fix lbox this week still so it doesn't cause this kind of pain anymore.
Sign in to reply to this message.
Please take a look.
Sign in to reply to this message.
On 2012/06/13 02:52:47, dfc wrote: > Please take a look. Taking into consideration, comment #2, https://codereview.appspot.com/6295067/#msg2
Sign in to reply to this message.
The logic is looking nice. A few details to polish: https://codereview.appspot.com/6295067/diff/3020/cmd/jujud/provisioning.go File cmd/jujud/provisioning.go (right): https://codereview.appspot.com/6295067/diff/3020/cmd/jujud/provisioning.go#ne... cmd/jujud/provisioning.go:46: var err error Please drop this line. The use of err here is misleading and unnecessary. The first line within the for creates a new "err" variable pointing at a different address. https://codereview.appspot.com/6295067/diff/3020/cmd/jujud/provisioning.go#ne... cmd/jujud/provisioning.go:47: for attempt := lengthyAttempt.Start(); attempt.Next(); { Agree with Roger.. this seems like a verbose implementation of "for { ...; time.Sleep(10) }". https://codereview.appspot.com/6295067/diff/3020/cmd/jujud/provisioning.go#ne... cmd/jujud/provisioning.go:50: log.Printf("provisioner could not connect to zookeeper: %v", err) s/zookeeper/environment/ The log also seems misplaced. It's asking for a NewProvisioner, and that's failing. How can it tell that the issue is that it can't connect, specifically? I suggest wording it as "provisioner can't open environment: %v", and moving it to the appropriate place inside NewProvisioner. https://codereview.appspot.com/6295067/diff/3020/cmd/jujud/provisioning.go#ne... cmd/jujud/provisioning.go:55: // impossible at this point Stop causes it to exit with nil.. what's impossible? https://codereview.appspot.com/6295067/diff/3020/cmd/jujud/provisioning.go#ne... cmd/jujud/provisioning.go:57: break return nil The "return err" after the break is misleading. The variable is not pointing to the same address as the err we have here, as described above. https://codereview.appspot.com/6295067/diff/3020/cmd/jujud/provisioning.go#ne... cmd/jujud/provisioning.go:59: log.Printf("provisioner reported error, retrying: %v", err) "restarting provisioner after error: %v" The "retrying: <error>" was slightly misleading. The error isn't for the retry. https://codereview.appspot.com/6295067/diff/3020/cmd/jujud/provisioning.go#ne... cmd/jujud/provisioning.go:61: return err panic("unreachable") See above.
Sign in to reply to this message.
Please take a look. https://codereview.appspot.com/6295067/diff/3020/cmd/jujud/provisioning.go File cmd/jujud/provisioning.go (right): https://codereview.appspot.com/6295067/diff/3020/cmd/jujud/provisioning.go#ne... cmd/jujud/provisioning.go:46: var err error On 2012/06/13 20:01:16, niemeyer wrote: > Please drop this line. The use of err here is misleading and unnecessary. The > first line within the for creates a new "err" variable pointing at a different > address. Yup, my mistake. https://codereview.appspot.com/6295067/diff/3020/cmd/jujud/provisioning.go#ne... cmd/jujud/provisioning.go:47: for attempt := lengthyAttempt.Start(); attempt.Next(); { On 2012/06/13 20:01:16, niemeyer wrote: > Agree with Roger.. this seems like a verbose implementation of "for { ...; > time.Sleep(10) }". I am inclined to agree. Let's give this one more pass. If it is still a poor fit, then i'll abandon this proposal. https://codereview.appspot.com/6295067/diff/3020/cmd/jujud/provisioning.go#ne... cmd/jujud/provisioning.go:55: // impossible at this point On 2012/06/13 20:01:16, niemeyer wrote: > Stop causes it to exit with nil.. what's impossible? There is no code path which will cause the p.Stop() to be called. p.Wait() will always exit with an non nil error (otherwise it wouldn't exit). https://codereview.appspot.com/6295067/diff/3020/cmd/jujud/provisioning.go#ne... cmd/jujud/provisioning.go:57: break On 2012/06/13 20:01:16, niemeyer wrote: > return nil > > The "return err" after the break is misleading. The variable is not pointing to > the same address as the err we have here, as described above. Fixed. https://codereview.appspot.com/6295067/diff/3020/cmd/jujud/provisioning.go#ne... cmd/jujud/provisioning.go:59: log.Printf("provisioner reported error, retrying: %v", err) On 2012/06/13 20:01:16, niemeyer wrote: > "restarting provisioner after error: %v" > > The "retrying: <error>" was slightly misleading. The error isn't for the retry. Done. https://codereview.appspot.com/6295067/diff/3020/cmd/jujud/provisioning.go#ne... cmd/jujud/provisioning.go:61: return err On 2012/06/13 20:01:16, niemeyer wrote: > panic("unreachable") > > See above. Done.
Sign in to reply to this message.
https://codereview.appspot.com/6295067/diff/3020/cmd/jujud/provisioning.go File cmd/jujud/provisioning.go (right): https://codereview.appspot.com/6295067/diff/3020/cmd/jujud/provisioning.go#ne... cmd/jujud/provisioning.go:47: for attempt := lengthyAttempt.Start(); attempt.Next(); { On 2012/06/13 23:59:19, dfc wrote: > I am inclined to agree. Let's give this one more pass. If it is still a poor > fit, then i'll abandon this proposal. I still think it's a poor fit, at least until we find a reason for Total to not be "infinite", or for this to have other benefits on top of the plain Sleep call. https://codereview.appspot.com/6295067/diff/8002/cmd/jujud/provisioning.go File cmd/jujud/provisioning.go (right): https://codereview.appspot.com/6295067/diff/8002/cmd/jujud/provisioning.go#ne... cmd/jujud/provisioning.go:52: // not return until the tomb is closed by an error. > There is no code path which will cause the p.Stop() to > be called. p.Wait() will always exit with an non nil > error (otherwise it wouldn't exit). The Provisioner has a Stop method, that is part of the API of this very type, that causes this logic to return with nil. We must be able to call this method, because we'll want to stop the provisioner for sure to halt background activity, but really it doesn't matter if we're calling it or not.. the implementation is internally broken, and the comment above disagrees with the code that is in place just a few lines below.
Sign in to reply to this message.
> The Provisioner has a Stop method, that is part of the API of this very type, > that causes this logic to return with nil. We must be able to call this method, > because we'll want to stop the provisioner for sure to halt background activity, > but really it doesn't matter if we're calling it or not.. the implementation is > internally broken, and the comment above disagrees with the code that is in > place just a few lines below. Sorry, I was not clear in my initial response. There is no code path within jujud which calls p.Stop so p.Wait() will always return a non nil value. I think it is simpler if I just remove this comment.
Sign in to reply to this message.
Please take a look.
Sign in to reply to this message.
On 2012/06/15 01:38:16, dfc wrote: > Please take a look. *le sigh* hang on, branch is dirty, i have to push to the prereq.
Sign in to reply to this message.
Please take a look.
Sign in to reply to this message.
On 2012/06/15 01:42:12, dfc wrote: > Please take a look. LGTM but... no test?
Sign in to reply to this message.
Can't test til the stop/start support lands. Would you like me to hold this branch til then ? On Fri, Jun 15, 2012 at 4:02 PM, <rogpeppe@gmail.com> wrote: > On 2012/06/15 01:42:12, dfc wrote: >> >> Please take a look. > > > LGTM but... no test? > > https://codereview.appspot.com/6295067/
Sign in to reply to this message.
no, i'm fine for it to go in now. we'll remember it, i'm sure. On 15 June 2012 07:04, Dave Cheney <dave@cheney.net> wrote: > Can't test til the stop/start support lands. Would you like me to hold > this branch til then ? > > On Fri, Jun 15, 2012 at 4:02 PM, <rogpeppe@gmail.com> wrote: >> On 2012/06/15 01:42:12, dfc wrote: >>> >>> Please take a look. >> >> >> LGTM but... no test? >> >> https://codereview.appspot.com/6295067/
Sign in to reply to this message.
https://codereview.appspot.com/6295067/diff/9008/cmd/jujud/provisioning.go File cmd/jujud/provisioning.go (right): https://codereview.appspot.com/6295067/diff/9008/cmd/jujud/provisioning.go#ne... cmd/jujud/provisioning.go:46: log.Printf("restarting provisioner after error: %v", err) Dave, this is still not handling the reported issue entirely. The purpose of this branch is to sort out retries, and it's not doing that properly. If we call Stop it has to stop, with errors or not.
Sign in to reply to this message.
Please take a look. https://codereview.appspot.com/6295067/diff/9008/cmd/jujud/provisioning.go File cmd/jujud/provisioning.go (right): https://codereview.appspot.com/6295067/diff/9008/cmd/jujud/provisioning.go#ne... cmd/jujud/provisioning.go:46: log.Printf("restarting provisioner after error: %v", err) On 2012/06/15 21:33:13, niemeyer wrote: > Dave, this is still not handling the reported issue entirely. The purpose of > this branch is to sort out retries, and it's not doing that properly. If we call > Stop it has to stop, with errors or not. Done. nil is now a signal to jujud to exit cleanly.
Sign in to reply to this message.
I suspect it looks good, but can you please push it again since the diff is dirty right now. Also, a minor TODO to add there: https://codereview.appspot.com/6295067/diff/4003/cmd/jujud/provisioning.go File cmd/jujud/provisioning.go (right): https://codereview.appspot.com/6295067/diff/4003/cmd/jujud/provisioning.go#ne... cmd/jujud/provisioning.go:46: } // TODO If the provisioner is stopped, we should really exit // here, even if an error comes out.
Sign in to reply to this message.
Sorry, I didn't realise this one had a prereq. Should be fixed now. https://codereview.appspot.com/6295067/diff/4003/cmd/jujud/provisioning.go File cmd/jujud/provisioning.go (right): https://codereview.appspot.com/6295067/diff/4003/cmd/jujud/provisioning.go#ne... cmd/jujud/provisioning.go:46: } On 2012/06/19 10:34:08, niemeyer wrote: > // TODO If the provisioner is stopped, we should really exit > // here, even if an error comes out. Done.
Sign in to reply to this message.
Please take a look. https://codereview.appspot.com/6295067/diff/4003/cmd/jujud/provisioning.go File cmd/jujud/provisioning.go (right): https://codereview.appspot.com/6295067/diff/4003/cmd/jujud/provisioning.go#ne... cmd/jujud/provisioning.go:46: } On 2012/06/19 10:47:55, dfc wrote: > On 2012/06/19 10:34:08, niemeyer wrote: > > // TODO If the provisioner is stopped, we should really exit > > // here, even if an error comes out. > > Done. Done, but I think this has highlighted an ambiguity. I don't know how to tell the difference between an error due something like p.Stop(), resulting in a watching returning and error while closing, and an error that should be retried, ie, a state brainfart.
Sign in to reply to this message.
LGTM, thanks, and sorry for the trouble on this one. https://codereview.appspot.com/6295067/diff/4003/cmd/jujud/provisioning.go File cmd/jujud/provisioning.go (right): https://codereview.appspot.com/6295067/diff/4003/cmd/jujud/provisioning.go#ne... cmd/jujud/provisioning.go:46: } On 2012/06/19 10:55:10, dfc wrote: > Done, but I think this has highlighted an ambiguity. I don't know how to tell > the difference between an error due something like p.Stop(), resulting in a > watching returning and error while closing, and an error that should be retried, > ie, a state brainfart. Something like p.CleanlyStopped() maybe? For clarity, we'll need to implement proper dying because we'll need to watch notifications of upgrades and exit so that a new version can come up, for example.
Sign in to reply to this message.
*** Submitted: cmd/jujud: add retry strategy Adding a simple 10 second delay between retry attempts. R=rog, niemeyer CC= https://codereview.appspot.com/6295067 https://codereview.appspot.com/6295067/diff/4003/cmd/jujud/provisioning.go File cmd/jujud/provisioning.go (right): https://codereview.appspot.com/6295067/diff/4003/cmd/jujud/provisioning.go#ne... cmd/jujud/provisioning.go:46: } > Something like p.CleanlyStopped() maybe? > > For clarity, we'll need to implement proper dying because we'll need to watch > notifications of upgrades and exit so that a new version can come up, for > example. Done. For the moment nil is interpreted as a signal to exit the retry loop.
Sign in to reply to this message.
|