|
|
DescriptionAdd a provisioner safe mode
A new environ setting provisioner-safe-mode is
introduced (default = false). If true, then the
environ provisioner will not stop instances that
it does not know about. I had to muck about a
little with the logic to handle dead machines
to get the starts to align.
See bug 1254729
https://code.launchpad.net/~wallyworld/juju-core/provisioner-safe-mode/+merge/196666
(do not edit description out of merge proposal)
Patch Set 1 #
Total comments: 24
Patch Set 2 : Add a provisioner safe mode #
Total comments: 4
Patch Set 3 : Add a provisioner safe mode #Patch Set 4 : Add a provisioner safe mode #Patch Set 5 : Add a provisioner safe mode #
Total comments: 4
Patch Set 6 : Add a provisioner safe mode #Patch Set 7 : Add a provisioner safe mode #
Total comments: 11
MessagesTotal messages: 16
Please take a look.
Sign in to reply to this message.
A few thoughts prior to a longer pass. https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... File worker/provisioner/provisioner_task.go (left): https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... worker/provisioner/provisioner_task.go:275: logger.Tracef("unknown: %v", unknown) I think it would be very useful to have this logged at least at INFO. (we do log info by default now, right?) https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... File worker/provisioner/provisioner_task.go (right): https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... worker/provisioner/provisioner_task.go:76: safeModeUnlocked bool I really don't find this "unlocked" vocabulary very helpful, but I guess it's consistent. The mutex controlling it ought to be somewhere within sight though... https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... worker/provisioner/provisioner_task.go:133: var safeModeMutex sync.Mutex Yeah, please don't make this global. https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... worker/provisioner/provisioner_task.go:176: if err := machine.Remove(); err != nil { I don't really love this, because it potentially leaks instances -- if the provisioner bounces in safe mode, all those "unknown" instances will be retained even though we damn well ought to know that they're safe to remove. https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... worker/provisioner/provisioner_task.go:286: // and it will be included in the stopping set anyway. I don't think this is true. Dead machines can certainly have instance ids. I *think* the code's ok, but the comment is misleading. https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... File worker/provisioner/provisioner_test.go (right): https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... worker/provisioner/provisioner_test.go:656: defer stop(c, p) Is no way to test that the provisioner responds to changes in the environ config? Aren't there tests like that already somewhere?
Sign in to reply to this message.
The basic structure looks good (in particular an environ config setting seems like a good approach) but is it necessary to change all that logic in the provisioner task? https://codereview.appspot.com/32710043/diff/1/environs/config/config.go File environs/config/config.go (right): https://codereview.appspot.com/32710043/diff/1/environs/config/config.go#newc... environs/config/config.go:580: "provisioner-safe-mode": false, Can't we just make this into: "provisioner-safe-mode": schema.Omit, and put it at the top? https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... File worker/provisioner/provisioner_task.go (right): https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... worker/provisioner/provisioner_task.go:33: // runs in safe mode or not. perhaps this is a reasonable place to explain what safe mode actually entails? https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... worker/provisioner/provisioner_task.go:133: var safeModeMutex sync.Mutex On 2013/11/26 09:07:38, fwereade wrote: > Yeah, please don't make this global. +1 https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... worker/provisioner/provisioner_task.go:174: for _, machine := range dead { This all seems like more change than is necessary. I'm nervous about the significant logic changes proposed here. For reference, this is the change that I made when I tweaked the provisioner to do the same thing: // Stop all machines that are dead stopping := task.instancesForMachines(dead) // Find running instances that have no machines associated unknown, err := task.findUnknownInstances(stopping) if err != nil { return err } logger.Infof("ignoring unknown instances %v", unknown) unknown = nil // It's important that we stop unknown instances before starting // pending ones, because if we start an instance and then fail to // set its InstanceId on the machine we don't want to start a new // instance for the same machine ID. if err := task.stopInstances(append(stopping, unknown...)); err != nil { return err } In your case, you would end up something like this: // Stop all machines that are dead stopping := task.instancesForMachines(dead) // Find running instances that have no machines associated unknown, err := task.findUnknownInstances(stopping) if err != nil { return err } if task.safeMode() { logger.Infof("ignoring unknown instances %v", unknown) unknown = nil } // It's important that we stop unknown instances before starting // pending ones, because if we start an instance and then fail to // set its InstanceId on the machine we don't want to start a new // instance for the same machine ID. if err := task.stopInstances(append(stopping, unknown...)); err != nil { return err } That is, it's a 4 line tweak that seems to me to be fairly obviously correct. I'm not sure that any other logic needs to be changed, does it?
Sign in to reply to this message.
I think I'm with roger, just clearing out unknown seems cleaner. And I'd like to get more stuff logged at INFO level here, the lack of logging for actually stopping machines makes it harder to be sure that this change wasn't necessary. https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... File worker/provisioner/provisioner_task.go (right): https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... worker/provisioner/provisioner_task.go:138: task.safeModeUnlocked = safeMode Should we run processMachines here? If not, setting safe-mode back to false won't stop unknown instances until we get a fresh change from the watcher.
Sign in to reply to this message.
On 2013/11/26 09:43:00, fwereade wrote: > I think I'm with roger, just clearing out unknown seems cleaner. Ah, ofc -- we just drop the removed machines from the map, leading them to be treated as unknown. Bah. Objections rescinded.
Sign in to reply to this message.
Please take a look. https://codereview.appspot.com/32710043/diff/1/environs/config/config.go File environs/config/config.go (right): https://codereview.appspot.com/32710043/diff/1/environs/config/config.go#newc... environs/config/config.go:580: "provisioner-safe-mode": false, On 2013/11/26 09:36:30, rog wrote: > Can't we just make this into: > > "provisioner-safe-mode": schema.Omit, > > and put it at the top? No, because then return c.m["provisioner-safe-mode"].(bool) returns nil if the attr is not set. And we get a nil pointer error in the provisioner. Unless I'm not understanding something. https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... File worker/provisioner/provisioner_task.go (left): https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... worker/provisioner/provisioner_task.go:275: logger.Tracef("unknown: %v", unknown) On 2013/11/26 09:07:38, fwereade wrote: > I think it would be very useful to have this logged at least at INFO. (we do log > info by default now, right?) Done. https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... File worker/provisioner/provisioner_task.go (right): https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... worker/provisioner/provisioner_task.go:33: // runs in safe mode or not. On 2013/11/26 09:36:30, rog wrote: > perhaps this is a reasonable place to explain what safe mode actually entails? Done. https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... worker/provisioner/provisioner_task.go:76: safeModeUnlocked bool On 2013/11/26 09:07:38, fwereade wrote: > I really don't find this "unlocked" vocabulary very helpful, but I guess it's > consistent. The mutex controlling it ought to be somewhere within sight > though... Done. https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... worker/provisioner/provisioner_task.go:133: var safeModeMutex sync.Mutex On 2013/11/26 09:36:30, rog wrote: > On 2013/11/26 09:07:38, fwereade wrote: > > Yeah, please don't make this global. > > +1 Done. https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... worker/provisioner/provisioner_task.go:138: task.safeModeUnlocked = safeMode On 2013/11/26 09:43:00, fwereade wrote: > Should we run processMachines here? If not, setting safe-mode back to false > won't stop unknown instances until we get a fresh change from the watcher. If we do run processMachines, what ids do we pass in? https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... worker/provisioner/provisioner_task.go:176: if err := machine.Remove(); err != nil { On 2013/11/26 09:07:38, fwereade wrote: > I don't really love this, because it potentially leaks instances -- if the > provisioner bounces in safe mode, all those "unknown" instances will be retained > even though we damn well ought to know that they're safe to remove. Well, isn't that the idea of safe mode? Not to delete stuff? https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... worker/provisioner/provisioner_task.go:286: // and it will be included in the stopping set anyway. On 2013/11/26 09:07:38, fwereade wrote: > I don't think this is true. Dead machines can certainly have instance ids. > > I *think* the code's ok, but the comment is misleading. Yeah. The instance id is no longer available machine.Remove() is called and in the tests I was running the machine was dead and then removed so I got my wires crossed. https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... File worker/provisioner/provisioner_test.go (right): https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... worker/provisioner/provisioner_test.go:656: defer stop(c, p) On 2013/11/26 09:07:38, fwereade wrote: > Is no way to test that the provisioner responds to changes in the environ > config? Aren't there tests like that already somewhere? Not sure. I knew this was an omission. I couldn't see an easy way to set up such a test. I'll have another go.
Sign in to reply to this message.
Please take a look.
Sign in to reply to this message.
Looks great, thanks. A few minor comments below. https://codereview.appspot.com/32710043/diff/1/environs/config/config.go File environs/config/config.go (right): https://codereview.appspot.com/32710043/diff/1/environs/config/config.go#newc... environs/config/config.go:580: "provisioner-safe-mode": false, On 2013/11/26 11:47:52, wallyworld wrote: > On 2013/11/26 09:36:30, rog wrote: > > Can't we just make this into: > > > > "provisioner-safe-mode": schema.Omit, > > > > and put it at the top? > No, because then > > return c.m["provisioner-safe-mode"].(bool) > > returns nil if the attr is not set. And we get a nil pointer error in the > provisioner. Unless I'm not understanding something. Don't do that. Instead do: safeMode, _ := c.m["provisioner-safe-mode"].(bool) return safeMode which will work if the attribute is not present. https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... File worker/provisioner/provisioner_task.go (right): https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... worker/provisioner/provisioner_task.go:186: if err := task.stopInstances(append(stopping, unknown...)); err != nil { I think perhaps this should be moved above the machine.Remove logic. Otherwise, consider what happens if we're in safe mode and the provisioner bounces just before this line of code. AFAICS in that case we will leak the instances, although we should not. If we stop the instances *before* removing their machines from the state, then I *think* things are better, because the worst that can happen is that we try to stop the instances again. https://codereview.appspot.com/32710043/diff/20001/worker/provisioner/provisi... File worker/provisioner/provisioner_task.go (right): https://codereview.appspot.com/32710043/diff/20001/worker/provisioner/provisi... worker/provisioner/provisioner_task.go:170: unknown, err = task.findUnknownInstances(stopping) I wonder if it would be nice to check for unknown instances regardless of safe mode, so we can log that we've seen the unknown instances but are ignoring them. You could then move the "unknown instances" log line to here, saying a little more about what we're doing (or not doing) with them. https://codereview.appspot.com/32710043/diff/20001/worker/provisioner/provisi... worker/provisioner/provisioner_task.go:288: // If a machine is dead, it is already in stopping and Is this test necessary? I think I'd be inclined to code defensively and get the instance id of all machines including dead ones, even though they will probably be in the stopping list. Less code and more obviously correct without needing to know the calling context, I think. I can't see that performance is a particular issue here.
Sign in to reply to this message.
Please take a look. https://codereview.appspot.com/32710043/diff/1/environs/config/config.go File environs/config/config.go (right): https://codereview.appspot.com/32710043/diff/1/environs/config/config.go#newc... environs/config/config.go:580: "provisioner-safe-mode": false, On 2013/11/26 12:28:19, rog wrote: > On 2013/11/26 11:47:52, wallyworld wrote: > > On 2013/11/26 09:36:30, rog wrote: > > > Can't we just make this into: > > > > > > "provisioner-safe-mode": schema.Omit, > > > > > > and put it at the top? > > No, because then > > > > return c.m["provisioner-safe-mode"].(bool) > > > > returns nil if the attr is not set. And we get a nil pointer error in the > > provisioner. Unless I'm not understanding something. > > Don't do that. > Instead do: > > safeMode, _ := c.m["provisioner-safe-mode"].(bool) > return safeMode > > which will work if the attribute is not present. Done. https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... File worker/provisioner/provisioner_task.go (right): https://codereview.appspot.com/32710043/diff/1/worker/provisioner/provisioner... worker/provisioner/provisioner_task.go:186: if err := task.stopInstances(append(stopping, unknown...)); err != nil { On 2013/11/26 12:28:19, rog wrote: > I think perhaps this should be moved above the machine.Remove logic. > Otherwise, consider what happens if we're in safe mode and the provisioner > bounces just before this line of code. AFAICS in that case we will leak the > instances, although we should not. > > If we stop the instances *before* removing their machines from the state, then I > *think* things are better, because the worst that can happen is that we try to > stop the instances again. Done. https://codereview.appspot.com/32710043/diff/20001/worker/provisioner/provisi... File worker/provisioner/provisioner_task.go (right): https://codereview.appspot.com/32710043/diff/20001/worker/provisioner/provisi... worker/provisioner/provisioner_task.go:170: unknown, err = task.findUnknownInstances(stopping) On 2013/11/26 12:28:19, rog wrote: > I wonder if it would be nice to check for unknown instances regardless of safe > mode, so we can log that we've seen the unknown instances but are ignoring them. > > You could then move the "unknown instances" log line to here, saying a little > more about what we're doing (or not doing) with them. Done. https://codereview.appspot.com/32710043/diff/20001/worker/provisioner/provisi... worker/provisioner/provisioner_task.go:288: // If a machine is dead, it is already in stopping and On 2013/11/26 12:28:19, rog wrote: > Is this test necessary? > I think I'd be inclined to code defensively and get the instance id of all > machines including dead ones, even though they will probably be in the stopping > list. Less code and more obviously correct without needing to know the calling > context, I think. I can't see that performance is a particular issue here. Done.
Sign in to reply to this message.
Please take a look.
Sign in to reply to this message.
Please take a look.
Sign in to reply to this message.
Please take a look.
Sign in to reply to this message.
couple more thoughts https://codereview.appspot.com/32710043/diff/120001/worker/provisioner/provis... File worker/provisioner/provisioner_task.go (right): https://codereview.appspot.com/32710043/diff/120001/worker/provisioner/provis... worker/provisioner/provisioner_task.go:143: if err := task.processMachines(nil); err != nil { Can we be sure that machinesMap is filled in at this point? If there's any code path by which it might not be, won't this be equivalent to, uh, stopping all instances? I'll read through some more but it's not immediately apparent (to me, at least). https://codereview.appspot.com/32710043/diff/120001/worker/provisioner/provis... worker/provisioner/provisioner_task.go:156: // one back in. Expand please... why wouldn't we want to block? Even if we really really *don't* want to block until the message is handled -- which STM to explode the possible state space and make analysis much harder -- we should *definitely* handle the tomb dying in any select. https://codereview.appspot.com/32710043/diff/120001/worker/provisioner/provis... worker/provisioner/provisioner_task.go:166: task.safeModeChan <- safeMode I'm pretty sure this method will fail nastily if called concurrently. That we don't call it concurrently is IMO not much of a defence...
Sign in to reply to this message.
https://codereview.appspot.com/32710043/diff/120001/worker/provisioner/provis... File worker/provisioner/provisioner_task.go (right): https://codereview.appspot.com/32710043/diff/120001/worker/provisioner/provis... worker/provisioner/provisioner_task.go:143: if err := task.processMachines(nil); err != nil { On 2013/11/27 10:52:24, fwereade wrote: > Can we be sure that machinesMap is filled in at this point? If there's any code > path by which it might not be, won't this be equivalent to, uh, stopping all > instances? > > I'll read through some more but it's not immediately apparent (to me, at least). processMachines itself populates the machine map https://codereview.appspot.com/32710043/diff/120001/worker/provisioner/provis... worker/provisioner/provisioner_task.go:156: // one back in. On 2013/11/27 10:52:24, fwereade wrote: > Expand please... why wouldn't we want to block? Even if we really really *don't* > want to block until the message is handled -- which STM to explode the possible > state space and make analysis much harder -- we should *definitely* handle the > tomb dying in any select. Roger? https://codereview.appspot.com/32710043/diff/120001/worker/provisioner/provis... worker/provisioner/provisioner_task.go:166: task.safeModeChan <- safeMode On 2013/11/27 10:52:24, fwereade wrote: > I'm pretty sure this method will fail nastily if called concurrently. That we > don't call it concurrently is IMO not much of a defence... I was using a mutex but Roger recommended this code.
Sign in to reply to this message.
I'll take this forward, addressing the comments below, and others. https://codereview.appspot.com/32710043/diff/80001/environs/config/config.go File environs/config/config.go (right): https://codereview.appspot.com/32710043/diff/80001/environs/config/config.go#... environs/config/config.go:595: "provisioner-safe-mode": false, You don't need this (it's overridden from alwaysOptional) https://codereview.appspot.com/32710043/diff/80001/environs/config/config_tes... File environs/config/config_test.go (right): https://codereview.appspot.com/32710043/diff/80001/environs/config/config_tes... environs/config/config_test.go:801: } else { c.Assert(cfg.ProvisionerSafeMode(), jc.IsFalse) } ? https://codereview.appspot.com/32710043/diff/80001/worker/provisioner/provisi... File worker/provisioner/provisioner_task.go (right): https://codereview.appspot.com/32710043/diff/80001/worker/provisioner/provisi... worker/provisioner/provisioner_task.go:175: } } else { logger.Infof("stopping unknown instances %v", unknown) } (and lose the logger call in findUnknownInstances) ? https://codereview.appspot.com/32710043/diff/120001/worker/provisioner/provis... File worker/provisioner/provisioner_task.go (right): https://codereview.appspot.com/32710043/diff/120001/worker/provisioner/provis... worker/provisioner/provisioner_task.go:143: if err := task.processMachines(nil); err != nil { On 2013/11/27 10:58:10, wallyworld wrote: > On 2013/11/27 10:52:24, fwereade wrote: > > Can we be sure that machinesMap is filled in at this point? If there's any > code > > path by which it might not be, won't this be equivalent to, uh, stopping all > > instances? > > > > I'll read through some more but it's not immediately apparent (to me, at > least). > > processMachines itself populates the machine map Actually, I think William has a good point. The machine map is populated with ids that are passed into processMachines. I think we should avoid reading on safeModeChan until we have processed the first event from Changes. https://codereview.appspot.com/32710043/diff/120001/worker/provisioner/provis... worker/provisioner/provisioner_task.go:156: // one back in. On 2013/11/27 10:58:10, wallyworld wrote: > On 2013/11/27 10:52:24, fwereade wrote: > > Expand please... why wouldn't we want to block? Even if we really really > *don't* > > want to block until the message is handled -- which STM to explode the > possible > > state space and make analysis much harder -- we should *definitely* handle the > > tomb dying in any select. > > Roger? The "not block" thing is mainly just paranoia. If we ensure that it never blocks, then there is no way that we can have any kind of deadlock, regardless of the implementation of either task. We could guard against the possibility of concurrent calls to SetSafeMode by adding a mutex, but you're probably right that doing a blocking send, also selecting on dying, is a better approach. This does require a slightly more detailed analysis of the system to verify correctness though. That looks ok though. https://codereview.appspot.com/32710043/diff/120001/worker/provisioner/provis... worker/provisioner/provisioner_task.go:166: task.safeModeChan <- safeMode On 2013/11/27 10:58:10, wallyworld wrote: > On 2013/11/27 10:52:24, fwereade wrote: > > I'm pretty sure this method will fail nastily if called concurrently. That we > > don't call it concurrently is IMO not much of a defence... > > I was using a mutex but Roger recommended this code. Actually it doesn't fail *too* nastily if called concurrently - it just falls back to blocking, something that could be easily addressed by looping and avoiding the blocking send, or using a mutex. It's also very common in Go to have methods that are not safe to call concurrently. That said, I think that just blocking (and selecting on dying) will be fine here. https://codereview.appspot.com/32710043/diff/120001/worker/provisioner/provis... worker/provisioner/provisioner_task.go:217: func asString(instances []instance.Instance) string { I think I'd call this "instanceIds" and have it return []string.
Sign in to reply to this message.
Review moved to https://codereview.appspot.com/34050043/ https://codereview.appspot.com/32710043/diff/80001/worker/provisioner/provisi... File worker/provisioner/provisioner_task.go (right): https://codereview.appspot.com/32710043/diff/80001/worker/provisioner/provisi... worker/provisioner/provisioner_task.go:175: } On 2013/11/27 12:47:19, rog wrote: > } else { > logger.Infof("stopping unknown instances %v", unknown) > } > > (and lose the logger call in findUnknownInstances) > ? Done. https://codereview.appspot.com/32710043/diff/120001/worker/provisioner/provis... File worker/provisioner/provisioner_task.go (right): https://codereview.appspot.com/32710043/diff/120001/worker/provisioner/provis... worker/provisioner/provisioner_task.go:217: func asString(instances []instance.Instance) string { On 2013/11/27 12:47:19, rog wrote: > I think I'd call this "instanceIds" and have it return []string. Done.
Sign in to reply to this message.
|