|
|
Created:
12 years, 4 months ago by fwereade Modified:
12 years, 4 months ago Reviewers:
mp+137585 Visibility:
Public. |
Descriptionstate: Dying unit cannot enter relation scope
https://code.launchpad.net/~fwereade/juju-core/dying-unit-enter-scope/+merge/137585
(do not edit description out of merge proposal)
Patch Set 1 #
Total comments: 14
Patch Set 2 : state: Dying unit cannot enter relation scope #
Total comments: 3
Patch Set 3 : state: Dying unit cannot enter relation scope #
MessagesTotal messages: 16
Please take a look.
Sign in to reply to this message.
I'm a little concerned about how this is being glued together. It at least looks like we are interpreting the error at the wrong level. I could certainly be wrong, though. https://codereview.appspot.com/6864050/diff/1/state/relation.go File state/relation.go (right): https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode318 state/relation.go:318: } This feels like something that would like to return multiple errors, rather than just stopping at the first one. It also looks large enough that it would be easier to understand as a separate helper function. (Wrap ru.st.runner.Run, call it with parameters, and interpret the error result.) I do think that "CannotEnterScope" is not a useful *human visable* error. You also have 'cannot enter scope' separate from 'cannot examine scope', which I don't fully understand. I guess my point is, "foo not alive" is somewhat meaningful to a user, but "cannot examine scope" or "cannot enter scope" seem to be exposing implementation specifics, vs something users care about. (So might be good to have in a log file, but not useful as the error message a user might see.) Offhand this code also looks a bit "racy". In that you got an error from X and you seem to be trying to disambiguate what the real error is by going around and refreshing state on various objects and seeing if something is/isn't alive. Would it be possible to get the failure from the 'run' itself, rather than just getting ErrAborted? (Possibly add state to ErrAborted so that it can tell you *why* it aborted.)
Sign in to reply to this message.
jameinel, thanks for your comments; I'd be happy to talk more about your concerns, but I'm not quite sure what you mean by the top-level message. Would you expand please? https://codereview.appspot.com/6864050/diff/1/state/relation.go File state/relation.go (right): https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode318 state/relation.go:318: } On 2012/12/04 09:06:08, jameinel wrote: > This feels like something that would like to return multiple errors, rather than > just stopping at the first one. Please expand... what is improved by emitting/handling multiple errors? > It also looks large enough that it would be easier to understand as a separate > helper function. > (Wrap ru.st.runner.Run, call it with parameters, and interpret the error > result.) I'm not sure the code will be made any clearer by doing that... but then I'm not clear which bit you're suggesting we should carve out. > I do think that "CannotEnterScope" is not a useful *human visable* error. You > also have 'cannot enter scope' separate from 'cannot examine scope', which I > don't fully understand. I'm keenest on the idea of just dropping the "cannot examine scope" annotation -- usually when we get a mongo error we don't know about, we just let it bubble up. > I guess my point is, "foo not alive" is somewhat meaningful to a user, but > "cannot examine scope" or "cannot enter scope" seem to be exposing > implementation specifics, vs something users care about. (So might be good to > have in a log file, but not useful as the error message a user might see.) Good thing that ErrCannotEnterScope is a flow-control construct that will/should never make it as far as a log file, let alone a user ;). > > Offhand this code also looks a bit "racy". In that you got an error from X and > you seem to be trying to disambiguate what the real error is by going around and > refreshing state on various objects and seeing if something is/isn't alive. > > Would it be possible to get the failure from the 'run' itself, rather than just > getting ErrAborted? (Possibly add state to ErrAborted so that it can tell you > *why* it aborted.) > I can't see a mechanism by which the txn package could do that... you seem to be asking for the txn package to do roughly what I do here, but in a completely generic way across an arbitrary number of asserts (in which every asserting op needs to be annotated with N possible failure messages...). Am I missing something?
Sign in to reply to this message.
On Tue, 2012-12-04 at 12:50 +0000, John A Meinel wrote: To start with, I'm fully unfamiliar with this code, so I'm talking > fairly meta about design, and what comes to mind as I read the code. > As I'm sure you're much more familiar with it than I, don't take > things I say as absolute. No worries, the discussion is valued all the same. > My top level feeling is that the code is slightly incorrect, because > something is triggering an abort, and we are divining the what after > the fact. Rather than having the thing which triggers an abort tell us > what the reason for the abort was. > > From what I can tell of the current code, it isn't possible to do so > (we are returning an explicit instance txn.ErrAborted, not an instance > of an error type). > > So fixing that could certainly be out of scope for what you are doing. > Regardless, I think it's a very tricky problem. AFAICT, the txn package generally can't tell which assertion failed; finding out would be ugly and inefficient and complex to a degree that far outstrips that of the check-possible-abort-sources style. (Maybe you have a smart idea? ISTM that in the general case you potentially need (NxM)! additional queries to determine which assert(s) failed, and that you also need to annotate all the assertions with error messages, and... just... eww. (note that there's no point making assertions before attempting the ops -- the state we're asserting could change in the middle. ISTM that throwing a txn at the wall and seeing whether it sticks is likely to be the neatest way to determine whether a change is sensible: if it turns out it's not, we can figure out why it's broken and go from there.)) > Please expand... what is improved by emitting/handling multiple > > errors? > > You are looping over a bunch of possible things that could be a cause > of errors. Is it possible that more than one of them could be the > error (eg. multiple names not started)? You short circuit on the first > one that fails. Which means if you try to do X and it fails because of > Y and Z, but you are only told about Y. So you fix Y, and try X again > this time failing because of Z. Then you fix Z and it works. > > It seems like it could be nice to find out "X failed because of Y and > Z" rather than just "X failed because of Y". > Wrong context, I think -- this "failure" is not "fixable", it's a bald statement that the system state is not what the client thought and the requested change is flat-out never-gonna-happen impossible: either due to lifecycle state (ErrCannotEnterScope); or because it's already happened for some reason (nil) or maybe because a programmer screwed up somewhere ("inconsistent state"). Either way, the only sensible reaction to failing to enter a scope is to back out of what you're trying to do and pretend the relation just doesn't exist, and this is the appropriate reaction regardless of the reason we're failing to enter scope. I don't think there's any value in creating and handling N different errors to express a single situation (that one of the required participants is going away, or gone, and will *never* be able to enter scope.) Roughly the block of > if err := ru.st.runner.Run(ops, "", nil); err == txn.ErrAborted { > ... > } > > The actual code here doesn't need much of the context of the rest of > the function, and it "feels" roughly helper-function sized. > I'm torn: readability/locality is impacted by putting the ops in a separate function to the assert-failure-handling. I do see where you're coming from -- but I'm not too bothered by the size of the method, mainly because the []txn.Op takes up an awful lot of vertical space but very little mental space: it's perfectly clear what we're asking for, and the mind can skate directly over to the interesting bit (determining what it means for that []txn.Op to be aborted) while keeping the [] right there for easy reference. That said, I am finding that the code that generates an []txn.Op is sometimes worth factoring out; I just don't like to do it until there's a compelling reason to do so (ie the value of not repeating the code outweighs the cost of looking elsewhere to understand the details of a method). ... > > > > > Good thing that ErrCannotEnterScope is a flow-control construct > > that will/should never make it as far as a log file, let alone a > > user ;). > > Is it proper to use 'error' as a type to do flow-control? It seems to > fall into all the same issues that people claim for > exceptions-for-flow-control. (vs enumerated values for flow control). > I don't see any real problems with the technique myself, but I haven't really used it mindfully; it's just an idiom that I see commonly, and which STM to lead to pretty clear code... at least in this instance. > I can't see a mechanism by which the txn package could do that... > > you seem to be asking for the txn package to do roughly what I do > > here, but in a completely generic way across an arbitrary number > > of asserts (in which every asserting op needs to be annotated with > > N possible failure messages...). Am I missing something? > > > > https://codereview.appspot.com/6864050/ > > > > The way it is coded today, we probably cannot. Run is returning a > global error object (txn.ErrAborted) rather than a type (_, ok = > err.(ErrAborted)). > As such, you wouldn't want to put the reason for the current abort > onto that object. > > At least in *my* head Exceptions are instance objects, not static > objects, but I see that isn't the case here. > Yeah -- sometimes you see things like os.IsExist(err) and state.IsNotFound(err) but they feel a bit bloaty when a static error does everything we need. There's nothing stopping us doing that in this case *except* that determining the cause of the failure is really hard for the txn package to accomplish. So the reason I'm not +1 is simply because I don't feel I understand > the underlying code enough to vote on it, and have some questions as > to whether it could be done better. I certainly wouldn't block it > landing, though. > > John > =:-> > OK, so I can consider this review to be essentially neutral? ie, I'm still looking for 2 +1s, but I don't need to make specific changes to accommodate your point of view? (I will try extracting the abort handler, but I'm not sure I'll push it unless it STM to be a clear win.)
Sign in to reply to this message.
> In a SQL DB, you're doing some sort of UPDATE/INSERT that violates a > rule. And the requesting INSERT could have "if this fails, it is > because of X". Sort of encoding the context while the action is > happening, rather than inferring it after the fact. My understanding, which may be imperfect, is that generally the asserts in transactions are just smooshed into a great big select that is used to determine what document, if any, to operate on; if no such document is found, bam, ErrAborted. AFAICT there's no way to get any more nuance without more roundtrips, and I don't think that's a good thing to add to txn.
Sign in to reply to this message.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 ... > Regardless, I think it's a very tricky problem. AFAICT, the txn > package generally can't tell which assertion failed; finding out > would be ugly and inefficient and complex to a degree that far > outstrips that of the check-possible-abort-sources style. > > (Maybe you have a smart idea? ISTM that in the general case you > potentially need (NxM)! additional queries to determine which > assert(s) failed, and that you also need to annotate all the > assertions with error messages, and... just... eww. (note that > there's no point making assertions before attempting the ops -- the > state we're asserting could change in the middle. ISTM that > throwing a txn at the wall and seeing whether it sticks is likely > to be the neatest way to determine whether a change is sensible: if > it turns out it's not, we can figure out why it's broken and go > from there.)) So I realize that TXN can't determine what caused the abort. But isn't there an action happening that triggers the abort? In a SQL DB, you're doing some sort of UPDATE/INSERT that violates a rule. And the requesting INSERT could have "if this fails, it is because of X". Sort of encoding the context while the action is happening, rather than inferring it after the fact. If it isn't actually possible, the change itself is LGTM. John =:-> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (Cygwin) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iEYEARECAAYFAlC/F+YACgkQJdeBCYSNAAOnzACeLN0v4xo1CyKlP4yxcV+Wu80P /gwAn1SKdcT10bkbzSRhEKzOd4X/Un08 =UrZa -----END PGP SIGNATURE-----
Sign in to reply to this message.
On 2012/12/05 10:37:34, fwereade wrote: > AFAICT there's no way to get any more nuance without more > roundtrips, and I don't think that's a good thing to add to txn. Hmm, ofc, we *could* easily attach the (first) Op that aborted the transaction, and that would cut down on the number of subsequent checks required... not sure how niemeyer would feel about that, though, it's a pretty significant interface change.
Sign in to reply to this message.
looks plausible. a couple of comments below. https://codereview.appspot.com/6864050/diff/1/state/relation.go File state/relation.go (right): https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode260 state/relation.go:260: // the unit not to be Alive. Once a unit has entered a scope, in stays in scope s/in stays/it stays/ https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode293 state/relation.go:293: return fmt.Errorf("cannot initialize state for %s: %v", desc, err) how is getting the private address "initializing state"? https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode310 state/relation.go:310: for _, l := range []lifeRefresher{ru.relation, ru.unit} { perhaps a comment might be appropriate here, saying what the possible things are that can have failed and why this check is the right thing to do. AFAICS we've got 4 possible asserts that can have failed here, and i can't quite see why we don't check 4 things. (is it possible that someone could have set the private address in the meantime, in which case we could see an "inconsistent state" error AFAICS, though the state is not inconsistent?)
Sign in to reply to this message.
So far it's looking good, I only have a problem with the usage of an error as an expected flow-control variable. Are there situations where this error pops out as an error or is it always used for flow-control? https://codereview.appspot.com/6864050/diff/1/state/relation.go File state/relation.go (right): https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode318 state/relation.go:318: } On 2012/12/04 11:31:45, fwereade wrote: > On 2012/12/04 09:06:08, jameinel wrote: > > I guess my point is, "foo not alive" is somewhat meaningful to a user, but > > "cannot examine scope" or "cannot enter scope" seem to be exposing > > implementation specifics, vs something users care about. (So might be good to > > have in a log file, but not useful as the error message a user might see.) > > Good thing that ErrCannotEnterScope is a flow-control construct that will/should > never make it as far as a log file, let alone a user ;). Here I have my troubles. If it is an expected and valid state we have to react on I wouldn't call it an error. Here I more expect a kind of if result, ok := doThis(); ok { ... }
Sign in to reply to this message.
A few responses... opinions? https://codereview.appspot.com/6864050/diff/1/state/relation.go File state/relation.go (right): https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode293 state/relation.go:293: return fmt.Errorf("cannot initialize state for %s: %v", desc, err) On 2012/12/07 10:06:21, rog wrote: > how is getting the private address "initializing state"? It's not. The error is stating that it's impossible to initialize the relation unit state; the precise reason for that is just a detail, explained in the %v error. https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode310 state/relation.go:310: for _, l := range []lifeRefresher{ru.relation, ru.unit} { On 2012/12/07 10:06:21, rog wrote: > perhaps a comment might be appropriate here, saying what > the possible things are that can have failed and why > this check is the right thing to do. If any DocMissing failed, we must already be in scope, as detected by the existence of the scope doc and expressed to the client as `return nil` in :308. > AFAICS we've got 4 possible asserts that can have failed here, and i can't quite > see why we don't check 4 things. (is it possible that someone could have set the > private address in the meantime, in which case we could see an "inconsistent > state" error AFAICS, though the state is not inconsistent?) If we failed to find the private address, we bailed out without even trying a txn; if the DocMissing asserts failed, we returned nil; if the isAlive asserts failed, we return suitable errors; otherwise, who knows what's going on? https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode318 state/relation.go:318: } On 2012/12/07 10:07:10, TheMue wrote: > On 2012/12/04 11:31:45, fwereade wrote: > > On 2012/12/04 09:06:08, jameinel wrote: > > > I guess my point is, "foo not alive" is somewhat meaningful to a user, but > > > "cannot examine scope" or "cannot enter scope" seem to be exposing > > > implementation specifics, vs something users care about. (So might be good > to > > > have in a log file, but not useful as the error message a user might see.) > > > > Good thing that ErrCannotEnterScope is a flow-control construct that > will/should > > never make it as far as a log file, let alone a user ;). > > Here I have my troubles. If it is an expected and valid state we have to react > on I wouldn't call it an error. Here I more expect a kind of > > if result, ok := doThis(); ok { ... } I'm confused... why is this different from all the other cases where we define specific errors for the purpose of flow control? Is the problem just that I used the words "flow control" and it freaked everyone out? ;)
Sign in to reply to this message.
On 2012/12/07 10:30:04, fwereade wrote: > A few responses... opinions? > I'm confused... why is this different from all the other cases where we define > specific errors for the purpose of flow control? Is the problem just that I used > the words "flow control" and it freaked everyone out? ;) Yes, maybe I've been just to pedantic on your wording. For sure it's normal to react on errors and let it control the program flow. But it sounded as this one is only invented for program flow and the returned error is used as its carrier. Sorry for the trouble. :)
Sign in to reply to this message.
https://codereview.appspot.com/6864050/diff/1/state/relation.go File state/relation.go (right): https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode293 state/relation.go:293: return fmt.Errorf("cannot initialize state for %s: %v", desc, err) On 2012/12/07 10:30:04, fwereade wrote: > On 2012/12/07 10:06:21, rog wrote: > > how is getting the private address "initializing state"? > > It's not. The error is stating that it's impossible to initialize the relation > unit state; the precise reason for that is just a detail, explained in the %v > error. ah yes, i'd read the txn.DocMissing as pertaining to the private address itself, rather than the relation unit. https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode310 state/relation.go:310: for _, l := range []lifeRefresher{ru.relation, ru.unit} { On 2012/12/07 10:30:04, fwereade wrote: > On 2012/12/07 10:06:21, rog wrote: > > perhaps a comment might be appropriate here, saying what > > the possible things are that can have failed and why > > this check is the right thing to do. > > If any DocMissing failed, we must already be in scope, as detected by the > existence of the scope doc and expressed to the client as `return nil` in :308. > > > AFAICS we've got 4 possible asserts that can have failed here, and i can't > quite > > see why we don't check 4 things. (is it possible that someone could have set > the > > private address in the meantime, in which case we could see an "inconsistent > > state" error AFAICS, though the state is not inconsistent?) > > If we failed to find the private address, we bailed out without even trying a > txn; if the DocMissing asserts failed, we returned nil; if the isAlive asserts > failed, we return suitable errors; otherwise, who knows what's going on? i think that something encapsulating the above information might be worth putting in a comment, hence my remark. it's not that obvious from a glance at the code (well, it wasn't to me!) that this covers all the cases. https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode318 state/relation.go:318: } On 2012/12/07 10:30:04, fwereade wrote: > On 2012/12/07 10:07:10, TheMue wrote: > > On 2012/12/04 11:31:45, fwereade wrote: > > > On 2012/12/04 09:06:08, jameinel wrote: > > > > I guess my point is, "foo not alive" is somewhat meaningful to a user, but > > > > "cannot examine scope" or "cannot enter scope" seem to be exposing > > > > implementation specifics, vs something users care about. (So might be good > > to > > > > have in a log file, but not useful as the error message a user might see.) > > > > > > Good thing that ErrCannotEnterScope is a flow-control construct that > > will/should > > > never make it as far as a log file, let alone a user ;). > > > > Here I have my troubles. If it is an expected and valid state we have to react > > on I wouldn't call it an error. Here I more expect a kind of > > > > if result, ok := doThis(); ok { ... } > > I'm confused... why is this different from all the other cases where we define > specific errors for the purpose of flow control? Is the problem just that I used > the words "flow control" and it freaked everyone out? ;) i think returning specific errors is just fine. we do it all the time, and it works well.
Sign in to reply to this message.
Please take a look. https://codereview.appspot.com/6864050/diff/1/state/relation.go File state/relation.go (right): https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode260 state/relation.go:260: // the unit not to be Alive. Once a unit has entered a scope, in stays in scope On 2012/12/07 10:06:21, rog wrote: > s/in stays/it stays/ Done. https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode310 state/relation.go:310: for _, l := range []lifeRefresher{ru.relation, ru.unit} { On 2012/12/07 10:40:53, rog wrote: > On 2012/12/07 10:30:04, fwereade wrote: > > On 2012/12/07 10:06:21, rog wrote: > > > perhaps a comment might be appropriate here, saying what > > > the possible things are that can have failed and why > > > this check is the right thing to do. > > > > If any DocMissing failed, we must already be in scope, as detected by the > > existence of the scope doc and expressed to the client as `return nil` in > :308. > > > > > AFAICS we've got 4 possible asserts that can have failed here, and i can't > > quite > > > see why we don't check 4 things. (is it possible that someone could have set > > the > > > private address in the meantime, in which case we could see an "inconsistent > > > state" error AFAICS, though the state is not inconsistent?) > > > > If we failed to find the private address, we bailed out without even trying a > > txn; if the DocMissing asserts failed, we returned nil; if the isAlive asserts > > failed, we return suitable errors; otherwise, who knows what's going on? > > i think that something encapsulating the above information might be worth > putting in a comment, hence my remark. it's not that obvious from a glance at > the code (well, it wasn't to me!) that this covers all the cases. Indeed -- I wasn't trying to say "no comments!", I was just trying to say "this is what I do, I think it makes sense" :). Done.
Sign in to reply to this message.
LGTM https://codereview.appspot.com/6864050/diff/9002/state/relation.go File state/relation.go (right): https://codereview.appspot.com/6864050/diff/9002/state/relation.go#newcode305 state/relation.go:305: return err nice
Sign in to reply to this message.
LGTM, only one small question. https://codereview.appspot.com/6864050/diff/9002/state/relation.go File state/relation.go (right): https://codereview.appspot.com/6864050/diff/9002/state/relation.go#newcode328 state/relation.go:328: return fmt.Errorf("cannot enter scope for %s: inconsistent state", desc) Is an error enough or isn't it worth a panic?
Sign in to reply to this message.
*** Submitted: state: Dying unit cannot enter relation scope R=jameinel, TheMue, rog CC= https://codereview.appspot.com/6864050 https://codereview.appspot.com/6864050/diff/9002/state/relation.go File state/relation.go (right): https://codereview.appspot.com/6864050/diff/9002/state/relation.go#newcode328 state/relation.go:328: return fmt.Errorf("cannot enter scope for %s: inconsistent state", desc) On 2012/12/07 15:11:43, TheMue wrote: > Is an error enough or isn't it worth a panic? I don't think it's very helpful to panic -- we don't get any useful extra information, and we'll take down the uniter anyway -- it'll either recover or logspam us every 5 seconds, and I'd rather have concise logspam :).
Sign in to reply to this message.
|