Issue 6864050: state: Dying unit cannot enter relation scope

jameinel

I'm a little concerned about how this is being glued together. It at least looks ...

12 years, 7 months ago (2012-12-04 09:06:08 UTC) #2

fwereade

jameinel, thanks for your comments; I'd be happy to talk more about your concerns, but ...

12 years, 7 months ago (2012-12-04 11:31:44 UTC) #3

jameinel, thanks for your comments; I'd be happy to talk more about your
concerns, but I'm not quite sure what you mean by the top-level message. Would
you expand please?

https://codereview.appspot.com/6864050/diff/1/state/relation.go
File state/relation.go (right):

https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode318
state/relation.go:318: }
On 2012/12/04 09:06:08, jameinel wrote:
> This feels like something that would like to return multiple errors, rather
than
> just stopping at the first one.

Please expand... what is improved by emitting/handling multiple errors?

> It also looks large enough that it would be easier to understand as a separate
> helper function.
> (Wrap ru.st.runner.Run, call it with parameters, and interpret the error
> result.)

I'm not sure the code will be made any clearer by doing that... but then I'm not
clear which bit you're suggesting we should carve out.

> I do think that "CannotEnterScope" is not a useful *human visable* error. You
> also have 'cannot enter scope' separate from 'cannot examine scope', which I
> don't fully understand.

I'm keenest on the idea of just dropping the "cannot examine scope" annotation
-- usually when we get a mongo error we don't know about, we just let it bubble
up.

> I guess my point is, "foo not alive" is somewhat meaningful to a user, but
> "cannot examine scope" or "cannot enter scope" seem to be exposing
> implementation specifics, vs something users care about. (So might be good to
> have in a log file, but not useful as the error message a user might see.)

Good thing that ErrCannotEnterScope is a flow-control construct that will/should
never make it as far as a log file, let alone a user ;).

> 
> Offhand this code also looks a bit "racy". In that you got an error from X and
> you seem to be trying to disambiguate what the real error is by going around
and
> refreshing state on various objects and seeing if something is/isn't alive.
> 
> Would it be possible to get the failure from the 'run' itself, rather than
just
> getting ErrAborted? (Possibly add state to ErrAborted so that it can tell you
> *why* it aborted.)
> 

I can't see a mechanism by which the txn package could do that... you seem to be
asking for the txn package to do roughly what I do here, but in a completely
generic way across an arbitrary number of asserts (in which every asserting op
needs to be annotated with N possible failure messages...). Am I missing
something?

Sign in to reply to this message.

fwereade

On Tue, 2012-12-04 at 12:50 +0000, John A Meinel wrote: To start with, I'm fully ...

12 years, 7 months ago (2012-12-05 08:43:42 UTC) #4

On Tue, 2012-12-04 at 12:50 +0000, John A Meinel wrote:
To start with, I'm fully unfamiliar with this code, so I'm talking
> fairly meta about design, and what comes to mind as I read the code.
> As I'm sure you're much more familiar with it than I, don't take
> things I say as absolute.

No worries, the discussion is valued all the same.

> My top level feeling is that the code is slightly incorrect, because
> something is triggering an abort, and we are divining the what after
> the fact. Rather than having the thing which triggers an abort tell us
> what the reason for the abort was.
> 
> From what I can tell of the current code, it isn't possible to do so
> (we are returning an explicit instance txn.ErrAborted, not an instance
> of an error type).
> 
> So fixing that could certainly be out of scope for what you are doing.
> 
Regardless, I think it's a very tricky problem. AFAICT, the txn package
generally can't tell which assertion failed; finding out would be ugly and
inefficient and complex to a degree that far outstrips that of the
check-possible-abort-sources style.

(Maybe you have a smart idea? ISTM that in the general case you potentially need
(NxM)! additional queries to determine which assert(s) failed, and that you also
need to annotate all the assertions with error messages, and... just... eww.
(note that there's no point making assertions before attempting the ops -- the
state we're asserting could change in the middle. ISTM that throwing a txn at
the wall and seeing whether it sticks is likely to be the neatest way to
determine whether a change is sensible: if it turns out it's not, we can figure
out why it's broken and go from there.))

> Please expand... what is improved by emitting/handling multiple 
> > errors?
> 
> You are looping over a bunch of possible things that could be a cause
> of errors. Is it possible that more than one of them could be the
> error (eg. multiple names not started)? You short circuit on the first
> one that fails. Which means if you try to do X and it fails because of
> Y and Z, but you are only told about Y. So you fix Y, and try X again
> this time failing because of Z. Then you fix Z and it works.
> 
> It seems like it could be nice to find out "X failed because of Y and
> Z" rather than just "X failed because of Y".
> 
Wrong context, I think -- this "failure" is not "fixable", it's a bald statement
that the system state is not what the client thought and the requested change is
flat-out never-gonna-happen impossible: either due to lifecycle state
(ErrCannotEnterScope); or because it's already happened for some reason (nil) or
maybe because a programmer screwed up somewhere ("inconsistent state").

Either way, the only sensible reaction to failing to enter a scope is to back
out of what you're trying to do and pretend the relation just doesn't exist, and
this is the appropriate reaction regardless of the reason we're failing to enter
scope. I don't think there's any value in creating and handling N different
errors to express a single situation (that one of the required participants is
going away, or gone, and will *never* be able to enter scope.)

Roughly the block of
> if err := ru.st.runner.Run(ops, "", nil); err == txn.ErrAborted {
> ...
> }
> 
> The actual code here doesn't need much of the context of the rest of
> the function, and it "feels" roughly helper-function sized.
> 

I'm torn: readability/locality is impacted by putting the ops in a separate
function to the assert-failure-handling. I do see where you're coming from --
but I'm not too bothered by the size of the method, mainly because the []txn.Op
takes up an awful lot of vertical space but very little mental space: it's
perfectly clear what we're asking for, and the mind can skate directly over to
the interesting bit (determining what it means for that []txn.Op to be aborted)
while keeping the [] right there for easy reference.

That said, I am finding that the code that generates an []txn.Op is sometimes
worth factoring out; I just don't like to do it until there's a compelling
reason to do so (ie the value of not repeating the code outweighs the cost of
looking elsewhere to understand the details of a method).

...
> 
> > 
> > Good thing that ErrCannotEnterScope is a flow-control construct 
> > that will/should never make it as far as a log file, let alone a 
> > user ;).
> 
> Is it proper to use 'error' as a type to do flow-control? It seems to
> fall into all the same issues that people claim for
> exceptions-for-flow-control. (vs enumerated values for flow control).
> 
I don't see any real problems with the technique myself, but I haven't really
used it mindfully; it's just an idiom that I see commonly, and which STM to lead
to pretty clear code... at least in this instance.

> I can't see a mechanism by which the txn package could do that... 
> > you seem to be asking for the txn package to do roughly what I do 
> > here, but in a completely generic way across an arbitrary number
> > of asserts (in which every asserting op needs to be annotated with
> > N possible failure messages...). Am I missing something?
> > 
> > https://codereview.appspot.com/6864050/
> > 
> 
> The way it is coded today, we probably cannot. Run is returning a
> global error object (txn.ErrAborted) rather than a type (_, ok =
> err.(ErrAborted)).
> As such, you wouldn't want to put the reason for the current abort
> onto that object.
> 
> At least in *my* head Exceptions are instance objects, not static
> objects, but I see that isn't the case here.
> 
Yeah -- sometimes you see things like os.IsExist(err) and state.IsNotFound(err)
but they feel a bit bloaty when a static error does everything we need. There's
nothing stopping us doing that in this case *except* that determining the cause
of the failure is really hard for the txn package to accomplish.

So the reason I'm not +1 is simply because I don't feel I understand
> the underlying code enough to vote on it, and have some questions as
> to whether it could be done better. I certainly wouldn't block it
> landing, though.
> 
> John
> =:->
> 
OK, so I can consider this review to be essentially neutral? ie, I'm still
looking for 2 +1s, but I don't need to make specific changes to accommodate your
point of view? (I will try extracting the abort handler, but I'm not sure I'll
push it unless it STM to be a clear win.)

Sign in to reply to this message.

fwereade

> In a SQL DB, you're doing some sort of UPDATE/INSERT that violates a > ...

12 years, 7 months ago (2012-12-05 10:37:34 UTC) #5

jameinel

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 ... > Regardless, I think it's a very tricky ...

12 years, 7 months ago (2012-12-06 07:10:11 UTC) #6

fwereade

On 2012/12/05 10:37:34, fwereade wrote: > AFAICT there's no way to get any more nuance ...

12 years, 7 months ago (2012-12-07 09:45:07 UTC) #7

rog

looks plausible. a couple of comments below. https://codereview.appspot.com/6864050/diff/1/state/relation.go File state/relation.go (right): https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode260 state/relation.go:260: // the ...

12 years, 7 months ago (2012-12-07 10:06:21 UTC) #8

TheMue

So far it's looking good, I only have a problem with the usage of an ...

12 years, 7 months ago (2012-12-07 10:07:09 UTC) #9

fwereade

A few responses... opinions? https://codereview.appspot.com/6864050/diff/1/state/relation.go File state/relation.go (right): https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode293 state/relation.go:293: return fmt.Errorf("cannot initialize state for ...

12 years, 7 months ago (2012-12-07 10:30:04 UTC) #10

A few responses... opinions?

https://codereview.appspot.com/6864050/diff/1/state/relation.go
File state/relation.go (right):

https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode293
state/relation.go:293: return fmt.Errorf("cannot initialize state for %s: %v",
desc, err)
On 2012/12/07 10:06:21, rog wrote:
> how is getting the private address "initializing state"?

It's not. The error is stating that it's impossible to initialize the relation
unit state; the precise reason for that is just a detail, explained in the %v
error.

https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode310
state/relation.go:310: for _, l := range []lifeRefresher{ru.relation, ru.unit} {
On 2012/12/07 10:06:21, rog wrote:
> perhaps a comment might be appropriate here, saying what
> the possible things are that can have failed and why
> this check is the right thing to do.

If any DocMissing failed, we must already be in scope, as detected by the
existence of the scope doc and expressed to the client as `return nil` in :308.

> AFAICS we've got 4 possible asserts that can have failed here, and i can't
quite
> see why we don't check 4 things. (is it possible that someone could have set
the
> private address in the meantime, in which case we could see an "inconsistent
> state" error AFAICS, though the state is not inconsistent?)

If we failed to find the private address, we bailed out without even trying a
txn; if the DocMissing asserts failed, we returned nil; if the isAlive asserts
failed, we return suitable errors; otherwise, who knows what's going on?

https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode318
state/relation.go:318: }
On 2012/12/07 10:07:10, TheMue wrote:
> On 2012/12/04 11:31:45, fwereade wrote:
> > On 2012/12/04 09:06:08, jameinel wrote:
> > > I guess my point is, "foo not alive" is somewhat meaningful to a user, but
> > > "cannot examine scope" or "cannot enter scope" seem to be exposing
> > > implementation specifics, vs something users care about. (So might be good
> to
> > > have in a log file, but not useful as the error message a user might see.)
> > 
> > Good thing that ErrCannotEnterScope is a flow-control construct that
> will/should
> > never make it as far as a log file, let alone a user ;).
> 
> Here I have my troubles. If it is an expected and valid state we have to react
> on I wouldn't call it an error. Here I more expect a kind of
> 
> if result, ok := doThis(); ok { ... }

I'm confused... why is this different from all the other cases where we define
specific errors for the purpose of flow control? Is the problem just that I used
the words "flow control" and it freaked everyone out? ;)

Sign in to reply to this message.

TheMue

On 2012/12/07 10:30:04, fwereade wrote: > A few responses... opinions? > I'm confused... why is ...

12 years, 7 months ago (2012-12-07 10:37:19 UTC) #11

rog

https://codereview.appspot.com/6864050/diff/1/state/relation.go File state/relation.go (right): https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode293 state/relation.go:293: return fmt.Errorf("cannot initialize state for %s: %v", desc, err) ...

12 years, 7 months ago (2012-12-07 10:40:53 UTC) #12

https://codereview.appspot.com/6864050/diff/1/state/relation.go
File state/relation.go (right):

https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode293
state/relation.go:293: return fmt.Errorf("cannot initialize state for %s: %v",
desc, err)
On 2012/12/07 10:30:04, fwereade wrote:
> On 2012/12/07 10:06:21, rog wrote:
> > how is getting the private address "initializing state"?
> 
> It's not. The error is stating that it's impossible to initialize the relation
> unit state; the precise reason for that is just a detail, explained in the %v
> error.

ah yes, i'd read the txn.DocMissing as pertaining to the private address itself,
rather than the relation unit.

https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode310
state/relation.go:310: for _, l := range []lifeRefresher{ru.relation, ru.unit} {
On 2012/12/07 10:30:04, fwereade wrote:
> On 2012/12/07 10:06:21, rog wrote:
> > perhaps a comment might be appropriate here, saying what
> > the possible things are that can have failed and why
> > this check is the right thing to do.
> 
> If any DocMissing failed, we must already be in scope, as detected by the
> existence of the scope doc and expressed to the client as `return nil` in
:308.
> 
> > AFAICS we've got 4 possible asserts that can have failed here, and i can't
> quite
> > see why we don't check 4 things. (is it possible that someone could have set
> the
> > private address in the meantime, in which case we could see an "inconsistent
> > state" error AFAICS, though the state is not inconsistent?)
> 
> If we failed to find the private address, we bailed out without even trying a
> txn; if the DocMissing asserts failed, we returned nil; if the isAlive asserts
> failed, we return suitable errors; otherwise, who knows what's going on?

i think that something encapsulating the above information might be worth
putting in a comment, hence my remark. it's not that obvious from a glance at
the code (well, it wasn't to me!) that this covers all the cases.

https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode318
state/relation.go:318: }
On 2012/12/07 10:30:04, fwereade wrote:
> On 2012/12/07 10:07:10, TheMue wrote:
> > On 2012/12/04 11:31:45, fwereade wrote:
> > > On 2012/12/04 09:06:08, jameinel wrote:
> > > > I guess my point is, "foo not alive" is somewhat meaningful to a user,
but
> > > > "cannot examine scope" or "cannot enter scope" seem to be exposing
> > > > implementation specifics, vs something users care about. (So might be
good
> > to
> > > > have in a log file, but not useful as the error message a user might
see.)
> > > 
> > > Good thing that ErrCannotEnterScope is a flow-control construct that
> > will/should
> > > never make it as far as a log file, let alone a user ;).
> > 
> > Here I have my troubles. If it is an expected and valid state we have to
react
> > on I wouldn't call it an error. Here I more expect a kind of
> > 
> > if result, ok := doThis(); ok { ... }
> 
> I'm confused... why is this different from all the other cases where we define
> specific errors for the purpose of flow control? Is the problem just that I
used
> the words "flow control" and it freaked everyone out? ;)

i think returning specific errors is just fine. we do it all the time, and it
works well.

Sign in to reply to this message.

fwereade

Please take a look. https://codereview.appspot.com/6864050/diff/1/state/relation.go File state/relation.go (right): https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode260 state/relation.go:260: // the unit not to ...

12 years, 7 months ago (2012-12-07 15:06:11 UTC) #13

Please take a look.

https://codereview.appspot.com/6864050/diff/1/state/relation.go
File state/relation.go (right):

https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode260
state/relation.go:260: // the unit not to be Alive. Once a unit has entered a
scope, in stays in scope
On 2012/12/07 10:06:21, rog wrote:
> s/in stays/it stays/

Done.

https://codereview.appspot.com/6864050/diff/1/state/relation.go#newcode310
state/relation.go:310: for _, l := range []lifeRefresher{ru.relation, ru.unit} {
On 2012/12/07 10:40:53, rog wrote:
> On 2012/12/07 10:30:04, fwereade wrote:
> > On 2012/12/07 10:06:21, rog wrote:
> > > perhaps a comment might be appropriate here, saying what
> > > the possible things are that can have failed and why
> > > this check is the right thing to do.
> > 
> > If any DocMissing failed, we must already be in scope, as detected by the
> > existence of the scope doc and expressed to the client as `return nil` in
> :308.
> > 
> > > AFAICS we've got 4 possible asserts that can have failed here, and i can't
> > quite
> > > see why we don't check 4 things. (is it possible that someone could have
set
> > the
> > > private address in the meantime, in which case we could see an
"inconsistent
> > > state" error AFAICS, though the state is not inconsistent?)
> > 
> > If we failed to find the private address, we bailed out without even trying
a
> > txn; if the DocMissing asserts failed, we returned nil; if the isAlive
asserts
> > failed, we return suitable errors; otherwise, who knows what's going on?
> 
> i think that something encapsulating the above information might be worth
> putting in a comment, hence my remark. it's not that obvious from a glance at
> the code (well, it wasn't to me!) that this covers all the cases.

Indeed -- I wasn't trying to say "no comments!", I was just trying to say "this
is what I do, I think it makes sense" :). Done.

Sign in to reply to this message.

rog

LGTM https://codereview.appspot.com/6864050/diff/9002/state/relation.go File state/relation.go (right): https://codereview.appspot.com/6864050/diff/9002/state/relation.go#newcode305 state/relation.go:305: return err nice

12 years, 7 months ago (2012-12-07 15:09:17 UTC) #14

TheMue

LGTM, only one small question. https://codereview.appspot.com/6864050/diff/9002/state/relation.go File state/relation.go (right): https://codereview.appspot.com/6864050/diff/9002/state/relation.go#newcode328 state/relation.go:328: return fmt.Errorf("cannot enter scope ...

12 years, 7 months ago (2012-12-07 15:11:43 UTC) #15

fwereade

12 years, 7 months ago (2012-12-07 16:52:37 UTC) #16

*** Submitted:

state: Dying unit cannot enter relation scope

R=jameinel, TheMue, rog
CC=
https://codereview.appspot.com/6864050

https://codereview.appspot.com/6864050/diff/9002/state/relation.go
File state/relation.go (right):

https://codereview.appspot.com/6864050/diff/9002/state/relation.go#newcode328
state/relation.go:328: return fmt.Errorf("cannot enter scope for %s:
inconsistent state", desc)
On 2012/12/07 15:11:43, TheMue wrote:
> Is an error enough or isn't it worth a panic?

I don't think it's very helpful to panic -- we don't get any useful extra
information, and we'll take down the uniter anyway -- it'll either recover or
logspam us every 5 seconds, and I'd rather have concise logspam :).

Sign in to reply to this message.

Issue 6864050: state: Dying unit cannot enter relation scope

Description

Patch Set 1 #

Patch Set 2 : state: Dying unit cannot enter relation scope #

Patch Set 3 : state: Dying unit cannot enter relation scope #

Messages