Issue 5847053: A spec for how to cleanly stop agents.

fwereade

couple of typos and some vague rambling thoughts; I'll try to write them up in ...

12 years, 1 month ago (2012-03-17 12:20:30 UTC) #2

couple of typos and some vague rambling thoughts; I'll try to write them up in
some more detail for the list.

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst
File source/drafts/stopping-units.rst (right):

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rs...
source/drafts/stopping-units.rst:41: its killed to properly terminate. This
example can be extended to
s/its/it's/

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rs...
source/drafts/stopping-units.rst:59: against all the relevant entitites. [See
Topology Internals]
Please define "relevant entities". When stopping a service, for example, do you
mean we should mark all the following:
 - all units of that service
 - all unit relations of that service
 - all machines running only that service
 - the service itself
?

I think it'd be easier to think of things cascading a bit more -- so individual
entities detected their own impending death by watching for changes to their
immediate parents' statuses... but I may have to think it through a little more.

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rs...
source/drafts/stopping-units.rst:70: - unit agent disables its upstart file
Not sure about this; can't we just always write agents with a "normal exit 0"
stanza and depend on that?

Otherwise, if we get a "weird" failure at this precise point, we won't get an
opportunity to start up again and tidy ourselves up nicely; and besides it seems
more correct to me to let the machine agent *always* tidy up the unit agent
upstart files it originally created (which it will have to do anyway if it times
out...).

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rs...
source/drafts/stopping-units.rst:72: - unit agent process exists
s/exists/exits with a code indicating to upstart that it should not be
restarted/

(?)

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rs...
source/drafts/stopping-units.rst:74: - machine agent watch fires on 'stop' node
deletion, shutsdown the units container.
s/shutsdown/shuts down/

But regardless, I think we need to consider this in a little more detail --
we're not guaranteed to have a container, after all, and we probably shouldn't
be killing whole containers if we're just shutting down a subordinate charm, for
example.

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rs...
source/drafts/stopping-units.rst:78: node contents updated to reflect this.
If the PA kills a machine, it will then be responsible for cleaning up the MA's
nodes *and* any unit nodes? This doesn't seem quite right... sounds like a job
for the GC really (because -- I *think* -- it's going to have to deal with unit
cleanup as a consequence of forcible machine termination. (If the MA doesn't
handle a stop request properly, probably it didn't stop its units either.).

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rs...
source/drafts/stopping-units.rst:89: and garbage collects their topology
footprint and zk state.
I think this needs a bit more detail; see below.

https://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rs...
source/drafts/stopping-units.rst:100: internal data structures.
Sounds sensible. I'm wondering if it would be sensible to extend the general
idea to unit relations, replacing the existing mechanism; in combination with a
gc node, for example, it could be useful for units to hold a don't-delete-yet
lock on other participants in their relations (until the point at which the unit
itself has processed their removal).

(I think we do need something like this: a relation hook can theoretically
execute with an arbitrarily old members list, and the only reason this isn't
currently a problem is because we don't delete the old members' settings nodes,
so the relation hook commands can continue to cheerfully extract data from dead
unit relations. (I think, anyway: the details have grown hazy. Please correct me
if I need it ;)))

But then... if a UA holding a "lock" on another unit relation's settings gets
forcibly killed, we need a way to detect that *that* lock is no longer
relevant... it's not insoluble, sure, but it could get fiddly, and whatever we
do I think we'll need a semi-smart GC that can figure out these sorts of issues,
and I think it's worth speccing out the details of the GC in a bit more depth.

Sign in to reply to this message.

hazmat

Thanks for the review! http://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst File source/drafts/stopping-units.rst (right): http://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst#newcode59 source/drafts/stopping-units.rst:59: against all the relevant entitites. ...

12 years, 1 month ago (2012-03-21 08:35:37 UTC) #3

Thanks for the review!

http://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst
File source/drafts/stopping-units.rst (right):

http://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst...
source/drafts/stopping-units.rst:59: against all the relevant entitites. [See
Topology Internals]
On 2012/03/17 12:20:30, fwereade wrote:
> Please define "relevant entities". When stopping a service, for example, do
you
> mean we should mark all the following:
>  - all units of that service
>  - all unit relations of that service
>  - all machines running only that service
>  - the service itself
> ?
> 
> I think it'd be easier to think of things cascading a bit more -- so
individual
> entities detected their own impending death by watching for changes to their
> immediate parents' statuses... but I may have to think it through a little
more.


Good point, for a service it would set the appropriate value for all the units.
I like the notion of a children watching parents, but its not clear its nesc.,
but ultimately its the assigned machine's responsibility to shutdown the unit,
and it seems just as well.

http://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst...
source/drafts/stopping-units.rst:70: - unit agent disables its upstart file
On 2012/03/17 12:20:30, fwereade wrote:
> Not sure about this; can't we just always write agents with a "normal exit 0"
> stanza and depend on that?
> 
> Otherwise, if we get a "weird" failure at this precise point, we won't get an
> opportunity to start up again and tidy ourselves up nicely; and besides it
seems
> more correct to me to let the machine agent *always* tidy up the unit agent
> upstart files it originally created (which it will have to do anyway if it
times
> out...).

That sounds good, although a wierd failure might just as well exit 0 ;-) But
yeah.. the parent should perform final cleanup, and i like the upstart stanza
approach.

http://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst...
source/drafts/stopping-units.rst:74: - machine agent watch fires on 'stop' node
deletion, shutsdown the units container.
On 2012/03/17 12:20:30, fwereade wrote:
> s/shutsdown/shuts down/
> 
> But regardless, I think we need to consider this in a little more detail --
> we're not guaranteed to have a container, after all, and we probably shouldn't
> be killing whole containers if we're just shutting down a subordinate charm,
for
> example.

a subordinate is handled slightly differently, same protocol different parent. a
subordinate is cleaned up by its parent/primary unit. most of the mechanics of
container vs. not are already abstracted out behind unit deploy factories.

http://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst...
source/drafts/stopping-units.rst:78: node contents updated to reflect this.
On 2012/03/17 12:20:30, fwereade wrote:
> If the PA kills a machine, it will then be responsible for cleaning up the
MA's
> nodes *and* any unit nodes? This doesn't seem quite right... sounds like a job
> for the GC really (because -- I *think* -- it's going to have to deal with
unit
> cleanup as a consequence of forcible machine termination. (If the MA doesn't
> handle a stop request properly, probably it didn't stop its units either.).

machines can't be terminated currently if they have units assigned. but that
does bring up the question of exactly when do the units become unassigned from
the machine. in that particularly case its not identity that we're removing, and
the parent supervisor is the machine agent is already going to be dealing with
shutting it down.

http://codereview.appspot.com/5847053/diff/1/source/drafts/stopping-units.rst...
source/drafts/stopping-units.rst:100: internal data structures.
On 2012/03/17 12:20:30, fwereade wrote:
> Sounds sensible. I'm wondering if it would be sensible to extend the general
> idea to unit relations, replacing the existing mechanism; in combination with
a
> gc node, for example, it could be useful for units to hold a don't-delete-yet
> lock on other participants in their relations (until the point at which the
unit
> itself has processed their removal).
> 
> (I think we do need something like this: a relation hook can theoretically
> execute with an arbitrarily old members list, and the only reason this isn't
> currently a problem is because we don't delete the old members' settings
nodes,
> so the relation hook commands can continue to cheerfully extract data from
dead
> unit relations. (I think, anyway: the details have grown hazy. Please correct
me
> if I need it ;)))
> 
> But then... if a UA holding a "lock" on another unit relation's settings gets
> forcibly killed, we need a way to detect that *that* lock is no longer
> relevant... it's not insoluble, sure, but it could get fiddly, and whatever we
> do I think we'll need a semi-smart GC that can figure out these sorts of
issues,
> and I think it's worth speccing out the details of the GC in a bit more depth.


re extension, I think we'll have more interesting mechanisms for multi-node
workflows. Its possible to do here, but it feels orthogonal to stop problem. I
will note that clint was interested in this sort of feature a few months ago.

per our discussion on irc, gc can be fairly conservative with relations, only
cleaning it out after all endpoints service units that participated in the
relation are gone.

Sign in to reply to this message.

niemeyer

Nice to see this moving forward. Some comments: https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst File source/drafts/stopping-units.rst (right): https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode4 source/drafts/stopping-units.rst:4: ------------- ...

12 years, 1 month ago (2012-03-21 20:07:53 UTC) #5

Nice to see this moving forward. Some comments:

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units...
File source/drafts/stopping-units.rst (right):

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units...
source/drafts/stopping-units.rst:4: -------------
Sorry for nitpicking, but can we please stick to a consistent convention for
headers? I believe that's what we have today:

Top level: ===
Inner level: ---

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units...
source/drafts/stopping-units.rst:30: while still allowing for termination of
errant children.
This is a very dense paragraph. Can it be simplified to something like:

"""
Juju records the desired state in the topology, and preserves that state while
the actions to accomplish it are still taking place.
"""

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units...
source/drafts/stopping-units.rst:32: How it works, what follows is a a worked
example going through a unit
"How it works, what follows"?

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units...
source/drafts/stopping-units.rst:58: actions are recorded in the topology as a
key 'action' value 'destroy'
The "action: destroy" terminology feels a bit out of place here. This is pretty
much a procedure call, but the topology holds state instead. Something like
"destroyed: true" would be more appropriate. It remains true even after the
action has taken place.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units...
source/drafts/stopping-units.rst:65: - Machine writes to /units/unit-x/stop /
sets watch on /units/unit-x/stop and timer.
See below regarding the watch side of it.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units...
source/drafts/stopping-units.rst:69: - executing stop hook.
This should be done after departing from relations.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units...
source/drafts/stopping-units.rst:72: - unit agent updates /units/unit-x/stop to
reflect the action has been performed.
That step doesn't seem necessary. We can reuse the existing liveness mechanism
that is already supported by the agent. When the agent dies, the machine will
know it.

It also seems like a nice idea to have the stop ZooKeeper node existing for as
long as that's the intended state of the unit. We have to need that node on
startup for watching it anyway.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units...
source/drafts/stopping-units.rst:75: - machine agent watch fires on 'stop' node
deletion, shuts down the unit, removing
Also touches the above point.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units...
source/drafts/stopping-units.rst:94: and garbage collects their topology
footprint and zk state.
What about garbage collecting on the next action after the decision that it's
fine to GC it? I don't think we need a specific process that has the task of
GCing the topology.

In both cases, though, we have an issue: when is it ok to GC it? This proposal
is not providing any means for deciding on that, which means the data becomes
eternal. That hole indicates postponing that side of the problem may not be a
good idea. We can do something simple, but it'd be good implement cleaning up at
the same time.

Sign in to reply to this message.

hazmat

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst File source/drafts/stopping-units.rst (right): https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode4 source/drafts/stopping-units.rst:4: ------------- On 2012/03/21 20:07:53, niemeyer wrote: > Sorry for ...

12 years, 1 month ago (2012-03-26 00:42:07 UTC) #6

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units...
File source/drafts/stopping-units.rst (right):

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units...
source/drafts/stopping-units.rst:4: -------------
On 2012/03/21 20:07:53, niemeyer wrote:
> Sorry for nitpicking, but can we please stick to a consistent convention for
> headers? I believe that's what we have today:
> 
> Top level: ===
> Inner level: ---

done.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units...
source/drafts/stopping-units.rst:30: while still allowing for termination of
errant children.
On 2012/03/21 20:07:53, niemeyer wrote:
> This is a very dense paragraph. Can it be simplified to something like:
> 
> """
> Juju records the desired state in the topology, and preserves that state while
> the actions to accomplish it are still taking place.
> """

sounds good.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units...
source/drafts/stopping-units.rst:32: How it works, what follows is a a worked
example going through a unit
On 2012/03/21 20:07:53, niemeyer wrote:
> "How it works, what follows"?

removed

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units...
source/drafts/stopping-units.rst:58: actions are recorded in the topology as a
key 'action' value 'destroy'
On 2012/03/21 20:07:53, niemeyer wrote:
> The "action: destroy" terminology feels a bit out of place here. This is
pretty
> much a procedure call, but the topology holds state instead. Something like
> "destroyed: true" would be more appropriate. It remains true even after the
> action has taken place.

action here was a really a name for a coordinated state transition, ie. an
implied final state. recording the state instead of the activity sounds good. i
have a minor preference for a generic key over per state keys, as i can't think
of any multi-valued concurrent activities/states here, but not guessing via
constraints in the face of ambiguity does seem to be the wiser choice.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units...
source/drafts/stopping-units.rst:69: - executing stop hook.
On 2012/03/21 20:07:53, niemeyer wrote:
> This should be done after departing from relations.
definitely

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units...
source/drafts/stopping-units.rst:72: - unit agent updates /units/unit-x/stop to
reflect the action has been performed.
On 2012/03/21 20:07:53, niemeyer wrote:
> That step doesn't seem necessary. We can reuse the existing liveness mechanism
> that is already supported by the agent. When the agent dies, the machine will
> know it.
> 
> It also seems like a nice idea to have the stop ZooKeeper node existing for as
> long as that's the intended state of the unit. We have to need that node on
> startup for watching it anyway.

i wanted to avoid presence node usage, due its known problems with session
expiration. ie if an agent has a restart in the middle of the stop process, the
machine agent watch may erroneously think the agent is up. The stop node
explicitness feels more robust in this context, and as you've noted its nesc for
the coordination aspect anyways. Its also nice to clean up the local
coordination structure when the coordination is done.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units...
source/drafts/stopping-units.rst:94: and garbage collects their topology
footprint and zk state.
On 2012/03/21 20:07:53, niemeyer wrote:
> What about garbage collecting on the next action after the decision that it's
> fine to GC it? I don't think we need a specific process that has the task of
> GCing the topology.
> 
> In both cases, though, we have an issue: when is it ok to GC it? This proposal
> is not providing any means for deciding on that, which means the data becomes
> eternal. That hole indicates postponing that side of the problem may not be a
> good idea. We can do something simple, but it'd be good implement cleaning up
at
> the same time.

There is some cost associated to doing gc every user action, its n roundtrips
for the client to check. I agree it is simpler for user-gc, but there are other
forms of garbage that could be opportunistically collected with an agent (ie.
old /relations, non topology referenced service nodes) Its not clear that
garbage will be a problem in practice though outside of the topology, so i'm
fine with the user initiated gc confined to topology intents.

Sign in to reply to this message.

hazmat

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst File source/drafts/stopping-units.rst (right): https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode94 source/drafts/stopping-units.rst:94: and garbage collects their topology footprint and zk state. ...

12 years, 1 month ago (2012-03-26 00:59:22 UTC) #7

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units...
File source/drafts/stopping-units.rst (right):

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units...
source/drafts/stopping-units.rst:94: and garbage collects their topology
footprint and zk state.
On 2012/03/26 00:42:07, hazmat wrote:
> On 2012/03/21 20:07:53, niemeyer wrote:
> > What about garbage collecting on the next action after the decision that
it's
> > fine to GC it? I don't think we need a specific process that has the task of
> > GCing the topology.
> > 
> > In both cases, though, we have an issue: when is it ok to GC it? This
proposal
> > is not providing any means for deciding on that, which means the data
becomes
> > eternal. That hole indicates postponing that side of the problem may not be
a
> > good idea. We can do something simple, but it'd be good implement cleaning
up
> at
> > the same time.
> 
> There is some cost associated to doing gc every user action, its n roundtrips
> for the client to check. I agree it is simpler for user-gc, but there are
other
> forms of garbage that could be opportunistically collected with an agent (ie.
> old /relations, non topology referenced service nodes) Its not clear that
> garbage will be a problem in practice though outside of the topology, so i'm
> fine with the user initiated gc confined to topology intents.

i'm still a bit concerned about the lag here. Take a service with 10 units, if
we destroy the service, we end up recording destroy against the 10 units for
their assigned machine coordination and the service. 

I suppose that goes away if we just record the action once against the service,
and have machine agents watching all their assigned service unit's service in
the topology.

In either case, for the next user interaction, we have several roundtrips queued
for the next user interaction one per unit to verify their
completion/destruction. It feels like instilling an unpredictable lag, esp. as
the activity is an ongoing one that may take several minutes (executing the stop
hook) and thus require gc checking for the next several user commands. And in a
multi-user usage, the user who finally deleted the object from the topology
would be random, which also feels questionable from an audit perspective.

Given that, i'd prefer the dedicated gc agent.

Sign in to reply to this message.

hazmat

Please take a look. https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst File source/drafts/stopping-units.rst (right): https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units.rst#newcode94 source/drafts/stopping-units.rst:94: and garbage collects their topology ...

12 years, 1 month ago (2012-03-26 01:25:12 UTC) #8

niemeyer

12 years, 1 month ago (2012-03-26 14:36:47 UTC) #9

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units...
File source/drafts/stopping-units.rst (right):

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units...
source/drafts/stopping-units.rst:75: - machine agent watch fires on 'stop' node
deletion, shuts down the unit, removing
On 2012/03/21 20:07:53, niemeyer wrote:
> Also touches the above point.

This one is still pending.

https://codereview.appspot.com/5847053/diff/5001/source/drafts/stopping-units...
source/drafts/stopping-units.rst:94: and garbage collects their topology
footprint and zk state.
On 2012/03/26 00:59:22, hazmat wrote:
> i'm still a bit concerned about the lag here. 
(...)
> Given that, i'd prefer the dedicated gc agent.

Sounds good, but let's please not swipe the problem under the carpet. If we'll
have a dedicated GC, let's implement it upfront at the same time we implement
the garbage itself. It may be an additional command on juju-admin, for example,
that is called periodically on the zookeeper machine.

While implementing GC, we may figure details about the implementation that might
be best done differently, so we shouldn't be tying ourselves to an
implementation and postponing the decision on how to GC to a latter point.

https://codereview.appspot.com/5847053/diff/9001/source/drafts/stopping-units...
File source/drafts/stopping-units.rst (right):

https://codereview.appspot.com/5847053/diff/9001/source/drafts/stopping-units...
source/drafts/stopping-units.rst:50: are recorded in the topology as a 'destroy'
key with a True value
Thanks. Please just s/destroy/destroyed/, otherwise we would still have an
action to be carried rather than a state.

https://codereview.appspot.com/5847053/diff/9001/source/drafts/stopping-units...
source/drafts/stopping-units.rst:83: Principal actions via the client will
trigger gc activities against
New and unfinished sentence.

https://codereview.appspot.com/5847053/diff/9001/source/drafts/stopping-units...
source/drafts/stopping-units.rst:113: ``condition``: no presence and no stop
node.
Should be "no presence node and existent stop node."

https://codereview.appspot.com/5847053/diff/9001/source/drafts/stopping-units...
source/drafts/stopping-units.rst:128: ``condition``: no presence and no stop
node.
Ditto.

https://codereview.appspot.com/5847053/diff/9001/source/drafts/stopping-units...
source/drafts/stopping-units.rst:136: agent/domain state ( machine, unit,
service), its best to treat them
s/its/it's/

Sign in to reply to this message.

Issue 5847053: A spec for how to cleanly stop agents.

Description

Patch Set 1 #

Patch Set 2 : A spec for how to cleanly stop agents. #

Patch Set 3 : A spec for how to cleanly stop agents. #

Messages