|
|
Created:
12 years, 1 month ago by fwereade Modified:
12 years ago Reviewers:
mp+93357 Visibility:
Public. |
DescriptionA replacement for zookeeper ephemeral nodes, which allows for seamless
reestablishment of presence across separate sessions.
https://code.launchpad.net/~fwereade/juju/go-presence-nodes/+merge/93357
(do not edit description out of merge proposal)
Patch Set 1 #
Total comments: 11
Patch Set 2 : Ephemeral node replacement sketch, DON'T SUBMIT #Patch Set 3 : Ephemeral node replacement sketch #
Total comments: 5
Patch Set 4 : Presence nodes #Patch Set 5 : Presence nodes #
Total comments: 32
Patch Set 6 : Presence nodes #
Total comments: 4
Patch Set 7 : Presence nodes #Patch Set 8 : Presence nodes #
Total comments: 8
Patch Set 9 : Presence nodes #
Total comments: 52
Patch Set 10 : Presence nodes #
Total comments: 4
Patch Set 11 : Presence nodes #
Total comments: 1
Patch Set 12 : Presence nodes #
Total comments: 18
Patch Set 13 : Presence nodes #Patch Set 14 : Presence nodes #Patch Set 15 : Presence nodes #
MessagesTotal messages: 30
Please take a look.
Sign in to reply to this message.
i really like the essential idea (OccupiedW). could do with a bit of refinement though. i think that my comments boil down to something like the following API: package presence type Node struct {...} func Occupy(zk *zookeeper.Conn, path string, timeout time.Duration) (*Node, error) type Event struct { State int // zookeeper state if the zk session has gone down Occupied bool } func OccupiedW(zk *zookeeper.Conn, path string) (bool, <-chan Event, error) func Occupied(zk *zookeeper.Conn, path) (bool, error) https://codereview.appspot.com/5672067/diff/1/state/presence.go File state/presence.go (right): https://codereview.appspot.com/5672067/diff/1/state/presence.go#newcode19 state/presence.go:19: stat, err := n.conn.Exists(n.path) i don't think it's worth calling Exists here. you could just call Create and check for a "node exists" error. https://codereview.appspot.com/5672067/diff/1/state/presence.go#newcode44 state/presence.go:44: // presenceNode holds all the data used by both PresenceNode and PresenceNodeClient. i'd make these two separate types and duplicate the fields. it's only happenstance that they use exactly the same fields now - i think that will change. for example, it would make sense for PresenceNode to contain an instance of ChangeNode. https://codereview.appspot.com/5672067/diff/1/state/presence.go#newcode64 state/presence.go:64: func (p *PresenceNode) Occupy() chan<- bool { rather than returning a channel here, i think it would be better if PresenceNode had a Close method. in fact, i'd probably make NewPresenceNode create and occupy the node - i don't see that it's worth having a separate Occupy method. in fact, why not rename NewPresenceNode to Occupy? func Occupy(conn *zookeeper.Conn, path string, timeout time.Duration) *Node (if we put this in a package called "presence", i think this will read well). https://codereview.appspot.com/5672067/diff/1/state/presence.go#newcode88 state/presence.go:88: type PresenceNodeClient presenceNode i'm not sure that this type is worth having as things are now. i think i'd go with a single function: func OccupiedW(zk *zookeeper.Conn, path string) (bool, <-chan bool, error) then again, if you're going to store the clock skew, you'll need a place to store it. having the timeout in the call seems error prone - what happens if the occupier is using a different time out to the client? how about storing the timeout as the contents of the presence node? you're doing a GetW anyway, so it won't affect performance. https://codereview.appspot.com/5672067/diff/1/state/presence.go#newcode95 state/presence.go:95: func (p *PresenceNodeClient) Occupied(clock *ChangeNode) (occupied bool, err error) { rather than passing in a close ChangeNode, we could just use a well known zk path, which would make things easier for clients (the ChangeNode type would not need to be exported, for example). https://codereview.appspot.com/5672067/diff/1/state/presence.go#newcode128 state/presence.go:128: mtime, err := clock.Change() i think this is overkill - it means that every call to OccupiedW incurs a minimum of three round trips to zk AFAICS. we could compute the clock skew when the PresenceNodeClient is created and trust that the local clock won't go too far out of sync (we could resynchronise once every day or so if we really cared). with the length of timeouts we'll be talking, i don't think a little skew over time is going to matter too much. https://codereview.appspot.com/5672067/diff/1/state/presence.go#newcode147 state/presence.go:147: }(timeout) you can use time.NewTimer for this. https://codereview.appspot.com/5672067/diff/1/state/presence.go#newcode151 state/presence.go:151: if err.(zookeeper.Error) == zookeeper.ZNONODE { if err == zookeeper.ZNONODE { https://codereview.appspot.com/5672067/diff/1/state/presence.go#newcode164 state/presence.go:164: panic("zookeeper connection down") i'd put this first, then you save a level of indentation in the other branch. https://codereview.appspot.com/5672067/diff/1/state/presence.go#newcode188 state/presence.go:188: // note: node could have been created in the meantime i think this has a problem. if the node has been created in the meantime, we'll want to use occupiedW to watch the node timeout rather than just waiting for it to be deleted. if occupied calls ExistsW rather than Exists, then you won't have the race and you can always return false from unoccupiedW. also, i think you can drop nonexistentW - ExistsW will also return an event when the node changes. https://codereview.appspot.com/5672067/diff/1/state/presence.go#newcode208 state/presence.go:208: panic("zookeeper connection down") we shouldn't panic if this happens. perhaps we could define a type Event that encapsulates both zk session events and the occupied/unoccupied status; then we could send the event that's causing us to panic to the receiver and they could deal with it just as they need to deal with any other watch function.
Sign in to reply to this message.
Please take a look.
Sign in to reply to this message.
Please take a look.
Sign in to reply to this message.
as discussed on IRC, i don't think this API is an improvement. in particular the persistent Watcher seems to complicate the interface for no particularly good reason. the way other calls in the zk interface work is to return a current value and a channel to wait for a change of that value - which what the previous API did. that seemed entirely appropriate to me. https://codereview.appspot.com/5672067/diff/5001/state/presence/presence.go File state/presence/presence.go (right): https://codereview.appspot.com/5672067/diff/5001/state/presence/presence.go#n... state/presence/presence.go:37: // oneShot sends a single true value to ch and closes it. i think calls to this would read better if the contents were inlined, making the channel operations clearer. also, the way you're using it, i *think* you could just do close(ch) and it be entirely equivalent. https://codereview.appspot.com/5672067/diff/5001/state/presence/presence.go#n... state/presence/presence.go:45: type Pinger struct { i'm not sure about "Pinger" as a name. it sounds like the implementation detail is leaking into the API naming also, a ping is usually something that involves two parties (i "ping" a host, or a person on IRC, but updating my "present" status doesn't seem very like a ping to me). personally i thought that Occupy expressed the spirit of what was going on quite well. "occupation" of a node suggests nicely that there can be only one occupant and that the occupation is active. if we were using a system slightly less broken than zookeeper, perhaps we could use an implementation that doesn't involve polling, and it would be good if the API could remain substantially the same. https://codereview.appspot.com/5672067/diff/5001/state/presence/presence.go#n... state/presence/presence.go:103: Alive bool i don't see how it's possible to use this in a non-racy way. there's no mutex protection and it changes concurrently. https://codereview.appspot.com/5672067/diff/5001/state/presence/presence.go#n... state/presence/presence.go:107: Error error again, not possible to use this correctly. https://codereview.appspot.com/5672067/diff/5001/state/presence/presence.go#n... state/presence/presence.go:111: func Watch(conn *zookeeper.Conn, path string) (*Watcher, error) { i'm not sure i see the advantage in having a watcher that watches for multiple events. it complicates the implementation, and AFAICS most uses will watch for a single event and then take some action based on that event, probably using some kind of locking to decide, for instance, who is going to provision a new machine when its machine agent has gone away.
Sign in to reply to this message.
Please take a look.
Sign in to reply to this message.
Please take a look.
Sign in to reply to this message.
lots of comments again, sorry. it's looking good in general though. the underlying logic looks good (with the possible exception of the node deletion race) https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go File state/presence/presence.go (right): https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:19: func (n *changeNode) Change() (mtime time.Time, err error) { s/Change/change/ it's not exported, so not worth giving it a name that looks like it might be. https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:34: // Pinger continually updates a target node in zookeeper when run. i'm not sure that the word "target" adds much here and below. https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:46: tick := time.After(p.period) if you're going to use time.After, you might as well inline it into the select statement. If you don't want it inlined, you may as well use time.NewTimer() and Stop the timer if it doesn't go off. https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:49: close(p.closed) i think you don't need the closed channel. given that closing has a buffer size of zero, when the sender sends a value on it, it knows that the run loop has received it and will return. https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:62: // Close just stops updating the target node; AliveW watches will not notice s/just// s/target// https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:66: <-p.closed maybe call p.target.Change here so callers of Close have at least p.period leeway to reinstate a new pinger. https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:78: func StartPing(conn *zk.Conn, path string, period time.Duration) (*Pinger, error) { NewPinger? https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:97: // content of its target node. path is present only for convenience's sake; this i'm not sure that the convenience is worth it. a state should hold the state only IMHO. if you wanted convenience, you could bundle up the zk connection, the path and the watch channel into a local type, say: type watcher struct { conn *zk.Conn path string watch chan<- bool } and then define, for instance: func (w watcher) pstate(stat *zk.Stat, data string) (pstate, error) instead of getPstate. YMMV. i'd probably just pass the extra argument. https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:101: func getPstate(conn *zk.Conn, path string, stat *zk.Stat, data string) (pstate, error) { you could just pass the mtime in. s/data/content/ https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:139: stat, zkWatch, err = conn.ExistsW(path) it's a pity we have this race. it seems to me that it's because we delete the node when pinger finishes. i wonder if things might be cleaner if require the node to exist. that means that if we happen to call Alive on a wrong path, we won't wait for a node that will never appear. as for cleaning up, we could say that a particular content (say empty) means "clean. nothing here". https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:155: // awaitDead sends false to watch when the target node is deleted, or when it has you should probably add that the target node is known to be alive when awaitDead is called, and similar for awaitAlive below. https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:158: dead := time.After(p.timeout) same applies here as in Pinger.run https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:177: go awaitDead(conn, p, zkWatch, watch) i'm not sure it's worth starting a new goroutine here. with a loop around the select the logic is more obvious, i think. you could even put the watch<-false statement after the loop to make it obvious that's the natural post-condition of the loop (apart from the two cases where it's closed). then this logic can be: if !p.alive { break loop } https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:185: // awaitAlive send true to watch when the target node is changed or created. s/send/sends/ https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:205: go awaitAlive(conn, p, zkWatch, watch) use a loop, as in awaitDead. https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:212: // liveness change to be detected. this doesn't read great to me. how about this (borrowing an idiom from the zk docs)? // AliveW returns whether the Pinger at the given node path seems // to be alive. It also returns a channel that will receive // the new status when it changes. If an error is encountered // after AliveW returns, the channel will be closed.
Sign in to reply to this message.
Please take a look.
Sign in to reply to this message.
Please take a look.
Sign in to reply to this message.
Please take a look.
Sign in to reply to this message.
These may be somewhat redundant now, given the changes in the following patch sets, but there's no sense in not publishing them... https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go File state/presence/presence.go (right): https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:19: func (n *changeNode) Change() (mtime time.Time, err error) { On 2012/02/23 15:30:29, rog wrote: > s/Change/change/ > it's not exported, so not worth giving it a name that looks like it might be. Done. https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:34: // Pinger continually updates a target node in zookeeper when run. On 2012/02/23 15:30:29, rog wrote: > i'm not sure that the word "target" adds much here and below. Maybe not in the comments, but I feel "target" is at least as correct a field name as "node". (Not sure if you're actually talking about the field name in the first place) https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:46: tick := time.After(p.period) On 2012/02/23 15:30:29, rog wrote: > if you're going to use time.After, you might as well inline it into the select > statement. If you don't want it inlined, you may as well use time.NewTimer() and > Stop the timer if it doesn't go off. Done. https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:49: close(p.closed) On 2012/02/23 15:30:29, rog wrote: > i think you don't need the closed channel. given that closing has a buffer size > of zero, when the sender sends a value on it, it knows that the run loop has > received it and will return. Done. https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:62: // Close just stops updating the target node; AliveW watches will not notice On 2012/02/23 15:30:29, rog wrote: > s/just// > s/target// Done. https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:66: <-p.closed On 2012/02/23 15:30:29, rog wrote: > maybe call p.target.Change here so callers of Close have at least p.period > leeway to reinstate a new pinger. Done. https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:78: func StartPing(conn *zk.Conn, path string, period time.Duration) (*Pinger, error) { On 2012/02/23 15:30:29, rog wrote: > NewPinger? Yeah, on reflection, I think that fits better. https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:97: // content of its target node. path is present only for convenience's sake; this On 2012/02/23 15:30:29, rog wrote: > i'm not sure that the convenience is worth it. a state should hold the state > only IMHO. if you wanted convenience, you could bundle up the zk connection, the > path and the watch channel into a local type, say: > > type watcher struct { > conn *zk.Conn > path string > watch chan<- bool > } > > and then define, for instance: > > func (w watcher) pstate(stat *zk.Stat, data string) (pstate, error) > instead of getPstate. YMMV. i'd probably just pass the extra argument. I'll see what I can do. https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:101: func getPstate(conn *zk.Conn, path string, stat *zk.Stat, data string) (pstate, error) { On 2012/02/23 15:30:29, rog wrote: > you could just pass the mtime in. > s/data/content/ Done. https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:139: stat, zkWatch, err = conn.ExistsW(path) On 2012/02/23 15:30:29, rog wrote: > it's a pity we have this race. it seems to me that it's because we delete the > node when pinger finishes. i wonder if things might be cleaner if require the > node to exist. that means that if we happen to call Alive on a wrong path, we > won't wait for a node that will never appear. > > as for cleaning up, we could say that a particular content (say empty) means > "clean. nothing here". I'd been thinking more of the case where the Pinger hadn't yet been created -- regardless of cleanup behaviour, I don't think it's sensible to assume that every node we could ever want to watch is guaranteed to exist. https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:155: // awaitDead sends false to watch when the target node is deleted, or when it has On 2012/02/23 15:30:29, rog wrote: > you should probably add that the target node is known to be alive when awaitDead > is called, and similar for awaitAlive below. Done. https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:158: dead := time.After(p.timeout) On 2012/02/23 15:30:29, rog wrote: > same applies here as in Pinger.run Done. https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:177: go awaitDead(conn, p, zkWatch, watch) On 2012/02/23 15:30:29, rog wrote: > i'm not sure it's worth starting a new goroutine here. with a loop around the > select the logic is more obvious, i think. > you could even put the watch<-false statement after the loop to make it obvious > that's the natural post-condition of the loop (apart from the two cases where > it's closed). > > then this logic can be: > if !p.alive { > break loop > } Done. https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:185: // awaitAlive send true to watch when the target node is changed or created. On 2012/02/23 15:30:29, rog wrote: > s/send/sends/ Done. https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:205: go awaitAlive(conn, p, zkWatch, watch) On 2012/02/23 15:30:29, rog wrote: > use a loop, as in awaitDead. Done. https://codereview.appspot.com/5672067/diff/11001/state/presence/presence.go#... state/presence/presence.go:212: // liveness change to be detected. On 2012/02/23 15:30:29, rog wrote: > this doesn't read great to me. > > how about this (borrowing an idiom from the zk docs)? > > // AliveW returns whether the Pinger at the given node path seems > // to be alive. It also returns a channel that will receive > // the new status when it changes. If an error is encountered > // after AliveW returns, the channel will be closed. Nice, thanks.
Sign in to reply to this message.
Please take a look.
Sign in to reply to this message.
i think this is very nearly there. https://codereview.appspot.com/5672067/diff/12004/state/presence/presence.go File state/presence/presence.go (right): https://codereview.appspot.com/5672067/diff/12004/state/presence/presence.go#... state/presence/presence.go:113: n.timeout = time.Duration(0) s/time.Duration(0)/0/ https://codereview.appspot.com/5672067/diff/12004/state/presence/presence.go#... state/presence/presence.go:131: func (n *node) updateExists(content string, stat *zk.Stat) error { i don't like updateExists as a name for this function. it sounds like it's updating the "exists" status of the node, but it's actually primarily about looking at the timeout. how about updateTimeout as a name? also it's probably worth working out how you want to order the functions in this file - caller first or callee first. the predominant order seems to be callee first, so perhaps this function should go before node.update. (personally, i think i prefer to read the top-level functions first followed by the lower-level functions, but i appreciate it's a matter of taste) https://codereview.appspot.com/5672067/diff/12004/state/presence/presence.go#... state/presence/presence.go:140: return err i'd do this in one line: return fmt.Errorf("%s is not a valid presence node: %s", n.path, err) https://codereview.appspot.com/5672067/diff/12004/state/presence/presence.go#... state/presence/presence.go:222: alive := n.alive this function might read better if you named the result parameters.
Sign in to reply to this message.
This is looking very good. There's only one issue worth of attention, and then mostly details. https://codereview.appspot.com/5672067/diff/10004/state/presence/presence.go File state/presence/presence.go (right): https://codereview.appspot.com/5672067/diff/10004/state/presence/presence.go#... state/presence/presence.go:61: func (p *Pinger) Close() { Can we please call this Stop, as the counterpart of Start? https://codereview.appspot.com/5672067/diff/10004/state/presence/presence.go#... state/presence/presence.go:132: clock := changeNode{n.conn, "/clock", ""} I wonder if there's a way to implement this logic that doesn't involve everybody punching a single node in zk. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go File state/presence/presence.go (right): https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:17: // returns the node's new MTime. This allows it to act as an ad-hoc remote clock Please drop after "... new MTime.". It's not clear what "This" and "it" are in this context, and the documentation here may focus on what it does rather than what all the use cases are. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:34: // Pinger continually updates a node in zookeeper when run. s/continually/periodically/ s/when run/while it is running/ https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:39: closing chan bool s/closing/stopping/? Can we please use the Stop term, instead of Close? Stop is the counterpart of Start, and makes more sense here given that we don't Open a Pinger. I've also just realized that NewTicker, besides being "New", also uses "Stop" on its type. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:42: // run calls change on p.target every p.period nanoseconds until p is closed. s/closed/stopped/ https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:60: // watchers time out as late as possible. That's nice. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:61: func (p *Pinger) Close() { s/Close/Stop/ https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:74: // path every period nanoseconds. // StartPinger returns a new Pinger that refreshes the content of path periodically. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:77: _, err := target.change() Nice. It's wise to ping it once before putting it alive. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:94: // update sets the current values of n.alive and n.timeout. s/sets/reads from ZooKeeper/ https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:107: // updateW sets the current values of n.alive and n.timeout, and returns a Same. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:130: // content and stat of the zookeeper node (which must exist). updateExists reads a bit weird, and the documentation doesn't seem entirely right. This method doesn't really care if the node exists or not when it's called, and the "update" in its name means something else than it means for the functions above (update reads from zk, updateW reads from zk, updateExists doesn't even use n.path). I'd suggest something along the lines of setLiveness, but feel free to use something else that makes more sense to you. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:133: now, err := clock.change() Something doesn't feel right here. Every single ping watcher will be touching this file on every single period. This should only be necessary once when the watch is first created, to avoid waiting for the timeout on a presence node that is dead for long. From then on, we can wait for the timeout internally without punching the same node across every single agent. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:150: // called when the node is known to be alive. Please replace the last sentence with: zkWatch must be observing changes on the existent node at n.path. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:180: // only be called when the node is known to be dead. Ditto. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... File state/presence/presence_test.go (right): https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... state/presence/presence_test.go:61: c.Assert((<-session).Ok(), Equals, true) Please check the event type rather than just Ok. There's a chance we start waiting for a connected event before returning from Dial at some point, and it'd be useful to have this breaking in that case. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... state/presence/presence_test.go:83: longEnough = period * 4 // longest possible time to detect a close Shouldn't 3 be enough here? https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... state/presence/presence_test.go:92: c.FailNow() Replace both with: c.Fatal("Liveness change not detected") https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... state/presence/presence_test.go:100: t := time.After(longEnough) Why isn't this inlined in the select? https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... state/presence/presence_test.go:104: c.FailNow() Same as above. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... state/presence/presence_test.go:111: t := time.After(longEnough) Same. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... state/presence/presence_test.go:117: c.FailNow() Same. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... state/presence/presence_test.go:242: waitFor(c, altConn, path) Why is waitFor necessary here? As I understand, the StartPinger above should only return after the change has been ack'd by the server, which means a new connection here shouldn't find an arbitrarily different state. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... state/presence/presence_test.go:271: c.FailNow() This is a Skip rather than an ExpectFailure. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... state/presence/presence_test.go:280: waitFor(c, altConn, path) Also don't see how that's necessary. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... state/presence/presence_test.go:292: c.FailNow() Skip as well. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... state/presence/presence_test.go:301: waitFor(c, s.zkConn, path) Also don't see how that's necessary.
Sign in to reply to this message.
https://codereview.appspot.com/5672067/diff/10004/state/presence/presence.go File state/presence/presence.go (right): https://codereview.appspot.com/5672067/diff/10004/state/presence/presence.go#... state/presence/presence.go:61: func (p *Pinger) Close() { On 2012/02/27 14:50:34, niemeyer wrote: > Can we please call this Stop, as the counterpart of Start? Sorry, that was a left over from a previous review not delivered. There are new versions of these comments on the new patch set. https://codereview.appspot.com/5672067/diff/10004/state/presence/presence.go#... state/presence/presence.go:132: clock := changeNode{n.conn, "/clock", ""} On 2012/02/27 14:50:34, niemeyer wrote: > I wonder if there's a way to implement this logic that doesn't involve everybody > punching a single node in zk. Ditto.
Sign in to reply to this message.
Please take a look.
Sign in to reply to this message.
LGTM https://codereview.appspot.com/5672067/diff/18001/state/presence/presence.go File state/presence/presence.go (right): https://codereview.appspot.com/5672067/diff/18001/state/presence/presence.go#... state/presence/presence.go:99: // ie there's no way it can be dead. s/ie/that is,/
Sign in to reply to this message.
https://codereview.appspot.com/5672067/diff/18001/state/presence/presence.go File state/presence/presence.go (right): https://codereview.appspot.com/5672067/diff/18001/state/presence/presence.go#... state/presence/presence.go:108: period, err := time.ParseDuration(content) You have the new content despite the fact that this function has been run before, which means you can update the timeout without touching /clock, right? It's only "now" that you won't know on every single period, which doesn't really matter since timeout is built on period, which is relative.
Sign in to reply to this message.
Whoops, this is at least 3 reviews' responses backed up... sorry :(. https://codereview.appspot.com/5672067/diff/12004/state/presence/presence.go File state/presence/presence.go (right): https://codereview.appspot.com/5672067/diff/12004/state/presence/presence.go#... state/presence/presence.go:113: n.timeout = time.Duration(0) On 2012/02/27 10:45:52, rog wrote: > s/time.Duration(0)/0/ Done. https://codereview.appspot.com/5672067/diff/12004/state/presence/presence.go#... state/presence/presence.go:131: func (n *node) updateExists(content string, stat *zk.Stat) error { On 2012/02/27 10:45:52, rog wrote: > i don't like updateExists as a name for this function. it sounds like it's > updating the "exists" status of the node, but it's actually primarily about > looking at the timeout. how about updateTimeout as a name? > > also it's probably worth working out how you want to order the functions in this > file - caller first or callee first. the predominant order seems to be callee > first, so perhaps this function should go before node.update. > (personally, i think i prefer to read the top-level functions first followed by > the lower-level functions, but i appreciate it's a matter of taste) Done. https://codereview.appspot.com/5672067/diff/12004/state/presence/presence.go#... state/presence/presence.go:140: return err On 2012/02/27 10:45:52, rog wrote: > i'd do this in one line: > return fmt.Errorf("%s is not a valid presence node: %s", n.path, err) Done. https://codereview.appspot.com/5672067/diff/12004/state/presence/presence.go#... state/presence/presence.go:222: alive := n.alive On 2012/02/27 10:45:52, rog wrote: > this function might read better if you named the result parameters. I'm leaving this as-is due to the type of watch; I definitely want to *return* a <-chan, but if watch is *declared* as a <-chan I can't pass it to waitDead/waitAlive. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go File state/presence/presence.go (right): https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:17: // returns the node's new MTime. This allows it to act as an ad-hoc remote clock On 2012/02/27 14:50:34, niemeyer wrote: > Please drop after "... new MTime.". It's not clear what "This" and "it" are in > this context, and the documentation here may focus on what it does rather than > what all the use cases are. Done. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:34: // Pinger continually updates a node in zookeeper when run. On 2012/02/27 14:50:34, niemeyer wrote: > s/continually/periodically/ > s/when run/while it is running/ Done. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:39: closing chan bool On 2012/02/27 14:50:34, niemeyer wrote: > s/closing/stopping/? > > Can we please use the Stop term, instead of Close? Stop is the counterpart of > Start, and makes more sense here given that we don't Open a Pinger. I've also > just realized that NewTicker, besides being "New", also uses "Stop" on its type. Done. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:42: // run calls change on p.target every p.period nanoseconds until p is closed. On 2012/02/27 14:50:34, niemeyer wrote: > s/closed/stopped/ Done. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:60: // watchers time out as late as possible. On 2012/02/27 14:50:34, niemeyer wrote: > That's nice. Thanks :). https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:61: func (p *Pinger) Close() { On 2012/02/27 14:50:34, niemeyer wrote: > s/Close/Stop/ Done. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:74: // path every period nanoseconds. On 2012/02/27 14:50:34, niemeyer wrote: > // StartPinger returns a new Pinger that refreshes the content of path > periodically. Done. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:77: _, err := target.change() On 2012/02/27 14:50:34, niemeyer wrote: > Nice. It's wise to ping it once before putting it alive. Thanks :). https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:94: // update sets the current values of n.alive and n.timeout. On 2012/02/27 14:50:34, niemeyer wrote: > s/sets/reads from ZooKeeper/ Done. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:107: // updateW sets the current values of n.alive and n.timeout, and returns a On 2012/02/27 14:50:34, niemeyer wrote: > Same. Done. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:130: // content and stat of the zookeeper node (which must exist). On 2012/02/27 14:50:34, niemeyer wrote: > updateExists reads a bit weird, and the documentation doesn't seem entirely > right. This method doesn't really care if the node exists or not when it's > called, and the "update" in its name means something else than it means for the > functions above (update reads from zk, updateW reads from zk, updateExists > doesn't even use n.path). > > I'd suggest something along the lines of setLiveness, but feel free to use > something else that makes more sense to you. setAlive works for me, especially because... https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:133: now, err := clock.change() On 2012/02/27 14:50:34, niemeyer wrote: > Something doesn't feel right here. Every single ping watcher will be touching > this file on every single period. This should only be necessary once when the > watch is first created, to avoid waiting for the timeout on a presence node that > is dead for long. From then on, we can wait for the timeout internally without > punching the same node across every single agent. ...if timeout is already set we're safe just (re-)setting alive to true, and if it's not we can do the clock/delay checks as before. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:150: // called when the node is known to be alive. On 2012/02/27 14:50:34, niemeyer wrote: > Please replace the last sentence with: > > zkWatch must be observing changes on the existent node at n.path. Done. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence.go#... state/presence/presence.go:180: // only be called when the node is known to be dead. On 2012/02/27 14:50:34, niemeyer wrote: > Ditto. Done. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... File state/presence/presence_test.go (right): https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... state/presence/presence_test.go:61: c.Assert((<-session).Ok(), Equals, true) On 2012/02/27 14:50:34, niemeyer wrote: > Please check the event type rather than just Ok. There's a chance we start > waiting for a connected event before returning from Dial at some point, and it'd > be useful to have this breaking in that case. Done. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... state/presence/presence_test.go:83: longEnough = period * 4 // longest possible time to detect a close On 2012/02/27 14:50:34, niemeyer wrote: > Shouldn't 3 be enough here? Assuming period of 25ms, and therefore timeout of 50ms: 0ms: pinger fires for the last time 49ms: new watcher checks and sees it's "alive" 99ms: watcher times out and realises it's actually dead 100ms: long enough :) https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... state/presence/presence_test.go:92: c.FailNow() On 2012/02/27 14:50:34, niemeyer wrote: > Replace both with: > > c.Fatal("Liveness change not detected") Done. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... state/presence/presence_test.go:100: t := time.After(longEnough) On 2012/02/27 14:50:34, niemeyer wrote: > Why isn't this inlined in the select? Er, cos I be dumb. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... state/presence/presence_test.go:104: c.FailNow() On 2012/02/27 14:50:34, niemeyer wrote: > Same as above. Done. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... state/presence/presence_test.go:111: t := time.After(longEnough) On 2012/02/27 14:50:34, niemeyer wrote: > Same. Done. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... state/presence/presence_test.go:117: c.FailNow() On 2012/02/27 14:50:34, niemeyer wrote: > Same. Done. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... state/presence/presence_test.go:242: waitFor(c, altConn, path) On 2012/02/27 14:50:34, niemeyer wrote: > Why is waitFor necessary here? As I understand, the StartPinger above should > only return after the change has been ack'd by the server, which means a new > connection here shouldn't find an arbitrarily different state. I thought that each connection was guaranteed to see the same timeline, but not necessarily at the same times; and in practice, during development, I did see failures that appeared to me to be consistent with two connections working on different-enough-to-be-wrong timelines; ie that, when you're hitting ZK over more than one connection, mere ordering of statements offers no guarantees wrt total order of operations. This fits with my prior understanding of ZK; have I misunderstood something? ...that said, I did just try to repro the problems *without* waitFors, and I couldn't find them... so I guess we can tentatively chalk it up to transient mid-dev stupidity and just drop waitFor. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... state/presence/presence_test.go:271: c.FailNow() On 2012/02/27 14:50:34, niemeyer wrote: > This is a Skip rather than an ExpectFailure. Done. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... state/presence/presence_test.go:280: waitFor(c, altConn, path) On 2012/02/27 14:50:34, niemeyer wrote: > Also don't see how that's necessary. Done. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... state/presence/presence_test.go:292: c.FailNow() On 2012/02/27 14:50:34, niemeyer wrote: > Skip as well. Done. https://codereview.appspot.com/5672067/diff/16001/state/presence/presence_tes... state/presence/presence_test.go:301: waitFor(c, s.zkConn, path) On 2012/02/27 14:50:34, niemeyer wrote: > Also don't see how that's necessary. Done. https://codereview.appspot.com/5672067/diff/18001/state/presence/presence.go File state/presence/presence.go (right): https://codereview.appspot.com/5672067/diff/18001/state/presence/presence.go#... state/presence/presence.go:99: // ie there's no way it can be dead. On 2012/02/28 09:47:37, rog wrote: > s/ie/that is,/ Done. https://codereview.appspot.com/5672067/diff/18001/state/presence/presence.go#... state/presence/presence.go:108: period, err := time.ParseDuration(content) On 2012/02/28 14:53:59, niemeyer wrote: > You have the new content despite the fact that this function has been run > before, which means you can update the timeout without touching /clock, right? > It's only "now" that you won't know on every single period, which doesn't really > matter since timeout is built on period, which is relative. Good point. Done.
Sign in to reply to this message.
Please take a look.
Sign in to reply to this message.
https://codereview.appspot.com/5672067/diff/22001/state/presence/presence.go File state/presence/presence.go (right): https://codereview.appspot.com/5672067/diff/22001/state/presence/presence.go#... state/presence/presence.go:97: timeoutWasSet := n.timeout != 0 this seems a little oblique to me. i wonder if things might read better if instead of trying to infer first-timeness from n.timeout, setAlive (and update and updateW) had a boolean firstTime argument, which would be set for the calls in Alive and AliveW only. less state to hold in the head.
Sign in to reply to this message.
Please take a look.
Sign in to reply to this message.
https://codereview.appspot.com/5672067/diff/24001/state/presence/presence.go File state/presence/presence.go (right): https://codereview.appspot.com/5672067/diff/24001/state/presence/presence.go#... state/presence/presence.go:93: // content and stat of the zookeeper node (which must exist). comment about firstTime? https://codereview.appspot.com/5672067/diff/24001/state/presence/presence.go#... state/presence/presence.go:113: // check for staleness with the clock node. this comment reads a bit oddly. perhaps: // If this method is not being run for the first time, // we know that the node has just changed, so we // know that it's alive and there's no need to check for // staleness with the clock node. https://codereview.appspot.com/5672067/diff/24001/state/presence/presence.go#... state/presence/presence.go:134: // changes. comment about firstTime?
Sign in to reply to this message.
LGTM.. just a few trivials. https://codereview.appspot.com/5672067/diff/24001/state/presence/presence.go File state/presence/presence.go (right): https://codereview.appspot.com/5672067/diff/24001/state/presence/presence.go#... state/presence/presence.go:98: return fmt.Errorf("%s is not a valid presence node: %s", n.path, err) I think it'd be more useful to have the actual content here: fmt.Errorf("%s presence node has bad data: %q", n.path, content) https://codereview.appspot.com/5672067/diff/24001/state/presence/presence_tes... File state/presence/presence_test.go (right): https://codereview.appspot.com/5672067/diff/24001/state/presence/presence_tes... state/presence/presence_test.go:71: _ = Suite(&PresenceSuite{}) Please move that next to the PresenceSuite type definition above. https://codereview.appspot.com/5672067/diff/24001/state/presence/presence_tes... state/presence/presence_test.go:73: longEnough = period * 4 // longest possible time to detect a close I can't parse what that actually means. Can you please move the comment to a full-line comment right above longEnough, and explain why period * 4 is long enough and how is "close" related to that? https://codereview.appspot.com/5672067/diff/24001/state/presence/presence_tes... state/presence/presence_test.go:132: assertNoChange(c, watch) These tests look pretty good, thanks. https://codereview.appspot.com/5672067/diff/24001/state/presence/presence_tes... state/presence/presence_test.go:253: c.Skip("Waiting on gozk change to eliminate occasional panic after Stop") This went in already. https://codereview.appspot.com/5672067/diff/24001/state/presence/presence_tes... state/presence/presence_test.go:272: c.Skip("Waiting on gozk change to eliminate consistent panic after Stop") Ditto.
Sign in to reply to this message.
Please take a look.
Sign in to reply to this message.
https://codereview.appspot.com/5672067/diff/24001/state/presence/presence.go File state/presence/presence.go (right): https://codereview.appspot.com/5672067/diff/24001/state/presence/presence.go#... state/presence/presence.go:93: // content and stat of the zookeeper node (which must exist). On 2012/02/28 17:12:49, rog wrote: > comment about firstTime? Done. https://codereview.appspot.com/5672067/diff/24001/state/presence/presence.go#... state/presence/presence.go:98: return fmt.Errorf("%s is not a valid presence node: %s", n.path, err) On 2012/03/08 21:58:45, niemeyer wrote: > I think it'd be more useful to have the actual content here: > > fmt.Errorf("%s presence node has bad data: %q", n.path, content) Done. https://codereview.appspot.com/5672067/diff/24001/state/presence/presence.go#... state/presence/presence.go:113: // check for staleness with the clock node. On 2012/02/28 17:12:49, rog wrote: > this comment reads a bit oddly. > perhaps: > // If this method is not being run for the first time, > // we know that the node has just changed, so we > // know that it's alive and there's no need to check for > // staleness with the clock node. Done. https://codereview.appspot.com/5672067/diff/24001/state/presence/presence.go#... state/presence/presence.go:134: // changes. On 2012/02/28 17:12:49, rog wrote: > comment about firstTime? Done. https://codereview.appspot.com/5672067/diff/24001/state/presence/presence_tes... File state/presence/presence_test.go (right): https://codereview.appspot.com/5672067/diff/24001/state/presence/presence_tes... state/presence/presence_test.go:71: _ = Suite(&PresenceSuite{}) On 2012/03/08 21:58:45, niemeyer wrote: > Please move that next to the PresenceSuite type definition above. Done. https://codereview.appspot.com/5672067/diff/24001/state/presence/presence_tes... state/presence/presence_test.go:73: longEnough = period * 4 // longest possible time to detect a close On 2012/03/08 21:58:45, niemeyer wrote: > I can't parse what that actually means. Can you please move the comment to a > full-line comment right above longEnough, and explain why period * 4 is long > enough and how is "close" related to that? Done. https://codereview.appspot.com/5672067/diff/24001/state/presence/presence_tes... state/presence/presence_test.go:132: assertNoChange(c, watch) On 2012/03/08 21:58:45, niemeyer wrote: > These tests look pretty good, thanks. Thanks :). https://codereview.appspot.com/5672067/diff/24001/state/presence/presence_tes... state/presence/presence_test.go:253: c.Skip("Waiting on gozk change to eliminate occasional panic after Stop") On 2012/03/08 21:58:45, niemeyer wrote: > This went in already. Done. https://codereview.appspot.com/5672067/diff/24001/state/presence/presence_tes... state/presence/presence_test.go:272: c.Skip("Waiting on gozk change to eliminate consistent panic after Stop") On 2012/03/08 21:58:45, niemeyer wrote: > Ditto. Done.
Sign in to reply to this message.
Please take a look.
Sign in to reply to this message.
LGTM, thanks!
Sign in to reply to this message.
*** Submitted: Presence nodes A replacement for zookeeper ephemeral nodes, which allows for seamless reestablishment of presence across separate sessions. R=rog, niemeyer CC= https://codereview.appspot.com/5672067
Sign in to reply to this message.
|