Issue 10495043: state: relation scope event coalescence

Can't Edit
Can't Publish+Mail
Start Review

Created:
10 years, 10 months ago by fwereade

Modified:
10 years, 10 months ago

Reviewers:
mp+171069, rog

Visibility:
Public.

Description

state: relation scope event coalescence So, this is an interesting branch, and it surely won't land as it is now. It *started* with lp:1192433; I think it's clear that dying units should advertise their departure from relations before they actually, uh, depart. The obvious answer seemed to be to allow them to write it in the relevant scope doc, because that's watched anyway; but that imposes a bit of a change on the watcher, because now it has to hit the DB to verify the nature of an event. Doing this 100 times for 100 changes in the same few ms seemed a bit crazy, but I might have let that slide had I not started examining the behaviour of the watcher under real usage (as opposed to how we unit test them). And, er, it's a bit alarming. Watchers are all unit tested like this: * cause changes * force sync * read output channel ...and in that context they work perfectly: all events are coalesced nicely, and everyone's happy. But in *real* usage the story's a bit different. Those same three things are happening, but they're all happening concurrently and independently. And so, in practice, anything using a watcher live is almost continuously reading from its out chan; but the syncs happen every 5 seconds and are followed by a fast -- but stuttering, not continuous -- stream of events on the in chan. The upshot is that we barely coalesce at all: 100 scope changes delivered in response to a single sync will be received by the ultimate client as at least 90 separate sends on a continuously-read Changes channel. Soon, this won't matter much, because the watchers are exposed over the API via the Next call, and we won't need to read from Changes until we get a Next request; so the coalescence *will* be induced. But this then constrains use to use the Next approach, and the Next approach itself will not significantly help coalescence in situations where there's an additional event-interpretation layer client-side (in particular uniter.filter and the HookQueue bits *do* read as fast as possible and massage the watcher data some more before passing it on). So... there's some justification for fixing coalescence behaviour anyway; and doing so in this case didn't seem optional anyway, because we weren't doing it properly in the first place (units can be seen to enter and leave in the same event, when such changes should be entirely stripped). And, better yet, doing so allowed me to batch up the new db calls (FWIW, I think LifecycleWatcher would really benefit from this too). But then I had to figure out how to test it. And I think I managed that, but there are a number of problems with the test; primarily that it takes a good 20s to run in the best case, and there appears to be an occasional problem with the scope-manipulation workers, where they just don't get run for a while (and since each starts 200 scope-leaving goroutines sharing the same state conn, that's maybe not so surprising). When this happens, the test takes about 60s on my machine (but still passes, yay). And ofc we can't go around burning 20-60s on a single unit test in state; and I don't really want to futz with the watcher timings, because I'm trying to verify real behaviour. How would people feel about something like a testing.SlowSuite function, that only called gocheck.Suite if some flag like -juju.testing.slow were passed on the command line? go-bot (or perhaps the putative live-tests CI stuff) could certainly do that, and humans wouldn't need to be bothered by it at all if they didn't want to be. Sensible? https://code.launchpad.net/~fwereade/juju-core/interesting-watcher-derail/+merge/171069 (do not edit description out of merge proposal)

Patch Set 1 #

Total comments: 12

Created: 10 years, 10 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Stats (+285 lines, -54 lines)			Patch
A	[revision details]	View	1 chunk	+2 lines, -0 lines	0 comments	Download
M	state/relationunit.go	View	2 chunks	+3 lines, -2 lines	0 comments	Download
M	state/relationunit_test.go	View	2 chunks	+136 lines, -0 lines	7 comments	Download
M	state/watcher.go	View	6 chunks	+144 lines, -52 lines	5 comments	Download

Messages

Total messages: 1

Expand All Messages | Collapse All Messages

rog

10 years, 10 months ago (2013-06-24 16:00:54 UTC) #1

does not quite LGTM because of the time it takes to run the test. i'd like to
see what happens if you're able to reduce the watch polling interval (say to
0.5s).

otherwise, looks good, with some suggestions below.

as for slow tests, go test already provides a -short flag, and i think it would
be fine to use that. it would mean that the default is to run all tests, even
the slow ones, which i think is the right way around.

https://codereview.appspot.com/10495043/diff/1/state/relationunit_test.go
File state/relationunit_test.go (right):

https://codereview.appspot.com/10495043/diff/1/state/relationunit_test.go#new...
state/relationunit_test.go:468: func (s *RelationUnitSuite)
TestStressWatchScope(c *C) {
it's great to have a test like this, but...
it's quite a beast!

do you think it might be reasonable to split up into a few different
functions so it's more obvious where the side-effects are?

perhaps a little type:

type stressWatchScopeTest struct {
    unitsPerWorker int
    maxLeaveDelay int
    
    allEntered, allLeft sync.WaitGroup
}

func (t *stressWatchScopeTest) enterLeave(c *C) {
    t.c.Logf("worker %d: opening state...", worker)
    etc
}

func (t *stressWatchScopeTest) verifyChanges(c *C, w
*state.RelationScopeWatcher) {
    eventCount := 0
    etc
}

https://codereview.appspot.com/10495043/diff/1/state/relationunit_test.go#new...
state/relationunit_test.go:486: // have to wait ~60s for it.
i hope we can find out what actually goes on there.

https://codereview.appspot.com/10495043/diff/1/state/relationunit_test.go#new...
state/relationunit_test.go:488: workers := runtime.GOMAXPROCS(-1)
workers := runtime.NumCPU()
defer runtime.GOMAXPROCS(runtime.GOMAXPROCS(workers))

?

https://codereview.appspot.com/10495043/diff/1/state/relationunit_test.go#new...
state/relationunit_test.go:493: maxLeaveDelay := 8
i think i'd make this 8 * time.Second and then use

   time.Duration(i) * time.Second % maxLeaveDelay

below, to make it obvious how long our max delay is here.
(it also provides an interesting way of tweaking the test,
by making maxLeaveDelay a smaller prime number)

https://codereview.appspot.com/10495043/diff/1/state/relationunit_test.go#new...
state/relationunit_test.go:500: go func(worker int) {
worker := worker
go func() {
    ...
}

is more direct and it's easier to verify what
value worker is actually given, rather than
looking all the way down to the func call argument.

https://codereview.appspot.com/10495043/diff/1/state/relationunit_test.go#new...
state/relationunit_test.go:502: st, err := state.Open(state.TestingStateInfo(),
state.TestingDialOpts())
it might be interesting to make each worker open the state too...

https://codereview.appspot.com/10495043/diff/1/state/relationunit_test.go#new...
state/relationunit_test.go:515: finished.Add(1)
as discussed online, this might look nicer as something like this:

endTime := t0.Add(time.Second * time.Duration(i%maxLeaveDelay))
go func() {
    time.Sleep(endTime.Sub(time.Now())
    err := ru.LeaveScope()
    finished.Done()
}()

or put the end time computation inside the func
and add an:

i := i

declaration just before the go statement.

https://codereview.appspot.com/10495043/diff/1/state/watcher.go
File state/watcher.go (right):

https://codereview.appspot.com/10495043/diff/1/state/watcher.go#newcode426
state/watcher.go:426: base map[string]bool
i'd like to see a comment on these fields and on the functions that operate on
the type.

https://codereview.appspot.com/10495043/diff/1/state/watcher.go#newcode452
state/watcher.go:452: }
delete(info.diff, name)

and lose the map reallocation at the end?

https://codereview.appspot.com/10495043/diff/1/state/watcher.go#newcode534
state/watcher.go:534: i := 0
this doesn't seem to be used

https://codereview.appspot.com/10495043/diff/1/state/watcher.go#newcode572
state/watcher.go:572: sel := D{{"_id", D{{"$in", existIds}}}}
this is a good example why i think our "everything
is a bulk op" policy may be premature. this
is essentially a bulk op, but it doesn't map well
to our "give me a result for each item in this
vector paradigm" - it's more like "here's a bundle
of ids, give us a bundle of results", which is much
easier to implement with mgo's underlying primitives
and may be sufficient in most of the cases we care
about.

just sayin'.

https://codereview.appspot.com/10495043/diff/1/state/watcher.go#newcode612
state/watcher.go:612: if !sent || info.hasChanges() {
since AFAICS sent always == (out != nil), wouldn't this be a tad nicer as:

if out != nil || info.hasChanges() {
    changes = info.changes()
    out = w.out
} else {
    out = nil
}

without the need for sent at all?

Expand All Messages | Collapse All Messages