Rietveld Code Review Tool
Help | Bug tracker | Discussion group | Source code | Sign in
(81)

Issue 74250043: code review 74250043: runtime: fix empty string handling in garbage collector (Closed)

Can't Edit
Can't Publish+Mail
Start Review
Created:
10 years, 8 months ago by rsc
Modified:
10 years, 8 months ago
Reviewers:
r, dave, iant, dvyukov, josharian
CC:
golang-codereviews, dave_cheney.net, r, dvyukov, jasdel, iant, khr
Visibility:
Public.

Description

runtime: fix empty string handling in garbage collector The garbage collector uses type information to guide the traversal of the heap. If it sees a field that should be a string, it marks the object pointed at by the string data pointer as visited but does not bother to look at the data, because strings contain bytes, not pointers. If you save s[len(s):] somewhere, though, the string data pointer actually points just beyond the string data; if the string data were exactly the size of an allocated block, the string data pointer would actually point at the next block. It is incorrect to mark that next block as visited and not bother to look at the data, because the next block may be some other type entirely. The fix is to ignore strings with zero length during collection: they are empty and can never become non-empty: the base pointer will never be used again. The handling of slices already does this (but using cap instead of len). This was not a bug in Go 1.2, because until January all string allocations included a trailing NUL byte not included in the length, so s[len(s):] still pointed inside the string allocation (at the NUL). This bug was causing the crashes in test/run.go. Specifically, the parsing of a regexp in package regexp/syntax allocated a []syntax.Inst with rounded size 1152 bytes. In fact it allocated many such slices, because during the processing of test/index2.go it creates thousands of regexps that are all approximately the same complexity. That takes a long time, and test/run works on other tests in other goroutines. One such other test is chan/perm.go, which uses an 1152-byte source file. test/run reads that file into a []byte and then calls strings.Split(string(src), "\n"). The string(src) creates an 1152-byte string - and there's a very good chance of it landing next to one of the many many regexp slices already allocated - and then because the file ends in a \n, strings.Split records the tail empty string as the final element in the slice. A garbage collection happens at this point, the collection finds that string before encountering the []syntax.Inst data it now inadvertently points to, and the []syntax.Inst data is not scanned for the pointers that it contains. Each syntax.Inst contains a []rune, those are missed, and the backing rune arrays are freed for reuse. When the regexp is later executed, the runes being searched for are no longer runes at all, and there is no match, even on text that should match. On 64-bit machines the pointer in the []rune inside the syntax.Inst is larger (along with a few other pointers), pushing the []syntax.Inst backing array into a larger size class, avoiding the collision with chan/perm.go's inadvertently sized file. I expect this was more prevalent on OS X than on Linux or Windows because those managed to run faster or slower and didn't overlap index2.go with chan/perm.go as often. On the ARM systems, we only run one errorcheck test at a time, so index2 and chan/perm would never overlap. It is possible that this bug is the root cause of other crashes as well. For now we only know it is the cause of the test/run crash. Many thanks to Dmitriy for help debugging. Fixes issue 7344. Fixes issue 7455.

Patch Set 1 #

Patch Set 2 : diff -r 9a60a27637a4 https://code.google.com/p/go/ #

Patch Set 3 : diff -r 9a60a27637a4 https://code.google.com/p/go/ #

Patch Set 4 : diff -r 9a60a27637a4 https://code.google.com/p/go/ #

Total comments: 2

Patch Set 5 : diff -r f74861b1d2b3 https://code.google.com/p/go/ #

Unified diffs Side-by-side diffs Delta from patch set Stats (+54 lines, -2 lines) Patch
M src/pkg/runtime/mgc0.c View 1 2 3 4 2 chunks +6 lines, -2 lines 0 comments Download
A test/gcstring.go View 1 2 3 4 1 chunk +48 lines, -0 lines 0 comments Download

Messages

Total messages: 14
rsc
Hello golang-codereviews@googlegroups.com (cc: dvyukov, iant, khr, r), I'd like you to review this change to ...
10 years, 8 months ago (2014-03-12 02:47:32 UTC) #1
dave_cheney.net
Excellent find Russ. I wonder if this is the cause of the random reports from ...
10 years, 8 months ago (2014-03-12 02:53:58 UTC) #2
rsc
It's possible but only at tip and only since January 27. Russ
10 years, 8 months ago (2014-03-12 02:55:41 UTC) #3
r
LGTM but please explain purpose of test in test code. https://codereview.appspot.com/74250043/diff/60001/test/gcstring.go File test/gcstring.go (right): https://codereview.appspot.com/74250043/diff/60001/test/gcstring.go#newcode3 ...
10 years, 8 months ago (2014-03-12 03:35:22 UTC) #4
dave_cheney.net
I'm seeing this odd error, odessa(~/go/test) % go run run.go # go run run.go -- ...
10 years, 8 months ago (2014-03-12 03:35:44 UTC) #5
rsc
On Tue, Mar 11, 2014 at 11:35 PM, Dave Cheney <dave@cheney.net> wrote: > I'm seeing ...
10 years, 8 months ago (2014-03-12 03:38:44 UTC) #6
dvyukov
LGTM striking story!
10 years, 8 months ago (2014-03-12 03:40:50 UTC) #7
jasdel
On 2014/03/12 03:35:44, dfc wrote: > I'm seeing this odd error, > > odessa(~/go/test) % ...
10 years, 8 months ago (2014-03-12 03:47:07 UTC) #8
rsc
On Tue, Mar 11, 2014 at 11:47 PM, <delpontej@gmail.com> wrote: > This patch fixes all.bash ...
10 years, 8 months ago (2014-03-12 03:48:47 UTC) #9
dave_cheney.net
LGTM. Fixes issue 7455.
10 years, 8 months ago (2014-03-12 03:49:09 UTC) #10
iant
LGTM https://codereview.appspot.com/74250043/diff/60001/test/gcstring.go File test/gcstring.go (right): https://codereview.appspot.com/74250043/diff/60001/test/gcstring.go#newcode1 test/gcstring.go:1: // Copyright 2014 The Go Authors. All rights ...
10 years, 8 months ago (2014-03-12 03:57:56 UTC) #11
rsc
*** Submitted as https://code.google.com/p/go/source/detail?r=b1e34388854e *** runtime: fix empty string handling in garbage collector The garbage ...
10 years, 8 months ago (2014-03-12 03:58:57 UTC) #12
josharian
Awesome. Thanks for the thorough explanation. For future reference -- since I also poked at ...
10 years, 8 months ago (2014-03-12 23:42:13 UTC) #13
rsc
10 years, 8 months ago (2014-03-13 01:15:58 UTC) #14
On Wed, Mar 12, 2014 at 7:42 PM, <josharian@gmail.com> wrote:

> Awesome. Thanks for the thorough explanation.
>
> For future reference -- since I also poked at this but got nowhere --
> what were the primary tools you used in tracking this down?
>

i don't feel like i got that far either honestly.

first i tried dumping the goroutine stacks at every garbage collection, and
i ran a few thousand instances of run.go with that turned on. i made a list
for each run of all the function pc's seen and i tried to find function
pc's that only appeared in failing runs or that were strongly indicative of
a failing run. that would suggest a place where the liveness bitmaps might
be wrong. that was a dead end.

i made the run.go program compile and save two of each regexp and panic
when they disagreed. that gave me a foothold; next was to sleep when they
disagreed so i could attach gdb and look around. using gdb to dump memory
(on a mac so gdb barely works) i found that it was always the rune slice
that had been reused incorrectly. that let me write a Check method on
regexp.Regexp that called Check on Prog that checked that the rune slice
contained actual runes (0...10ffff). if not it panicked. i inserted Check
calls in various places but the one that helped the most turned out to be
in the for _, we := range loop in run.go. maybe another loop just to do
Check after that loop finished would have helped too. at that point i knew
what to look for, and Check could print the address of the reused slice
data.

then i was able to reproduce it using GODEBUG=allocfreetrace=1. the most
interesting bit from that was that you could see the pointer being
allocated and then surviving at least a few garbage collections before
being freed incorrectly. surviving the first garbage collection strongly
suggested it was not a bug in the liveness analysis, because those bugs
matter only until the pointer is linked into the main heap object graph. in
fact the one i looked at most carefully survived 17 garbage collections
before dying mysteriously.

at this point i asked dmitriy about background sweep, because i could see
that the free was happening due to an on-demand sweep during an ordinary
allocation. i described what i knew and he didn't have any useful ideas at
the time. but then before he went to bed he pinged me to say 'maybe
s[len(s):] is mishandled now that there are no terminating NULs.' the data
i'd collected confirmed that this could be the problem and provided the
details needed to explain the specific failure patterns we were seeing.

so really i just flailed around for a few days until dmitriy had the right
insight (to be fair, he removed the terminating NUL too). but my flailing
produced data that backed up the insight.

russ
Sign in to reply to this message.

Powered by Google App Engine
RSS Feeds Recent Issues | This issue
This is Rietveld f62528b