Issue 5244061: code review 5244061: html: rewrite the tokenizer to be more consistent.

Issue 5244061: code review 5244061: html: rewrite the tokenizer to be more consistent. (Closed)

Can't Edit
Can't Publish+Mail
Start Review

Created:
13 years, 9 months ago by nigeltao

Modified:
13 years, 9 months ago

Reviewers:

CC:
andybalholm, gri, golang-dev

Visibility:
Public.

Description

html: rewrite the tokenizer to be more consistent. Previously, the tokenizer made two passes per token. The first pass established the token boundary. The second pass picked out the tag name and attributes inside that boundary. This was problematic when the two passes disagreed. For example, "<p id=can't><p id=won't>" caused an infinite loop because the first pass skipped everything inside the single quotes, and recognized only one token, but the second pass never got past the first '>'. This change rewrites the tokenizer to use one pass, accumulating the boundary points of token text, tag names, attribute keys and attribute values as it looks for the token endpoint. It should still be reasonably efficient: text, names, keys and values are not lower-cased or unescaped (and converted from []byte to string) until asked for. One of the token_test test cases was fixed to be consistent with html5lib. Three more test cases were temporarily disabled, and will be re-enabled in a follow-up CL. All the parse_test test cases pass.

Patch Set 1 #

Patch Set 2 : diff -r f183ae52d934 https://go.googlecode.com/hg/ #

Patch Set 3 : diff -r f183ae52d934 https://go.googlecode.com/hg/ #

Patch Set 4 : diff -r f183ae52d934 https://go.googlecode.com/hg/ #

Total comments: 14

Patch Set 5 : diff -r ceb8b9904cc4 https://go.googlecode.com/hg/ #

Total comments: 4

Patch Set 6 : diff -r 8fd7e6d070c8 https://go.googlecode.com/hg/ #

Created: 13 years, 9 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+272 lines, -227 lines)			Patch
M	src/pkg/html/escape.go	View	1	1 chunk	+10 lines, -0 lines	0 comments	Download
M	src/pkg/html/token.go	View	1 2 3 4 5	14 chunks	+230 lines, -211 lines	0 comments	Download
M	src/pkg/html/token_test.go	View	1 2 3 4	4 chunks	+32 lines, -16 lines	0 comments	Download

Messages

Total messages: 10

Expand All Messages | Collapse All Messages

nigeltao

Hello andybalholm@gmail.com, gri@golang.org (cc: golang-dev@googlegroups.com), I'd like you to review this change to https://go.googlecode.com/hg/

13 years, 9 months ago (2011-10-12 09:36:15 UTC) #1

andybalholm

It looks like a really good design. It'll need a little fine-tuning to bring it ...

13 years, 9 months ago (2011-10-12 16:19:42 UTC) #2

It looks like a really good design. It'll need a little fine-tuning to bring it
into compliance with the spec, but that can be done later. I think the design
will support it easily.

I ran a benchmark comparing it to the old version. It's slightly slower, but
about twice as fast as the single-pass tokenizer that I wrote. (The benchmark is
tokenizing a HTML page output by godoc.)

Before:
html.BenchmarkLowLevel	   10000	    107259 ns/op
html.BenchmarkHighLevel	    5000	    354051 ns/op

After:
html.BenchmarkLowLevel	   10000	    140221 ns/op
html.BenchmarkHighLevel	    5000	    423839 ns/op

http://codereview.appspot.com/5244061/diff/6001/src/pkg/html/token.go
File src/pkg/html/token.go (right):

http://codereview.appspot.com/5244061/diff/6001/src/pkg/html/token.go#newcode221
src/pkg/html/token.go:221: case ' ', '\n', '\t', '\f':
It might be a good idea to include '\r' here too. The HTML5 spec assumes that \r
and \r\n have already been converted to \n by a preprocessing stage, which we
don't have.

http://codereview.appspot.com/5244061/diff/6001/src/pkg/html/token.go#newcode341
src/pkg/html/token.go:341: // If we didn't get a final ">", it's a text token.
Not according to html5lib:

print html5lib.parse(StringIO.StringIO('<p x=1')).printTree()
#document
|  <html html>
|    <html head>
|    <html body>
|      <html p>
|        x="1"

http://codereview.appspot.com/5244061/diff/6001/src/pkg/html/token.go#newcode364
src/pkg/html/token.go:364: case ' ', '\n', '\t', '\f', '/':
Include '\r'?

http://codereview.appspot.com/5244061/diff/6001/src/pkg/html/token_test.go
File src/pkg/html/token_test.go (right):

http://codereview.appspot.com/5244061/diff/6001/src/pkg/html/token_test.go#ne...
src/pkg/html/token_test.go:56: // TODO: check what html5lib does in these two
cases.
print html5lib.parse(StringIO.StringIO("<p id=0</p>")).printTree()
#document
|  <html html>
|    <html head>
|    <html body>
|      <html p>
|        id="0</p"

print html5lib.parse(StringIO.StringIO('<p id="0</p>')).printTree()
#document
|  <html html>
|    <html head>
|    <html body>
|      <html p>
|        id="0</p>"

andybalholm

http://codereview.appspot.com/5244061/diff/6001/src/pkg/html/token.go File src/pkg/html/token.go (right): http://codereview.appspot.com/5244061/diff/6001/src/pkg/html/token.go#newcode134 src/pkg/html/token.go:134: data [2]int It might produce clearer code to use ...

13 years, 9 months ago (2011-10-12 20:09:25 UTC) #3

gri

FYI http://codereview.appspot.com/5244061/diff/6001/src/pkg/html/token.go File src/pkg/html/token.go (right): http://codereview.appspot.com/5244061/diff/6001/src/pkg/html/token.go#newcode140 src/pkg/html/token.go:140: pendingAttr [4]int same here: wouldn't it be better ...

13 years, 9 months ago (2011-10-12 23:04:35 UTC) #4

nigeltao

Yeah, the tokenizer isn't 100% correct, but I think it's good enough to check in ...

13 years, 9 months ago (2011-10-13 00:51:06 UTC) #5

gri

LGTM Please add a comment to the span indicating if it is inclusive or exclusive. ...

13 years, 9 months ago (2011-10-13 22:07:40 UTC) #8

nigeltao

http://codereview.appspot.com/5244061/diff/12001/src/pkg/html/token.go File src/pkg/html/token.go (right): http://codereview.appspot.com/5244061/diff/12001/src/pkg/html/token.go#newcode112 src/pkg/html/token.go:112: start, end int On 2011/10/13 22:07:40, gri wrote: > ...

13 years, 9 months ago (2011-10-13 22:56:55 UTC) #9

nigeltao

13 years, 9 months ago (2011-10-13 22:58:46 UTC) #10

*** Submitted as http://code.google.com/p/go/source/detail?r=b116c220cb2c ***

html: rewrite the tokenizer to be more consistent.

Previously, the tokenizer made two passes per token. The first pass
established the token boundary. The second pass picked out the tag name
and attributes inside that boundary. This was problematic when the two
passes disagreed. For example, "<p id=can't><p id=won't>" caused an
infinite loop because the first pass skipped everything inside the
single quotes, and recognized only one token, but the second pass never
got past the first '>'.

This change rewrites the tokenizer to use one pass, accumulating the
boundary points of token text, tag names, attribute keys and attribute
values as it looks for the token endpoint.

It should still be reasonably efficient: text, names, keys and values
are not lower-cased or unescaped (and converted from []byte to string)
until asked for.

One of the token_test test cases was fixed to be consistent with
html5lib. Three more test cases were temporarily disabled, and will be
re-enabled in a follow-up CL. All the parse_test test cases pass.

R=andybalholm, gri
CC=golang-dev
http://codereview.appspot.com/5244061

Expand All Messages | Collapse All Messages