Rietveld Code Review Tool
Help | Bug tracker | Discussion group | Source code | Sign in
(38)

Issue 5244061: code review 5244061: html: rewrite the tokenizer to be more consistent. (Closed)

Can't Edit
Can't Publish+Mail
Start Review
Created:
13 years, 9 months ago by nigeltao
Modified:
13 years, 9 months ago
Reviewers:
CC:
andybalholm, gri, golang-dev
Visibility:
Public.

Description

html: rewrite the tokenizer to be more consistent. Previously, the tokenizer made two passes per token. The first pass established the token boundary. The second pass picked out the tag name and attributes inside that boundary. This was problematic when the two passes disagreed. For example, "<p id=can't><p id=won't>" caused an infinite loop because the first pass skipped everything inside the single quotes, and recognized only one token, but the second pass never got past the first '>'. This change rewrites the tokenizer to use one pass, accumulating the boundary points of token text, tag names, attribute keys and attribute values as it looks for the token endpoint. It should still be reasonably efficient: text, names, keys and values are not lower-cased or unescaped (and converted from []byte to string) until asked for. One of the token_test test cases was fixed to be consistent with html5lib. Three more test cases were temporarily disabled, and will be re-enabled in a follow-up CL. All the parse_test test cases pass.

Patch Set 1 #

Patch Set 2 : diff -r f183ae52d934 https://go.googlecode.com/hg/ #

Patch Set 3 : diff -r f183ae52d934 https://go.googlecode.com/hg/ #

Patch Set 4 : diff -r f183ae52d934 https://go.googlecode.com/hg/ #

Total comments: 14

Patch Set 5 : diff -r ceb8b9904cc4 https://go.googlecode.com/hg/ #

Total comments: 4

Patch Set 6 : diff -r 8fd7e6d070c8 https://go.googlecode.com/hg/ #

Unified diffs Side-by-side diffs Delta from patch set Stats (+272 lines, -227 lines) Patch
M src/pkg/html/escape.go View 1 1 chunk +10 lines, -0 lines 0 comments Download
M src/pkg/html/token.go View 1 2 3 4 5 14 chunks +230 lines, -211 lines 0 comments Download
M src/pkg/html/token_test.go View 1 2 3 4 4 chunks +32 lines, -16 lines 0 comments Download

Messages

Total messages: 10
nigeltao
Hello andybalholm@gmail.com, gri@golang.org (cc: golang-dev@googlegroups.com), I'd like you to review this change to https://go.googlecode.com/hg/
13 years, 9 months ago (2011-10-12 09:36:15 UTC) #1
andybalholm
It looks like a really good design. It'll need a little fine-tuning to bring it ...
13 years, 9 months ago (2011-10-12 16:19:42 UTC) #2
andybalholm
http://codereview.appspot.com/5244061/diff/6001/src/pkg/html/token.go File src/pkg/html/token.go (right): http://codereview.appspot.com/5244061/diff/6001/src/pkg/html/token.go#newcode134 src/pkg/html/token.go:134: data [2]int It might produce clearer code to use ...
13 years, 9 months ago (2011-10-12 20:09:25 UTC) #3
gri
FYI http://codereview.appspot.com/5244061/diff/6001/src/pkg/html/token.go File src/pkg/html/token.go (right): http://codereview.appspot.com/5244061/diff/6001/src/pkg/html/token.go#newcode140 src/pkg/html/token.go:140: pendingAttr [4]int same here: wouldn't it be better ...
13 years, 9 months ago (2011-10-12 23:04:35 UTC) #4
nigeltao
Yeah, the tokenizer isn't 100% correct, but I think it's good enough to check in ...
13 years, 9 months ago (2011-10-13 00:51:06 UTC) #5
andybalholm
LGTM
13 years, 9 months ago (2011-10-13 14:32:31 UTC) #6
nigeltao
gri, does it LGTY?
13 years, 9 months ago (2011-10-13 21:54:29 UTC) #7
gri
LGTM Please add a comment to the span indicating if it is inclusive or exclusive. ...
13 years, 9 months ago (2011-10-13 22:07:40 UTC) #8
nigeltao
http://codereview.appspot.com/5244061/diff/12001/src/pkg/html/token.go File src/pkg/html/token.go (right): http://codereview.appspot.com/5244061/diff/12001/src/pkg/html/token.go#newcode112 src/pkg/html/token.go:112: start, end int On 2011/10/13 22:07:40, gri wrote: > ...
13 years, 9 months ago (2011-10-13 22:56:55 UTC) #9
nigeltao
13 years, 9 months ago (2011-10-13 22:58:46 UTC) #10
*** Submitted as http://code.google.com/p/go/source/detail?r=b116c220cb2c ***

html: rewrite the tokenizer to be more consistent.

Previously, the tokenizer made two passes per token. The first pass
established the token boundary. The second pass picked out the tag name
and attributes inside that boundary. This was problematic when the two
passes disagreed. For example, "<p id=can't><p id=won't>" caused an
infinite loop because the first pass skipped everything inside the
single quotes, and recognized only one token, but the second pass never
got past the first '>'.

This change rewrites the tokenizer to use one pass, accumulating the
boundary points of token text, tag names, attribute keys and attribute
values as it looks for the token endpoint.

It should still be reasonably efficient: text, names, keys and values
are not lower-cased or unescaped (and converted from []byte to string)
until asked for.

One of the token_test test cases was fixed to be consistent with
html5lib. Three more test cases were temporarily disabled, and will be
re-enabled in a follow-up CL. All the parse_test test cases pass.

R=andybalholm, gri
CC=golang-dev
http://codereview.appspot.com/5244061
Sign in to reply to this message.

Powered by Google App Engine
RSS Feeds Recent Issues | This issue
This is Rietveld f62528b