R=gri There are many, many TODOs before this parser is good enough for the tangled ...
14 years, 5 months ago
(2010-12-02 06:21:56 UTC)
#2
R=gri
There are many, many TODOs before this parser is good enough for the tangled
mess that is all the world's HTML, but the general approach should be clear, and
it does pass the first three tests in the test suite.
I've rewritten it to have an explicit read instead of an unread, driven by a ...
14 years, 5 months ago
(2010-12-03 04:04:24 UTC)
#4
I've rewritten it to have an explicit read instead of an unread, driven by a
bool return value from calling the insertionMode transition function. Having the
transition function explictly call consume (or read) doesn't work so well since
it would have to handle the error (which may be EOF).
Consuming is the norm, and not doing so is the exception. Often the function
returns in return !foo, and it might be a little nicer to reverse the semantics
of the bool return, although I don't have a good variable name that means the
opposite of consumed. Let me think about it.
http://codereview.appspot.com/3355041/diff/5001/src/pkg/html/doc.go
File src/pkg/html/doc.go (right):
http://codereview.appspot.com/3355041/diff/5001/src/pkg/html/doc.go#newcode72
src/pkg/html/doc.go:72: Parsing is done by calling Parse on an io.Reader, which
returns the root of the
On 2010/12/02 22:11:35, gri wrote:
> should this be: "... by calling Parse with an io.Reader, ..." ? (_on_ an
> io.Reader sounds to me like Parse is a method of io.Reader, but then again, I
am
> not a native English speaker).
Done.
http://codereview.appspot.com/3355041/diff/5001/src/pkg/html/parse.go
File src/pkg/html/parse.go (right):
http://codereview.appspot.com/3355041/diff/5001/src/pkg/html/parse.go#newcode16
src/pkg/html/parse.go:16: // ErrorNode means that an error occurred during
parsing.
On 2010/12/02 22:11:35, gri wrote:
> These comments are redundant - the constant names are so clear. Better to just
> have a small comment before the const.
Done.
http://codereview.appspot.com/3355041/diff/5001/src/pkg/html/parse.go#newcode30
src/pkg/html/parse.go:30: // contain a slice of Attributes. Data is unescaped,
so that it looks like
On 2010/12/02 22:11:35, gri wrote:
> s/Data/Text Data/ ?
It's just Data, the same as a Token's Data field.
http://codereview.appspot.com/3355041/diff/5001/src/pkg/html/parse.go#newcode42
src/pkg/html/parse.go:42: type insertionMode func(*parser) insertionMode
On 2010/12/02 22:11:35, gri wrote:
> I wonder if there's a better name for this function type.
"transitionFunction",
> "transition", ?
The spec uses "insertion mode".
http://codereview.appspot.com/3355041/diff/5001/src/pkg/html/parse.go#newcode48
src/pkg/html/parse.go:48: t *Tokenizer
On 2010/12/02 22:11:35, gri wrote:
> Is field t accessed a lot? It deserves a longer name unless it's used
everywhere
> (maybe tstream, stream, ...). Your are not cheap with characters for the next
> fields (token, and hasToken). It seems to me that it should only be accessed
in
> read().
Done.
http://codereview.appspot.com/3355041/diff/5001/src/pkg/html/parse.go#newcode49
src/pkg/html/parse.go:49: token Token
On 2010/12/02 22:11:35, gri wrote:
> I'd rather make this field shorter ("tok" comes to mind), it's used
everywhere.
Done.
http://codereview.appspot.com/3355041/diff/5001/src/pkg/html/parse.go#newcode50
src/pkg/html/parse.go:50: hasToken bool
On 2010/12/02 22:11:35, gri wrote:
> should explain hasToken
>
> As an aside: I am always skeptical when I see a boolean flag to be used with a
> read/unread mechanism. Would it be possible to always have the invariant that
> token is the next token we are looking at, and that it may not be _consumed_
in
> some cases (effectively when you'd do the unread)? In all cases it is
consumed,
> and read is called to get the token afterwards. So instead of doing an unread
> (and somehow one needs to know that there are no 2 unreads following each
other,
> otherwise there's a bug), there's a consume().
I've made insertionMode return a bool saying whether or not to consume.
http://codereview.appspot.com/3355041/diff/5001/src/pkg/html/parse.go#newcode55
src/pkg/html/parse.go:55: // doc is the resultant document.
On 2010/12/02 22:11:35, gri wrote:
> Is this the root or the current sub-node? explain.
It's the root node.
http://codereview.appspot.com/3355041/diff/5001/src/pkg/html/parse.go#newcode59
src/pkg/html/parse.go:59: // Element pointers (section 10.2.3.4).
On 2010/12/02 22:11:35, gri wrote:
> Perhaps this could also be explained in more detail here.
These extra state fields are written but not read anywhere yet. I'll add some
explanatory comments when I implement the part of the specification's algorithm
that refers to them.
http://codereview.appspot.com/3355041/diff/5001/src/pkg/html/parse.go#newcode65
src/pkg/html/parse.go:65: // pop pops from the stack of open elements.
On 2010/12/02 22:11:35, gri wrote:
> It must only be called for non-empty stacks.
>
> (or something along those lines)
Done.
http://codereview.appspot.com/3355041/diff/5001/src/pkg/html/parse.go#newcode81
src/pkg/html/parse.go:81: n := len(p.stack)
On 2010/12/02 22:11:35, gri wrote:
> perhaps invert this, with common (?) case first:
>
> if n := len(p.stack); n > 0 {
> return p.stack[n-1]
> }
> return p.doc
>
> (up to you)
Done.
http://codereview.appspot.com/3355041/diff/5001/src/pkg/html/token.go
File src/pkg/html/token.go (right):
http://codereview.appspot.com/3355041/diff/5001/src/pkg/html/token.go#newcode18
src/pkg/html/token.go:18: // ErrorToken means that an error occurred during
tokenization.
On 2010/12/02 22:11:35, gri wrote:
> Is there a particular reason for the longer names? The shorter ones seem nice.
It's because I'm introducing ErrorNode and TextNode as values of type NodeType.
PTAL. Regarding error handling, the spec defines "parse errors" but IIUC they only terminate parsing ...
14 years, 5 months ago
(2010-12-05 22:58:25 UTC)
#6
PTAL.
Regarding error handling, the spec defines "parse errors" but IIUC they only
terminate parsing in strict mode, which is only really used by conformance
checkers. Real world parsers just recover when encountering malformed HTML, and
the spec defines exactly how to recover. Real world HTML is nonconformant; that
horse has bolted and will never come back.
My parser doesn't do anything on "parse errors" except recover, and my tokenizer
only returns errors in placeholder TODO cases (such as implementing <!--
comments -->) or passing along those errors from its underlying io.Reader.
http://codereview.appspot.com/3355041/diff/15001/src/pkg/html/parse.go
File src/pkg/html/parse.go (right):
http://codereview.appspot.com/3355041/diff/15001/src/pkg/html/parse.go#newcode61
src/pkg/html/parse.go:61: // pop pops from the stack of open elements.
On 2010/12/04 00:13:41, gri wrote:
> pop pops the top of the stack of open nodes.
>
> (matches what you wrote for top())
Done.
http://codereview.appspot.com/3355041/diff/15001/src/pkg/html/parse.go#newcode75
src/pkg/html/parse.go:75: // top returns the top of the stack of open elements.
On 2010/12/04 00:13:41, gri wrote:
> s/element/nodes/
stack is a []*Node, but consists only of element nodes, and not text nodes.
http://codereview.appspot.com/3355041/diff/15001/src/pkg/html/parse.go#newcod...
src/pkg/html/parse.go:115: // read reads the next token. This is usually from
the tokenizer, but it may
On 2010/12/04 00:13:41, gri wrote:
> This comment is not quite right anymore (there's no unread).
Done.
LGTM http://codereview.appspot.com/3355041/diff/26001/src/pkg/html/doc.go File src/pkg/html/doc.go (right): http://codereview.appspot.com/3355041/diff/26001/src/pkg/html/doc.go#newcode73 src/pkg/html/doc.go:73: the parse tree (the document element) as a ...
14 years, 5 months ago
(2010-12-06 22:17:48 UTC)
#7
*** Submitted as http://code.google.com/p/go/source/detail?r=018e9434af09 *** html: first cut at a parser. R=gri CC=golang-dev http://codereview.appspot.com/3355041
14 years, 5 months ago
(2010-12-07 01:02:48 UTC)
#8
Issue 3355041: code review 3355041: html: first cut at a parser.
(Closed)
Created 14 years, 5 months ago by nigeltao
Modified 14 years, 5 months ago
Reviewers:
Base URL:
Comments: 38