Issue 4559048: faster html-sanitizer.js

Issue 4559048: faster html-sanitizer.js (Closed)

Can't Edit
Can't Publish+Mail
Start Review

Created:
15 years, 1 month ago by felix8a

Modified:
14 years, 3 months ago

Reviewers:
MikeSamuel, Jasvir, metaweta

CC:
google-caja-discuss_googlegroups.com

Base URL:
http://google-caja.googlecode.com/svn/trunk/

Visibility:
Public.

Description

html-sanitizer is pathologically slow on IE<=8 and Firefox 3.6. This change is a rewrite of the parser that address that. For example, sanitizing the html source of http://code.google.com/p/google-caja/issues/list takes this much time on my computers (msec): old new 25000 580 ie6 22000 550 ie7 20000 350 ie8 4400 100 ff3.6 Most of the slowness is due to this statement: htmlText = htmlText.substring(m[0].length); The old parser is a typical parsing loop: identify the next token at the front of htmlText, process it, then remove it. However, this peel-off-the-head loop is pathologically slow on the above browsers, which I'm guessing is because the substring operation above will copy the string tail to a new block of memory, rather than re-using the same memory. The new parser avoids that by first splitting the input into potentially meaningful tokens (eg, '<' '<!--' '&' are tokens). And then it recombines these tokens when they happen to be quoted or otherwise not meaningful. On modern browsers (Chrome-18, Firefox-11, IE-9, Opera-11, Safari-5.1), the sanitization time for the above example is pretty similar for both algorithms, about 30-50msec on my computers. This change also adds some machinery for performance testing. Some of the test inputs trigger pathological behavior for the old and new sanitizers in modern browsers. The two sanitizers have different pathological cases, so the new one is sometimes slower, but the trouble spots are uncommon patterns like <p title=">>>>[repeated]...">[repeated]... the runtimes are small (eg, old=5msec new=50msec), and I think the new algorithm is O(n) for all types of inputs. The old algorithm is quadratic for some pathological inputs. (On IE9, one example is old=.09msec new=.06msec, the same example repeated 2000 times is old=1300msec new=4.5msec) Since this is a nontrivial change, I've also added more testing, especially regression testing. There's one incompatible behavior change. The public function makeSaxParser will create a SAX-style parser that calls handlers like startTag, endTag, cdata, etc. The old and new parsers chunk cdata differently. For example, the input x&y will be three cdata events in the old parser, but two in the new. I think this difference doesn't matter. I did a code search for uses of makeSaxParser and found only 2 outside of html-sanitizer. Neither of them cares how cdata is chunked. We don't specify the behavior of makeSaxParser, so this change is not violating any explicit contract. Also, the Java SAX specification explicitly says that character events are not necessarily chunked the same way all the time, so it seems unlikely we're violating any implicit contracts.

Patch Set 1 #

Total comments: 18

Patch Set 2 : faster html-sanitizer.js #

Patch Set 3 : faster html-sanitizer.js #

Total comments: 19

Patch Set 4 : faster html-sanitizer.js #

Created: 14 years, 3 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+1988 lines, -371 lines)			Patch
M	build.xml	View	1 2	1 chunk	+2 lines, -0 lines	0 comments	Download
M	src/com/google/caja/plugin/html-sanitizer.js	View	1 2 3	4 chunks	+333 lines, -149 lines	0 comments	Download
A	src/com/google/caja/plugin/html-sanitizer-exp.js	View	1 2 3	1 chunk	+8 lines, -0 lines	0 comments	Download
A	src/com/google/caja/plugin/html-sanitizer-legacy.js	View	1	1 chunk	+632 lines, -0 lines	0 comments	Download
M	tests/com/google/caja/plugin/JsHtmlSanitizerTest.java	View	1 2	1 chunk	+11 lines, -0 lines	0 comments	Download
M	tests/com/google/caja/plugin/css-stylesheet-test.html	View	1 2	1 chunk	+1 line, -1 line	0 comments	Download
M	tests/com/google/caja/plugin/html-css-sanitizer-test.html	View	1 2	1 chunk	+1 line, -1 line	0 comments	Download
M	tests/com/google/caja/plugin/html-css-sanitizer-test.js	View	1 2	1 chunk	+7 lines, -7 lines	0 comments	Download
A	tests/com/google/caja/plugin/html-sanitizer-bench.css	View	1	1 chunk	+32 lines, -0 lines	0 comments	Download
A	tests/com/google/caja/plugin/html-sanitizer-bench.html	View	1	1 chunk	+67 lines, -0 lines	0 comments	Download
A	tests/com/google/caja/plugin/html-sanitizer-bench.js	View	1	1 chunk	+227 lines, -0 lines	0 comments	Download
A	tests/com/google/caja/plugin/html-sanitizer-data.js	View	1	1 chunk	+306 lines, -0 lines	0 comments	Download
A	tests/com/google/caja/plugin/html-sanitizer-legacy-test.html	View	1	1 chunk	+49 lines, -0 lines	0 comments	Download
A	tests/com/google/caja/plugin/html-sanitizer-regress.html	View	1	1 chunk	+43 lines, -0 lines	0 comments	Download
A	tests/com/google/caja/plugin/html-sanitizer-regress.js	View	1	1 chunk	+67 lines, -0 lines	0 comments	Download
A	tests/com/google/caja/plugin/html-sanitizer-samples.js	View	1	1 chunk	+2 lines, -0 lines	0 comments	Download
M	tests/com/google/caja/plugin/html-sanitizer-test.html	View	1 2	1 chunk	+14 lines, -19 lines	0 comments	Download
M	tests/com/google/caja/plugin/html-sanitizer-test.js	View	1 2	3 chunks	+186 lines, -194 lines	0 comments	Download

Messages

Total messages: 11

Expand All Messages | Collapse All Messages

felix8a

forgot to mention, I did try the m//g and lastIndex approach, and it turns out ...

15 years, 1 month ago (2011-05-26 23:39:30 UTC) #2

MikeSamuel

http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer-exp.js File src/com/google/caja/plugin/html-sanitizer-exp.js (right): http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer-exp.js#newcode1 src/com/google/caja/plugin/html-sanitizer-exp.js:1: // stub for experimenting with changes to html-sanitizer.js This ...

15 years, 1 month ago (2011-05-27 20:22:28 UTC) #3

felix8a

I'm going to upload a new patch momentarily, rebased from trunk, with changes from the ...

14 years, 3 months ago (2012-03-13 16:58:20 UTC) #4

I'm going to upload a new patch momentarily, rebased from trunk, with changes
from the comments and a bunch of other minor changes.

I've got someone asking for the fix to the IE substr performance problem, so I'd
like to get this committed soon, and then work on the other performance issues
later, which are minor relative to the IE substr problem.

On 2011/05/27 20:22:28, MikeSamuel wrote:
>
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-...
> File src/com/google/caja/plugin/html-sanitizer-exp.js (right):
> 
>
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-...
> src/com/google/caja/plugin/html-sanitizer-exp.js:1: // stub for experimenting
> with changes to html-sanitizer.js
> This doesn't seem to be included in the html-sanitizer bundle in build.xml.

intentionally excluded.  added a clarifying comment.

>
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-...
> File src/com/google/caja/plugin/html-sanitizer-r4455.js (right):
> 
>
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-...
> src/com/google/caja/plugin/html-sanitizer-r4455.js:1: // Copyright (C) 2006
> Google Inc.
> Is this the old version?  What are the long term plans for this?

originally I thought it would be helpful if users could refer to the old version
of html-sanitizer at a public url, but there doesn't seem to be much value to
that, so I've renamed the files -legacy, and now they're only used for testing.

>
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-...
> File src/com/google/caja/plugin/html-sanitizer.js (right):
> 
>
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-...
> src/com/google/caja/plugin/html-sanitizer.js:192: var ATTR_RE = new RegExp(
> Ok, so this is meant to match a name = value pair?

yes, added a comment.

>
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-...
> src/com/google/caja/plugin/html-sanitizer.js:208: '[^\"\'\\s]*' ) +
> If this is not matched against any string containing the ">" or "/>" closing a
> tag, then please document that fact and ignore the below.  Otherwise, ...

Yeah, there's never a closing > in the string matched.  Clarified in the
comment.

>
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-...
> src/com/google/caja/plugin/html-sanitizer.js:252: // parts if we discover
> they're in a different context.
> Is splitting faster than global matching?  e.g. str.match(/.../g)

I think I did an experiment with parsing using //g at one point, but I don't
remember the result of that.  I'll look at it again when I work on the minor
performance issues marked as TODO.

>
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-...
> src/com/google/caja/plugin/html-sanitizer.js:266: var nextEndComment = 0;
> Maybe initialize to -1 if my comments on uses of nextGT and nextEndComment are
> correct.

I'm not actually using this the way I thought, so I replaced it with a flag
noMoreEndComments, meaning we've scanned to the end of the input and didn't find
any end comment markers, so we don't have to repeat the scan if we see another
open comment marker.

>
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-...
> src/com/google/caja/plugin/html-sanitizer.js:267: while (pos < end) {
> The relationship between pos,end,parts and this loop might be more obvious if
> they were declared in the loop
> 
>     for (var pos = 0, end = parts.length; pos < end;) {

done

>
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-...
> src/com/google/caja/plugin/html-sanitizer.js:277: }
> Do your benchmarks take into account that there is a call to pcdata per
entity. 
> Some real HTML has large numbers of &nbsp;'s.

yes, some of the test cases are large number of & entities repeated, and these
are mostly insignificant performance differences between old and new versions.

> Is the fact that contiguous text segments might be split into separate pcdata
> calls a change from existing behavior?

yes, but I think it doesn't matter.  I've added a comment about it, and also
updated the description of this change to explain my reasoning.

>
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-...
> src/com/google/caja/plugin/html-sanitizer.js:320: p = (nextEndComment < pos +
1)
> ? pos + 1 : nextEndComment;
> Is < pos + 1 the right boundary condition?  You've already incremented pos at
> the top of the loop, so does (pos) point at the token after the "<!--"?

pos + 1 is correct, because '--' and '>' are separate tokens, and we're looking
for the '>'.  added a clarifying comment.

>
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-...
> src/com/google/caja/plugin/html-sanitizer.js:374: var re =
> /(<\/|<!--|<[!?]|[&<>])/g;
> Why is it necessary to split on ">"?

making '>' a separate token lets us quickly find the end of a tag in cases where
quotes aren't involved, which is almost all end tags and many start tags.  I'm
not sure I tried a variant that doesn't split on '>'; I might explore that
option when working on some of the performance TODOs.

> Can you split using a forward lookahead to get consistent behavior on IE?
> var re = /(?=<(?:[/&!?]?|!--|>)/;

'axbxc'.split(/(?=x)/g) is ['a', 'xb', 'xc'], which is not the same, but might
be usable with a tweak of the algorithm.  I'll explore this later.

>
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-...
> src/com/google/caja/plugin/html-sanitizer.js:418: function parseText(parts,
tag,
> h, param) {
> I haven't looked clearly at this method yet.  Can you add a comment on what
it's
> supposed to do.

done

>
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-...
> src/com/google/caja/plugin/html-sanitizer.js:421: endTagRe[tag.name] = new
> RegExp('^' + tag.name + '(?:[\\s\\/]|$)', 'i');
> Ok.  Even if tag.name can contain any of [\w:-] none of those are special in
> regexps outside charsets.  Obviously . and $ would cause problems.

right

>
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-...
> src/com/google/caja/plugin/html-sanitizer.js:451: // For now, optimistically
> assume there are no quoted '>'
> What does this mean?  Is this assumption revisited?  Is this assumption the
> reason for the attr regexp that I commented upon above?

yes, clarified the comment.

>
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-...
> src/com/google/caja/plugin/html-sanitizer.js:500: if (q === '"' || q === "'")
{
> charCodeAt can be faster than charAt.
> q = v.charCodeAt(0);
> if (q === 0x22 || q === 0x27) {

done

felix8a

new snapshot, rebased from trunk. no significant changes, just fixing some conflicts with the html-css-sanitizer ...

14 years, 3 months ago (2012-03-20 00:14:43 UTC) #6

MikeSamuel

This is nice work. Comments inline. http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer-exp.js File src/com/google/caja/plugin/html-sanitizer-exp.js (right): http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer-exp.js#newcode6 src/com/google/caja/plugin/html-sanitizer-exp.js:6: */ What do ...

14 years, 3 months ago (2012-03-20 17:26:36 UTC) #7

felix8a

updated snapshot http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer-exp.js File src/com/google/caja/plugin/html-sanitizer-exp.js (right): http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer-exp.js#newcode6 src/com/google/caja/plugin/html-sanitizer-exp.js:6: */ On 2012/03/20 17:26:36, MikeSamuel wrote: > ...

14 years, 3 months ago (2012-03-20 20:16:08 UTC) #8

updated snapshot

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/h...
File src/com/google/caja/plugin/html-sanitizer-exp.js (right):

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/h...
src/com/google/caja/plugin/html-sanitizer-exp.js:6: */
On 2012/03/20 17:26:36, MikeSamuel wrote:
> What do you mean by "this file defines"?  Are html objects defined in other
> files, like html-sanitizer-legacy, observably different?

html-sanitizer-test.html says

<script type="text/javascript" src="html-sanitizer-legacy.js"></script>
<script>var html0 = html; html = void 0;</script>

basically imitation module import.

clarified the comment to: "If this file sets the global 'html' to a value
similar to that in html-sanitizer.js"

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/h...
File src/com/google/caja/plugin/html-sanitizer.js (right):

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/h...
src/com/google/caja/plugin/html-sanitizer.js:217: '(\')[^\']*(\'|$)' +    // 6,
7 = Single-quoted string
On 2012/03/20 17:26:36, MikeSamuel wrote:
> Do we get any benefit from doing ([\"\'])[\s\S]*?(\4|$) and avoid having two
> sets of quote groups?

I'm wary of backreferences because some regexp engines don't handle them well. 
I'll add a TODO to look into this.

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/h...
src/com/google/caja/plugin/html-sanitizer.js:221: '(?=[a-z][a-z-]*\\s*=)' +
On 2012/03/20 17:26:36, MikeSamuel wrote:
> Not necessarily relevant to this CL, but if it's easier to drop this case, I
> think we can.  I think HTML5 actually settled on treating
> 
>    <a a= b=c>
> 
> as equivalent to
> 
>     <a a= "b=c">
> 
> http://www.w3.org/TR/html5/tokenization.html#before-attribute-value-state and
> http://www.w3.org/TR/html5/tokenization.html#attribute-value-unquoted-state
> indicate that '=' inside an unquoted value is an error state, but the
procedure
> to follow when you don't fail fast is to treat the '=' and following content
as
> attribute value content.

ok, adding a TODO

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/h...
src/com/google/caja/plugin/html-sanitizer.js:230: var ENTITY_RE =
/^(#[0-9]+|#[x][0-9a-f]+|\w+);/i;
On 2012/03/20 17:26:36, MikeSamuel wrote:
> [x] -> x

Done.

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/h...
src/com/google/caja/plugin/html-sanitizer.js:233: var splitWillCapture =
('a,b'.split(/(,)/).length === 3);
On 2012/03/20 17:26:36, MikeSamuel wrote:
> Ok.  Does it matter whether ','.split(/,/).length == 2 or 0?

no, I don't care whether the result has null strings or not, I just care whether
groups capture or not.  null strings get ignored in the switch statement in
parse()

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/h...
src/com/google/caja/plugin/html-sanitizer.js:316: if (m =
/^(\w+)[^\'\"]*/.exec(next)) {
On 2012/03/20 17:26:36, MikeSamuel wrote:
> No $ required here or at 333?

There are three cases that need to be handed after we see a '<' or a '</'
1. we don't have a valid tag name -> emit pcdata
2. we do have a valid tag name, and there are no quotes -> fast handling of
simple tags
3. we do have a valid tag name, and there are quotes -> slow handling of tags
with attributes.

The non-anchored regexp lets me distinguish the three cases with just one regexp
test.  An anchored regexp would require adding a second regexp test.

http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/h...
src/com/google/caja/plugin/html-sanitizer.js:325: // slow case, need to parse
attributes
On 2012/03/20 17:26:36, MikeSamuel wrote:
> Don't need to parse attributes on an end tag.

It's possible for someone to write </p foo="a>">, and if we don't parse end-tag
attributes here, the result would be sanitized differently.  This might be a
case that we don't care about, I'll add a TODO.

MikeSamuel

LGTM http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js File src/com/google/caja/plugin/html-sanitizer.js (right): http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js#newcode217 src/com/google/caja/plugin/html-sanitizer.js:217: '(\')[^\']*(\'|$)' + // 6, 7 = Single-quoted string ...

14 years, 3 months ago (2012-03-22 23:11:39 UTC) #10

@r4823

Expand All Messages | Collapse All Messages