Rietveld Code Review Tool
Help | Bug tracker | Discussion group | Source code | Sign in
(95)

Issue 4559048: faster html-sanitizer.js (Closed)

Can't Edit
Can't Publish+Mail
Start Review
Created:
15 years, 1 month ago by felix8a
Modified:
14 years, 3 months ago
Reviewers:
MikeSamuel, Jasvir, metaweta
CC:
google-caja-discuss_googlegroups.com
Base URL:
http://google-caja.googlecode.com/svn/trunk/
Visibility:
Public.

Description

html-sanitizer is pathologically slow on IE<=8 and Firefox 3.6. This change is a rewrite of the parser that address that. For example, sanitizing the html source of http://code.google.com/p/google-caja/issues/list takes this much time on my computers (msec): old new 25000 580 ie6 22000 550 ie7 20000 350 ie8 4400 100 ff3.6 Most of the slowness is due to this statement: htmlText = htmlText.substring(m[0].length); The old parser is a typical parsing loop: identify the next token at the front of htmlText, process it, then remove it. However, this peel-off-the-head loop is pathologically slow on the above browsers, which I'm guessing is because the substring operation above will copy the string tail to a new block of memory, rather than re-using the same memory. The new parser avoids that by first splitting the input into potentially meaningful tokens (eg, '<' '<!--' '&' are tokens). And then it recombines these tokens when they happen to be quoted or otherwise not meaningful. On modern browsers (Chrome-18, Firefox-11, IE-9, Opera-11, Safari-5.1), the sanitization time for the above example is pretty similar for both algorithms, about 30-50msec on my computers. This change also adds some machinery for performance testing. Some of the test inputs trigger pathological behavior for the old and new sanitizers in modern browsers. The two sanitizers have different pathological cases, so the new one is sometimes slower, but the trouble spots are uncommon patterns like <p title=">>>>[repeated]...">[repeated]... the runtimes are small (eg, old=5msec new=50msec), and I think the new algorithm is O(n) for all types of inputs. The old algorithm is quadratic for some pathological inputs. (On IE9, one example is old=.09msec new=.06msec, the same example repeated 2000 times is old=1300msec new=4.5msec) Since this is a nontrivial change, I've also added more testing, especially regression testing. There's one incompatible behavior change. The public function makeSaxParser will create a SAX-style parser that calls handlers like startTag, endTag, cdata, etc. The old and new parsers chunk cdata differently. For example, the input x&amp;y will be three cdata events in the old parser, but two in the new. I think this difference doesn't matter. I did a code search for uses of makeSaxParser and found only 2 outside of html-sanitizer. Neither of them cares how cdata is chunked. We don't specify the behavior of makeSaxParser, so this change is not violating any explicit contract. Also, the Java SAX specification explicitly says that character events are not necessarily chunked the same way all the time, so it seems unlikely we're violating any implicit contracts.

Patch Set 1 #

Total comments: 18

Patch Set 2 : faster html-sanitizer.js #

Patch Set 3 : faster html-sanitizer.js #

Total comments: 19

Patch Set 4 : faster html-sanitizer.js #

Unified diffs Side-by-side diffs Delta from patch set Stats (+1988 lines, -371 lines) Patch
M build.xml View 1 2 1 chunk +2 lines, -0 lines 0 comments Download
M src/com/google/caja/plugin/html-sanitizer.js View 1 2 3 4 chunks +333 lines, -149 lines 0 comments Download
A src/com/google/caja/plugin/html-sanitizer-exp.js View 1 2 3 1 chunk +8 lines, -0 lines 0 comments Download
A src/com/google/caja/plugin/html-sanitizer-legacy.js View 1 1 chunk +632 lines, -0 lines 0 comments Download
M tests/com/google/caja/plugin/JsHtmlSanitizerTest.java View 1 2 1 chunk +11 lines, -0 lines 0 comments Download
M tests/com/google/caja/plugin/css-stylesheet-test.html View 1 2 1 chunk +1 line, -1 line 0 comments Download
M tests/com/google/caja/plugin/html-css-sanitizer-test.html View 1 2 1 chunk +1 line, -1 line 0 comments Download
M tests/com/google/caja/plugin/html-css-sanitizer-test.js View 1 2 1 chunk +7 lines, -7 lines 0 comments Download
A tests/com/google/caja/plugin/html-sanitizer-bench.css View 1 1 chunk +32 lines, -0 lines 0 comments Download
A tests/com/google/caja/plugin/html-sanitizer-bench.html View 1 1 chunk +67 lines, -0 lines 0 comments Download
A tests/com/google/caja/plugin/html-sanitizer-bench.js View 1 1 chunk +227 lines, -0 lines 0 comments Download
A tests/com/google/caja/plugin/html-sanitizer-data.js View 1 1 chunk +306 lines, -0 lines 0 comments Download
A tests/com/google/caja/plugin/html-sanitizer-legacy-test.html View 1 1 chunk +49 lines, -0 lines 0 comments Download
A tests/com/google/caja/plugin/html-sanitizer-regress.html View 1 1 chunk +43 lines, -0 lines 0 comments Download
A tests/com/google/caja/plugin/html-sanitizer-regress.js View 1 1 chunk +67 lines, -0 lines 0 comments Download
A tests/com/google/caja/plugin/html-sanitizer-samples.js View 1 1 chunk +2 lines, -0 lines 0 comments Download
M tests/com/google/caja/plugin/html-sanitizer-test.html View 1 2 1 chunk +14 lines, -19 lines 0 comments Download
M tests/com/google/caja/plugin/html-sanitizer-test.js View 1 2 3 chunks +186 lines, -194 lines 0 comments Download

Messages

Total messages: 11
felix8a
15 years, 1 month ago (2011-05-26 23:31:38 UTC) #1
felix8a
forgot to mention, I did try the m//g and lastIndex approach, and it turns out ...
15 years, 1 month ago (2011-05-26 23:39:30 UTC) #2
MikeSamuel
http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer-exp.js File src/com/google/caja/plugin/html-sanitizer-exp.js (right): http://codereview.appspot.com/4559048/diff/1/src/com/google/caja/plugin/html-sanitizer-exp.js#newcode1 src/com/google/caja/plugin/html-sanitizer-exp.js:1: // stub for experimenting with changes to html-sanitizer.js This ...
15 years, 1 month ago (2011-05-27 20:22:28 UTC) #3
felix8a
I'm going to upload a new patch momentarily, rebased from trunk, with changes from the ...
14 years, 3 months ago (2012-03-13 16:58:20 UTC) #4
felix8a
ping
14 years, 3 months ago (2012-03-15 15:15:00 UTC) #5
felix8a
new snapshot, rebased from trunk. no significant changes, just fixing some conflicts with the html-css-sanitizer ...
14 years, 3 months ago (2012-03-20 00:14:43 UTC) #6
MikeSamuel
This is nice work. Comments inline. http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer-exp.js File src/com/google/caja/plugin/html-sanitizer-exp.js (right): http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer-exp.js#newcode6 src/com/google/caja/plugin/html-sanitizer-exp.js:6: */ What do ...
14 years, 3 months ago (2012-03-20 17:26:36 UTC) #7
felix8a
updated snapshot http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer-exp.js File src/com/google/caja/plugin/html-sanitizer-exp.js (right): http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer-exp.js#newcode6 src/com/google/caja/plugin/html-sanitizer-exp.js:6: */ On 2012/03/20 17:26:36, MikeSamuel wrote: > ...
14 years, 3 months ago (2012-03-20 20:16:08 UTC) #8
felix8a
ping?
14 years, 3 months ago (2012-03-22 15:07:50 UTC) #9
MikeSamuel
LGTM http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js File src/com/google/caja/plugin/html-sanitizer.js (right): http://codereview.appspot.com/4559048/diff/11002/src/com/google/caja/plugin/html-sanitizer.js#newcode217 src/com/google/caja/plugin/html-sanitizer.js:217: '(\')[^\']*(\'|$)' + // 6, 7 = Single-quoted string ...
14 years, 3 months ago (2012-03-22 23:11:39 UTC) #10
felix8a
14 years, 3 months ago (2012-03-22 23:54:52 UTC) #11
@r4823
Sign in to reply to this message.

Powered by Google App Engine
RSS Feeds Recent Issues | This issue
This is Rietveld f62528b