Descriptionhtml-sanitizer is pathologically slow on IE<=8 and Firefox 3.6.
This change is a rewrite of the parser that address that.
For example, sanitizing the html source of
http://code.google.com/p/google-caja/issues/list
takes this much time on my computers (msec):
old new
25000 580 ie6
22000 550 ie7
20000 350 ie8
4400 100 ff3.6
Most of the slowness is due to this statement:
htmlText = htmlText.substring(m[0].length);
The old parser is a typical parsing loop: identify the next token
at the front of htmlText, process it, then remove it. However,
this peel-off-the-head loop is pathologically slow on the above
browsers, which I'm guessing is because the substring operation
above will copy the string tail to a new block of memory, rather
than re-using the same memory.
The new parser avoids that by first splitting the input into
potentially meaningful tokens (eg, '<' '<!--' '&' are tokens).
And then it recombines these tokens when they happen to be quoted
or otherwise not meaningful.
On modern browsers (Chrome-18, Firefox-11, IE-9, Opera-11,
Safari-5.1), the sanitization time for the above example is
pretty similar for both algorithms, about 30-50msec on my
computers.
This change also adds some machinery for performance testing.
Some of the test inputs trigger pathological behavior for the old
and new sanitizers in modern browsers. The two sanitizers have
different pathological cases, so the new one is sometimes slower,
but the trouble spots are uncommon patterns like
<p title=">>>>[repeated]...">[repeated]...
the runtimes are small (eg, old=5msec new=50msec), and I think
the new algorithm is O(n) for all types of inputs. The old
algorithm is quadratic for some pathological inputs. (On IE9,
one example is old=.09msec new=.06msec, the same example repeated
2000 times is old=1300msec new=4.5msec)
Since this is a nontrivial change, I've also added more testing,
especially regression testing.
There's one incompatible behavior change. The public function
makeSaxParser will create a SAX-style parser that calls handlers
like startTag, endTag, cdata, etc.
The old and new parsers chunk cdata differently. For example,
the input
x&y
will be three cdata events in the old parser, but two in the new.
I think this difference doesn't matter. I did a code search for
uses of makeSaxParser and found only 2 outside of html-sanitizer.
Neither of them cares how cdata is chunked. We don't specify the
behavior of makeSaxParser, so this change is not violating any
explicit contract. Also, the Java SAX specification explicitly
says that character events are not necessarily chunked the same
way all the time, so it seems unlikely we're violating any
implicit contracts.
Patch Set 1 #
Total comments: 18
Patch Set 2 : faster html-sanitizer.js #Patch Set 3 : faster html-sanitizer.js #
Total comments: 19
Patch Set 4 : faster html-sanitizer.js #MessagesTotal messages: 11
|