Rietveld Code Review Tool
Help | Bug tracker | Discussion group | Source code | Sign in
(601)

Issue 157161: Caja-based HtmlParser and Parser Overhaul (Closed)

Can't Edit
Can't Publish+Mail
Start Review
Created:
15 years, 9 months ago by johnfargo
Modified:
15 years, 9 months ago
Reviewers:
Paul Lindner, gilles.devaux, shindig.remailer, etnu
Base URL:
http://svn.apache.org/repos/asf/incubator/shindig/trunk/
Visibility:
Public.

Description

Summary: * Overhauls the GadgetHtmlParser base class and associated test cases * Tweaks the Neko-based HTML parser implementation * Introduces new Caja-based HTML parser This fairly substantial CL reworks the HTML parsing system to better represent (though not fully yet) the way that HTML is handled within gadgets: as tag soup, cleaned up via custom rules after the fact into a legitimate, well-formed document. It's a step toward treating concrete GadgetHtmlParser implementations purely as fragment parsers. Change detail: * All parsing tests factored into base test classes with concrete tests largely just providing a concrete parser implementation. - HTML-equivalence method added utilizing the (fantastic) diff_match_patch library, which ignores whitespace, case, and attributing-encoding differences. * GadgetHtmlParser now does significant cleanup of the DOM it retrieves from parseDomImpl(...), which BTW will soon go the way of the dodo in favor of always using parseFragmentImpl(...) - Creates head element and populates it with all style elements (only), as putting these here cannot break rendering and because HTML requires <style> in head. - Creates body element as well. - Combines multiple <head> elements together, if present. - Prepends head with elements that occurred above a <head> element that occurred in source, if any. - Combines multiple <body> elements together, if present. - Prepends and appends, respectively, elements found before and after the first <body> tag and after the first <head> tag, and elements found after the first <body> tag, without any <head> or <body> parent, to the <body> tag (that was a mouthful). - As noted above, stuffs all <style> elements found in <body> at the end of <head> - If OpenSocial-type <script> elements are treated per spec (ie. having only text, no children), reprocesses this text as HTML and adds as children for template processing. * Introduces CajaHtmlParser - Still has parseDomImpl method, mostly for API compatibility (short-term) with Neko-based HtmlParser implementation, which has subtle differences btw parseDomImpl and parseFragmentImpl which I want to clean up in a follow-up CL (again, obviating the need for parseDomImpl altogether). - Delegates to Caja's DomParser class's parseFragment() method for most parsing needs - Depends on: http://codereview.appspot.com/157089/show to retain comments and fully-formed documents (working w/ Caja folks to get this committed) This CL is NOT fully complete as yet, but I'm sending it out for a look by the community. In addition to the Caja CL dependency, the CL implicitly makes Shindig require Java 1.6 b/c of its use of LinkedList.pop() (Deque interface) and diff_match_patch, whose Maven JAR is built against 1.6. I'll fix both of these, unless a move to 1.6 is deemed reasonable by all.

Patch Set 1 #

Patch Set 2 : Syncing changes to head for proper diffing. #

Patch Set 3 : Reverting default parser to Neko, for the time being; and removing debugging code. #

Patch Set 4 : Minor spacing fixes (self-review). #

Patch Set 5 : Final patch set. Syncs and compiles cleanly against build@head. #

Total comments: 5
Unified diffs Side-by-side diffs Delta from patch set Stats (+1135 lines, -470 lines) Patch
features/src/main/javascript/features/caja/feature.xml View 1 2 3 4 1 chunk +1 line, -1 line 0 comments Download
java/gadgets/pom.xml View 1 2 3 4 1 chunk +4 lines, -0 lines 1 comment Download
java/gadgets/src/main/java/org/apache/shindig/gadgets/parse/GadgetHtmlParser.java View 1 2 3 4 6 chunks +156 lines, -48 lines 4 comments Download
java/gadgets/src/main/java/org/apache/shindig/gadgets/parse/caja/CajaHtmlParser.java View 1 2 3 4 1 chunk +136 lines, -0 lines 0 comments Download
java/gadgets/src/main/java/org/apache/shindig/gadgets/parse/nekohtml/NekoSimplifiedHtmlParser.java View 1 2 3 4 1 chunk +1 line, -2 lines 0 comments Download
java/gadgets/src/main/java/org/apache/shindig/gadgets/servlet/CajaContentRewriter.java View 1 2 3 4 1 chunk +0 lines, -1 line 0 comments Download
java/gadgets/src/test/java/org/apache/shindig/gadgets/parse/AbstractParserAndSerializerTest.java View 1 2 3 4 1 chunk +58 lines, -18 lines 0 comments Download
java/gadgets/src/test/java/org/apache/shindig/gadgets/parse/AbstractParsingTestBase.java View 1 2 3 4 1 chunk +114 lines, -0 lines 0 comments Download
java/gadgets/src/test/java/org/apache/shindig/gadgets/parse/AbstractSocialMarkupHtmlParserTest.java View 1 2 3 1 chunk +198 lines, -0 lines 0 comments Download
java/gadgets/src/test/java/org/apache/shindig/gadgets/parse/CompactHtmlSerializerTest.java View 1 2 3 4 2 chunks +20 lines, -12 lines 0 comments Download
java/gadgets/src/test/java/org/apache/shindig/gadgets/parse/caja/CajaCompactHtmlSerializerTest.java View 1 chunk +32 lines, -0 lines 0 comments Download
java/gadgets/src/test/java/org/apache/shindig/gadgets/parse/caja/CajaParserAndSerializerTest.java View 1 chunk +32 lines, -0 lines 0 comments Download
java/gadgets/src/test/java/org/apache/shindig/gadgets/parse/caja/CajaSocialMarkupHtmlParserTest.java View 1 chunk +32 lines, -0 lines 0 comments Download
java/gadgets/src/test/java/org/apache/shindig/gadgets/parse/nekohtml/NekoCompactHtmlSerializerTest.java View 1 chunk +36 lines, -0 lines 0 comments Download
java/gadgets/src/test/java/org/apache/shindig/gadgets/parse/nekohtml/NekoParserAndSerializeTest.java View 1 2 3 4 1 chunk +39 lines, -39 lines 0 comments Download
java/gadgets/src/test/java/org/apache/shindig/gadgets/parse/nekohtml/SocialMarkupHtmlParserTest.java View 1 2 3 4 1 chunk +5 lines, -128 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/nekohtml/test.html View 1 2 3 4 1 chunk +0 lines, -26 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/nekohtml/test-expected.html View 1 2 3 4 1 chunk +0 lines, -23 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/nekohtml/test-fragment2.html View 1 2 3 4 1 chunk +0 lines, -2 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/nekohtml/test-fragment2-expected.html View 1 2 3 4 1 chunk +0 lines, -2 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/nekohtml/test-fulldocnodoctype-expected.html View 1 chunk +8 lines, -0 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/nekohtml/test-leadingscript-expected.html View 1 2 3 4 1 chunk +5 lines, -1 line 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/nekohtml/test-socialmarkup.html View 1 2 3 4 1 chunk +0 lines, -19 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/nekohtml/test-with-ampersands.html View 1 2 3 4 1 chunk +0 lines, -11 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/nekohtml/test-with-ampersands-expected.html View 1 2 3 4 1 chunk +0 lines, -8 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/nekohtml/test-with-iecond-comments.html View 1 2 3 4 1 chunk +0 lines, -30 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/nekohtml/test-with-iecond-comments-expected.html View 1 2 3 4 1 chunk +0 lines, -4 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/nekohtml/test-with-specialtags.html View 1 2 3 4 1 chunk +0 lines, -61 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/nekohtml/test-with-specialtags-expected.html View 1 2 3 4 1 chunk +0 lines, -33 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/test.html View 1 chunk +26 lines, -0 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/test-expected.html View 1 chunk +23 lines, -0 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/test-fragment.html View 1 chunk +2 lines, -0 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/test-fragment-expected.html View 1 chunk +2 lines, -0 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/test-fragment2.html View 1 chunk +2 lines, -0 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/test-fragment2-expected.html View 1 chunk +2 lines, -0 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/test-fulldocnodoctype.html View 1 chunk +7 lines, -0 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/test-fulldocnodoctype-expected.html View 1 chunk +7 lines, -0 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/test-headnobody.html View 1 chunk +5 lines, -0 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/test-headnobody-expected.html View 1 chunk +5 lines, -0 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/test-socialmarkup.html View 1 chunk +19 lines, -0 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/test-with-ampersands.html View 1 chunk +11 lines, -0 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/test-with-ampersands-expected.html View 1 chunk +8 lines, -0 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/test-with-iecond-comments.html View 1 chunk +30 lines, -0 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/test-with-iecond-comments-expected.html View 1 chunk +4 lines, -0 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/test-with-specialtags.html View 1 chunk +61 lines, -0 lines 0 comments Download
java/gadgets/src/test/resources/org/apache/shindig/gadgets/parse/test-with-specialtags-expected.html View 1 chunk +33 lines, -0 lines 0 comments Download
pom.xml View 1 2 3 4 2 chunks +11 lines, -1 line 0 comments Download

Messages

Total messages: 16
johnfargo
15 years, 9 months ago (2009-11-24 23:51:21 UTC) #1
johnfargo
Syncing changes to head for proper diffing.
15 years, 9 months ago (2009-11-25 00:01:25 UTC) #2
johnfargo
Reverting default parser to Neko, for the time being; and removing debugging code.
15 years, 9 months ago (2009-11-25 00:06:36 UTC) #3
johnfargo
Minor spacing fixes (self-review).
15 years, 9 months ago (2009-11-25 00:16:28 UTC) #4
johnfargo
Updated: Caja piece is committed, pom.xml update reflects same. JDK 6->5 tweaks needed still.
15 years, 9 months ago (2009-12-04 06:29:39 UTC) #5
johnfargo
Final patch set. Syncs and compiles cleanly against build@head.
15 years, 9 months ago (2009-12-04 22:43:36 UTC) #6
johnfargo
Does anyone have thoughts on this CL? It's been sitting for a week and a ...
15 years, 9 months ago (2009-12-04 23:05:09 UTC) #7
Paul Lindner
Didn't have time to deeply look at this. Only obvious thing I noted is that ...
15 years, 9 months ago (2009-12-07 08:25:12 UTC) #8
etnu
http://codereview.appspot.com/157161/diff/3092/2082 File java/gadgets/src/main/java/org/apache/shindig/gadgets/parse/GadgetHtmlParser.java (right): http://codereview.appspot.com/157161/diff/3092/2082#newcode175 java/gadgets/src/main/java/org/apache/shindig/gadgets/parse/GadgetHtmlParser.java:175: styleNodes.add(bodyKid); Why not put all of these into a ...
15 years, 9 months ago (2009-12-07 18:25:41 UTC) #9
fargo
Thanks Paul, change made and tested. Anyone else have commentary? I'd love some input, even ...
15 years, 9 months ago (2009-12-10 05:46:59 UTC) #10
gilles.devaux
I'm all for it, I've been having problems with Caja and Shindig since now Caja ...
15 years, 9 months ago (2009-12-11 01:48:12 UTC) #11
johnfargo
Hey Kevin - thanks for the look http://codereview.appspot.com/157161/diff/3092/2082 File java/gadgets/src/main/java/org/apache/shindig/gadgets/parse/GadgetHtmlParser.java (right): http://codereview.appspot.com/157161/diff/3092/2082#newcode175 java/gadgets/src/main/java/org/apache/shindig/gadgets/parse/GadgetHtmlParser.java:175: styleNodes.add(bodyKid); Mostly ...
15 years, 9 months ago (2009-12-11 23:37:43 UTC) #12
johnfargo
Hey all, I'm planning to commit this code -- let me know if you have ...
15 years, 9 months ago (2009-12-17 00:33:47 UTC) #13
Paul Lindner
go ahead, hopefully the junit 4 cleanups don't collide with what you did... On Wed, ...
15 years, 9 months ago (2009-12-17 00:41:54 UTC) #14
fargo
They did :) But I just kept all my own changes, since mine converted all ...
15 years, 9 months ago (2009-12-17 00:46:15 UTC) #15
fargo
15 years, 9 months ago (2009-12-17 00:54:38 UTC) #16
CL committed. Follow-up comments welcome.

On Wed, Dec 16, 2009 at 4:46 PM, John Hjelmstad <fargo@google.com> wrote:

> They did :) But I just kept all my own changes, since mine converted all
> tests to junit4 also.
>
> -John
>
>
> On Wed, Dec 16, 2009 at 4:41 PM, Paul Lindner <lindner@inuus.com> wrote:
>
>> go ahead, hopefully the junit 4 cleanups don't collide with what you
>> did...
>>
>>
>> On Wed, Dec 16, 2009 at 4:33 PM, John Hjelmstad <johnfargo@gmail.com>wrote:
>>
>>> Hey all,
>>>
>>> I'm planning to commit this code -- let me know if you have any
>>> misgivings; I'll revert if issues are found.
>>>
>>> Thanks,
>>> John
>>>
>>> On Wed, Dec 9, 2009 at 9:46 PM, John Hjelmstad <fargo@google.com> wrote:
>>>
>>>> Thanks Paul, change made and tested. Anyone else have commentary? I'd
>>>> love some input, even if only on GadgetHtmlParser.java, where the bulk of
>>>> Neko-related potential side effects (given that Neko is still marked as
>>>> default parser) are introduced.
>>>>
>>>> --j
>>>>
>>>>
>>>> On Mon, Dec 7, 2009 at 12:25 AM, <lindner@inuus.com> wrote:
>>>>
>>>>> Didn't have time to deeply look at this.  Only obvious thing I noted is
>>>>> that the diff lib should be test scope.
>>>>>
>>>>>
>>>>>
>>>>> http://codereview.appspot.com/157161/diff/3092/2084
>>>>> File java/gadgets/pom.xml (right):
>>>>>
>>>>> http://codereview.appspot.com/157161/diff/3092/2084#newcode133
>>>>> java/gadgets/pom.xml:133: <artifactId>diff_match_patch</artifactId>
>>>>> This should be <scope>test</scope> so we don't include this in the
>>>>> deployed artifacts.
>>>>>
>>>>>
>>>>> http://codereview.appspot.com/157161
>>>>>
>>>>
>>>>
>>>
>>
>
Sign in to reply to this message.

Powered by Google App Engine
RSS Feeds Recent Issues | This issue
This is Rietveld f62528b