RE: Html-Tidy BUG ???

Hi Gary,

Thanks for this background.  Clearly, you have given the subject a great
deal of careful thought.  The SAX-based approach makes a great deal of
sense.

DOM compatibility is an issue for the C version of Tidy as well.  You may
have noticed the discussions on this list.   For much the same reasons,
these discussions seemed to gravitate to the SAX adapter method as well.

I think the tracking issue will go both ways.  I think we should look to
JTidy for direction on the library interface and implementation.  Ditto for
the SAX adapter, of course.

take it easy,
Charlie


-----Original Message-----
From: Gary L Peskin [mailto:garyp@firstech.com]
Sent: Thursday, June 14, 2001 7:55 PM
To: html-tidy@w3.org
Subject: Re: Html-Tidy BUG ???


We have several issues here.  First, are the DOM classes that are shipped
with JTidy itself.  With the exception of the DOMException class, the rest
of the stuff in the DOM package are all interfaces.  There is no problem at
all with deleting these classes and replacing them with the official W3C DOM
classes and, in fact, I'll do that shortly.

The next problem that we have is developing implementations for each of
those interfaces so that we can implement DOM support.  JTidy is a
faithful port of the released version of HTML Tidy (more on this later).
So, the JTidy parse tree mirrors the c Tidy parse tree.  This tree is NOT a
DOM tree but is a specialized HTML tree which suits Tidy's purposes.

In order to maintain compatability with c Tidy and make it easy to retrofit
maintenance and enhancements from the c version to the java version, we have
left the tree alone.  The parse tree is a central data structure and
monkeying with it would generate a lot of porting issues. 
So, instead, what Andy did was create a peer node, called an Adapter, when
needed.  The idea is that when we needed to represent something in
the DOM, we created a DOM node which was basically a thin wrapper on the
corresponding Tidy node but which implemented the DOM methods and took into
account the differences between the Tidy node tree structure and the DOM
node tree structure.  The Adapter node contains a reference to the Tidy node
and vice-versa and the DOM nodes are only created as needed so there is no
overhead if you're not using the DOM support.

This is how things were when Andy was unable to continue with the JTidy
development. Along came DOM level 2 and more and more requests for DOM 1
features that were not implemented in the initial release.  In addition, we
had people using XalanJ1, for example, that needed a separate liaison class
to interface with each DOM model so someone would have needed to create a
TidyLiaison to support Xalan.

As a result of the increasing feature set and complexity of the DOM, I
suggested that it would be a good idea to just have JTidy implement the
SAX2 XMLReader interface so that it could throw off SAX2 events to a SAX2
ContentHandler.  Then, the user could plug in Xerces or whatever other XML
parser implementation they wanted, provided that it supplied a
ContentHandler, which Xerces does, and build their own DOM tree and have
a real DOM and JTidy wouldn't have to worry about keeping up with all of the
DOM features.  As a bonus, you'd get SAX2 support as well.  This
way, JDOM could be supported as well, I believe, using their SAXBuilder.
 
Down the road it would be nice for Tidy to implement JAXP as well but that's
another story.

I merrily started coding up the XMLReader support last December but have
been delayed for several reasons.  I am now almost in a position to get
back into it in a few more days and I hope to have it ready about two weeks
after that.

For now, the next best thing is to write out the XHTML output from Tidy and
then read it in using your favorite XML parser.  It's not a fantastic
solution but it does work.

In the meantime, I've followed with great interest the impressive activity
over on SourceForge and on this list as well with respect to the HTML Tidy
project.  Of course, we'd like to port over the improvements and changes to
HTML Tidy at some point.  I haven't seen any mention of a release schedule.
Have I just missed this discussion.  I'd rather wait until the HTML Tidy
folks get to a point where you're comfortable with the stability and feature
set and ready for a release rather than trying to port the changes over as
they occur and try to hit a moving target.

Sorry this post was so long but I didn't have time to make it shorter :)

Gary

"Reitzel, Charlie" wrote:
> 
> Out of dumb curiousity, can anyone familiar w/ JTidy internals tell us
what
> are the major impediments to W3C DOM compatibility?
> 
> -----Original Message-----
> From: Valeri.Atamaniouk@nokia.com [mailto:Valeri.Atamaniouk@nokia.com]
> Sent: Thursday, June 14, 2001 10:45 AM
> To: holger.prause@detewe.de; html-tidy@w3.org
> Subject: RE: Html-Tidy BUG ???
> 
> Hello
> 
> The answer is fairly simple: tidy's DOM implementation is not compatible
> with W3C recommendation.
> 
> BR
> VA
> 
> PS I think you should write a translator from tidy's implementation into
> standard one (just copy the tree).
> 
> > -----Original Message-----
> > From: ext Holger Prause [mailto:holger.prause@detewe.de]
> > Sent: 12 June 2001 18:05
> > To: html-tidy@w3.org
> > Subject: Html-Tidy BUG ???
> >
> >
> > Hi
> >
> >
> > i am using Jtidy(html tidy) to get a DOM out of some html
> > files and then
> > i get all Links (all Elements with nodename "a").Now i want
> > to take this
> > dom and want it to
> > process with XSLT
> >
> > when i use the following Code  i get the following Exception
> >
> > <pre>
> > XSLTProcessor processor = XSLTProcessorFactory.getProcessor();
> >         processor.process(new XSLTInputSource(doc),new
> > XSLTInputSource(new FileInputStream(xslPath)),
> >         new XSLTResultTarget(new FileOutputStream(outputFile)));
> > </pre>
> >
> >
> > XSL Error: Cannot use a DTMLiaison for a input DOM node... pass a
> > org.apache.xalan.xpath.xdom.XercesLiaison instead!
> >
> > XSL Error: SAX Exception
> >
> > org.apache.xalan.xslt.XSLProcessorException:
> >  at
> > org.apache.xalan.xslt.XSLTEngineImpl.error(XSLTEngineImpl.java:1799)
> >
> >  at
> > org.apache.xalan.xslt.XSLTEngineImpl.error(XSLTEngineImpl.java:1691)
> >
> >
> > atorg.apache.xalan.xslt.XSLTEngineImpl.getSourceTreeFromInput(
> > XSLTEngineImpl.java:919)
> >
> >  at
> > org.apache.xalan.xslt.XSLTEngineImpl.process(XSLTEngineImpl.java:643)
> >  at DOMToHtmlSerializer.serialize(DOMToHtmlSerializer.java:39)
> >  at HtmlLinkValidator.validate(HtmlLinkValidator.java:56)
> >  at Main.<init>(Main.java:44)
> >  at Main.main(Main.java:55)
> >
> >
> > Ok i thought , if he want it that way i pass a xerces liasion
> >
> > <pre>
> > XercesLiaison xl = new XercesLiaison();
> >         XSLTProcessor processor =
> > XSLTProcessorFactory.getProcessor(xl);
> >
> >         processor.process(new XSLTInputSource(doc),new
> > XSLTInputSource(new FileInputStream(xslPath)),
> >         new XSLTResultTarget(new FileOutputStream(outputFile)));
> > </pre>
> >
> > than i get the following exception
> > XSL Error: SAX Exception
> >
> > org.apache.xalan.xslt.XSLProcessorException: XercesLiaison can not
> > handle nodes of type class org.w3c.tidy.DOMDocumentImpl
> >  at
> > org.apache.xalan.xslt.XSLTEngineImpl.error(XSLTEngineImpl.java:1753)
> >
> >  at
> > org.apache.xalan.xslt.XSLTEngineImpl.error(XSLTEngineImpl.java:1717)
> >
> >  at
> > org.apache.xalan.xslt.XSLTEngineImpl.process(XSLTEngineImpl.java:746)
> >  at DOMToHtmlSerializer.serialize(DOMToHtmlSerializer.java:39)
> >  at HtmlLinkValidator.validate(HtmlLinkValidator.java:56)
> >  at Main.<init>(Main.java:44)
> >  at Main.main(Main.java:55)
> >
> >
> > "
> > org.apache.xalan.xslt.XSLProcessorException: XercesLiaison can not
> > handle nodes of type class org.w3c.tidy.DOMDocumentImpl             "
> >
> > Why is JTidy using its own
> > DOMDocumentImpl(org.w3c.tidy.DOMDocumentImp)
> > and not the  DOMDocumentImpl from w3c(org.w3c.dom.DOMDocumentImp) ?? (
> >
> > This would have saved my a lot of time
> >
> >
> >
> > Now what can i do ?
> >
> > Solution 1: write the tidy-dom to disk and the reparse it with any
> > xml-parser , and the process it
> >
> > Solution 2.
> >
> > write a wrapper wich changes the tidy-dom to an pure
> > org.w3c.dom.Document
> > and then process it
> >
> > Solution 3 :
> > Search for another tool doing it
> >
> >
> > Hmm can anyone of u , especially the developers of this too /
> > libraryl,
> > tell me what to do?
> >
> >

Received on Friday, 15 June 2001 16:03:17 UTC