RE: Making HTML Tidy a supported library

Hi Richard,

I see your points.  I didn't think the DOM was designed to do
transformations or conversions itself, but was intended simply as an
interface to the tree.  How the tree gets built, well, that is a problem for
the implementation.

Fact is, Tidy *does* support some of the "save as" options you mention.  I'm
not sure we need to propogate the SGML/ESIS stuff any further.  But that's
just me.  I think that half the point of separating the "pretty printer"
from the parser would be that, once you had parsed and cleaned up an input
document, you can write any new transform/conversion routine you like.  Same
goes for the DOM, btw.  Thus, the Tidy-To-DOM converter might really be
useful, because other 3rd partly libs can work w/ DOM instances and do
transforms, etc.  This would be a big win for TidyJ, I imagine.

About the SAX events, I'm not sure I follow exactly what you are saying.
You said, "If HTML information content could not be captured in XML, we
wouldn't have XHTML."  

XHTML *is* valid XML.  But not all HTML is XHTML.  So, sure, for XHTML (or
any other XML) you can fire SAX events.  If it's true that HTML is valid
SGML, you might be able to fire ESIS events from a clean Tidy tree.   I
think, as I write this, I am catching your point.  If you can convert a Tidy
tree to XML, you can fire the equivalent SAX events.  Is that what you have
in mind?

I think the bigger problem w/ DOM and SAX support will be the lack of
standard header files for these interfaces.  The W3C and Dave Megginson
supply the org.w3c.dom.* and org.xml.sax.* Java packages.  So we'll have to
pick which C++ SAX and DOM to support.  Any suggestions?

take it easy,
Charles Reitzel

Hope you don't mind that I copy this to the list.


-----Original Message-----
From: Richard A. O'Keefe [mailto:ok@atlas.otago.ac.nz]
Sent: Monday, May 14, 2001 6:46 PM
To: CReitzel@arrakisplanet.com
Subject: RE: FW: Making HTML Tidy a supported library


	My basic thinking is that Tidy couldn't use a DOM, because the
	whole point of Tidy is that you can throw all kinds of mal-formed
	slop at it and produce nice clean code.
	
It is a great pity that there is
 *NO* "save as HTML" method in the DOM
 *NO* "save as XML" method in the DOM
 *NO* "save as ESIS" method in the DOM
because if there were even one of those anyone with a DOM+Javascript
browser just wouldn't _need_ Tidy.  You'd load a page, letting the
browser do whatever cleanup it wanted, and then save out a clean version.

Come to think of it, there is also
 *NO* "load HTML" method in the DOM
 *NO* "load XML" method in the DOM.
Each different DOM implementation has a different way to create a document,
hence the existence of JAX and JDOM.

So you are 100% right that Tidy could be used to _build_ a DOM model,
but would still have to do its own parsing.

	Bjoern's approach of traversing the internal Tidy tree and emitting
	SAX events seems like a good one in principle, but I wonder if you
can adequately capture HTML in SAX events (designed to present XML).

If HTML information content could not be captured in XML, we wouldn't have
XHTML.  The SAX events are a refinement of the ESIS interface described in
the SGML standard.  The SAX interface even lets you find out exactly where
things were in the original sources, which the DOM does not.

Since JDOM is already set up to be layered over SAX, I'd say that the best
way to handle possibly mucky HTML in Java would be a combination of JTidy
and JDOM linked by SAX.

Received on Monday, 14 May 2001 20:48:47 UTC