- From: Reitzel, Charlie <CReitzel@arrakisplanet.com>
- Date: Mon, 14 May 2001 20:49:13 -0400
- To: "'Richard A. O'Keefe'" <ok@atlas.otago.ac.nz>
- Cc: html-tidy@w3.org
Hi Richard, I see your points. I didn't think the DOM was designed to do transformations or conversions itself, but was intended simply as an interface to the tree. How the tree gets built, well, that is a problem for the implementation. Fact is, Tidy *does* support some of the "save as" options you mention. I'm not sure we need to propogate the SGML/ESIS stuff any further. But that's just me. I think that half the point of separating the "pretty printer" from the parser would be that, once you had parsed and cleaned up an input document, you can write any new transform/conversion routine you like. Same goes for the DOM, btw. Thus, the Tidy-To-DOM converter might really be useful, because other 3rd partly libs can work w/ DOM instances and do transforms, etc. This would be a big win for TidyJ, I imagine. About the SAX events, I'm not sure I follow exactly what you are saying. You said, "If HTML information content could not be captured in XML, we wouldn't have XHTML." XHTML *is* valid XML. But not all HTML is XHTML. So, sure, for XHTML (or any other XML) you can fire SAX events. If it's true that HTML is valid SGML, you might be able to fire ESIS events from a clean Tidy tree. I think, as I write this, I am catching your point. If you can convert a Tidy tree to XML, you can fire the equivalent SAX events. Is that what you have in mind? I think the bigger problem w/ DOM and SAX support will be the lack of standard header files for these interfaces. The W3C and Dave Megginson supply the org.w3c.dom.* and org.xml.sax.* Java packages. So we'll have to pick which C++ SAX and DOM to support. Any suggestions? take it easy, Charles Reitzel Hope you don't mind that I copy this to the list. -----Original Message----- From: Richard A. O'Keefe [mailto:ok@atlas.otago.ac.nz] Sent: Monday, May 14, 2001 6:46 PM To: CReitzel@arrakisplanet.com Subject: RE: FW: Making HTML Tidy a supported library My basic thinking is that Tidy couldn't use a DOM, because the whole point of Tidy is that you can throw all kinds of mal-formed slop at it and produce nice clean code. It is a great pity that there is *NO* "save as HTML" method in the DOM *NO* "save as XML" method in the DOM *NO* "save as ESIS" method in the DOM because if there were even one of those anyone with a DOM+Javascript browser just wouldn't _need_ Tidy. You'd load a page, letting the browser do whatever cleanup it wanted, and then save out a clean version. Come to think of it, there is also *NO* "load HTML" method in the DOM *NO* "load XML" method in the DOM. Each different DOM implementation has a different way to create a document, hence the existence of JAX and JDOM. So you are 100% right that Tidy could be used to _build_ a DOM model, but would still have to do its own parsing. Bjoern's approach of traversing the internal Tidy tree and emitting SAX events seems like a good one in principle, but I wonder if you can adequately capture HTML in SAX events (designed to present XML). If HTML information content could not be captured in XML, we wouldn't have XHTML. The SAX events are a refinement of the ESIS interface described in the SGML standard. The SAX interface even lets you find out exactly where things were in the original sources, which the DOM does not. Since JDOM is already set up to be layered over SAX, I'd say that the best way to handle possibly mucky HTML in Java would be a combination of JTidy and JDOM linked by SAX.
Received on Monday, 14 May 2001 20:48:47 UTC