- From: Christian Peter <cpeter@rostock.igd.fhg.de>
- Date: Wed, 09 Oct 2002 17:15:43 +0200
- To: html-tidy@w3.org
Hi, I want to convert "any" HTML document to XML and thought using JTidy might be a good idea since the system in which this converter will be integrated is written in Java. I took the demo code from SourceForge (http://sourceforge.net/docman/display_doc.php?docid=1298&group_id=13153), got it running, and am now wondering why the xml output file doesn't look as expected. (The demo program calls the instance of class Tidy with xmlOut=true which is said to set the output to XML format). And here's the things confusing me: First, the generated files start with <html> <head> <meta name="generator" content="HTML Tidy, see www.w3.org" /> rather than with <?xml version="1.0" encoding="us-ascii"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" Why's that? This looks to me as if the output isn't set to XML at all. What do I have to do to get it really being set to XML? Second, with quite a lot of sites (e.g. www.nasa.gov) I get a parsing error when reading the generated file (with IE or Netscape): XML Parsing Error: undefined entity Location: file:///C:/prog/3DWS/JTidy/files/www.nasa.gov.xml Line Number 208, Column 22:size="2">NASA en Español</font></a></td> Question: which settings are necessary to get this handled properly? I should tell you that I'm new to XML as well, as much as I haven't much knowledge on HTML. But since I'm very bright I'm sure I'll need just some little help at the beginning and soon will be a valuable contributor to this list ;-) Many, many thanks for your help and patience! Christian
Received on Wednesday, 9 October 2002 11:19:10 UTC