JTidy - Beginner's question

Hi,

I want to convert "any" HTML document to XML and thought using JTidy 
might be a good idea since the system in which this converter will 
be integrated is written in Java.

I took the demo code from SourceForge 
(http://sourceforge.net/docman/display_doc.php?docid=1298&group_id=13153), 
got it running, and am now wondering why the xml output file doesn't 
look as expected. (The demo program calls the instance of class Tidy 
with xmlOut=true which is said to set the output to XML format).

And here's the things confusing me:

First, the generated files start with

   <html>
   <head>
   <meta name="generator" content="HTML Tidy, see www.w3.org" />

rather than with

   <?xml version="1.0" encoding="us-ascii"?>
   <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"

Why's that? This looks to me as if the output isn't set to XML at 
all. What do I have to do to get it really being set to XML?

Second, with quite a lot of sites (e.g. www.nasa.gov) I get a 
parsing error when reading the generated file (with IE or Netscape):

   XML Parsing Error: undefined entity
   Location: file:///C:/prog/3DWS/JTidy/files/www.nasa.gov.xml
   Line Number 208, Column 22:size="2">NASA en
   Espa&ntilde;ol</font></a></td>

Question: which settings are necessary to get this handled properly?

I should tell you that I'm new to XML as well, as much as I haven't 
much knowledge on HTML. But since I'm very bright I'm sure I'll need 
just some little help at the beginning and soon will be a valuable 
contributor to this list ;-)

Many, many thanks for your help and patience!

Christian

Received on Wednesday, 9 October 2002 11:19:10 UTC