W3C home > Mailing lists > Public > html-tidy@w3.org > October to December 2002

JTidy - Beginner's question

From: Christian Peter <cpeter@rostock.igd.fhg.de>
Date: Wed, 09 Oct 2002 17:15:43 +0200
Message-ID: <3DA4481F.6020601@rostock.igd.fhg.de>
To: html-tidy@w3.org


I want to convert "any" HTML document to XML and thought using JTidy 
might be a good idea since the system in which this converter will 
be integrated is written in Java.

I took the demo code from SourceForge 
got it running, and am now wondering why the xml output file doesn't 
look as expected. (The demo program calls the instance of class Tidy 
with xmlOut=true which is said to set the output to XML format).

And here's the things confusing me:

First, the generated files start with

   <meta name="generator" content="HTML Tidy, see www.w3.org" />

rather than with

   <?xml version="1.0" encoding="us-ascii"?>
   <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"

Why's that? This looks to me as if the output isn't set to XML at 
all. What do I have to do to get it really being set to XML?

Second, with quite a lot of sites (e.g. www.nasa.gov) I get a 
parsing error when reading the generated file (with IE or Netscape):

   XML Parsing Error: undefined entity
   Location: file:///C:/prog/3DWS/JTidy/files/www.nasa.gov.xml
   Line Number 208, Column 22:size="2">NASA en

Question: which settings are necessary to get this handled properly?

I should tell you that I'm new to XML as well, as much as I haven't 
much knowledge on HTML. But since I'm very bright I'm sure I'll need 
just some little help at the beginning and soon will be a valuable 
contributor to this list ;-)

Many, many thanks for your help and patience!

Received on Wednesday, 9 October 2002 11:19:10 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:38:52 UTC