Problems with entities HTML -> XML! (new to list) from Niels Peter Strandberg on 2001-01-26 (html-tidy@w3.org from January to March 2001)

From: Niels Peter Strandberg <nielspeter@npstrandberg.com>
Date: Fri, 26 Jan 2001 15:40:33 +0100
To: html-tidy@w3c.org
Message-Id: <200101261439.PAA14104@d1o38.telia.com>

Hi!

(Using jTidy)

I'm converting a html file to xml.  I have 2 problems that I need to know how to solve.

Code:

        tidy.setXmlOut(true);
        tidy.setFixBackslash(true); // URL FixBackslash
        tidy.setRawOut(true); // RawOut - avoid mapping values > 127 to entities
        tidy.setXmlPi(true); // XmlPi - add <?xml?> for XML docs
        tidy.setQuoteAmpersand(true); // QuoteAmpersand - output naked ampersand as &
        tidy.setTidyMark(false); // TidyMark - add meta element indicating tidied doc
        tidy.setWraplen(99999); // Wraplen - default wrap margin



The result file output:

<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
<head>
<link rel="made" href="wsanchez@apple.com" />
<title>Welcome to Mac OS X!</title>
...........


Problems:

I want to treat this result file as a "normal" XML file. I'm going to transform the result using XSL and XPath.

1) Entities! The &copy; is treated as an entity. So the parser complains. What I want is all "entities" converted to their "right" character. (ex. &copy;  ->  ©). How can this be done?

2) I open the result file in XML Spy for Window. XML Spy tells me that the <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> need a space some where. Do I need the DOCTYPE at all? How do I solve the problem?


Here is want I want to do:

html (url)-> xml -> xsl or xpath -> xml (DOM or file)

the ideal was:
html -> DOM (jTidy), then using XPath or XSL to manipulate the DOM tree -> Result could be a XML file, HTML file, DOM tree ....

Is there anyone out there that has made an application that can do this in one go, and are ready to share it?


Regards, Niels Peter

Received on Friday, 26 January 2001 09:40:07 UTC