HTML-->XML with Tidy

I've been playing around with using Tidy to generate XHTML and XML
documents.  It is pretty amazing what it does, and most of the generated XML
is excellent.  I have had some problems with the XML "headers," i.e. the
first few lines of the generated XML files.  I'm not an XML expert, but my
current assignment is to get some web pages into a valid XML format (for a
database).  I have some experiences that might be interesting to the readers
of this group.  I welcome any comments.

When converting HTML documents to XML, Tidy keeps track of whether there was
already a DOCTYPE specified, and keeps the original if it existed.  This is
probably the "theoretically correct" thing to do in an ideal world.  In
practice, however, it doesn't work with a lot of pages.  The best example I
can give is any page generated by MSHTML (in other words, FrontPage Express,
or any program that uses the Internet Explorer control to create or modify a
web page as mine does).  MSHTML adds the following DOCTYPE to the HTML
document:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

After converting this document to XML, I get a "header" that looks like
this:

<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

This is not a valid XML doctype spec.  It doesn't have the second data
string giving the filename (or url) of the dtd.  I don't really know how
Tidy should detect this case or what it should do about it.  One idea I had
was for Tidy to remove invalid doctypes, validating them with some very
simple check, or maybe just trapping for the above doctype and replacing it
with a better one.  But as I said before, I'm really in over my head with
XML, so I couldn't help much.

Next, I found that Internet Explorer 5.0 chokes on the W3C's Transitional
DTD, giving the following error (error is from parsing the DTD, not the XML
file):

Attribute 'xmlns:' must be a #FIXED attribute. Line 257, Position 4

I guess this is a question of interpretation of the specifications for an
XML DTD.  The IE 5.0 programmers interpreted the spec one way, and the
author of the XML Transitional DTD interpreted the spec another way.
Anyway, we tried using the strict dtd, and IE 5.0 successfully parsed the
DTD, but the XML file wasn't "strict," so it failed the validation (that was
no surprise).  This may be an issue worth looking into.

The last thing I wanted to mention was a minor bug in the command line
parser for Tidy.  The option -asxml is supposed to make tidy output XML, but
it is actually parsed into the "output XHTML" option instead.  Change line
713 from "xHTML = yes;" to "XmlOut = yes;" to fix the problem.  The
distinction is minor, but this keeps it consistent with the config file's
options.

As far as that goes, it may make sense to add a separate command line option
for xHTML, adding

            else if (strcmp(arg, "asxhtml") == 0)
                xHTML = yes;

right below line 713.  Of course since the distinction between the XML and
XHTML outputs is minimal (Tidy outputs the original DOCTYPE in XML vs. a
generated DOCTYPE in XHTML, and adds the xmlns attribute to the <html> tag
in XHTML, and possibly a few other minor differences that I didn't notice),
this may be a moot point.

Thanks!

=-=-=-=-=-=-=-=-=-=-=-=-=-=
Douglas Cook - MCP
mailto:cookd@cs.byu.edu
=-=-=-=-=-=-=-=-=-=-=-=-=-=

Received on Wednesday, 18 August 1999 15:27:30 UTC