Re: HTML-->XML with Tidy

On Wed, 18 Aug 1999, Douglas Cook wrote:

> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
> 
> After converting this document to XML, I get a "header" that looks like
> this:
> 
> <?xml version="1.0"?>
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
> 
> This is not a valid XML doctype spec.  It doesn't have the
> second data string giving the filename (or url) of the dtd.  I
> don't really know how Tidy should detect this case or what it
> should do about it.  One idea I had was for Tidy to remove
> invalid doctypes, validating them with some very simple check,
> or maybe just trapping for the above doctype and replacing it
> with a better one.  But as I said before, I'm really in over my
> head with XML, so I couldn't help much.

I guess the simplest option would be to add "" after the public
identifier string. That would satisfy the letter of the XML spec.

> Next, I found that Internet Explorer 5.0 chokes on the W3C's
> Transitional DTD, giving the following error (error is from
> parsing the DTD, not the XML file):
> 
> Attribute 'xmlns:' must be a #FIXED attribute. Line 257, Position 4

We have added this to the XHTML DTDs.

> The last thing I wanted to mention was a minor bug in the
> command line parser for Tidy.  The option -asxml is supposed to
> make tidy output XML, but it is actually parsed into the "output
> XHTML" option instead.  Change line 713 from "xHTML = yes;" to
> "XmlOut = yes;" to fix the problem.  The distinction is minor,
> but this keeps it consistent with the config file's options.
> 
> As far as that goes, it may make sense to add a separate command
> line option for xHTML, adding
> 
>             else if (strcmp(arg, "asxhtml") == 0)
>                 xHTML = yes;
> 
> right below line 713.  Of course since the distinction between
> the XML and XHTML outputs is minimal (Tidy outputs the original
> DOCTYPE in XML vs. a generated DOCTYPE in XHTML, and adds the
> xmlns attribute to the <html> tag in XHTML, and possibly a few
> other minor differences that I didn't notice), this may be a
> moot point.

Thanks - I will look into these points.

Regards,

-- Dave Raggett <dsr@w3.org> http://www.w3.org/People/Raggett
phone: +44 122 578 2984 (or 2521) +44 385 320 444 (gsm mobile)
World Wide Web Consortium (on assignment from HP Labs)

Received on Tuesday, 24 August 1999 06:43:24 UTC