- From: Douglas Cook <cookd@cs.byu.edu>
- Date: Wed, 18 Aug 1999 12:25:44 -0700
- To: Tidy <html-tidy@w3.org>
I've been playing around with using Tidy to generate XHTML and XML documents. It is pretty amazing what it does, and most of the generated XML is excellent. I have had some problems with the XML "headers," i.e. the first few lines of the generated XML files. I'm not an XML expert, but my current assignment is to get some web pages into a valid XML format (for a database). I have some experiences that might be interesting to the readers of this group. I welcome any comments. When converting HTML documents to XML, Tidy keeps track of whether there was already a DOCTYPE specified, and keeps the original if it existed. This is probably the "theoretically correct" thing to do in an ideal world. In practice, however, it doesn't work with a lot of pages. The best example I can give is any page generated by MSHTML (in other words, FrontPage Express, or any program that uses the Internet Explorer control to create or modify a web page as mine does). MSHTML adds the following DOCTYPE to the HTML document: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> After converting this document to XML, I get a "header" that looks like this: <?xml version="1.0"?> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> This is not a valid XML doctype spec. It doesn't have the second data string giving the filename (or url) of the dtd. I don't really know how Tidy should detect this case or what it should do about it. One idea I had was for Tidy to remove invalid doctypes, validating them with some very simple check, or maybe just trapping for the above doctype and replacing it with a better one. But as I said before, I'm really in over my head with XML, so I couldn't help much. Next, I found that Internet Explorer 5.0 chokes on the W3C's Transitional DTD, giving the following error (error is from parsing the DTD, not the XML file): Attribute 'xmlns:' must be a #FIXED attribute. Line 257, Position 4 I guess this is a question of interpretation of the specifications for an XML DTD. The IE 5.0 programmers interpreted the spec one way, and the author of the XML Transitional DTD interpreted the spec another way. Anyway, we tried using the strict dtd, and IE 5.0 successfully parsed the DTD, but the XML file wasn't "strict," so it failed the validation (that was no surprise). This may be an issue worth looking into. The last thing I wanted to mention was a minor bug in the command line parser for Tidy. The option -asxml is supposed to make tidy output XML, but it is actually parsed into the "output XHTML" option instead. Change line 713 from "xHTML = yes;" to "XmlOut = yes;" to fix the problem. The distinction is minor, but this keeps it consistent with the config file's options. As far as that goes, it may make sense to add a separate command line option for xHTML, adding else if (strcmp(arg, "asxhtml") == 0) xHTML = yes; right below line 713. Of course since the distinction between the XML and XHTML outputs is minimal (Tidy outputs the original DOCTYPE in XML vs. a generated DOCTYPE in XHTML, and adds the xmlns attribute to the <html> tag in XHTML, and possibly a few other minor differences that I didn't notice), this may be a moot point. Thanks! =-=-=-=-=-=-=-=-=-=-=-=-=-= Douglas Cook - MCP mailto:cookd@cs.byu.edu =-=-=-=-=-=-=-=-=-=-=-=-=-=
Received on Wednesday, 18 August 1999 15:27:30 UTC