- From: Murray Altheim <murray@spyglass.com>
- Date: Wed, 11 Dec 1996 17:57:36 -0400
- To: w3c-sgml-wg@w3.org
"XML and HTML HTML and XML Live together in per-fect Har-mo-neeee..." (sung to the tune of "Ebony and Ivory") An unnamed browser company wishes to understand the ramifications of XML on the Web, so I thought this might be of interest to this group. If not, please tune out this thread and sorry for its intrusion. I hope to hash out some real answers for the HTML community, as well as my company. This document is also posted online, where it's a bit easier to read (markup! yes!): http://www.cm.spyglass.com/doc/spec/htmlxml.html Obviously, how we answer numbers 1, 4, and 5 below may have a great deal of impact on MS, Netscape, Spyglass, and others, as well as how XML is accepted in the greater Web community. The other questions are equally important to different demographics. While I realize some of this has yet to be decided, I thought I'd get a running start and ask (and therefore try to answer) some questions. People wanna know: 1. What happens when an XML document meets an HTML browser? 2. What happens when an HTML document meets an XML browser? 3. What changes must one make to an HTML document to make it compatible with an XML browser? 4. What changes must one make to an XML document to make it compatible with an HTML browser? 5. What changes must an HTML browser developer make to their product to allow it to correctly parse XML? 6. What changes must an XML browser developer make to their product to allow it to correctly parse valid HTML? 7. What changes must an XML browser developer make to their product to allow it to correctly parse invalid HTML? I'll try to break this down by question: 1. XML documents' behavior in an HTML browser a. the '.xml' extension is not understood, so the HTML browser tries to download it. The user gets to read his XML in emacs. b. the '.xml' extension is understood as text/plain, so the user gets to read his XML in Netscape. Whoopee. c. the HTML browser tries to parse it as HTML. All sorts of strange stuff appears on the screen (what is this "<?XML>" thingie? why don't any of the images appear, and I see all these "<IMG/>" (or even "<IMAGE/>") tags?) 2. HTML documents' behavior in an XML browser a. the HTML document is not valid, nor even well-formed. The XML browser dies a thousand violent deaths. (I've seen this happen. Not pretty.) b. the HTML document is either well-formed or even (!) valid. But it is strictly HTML, so the XML browser hits the first IMG tag and just keeps looking for that elusive </IMG>. The user goes to bed wondering. 3. HTML document compatibility with an XML browser a. This is currently not possible without modification to the declaration and HTML DTD. Now, if one is willing and able to make these changes, we might be able to play. In the SGML declaration, declare the null end tag NET="/>". In the DTD, disallow minimization rules -- require end tags for all elements. For standardized DTDs like DocBook and HTML in wide use, this might be a problem. Also, unquoted attribute values are disallowed. b. Another option, modifying the DTD to include no empty element declarations, wouldn't work, as the installed document base prohibits such a change. 4. XML document compatibility with an HTML browser a. Current HTML browser will produce noise on PIs (and not process the PIs), not handle the modified NET properly, and due to the arbitrary nature of the markup, produce unpredictable results. b. If the XML document is really an HTML document in an XML wrapper (see #3), then it's a matter of modifying the browser as in #5 below. 5. HTML browser parsing XML An HTML browser developer must modify the current HTML parsing code to take into account: a. Both the document character set and encoding are different from HTML. XML uses Unicode and UTF-2/UTF-8 (allowing other encodings), so unless the browser is already i18n-ed, this may be a big problem. b. Processing Instructions <?XML ... ?> I didn't realize we'd also changed PIC for XML. Hmmm. d. Funky End Tag Weirdness d. A currently unspecified hyperlinking mechanism characterized by a different link syntax using IDREFs and IDs (?) e. If we care about validity (as does everybody on the Web), there will be some XML documents that are broken, ie., well formed. (I think we can safely ignore this concern.) f. Marked section handling. Easy: 1. Search for "<![" 2. Parse forward to next "[" 3. Parse forward to next "]]>" All content between #1 and #2 (ignoring whitespace) is the MS keyword. This can be IGNORE or INCLUDE, or an entity that expands to that (With no entity expansion in HTML, it'd better be INCLUDE or IGNORE). If the keyword is not IGNORE, then default to INCLUDE. (If it's CDATA, what to do?) All content between #2 and #3 is the content to be included or ignored. All the rest (including the keyword) is markup to be discarded by the formatter. 6. XML browser parsing valid HTML a. Once the XML browser detects an HTML document, punt. b. ? 6. XML browser parsing invalid HTML a. Once the XML browser detects an HTML document, punt. b. ? ------------ Again, a thousand pardons for the noise this creates, but I think/hope this will be of benefit to all. All comments, criticisms, etc. to me. Comments to me either privately or publicly may/will be incorporated into the online document. If anyone else has already written something like this, please let me know and maybe we can work a deal. I've got ten live chickens in my apartment I might be willing to sell... Murray ``````````````````````````````````````````````````````````````````````````````` Murray Altheim, Program Manager Spyglass, Inc., Cambridge, Massachusetts email: <mailto:murray@spyglass.com> http: <http://www.cm.spyglass.com/murray/murray.html> "Give a monkey the tools and he'll eventually build a typewriter."
Received on Wednesday, 11 December 1996 17:55:13 UTC