- From: Christopher R. Maden <crm@ebt.com>
- Date: Wed, 11 Dec 1996 23:28:14 GMT
- To: w3c-sgml-wg@w3.org
[Murray Altheim] > 1. XML documents' behavior in an HTML browser > > a. the '.xml' extension is not understood, so the HTML browser tries > to download it. The user gets to read his XML in emacs. > b. the '.xml' extension is understood as text/plain, so the user > gets to read his XML in Netscape. Whoopee. > c. the HTML browser tries to parse it as HTML. All sorts of strange > stuff appears on the screen (what is this "<?XML>" thingie? why > don't any of the images appear, and I see all these "<IMG/>" (or > even "<IMAGE/>") tags?) Which happens depends on the browser's configuration. Presumably, the first time encountering text/xml or text/x-xml, the browser will prompt the user. I don't see a. or b. as a problem, since that's a matter of choice for the user. For c., they'll see PIs, but that's really a bug in most browser implementations. The markup (including <img/>) will generally be hidden. They'll see all the content, probably run together in one large paragraph unless the XML happens to include a <p> element. If one is authoring XML and expecting that HTML browsers may be used with it, one should use a tagset that is a superset of HTML. > 2. HTML documents' behavior in an XML browser > > a. the HTML document is not valid, nor even well-formed. The XML > browser dies a thousand violent deaths. (I've seen this > happen. Not pretty.) If people would validate their HTML, this wouldn't happen. But I would assume that an XML browser wouldn't be used (at least not as such) for text/html; on encountering that, it should act as an HTML user-agent, with the usual error recovery for bogus markup. Even in XML mode, I would expect a user-agent to have reasonable error recovery for dealing with interleaved tags, etc. > b. the HTML document is either well-formed or even (!) valid. But it > is strictly HTML, so the XML browser hits the first IMG tag and > just keeps looking for that elusive </IMG>. The user goes to bed > wondering. See 2a. for why this shouldn't be an issue, or 3. if it still is. > 3. HTML document compatibility with an XML browser > > a. This is currently not possible without modification to the > declaration and HTML DTD. Now, if one is willing and able to make > these changes, we might be able to play. In the SGML declaration, > declare the null end tag NET="/>". In the DTD, disallow > minimization rules -- require end tags for all elements. For > standardized DTDs like DocBook and HTML in wide use, this might > be a problem. Also, unquoted attribute values are disallowed. Require end-tags for all elements in HTML per Prescod; now valid HTML documents are well-formed XML. No current Web browsers will choke on this (though some insert extra space on </br> and </p>). > b. Another option, modifying the DTD to include no empty element > declarations, wouldn't work, as the installed document base > prohibits such a change. Not true - truly valid documents (the minority) declare to which DTD they comply. Ones that weren't valid to begin with don't care what the DTD says. It works in the browser. > 4. XML document compatibility with an HTML browser > > a. Current HTML browser will produce noise on PIs (and not process > the PIs), not handle the modified NET properly, and due to the > arbitrary nature of the markup, produce unpredictable results. They will handle the modified NET (at least most of them will - I haven't actually tried Spyglass's). Anything based on libwww ignores <[^?].*>. > b. If the XML document is really an HTML document in an XML wrapper > (see #3), then it's a matter of modifying the browser as in #5 > below. > 5. HTML browser parsing XML > > An HTML browser developer must modify the current HTML parsing code > to take into account: > > a. Both the document character set and encoding are different from > HTML. XML uses Unicode and UTF-2/UTF-8 (allowing other > encodings), so unless the browser is already i18n-ed, this may be > a big problem. I don't see it as much of a problem. Non-i18n browsers are probably U.S. or European, and will deal with UTF-8 OK. UCS-2 will look bad, but this will be a problem for HTML as well as XML, so I don't see it as an XML issue. > b. Processing Instructions <?XML ... ?> I didn't realize we'd also > changed PIC for XML. Hmmm. Shouldn't change anything for most current Web clients. > c. Funky End Tag Weirdness Not a problem - see above. > d. A currently unspecified hyperlinking mechanism characterized by a > different link syntax using IDREFs and IDs (?) I think hyperlinks are the least of a Web client's worries handling XML. I'd be happy to see the data in a readable way. > e. If we care about validity (as does everybody on the Web), there > will be some XML documents that are broken, ie., well formed. (I > think we can safely ignore this concern.) Did you mean "not well formed"? > f. Marked section handling. Easy: > 1. Search for "<![" > 2. Parse forward to next "[" > 3. Parse forward to next "]]>" > All content between #1 and #2 (ignoring whitespace) is the MS > keyword. This can be IGNORE or INCLUDE, or an entity that > expands to that (With no entity expansion in HTML, it'd > better be INCLUDE or IGNORE). If the keyword is not IGNORE, > then default to INCLUDE. (If it's CDATA, what to do?) > All content between #2 and #3 is the content to be included > or ignored. All the rest (including the keyword) is markup to > be discarded by the formatter. Only CDATA marked sections are permitted in the document instance (prod. [19]). Display all marked sections outside <!DOCTYPE ... >. > 6. XML browser parsing valid HTML > > a. Once the XML browser detects an HTML document, punt. My preference for 6. and 7. I don't think a text/xml user agent should attempt to handle text/html. If a single program incorporates a user agent for both types, fine; but keep the distinction. > Again, a thousand pardons for the noise this creates, but I > think/hope this will be of benefit to all. All comments, criticisms, > etc. to me. Comments to me either privately or publicly may/will be > incorporated into the online document. If anyone else has already > written something like this, please let me know and maybe we can > work a deal. I've got ten live chickens in my apartment I might be > willing to sell... -Chris -- <!NOTATION SGML.Geek PUBLIC "-//GCA//NOTATION SGML Geek//EN"> <!ENTITY crism PUBLIC "-//EBT//NONSGML Christopher R. Maden//EN" SYSTEM "<URL>http://www.ebt.com <TEL>+1.401.421.9550 <FAX>+1.401.521.2030 <USMAIL>One Richmond Square, Providence, RI 02906 USA" NDATA SGML.Geek>
Received on Wednesday, 11 December 1996 18:43:09 UTC