XML and HTML Interoperability

   "XML and HTML
    HTML and XML
    Live together in per-fect
    Har-mo-neeee..."
    (sung to the tune of "Ebony and Ivory")

An unnamed browser company wishes to understand the ramifications of XML on
the Web, so I thought this might be of interest to this group. If not,
please tune out this thread and sorry for its intrusion. I hope to hash out
some real answers for the HTML community, as well as my company. This
document is also posted online, where it's a bit easier to read (markup!
yes!):

   http://www.cm.spyglass.com/doc/spec/htmlxml.html

Obviously, how we answer numbers 1, 4, and 5 below may have a great deal of
impact on MS, Netscape, Spyglass, and others, as well as how XML is
accepted in the greater Web community. The other questions are equally
important to different demographics. While I realize some of this has yet
to be decided, I thought I'd get a running start and ask (and therefore try
to answer) some questions. People wanna know:

    1. What happens when an XML document meets an HTML browser?
    2. What happens when an HTML document meets an XML browser?
    3. What changes must one make to an HTML document to make it
       compatible with an XML browser?
    4. What changes must one make to an XML document to make it
       compatible with an HTML browser?
    5. What changes must an HTML browser developer make to their
       product to allow it to correctly parse XML?
    6. What changes must an XML browser developer make to their
       product to allow it to correctly parse valid HTML?
    7. What changes must an XML browser developer make to their
       product to allow it to correctly parse invalid HTML?

I'll try to break this down by question:

1. XML documents' behavior in an HTML browser

a. the '.xml' extension is not understood, so the HTML browser tries to
   download it. The user gets to read his XML in emacs.
b. the '.xml' extension is understood as text/plain, so the user gets to read
   his XML in Netscape. Whoopee.
c. the HTML browser tries to parse it as HTML. All sorts of strange stuff
   appears on the screen (what is this "<?XML>" thingie? why don't any of
   the images appear, and I see all these "<IMG/>" (or even "<IMAGE/>")
   tags?)

2. HTML documents' behavior in an XML browser

a. the HTML document is not valid, nor even well-formed. The XML browser dies
   a thousand violent deaths. (I've seen this happen. Not pretty.)
b. the HTML document is either well-formed or even (!) valid. But it is
   strictly HTML, so the XML browser hits the first IMG tag and just keeps
   looking for that elusive </IMG>. The user goes to bed wondering.

3. HTML document compatibility with an XML browser

a. This is currently not possible without modification to the declaration and
   HTML DTD. Now, if one is willing and able to make these changes, we might be
   able to play. In the SGML declaration, declare the null end tag NET="/>".
   In the DTD, disallow minimization rules -- require end tags for all
   elements. For standardized DTDs like DocBook and HTML in wide use, this
   might be a problem. Also, unquoted attribute values are disallowed.

b. Another option, modifying the DTD to include no empty element
   declarations, wouldn't work, as the installed document base prohibits such
   a change.

4. XML document compatibility with an HTML browser

a. Current HTML browser will produce noise on PIs (and not process the PIs),
   not handle the modified NET properly, and due to the arbitrary nature of
   the markup, produce unpredictable results.

b. If the XML document is really an HTML document in an XML wrapper (see #3),
   then it's a matter of modifying the browser as in #5 below.

5. HTML browser parsing XML

An HTML browser developer must modify the current HTML parsing code to take
into account:

a. Both the document character set and encoding are different from HTML.
   XML uses Unicode and UTF-2/UTF-8 (allowing other encodings), so unless
   the browser is already i18n-ed, this may be a big problem.
b. Processing Instructions   <?XML ... ?>   I didn't realize we'd also changed
   PIC for XML. Hmmm.
d. Funky End Tag Weirdness
d. A currently unspecified hyperlinking mechanism
   characterized by a different link syntax using IDREFs and IDs (?)
e. If we care about validity (as does everybody on the Web), there
   will be some XML documents that are broken, ie., well formed. (I think
   we can safely ignore this concern.)
f. Marked section handling. Easy:
       1. Search for "<!["
       2. Parse forward to next "["
       3. Parse forward to next "]]>"
       All content between #1 and #2 (ignoring whitespace) is the MS
       keyword. This can be IGNORE or INCLUDE, or an entity that expands
       to that (With no entity expansion in HTML, it'd better be INCLUDE
       or IGNORE). If the keyword is not IGNORE, then default to INCLUDE.
       (If it's CDATA, what to do?)
       All content between #2 and #3 is the content to be included or
       ignored. All the rest (including the keyword) is markup to be
       discarded by the formatter.

6. XML browser parsing valid HTML

a. Once the XML browser detects an HTML document, punt.
b. ?

6. XML browser parsing invalid HTML

a. Once the XML browser detects an HTML document, punt.
b. ?

------------

Again, a thousand pardons for the noise this creates, but I think/hope this
will be of benefit to all. All comments, criticisms, etc. to me. Comments
to me either privately or publicly may/will be incorporated into the online
document. If anyone else has already written something like this, please
let me know and maybe we can work a deal. I've got ten live chickens in my
apartment I might be willing to sell...

Murray

```````````````````````````````````````````````````````````````````````````````
    Murray Altheim, Program Manager
    Spyglass, Inc., Cambridge, Massachusetts
    email: <mailto:murray@spyglass.com>
    http:  <http://www.cm.spyglass.com/murray/murray.html>
           "Give a monkey the tools and he'll eventually build a typewriter."

Received on Wednesday, 11 December 1996 17:55:13 UTC