Re: XML and HTML Interoperability

[Murray Altheim]

> 1. XML documents' behavior in an HTML browser
> a. the '.xml' extension is not understood, so the HTML browser tries
>    to download it. The user gets to read his XML in emacs.
> b. the '.xml' extension is understood as text/plain, so the user
>    gets to read his XML in Netscape. Whoopee.
> c. the HTML browser tries to parse it as HTML. All sorts of strange
>    stuff appears on the screen (what is this "<?XML>" thingie? why
>    don't any of the images appear, and I see all these "<IMG/>" (or
>    even "<IMAGE/>") tags?)

Which happens depends on the browser's configuration.  Presumably, the
first time encountering text/xml or text/x-xml, the browser will
prompt the user.

I don't see a. or b. as a problem, since that's a matter of choice for
the user.  For c., they'll see PIs, but that's really a bug in most
browser implementations.  The markup (including <img/>) will generally
be hidden.  They'll see all the content, probably run together in one
large paragraph unless the XML happens to include a <p> element.

If one is authoring XML and expecting that HTML browsers may be used
with it, one should use a tagset that is a superset of HTML.

> 2. HTML documents' behavior in an XML browser
> a. the HTML document is not valid, nor even well-formed. The XML
>    browser dies a thousand violent deaths. (I've seen this
>    happen. Not pretty.)

If people would validate their HTML, this wouldn't happen.  But I
would assume that an XML browser wouldn't be used (at least not as
such) for text/html; on encountering that, it should act as an HTML
user-agent, with the usual error recovery for bogus markup.

Even in XML mode, I would expect a user-agent to have reasonable error
recovery for dealing with interleaved tags, etc.

> b. the HTML document is either well-formed or even (!) valid. But it
>    is strictly HTML, so the XML browser hits the first IMG tag and
>    just keeps looking for that elusive </IMG>. The user goes to bed
>    wondering.

See 2a. for why this shouldn't be an issue, or 3. if it still is.

> 3. HTML document compatibility with an XML browser
> a. This is currently not possible without modification to the
>    declaration and HTML DTD. Now, if one is willing and able to make
>    these changes, we might be able to play. In the SGML declaration,
>    declare the null end tag NET="/>".  In the DTD, disallow
>    minimization rules -- require end tags for all elements. For
>    standardized DTDs like DocBook and HTML in wide use, this might
>    be a problem. Also, unquoted attribute values are disallowed.

Require end-tags for all elements in HTML per Prescod; now valid HTML
documents are well-formed XML.  No current Web browsers will choke on
this (though some insert extra space on </br> and </p>).

> b. Another option, modifying the DTD to include no empty element
>    declarations, wouldn't work, as the installed document base
>    prohibits such a change.

Not true - truly valid documents (the minority) declare to which DTD
they comply.  Ones that weren't valid to begin with don't care what
the DTD says.  It works in the browser.

> 4. XML document compatibility with an HTML browser
> a. Current HTML browser will produce noise on PIs (and not process
>    the PIs), not handle the modified NET properly, and due to the
>    arbitrary nature of the markup, produce unpredictable results.

They will handle the modified NET (at least most of them will - I
haven't actually tried Spyglass's).  Anything based on libwww ignores

> b. If the XML document is really an HTML document in an XML wrapper
>    (see #3), then it's a matter of modifying the browser as in #5
>    below.

> 5. HTML browser parsing XML
> An HTML browser developer must modify the current HTML parsing code
> to take into account:
> a. Both the document character set and encoding are different from
>    HTML.  XML uses Unicode and UTF-2/UTF-8 (allowing other
>    encodings), so unless the browser is already i18n-ed, this may be
>    a big problem.

I don't see it as much of a problem.  Non-i18n browsers are probably
U.S. or European, and will deal with UTF-8 OK.  UCS-2 will look bad,
but this will be a problem for HTML as well as XML, so I don't see it
as an XML issue.

> b. Processing Instructions <?XML ... ?> I didn't realize we'd also
>    changed PIC for XML. Hmmm.

Shouldn't change anything for most current Web clients.

> c. Funky End Tag Weirdness

Not a problem - see above.

> d. A currently unspecified hyperlinking mechanism characterized by a
>    different link syntax using IDREFs and IDs (?)

I think hyperlinks are the least of a Web client's worries handling
XML.  I'd be happy to see the data in a readable way.

> e. If we care about validity (as does everybody on the Web), there
>    will be some XML documents that are broken, ie., well formed. (I
>    think we can safely ignore this concern.)

Did you mean "not well formed"?

> f. Marked section handling. Easy:
>        1. Search for "<!["
>        2. Parse forward to next "["
>        3. Parse forward to next "]]>"
>        All content between #1 and #2 (ignoring whitespace) is the MS
>        keyword. This can be IGNORE or INCLUDE, or an entity that
>        expands to that (With no entity expansion in HTML, it'd
>        better be INCLUDE or IGNORE). If the keyword is not IGNORE,
>        then default to INCLUDE.  (If it's CDATA, what to do?)
>        All content between #2 and #3 is the content to be included
>        or ignored. All the rest (including the keyword) is markup to
>        be discarded by the formatter.

Only CDATA marked sections are permitted in the document instance
(prod. [19]).  Display all marked sections outside <!DOCTYPE ... >.

> 6. XML browser parsing valid HTML
> a. Once the XML browser detects an HTML document, punt.

My preference for 6. and 7.  I don't think a text/xml user agent
should attempt to handle text/html.  If a single program incorporates
a user agent for both types, fine; but keep the distinction.

> Again, a thousand pardons for the noise this creates, but I
> think/hope this will be of benefit to all. All comments, criticisms,
> etc. to me. Comments to me either privately or publicly may/will be
> incorporated into the online document. If anyone else has already
> written something like this, please let me know and maybe we can
> work a deal. I've got ten live chickens in my apartment I might be
> willing to sell...

<!ENTITY crism PUBLIC "-//EBT//NONSGML Christopher R. Maden//EN" SYSTEM
"<URL>http://www.ebt.com <TEL>+1.401.421.9550 <FAX>+1.401.521.2030
<USMAIL>One Richmond Square, Providence, RI 02906 USA" NDATA SGML.Geek>

Follow-Ups: References: