[Prev][Next][Index][Thread]
Re: XML and HTML Interoperability
[Murray Altheim]
> 1. XML documents' behavior in an HTML browser
>
> a. the '.xml' extension is not understood, so the HTML browser tries
> to download it. The user gets to read his XML in emacs.
> b. the '.xml' extension is understood as text/plain, so the user
> gets to read his XML in Netscape. Whoopee.
> c. the HTML browser tries to parse it as HTML. All sorts of strange
> stuff appears on the screen (what is this "<?XML>" thingie? why
> don't any of the images appear, and I see all these "<IMG/>" (or
> even "<IMAGE/>") tags?)
Which happens depends on the browser's configuration. Presumably, the
first time encountering text/xml or text/x-xml, the browser will
prompt the user.
I don't see a. or b. as a problem, since that's a matter of choice for
the user. For c., they'll see PIs, but that's really a bug in most
browser implementations. The markup (including <img/>) will generally
be hidden. They'll see all the content, probably run together in one
large paragraph unless the XML happens to include a <p> element.
If one is authoring XML and expecting that HTML browsers may be used
with it, one should use a tagset that is a superset of HTML.
> 2. HTML documents' behavior in an XML browser
>
> a. the HTML document is not valid, nor even well-formed. The XML
> browser dies a thousand violent deaths. (I've seen this
> happen. Not pretty.)
If people would validate their HTML, this wouldn't happen. But I
would assume that an XML browser wouldn't be used (at least not as
such) for text/html; on encountering that, it should act as an HTML
user-agent, with the usual error recovery for bogus markup.
Even in XML mode, I would expect a user-agent to have reasonable error
recovery for dealing with interleaved tags, etc.
> b. the HTML document is either well-formed or even (!) valid. But it
> is strictly HTML, so the XML browser hits the first IMG tag and
> just keeps looking for that elusive </IMG>. The user goes to bed
> wondering.
See 2a. for why this shouldn't be an issue, or 3. if it still is.
> 3. HTML document compatibility with an XML browser
>
> a. This is currently not possible without modification to the
> declaration and HTML DTD. Now, if one is willing and able to make
> these changes, we might be able to play. In the SGML declaration,
> declare the null end tag NET="/>". In the DTD, disallow
> minimization rules -- require end tags for all elements. For
> standardized DTDs like DocBook and HTML in wide use, this might
> be a problem. Also, unquoted attribute values are disallowed.
Require end-tags for all elements in HTML per Prescod; now valid HTML
documents are well-formed XML. No current Web browsers will choke on
this (though some insert extra space on </br> and </p>).
> b. Another option, modifying the DTD to include no empty element
> declarations, wouldn't work, as the installed document base
> prohibits such a change.
Not true - truly valid documents (the minority) declare to which DTD
they comply. Ones that weren't valid to begin with don't care what
the DTD says. It works in the browser.
> 4. XML document compatibility with an HTML browser
>
> a. Current HTML browser will produce noise on PIs (and not process
> the PIs), not handle the modified NET properly, and due to the
> arbitrary nature of the markup, produce unpredictable results.
They will handle the modified NET (at least most of them will - I
haven't actually tried Spyglass's). Anything based on libwww ignores
<[^?].*>.
> b. If the XML document is really an HTML document in an XML wrapper
> (see #3), then it's a matter of modifying the browser as in #5
> below.
> 5. HTML browser parsing XML
>
> An HTML browser developer must modify the current HTML parsing code
> to take into account:
>
> a. Both the document character set and encoding are different from
> HTML. XML uses Unicode and UTF-2/UTF-8 (allowing other
> encodings), so unless the browser is already i18n-ed, this may be
> a big problem.
I don't see it as much of a problem. Non-i18n browsers are probably
U.S. or European, and will deal with UTF-8 OK. UCS-2 will look bad,
but this will be a problem for HTML as well as XML, so I don't see it
as an XML issue.
> b. Processing Instructions <?XML ... ?> I didn't realize we'd also
> changed PIC for XML. Hmmm.
Shouldn't change anything for most current Web clients.
> c. Funky End Tag Weirdness
Not a problem - see above.
> d. A currently unspecified hyperlinking mechanism characterized by a
> different link syntax using IDREFs and IDs (?)
I think hyperlinks are the least of a Web client's worries handling
XML. I'd be happy to see the data in a readable way.
> e. If we care about validity (as does everybody on the Web), there
> will be some XML documents that are broken, ie., well formed. (I
> think we can safely ignore this concern.)
Did you mean "not well formed"?
> f. Marked section handling. Easy:
> 1. Search for "<!["
> 2. Parse forward to next "["
> 3. Parse forward to next "]]>"
> All content between #1 and #2 (ignoring whitespace) is the MS
> keyword. This can be IGNORE or INCLUDE, or an entity that
> expands to that (With no entity expansion in HTML, it'd
> better be INCLUDE or IGNORE). If the keyword is not IGNORE,
> then default to INCLUDE. (If it's CDATA, what to do?)
> All content between #2 and #3 is the content to be included
> or ignored. All the rest (including the keyword) is markup to
> be discarded by the formatter.
Only CDATA marked sections are permitted in the document instance
(prod. [19]). Display all marked sections outside <!DOCTYPE ... >.
> 6. XML browser parsing valid HTML
>
> a. Once the XML browser detects an HTML document, punt.
My preference for 6. and 7. I don't think a text/xml user agent
should attempt to handle text/html. If a single program incorporates
a user agent for both types, fine; but keep the distinction.
> Again, a thousand pardons for the noise this creates, but I
> think/hope this will be of benefit to all. All comments, criticisms,
> etc. to me. Comments to me either privately or publicly may/will be
> incorporated into the online document. If anyone else has already
> written something like this, please let me know and maybe we can
> work a deal. I've got ten live chickens in my apartment I might be
> willing to sell...
-Chris
--
<!NOTATION SGML.Geek PUBLIC "-//GCA//NOTATION SGML Geek//EN">
<!ENTITY crism PUBLIC "-//EBT//NONSGML Christopher R. Maden//EN" SYSTEM
"<URL>http://www.ebt.com <TEL>+1.401.421.9550 <FAX>+1.401.521.2030
<USMAIL>One Richmond Square, Providence, RI 02906 USA" NDATA SGML.Geek>
Follow-Ups:
References: