Re: text/html for xml extensions of XHTML from Ian Hickson on 2001-05-01 (www-talk@w3.org from May to June 2001)

From: Ian Hickson <ian@hixie.ch>
Date: Tue, 1 May 2001 16:10:40 -0700 (Pacific Daylight Time)
To: "William F. Hammond" <hammond@csc.albany.edu>
cc: <mozilla-mathml@mozilla.org>, <www-talk@w3.org>
Message-ID: <Pine.WNT.4.31.0105011535390.800-100000@HIXIE.netscape.com>
On Tue, 1 May 2001, William F. Hammond wrote:
>>>
>>> 2.  The instance is served through http as "text/html" and any of
>>>     the following is true:
>>>
>>>     a.  The instance begins with the string "<?xml" .
>>
>> Nope. Here is a document that is valid text/html, but
>> non-well-formed text/xml, and which should therefore be sent
>> through the HTML parser:
>
> SGML validation does not pass on the merits of PI's. In today's
> world the appearance of "<?xml " at the beginning of a text/html
> item clearly indicates XML.

Remember that the XML declaration is optional, and that giving the XML
declaration is discouraged by the XHTML compatability guidelines (see
section C.1), which are supposed to be followed in order to send XHTML
as text/html.

If you are willing to use the XML declaration as a signal to use XML,
you might as well use text/xml since it's not going to be compatible
with older browsers anyway.


>>    <?xml this is not?>
>>    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0//EN">
>>    <!-- -- -->
>>      This is a comment. This document is not XHTML.
>>      <html xmlns="http://www.w3.org/1999/xhtml"/>
>>      Ok, I'm done now. -->
>>    <html>
>>     <title> Need a title in HTML! </title>
>>     <p> This is a valid HTML document.
>>    </html>
>>
>>>     b.  The instance has a string matching the case-sensitive
>>>         pattern "<!DOCTYPE html PUBLIC .*XHTML" before the first
>>>         document instance tag.
>> Hmm, the valid HTML document above also matches that string.
>
> Well, yes, if you look beyond the end of the "<!DOCTYPE ...>". My
> intention was that the string "XHTML" should be inside the value of
> the FPI, and perhaps the string should be "DTD XHTML".
>
> For the moment I don't know exactly how I would express it. Still I
> think that an xml capable user agent will look bad rolling past a
> correct document type declaration for XHTML.

The moment you get more complicated than "look for a pattern at the
start of the document" you end up having to write a fully fledged
parser. Extreme case in point:

   <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0//EN"
          [ <!-- SYSTEM "not XHTML" --> ]>


>>>     c.  The first document instance tag is an open tag for the element
>>>         "html" (all lower case) with a value specified for the attribute
>>>         "xmlns".
>>
>> How do you know it is the first instance tag without having a full
>> XML parser to skip past PIs, comments, internal subsets, and the
>> like?
>
> Surely a user agent in classical mode has a way of knowing what is a
> tag and what is not a tag.

By the time the classical parser has been invoked, it is too late to
back off and switch to an XML parser without a significant performance
hit. (This is definitely the case in Mozilla's architecture; I imagine
it is similar in other browsers that use distinct XML and HTML parsers
although of course maybe I am wrong in this.)


> Since many user agents appear to ignore PI's and document type
> declarations and many extant html offerings do not have document
> type declarations, (c) might reasonably be the sole criterion for
> calling the xml parser.

(c) is the most complicated to implement of the three.


> But does Mozilla call its xml parser for http://www.w3.org/ ?

Nope. If it did, it would render the page without any expanded
character entity references, since Mozilla is not a validating parser
and thus skips parsing the DTD and thus doesn't know what &nbsp;,
&middot; and &copy; are. Not to mention that it would end up ignoring
the print-media specific section of the stylesheet, which uses
uppercase element names and thus wouldn't match any of the lower case
elements (line 138 of the first stylesheet), and it would use an
unexpected background colour for the page because the stylesheet sets
the background on <body> and not <html>, which in XHTML will result in
a different rendering to the equivalent in HTML4 (same sheet, line 5).


Remind me, why would you want to send XML as text/html?

-- 
Ian Hickson                                            )\     _. - ._.)   fL
Invited Expert, CSS Working Group                     /. `- '  (  `--'
The views expressed in this message are strictly      `- , ) -> ) \
personal and not those of Netscape or Mozilla. ________ (.' \) (.' -' ______
Received on Tuesday, 1 May 2001 19:08:18 UTC