Re: XHTML, content type, and content negotiation from Karl Ove Hufthammer on 2000-06-20 (www-html@w3.org from June 2000)

From: Karl Ove Hufthammer <huftis@bigfoot.com>
Date: Tue, 20 Jun 2000 21:46:43 +0200
To: "Tim Taylor" <tim.taylor@iname.com>, <www-html@w3.org>
Message-ID: <021d01bfdaf1$286a6ca0$44349fc3@huftis>
----- Original Message -----
From: "Tim Taylor" <tim.taylor@iname.com>
To: <www-html@w3.org>
Sent: Saturday, June 17, 2000 7:01 AM
Subject: XHTML, content type, and content negotiation


| Is there any stance (official or unofficial) on how User Agents are
| supposed to process an XHTML document returned with a Content-Type of
| text/html?  What if the Content-Type is text/xml?  The XHTML 1.0 Spec is
| silent on this topic.

Here's my strictly unofficial opinion. Speaking of behalf of nobody but
myself:

The browser should treat XHTML content served as 'text/html' as XML. Some
points:

* Everybody who writes XHTML, do it "on purpose".

* They *want* strict parsing.

* The only reason they use 'text/html' is for their web
  pages to be backwards-compatible (and because
  the XHTML recommendation tells them to!).

The HTML (4.01, not XHTML 1.0) Recommendation doesn't say what user
agents should do when they encounter "bad" HTML. The closest thing I
could find, was Appendix B, which only talkes about unknown elements and
attributes. These should be ignored (but the content rendered).

A browser should never reject:

<p>foo
<p>blaa

(Not "well-formed" but legal.)

But it could, in theory, refuse a document with HTML like this ("tag soup"):

<p><b>foo bar <i>baz</b> xyzzy</p>

IMO, the world (wide web) would be a much better place if all browsers acted
this way (from the start of -- it's too late now).

Doing this today, would of course be a very stupid thing to do; the browser
wouldn't render most pages out there. *But*, when it comes to XHTML, it's
*not* a stupid thing to do. XHTML *needs* strict parsing.

The XHTML specification tells us to use 'text/html'. This is a good thing,
since it lets us use XHTML, but the pages will still be backwards-compatible
with older user agents. Newer user agents, which "know" about XHTML, should
still treat the content as XML. There's no reason not to, since all XHTML web
pages will be valid -- there's no reason to be "backwards-compatible" with
malformed documents.

Browsers *must* refuse to render not well-formed XML. The latest HTML standard
is XHTML 1.0 -- HTML implemented as XML. Browsers should refuse to render not
well-formed XHTML, even when it's marked as 'text/html'.

| I'm specifically concerned about the following open Mozilla bug:
|
| <http://bugzilla.mozilla.org/show_bug.cgi?id=26022>
|
| The bug summary and description read:
|
| "XHTML 1.0 document with text/html media-type is treated as HTML 4.0
| document.
|
| Non-html tags in XHTML 1.0 document are ignored when the document
| lebeled with the Internet Mediatype "text/html". To be browsed old web
| browser, some XTHML documents are labeled with  "text/html", not labeled
| whith "text/xml". In new XHTML comformant browser  renders such
| documents as XHTML documents."
|
| Additional comments in the bug report indicate that Mozilla doesn't
| officially support XML,

That's not right. It fully supports XML. Actually, Mozilla's user interface is
written i XUL, which is a XML application.

| so technically it's behaving correctly as an
| "old web browser" [1].  However, Mozilla /will/ one day support XML.

And that's now! :)

| For future reference it would be helpfull to know the appropriate
| behavior.  Ideally, this would be in the XHTML spec as it's an ambiguity
| that may interfere with content authors making a smooth transition from
| HTML to XML.  Specifically people advised to start authoring their
| content as XHTML "right now" so that the full transition to XML down the
| road will be easier will be in for a surprise when their XHTML documents
| appear broken in newer browsers.

Yup. That's why we need strict parsing on all XHTML documents.

| Currently, I see two interpretations for the behavior of User Agents
| that support both HTML and XML:
|
| User Agent A: ignores the Content-Type header, instead relies on the
| document content.  In this interpretation, the User Agent would treat
| the document as XML.

I prefer this. This ensures that the document served as 'text/html', properly
rendered in a browser which supports XHTML, will be rendered in the same
browser when it's served as 'text/xml' too, in addition to making life much
easier for web authors (since they can easily check if their documents are
valid).

| User Agent B: obeys the Content-Type header.  An XHTML document returned
| with the Content-Type text/html is treated as HTML.  An XHTML document
| returned with Content-Type text/xml is treated as XML.
|
| I prefer interpretation B.  I picture B's behavior used in conjunction
| with HTTP Content Negotiation (RFC 2295).  This is what I assumed XHTML
| was intended for all along.  I assumed that content authors could rely
| on default styling of HTML elements so long as the document was served
| as text/html.  Only if the document was served as text/xml would styling
| for all elements be necessary for proper rendering in UAs.

Why? The reason we have HTML and XHTML is to provide a fixed set of elements,
which the browser can render in a meaningful way. A speech browser can use
different voice for headings (e.g. 'h1' elements), a graphical browser will
often render with a bigger font-size (and perhaps in a different colour). Some
browser will be able to automatically generate a table of contents based on
heading elements. User can use user style sheets to make sure all documents
are rendered in a way they like (e.g. all headings should be dark blue on a
white background). This is the power of a fixed set of elements. If browsers
didn't use a default "ua.css", there's really not much reason to use XHTML
instead of just your own, private XML element set.

XML only says something about how a document should be written. The various
standards says how they should be rendered. For example, the following MathML:

<apply>
  <fn>
    <ci>f</ci>
  </fn>
  <ci>x</ci>
</apply>

should be rendered as f(x) (or perhaps spoken!).

SVG is a XML application which defines how SVG should be rendered, i.e. as ima
ges. In the same way, XHTML defines how HTML should be rendered (the way it's
defined in HTML 4.01).

-- 
Karl Ove Hufthammer
Received on Tuesday, 20 June 2000 15:53:27 UTC