Re: Why your XHTML article is wrong from Ian Hickson on 2002-11-20 (www-archive@w3.org from November 2002)

From: Ian Hickson <ian@hixie.ch>
Date: Wed, 20 Nov 2002 07:27:02 +0000 (GMT)
To: Aaron Swartz <me@aaronsw.com>
Cc: "www-archive@w3.org" <www-archive@w3.org>
Message-ID: <Pine.LNX.4.21.0211200508220.18728-100000@dhalsim.dreamhost.com>
First, let me be absolutely clear about what my opinion is, so that we
don't argue at cross-purposes.

I am in favour of XHTML itself, and nothing against the technology. The
only thing I have a problem with is sending XHTML as text/html.

Second, let's make sure we agree on some core facts:

XHTML sent as text/html is treated as legacy tag soup by UAs. Legacy tag
soup does not support namespaces.

Only XHTML documents that are compatible with legacy tag soup (as defined
by XHTML 1.0 Appendix C) may be sent as text/html.

XML requires that UAs abort with a fatal error when parsing an ill-formed
document. XHTML is an XML application and thus all of XML's parsing rules
apply to XHTML.


On Tue, 19 Nov 2002, Aaron Swartz wrote:
>>
>> It is suggested that authors should use HTML 4.01 instead of XHTML 
>> [...]
> 
> XHTML is simpler,

XHTML has fewer esoteric syntax rules, agreed.


> more aesthetically pleasing,

That is a subjective argument and not really one I am concerned about. All
XHTML documents can be mapped directly to equivalent HTML documents and
vice versa, meaning that either form can be used for content development,
which is the only time a format's aesthetic qualities matter.


> and works with deployed XML and HTML tools and specs (like
> namespaces).

I am not entirely sure what this means. Certainly, UAs do not support
namespaces in legacy tag soup documents (or XHTML documents sent as
text/html, which are treated as legacy tag soup documents).

HTML works equally well with deployed SGML and HTML tools and specs.


>> you are [...] relying on their error handling.
> 
> I don't see why this is bad.

Because error handling is not defined anywhere, and you are therefore
relying on what is basically proprietary technology.


> Relying on such slack is how we can build backwards-compatible specs;

This is very much incorrect, so much so that an entire document metaformat
was developed with one overwhelming requirement: that error handling rules
be explicitly defined. (That format is now known as XML.)


> otherwise upgrading would require a flag day.

An example of a format which is backwards compatible due to good design is
CSS. It has _forward_ compatible parsing rules which ensure that any
conforming UA will treat any document in a predictable way. Thus upgrading
CSS does _not_ require a "flag day".


HTML is a technology with undefined error handling. The slack is in fact
the source of most of the _incompatabilities_ between UAs.


>>  * <script> and <style> elements
> 
> I don't use any.

You are fortunate. Most people use both extensively.


>>  * Document sent as text/html are handled as tag soup [2] by most UAs.
>>    This means that authors are not checking for validity,
> 
> That doesn't follow.

You are correct, I am missing a step in that argument.

Document sent as text/html are handled as tag soup by most UAs. Most
authors only check their documents look good in their UA of choice. This
means that most authors are not checking for validity.


>> the main advantage of using XHTML [is] that it has to be valid
> 
> That's pretty subjective; I don't consider that the main advantage.

It was one of the primary goals of XML's development, as discussed above.


>>  * If you ever switch your XHTML documents from text/html to text/xml,
>>    then you will in all likelyhood end up with a considerable number
>>    of XML errors, meaning your content won't be readable by users.
>>    (Most XHTML documents do not validate.)
> 
> Sure, invalid documents are invalid. What does this have to do with 
> XHTML?

XHTML UAs _must_ refuse to render ill-formed documents, per the XML spec.
This does not apply to legacy tag soup (aka HTML) UAs.

This means that if ill-formed XHTML content is sent using an XML MIME type
to UAs, it will no longer be readable by users, as compliant UAs will
refuse to render the content.


>>  * A CSS stylesheet written for an HTML document has subtly different
>>    semantics [...]
> 
> I don't take advantage of these.

Surprisingly, this does indeed appear to be the case.


>>  * The only real advantage to using XHTML rather than HTML is that it
>>    is then possible to use XML tools with it. However, if tools are
>>    being used, then the same tools might as well produce HTML for you.
>>    Alternatively, the tools could take SGML as input instead of XML.
> 
> And tools could parse and produce TeX too. By your reasoning, it'd be 
> safe for the Web to move to TeX.

TeX is not semantically rich, so it is not even relevant here.


>>  * HTML 4.01 contains everything that XHTML contains,
> 
> HTML 4.01 doesn't allow namespaces.

Neither does XHTML sent as text/html.


>>  so there is little reason to use XHTML in the real world.
> 
> Even if the premise was true, that doesn't follow.

Assume for the moment that the premises are true, why does it not follow?


>> UAs can't handle XHTML sent as text/html as XML
> 
> I agree with this, and I'd like to hear suggestions on how to address 
> this problem.

It's not a problem. Use application/xhtml+xml (or any of the other MIME
types suggested by RFC 3023).


> I have no problem sending my content with a special mime type to a
> client which will do the right thing with it, do you have code that
> will do this for me?

Yes. See:
   http://software.hixie.ch/utilities/cgi/xhtml-for-ie/

Alternatively, see:
   http://www.damowmow.com/playground/demos/mime-mod_rewrite/

This is a problem I've had to solve for people several times.


> Kluge: I've set up http://xhtml.aaronsw.com/ to be identical to
> aaronsw.com except serve pages with an XHTML mime type. Doing this
> found a few problems, as your article suggests it would.

If you'd used HTML4, you would never have had to find problems, because
you would never have had to take existing content and change its MIME
type, effectively thrusting it into a world with new rules.

I am all in favour of using XHTML, _for new content_, sent as an XML MIME
type from the start.


> It also found a Mozilla bug:
> Test case: http://www.aaronsw.com/2002/fixedxmlns
> Do you want to file a bug?

That isn't a bug. Mozilla is a non-validating parser, and as such does not
have to do attribute defaulting.

Anyway, that document is invalid.


>> There are few advantages to using XHTML if you are sending the content
>> as text/html, and many disadvantages.
> 
> You have not listed one disadvantage.

Correction, you have not agreed to one disadvantage. I have listed at
least six, including:

 1. Relying on proprietary error handling technology
 2. Syntactic differences such as <script> and <style> content models
 3. Lack of syntax checking in UAs
 4. Switching to the right MIME type causes problems
 5. Differences in CSS semantics
 6. Differences in DOM semantics


>> Authors should stick to writing valid HTML 4.01 for the time being.
>> Once user agents that support XML and XHTML sent as text/xml are
>> widespread, then authors may reconsider learning and using XHTML.
> 
> This makes no sense. We know that we will have to rewrite HTML pages to 
> be XHTML.

Why? Why would you ever want to do this? Assuming Google is complete (it's
not) there are approximately three billion pages out there, of which the
overwhelming majority is HTML. Even if we assume that only a third of
those are HTML (and the likely number is more like 99%, if not higher, and
that's not even counting the invalid documents labelled as XHTML which
will also need to be corrected before "switching" to XHTML), that's still
one billion HTML documents.

Why do you think we'll need to rewrite these billion documents?

-- 
Ian Hickson                                      )\._.,--....,'``.    fL
"meow"                                          /,   _.. \   _\  ;`._ ,.
http://index.hixie.ch/                         `._.-(,_..'--(,_..'`-.;.'
Received on Wednesday, 20 November 2002 02:27:05 UTC