Re: validator.w3.org and application/xhtml+xml from Terje Bless on 2002-06-29 (www-validator@w3.org from June 2002)

From: Terje Bless <link@pobox.com>
Date: Sat, 29 Jun 2002 09:47:43 +0200
To: William F Hammond <hammond@csc.albany.edu>
cc: www-validator@w3.org
Message-ID: <r01050300-1015-B06D7DE88B3411D690D900039300CF5C@[193.157.66.10]>

William F Hammond <hammond@csc.albany.edu> wrote:

>>Could you please send this to www-validator so it doesn't get lost?

Yes, please do send these issues to www-validator@w3.org;
precisely so they do not get lost or forgotten. :-)

>Shouldn't the W3C validator attempt to parse any content submitted as
>text/html (RFC 2854), text/xml (RFC 3023), application/xml (RFC 3023),
>or application/xhtml+xml (RFC 3206)?

The application/xhtml+xml media type was not defined when the last update
of the public version of the Validator was released (some would argue that
it still isn't in any meaningful way![0]). The current development version
-- which can usually be found on http://validator.w3.org:8001/ -- should
support application/xhtml+xml (but I haven't checked in a while so it may
be broken). The other content types should be supported in the public
version of the Validator. Please let me know if any are missing!

>Isn't it assumed for text/html transfer that any necessary non-default
>encoding information is to be derived from a "charset" spec in the
>Content-Type transfer header?

The W3C and the IETF have incompatible specifications of HTTP defaulting
rules in the text/* media type hierarchy. The IETF specifies "ISO-8859-1"
as the default and the W3C recommends applying no defaulting rules. At the
same time, the W3C seems fond of the "Chicken _And_ Egg" practice of
specifying the encoding information _inside_ the encoded entity (cf. HTML's
"meta" element, and XML's "encoding" attribute), and providing defaulting
rules in the absence of explicit encoding indication.

The only sane course of action is to apply _no_ defaulting rules and refuse
to attempt validation of any resource whose character encoding is not
explicitly labelled. This unfortunately means that our heuristics for
identifying the "Chicken&Egg Encoding" are less then ideal and will fail
when presented with XML defaulting rules.

This code is up for review in the not too distant future -- recent
developments have (potentially) badly broken all of Martin Dürst's hard
work in that area -- and it is possible that this situation can be improved
then.

Most likely this will be achieved by making anything served as text/html be
a special case, with "Tagsoup Semantics", and then make XML syntax and
semantics the "primary" method for dealing with character encoding
determination. Unfortunately, the deployment of resources with the
application/xhtml+xml media type is likely to be pretty much equal to "nil"
until certain popular UAs decide to treat it in a reasonable manner.

-- 
Everytime I write a rhyme these people thinks its a crime
I tell `em what's on my mind. I guess I'm a CRIMINAL!
I don't gotta say a word I just flip `em the bird and keep goin,
I don't take shit from no one. I'm a CRIMINAL!

Received on Saturday, 29 June 2002 03:49:35 UTC