W3C home > Mailing lists > Public > www-validator@w3.org > December 2005

Re: Document without charset

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Fri, 9 Dec 2005 08:26:30 +0200 (EET)
To: www-validator@w3.org
Message-ID: <Pine.GSO.4.63.0512090803560.6782@korppi.cs.tut.fi>

On Thu, 8 Dec 2005, Jirka Kosek wrote:

> Jukka K. Korpela wrote:
>
>> Apparently the validator uses UTF-8 as the implied default.
>> 
>> The choice is impractical
>
> I can't recall RFC number from the top of my head, but HTTP protocol assumes 
> ISO-8859-1 for all text/* media types as a default.

HTTP/1.1 is RFC 2616. The clause you are referring to is 3.7.1.

> So it is no "impractical", it is clearly bug.

There is a definite contradiction between the HTTP protocol definition and 
the HTML 4.01 specification. This has been discussed on different fora
several times. The consensus is that HTML as a higher-level protocol 
trumps the transfer protocol. The HTML 4.01 specification, in clause 
5.2.2, discusses this very theme and concludes: "user agents must not 
assume any default value for the 'charset' parameter". (This does not 
exclude the possibility of ultimately falling back to a default, which may 
depend on the user agent. It just means that in the absence of an HTTP
header with a 'charset' parameter, user agents must not imply ISO-8859-1
or any other 'charset' value but proceed to the algorithm of using other 
sources of information, such as a <meta> tag.)

> That's why text/xml was superseded by application/xml where is no such 
> default assumed.

The media types for XML are a mess, as you can see from RFC 3023.
The type text/xml has not been superseded; it is an alternative that
can be used - and _should_ be used under some conditions (if we take
RFC 3023 seriously).

> If there were no charset parameter, ISO-8859-1 should be 
> assumed from HTTP point of view,

But not by the definition of the text/xml media type in RFC 3023, which 
specifies US-ASCII as the default when text/xml is transmitted over HTTP
without a 'charset' parameter. Thus, if you submit an XML document to
a validator without 'charset', then the validator is formally required to
treat it as US-ASCII. (I hope the validator doesn't actually behave so, or 
at least issues an adequate warning.)

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Friday, 9 December 2005 06:30:01 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 25 April 2012 12:14:20 GMT