Re: flakey charset detection from Martin Duerst on 2002-12-05 (www-validator@w3.org from December 2002)

From: Martin Duerst <duerst@w3.org>
Date: Fri, 06 Dec 2002 06:26:43 +0900
To: David Brownell <david-b@pacbell.net>, Karl Dubost <karl@w3.org>
Cc: www-validator@w3.org
Message-Id: <4.2.0.58.J.20021206061347.051e9dc0@localhost>

Hello David,

Many thanks for your comments.

At 12:15 02/12/04 -0800, David Brownell wrote:

>Karl Dubost wrote:
>>At 8:01 -0800 2002-12-04, David Brownell wrote:
>>
>>>I recently validated a xhtml 1.0 page that used to validate just fine, and
>>>instead, I got a message that said things like:
>>
>>Could you give an URI of your document?
>
>http://xmlconf.sourceforge.net/xml/
>
>... you'll notice it's "Content-Type: text/html", which is specified
>(see http://www.ietf.org/rfc/rfc2854.txt section 6) to mean "iso-8859-1".

To quote from rfc 2854:

 >>>>
6. Charset default rules

    The use of an explicit charset parameter is strongly recommended.
    While [MIME] specifies "The default character set, which must be
    assumed in the absence of a charset parameter, is US-ASCII."  [HTTP]
    Section 3.7.1, defines that "media subtypes of the 'text' type are
    defined to have a default charset value of 'ISO-8859-1'".  Section
    19.3 of [HTTP] gives additional guidelines.  Using an explicit
    charset parameter will help avoid confusion.

    Using an explicit charset parameter also takes into account that the
    overwhelming majority of deployed browsers are set to use something
    else than 'ISO-8859-1' as the default; the actual default is either a
    corporate character encoding or character encodings widely deployed
    in a certain national or regional community. For further
    considerations, please also see Section 5.2 of [HTML40].
 >>>>

So what does this say? It says that MIME says us-ascii, http says
iso-8859-1, and HTML says that you can't count on a default.
It by no way says that the default is iso-8859-1.

And if you check reality (around the world, not only in your
neighborhood), you will find that the HTML spec is much
closer to reality than the HTTP spec.
In some cases, e.g. for valid HTML, it makes a lot of sense to
work on moving reality closer to the specs. In other cases,
it makes more sense to move the specs towards reality.
This is such a case.

>>>p.s. Given that it's XHTML, I find the fact that it even _tried_
>>>      using the META element to be worrisome ... that means that
>>>      parsing this document as XML could give different results,
>>>      which breaks all XHTML goals I ever heard.  Not that I've
>>>      tracked XHTML recently, but this seems like trouble.

I would have to go and check the source to see if it indeed is
checking META, but as you serve it as text/html, that doesn't
seem to be inappropriate. It would complain if it found
contradictory info. You may also assume that this is
a message that covers various cases.

Regards,   Martin.

Received on Thursday, 5 December 2002 16:26:53 UTC