Re: default charset broken from Martin Duerst on 2003-06-08 (www-validator@w3.org from June 2003)

From: Martin Duerst <duerst@w3.org>
Date: Sun, 08 Jun 2003 16:35:18 -0400
To: Kjetil Torgrim Homme <kjetilho@ifi.uio.no>, W3C Validator <www-validator@w3.org>, Terje Bless <link@pobox.com>
Message-Id: <4.2.0.58.J.20030608161446.0749bb20@localhost>

At 19:34 03/06/07 +0200, Kjetil Torgrim Homme wrote:

>[Terje Bless]:

> >   I see no relevance to this other then to support the view that the
> >   HTTP WG also meant for charset to be explicitly specified unless
> >   there was some specific and overweighing reason not to
> >   (i.e. ォSHOULDサ).
>
>the relevance was that the HTML spec ignored the text of RFC 2068,
>which is even stronger than RFC 2616.

That's because in RFC 2068, and even still in RFC 2616, there
is some language about user agents that get completely confused
if they see a 'charset' parameter in a Content-Type: header.
As far as I remember, that was Netscape 2 and friends.
I haven't heard about such user agents for more than 5 years.

> >   > furthermore, configuring Apache to set include
> >   > charset=iso-8859-1 for all files of type text/html will make it
> >   > impossible for a document to use a different charset since it
> >   > overrides META HTTP-EQUIV.  (another poor choice in the HTML
> >   > Recommendation, IMHO).
> >
> >   Nonsense. In Apache you would use AddDefaultEncoding for
> >   iso-8859-1 and use Content-Negotiation to select between
> >   e.g. index.html.utf-8 and index.html.iso-8859-1 (or between
> >   "index.html.utf-8" and "" ;D).

You can also use AddCharset or AddType on specific files
or directories. No need to talk about negotiation.

>my point stands, META can no longer be used.  but this is not
>important.

There are two ways to run things if you have variation.
Either use HTTP headers, or use META. This decision should
be made on a per-directory base.

> >   Defaulting to UTF-8 is intended to be the least-wrong error
> >   recovery procedure (given its inclusiveness and wide applicability
> >   in non-european/north-american contexts), but the result can never
> >   say authoratively that the page is valid or invalid since we
> >   didn't have enough information to reliably validate it (i.e. the
> >   result is guesswork).
>
>thank you for the explanation, I don't object to that behaviour.

Terje, I'm not sure defaulting to UTF-8 is the best solution.
One problem is that assuming UTF-8 will create a lot of 'decoding
problems' if it's not UTF-8. If the user says it's UTF-8, then
it's a good idea to report all these, but if the user doesn't
claim it's UTF-8, producing all these errors will be confusing.
Maybe we should think this through a bit more.

Regards,   Martin.

Received on Sunday, 8 June 2003 17:32:41 UTC