Re: flakey charset detection from David Brownell on 2002-12-04 (www-validator@w3.org from December 2002)

From: David Brownell <david-b@pacbell.net>
Date: Wed, 04 Dec 2002 15:34:46 -0800
To: Terje Bless <link@pobox.com>
Cc: W3C Validator <www-validator@w3.org>, Karl Dubost <karl@w3.org>
Message-id: <3DEE9116.2000406@pacbell.net>

Terje Bless wrote:
> 
> I suspect the problem here was that HTML 4.01 was trying to fix something
> that it was not in their purview to fix; namely the poor suitability of
> ISO-8859-1 as a default for many web documents.

As a default it's pretty good, but a lot of broken systems (browsers and
servers both) got shipped that didn't work that way.  Such as by letting
the default be overridden on servers, and having browsers use charset=.

I agree that changing HTTP or MIME were not in that purview.  Fixing broken
browsers was though; while requiring charset= would have been:

> It is highly unfortunate IMO that they chose to do this by overriding HTTP
> instead of adding an additional requirement that HTML 4.01 served over HTTP
> must explicitly set a character encoding; or by simply punting the issue
> back to where it belongs, namely the HTTP specification.

Considering that the "HTTP plus HTML" crew was responsible for most of
this braindamage in the first place, this is pitiful!  MIME has always
said the default for "text" is ASCII, but when HTTP was first written
up it changed that to "iso-8859-1", mostly to benefit HTML (and, as a
side effect, prevent re-use of existing MIME libraries).

So HTTP doesn't really do MIME, because of that, and now it seems like
HTML won't really do HTTP any more either!

> Of course, there is the strong implication that a document that does not
> explicitly specify it's encoding is invalid and unparseable, but this is
> wholly intentional given that state of character encoding issues.

So far the *ONLY* user agent I've ever seen that has any problem parsing
that is the current w3c validator.  And that's rather new behavior, with
only one weak standards leg to stand on (html4).

Also ... it says that it tried the Appendix F rules for XML, so either it
should NOT do that (effect is to detect encodings like UTF-16 which all
standards agree "must" be explicitly labeled) OR it should also be trying
the standard HTTP rule like it used to, and like other user agents.  It's
not even a useful "pedantic mode" default.

- Dave

Received on Wednesday, 4 December 2002 18:30:16 UTC