Re: default charset broken from Terje Bless on 2003-06-07 (www-validator@w3.org from June 2003)

From: Terje Bless <link@pobox.com>
Date: Sat, 7 Jun 2003 18:41:41 +0200
To: W3C Validator <www-validator@w3.org>
cc: Kjetil Torgrim Homme <kjetilho@ifi.uio.no>
Message-ID: <f02000001-1026-EAF4ACDA990611D7B1DF0030657B83E8@[193.157.66.23]>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Kjetil Torgrim Homme <kjetilho@ifi.uio.no> wrote:

>a Standards Track RFC can't be overridden by a Recommendation from W3C.

See my previous message on this issue.

>>This new behaviour goes some way towards addressing your concern, but
>>you will still find your documents labelled Invalid unless you specify
>>a character encoding.
>
>my document is valid, so this is incorrect behaviour.

The latter part of your sentence hinges on the former and the former is still
up for debate. :-)

>I don't subscribe to cargo cult coding, and I don't care about catering
>to broken software.  also note that this paragraph wasn't in the
>original HTTP/1.1 RFC, and the text in 5.2.2 has not changed since HTML
>4.0 of December 1997.

I see no relevance to this other then to support the view that the HTTP WG
also meant for charset to be explicitly specified unless there was some
specific and overweighing reason not to (i.e. «SHOULD»).

>furthermore, configuring Apache to set include charset=iso-8859-1 for
>all files of type text/html will make it impossible for a document to
>use a different charset since it overrides META HTTP-EQUIV.  (another
>poor choice in the HTML Recommendation, IMHO).

Nonsense. In Apache you would use AddDefaultEncoding for iso-8859-1 and use
Content-Negotiation to select between e.g. index.html.utf-8 and
index.html.iso-8859-1 (or between "index.html.utf-8" and "" ;D).

>>[3] - <http://validator.w3.org:8001/>. Feedback encouraged!
>
>well, it didn't process http://www.usenet.no.  in fact it assumed UTF-8,
>which there is no basis for doing at all.  IMO, that's a further
>regression.

This is a separate issue. Given the conclusion that defaulting in HTTP and
HTML are to be ignored, we come to the decision on how to treat such documents
in the implementation. In a browser, the most permissive behaviour may be
assumed (with extensive sniffing/guessing to arrive at the least wrong
encoding), but in a Validator we can either reject it out of hand (as we did
in previous versions) or find some defaulting behaviour that is most usefull
while still supporting the conclusion that it isn't valid.

Defaulting to UTF-8 is intended to be the least-wrong error recovery procedure
(given its inclusiveness and wide applicability in non-european/north-american
contexts), but the result can never say authoratively that the page is valid
or invalid since we didn't have enough information to reliably validate it
(i.e. the result is guesswork).

- -- 
Editor's note: in the last update,   we noted that Larry Wall would "vomment"
on existing RFCs. Some took that to be a cross between "vomit" and "comment."
We are unsure of whether it was a subconscious slip or a typographical error.
We are also unsure of whether or not to regret the error.     -- use.perl.org

-----BEGIN PGP SIGNATURE-----
Version: PGP SDK 3.0.2

iQA/AwUBPuIVxKPyPrIkdfXsEQJ1BACdEyiEDGFF8Cc0tW/mUOpt2YA1AgYAoNFR
kl7Mu9t2nlUecI18pAJnVKzc
=HsK2
-----END PGP SIGNATURE-----

Received on Saturday, 7 June 2003 12:41:45 UTC