Re: default charset broken from Terje Bless on 2003-06-08 (www-validator@w3.org from June 2003)

From: Terje Bless <link@pobox.com>
Date: Sun, 8 Jun 2003 07:54:48 +0200
To: W3C Validator <www-validator@w3.org>
Message-ID: <f02000001-1026-B6C4B3F0997511D7B1DF0030657B83E8@[193.157.66.23]>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

[ no longer CCing as you're subscribed IIRC ]

Karl Ove Hufthammer <karl@huftis.org> wrote:

>But in my opinion the main problem is that the validator is labeling
>perfectly valid documents as invalid. I think this is more serious than
>not labeling invalid documents as invalid because of character encoding
>issues.

I disagree. I think the worst thing we could do is label something as valid if
there is a chance that it isn't. And as mentioned, the validator is saying
that "I cannot pronounce this to be valid because you did not give me enough
information to reliably test it.", not "This document isn't valid".

I'll accept a charge that the way this is presented needs work though, if this
distinction isn't allready clear?


>>In particular, if we allow for your interpretation above, we would in
>>effect default to ISO-8859-1 not only for pages such as Kjetil's (who
>>are most certainly correct and the author very aware of what he is
>>doing), but also for Joe Web-duh-signer and his clueless little hosting
>>company where there is _no_ conscious decision involved and ISO-8859-1
>>is the _wrong_ value more often then not.
>
>'More often than not'? Isn't ISO-8859-1 the most used encoding for valid
>documents?

ISO-8859-1 is presumably the most widely used encoding in Europe and North
America, and I would assume for both valid and invalid documents (modulo the
Windows-1252 and MacRoman documents, which I lump in with ISO-8859-1 for
simplicity).

But if you look at the __World_Wide__ Web I think it is highly unlikely that
ISO-8859-1 is the correct encoding for the majority of pages in general, and
growing less likely by the minute as Asia, Latin America, and Africa comes
on-line.


>And if there are '_no_ conscious decision involved', I doubt the
>Web pages would be valid even with an explicit character encoding
>declaration.

Character Encoding is far more esoteric and obscure -- i.e. harder to make
people aware of -- than markup validity. I see a lot of people that obviously
has little technical understanding of the markup flavour they're using, that
are trying for passing validation (because this is obvious) which I highly
doubt has ever or will ever consider what encoding they are using unless the
validator tells them there is a problem.



BTW, I'm somewhat playing devil's advocate in this thread. If I were
developing the validator as an in-house tool I would have implemented HTTP
defaulting rules -- and bugger the HTML 4.0 Rec -- and just noted the lack of
an explicit encoding, and possibly checked for signs of Win-1252/MacRoman. The
above is more or less where we ended up after discussion and not where I
originally started out. You may detect artifacts of this scattered around my
arguments. :-)

- -- 
"I don't want to learn to manage my anger;
 I want to FRANCHISE it!" -- Kevin Martin

-----BEGIN PGP SIGNATURE-----
Version: PGP SDK 3.0.2

iQA/AwUBPuLPp6PyPrIkdfXsEQIb1wCfemfF5KRisMIQPnE8FGJxQLiMZ2cAoJBX
TkFDtDmTSzedRFF1doWXvj1N
=lefY
-----END PGP SIGNATURE-----
Received on Sunday, 8 June 2003 01:54:52 UTC