Re: Fallbeck to UTF-8 from Andreas Prilop on 2007-11-29 (www-validator@w3.org from November 2007)

From: Andreas Prilop <aprilop2007@trashmail.net>
Date: Thu, 29 Nov 2007 16:25:32 +0100 (MET)
To: www-validator@w3.org
Message-ID: <Pine.GSO.4.63.0711291614110.3541@s5b004.rrzn.uni-hannover.de>

On Thu, 29 Nov 2007, olivier Thereaux wrote:

>> Given a webpage that does not specify any encoding (charset).
>> Then validator.w3.org reports:
>>
>> (1) No Character Encoding Found! Falling back to UTF-8.
>>
>> (2) Sorry, I am unable to validate this document because on line ...
>>     it contained one or more bytes that I cannot interpret as utf-8
>>
>> This makes no sense; and it doesn't help the user.
>
> You're not suggesting a better procedure, either.

OK, here are my suggestions:

(a) Immediately tell "This document cannot be checked" without any
    reference to UTF-8. Since the document cannot be taken as UTF-8-
    encoded, "charset=utf-8" was most probably not the author's
    intention.

OR

(b) Take ISO-8859-1 as fallback encoding (the default of RFC 2616).
    This will "work" if no bytes from 0x80 to 0x9F are present -
    hence with many of the traditional 8-bit character sets.
    Otherwise (if some bytes from 0x80 to 0x9F are found),
    give the usual errors about "non SGML character number ..."

Received on Thursday, 29 November 2007 15:33:01 UTC