Re: Fallbeck to UTF-8

On Thu, 29 Nov 2007, olivier Thereaux wrote:

>> Given a webpage that does not specify any encoding (charset).
>> Then validator.w3.org reports:
>>
>> (1) No Character Encoding Found! Falling back to UTF-8.
>>
>> (2) Sorry, I am unable to validate this document because on line ...
>>     it contained one or more bytes that I cannot interpret as utf-8
>>
>> This makes no sense; and it doesn't help the user.
>
> You're not suggesting a better procedure, either.

OK, here are my suggestions:

(a) Immediately tell "This document cannot be checked" without any
    reference to UTF-8. Since the document cannot be taken as UTF-8-
    encoded, "charset=utf-8" was most probably not the author's
    intention.

OR

(b) Take ISO-8859-1 as fallback encoding (the default of RFC 2616).
    This will "work" if no bytes from 0x80 to 0x9F are present -
    hence with many of the traditional 8-bit character sets.
    Otherwise (if some bytes from 0x80 to 0x9F are found),
    give the usual errors about "non SGML character number ..."

Received on Thursday, 29 November 2007 15:33:01 UTC