Re: Fallback to UTF-8 from Henri Sivonen on 2008-04-24 (www-validator@w3.org from April 2008)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Thu, 24 Apr 2008 22:08:25 +0300
To: W3C Validator Community <www-validator@w3.org>
Message-Id: <80FA3D73-C52E-4AC3-9951-ED1A443DE4F7@iki.fi>

On Apr 24, 2008, at 20:09 , Jukka K. Korpela wrote:

> David Dorward wrote:
>
>> Looking at the HTML spec, it says 'user agents must not assume any
>> default value for the "charset" parameter'
>> (http://www.w3.org/TR/html4/charset.html
>> ). So, following that guidance, the validator shouldn't guess at all
>> and should just state that no encoding was found and that it can't
>> continue until one is specified.
>
> I don't think that's quite the idea. Rather, that no default for the
> parameter (US-ASCII, ISO-8859-1, UTF-8, or any other default) should  
> be
> assumed.
[...]
> In the absence of any particular reason to guess anything else, I  
> think
> a user agent should assume a hypothetical generic encoding (we could
> give it a name, but that's not important right now) that uses 8 bits  
> for
> one character so that octets 0 - 127 have their ASCII values and other
> octets denote undefined graphic characters.

Considering the real Web content, it is better to pick Windows-1252  
than a hypothetical generic encoding.

For what it's worth, HTML5 makes it conforming to have ASCII-only  
pages without declaring the character encoding, but having non-ASCII  
characters without an encoding declaration either of the HTTP level or  
on the HTML level makes a document invalid. Validator.nu emits a  
warning even when the content is ASCII-only if the encoding is not  
declared in order to flag content management systems that lack  
encoding declarations even if the particular page being validated  
happens to be ASCII-only.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Thursday, 24 April 2008 19:09:04 UTC