Re: Fallback to UTF-8 from Jukka K. Korpela on 2008-04-24 (www-validator@w3.org from April 2008)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Thu, 24 Apr 2008 23:04:45 +0300
To: "W3C Validator Community" <www-validator@w3.org>
Message-ID: <031001c8a646$723b4e70$0500000a@DOCENDO>

Henri Sivonen wrote:

> Considering the real Web content, it is better to pick Windows-1252
> than a hypothetical generic encoding.

No, it's not because _in validation_ you don't need to make any guess on 
the meanings of octets > 127 decimal. You're not supposed to render them 
(apart from echoing them along with error messages, but they're not 
markup-significant) or to process them in any way but treating them as 
data characters.

If you assume windows-1252, then many possible octets will be unassigned 
and you may well have the problem of having guessed something and then 
detected the guess must be wrong. The document could be in some other 
8-bit encoding, or in UTF-8, or something else, and if you hadn't bet on 
windows-1252, you would have analyzed the markup properly.

Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

Received on Thursday, 24 April 2008 20:05:23 UTC