Re: Fallback to UTF-8 from Henri Sivonen on 2008-04-25 (www-validator@w3.org from April 2008)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Fri, 25 Apr 2008 10:16:20 +0300
To: W3C Validator Community <www-validator@w3.org>
Message-Id: <A75E26AF-4967-4A8F-9BD4-9C09F44B25BA@iki.fi>

On Apr 24, 2008, at 23:04 , Jukka K. Korpela wrote:

> Henri Sivonen wrote:
>
>> Considering the real Web content, it is better to pick Windows-1252
>> than a hypothetical generic encoding.
>
> No, it's not because _in validation_ you don't need to make any  
> guess on
> the meanings of octets > 127 decimal.

Validator.nu, for example, checks for bad byte sequences in the  
encoding (subject to decoder bugs), looks for the last two non- 
character code points on each plane and looks for PUA characters.

> You're not supposed to render them
> (apart from echoing them along with error messages, but they're not
> markup-significant) or to process them in any way but treating them as
> data characters.

Rendering source extracts is a significant part of validator UI.

> If you assume windows-1252, then many possible octets will be  
> unassigned
> and you may well have the problem of having guessed something and then
> detected the guess must be wrong. The document could be in some other
> 8-bit encoding, or in UTF-8, or something else, and if you hadn't  
> bet on
> windows-1252, you would have analyzed the markup properly.

Right, but if non-declared non-ASCII is an error, the pass/fail  
outcome will be right even if for the wrong reason.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Friday, 25 April 2008 07:17:11 UTC