W3C home > Mailing lists > Public > www-validator@w3.org > April 2008

Re: Fallback to UTF-8

From: Henri Sivonen <hsivonen@iki.fi>
Date: Fri, 25 Apr 2008 10:16:20 +0300
Message-Id: <A75E26AF-4967-4A8F-9BD4-9C09F44B25BA@iki.fi>
To: W3C Validator Community <www-validator@w3.org>

On Apr 24, 2008, at 23:04 , Jukka K. Korpela wrote:

> Henri Sivonen wrote:
>
>> Considering the real Web content, it is better to pick Windows-1252
>> than a hypothetical generic encoding.
>
> No, it's not because _in validation_ you don't need to make any  
> guess on
> the meanings of octets > 127 decimal.

Validator.nu, for example, checks for bad byte sequences in the  
encoding (subject to decoder bugs), looks for the last two non- 
character code points on each plane and looks for PUA characters.

> You're not supposed to render them
> (apart from echoing them along with error messages, but they're not
> markup-significant) or to process them in any way but treating them as
> data characters.

Rendering source extracts is a significant part of validator UI.

> If you assume windows-1252, then many possible octets will be  
> unassigned
> and you may well have the problem of having guessed something and then
> detected the guess must be wrong. The document could be in some other
> 8-bit encoding, or in UTF-8, or something else, and if you hadn't  
> bet on
> windows-1252, you would have analyzed the markup properly.

Right, but if non-declared non-ASCII is an error, the pass/fail  
outcome will be right even if for the wrong reason.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Friday, 25 April 2008 07:17:11 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 25 April 2012 12:14:29 GMT