Re: Fallback to UTF-8 from Henri Sivonen on 2008-04-25 (www-validator@w3.org from April 2008)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Fri, 25 Apr 2008 12:44:35 +0300
To: W3C Validator Community <www-validator@w3.org>
Message-Id: <ED9898F4-94AB-4362-B9DD-C5816956E83E@iki.fi>

On Apr 25, 2008, at 00:29 , Karl Dubost wrote:

> Le 25 avr. 2008 à 03:08, Henri Sivonen a écrit :
>> Considering the real Web content, it is better to pick Windows-1252  
>> than a hypothetical generic encoding.
>
> nope. If you consider a cluster community approach.

I'm not familiar with that term.

> Windows-1252 will be more popular in the current statistics for the  
> world, but not in some specific regions, like Japan or China for  
> example.

HTML5 allows browsers to be configurable to have another ASCII-based  
last resort encoding that is a better guess for the user's locale.

Online validators are global, though. I could enable heuristic  
encoding detection for CJK as I have done for parsetree.validator.nu,  
but I don't have Cyrillic detection code available, for example. Non- 
declared non-ASCII is an error anyway, so the benefit seems small in  
the validator case compared to the software QA cost (making sure that  
you cannon use other decoders to sneak in non-declared non-ASCII).

> So basically choosing Windows-1252, you will be imposing a local  
> preference on others.

Should I make the source extract equally inconvenient for everyone?  
The document doesn't pass validation anyway.

> Take it another way, once the number of Chinese Web pages will be  
> largely superior than the rest of the world (1), should we change  
> the spec to Big5 or GB2312?
>
> (1) 8.47 billions of pages, fast growth rate (89.4% in 2007)

Growth in any language should be UTF-8 *declared* as such.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Friday, 25 April 2008 09:45:17 UTC