W3C home > Mailing lists > Public > www-validator@w3.org > April 2008

Re: Fallback to UTF-8

From: Henri Sivonen <hsivonen@iki.fi>
Date: Fri, 25 Apr 2008 10:20:40 +0300
Message-Id: <1601D4C6-5A25-42AE-AAE1-FBFD07761DA6@iki.fi>
To: W3C Validator Community <www-validator@w3.org>

On Apr 24, 2008, at 23:11 , Frank Ellermann wrote:

> Henri Sivonen wrote:
>
>> Considering the real Web content, it is better to pick Windows-1252
>> than a hypothetical generic encoding.
>
> A good strategy for browsers, not necessarily for validators
> IFF it could accept wild mixtures of Latin-1 and UTF-8 as
> "valid" windows-1252.
[...]
> Your proposal "just assume windows-1252" is an idea for the
> validation step,

That wasn't the proposal. The proposal was: Assume Windows-1252 but  
treat the upper half as errors.

> but it could have rather odd effects for the
> UTF-8 output of other errors, when the input contains any octet
> in the range 0x80..0x9F, or worse, if the input in fact was
> UTF-8, not windows-1252.

Would mere U+FFFD be better?

> Jukka's proposal avoids most surprises - all octets 0x80..0xFF
> are accepted as "unknown garbage".

I think a quality assurance tool should not *accept* unknown garbage  
but emit an error on non-declared non-ASCII.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Friday, 25 April 2008 07:21:20 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 25 April 2012 12:14:29 GMT