Re: Fallback to UTF-8 from Henri Sivonen on 2008-04-25 (www-validator@w3.org from April 2008)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Fri, 25 Apr 2008 10:20:40 +0300
To: W3C Validator Community <www-validator@w3.org>
Message-Id: <1601D4C6-5A25-42AE-AAE1-FBFD07761DA6@iki.fi>

On Apr 24, 2008, at 23:11 , Frank Ellermann wrote:

> Henri Sivonen wrote:
>
>> Considering the real Web content, it is better to pick Windows-1252
>> than a hypothetical generic encoding.
>
> A good strategy for browsers, not necessarily for validators
> IFF it could accept wild mixtures of Latin-1 and UTF-8 as
> "valid" windows-1252.
[...]
> Your proposal "just assume windows-1252" is an idea for the
> validation step,

That wasn't the proposal. The proposal was: Assume Windows-1252 but  
treat the upper half as errors.

> but it could have rather odd effects for the
> UTF-8 output of other errors, when the input contains any octet
> in the range 0x80..0x9F, or worse, if the input in fact was
> UTF-8, not windows-1252.

Would mere U+FFFD be better?

> Jukka's proposal avoids most surprises - all octets 0x80..0xFF
> are accepted as "unknown garbage".

I think a quality assurance tool should not *accept* unknown garbage  
but emit an error on non-declared non-ASCII.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Friday, 25 April 2008 07:21:20 UTC