Re: Fallback to UTF-8 from Frank Ellermann on 2008-04-24 (www-validator@w3.org from April 2008)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Thu, 24 Apr 2008 22:11:13 +0200
To: www-validator@w3.org
Message-ID: <fuqpdr$mu9$1@ger.gmane.org>

Henri Sivonen wrote:

> Considering the real Web content, it is better to pick Windows-1252  
> than a hypothetical generic encoding.

A good strategy for browsers, not necessarily for validators 
IFF it could accept wild mixtures of Latin-1 and UTF-8 as 
"valid" windows-1252.

Andreas' example shows that assuming UTF-8 does not work as it
should for validator.w3, it ends up in a fatal error instead of
reporting non-UTF-8 octets.

His Latin-1 example was better, it reported 0x80 as non-Latin-1.

Your proposal "just assume windows-1252" is an idea for the
validation step, but it could have rather odd effects for the
UTF-8 output of other errors, when the input contains any octet
in the range 0x80..0x9F, or worse, if the input in fact was 
UTF-8, not windows-1252.

Jukka's proposal avoids most surprises - all octets 0x80..0xFF
are accepted as "unknown garbage".  He didn't say how that can
be displayed in the error output, question marks ?  u+FFFD ?

 Frank

Received on Thursday, 24 April 2008 20:10:24 UTC