Re: Fallback to UTF-8 from Frank Ellermann on 2008-04-25 (www-validator@w3.org from April 2008)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Fri, 25 Apr 2008 18:17:43 +0200
To: www-validator@w3.org
Message-ID: <fut020$klm$1@ger.gmane.org>

Henri Sivonen wrote:

> The proposal was: Assume Windows-1252 but treat the upper 
> half as errors.

Then you could end up with reporting tons of errors, instead
of only one error, as in Jukka's proposal.  validator.w3.org
is near to the ideal "one error" with its UTF-8 approach, but
unfortunately it is a fatal error, suppressing anything else
it would find with the other proposals (Andreas, Jukka, you).

 [output] 
> Would mere U+FFFD be better?

For Unicode output IMO good enough:  There was at least one
error, the missing charset, the user has to come back anyway.

Of course these strategies fail miserably when the markup is
non-ASCII or worse (UTF-1, UTF-7, UTF-16, BOCU-1, SCSU, etc.),
but to cover such oddities we could declare victory with the
UTF-8 fallback as is - obviously not what we want (for HTML).

>> Jukka's proposal avoids most surprises - all octets 
>> 0x80..0xFF are accepted as "unknown garbage".

> I think a quality assurance tool should not *accept* unknown
> garbage but emit an error on non-declared non-ASCII.

I meant "accept" limited to parsing the input, in the sense of
"not giving up with a fatal error", as validator.w3.org does
it when its UTF-8 fallback turns out to be wrong.  Of course
any "unknown garbage" is an error.  But with Jukka's proposal
this is *one* error, neither fatal, nor "thousands of errors".

 Frank

Received on Friday, 25 April 2008 16:15:47 UTC