Re: Fallback to UTF-8 from Jukka K. Korpela on 2008-04-25 (www-validator@w3.org from April 2008)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Fri, 25 Apr 2008 11:00:21 +0300
To: "W3C Validator Community" <www-validator@w3.org>
Message-ID: <015401c8a6aa$69259160$0500000a@DOCENDO>

Henri Sivonen wrote:

> Validator.nu, for example, checks for bad byte sequences in the
> encoding (subject to decoder bugs), looks for the last two non-
> character code points on each plane and looks for PUA characters.

That's a different issue. The question was about handling data for which 
no encoding has been specified. Hence there is formally no criterion for 
"bad byte sequences", still less for anything related to code points.

> - - if non-declared non-ASCII is an error, the pass/fail
> outcome will be right even if for the wrong reason.

Anything non-declared (even if it consists just of octets in the ASCII 
range) is an error, but at a category level other than validation 
errors. Formally, there is no document to be validated, just some lump 
of octets. Hence, the correct response says this and _could_ refuse to 
do anything else. Even "This document can not be checked" is a bit 
questionable. Which _document_? Better: The submitted data cannot be 
interpreted as a marked-up document.

If you wish to do something additional to help the user - and this is 
probably a good idea if implemented properly - then the report should 
clearly say what has been done ("Falling back" sounds like an odd 
expression) and it should use a guess that is the least likely to spawn 
wrong or misleading error messages.

If the additional thing tends to confuse users rather than help them, 
then, well, maybe the validator should just say "I can't process your 
data" in some polite and informative terms.

Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

Received on Friday, 25 April 2008 08:00:51 UTC