Re: Fallback to UTF-8 from Henri Sivonen on 2008-04-25 (www-validator@w3.org from April 2008)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Fri, 25 Apr 2008 12:25:21 +0300
To: W3C Validator Community <www-validator@w3.org>
Message-Id: <26915A69-747C-4ECB-A3D3-5EF83E922701@iki.fi>

On Apr 25, 2008, at 11:00 , Jukka K. Korpela wrote:

> Henri Sivonen wrote:
>
>> Validator.nu, for example, checks for bad byte sequences in the
>> encoding (subject to decoder bugs), looks for the last two non-
>> character code points on each plane and looks for PUA characters.
>
> That's a different issue. The question was about handling data for  
> which
> no encoding has been specified. Hence there is formally no criterion  
> for
> "bad byte sequences", still less for anything related to code points.

Depends on which spec you read. My point is that while HTML 4.01  
doesn't specify this properly, this is a solved problem (by HTML 5) in  
text/html, so it isn't particularly productive to extrapolate from  
legacy specs in other ways.

>> - - if non-declared non-ASCII is an error, the pass/fail
>> outcome will be right even if for the wrong reason.
>
> Anything non-declared (even if it consists just of octets in the ASCII
> range) is an error, but at a category level other than validation
> errors. Formally, there is no document to be validated, just some lump
> of octets. Hence, the correct response says this and _could_ refuse to
> do anything else. Even "This document can not be checked" is a bit
> questionable. Which _document_? Better: The submitted data cannot be
> interpreted as a marked-up document.

How about "The character encoding of the document was not explicit  
(assumed windows-1252) but the document contains non-ASCII."
http://html5.validator.nu/?doc=http%3A%2F%2Fwww.unics.uni-hannover.de%2Fnhtcapri%2Ftest.htm

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Friday, 25 April 2008 09:26:06 UTC