Re: Fallback to UTF-8

On Apr 25, 2008, at 11:00 , Jukka K. Korpela wrote:

> Henri Sivonen wrote:
>>, for example, checks for bad byte sequences in the
>> encoding (subject to decoder bugs), looks for the last two non-
>> character code points on each plane and looks for PUA characters.
> That's a different issue. The question was about handling data for  
> which
> no encoding has been specified. Hence there is formally no criterion  
> for
> "bad byte sequences", still less for anything related to code points.

Depends on which spec you read. My point is that while HTML 4.01  
doesn't specify this properly, this is a solved problem (by HTML 5) in  
text/html, so it isn't particularly productive to extrapolate from  
legacy specs in other ways.

>> - - if non-declared non-ASCII is an error, the pass/fail
>> outcome will be right even if for the wrong reason.
> Anything non-declared (even if it consists just of octets in the ASCII
> range) is an error, but at a category level other than validation
> errors. Formally, there is no document to be validated, just some lump
> of octets. Hence, the correct response says this and _could_ refuse to
> do anything else. Even "This document can not be checked" is a bit
> questionable. Which _document_? Better: The submitted data cannot be
> interpreted as a marked-up document.

How about "The character encoding of the document was not explicit  
(assumed windows-1252) but the document contains non-ASCII."

Henri Sivonen

Received on Friday, 25 April 2008 09:26:06 UTC