Re: Fallback to UTF-8 from olivier Thereaux on 2008-04-28 (www-validator@w3.org from April 2008)

From: olivier Thereaux <ot@w3.org>
Date: Mon, 28 Apr 2008 15:42:45 +0900
To: Jukka K.Korpela <jkorpela@cs.tut.fi>
Cc: "W3C Validator Community" <www-validator@w3.org>
Message-Id: <9ACD5683-D85A-40F7-A937-7C72C7798B10@w3.org>

Hi Jukka,

Thanks for your feedback.

On 28-Apr-08, at 3:05 PM, Jukka K. Korpela wrote:
>> * utf-8, because it is the future-looking encoding, also appropriate
>> for most international content.
>
> Future-looking does not apply here. We are dealing with an _error
> condition_: the encoding has not been specified, and you (we) have
> decided that the validator should make a guess, as extra comfort to  
> the
> user.

More and more authoring tools are and (my bet on the future) will be  
using utf-8 as a default. Considering it a worthy guess in the case of  
non-labeled content is thus future-looking.

> In the given situation, the validator should encourage the user to
> specify the encoding

We agree. And it does.

[[ No Character Encoding Found!
...
Read the FAQ entry on character encoding for more details and pointers  
on how to fix this problem with your document.]]

This message will be shown regardless of whether our attempts to  
transcode with one, or several, defaults, succeed.

> Why those three?

Because these are the three encodings that are either declared as a  
potential default in relevant specifications, or (for utf-8) are a  
rising default and thus a good encoding to try. In other words,  
because we can't try all the encodings possible.

> Why would you take extra trouble
> to make such a guess and ultimately reject a document just because  
> it is
> in one of the many 8-bit encodings that happens to have characters in
> positions that make them malformed in any of the encodings you tried?
[...]
> Exactly what is wrong with the idea of assuming octets 0...127 decimal
> to have their Ascii meanings and other octets to constitute data
> characters in some unknown encoding? Not knowing what those data
> characters are does not harm validation at all.

That was discussed earlier. In a nutshell: error display, source  
display, outline etc. We need to transcode into the output charset of  
the validator, namely utf-8, or forfeit those features. The latter  
does not appear to be a option at this point, hence the compromise.

-- 
olivier

Received on Monday, 28 April 2008 06:43:21 UTC