Re: Fallback to UTF-8 from Jukka K. Korpela on 2008-04-24 (www-validator@w3.org from April 2008)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Thu, 24 Apr 2008 20:09:39 +0300
To: <www-validator@w3.org>
Message-ID: <026301c8a62d$fba98870$0500000a@DOCENDO>

David Dorward wrote:

> Looking at the HTML spec, it says 'user agents must not assume any
> default value for the "charset" parameter'
> (http://www.w3.org/TR/html4/charset.html
> ). So, following that guidance, the validator shouldn't guess at all
> and should just state that no encoding was found and that it can't
> continue until one is specified.

I don't think that's quite the idea. Rather, that no default for the 
parameter (US-ASCII, ISO-8859-1, UTF-8, or any other default) should be 
assumed. Thus, when a user agent encounters a document with no charset 
parameter, it _could_ just reject the data as incomprehensible and 
nonconforming, but it _should_ make some effort at reasoning an 
encoding. There are many approaches to this.

Anyway, when a reasoning or guess has been made, a user agent _should_ 
apparently report a problem or backtrack if it turns out that the data 
does not meet the constraints of the encoding. It _should_ IMHO be 
honest about it, saying, in effect, something like the following:

"The encoding of the document was not specified in any manner prescribed 
in HTML specifications. Therefore, I tried to make an educated guess and 
came up with the idea that the encoding is xxx. Then I found out that 
this cannot be correct. Now I'm giving up. You need to specify the 
encoding and retry."

In the absence of any particular reason to guess anything else, I think 
a user agent should assume a hypothetical generic encoding (we could 
give it a name, but that's not important right now) that uses 8 bits for 
one character so that octets 0 - 127 have their ASCII values and other 
octets denote undefined graphic characters.

This works well for most encodings actually in use, for the purposes of 
validation, unless the document uses non-ASCII characters for names of 
elements and attributes (which is permitted in XML but unadvisable and 
uncommon in practice). It doesn't matter what the "upper half" octets 
mean, since they would normally be just data characters and validators 
don't care about text characters.

Now I have a déjà-vu feeling: I'm pretty sure this has been discussed at 
least once in the current universe, in a context like this. I cannot 
recollect any arguments against the simple approach I proposed.

Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

Received on Thursday, 24 April 2008 17:10:19 UTC