- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Thu, 24 Apr 2008 20:09:39 +0300
- To: <www-validator@w3.org>
David Dorward wrote: > Looking at the HTML spec, it says 'user agents must not assume any > default value for the "charset" parameter' > (http://www.w3.org/TR/html4/charset.html > ). So, following that guidance, the validator shouldn't guess at all > and should just state that no encoding was found and that it can't > continue until one is specified. I don't think that's quite the idea. Rather, that no default for the parameter (US-ASCII, ISO-8859-1, UTF-8, or any other default) should be assumed. Thus, when a user agent encounters a document with no charset parameter, it _could_ just reject the data as incomprehensible and nonconforming, but it _should_ make some effort at reasoning an encoding. There are many approaches to this. Anyway, when a reasoning or guess has been made, a user agent _should_ apparently report a problem or backtrack if it turns out that the data does not meet the constraints of the encoding. It _should_ IMHO be honest about it, saying, in effect, something like the following: "The encoding of the document was not specified in any manner prescribed in HTML specifications. Therefore, I tried to make an educated guess and came up with the idea that the encoding is xxx. Then I found out that this cannot be correct. Now I'm giving up. You need to specify the encoding and retry." In the absence of any particular reason to guess anything else, I think a user agent should assume a hypothetical generic encoding (we could give it a name, but that's not important right now) that uses 8 bits for one character so that octets 0 - 127 have their ASCII values and other octets denote undefined graphic characters. This works well for most encodings actually in use, for the purposes of validation, unless the document uses non-ASCII characters for names of elements and attributes (which is permitted in XML but unadvisable and uncommon in practice). It doesn't matter what the "upper half" octets mean, since they would normally be just data characters and validators don't care about text characters. Now I have a déjà-vu feeling: I'm pretty sure this has been discussed at least once in the current universe, in a context like this. I cannot recollect any arguments against the simple approach I proposed. Jukka K. Korpela ("Yucca") http://www.cs.tut.fi/~jkorpela/
Received on Thursday, 24 April 2008 17:10:19 UTC