- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Thu, 24 Apr 2008 20:09:39 +0300
- To: <www-validator@w3.org>
David Dorward wrote:
> Looking at the HTML spec, it says 'user agents must not assume any
> default value for the "charset" parameter'
> (http://www.w3.org/TR/html4/charset.html
> ). So, following that guidance, the validator shouldn't guess at all
> and should just state that no encoding was found and that it can't
> continue until one is specified.
I don't think that's quite the idea. Rather, that no default for the
parameter (US-ASCII, ISO-8859-1, UTF-8, or any other default) should be
assumed. Thus, when a user agent encounters a document with no charset
parameter, it _could_ just reject the data as incomprehensible and
nonconforming, but it _should_ make some effort at reasoning an
encoding. There are many approaches to this.
Anyway, when a reasoning or guess has been made, a user agent _should_
apparently report a problem or backtrack if it turns out that the data
does not meet the constraints of the encoding. It _should_ IMHO be
honest about it, saying, in effect, something like the following:
"The encoding of the document was not specified in any manner prescribed
in HTML specifications. Therefore, I tried to make an educated guess and
came up with the idea that the encoding is xxx. Then I found out that
this cannot be correct. Now I'm giving up. You need to specify the
encoding and retry."
In the absence of any particular reason to guess anything else, I think
a user agent should assume a hypothetical generic encoding (we could
give it a name, but that's not important right now) that uses 8 bits for
one character so that octets 0 - 127 have their ASCII values and other
octets denote undefined graphic characters.
This works well for most encodings actually in use, for the purposes of
validation, unless the document uses non-ASCII characters for names of
elements and attributes (which is permitted in XML but unadvisable and
uncommon in practice). It doesn't matter what the "upper half" octets
mean, since they would normally be just data characters and validators
don't care about text characters.
Now I have a déjà-vu feeling: I'm pretty sure this has been discussed at
least once in the current universe, in a context like this. I cannot
recollect any arguments against the simple approach I proposed.
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/
Received on Thursday, 24 April 2008 17:10:19 UTC