Re: charset parameter from Terje Bless on 2001-07-27 (www-validator@w3.org from July 2001)

From: Terje Bless <link@pobox.com>
Date: Fri, 27 Jul 2001 10:54:14 +0200
To: W3C Validator <www-validator@w3.org>
Message-ID: <20010727110142-r01010700-aa8bf379-0910-010c@192.168.1.6>
On 27.07.01 at 00:05, Bjoern Hoehrmann <derhoermi@gmx.net> wrote:

>* Terje Bless wrote:
>>>>When it comes time to parse the markup, you already have a charset; the
>>>>XML/HTML rules do not govern HTTP.
>>>They do for conforming applications.
>>
>>No they don't!
>
>A conforming HTML user agent must adhere to all "must"s in the HTML 4
>recommendation. Assuming no default value for the charset parameter is a
>must. Applications that do something different, i.e. assuming some default
>value or don't check if an explicit charset was given, aren't conforming
>user agents.

You fail to distinguish between a "HTNL 4 User Agent" and a "HTTP Client
Application". By the time the HTTP Application has finished processing the
response, and hands it over to the HTML 4 User Agent, the character
encoding is "ISO-8859-1" with no way of knowing whether that is by implicit
assumption (HTTP default parameter) or by explicit definition. The HTML 4
User Agent does not and cannot know whether the charset parameter was
present or not.

Just because most browsers today are hybrid HTTP/HTML combinations does not
mean the distinction does not exist.


BTW, I don't suppose there is any chance we could get an errata on the HTML
Rec. that sez "meta" MAY be interpreted by the server but MUST be ignored
by UAs? Pretty please with sugar on top? :-)

Then we could drop this whole issue and just point at the errata. :-)


>Please note that I don't comment on how applications should behave, nor if
>I like this definition.

Me neither. I'm not arguing what I think is the proper behaviur; I'm
arguing about what is the correct interpretation of the relevant specs.

If I were writing a browser-equivalent application, I would probably assume
UTF-8 if no charset was given in the HTTP response, and complain -- loudly!
-- if the result was not valid UTF-8 (or valid HTML for that matter ;D). If
I were writing a spec I would probably mandate UTF-8 for unlabeled docs,
and strongly discourage the use of other encodings.

Unfortunately, for the Validator, correctness is the goal rather then
convenience. I just simply don't know what the correct behaviour is here;
and the fact that we need to take deployed browser behaviour and user
expectations into account, for how we respond to whatever we decide the
correct behaviour is, does not make the issue any clearer to me.


At least the XML Rec. seems to have solved some of my problems for XML; it
describes fairly well the expected behaviour when faced with various
encoding variants and labellings. It's not quite unambigious, but it's
close enough to split hairs on. :-)



Björn, Nick, Martin (and anyone else with an opinion ;D)[0]: could you take
a look at the pseudo-algorithm I posted the other day and tell me of any
problems you see with it? What _exactly_ would you say is the "correct"
behaviour for the Validator? Did I leave out anything?






[0] - BTW, where is Liam at these day?
      I haven't seen him around in a while?
Received on Friday, 27 July 2001 05:02:00 UTC