Re: charset parameter

From: Terje Bless (link@pobox.com)
Date: Fri, Jul 27 2001

  • Next message: Terje Bless: "Re: charset parameter"

    Date: Fri, 27 Jul 2001 10:54:14 +0200
    From: Terje Bless <link@pobox.com>
    To: W3C Validator <www-validator@w3.org>
    Message-ID: <20010727110142-r01010700-aa8bf379-0910-010c@192.168.1.6>
    Subject: Re: charset parameter
    
    On 27.07.01 at 00:05, Bjoern Hoehrmann <derhoermi@gmx.net> wrote:
    
    >* Terje Bless wrote:
    >>>>When it comes time to parse the markup, you already have a charset; the
    >>>>XML/HTML rules do not govern HTTP.
    >>>They do for conforming applications.
    >>
    >>No they don't!
    >
    >A conforming HTML user agent must adhere to all "must"s in the HTML 4
    >recommendation. Assuming no default value for the charset parameter is a
    >must. Applications that do something different, i.e. assuming some default
    >value or don't check if an explicit charset was given, aren't conforming
    >user agents.
    
    You fail to distinguish between a "HTNL 4 User Agent" and a "HTTP Client
    Application". By the time the HTTP Application has finished processing the
    response, and hands it over to the HTML 4 User Agent, the character
    encoding is "ISO-8859-1" with no way of knowing whether that is by implicit
    assumption (HTTP default parameter) or by explicit definition. The HTML 4
    User Agent does not and cannot know whether the charset parameter was
    present or not.
    
    Just because most browsers today are hybrid HTTP/HTML combinations does not
    mean the distinction does not exist.
    
    
    BTW, I don't suppose there is any chance we could get an errata on the HTML
    Rec. that sez "meta" MAY be interpreted by the server but MUST be ignored
    by UAs? Pretty please with sugar on top? :-)
    
    Then we could drop this whole issue and just point at the errata. :-)
    
    
    >Please note that I don't comment on how applications should behave, nor if
    >I like this definition.
    
    Me neither. I'm not arguing what I think is the proper behaviur; I'm
    arguing about what is the correct interpretation of the relevant specs.
    
    If I were writing a browser-equivalent application, I would probably assume
    UTF-8 if no charset was given in the HTTP response, and complain -- loudly!
    -- if the result was not valid UTF-8 (or valid HTML for that matter ;D). If
    I were writing a spec I would probably mandate UTF-8 for unlabeled docs,
    and strongly discourage the use of other encodings.
    
    Unfortunately, for the Validator, correctness is the goal rather then
    convenience. I just simply don't know what the correct behaviour is here;
    and the fact that we need to take deployed browser behaviour and user
    expectations into account, for how we respond to whatever we decide the
    correct behaviour is, does not make the issue any clearer to me.
    
    
    At least the XML Rec. seems to have solved some of my problems for XML; it
    describes fairly well the expected behaviour when faced with various
    encoding variants and labellings. It's not quite unambigious, but it's
    close enough to split hairs on. :-)
    
    
    
    Björn, Nick, Martin (and anyone else with an opinion ;D)[0]: could you take
    a look at the pseudo-algorithm I posted the other day and tell me of any
    problems you see with it? What _exactly_ would you say is the "correct"
    behaviour for the Validator? Did I leave out anything?
    
    
    
    
    
    
    [0] - BTW, where is Liam at these day?
          I haven't seen him around in a while?