Re: charset parameter

From: Terje Bless (link@pobox.com)
Date: Sat, Jul 28 2001

  • Next message: Terje Bless: "Re: charset parameter"

    Date: Sat, 28 Jul 2001 23:01:03 +0200
    From: Terje Bless <link@pobox.com>
    To: W3C Validator <www-validator@w3.org>
    Message-ID: <20010728231840-r01010700-0e457ca8-0910-010c@192.168.1.6>
    Subject: Re: charset parameter
    
    On 28.07.01 at 20:31, Nick Kew <nick@webthing.com> wrote:
    
    >On Fri, 27 Jul 2001, Martin Duerst wrote:
    >
    >>Bj$BãS(Bn
    >
    >OK, smart****, what's this with charset="ISO-2022-JP" for posting to this
    >list?  Or are you really in Japan?
    
    He's really in Japan. :-)
    
    (
      /You/ try coordinating timezones between Norway (UTC+1), Canada (UTC-5),
      and Japan (UTC+10); and then throw in various conventions for DST...
    
      *sigh*
    
      As luck(?) would have it, both Gerald and I have weird sleep patterns
      and keep odd hours so it kinda works out in the end. :-)
    )
    
    
    >>>assume to be ASCIIpatible (for now).
    >
    >Questionmark: shouldn't that be iso-8859-1?
    
    Doesn't matter. What we need it for will be plain ASCII. OTOH, Latin 1 and
    UTF-8 are strict supersets of ASCII so we can just use UTF-8 and not worry
    about it.
    
    
    
    >>>If [HTTP Charset] found, use it (for now).
    >>No. If found, use it, done.
    >
    >>>If [META Charset] found, use unconditonally, overriding HTTP.
    >>No, don't override. See HTML 4, section 5.2.
    >
    >Agreed[...].
    
    Ok. This is assuming that you come down on Björn's side of
    the interpretation of the standards;  that is, they should
    be interpreted in accordance with widespread practice. But
    widespread practice is that META overrides anything in the
    HTTP header  (on the assumption that HTTP isn't set by the
    author and so is likely wrong).  That means that we in one
    place cater to users expectations,  but in the place where
    it's most important we refuse to do the same.
    
    Actually, this leaves us a way out of the whole issue;  we
    can just decide to cater to user expectation,  and what is
    the widespread practice, by always using the META if it is
    available and «fallback» to the HTYP header (including any
    defaulting) only if META isn't available. This is provably
    incorrect behaviour and encourages the META bogosity,  but
    it does give us a cleaner way out of this mess.
    
    
    What I _want_ to do, though, is to accept only a charset from the HTTP
    header, or a charset parameter to the CGI, and then ignoring META
    alltogether (with some suitable handwaving). This lets us off the hook on
    this whole issue, and discourages use of the META hack. If you have charset
    info in the HTTP header you're fine, and if you use the charset override
    you can check your document but it won't ever be labelled as Valid or give
    you the badge.
    
    (
      For XML of the non-text/html variety, I think we have good enough
      heuristics to do a decent job of sniffing until we hit the encoding
      paramneter in the XML Declaration. The expectation here is also such
      that you can easily get away with refusing to handle undefined gunk.
    )
    
    Do you think we can get away with that? :-)
    
    
    
    >>>3) Check for a CGI "charset" parameter.
    >>>a) If found, use unconditionally, overriding META, but mark doc invalid.
    >> 
    >>This overrides both META and HTTP, so it should come first. The 'charset'
    >>parameter corresponds to an explicit (per page) user setting in a
    browser.
    >
    >er??? WTF is a CGI "charset" parameter?  CGI has exactly what comes to
    >it from HTTP.  Is there some other HTTP charset header set by browsers?
    
    /check?uri=foo;charset=whatever
    
    It's in the current development tree, but hasn't been released yet. See
    <URL:http://validator.w3.org:8001/check?uri=http://localhost/;charset=iso-8859-1>.
    
    
    >MIME part-headers for file upload are a different story.  Indeed, I
    >wonder if they're what Terje has in mind in this argument
    
    Not really, but now that you mention it, they /are/ another headache.
    Thanks just /sooo/ much for reminding me... :-(
    
    
    >When we tell the user to "deal with it", what is the severity of our
    >message?  I'd be inclined to stick with Warning.
    
    "I apologize, but my charset-fu is too weak to do honor to your document".
    
    It's not a warning or error in the document; it's that we can't find any
    charset info anywhere and so we're unable to process it. The status of the
    document is undefined/undetermined. This neatly sidesteps any number of
    nasty issues. :-)