Re: charset parameter

On 28.07.01 at 20:31, Nick Kew <nick@webthing.com> wrote:

>On Fri, 27 Jul 2001, Martin Duerst wrote:
>
>>Bj$BãS(Bn
>
>OK, smart****, what's this with charset="ISO-2022-JP" for posting to this
>list?  Or are you really in Japan?

He's really in Japan. :-)

(
  /You/ try coordinating timezones between Norway (UTC+1), Canada (UTC-5),
  and Japan (UTC+10); and then throw in various conventions for DST...

  *sigh*

  As luck(?) would have it, both Gerald and I have weird sleep patterns
  and keep odd hours so it kinda works out in the end. :-)
)


>>>assume to be ASCIIpatible (for now).
>
>Questionmark: shouldn't that be iso-8859-1?

Doesn't matter. What we need it for will be plain ASCII. OTOH, Latin 1 and
UTF-8 are strict supersets of ASCII so we can just use UTF-8 and not worry
about it.



>>>If [HTTP Charset] found, use it (for now).
>>No. If found, use it, done.
>
>>>If [META Charset] found, use unconditonally, overriding HTTP.
>>No, don't override. See HTML 4, section 5.2.
>
>Agreed[...].

Ok. This is assuming that you come down on Björn's side of
the interpretation of the standards;  that is, they should
be interpreted in accordance with widespread practice. But
widespread practice is that META overrides anything in the
HTTP header  (on the assumption that HTTP isn't set by the
author and so is likely wrong).  That means that we in one
place cater to users expectations,  but in the place where
it's most important we refuse to do the same.

Actually, this leaves us a way out of the whole issue;  we
can just decide to cater to user expectation,  and what is
the widespread practice, by always using the META if it is
available and «fallback» to the HTYP header (including any
defaulting) only if META isn't available. This is provably
incorrect behaviour and encourages the META bogosity,  but
it does give us a cleaner way out of this mess.


What I _want_ to do, though, is to accept only a charset from the HTTP
header, or a charset parameter to the CGI, and then ignoring META
alltogether (with some suitable handwaving). This lets us off the hook on
this whole issue, and discourages use of the META hack. If you have charset
info in the HTTP header you're fine, and if you use the charset override
you can check your document but it won't ever be labelled as Valid or give
you the badge.

(
  For XML of the non-text/html variety, I think we have good enough
  heuristics to do a decent job of sniffing until we hit the encoding
  paramneter in the XML Declaration. The expectation here is also such
  that you can easily get away with refusing to handle undefined gunk.
)

Do you think we can get away with that? :-)



>>>3) Check for a CGI "charset" parameter.
>>>a) If found, use unconditionally, overriding META, but mark doc invalid.
>> 
>>This overrides both META and HTTP, so it should come first. The 'charset'
>>parameter corresponds to an explicit (per page) user setting in a
browser.
>
>er??? WTF is a CGI "charset" parameter?  CGI has exactly what comes to
>it from HTTP.  Is there some other HTTP charset header set by browsers?

/check?uri=foo;charset=whatever

It's in the current development tree, but hasn't been released yet. See
<URL:http://validator.w3.org:8001/check?uri=http://localhost/;charset=iso-8859-1>.


>MIME part-headers for file upload are a different story.  Indeed, I
>wonder if they're what Terje has in mind in this argument

Not really, but now that you mention it, they /are/ another headache.
Thanks just /sooo/ much for reminding me... :-(


>When we tell the user to "deal with it", what is the severity of our
>message?  I'd be inclined to stick with Warning.

"I apologize, but my charset-fu is too weak to do honor to your document".

It's not a warning or error in the document; it's that we can't find any
charset info anywhere and so we're unable to process it. The status of the
document is undefined/undetermined. This neatly sidesteps any number of
nasty issues. :-)

Received on Saturday, 28 July 2001 17:18:43 UTC