Re: charset parameter from Terje Bless on 2001-07-26 (www-validator@w3.org from July 2001)

From: Terje Bless <link@pobox.com>
Date: Thu, 26 Jul 2001 05:38:15 +0200
To: W3C Validator <www-validator@w3.org>
Message-ID: <20010726062552-r01010700-002f5b46-0910-010c@192.168.1.6>
On 25.07.01 at 16:57, Martin Duerst <duerst@w3.org> wrote:

>At 03:53 01/07/25 +0200, Terje Bless wrote:
>
>>The issue is that the transport protocol sez that an absense of an
>>explicit charset parameter on the Content-Type means "ISO-8859-1"; HTML
>>or XML rules don't apply here. When it comes time to parse the markup,
>>you already have a charset; the XML/HTML rules do not govern HTTP.
>
>[The] HTML 4 spec explicitly says that the HTTP default doesn't work.

The HTML Recommendation has no authority to dictate syntax or semantics for
an arbitrary transport protocol. HTML sent over SMTP, encapsulated in MIME,
must conform to RFC 2822 and RFCs 2045-2050 first. As far as HTTP is
concerned, "text/html" is just an opaque block of data whose exact
transport details are dictated by the Content-Transfer-Encoding field.

I'm guessing that the _intent_ was that something labelled "ISO-8859-1"
should be parsed accordingly, until a meta element with, say,
"windows-1250" was encountered, and then _restarted_ with the new encoding
in effect (implicit in this is that it should be compatible with the
transport encoding up to the meta element).

This obviously does not consider HTTP defaulting behaviour, but even
[RFC 2854] still says that ISO-8859-1 is the default.

See also
<URL:http://lists.w3.org/Archives/Public/www-validator/2001AprJun/0163.html>
for the details Björn posted in April[0].


>>In practice you have to decide between "Assume ISO-8859-1 as that's what
>>/people/ tend to assume" or "Assume nothing as people will get it wrong
>>some part of the time".
>
>Well, in your part, that's what /people/ tend to assume, but in
>this part of the world, assumptions are quite different.

I know. The situation may have changed, but it used to be that us Western
Imperialists -- :-) -- were in the overwhelming majority on the Internet.
In those circumstances, assuming ISO-8859-1 was an acceptable (barely)
compromise. This assumption is still widely held, for better or worse.

What that implies for how the Validator should behave is what I'm
ambivalent about.

As a data point; my impression of the general English skills of
"Easterners" (if you'll pardon my French ;D) is that we will need a
translated version of the Validator to be even remotely usefull[1].
This might also include localizing it to assume S-JIS or Big5, KOI8, or
EUC_JP (etc.).

None of these are ASCIIpadible enough to let us extract a meta element
AFAICT. It's no better to assume these then it is to assume ISO-Latin-1 by
way of the HTTP 1.1 defaulting rules.



The summary of all this is that I just don't know what the best way for the
Validator to behave is. I don't think we can achieve beeing in full
conformance with all the relevant specs, because the specs are mututally
exclusive. That means we have to make a decision on which poison; do we
agree with one spec, or the other spec? Or do we just punt, explain the
situation, and hope it'll still be usefull for users?



If I take my own preference and modify it to be more in line with what you
and Björn are saying (AFAICT), I think we end up with the following
pseudo-algorithm.

1) Check HTTP for charset.
  a) If found, use it (for now).
  b) If not found, assume to be ASCIIpatible (for now).

2) Check for META charset (using explicit or implied HTTP charset).
  a) If found, use unconditonally, overriding HTTP.
  b) If not found...
     I. If HTTP had explicit charset, keep using it.
    II. If no HTTP charset, punt and tell the user to "deal with it"

3) Check for a CGI "charset" parameter.
  a) If found, use unconditionally, overriding META, but mark doc invalid.
  b) If not found...
     I. If META or HTTP had explicit charset, keep using it.
    II. If no META or HTTP charset, punt and tell the user to "deal with
        it"

This pseudo-algorithm has the property that we accept the HTTP defaulting
behaviour for just long enough to try to find a better source for the
information, while still refusing to go on _just_ the HTTP defaulting.

This, however, still leaves us with the problem that a great majority of
pages rely on the HTTP defaulting and so we are no longer meeting user
expectations. The carrot beeing that they can use the charset override on
the CGI to get usefull behaviour regardless. Unfortunately, this probably
is not behaviour that will be conductive to getting people to fix their
pages.




[RFC 2854] - The 'text/html' Media Type, Connolly & Masinter, June 2000


[0] - Usefull little Bookmarklet for circumventing that horrid search
      engine the W3C List Arcives use. This one uses Google instead! :-)

<URL:javascript:void(Qr=prompt('Keywords...',''));if(Qr)void(location.href='http://google.com/search?query=site:lists.w3.org+'+escape(Qr)+'&num=10')>

      Use it for looking up a Message-ID like Björn posted.


[1] - Just to make _absolutely_ sure I'm not inadvertently stepping on
      someone's pride here: This is to be considered a failure of the
      Validator to make itself usefull and understood, rather then a
      failure on any particular group to understand it!
Received on Thursday, 26 July 2001 00:26:03 UTC