- From: Terje Bless <link@pobox.com>
- Date: Thu, 26 Jul 2001 05:38:15 +0200
- To: W3C Validator <www-validator@w3.org>
On 25.07.01 at 16:57, Martin Duerst <duerst@w3.org> wrote: >At 03:53 01/07/25 +0200, Terje Bless wrote: > >>The issue is that the transport protocol sez that an absense of an >>explicit charset parameter on the Content-Type means "ISO-8859-1"; HTML >>or XML rules don't apply here. When it comes time to parse the markup, >>you already have a charset; the XML/HTML rules do not govern HTTP. > >[The] HTML 4 spec explicitly says that the HTTP default doesn't work. The HTML Recommendation has no authority to dictate syntax or semantics for an arbitrary transport protocol. HTML sent over SMTP, encapsulated in MIME, must conform to RFC 2822 and RFCs 2045-2050 first. As far as HTTP is concerned, "text/html" is just an opaque block of data whose exact transport details are dictated by the Content-Transfer-Encoding field. I'm guessing that the _intent_ was that something labelled "ISO-8859-1" should be parsed accordingly, until a meta element with, say, "windows-1250" was encountered, and then _restarted_ with the new encoding in effect (implicit in this is that it should be compatible with the transport encoding up to the meta element). This obviously does not consider HTTP defaulting behaviour, but even [RFC 2854] still says that ISO-8859-1 is the default. See also <URL:http://lists.w3.org/Archives/Public/www-validator/2001AprJun/0163.html> for the details Björn posted in April[0]. >>In practice you have to decide between "Assume ISO-8859-1 as that's what >>/people/ tend to assume" or "Assume nothing as people will get it wrong >>some part of the time". > >Well, in your part, that's what /people/ tend to assume, but in >this part of the world, assumptions are quite different. I know. The situation may have changed, but it used to be that us Western Imperialists -- :-) -- were in the overwhelming majority on the Internet. In those circumstances, assuming ISO-8859-1 was an acceptable (barely) compromise. This assumption is still widely held, for better or worse. What that implies for how the Validator should behave is what I'm ambivalent about. As a data point; my impression of the general English skills of "Easterners" (if you'll pardon my French ;D) is that we will need a translated version of the Validator to be even remotely usefull[1]. This might also include localizing it to assume S-JIS or Big5, KOI8, or EUC_JP (etc.). None of these are ASCIIpadible enough to let us extract a meta element AFAICT. It's no better to assume these then it is to assume ISO-Latin-1 by way of the HTTP 1.1 defaulting rules. The summary of all this is that I just don't know what the best way for the Validator to behave is. I don't think we can achieve beeing in full conformance with all the relevant specs, because the specs are mututally exclusive. That means we have to make a decision on which poison; do we agree with one spec, or the other spec? Or do we just punt, explain the situation, and hope it'll still be usefull for users? If I take my own preference and modify it to be more in line with what you and Björn are saying (AFAICT), I think we end up with the following pseudo-algorithm. 1) Check HTTP for charset. a) If found, use it (for now). b) If not found, assume to be ASCIIpatible (for now). 2) Check for META charset (using explicit or implied HTTP charset). a) If found, use unconditonally, overriding HTTP. b) If not found... I. If HTTP had explicit charset, keep using it. II. If no HTTP charset, punt and tell the user to "deal with it" 3) Check for a CGI "charset" parameter. a) If found, use unconditionally, overriding META, but mark doc invalid. b) If not found... I. If META or HTTP had explicit charset, keep using it. II. If no META or HTTP charset, punt and tell the user to "deal with it" This pseudo-algorithm has the property that we accept the HTTP defaulting behaviour for just long enough to try to find a better source for the information, while still refusing to go on _just_ the HTTP defaulting. This, however, still leaves us with the problem that a great majority of pages rely on the HTTP defaulting and so we are no longer meeting user expectations. The carrot beeing that they can use the charset override on the CGI to get usefull behaviour regardless. Unfortunately, this probably is not behaviour that will be conductive to getting people to fix their pages. [RFC 2854] - The 'text/html' Media Type, Connolly & Masinter, June 2000 [0] - Usefull little Bookmarklet for circumventing that horrid search engine the W3C List Arcives use. This one uses Google instead! :-) <URL:javascript:void(Qr=prompt('Keywords...',''));if(Qr)void(location.href='http://google.com/search?query=site:lists.w3.org+'+escape(Qr)+'&num=10')> Use it for looking up a Message-ID like Björn posted. [1] - Just to make _absolutely_ sure I'm not inadvertently stepping on someone's pride here: This is to be considered a failure of the Validator to make itself usefull and understood, rather then a failure on any particular group to understand it!
Received on Thursday, 26 July 2001 00:26:03 UTC