- From: Martin J. Duerst <mduerst@ifi.unizh.ch>
- Date: Fri, 13 Dec 1996 12:42:01 +0100 (MET)
- To: Larry Masinter <masinter@parc.xerox.com>
- cc: Drazen.Kacar@public.srce.hr, Chris.Lilley@sophia.inria.fr, www-international@w3.org, Alan_Barrett/DUB/Lotus.LOTUSINT@crd.lotus.com, bobj@netscape.com, wjs@netscape.com, erik@netscape.com, Ed_Batutis/CAM/Lotus@crd.lotus.com
On Thu, 12 Dec 1996, Larry Masinter wrote: > # However, what I (and others) object against very strongly is > # the combination of RFC 1522 with the exception for ISO-8859-1. > # If everybody has to use RFC 1522, there must not be an > # exception for ISO-8859-1. ISO-8859-1 and Western Europe is > # not really anything special. If we choose RFC 1522, then > # everybody should use it, there should not be any exceptions. > > The exception for ISO-8859-1 for warning messages in HTTP is based on > the fact that there is an exception for ISO-8859-1 for text documents, > and that it made no sense for the protocol to be inconsistent. Larry - What you say here is a very weak excuse at best. I would like to comment specifically about two points: - The exception for ISO-8859-1 - The "inconsistency" of the protocol With regards to ISO-8859-1 as a default, Francois already commented about this in very strong words. With the words I usually describe it, the reality is: ISO-8859-1 comes in as a default after a long list of other things, which are, in (rough) order of priority: - "charset" parameter in Content-Type header - "charset" parameter in META element - User settings - Heuristic guessing by the client In actual consequence (because all major browsers have a "Document Encoding" submenu), this means that the default is close to worthless. It also means that correct labeling is preferred to relying on the "default". In Section 3.7.1 (as of version 06), the HTTP draft acknowledges this fact: >> Some HTTP/1.0 software has interpreted a Content-Type header without charset parameter incorrectly to mean "recipient should guess." Senders wishing to defeat this behavior MAY include a charset parameter even when the charset is ISO-8859-1 and SHOULD do so when it is known that it will not confuse the recipient. << Actually, in the first sentence, the description of the situation is rather complacent. *All* widely deployed HTTP/1.0 software interprets a Content-Type header without "charset" as "use other means to determine the encoding" (see above). This is a bare necessity because of the fact that currently, maybe 99% of the documents not in ISO-8859-1 are not labelled. This in turn is a result of the fact that i18n was not considered from the beginning when creating the web; the originators just took what was in front of their nose, ISO-8859-1. Now for your claim that "it made no sense for the protocol to be inconsistent". I greatly value protocol consistency, but let's be fair: One might claim consistency if for the actual documents: - It would be allowed to send ISO-8859-1 in 8-bit without labeling, but all other encodings would have to use QP to avoid using the 8th bit. - Having a default of ISO-8859-1 would actually work in practice, and not be virtually worthless as described above. - Having a default of ISO-8859-1 for document content would have proven to be a good solution, and not a coincidence of history that is creating countless headaches. - Dealing with HTTP headers and with document content would not, in many implementations, occur at very different places in the implementation, so that the savings, in terms of implementation complexity, are nonexistent. Claiming that the way HTTP now deals with document encodings and the way HTTP 1.1 proposes to deal with the encodings of warning headers has anything to do with consistency is more of a joke than serious engineering. > It is a historical fact that the web's origin at CERN in western > Europe gives it a western-European bias. This is perhaps unfortunate > (if you're not a western European), I am a western European. I grew up and live in central western Europe, in the same small country where the Web originated. I acknowledge the fact that the Web has its roots in western Europe, and I am even proud of it. Nevertheless, I think the Web is global, and its protocols and formats have to be designed with the whole world in mind, and the historical existence of some western-European bias in some of its components should in no way be used as an argument to prepetuate and spread such a bias to places where it is absolutely not necessary. Historical coincidence should also not be used as an argument to repeat the errors of the past. > but your proposal that the default > be UTF-8 doesn't actually advantage much of the world that currently > has different encodings as their default. I have proposed various alternatives, all of which are strongly preferable to the current proposal. I would be happy if the current HTTP 1.1 spec would say: - Use RFC 1522 for everything (except ASCII). - Keep the 8-bit channel open for future use. So that we might then specify to used raw UTF-8 in the future when there is consensus that a reasonable base of implementations can handle it. I would also be happy if the current spec said: - Use RFC 1522 for whatever you want to encode. - Use raw UTF-8 in the 8-bit channel. This would allow much of the world that currently has different encodings at their command to use these encodings. At this point, it is worth considering some additional points. First, if we specified RFC1522 + UTF-8, the only people that would be affected would be those implementing ISO-8859-1 only. As I believe in the global web, I hope that this is a small minority. They would have the choice between RFC1522 and UTF-8. Because ISO-8859-1 is a subset of Unicode, the conversion from ISO-8859-1 to UTF-8 is trivial, easier than the conversion to RFC1522, and easier than the conversion from anything else (except plain ASCII) to UTF-8. It's really not a big burden on ISO-8859-1 implementers. In addition, as it currently stands, the number of warnings is limited (6 in version 06), and they will probably end up as constants in the program or in some file. Thus the server does not have to implement a conversion algorithm, the programmer just has to assure that (s)he includes the correct string constants. With a list providing HTML entity to C constant notation, doing the conversion by hand is a matter of a few seconds for the 6 warnings currently defined. Such a list could look as follows: HTML Hex C C Latin-1 Latin-1 UTF-8 Unicode ü FC \374 \303\274 General \xyz \30x\2yz I'm ready to provide a full list. As we have seen, the problems on the server side when removing the use of raw ISO-8859-1 are extremely marginal. Now let's have a look at the client side. The argument that conversion from UTF-8 to ISO-8859-1 is easier that from RFC1522 applies in this case, too. In addition, one should mention that any browser that correctly supports numeric character references has to have an idea about ISO10646/Unicode. Decoding UTF-8 to this is very easy. As we know, the main browsers will support UTF-8 for document content in their next version, and they should not have any problems doing that for warnings, either. > You're proposing that recipients apply heuristics to decide if the > warning messages are in UTF-8 or ISO-8859-1. This seems like a bad > idea, to make something that's deterministic into something that's > heuristic. This was one of my proposals, and not the most preferred one. I added it because I thought that one argument against removing the ISO-8859-1 default would be that it were already in use. As this seems not the case, there is no need for such heuristics. Let's make it deterministically RFC1522+8-bit extension in the future, or even better RFC1522+deterministic UTF-8. > The 12-byte overhead for the "=?UTF-8?Q?" and "?=" suffix > in the warning message isn't so big, Francois already has commented about this. I would have assumed that the whole HTTP group would have had more expertise in QP, RFC1522, and i18n to weed out such strange claims earlier. >and isn't really "Clogging up the > 8-bit channel". It's not RFC1522 and QP that is clogging up the 8-bit channel. It's the ISO-8859-1 "default". > Perhaps by the time Unicode is widespread -- in the next 3-5 years -- Unicode/UTF-8 support will be widespread next year. And I hope we still can make a small change to the HTTP 1.1 draft so that it will not look obviously outdated next year. > we'll have a new version of HTTP 2.x or HTTP-NG. I would certainly > propose that in the future, new versions of HTTP default to UTF-8. If we remove that ISO-8859-1 "default" from the warnings, we can even have UTF-8 as a default, now, or in the future if you think there is not enough consensus now. Larry - Unless you have any stringent arguments for keeping in the ISO-8859-1 "default" for warnings in HTTP1.1 headers, we should really get serious about removing that "default" from the spec. I have made several proposals for a different specification, and if you tell me which of the proposals would be most acceptable to you, I will gladly work out the wording. If you tell me how to subscribe to the HTTP list, I will also propose and defend this change on this list. It may be late for this change, but it can be done quickly, and it is worth to do it for the future of HTTP. Regards, Martin.
Received on Friday, 13 December 1996 06:51:53 UTC