- From: Martin J. Duerst <mduerst@ifi.unizh.ch>
- Date: Mon, 16 Dec 1996 19:24:42 +0100 (MET)
- To: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
Dear HTTP 1.1 specialists, As a specialist in I18N (coauthor of the HTML I18N spec), I am extremely busy trying to have a look at all the many internet drafts in the working, to help to find viable and long-lasting I18N solutions. As virtually all applications area drafts affect I18N, this takes a lot of time (besides my normal work!). It is therefore only lately that I have become aware of some details in the HTTP 1.1 spec that I have difficulties understanding and that I would propose to change. Discussions in private, and on another list, have not given any serious explanations for why HTTP 1.1 solves these issues the way it does, and have suggested that I address them directly to this list. The areas concerned I am concerned with are the TEXT rule and its explanation in Section 2.2 on page 16, and the warnings in Section 14.45 on pages 128/9 of draft-ietf-http-v11-07.txt. The main concern is the choice of "ISO-8859-1 OR RFC 1522" for the encoding of TEXT and warnings. I will expand on this below. A second point is the question of whether it is okay for a server/proxy not to send any warning text, but only the number, (correctly!) assuming that the client is in better shape to decide on language and wording of the text. On the other side, the draft gives the strong impression that the warning has to come with text, so that client implementors will just try to display it, and the user may end up without any warning text. A third point is the question of explicitly specifying English as a default for warnings. Every HTTP implementor should be expected to have as much knowledge of internet practice to be able to conclude that given no other indication of language preference, English is the best choice anyway. So saying "The default Language is English" is a dummy statement. On the other hand, it seems to be offensive to some people (not me!) that worry about the dominance of English in the internet. It can very well be argued that English as a default should not be written in stone. Therefore, it might be silently removed if any other changes in the "Warning" text are necessary. Now back to the MAIN POINT: Can anybody explain to me why ISO-8859-1 was choosen as a default for TEXT in headers and warnings? Given the recommendations of the IAB charset workshop (draft-weider-iab-char-wrkshop-00.txt), which repeatedly mentionnes UTF-8, this seems like a rather antiquated choice. On the other side, UTF-8 is extremely suited for the purpose: It covers all the characters of the world, is reasonably compact, and works together smoothlessly with ASCII. It is clear that 7-bit octets are reserved for ASCII; the 8th bit, a precious resource, should be used as carefully as possible. Using it for UTF-8 is definitely better than using it for ISO-8859-1. To improve the situation, I propose four variants for a better solution. The choice of variant may depend on various factors I am not fully knowledgeable about, such as the installed base of servers, proxies, and clients that support ISO-8859-1 and/or RFC 1522. Whoever knows anything on this topic is wellcome to share his/her knowledge. Solution 1: UTF-8 only ---------------------- Advantages: Very easy to implement on server. For those that doubt it, I offer to transcode lists of warnings from ISO 8859-1 (and quite a few other encodings) to UTF-8. A file with a list of warnings in various languages can be edited directly if it is in UTF-8, whereas this is not possible for an RFC1522-based solution. This applies also to the other solutions that are based on UTF-8, as the server can choose to use any of the allowed encodings. Easy to implement on client (does not need RFC 1522 code). UTF-8 support will be in the major web clients next year, and in anything that is serious about Java, anyway. Display support for exotic scripts such as Tibetan is not an issue, as RFC 1522 has the same problems. Solution 2: UTF-8 and RFC 1522 ------------------------------ The main advantage for this is that some scripts, in particular Indic scripts and Georgian, expand by a factor of 3 from native encoding to UTF-8. Otherwise, there is no good reason for keeping RFC 1522 with UTF-8 except maybe for installed base. Solution 3: UTF-8 and ISO-8859-1 -------------------------------- At first sight, this may seem very dangerous and bad design, because how should one find out whether something is ISO-8859-1 or UTF-8? Indeed, "guessing" is needed. But guessing is tremendously simplified, to the extent where it is really difficult to speak about guessing, by the following facts: ISO-8859-1 8-bit characters can be divided into three areas: A0-BF, C0-DF, and E0-FF. A0-BF contains all kinds of symbols, such as 1/4, copyright, superscript 2,... C0-DF contains upper case accented characters, E0-FF contains lower-case accented characters. The range 80-9F is not defined in ISO-8859-1, it's reserved for control characters (C1), but not used in internet context. In ISO-8859-1 strings, E0-FF will be relatively frequent, C0-DF considerably less frequent, and A0-BF even less. Sequences of two characters with the 8th bit set also are very rare. In UTF-8, the range 80-FF is divided into leading characters (L: C0-FF) and trailing characters (T: 80-BF). The following sequences are legal UTF-8: L1 (C0-DF) T L2 (E0-EF) T T L3 (F0-F7) T T T L4 (F8-FB) T T T T L5 (FC-FD) T T T T T So to find an octet sequence that is both legal UTF-8 and reasonable ISO-885-1, the best chance is to find a reasonable combination of an uppercase accented letter followed by a special sign such as copyrigth. Can a warning, or any other TEXT, reasonably be expected to contain such a combination (and no other 8-bit characters that don't conform to UTF-8)? Code to test an octet string for UTF-8 compliance is avaliable on request. The "guessing" solution was also accepted in ftp-wg to provide a reasonable upgrade path for existing implementations that use arbitrary unlabeled "charset"s in their filenames. Solution 4: RFC 1522 only ------------------------- Not really the best solution, but at least fair to everybody, and leaving the 8th bit open for the future. Some readers may argue that HTTP 1.0 already specifies ISO-8859-1 as the default for TEXT. This is not exactly true. HTTP 1.0 says: Recipients of header field TEXT containing octets outside the US- ASCII character set may assume that they represent ISO-8859-1 characters. Very obviously, this is just a suggestion, not a default. It does not make sense to thighten this default in the wrong direction. It may even be that at some places, based on this not-so-tight specification, implementors may have used any encoding for such fields. It may also be worth contemplating what happens when an UTF-8 string sent out by a server happens to be displayed on a client that is assuming that it can be nothing else than ISO-8859-1. If the UTF-8 string is something else than a string that could have been represented in ISO-8859-1, then it would have been impossible to reliably send it with HTTP 1.0 anyway. Otherwise, accidental accented characters will appear as two octets with the 8th bit set, and display as one or two characters in the ISO-8859-1 range. While this is of course very unfortunate, it does not preclude readability. It is a phenomenon that most computer users dealing with such languages are actually only too familliar with. Some additional comments for people concerned about these issues: Previous Discussions -------------------- To those of you to whom I give the impression of reopening a point that has already been beaten to death, please note that this is not true. "charset" issues and defaults for entity content have been discussed repeatedly and vigurously on this list, and a reasonable solution, considering all the backwards compatibility issues, has been found. However, ever after scanning the list archives of this year in great detail, I have not found any serious discussion of I18N issues in headers and warnings. Procedural Concerns ------------------- The current HTTP 1.1 draft is beyond last call, waiting for becomming an RFC. I do not know whether last minute changes can or should be made, but I have to say I don't care. Whether the issues I mentionned above are solved by a last minute change, a separate RFC, a mutual understanding in this group, or whatever, is of minor concern if they are solved at all. [The reference to RFC1522 has to be changed anyway to its superseding RFC 2047.] Many thanks in advance for you consideration. Regards, Martin Du"rst. ---- Dr.sc. Martin J. Du"rst ' , . p y f g c R l / = Institut fu"r Informatik a o e U i D h T n S - der Universita"t Zu"rich ; q j k x b m w v z Winterthurerstrasse 190 (the Dvorak keyboard) CH-8057 Zu"rich-Irchel Tel: +41 1 257 43 16 S w i t z e r l a n d Fax: +41 1 363 00 35 Email: mduerst@ifi.unizh.ch ----
Received on Monday, 16 December 1996 10:29:31 UTC