- From: Martin J. Duerst <mduerst@ifi.unizh.ch>
- Date: Tue, 18 Feb 1997 16:09:03 +0100 (MET)
- To: URI mailing list <uri@bunyip.com>
Hello everybody, [Please appologize if you receive this mail twice.] In the next mail to this list and its counterpart, I will make proposals for changes/additions to the syntax or the process draft. These changes are designed to set the direction for consistent internationalization (i18n) of URLs in the future. [The current state is best characterized by the word "chaos".] As is customary for standards, language is limited to the facts, without much background. To give you this background, and show that the direction pointed to is very reasonable, I am sending this mail with detailled explanations to show that it actually works. If you have any further questions, I will be glad to answer them. If the changes proposed are approved, I will convert and expand the main part of this mail into an internet draft. My proposals to the URL process and syntax draft are based on many discussions in conferences and on IETF mailing lists, and talks on this issue in the last Unicode conference and at the Multilinguism workshop in Sevilla. Overview -------- 0. The Current State 1. The Solution 2. The Proposals 3. The Second Step 4. Backwards Compatibility and Upgrade Path 0. The Current State -------------------- URLs currently are sequences of a limited subset of ASCII characters. To encode other characters, arbitrary and unspecified character encodings ("charset"s) are used, and the resulting octets are encoded using the %HH escaping mechanism. The first step of this is highly unsatisfactory because it cannot be inverted, i.e. you cannot derive the characters from the octets, and because even forward conversion cannot be done without some additional information, i.e. if you know the characters of a file name, you can't construct the URL, because you don't know the encoding that the server will expect. The above is current practice, and can't be changed overnight, especially because a lot of protocols and software is affected. But it is possible to specify the direction to go, and to start moving in that direction. Maybe you ask why caring about the internationalization of URLs is necessary at all. Well, for some parts of URLs (in particular the query part), i18n is undeniably necessary, and for others, not doing it is putting people and communities using something else than basic Latin at an unnecessary disadvantage. Of course, no information provider will be forced to use these features. But as an example, take the Hong Kong Turist Association adventising Hong Kong in the US and in Japan. It will only help them if the URL to their Japanese site is as understandable, memorizable, transcribable, or guessable for Japanese speakers as the ASCII URL is for English speakers. The arguments that i18n URLs cannot be keyboarded, transcribed, or mailed may have had some significance in the past, but are getting less and less valid. On the Web, with forms and Java, it's easy to offer a keyboarding service for the whole of UNicode. Anyway, the proposals I am making here are not at all going that far. I am just mentionning these things above and below to show that it's actually workable and desirable. Discussions on the URN list have shown that some people are still very sceptical about transcribability. The work on HTML 2.0 and HTML i18n has shown that it is possible to move towards i18n in a well-defined way even at a rather late stage. 1. The Solution --------------- To solve the problem that character encodings are not known, there could be different solutions. Using tags (a la RFC 1522) is ugly because these tags would have to be carried around on paper and everywhere. Having every scheme and mechanism use different conventions is also rather impracticable because this leads to confusion (e.g. if the dns part of an URL is different from the path part because dns and http use different conventions). Inventing new schemes for internationalization (e.g. ftp->iftp) is likewise too clumsy. The solution that remains is to specify preference for a single character encoding. Because this has to encompass a very wide range of characters, the Universal Character Set (UCS) of Unicode/ISO 10646 is the only solution. To be compatible with ASCII, there exist two UCS Transform Formats, UTF-7 and UTF-8. UTF-7, however, has various disadvantages. Some of the characters it uses can have special functions inside URLs. It is not completely compatible with ASCII. And it has some potential limitations with regards to the full space of UCS (although the later point is rather theoretical). UTF-8, designed by Rob Pike and others at Bell labs, is not perfect for URLs either. But it is fully ASCII-compatible, and doesn't interfere with URL syntax. Also, the problem that it uses all 8 bits per octet is solved by the widely deployed %HH-encoding of URLs. The problem that with %HH-ecoding, some URLs can get rather long, will hopefully be addressed by the direct use of the characters themselves somewhere in the future (see below). UTF-8 has an additional addvantage, namely that it is with high probability easily distinguishable from legacy encodings. This was one of the reasons it was adopted for internationalizing FTP (see draft-ietf-ftpext-intl-ftp-00.txt). URNs also have adopted UTF-8 and %HH-encoding (see draft-ietf-urn-syntax-02.txt). This is important because URLs and URNs are both part of URIs, and are otherwise kept in synch syntacticatly. Also, the general recommendation from the Internet Architecture Board is to use UTF-8 in situations where ASCII-compatibility is necessary (see draft-weider-iab-char-wrkshop-00.txt). This has lead to the conclusion, in the various discussions I have been involved, that UTF-8 is the solution. A good documentation on this can found at http://www.alis.com:8085/~yergeau/url-00.html). Actually, Francois Yergeau was faster than me to realize that UTF-8 is the way to go, and we wouldn't be where we are now if it weren't for his efforts. 2. The Proposals ---------------- In Sevilla, I proposed two steps to move towards the solution. The first step is to specify UTF-8 as the preferred character encoding for URLs, while leaving URLs unchanged otherwise. This brings the target for URLs in sync with the spec for URNs. For reasons already explained, the proposals for changes to the syntax and process draft I make cover only the first step. For most people caring about i18n, this looks like "not enough". But it is important to set the basic infrastructure first. Once we have that, the rest will come almost without effort. The two important points in the first step are "no enforcement" and "canonical form" (a better name for the later might be lowest common denominator). Canonical form means that %HH-escaping is compulsory, and no 8-bit or native encoding is allowed. This is important to stay in sync with URNs, to allow UTF-8 URL software and actual URLs to deploy before raising the expectations of users too much, and to address the concerns about transcribability of those in the URI community not very familliar with i18n issues. Of course, as with URNs, even with the canonical form only, it will be possible rather quickly to offer nice user interfaces hiding the %HH (once you know how the conversion is done!). Due to current use, it is impossible to mandate any character encoding for URLs; a lot of URLs would immediately become illegal. Also, URLs are a "hat" over many very different protocols and mechanisms, which all have their own communities and special needs. Some URLs, such as the "data:" URL, do not encode characters at all (but most actually do :-). Others might have very specific requirements that preclude the use of UTF-8, or may not need i18n at all for some good reason. However, please note that there is no need for the protocol or mechanism itself to use UTF-8 for internationalization. Assuming the protocol or mechanism is internationalized, with a well-defined way to encode the characters of UCS, a mapping from UTF-8 to that specific encoding can be defined. This is preferable to exposing the scheme-specific i18n in URLs, because it greatly simplifies URL handling and user interface issues. For examples of scheme- or mechanism-specific i18n, please see IMAP [RFC 2060, section 5.1.3], defining "modified UTF-7", and the "UTF-5" from my proposal for domain name i18n (draft-duerst-dns-i18n-00.txt). 3. The Second Step ------------------ Once the direction towards using UTF-8 as the character encoding for URL is firmly set, and other i18n aspects are nicely covered by software such as browsers and servers, it will be time for step two, I hope. This means that %HH-escaping will not be mandatory anymore. Note that this does not mean that binary UTF-8 will be placed in the middle of an ASCII-only document, or in a Latin-X document, or in a UCS2/UTF-16 document. In these cases, and in all others where the character encoding is implicitly or explicitly known, those characters that can be represented natively will be encoded using the native encoding. For the rest of the characters, UTF-8 and %HH-encoding will be used. This is crucial to make transcoding, for example in response to an "Accept-Charset:" header in HTTP, as well as e.g. when cutting/pasting text in a GUI interface, work correctly. It also conforms what is done currently with the basic URL characters e.g. in a document encoded in EBCDIC. 4. Backwards Compatibility and Upgrade Path ------------------------------------------- Backwards compatibility is fully guaranteed. No URL that was legal up to now is made illegal. No URL that works now is stopping to work. And it is actually possible to convert to using UTF-8 gradually, without the servers depending on all clients to be upgraded or vice versa. And even those cases that already use native document encoding for their URLs, and rely on them being interpreted octet-wise, can be dealt with, to the extent they have been working up to now. If a server e.g. has a Latin-X filesystem locally, and gets an URL that could be UTF-8, it converts it to Latin-X and looks if the resource is there. If the resource is not found, it tests using the raw octets directly. Because of the heuristic separation between UTF-8 and other encodings, and the usually extremely sparse use of namespace on a server (except for forms input, see below), chances for collision are exorbitantly small. If somebody is not satisfied with this, it's easy to add an extension that checks all current resource names and assures that there are no conflicting resource names, or that in case of conflict, the user gets a choice. A similar thing can happen on the browser side. A native URL, in whatever encoding, is converted to UTF-8 and used to probe the server. If it fails, another probe is made with the raw octet form of the native URL. Initially, it could be done the other way round, as long as that can be expected to yield more hits. Caching of the encoding that led to the last hit, on whatever granularity desirable, can help to reduce unnecessary connections. And the time penalty is not that significant, because the current use of native character representation with the expectation of binary interpretation has never been guaranteed to work anyway. Now for those URLs that the server creates on the fly, in particular the query part that results from form input. This currently is the biggest headache to people actually working on i18n of browsers and servers, and an uniform convention to use UTF-8 would improve the situation a lot. Unfortunately, however, the namespace in this case is more tightly populated. A server has a reasonable chance, depending on the character encodings it is expecting, to detect UTF-8. But a client doesn't have many clues about the server. Because this is a rather old problem, several solutions have been proposed and some of them are (partially) implementated. The first is (in the context of HTTP) to use a GET with a multipart body, where each of the parts has an appropriate "charset" tag. The second is to incorporate a hidden field into the FORM, and guess the encoding by reasonably assuming that the client is sending the same characters back, although maybe in a different encoding, and that all fields use the same character encoding. The third is to use the encoding of the document that was sent, with UTF-8 also in the case of UCS2/UTF-16 or raw UCS4. However, it is not sure that on a server that can transcode to Unicode, all the scripts will be able to accept Unicode. The fourth is to use the "Accept-Charset=" attribute defined in RFC 2070 for the <INPUT> and <TEXTAREA> elements in HTML <FORM>s to signal to the client that the server can handle UTF-8 forms input. If the legacy text encoding for query parts is known, this could even be handled transparently by the server. Because these features are only defined in HTML, it is important to agree on a single encoding to not bother other formats with similar hassles. Again, if you have any questions or comments, please contact me. With kind regards, Martin. ---- Dr.sc. Martin J. Du"rst ' , . p y f g c R l / = Institut fu"r Informatik a o e U i D h T n S - der Universita"t Zu"rich ; q j k x b m w v z Winterthurerstrasse 190 (the Dvorak keyboard) CH-8057 Zu"rich-Irchel Tel: +41 1 257 43 16 S w i t z e r l a n d Fax: +41 1 363 00 35 Email: mduerst@ifi.unizh.ch ----
Received on Tuesday, 18 February 1997 10:08:38 UTC