- From: Martin J. Duerst <mduerst@ifi.unizh.ch>
- Date: Fri, 14 Feb 1997 14:12:24 +0100 (MET)
- To: unicode@unicode.org, unicore@unicode.org, www-international@w3.org
Hello everybody, This is about INTERNATIONALIZATION of URLs! [first, appologies if you receive this mail more than once] As many of you probably are aware of, action is needed to get URL internationalization to seriously move in the right direction. As I have participated in many discussions and also done talks on this issue in the last Unicode conference and at the Multilinguism workshop in Sevilla, I feel some responsibility to get things ahead. However, I think I need some support, because the people concerned with URLs in general are not overly friendly to the idea of internationalizing URLs. Nevertheless, I think we have a strong case, because for some parts of URLs (query part), i18n is undeniably necessary, and for others, not doing it is putting people and communities using something else than basic Latin at an unnecessary disadvantage. The concerns that i18n URLs cannot be keyboarded, transcribed, or mailed may have had some significance in the past, but are not valid anymore. For example, it's easy to offer a keyboarding service for the whole of UNicode on the web. This mail is sent to some lists that deal with internationalization, and outlines the solution that others and I think is most feasible. It then describes the proposals I will make around Feb 18th on the relevant lists (unfortunately there are two of them; see below). If you think that this issue is important, and the solution taken makes sense, I would greatly apprciate it if you could have a look at these lists and contribute your oppinion. See the end of this mail for instructions as to how to subscribe to the lists. If you have any questions, I will also gladly answer them. Overview -------- 0. The Current State 1. The Solution 2. How to Get There 3. The Process 4. The Proposals 5. What You Can Do 0. The Current State -------------------- URLs currently are sequences of a limited subset of ASCII characters. To encode other characters, arbitrary character encodings ("charset"s) are used, and the resulting octets are encoded using the %HH escaping mechanism. The first step of this is highly unsatisfactory because it cannot be inverted, i.e. you cannot guess the characters from the octets, and because it cannot even be done without some additional information, i.e. if you know the characters of a file name, you can't construct the URL, because you don't know the encoding that the server will expect. The above is current practice, and can't be changed overnight, especially because a lot of protocols and software is affected. But it is possible to specify the direction to go, and it's possible to arrive at the goal sooner or later. 1. The Solution --------------- To solve the problem that character encodings are not known, there could be different solutions. Using tags (a la RFC 1522) is ugly because these tags would have to be carried around on paper and everywhere. Having every scheme and mechanism use different conventions is also rather impracticable because this leads to confusion (e.g. if the dns part of an URL is different from the path part because dns and http use different conventions). Inventing new schemes for internationalization (e.g. ftp->iftp) is likewise too clumsy. The solution that remains is to specify preference for a single character encoding. Because this has to encompass UCS and be compatible with ASCII, there are mainly two solutions, UTF-7 and UTF-8. UTF-7, however, has various disadvantages. Some of the characters it uses can have special functions inside URLs. It is not completely compatible with ASCII. And it has some potential limitations with regards to UCS4 (although the latest point is rather theoretical). UTF-8 is not perfect, either. But the problem that it uses 8 bits is solved by the already existing %HH-encoding. The problem that with %HH-ecoding, some URLs can get rather long, will be addressed by the direct use of the characters themselves where possible (see below). UTF-8 has an additional addvantage, namely that it is with high probability easily distinguishable from legacy encodings. This was one of the reasons it was adopted for internationalizing FTP (see draft-ietf-ftpext-intl-ftp-00.txt). URNs also have adopted UTF-8 and %HH-encoding (see draft-ietf-urn-syntax-02.txt). This is important because URLs and URNs are both part of URIs, and are otherwise kept in synch syntacticatly. Also, the general recommendation from the Internet Architecture Board is to use UTF-8 in situations where ASCII-compatibility is necessary (see draft-weider-iab-char-wrkshop-00.txt). This has lead to the conclusion, in the various discussions I have been involved, that UTF-8 is the solution. A good documentation on this can found at http://www.alis.com:8085/~yergeau/url-00.html). Actually, Francois Yergeau was faster than me to realize that UTF-8 is the way to go, and we wouldn't be where we are now if it weren't for his efforts. 2. How to Get There ------------------- In Sevilla, I proposed two steps to move towards the solution. The FIRST STEP is to specify UTF-8 as the preferred character encoding for URLs, while leaving URLs unchanged otherwise. The two important points here are "no enforcement" and "canonical form". Due to current use, it's impossible to inforce anything anyhow. And some Unicode critics are alergic if they are forced to use Unicode. Also, URLs are a "hat" over many very different protocols and mechanisms, which all have their own communities and special needs. Note that if a protocol or mechanism does not use UTF-8 for i18n, it is still possible (and preferable) to use UTF-8 in URLs and to define a scheme-specific mapping from UTF-8 to that specific encoding. Examples would be the "modified UTF-7" defined for IMAP [RFC 2060] and the "UTF-5" defined in my proposal for domain name i18n (draft-duerst-dns-i18n-00.txt). Canonical form means that %HH-escaping only is used, and no 8-bit or native encoding. This is important tactically because most people concerned with URLs in general are very much affraid that anything else cannot be handled (typed/transcribed/mailed). Although they are mostly wrong on this, it's difficult to convince them in a short time. Of course, even with canonical form only, it's always possible to offer a nice interface hiding the %HH (once you know how the conversion is done!). The SECOND STEP is of course that the characters can be used directly where appropriate, e.g. in HTML documents. It is important to notice here that to carry URLs in such documents, it's not binary UTF-8 octets that will be used (unless the whole document is UTF-8, and except for those parts of the URL that are in %HH-escaping). The characters are carried in the native encoding of the document! This is important to make transcoding and editing by the user very easy and natural, and it conforms to what is done with the ASCII characters e.g. in a document encoded in EBCDIC. 3. The Process -------------- The two important documents on URLs that are currently been worked on are the Syntax Draft and the Process Draft. The Syntax Draft is an improvement and regroupting of earlier RFCs on URLs (mainly RFC 1738 and RFC 1630). The current version is draft-fielding-url-syntax-03.txt. This work is carried out without an IETF working group; discussions are carried out on the otherwise disfunct URI mailing list. The main function of the Syntax Draft is to define URL syntax, including encoding of octets/characters and handling of relative URL references. The Process Draft is the responsibility of a new IETF working group. The mailing list of this working group is the URL mailing list. The main purpose of this work is to define conditions and process for creating new URLs. 4. The Proposals ---------------- On Feb 18th I will propose changes/additions to both drafts on the respective lists. These changes, if accepted, will correspond to STEP ONE (UTF-8 as the preferred character encoding, with %HH escaping). My current guess is that there is a good chance for these changes to get accepted, IF there is enough support for them. My proposal last December for a similar change/addition to the Syntax Draft was ignored because nobody seconded it, although there was no opposition to it. I'm sorry I have to bother you and ask for your help, but experience in the past has shown that wide support is crucial for i18n issues to get accepted. 5. What You Can Do ------------------ If you agree that we need to move towards consistent internationalization of URLs, and that what I described above is the right direction to go, you should subscribe to the two mailing lists in question (traffic there is currently low), and contribute to the discussion. A lot of short contributions from many people are better than few long contributions all from a small group. To subscribe to the URI list (Syntax Draft), send a mail with the text subscribe uri <your mail address> to majordomo@services.bunyip.com The archives for this list are at http://www.acl.lanl.gov/URI/archive/archives.html To subscribe to the URL list (Process Draft), send a mail with the text subscribe to ietf-url-request@imc.org The mailing list archive of this list is at http://www.imc.org/ietf-url/ To unsubscribe later, follow the instructions that the mailing list software will send you when you subscribe. To get the RFCs and internet-drafts mentionned above, contact your nearest repository. As an example, RFCs can be found as ftp://nic.nordu.net/rfc/rfcNNNN.txt where NNNN is the RFC number. Internet-drafts can be found as ftp://nic.nordu.net/internet-drafts/<draft-name> Please note that RFCs and internet-drafts can have a trailing .Z if they are compressed, so you have to try with and without .Z. With kind regards, Martin Du"rst. ---- Dr.sc. Martin J. Du"rst ' , . p y f g c R l / = Institut fu"r Informatik a o e U i D h T n S - der Universita"t Zu"rich ; q j k x b m w v z Winterthurerstrasse 190 (the Dvorak keyboard) CH-8057 Zu"rich-Irchel Tel: +41 1 257 43 16 S w i t z e r l a n d Fax: +41 1 363 00 35 Email: mduerst@ifi.unizh.ch ----
Received on Friday, 14 February 1997 08:20:53 UTC