- From: Francois Yergeau <yergeau@alis.com>
- Date: Thu, 20 Feb 1997 23:11:04 -0500
- To: "Roy T. Fielding" <fielding@kiwi.ICS.UCI.EDU>
- Cc: URI mailing list <uri@bunyip.com>
À 11:54 20-02-97 -0800, Roy T. Fielding a écrit : >It would help a great deal if advocates of localization did not use >the term internationalization; you are just creating unnecessary heat >instead of solving the problem at hand. In current terminology, as testified by a great many books and other documents on the subject, internationalization is a process by which an application, system, etc. is modified so that it can later support localization by specialization instead of by extension. You make the system support multiple charsets (or a universal one), and use one (or part of the universal one) in each locale. You establish a generic message mechanism, and use it with specific messages in each language. And so on... And I think this is very much the thrust of Martin's proposals: internationalize URLs, so that they can reliably and deterministically be localized anywhere. >What Martin (and others) have suggested is that the existing requirements >on internationalization are too severe. In essence, he wants to make it >legitimate for URLs to be localized (or lingua-centric), based on the >conjecture that it is more important for locals to be able to use the >most meaningful of names within a URL than it is that non-locals be able >to use those same URLs at all. Names within URLs are of relatively minor importance, and I wouldn't care very much if it were the sole issue. But URLs are also used for form submission, may contain the subject of a (potential) mail message, and possibly other such strings in present and future URL schemes, where it is much more important, even crucial, to be able to represent non-ASCII characters unambiguously. >It is my opinion that URLs are, first and foremost, a uniform method of >describing resource addresses such that they are usable by anyone in >the world. In my opinion, an address which has been localized at the >expense of international usage is not a URL, or at least should be >strongly discouraged. I would agree with a somewhat modified version of the above: a pure address (not a form submission or other such URL) which has been localized *is* a URL, but should be discouraged on the grounds of its loss of universality. Discouraged, not forbidden. Not everyone cares about universality all the time. There are cases where actually using the non-ASCII characters you need is much more important than allowing anyone in the world to type your URL. The URL synatx should facilitate that and make it as deterministic as possible, not forbid it or make it available only by jumping through hoops. >It is therefore my opinion that any attempt to increase the scope of >the URL character set to include non-ASCII characters is a bad idea. >This does not in any way restrict the nature of resources that can >be addressed by a URL; it just means that the URL chosen should be an >ASCII mapping, either one chosen by the user or one chosen automatically >using the %xx encoding. Yes, this is an inconvenience for non-English- >based filesystems and resources, but that is the price to pay for true >internationalization of resource access. This is more than an inconvenience if the underlying character encoding is left unspecified. It is undeterministic. Furthermore, I think universal access should be given no more than its proper weight. Those who want to provide universal access to a particular resource SHOULD use %-encoding, but those who do not need, or even want, to provide that should not be forced to use it for no good reason at all. But I'm flexible on that, the main problem is undeterminism when non-ASCII characters are actually required. >>+ To improve this, UTF-8 [RFC 2044] should be used to encode characters >>+ represented by URLs wherever possible. UTF-8 is fully compatible with >>+ US-ASCII, can encode all characters of the Universal Character Set, >>+ and is in most cases easily distinguishable from legacy encodings >>+ or random octet sequences. >>+ >>+ Schemes and mechanisms and the underlying protocols are suggested >>+ to start using UTF-8 directly (for new schemes, similar to URNs), >>+ to make a gradual transition to UTF-8 (see draft-ietf-ftpext-intl-ftp-00.txt >>+ for an example), or to define a mapping from their representation >>+ of characters to UTF-8 if UTF-8 cannot be used directly >>+ (see draft-duerst-dns-i18n-00.txt for an example). >> >>[Comment: the references can be removed from the final text.] >> >>+ Note: RFC 2044 specifies UTF-8 in terms of Unicode Version 1.1, >>+ corresponding to ISO 10646 without ammendments. It is widespread >>+ consensus that this should indeed be Unicode Version 2.0, >>+ corresponding to ISO 10646 including ammendment 5. > >None of the above belongs in this document. That is the purpose of >the "defining new URL schemes" document, which was previously removed >from the discussion of the generic syntax. These paragraphs address not only new schemes but also existing ones, HTTP in particular, that badly need internationalization. New schemes do not need "a gradual transition to UTF-8", they can jump right in. These paragraphs are crucially needed. >That doesn't make any sense -- it is done every day. Francois had a >personal URL with a c-cedilla, Well I didn't, but I could have :-) And others do, so... >and it makes sense to admonish >implementers that such things do occur and should not result in a >system crash if such is avoidable. Right. In fact, not only the system MUST NOT crash, but it SHOULD behave the same as if it had received the corresponding %XX. Regards, -- François Yergeau <yergeau@alis.com> Alis Technologies Inc., Montréal Tél : +1 (514) 747-2547 Fax : +1 (514) 747-2561
Received on Thursday, 20 February 1997 23:58:34 UTC