Date: Sun, 27 Apr 1997 18:16:02 +0200 (MET DST) From: "Martin J. Duerst" <firstname.lastname@example.org> To: Francois Yergeau <email@example.com> Cc: firstname.lastname@example.org Subject: Re: UTF-8 and URLs In-Reply-To: <email@example.com> Message-Id: <Pine.SUN.3.96.970427175721.245P-100000@enoshima> On Fri, 25 Apr 1997, Francois Yergeau wrote: > À 00:25 25-04-97 -0500, Dan Connolly a écrit : > >> Let's see: we would have an i18n RFC that would allow URLs to contain most > >> any characters, and a (possibly Draft) standard that would say "All URLs > >> consist of a restricted set of characters..." (we know which): clear > >> contradiction. > > > >Please don't cite out of context or paraphrase wildly. The _existing_ > >RFC limits the characters in URLs. In fact, the UTF-8-in-%XX encoding > >propsal doesn't even change that: it just adds semantics to the syntax. > > I'm sorry, but I see it differently: the UTF-8-in-%XX proposal doesn't add > octet values on-the-wire, but it adds, and correctly maps, thousands of > characters. It can be seen in different ways. For some of the issues discussed in the syntax draft, in particular all about relative URL processing, it is indeed just semantics and doesn't interfere. On the other hand, the current draft contains many explanations about the relation between represented characters, octets, and URL characters. Somebody studying it will greatly benefit from being told about the limitations of the model that the current draft assumes, and from being told the direction that is being taken to change the model and eliminate the deficiencies. Also, the UTF-8-in-%XX proposal, strictly requiring %XX, is indeed just an addition of semantics. However, once it is clear how these semantics are added, the next step, namely removing the %XX requirement and extending the URL character set to most of the Universal Character Set (excluding compatibility characters and stuff), is obvious. If URLs were closely similar to MIME headers, we could say that this is a transparent user-interface issue, but because URLs include the form on paper, where we agree that transcribing long %XX sequences is a great pain for those that know the actual characters, the situation is different. I originally proposed the addition of UTF-8-in-%XX to the current draft as an important first step towards fully international URLs, based on experience with the URN compromize. But UTF-8-in-%XX is only the first step, and because we already know the next steps, we definitely should tell this to the reader of the syntax draft, whether in the form of a fully reworked draft or (probably preferable) in the form of a note discussing future developments. Regards, Martin.