Date: Mon, 21 Apr 1997 12:02:53 +0200 (MET DST) From: "Martin J. Duerst" <email@example.com> To: "Roy T. Fielding" <fielding@kiwi.ICS.UCI.EDU> Cc: firstname.lastname@example.org Subject: Re: revised "generic syntax" internet draft In-Reply-To: <email@example.com> Message-Id: <Pine.SUN.3.96.970421113701.245E-100000@enoshima> Hello Roy, On Fri, 18 Apr 1997, you wrote: > Martin, I haven't forgotten about your very detailed problem statement > at <http://www.imc.org/ietf-url/mail-archive/0052.html>. My question was > whether all the other people advocating non-ASCII URLs agree to that > problem statement, I guess even though many people have expressed the problems (and solutions) in different words, there is wide agreement on it. The summary you have written also expresses the same problems, but the solutions you give are not satisfactory to many of us. > and in particular to the course of action for the > current draft revision. There have been various oppinions, from "leave it as it is, deal with internationalization separately" to "take the chance to recycle and deal with i18n completely". I personally tend towards the later, but I think that you and Larry have worked hard on the current draft, and that there are many aspects in i18n URLs that need long and detailled specs (such as BIDI) so that I think there should be some middle solution. The middle solution would be to include, in the current draft, a clear indication of where we are heading (UTF-8), so that people stay tuned and can take the necessary steps (for example if they have to decide how to set up their server site, whether to use some legacy encoding or UTF-8 for filenames, they can choose UTF-8 because they will know that this will make things easier in the long run), and then write some other documents to describe more advanced things. > >and looks into the way configuration information can be > >setup for Apache to inform it about special needs of scripts > >and stuff, before he again claims things to be impossible. > > It is impossible for Apache to correctly transcode incoming URLs for the > same reason that it is impossible for current browsers to decode and display > the encoded octets of received URLs -- a program cannot transcode bytes to > a different charset unless it knows how the bytes are currently encoded. > There is nothing you can do in the Apache configuration to change that > fact, since it is a property of how the URL is generated (either by some > other part of the server or some part of the user agent or some author > of any page in the Web). I meant the comment about Apache configuration to know which encoding the target of the URL (filename, cgi parameter,...) is in. To convert correctly, you need to know the "charset" of both the source and the target. As for the source (I have explained this already), if we expand the current heuristic "same as target" to "same as target or UTF-8", and not to "whatever it might be", then in sparse namespaces, we have something like a 99.999% hit rate, and because of the properties of UTF-8, we only occasionally need a second file system access. For dense name spaces, we need some information from the browser to distinguish "same as up to now" and "UTF-8", and I have already described the "FORM-UTF8: Yes" that does this job. > I think there is a way to define UTF-8 preference for URL encoding > such that it won't break existing services, by forbidding transcoding > of already-encoded octets. By "already-encoded", do you mean already encoded with %HH? Of course, things that are encoded in %HH should be treated as binary and not messed around with it. Once UTF-8 is firmly established, there might be instances that have a look at the %HH-sequences, find out that they look like UTF-8 (very rare for arbitrary sequences, unless they are only ASCII), and convert them to real characters. On converting back from real characters, UTF-8 would also be used, and so the same octets would be reproduced even if they were not originally UTF-8. However, apart from user interface only cases, this won't be frequent. > However, I won't bother to explain that > until there is broad agreement on what needs to be solved. Please go on and explain your ideas! Maybe they are even closer to mine than you think :-). Regards, Martin.