- From: Martin J. Duerst <mduerst@ifi.unizh.ch>
- Date: Mon, 21 Apr 1997 15:33:00 +0200 (MET DST)
- To: Larry Masinter <masinter@parc.xerox.com>
- Cc: Gary Adams - Sun Microsystems Labs BOS <Gary.Adams@east.sun.com>, uri@bunyip.com, fielding@kiwi.ics.uci.edu, Harald.T.Alvestrand@uninett.no
On Tue, 15 Apr 1997, Larry Masinter wrote: > > Are there any "facts" still in need of investigation > > or are the only unresolved issues questions of "opinion"? (My opinion > > is that the current system is already broken, if this could be > > subtantiated would that invalidate the "status quo" as a viable > > alternative?) > > > At this point, I think we need not just "facts" but some > actual "design". Exactly how does this all work in a way that > actually solves the problem? > > Let's suppose someone wants to publish information > about their product and put up a URL in a magazine. > > a) what URLs do they support in their server? The UTF-8 form of the URL they want to give to their product. %HH is not necessary, as this is already eliminated by the server core. > b) what gets printed in the magazine? The actual natural-language characters of the URL for their product. > c) what does the user type into the browser? Those same characters. > d) what does the browser do with what the user typed > in order to turn it into the URL that was generated in (a). Interpret the characters typed in, encode them as UTF-8, add %HH to be really conforming, and send that to the server. > how does this work for > 1) Japanese (16-bit characters) Well, the "16-bit" characters use up three bytes when in UTF-8, or 9 bytes whith additional %HH, but otherwise, everything works smoothly. > 2) Hebrew (right to left) This has been discussed to some extent. The URL should be stored in logical order. For display/paper, we have to agree on a uniform way of logical->graphical conversion. A proposal is available on Francois' web page. The main idea is to display the URL with overall LTR directionality and with all syntactically relevant characters with strong LTR directionality (this will differ from the basic, text- oriented BIDI algorithm). Bidirectional controls might be allowed or not, but if they are allowed, they will be strictly restricted to individual path elements and similar stuff. > What happens with "/" and the path components? The "/" will be strong LTR. The path components will always be displayed LTR globally. If an individual path component is Hebrew, it will be displayed RTL. > How does > directionality get represented? See above. Probably only implicit directionality is needed, but with a change for "/" and similar (which are neutral in the general text algorithm). Note that this does not mean that there is a need to change the general algorithm, just that some marks have to be inserted before display. > What are the considerations > for ambiguity beyond the familiar 0O0O0O1l1l1l for ASCII? First a small clarification: The 0O0O0O1l1l1l is explicitly familliar to this list because it has been mentionned a few times. It is *implicitly* familliar to most of the other ASCII users. There are many considerations, most of them script-specific and (implicitly) well known to the users of that particular script. There are some ambiguities due to the inclusion of characters in Unicode due to backwards compatibility. There are some other ambiguities due to the treatment of accented characters in Unicode/ISO 10646. Some of these ambiguities, in particular the last one, have to be solved by defining a normalization procedure/algorithm. The data for this is already given by the equivalence definitions found in Unicode, what is needed are a few core decisions. The main one of them will probaly be "Use precomposed for everything in Unicode 2.0, use decomposed for everything after version 2.0." This is most practical (because the Unicode 2.0 book is widely available and because it agrees with current practice) and leads to less problems for upgrading to newer versions with new characters. If it stays at that, then the final normalization spec will not contain much more than is used in practice already today. It is true that we don't know yet how a complex linguistic notation, a Latin base letter with e.g. five different diacritics, would get normalized, but I think because such beasts are not at the moment used too much for identifiers, we won't have problems to assure that our spec is ahead of actual practice. > When the details of this are worked out, and we actually > have something that works to allow non-ASCII URLs, then > we can look and see if %xx-hex encoded UTF-8 encoded Unicode > actually forms part of the solution. But it doesn't seem > "trivial" to me, or at all certain that the current proposal > is actually part of the solution. The overall design is definitely clear. It is very clear that it includes UTF-8. Nobody has seriously challenged this or brought up any kind of workable alternative. And that's what we should make clear in the current draft. It is not clear whether the final solution will include %HH-encoding. My guess is that %HH encoding will go away on the HTTP wire except for reserved (ASCII) characters. My guess is also that it will go away in HTML and similar documents. It of course will go away on paper. It will stay as a fallback, i.e. if I find a terrific Japanese web page of which I only know the Japanese URL, and I want to help somebody not familliar with Japanese to have a look at it. Regards, Martin.
Received on Monday, 21 April 1997 09:34:28 UTC