Message-Id: <199809050801.RAA29238@sh.w3.mag.keio.ac.jp> Date: Sat, 05 Sep 1998 16:15:07 +0900 To: "Sam Sun" <ssun@CNRI.Reston.VA.US> From: "Martin J. Duerst" <email@example.com> Cc: "URI distribution list" <uri@Bunyip.Com> In-Reply-To: <063d01bdd84a$71a944a0$1c1e1b0a@ssun.CNRI.Reston.Va.US> Subject: Re: iDNR, an alternative name resolution protocol At 17:24 98/09/04 -0400, Sam Sun wrote: > Martin, > > We are in the middle of drafting the URL Syntax for the Handle System, which > is the syntax for Handles referenced in HTML document, not the syntax for > Handles transmitted over the wire (the later uses UTF-8). Well, that's the same for every URI syntax spec. The place where this is easiest to see is the IMAP URL spec. > Because not every > customer of the Handle System uses UTF-8 capable platform, we have to define > the syntax so that they can enter the Handles in terms of their native > encoding. What you have to do is to define the syntax in terms of characters, independent of encoding. Please note that the URI syntax (RFC 2396) and the HTML syntax is also defined that way. Roy has explained that very clearly. I guess we will arrive at the same way of definition for i18n URIs (IURIs), but that still may take some more work. > (Customers will not enter hex encoded name because one, they won't > know the encoding, two, they don't look right.) The Handle System Resolver > will have to do the job of translating the native encoding to the protocol > encoding. For the Resolver to do this, the native encoding has to be > specified with the Handle reference. Otherwise, if we simply default the > encoding of the Handle reference to be the encoding of the surrounding > context of the HTML document, the Resolve will have to pass the document. > Also, it might break if user copy and paste the reference from one browser > window to another. Copy-paste has been covered by Roy. What I guess you should do is to set up handle proxy resolvers. Each user entering a handle on a system that cannot yet convert URIs to UTF-8 would be set to the appropriate resolver. Each resolver would only handle one encoding, and would then forward the stuff to the real resolver using UTF-8. > >Do you mean a syntax definition in octets, or in characters? > >For octets, things would get extremely nasty. Even ASCII characters > >have different octets in ASCII, EBCDIC, and UTF-16. > >For characters, it's basically the syntax of RFC 2396, where the > >general characters (the category that contains A-Z,...) are extended > >by the whole ISO 10646 repertoire minus certain cases. The certain > >cases can be divided into stuff that we will hopefully be able to > >specify exactly (e.g. precomposed/decomposed stuff,...), and stuff > >that is up to the commonsense of the users, as currently with 0O or > >lI1. > The "URI interpreter" might want to view URIs as octets, except that we have > to define a small set of octets that are used as separators (is this > doable?). However, the encoding information is necessary for the "protocol > specific interpreter" to translate the URI into its protocol encoding. There is a set of characters (not octets) that are defined as separators. See RFC 2396. > RFC2396 doesn't address the syntax we are discussing here. Section 1 of > RFC2396 states that "This document does not discuss the issues and > recommendation for dealing with characters outside of the US-ASCII character > set [ASCII]; those recommendations are discussed in a separate document." > And I suppose this "separate document" is the draft you are working on:). I guess so. > >> For example, the URI in HTML document may be defined as: > >> > >> <uri scheme> ":" [ <encoding> "@" ] <uri scheme specific string> > The encoding means that the bytes following the "@" are currently used for. > It's not what to be sent to the server over the wire. The message sent to > the server is generated by the protocol specific filter, and should follow > the protocol specification identified by the URI scheme. This is nonsense. HTML documents don't consist of bytes, they consist of characters. All characters are encoded in the same way. There are clearly defined mechanisms for finding out what that encoding is, they just have to be used. Putting an <encoding> into the URI doesn't help, it will either be the same as the whole document, in which case it is redundant and at the wrong place, or it will be different, in which case it will more or less severely confuse the browser (and the user), or it will be wrong, in which case it is less than worthless. Also, officially and up to now, URIs in HTML can't contain any other characters than those defined in RFC 2396. On the other hand, what do to in those error cases where other characters are present is already described in the HTML 4.0 spec, in one of the appendices (see the i18n URI page on the W3C web site if you don't find it otherwise). And this "error behaviour" is already in line with what we are working on here, and does not in any way need a "<encoding>". > It will not support mixed encoding. It's the limitation, but it's a step > more flexible than allowing one encoding (i.e. UTF-8) only. Unlike HTML > document, identifiers probably don't get mixed encoding that much anyway. HTML documents don't get mixed! Each HTML document is in exactly one single enocding! > >- If it's the current encoding, it will make transcoding very hard work. > > In RFC 2070, HTML was designed to be transcoded blindly. > > > >- Currently, you don't need this for EBCDIC. What is the result if > > part of the octets are to be interpreted according to the encoding > > of the document, and others according to the tag, but these two > > octet sets overlap. > > > > I think I don't quite get your points here. By transcoding, do you mean to > translate from one encoding to another? The HTML interpreter doesn't > transcode any HREF reference. It's the protocol specific filter that does > the work. I was speaking about transcoding of the whole HTML document from one encoding to another. For example, there are proxy servers that transcode from iso-2022-jp to shift_jis, and so on. They take the whole HTML doc and change everything that's in there. And the HTML interpreter, according to the error behaviour specified in HTML 4.0, will translate illegal codes from whatever encoding to UTF-8 and escape it. > >- Nobody would want to write http:us-ascii@//www.w3.org/. Why should > > that be necessary for Japanese (or whatever else)? > > This is where it ENCOURAGES using UTF-8, since no extra typing would be > needed:). So people should know what their computer produces when they type? They don't have to know that for anything else, why should they have to do that for URIs? > > How would it look on cardboard boxes? > > I don't know the answer to this. Maybe we can work it out? It seems to me > that encoding is a computing issue. When you type in, or copy and paste from > one (browser) window to another, the encoding needs to be carried along. But > the encoding probably does need to be specified in any cardboard printing. Exactly. Now, people know that they can type in these things starting with http:// and so on in their browser, and get where they want. You would have to add a two-page explanation to every URI on a cardboard box to explain in which case which encoding has to be added. As you say, encoding is a computer issue, and we shouldn't bother the user with it. Regards, Martin.