- From: Sam Sun <ssun@CNRI.Reston.VA.US>
- Date: Fri, 4 Sep 1998 17:24:47 -0400
- To: "Martin J. Duerst" <duerst@w3.org>
- Cc: "URI distribution list" <uri@Bunyip.Com>
Martin, We are in the middle of drafting the URL Syntax for the Handle System, which is the syntax for Handles referenced in HTML document, not the syntax for Handles transmitted over the wire (the later uses UTF-8). Because not every customer of the Handle System uses UTF-8 capable platform, we have to define the syntax so that they can enter the Handles in terms of their native encoding. (Customers will not enter hex encoded name because one, they won't know the encoding, two, they don't look right.) The Handle System Resolver will have to do the job of translating the native encoding to the protocol encoding. For the Resolver to do this, the native encoding has to be specified with the Handle reference. Otherwise, if we simply default the encoding of the Handle reference to be the encoding of the surrounding context of the HTML document, the Resolve will have to pass the document. Also, it might break if user copy and paste the reference from one browser window to another. > >Do you mean a syntax definition in octets, or in characters? >For octets, things would get extremely nasty. Even ASCII characters >have different octets in ASCII, EBCDIC, and UTF-16. >For characters, it's basically the syntax of RFC 2396, where the >general characters (the category that contains A-Z,...) are extended >by the whole ISO 10646 repertoire minus certain cases. The certain >cases can be divided into stuff that we will hopefully be able to >specify exactly (e.g. precomposed/decomposed stuff,...), and stuff >that is up to the commonsense of the users, as currently with 0O or >lI1. > The "URI interpreter" might want to view URIs as octets, except that we have to define a small set of octets that are used as separators (is this doable?). However, the encoding information is necessary for the "protocol specific interpreter" to translate the URI into its protocol encoding. RFC2396 doesn't address the syntax we are discussing here. Section 1 of RFC2396 states that "This document does not discuss the issues and recommendation for dealing with characters outside of the US-ASCII character set [ASCII]; those recommendations are discussed in a separate document." And I suppose this "separate document" is the draft you are working on:). > >> For example, the URI in HTML document may be defined as: >> >> <uri scheme> ":" [ <encoding> "@" ] <uri scheme specific string> >> >> The <encoding> is optional, and is not needed if the <uri scheme specific >> string> uses UTF-8. > >Things like these were considered. But there are a number of problems: > >- What does the encoding parameter mean? Is it the encoding that > the bytes following the "@" are currently used for, or is it > the encoding that the server is expecting. > The encoding means that the bytes following the "@" are currently used for. It's not what to be sent to the server over the wire. The message sent to the server is generated by the protocol specific filter, and should follow the protocol specification identified by the URI scheme. >- If you start down that road, what about cases where different parts > of the URI are in different encodings. > It will not support mixed encoding. It's the limitation, but it's a step more flexible than allowing one encoding (i.e. UTF-8) only. Unlike HTML document, identifiers probably don't get mixed encoding that much anyway. >- If it's the current encoding, it will make transcoding very hard work. > In RFC 2070, HTML was designed to be transcoded blindly. > >- Currently, you don't need this for EBCDIC. What is the result if > part of the octets are to be interpreted according to the encoding > of the document, and others according to the tag, but these two > octet sets overlap. > I think I don't quite get your points here. By transcoding, do you mean to translate from one encoding to another? The HTML interpreter doesn't transcode any HREF reference. It's the protocol specific filter that does the work. >- Nobody would want to write http:us-ascii@//www.w3.org/. Why should > that be necessary for Japanese (or whatever else)? This is where it ENCOURAGES using UTF-8, since no extra typing would be needed:). > How would it look on cardboard boxes? I don't know the answer to this. Maybe we can work it out? It seems to me that encoding is a computing issue. When you type in, or copy and paste from one (browser) window to another, the encoding needs to be carried along. But the encoding probably does need to be specified in any cardboard printing. Regards, Sam
Received on Friday, 4 September 1998 17:26:36 UTC