- From: Sam Sun <ssun@CNRI.Reston.VA.US>
- Date: Thu, 3 Sep 1998 11:16:36 -0400
- To: "Martin J. Duerst" <duerst@w3.org>
- Cc: "Larry Masinter" <masinter@parc.xerox.com>, "Harald Tveit Alvestrand" <Harald.Alvestrand@maxware.no>, "Jon Davis" <jdavis@inetinit.org>, "URI distribution list" <uri@Bunyip.Com>
Hi Martin, Very nice to hear from you... I think what we are really interested is the legal HREF syntax (under A element) in HTML document. According to the HTML4.0 spec, the HREF is defined as "href = uri [CT]" where "uri" is based on RFC1630 (I suppose it need to update to RFC2396 now). So the "uri" is used to govern the HTML document syntax, and I guess we all agree that it's not practical to MANDATE UTF-8 as the only encoding allowed? Actually, the last paragraph in section 3.5 (ftp://ftp.parc.xerox.com/pub/masinter/draft-masinter-url-i18n-03.txt) of your draft also said: "For example, a URI which contains a string in Japanese might actually arrive with a variety of encodings, due to the variety of interpretations of deployed systems. While this recommendation specifies a canonical encoding of Japanese using %HH-encoded UTF-8, in practice many URIs will be presented which contain characters encoded using Shift-JIS or EUC-JP, either with %HH encoding or not. Thus, to transition to the new regime, URI-interpreting software for Japanese should accept all three of the EUC-JP, Shift-JIS and UTF-8 encodings." Does this really mean that URI may be entered in any native encoding? If so, I think it would be helpful to provide the syntax definition used to declare the encoding of the URI. This allows URI parsers to convert to UTF-8 (or any other encoding used by the protocol) correctly without checking the document context. Otherwise, it could be hard for URI parsers to figure out the encoding of any particular URI, especially in multilingual document or on platforms with multiple input methods installed. For example, the URI in HTML document may be defined as: <uri scheme> ":" [ <encoding> "@" ] <uri scheme specific string> The <encoding> is optional, and is not needed if the <uri scheme specific string> uses UTF-8. Regards, Sam -----Original Message----- From: Martin J. Duerst <duerst@w3.org> To: Sam Sun <ssun@ns.cnri.reston.va.us> Cc: Larry Masinter <masinter@parc.xerox.com>; Harald Tveit Alvestrand <Harald.Alvestrand@maxware.no>; Jon Davis <jdavis@INETINIT.ORG>; URI distribution list <uri@Bunyip.Com> Date: Wednesday, September 02, 1998 4:39 AM Subject: Re: iDNR, an alternative name resolution protocol >Hello Sam, > >Many thanks for your comments. We are still working on the draft, >so any comments are wellcome. In particular, I think one of the >hard pieces is that in many places, this draft is a meta-spec, >i.e. it say what other specs should do. That requires very careful >wording; I think Larry has already some very good work on that. > >As for your comments, here are my answers: > >At 16:33 98/09/01 -0400, Sam Sun wrote: > >> The draft defines URI as "... both for transmission in network protocols and >> representation in spoken and written human communication". However, it seems >> that the URI defined for network protocol may have different set of >> requirements from URI targeted for human communication. URI defined for >> network protocol doesn't need to be concerned with "user friendly" as much >> as URI defined for human comsumption. And I think URI defined human >> communication should not require "everyone in the world be able to read or >> enter", because no single language is "friendly" to everyone in the world. >> >> For any particular URI scheme defined for a specific network protocol (e.g. >> http), it makes it simpler to have a uniform encoding. However, if URI is >> defined as the guideline for every network protocol to be integrated with >> web browser, it doesn't seem practical to enforce any specific encoding. >> Different URI schemes may map to different network protocols, and different >> protocols may have their very own encoding (already) defined. In fact, most >> URI scheme specific Resolver (telnet, ftp, ldap, ...) treats its URI as >> "human entered" and converts it into the protocol encoding before sending >> out the request. > >We have to clearly distinguish three things here: > >- The URIs as they are seen by humans. On a napkin, cardboard box, or what > you want, they don't have an encoding. On the screen, they have an encoding, > but the user doesn't and shouldn't care about it. > >- URIs as they are used in protocols. Up to now, the main protocol I know > that uses URIs is HTTP. FTP, telnet, LDAP,... don't use URIs [directly]. > Even HTTP in many cases uses only a part of an URI. > >- Information in URIs that is somehow used in protocols. These are not URIs. > You are right that each protocol should be able to use whatever encoding > is appropriate. If the draft says or implies anything else, we have to > make it clearer. While FTP i18n is defined to use UTF-8 anyway, and so > FTP doesn't make for a particularly interesting example, there is already > an example that shows this very clearly: The IMAP URI, RFC 2192. > IMAP uses a different encoding (usually denoted as "modified UTF-7" > for its folder names). The RFC gives code for conversion between this > and UTF-8. > > > >> These said, it seems more appropriate to define URI "for representation in >> spoken and written human communication" ONLY. And the URI encoding should be >> defined as scheme specific. Some URI schemes (e.g. "http:") may require a >> single encoding. While other URI schemes (e.g. "hld:") would allow any >> native encoding to be used. The conversion from the human entered URI to the >> network protocol is handled by the scheme specific Resolver. > >I agree with you that it might be a good idea to start with "representation in >(spoken and) written human communication". As far as I understand, that was >what was done in RFC 2396, too. But I would want to make a strong and >serious warning against anything that makes encodings in URIs dependent on >scheme specifics. Currently, we can have a look at any URI and always >read the characters (as long as they are limited to the ASCII repertoire), >independent of the scheme. If we need individual converters and display >logic for each URI type, we loose all the benefits of having an *Uniform* >resource identifier. > > >Regards, Martin.
Received on Thursday, 3 September 1998 11:26:24 UTC