- From: Larry Masinter <masinter@parc.xerox.com>
- Date: Thu, 3 Sep 1998 18:19:58 PDT
- To: "Sam Sun" <ssun@CNRI.Reston.VA.US>, "Martin J. Duerst" <duerst@w3.org>
- Cc: "Harald Tveit Alvestrand" <Harald.Alvestrand@maxware.no>, "Jon Davis" <jdavis@inetinit.org>, "URI distribution list" <uri@Bunyip.Com>
> Very nice to hear from you... I think what we are really interested is the > legal HREF syntax (under A element) in HTML document. According to the > HTML4.0 spec, the HREF is defined as "href = uri [CT]" where "uri" is based > on RFC1630 (I suppose it need to update to RFC2396 now). So the "uri" is > used to govern the HTML document syntax, and I guess we all agree that it's > not practical to MANDATE UTF-8 as the only encoding allowed? draft-masinter-url-i18n-03.txt does not 'mandate' UTF-8 (in the sense of 'MUST' in rfc 2119.) However, it does RECOMMEND UTF-8, in the sense of strongly encourage. > Actually, the last paragraph in section 3.5 > (ftp://ftp.parc.xerox.com/pub/masinter/draft-masinter-url-i18n-03.txt) of > your draft also said: > > "For example, a URI which contains a string in Japanese might actually > arrive with a variety of encodings, due to the variety of > interpretations of deployed systems. While this recommendation > specifies a canonical encoding of Japanese using %HH-encoded UTF-8, in > practice many URIs will be presented which contain characters encoded > using Shift-JIS or EUC-JP, either with %HH encoding or not. Thus, to > transition to the new regime, URI-interpreting software for Japanese > should accept all three of the EUC-JP, Shift-JIS and UTF-8 encodings." The "might" is intended to note the state of the world as it was, as opposed to the world we're trying to get to. I believe the right thing to do is to rewrite this section to make it clear which behaviors are 'legacy' and which ones are compliant with the current recommendation. We want to give separate advice for those pieces of software that generate URIs and those that interpret them. The generators SHOULD use UTF-8. The interpreters SHOULD accept other encodings besides UTF-8. This is because the first recommendation won't be implemented by the current installed base; however, this is part of engineering the transition path. > Does this really mean that URI may be entered in any native encoding? Not in the sense of "MAY" as in (allowed by this specification), but in the sense of "might" (it might happen because some software doesn't implement this specification). > If so, > I think it would be helpful to provide the syntax definition used to declare > the encoding of the URI. No, if you're going to update your software, update it to generate UTF-8, don't update it to add some encoding-declaration. That is, we _don't_ want to recommend some new practice that will further the current situation where there is no interoperability. > This allows URI parsers to convert to UTF-8 (or any > other encoding used by the protocol) correctly without checking the document > context. A 'URI interpreter' isn't a 'URI parser'. The parsing itself is simple. > Otherwise, it could be hard for URI parsers to figure out the > encoding of any particular URI, especially in multilingual document or on > platforms with multiple input methods installed. The point is that it doesn't need to 'figure it out'. > For example, the URI in HTML document may be defined as: > > <uri scheme> ":" [ <encoding> "@" ] <uri scheme specific string> > > The <encoding> is optional, and is not needed if the <uri scheme specific > string> uses UTF-8. This suggestion would continue to propagate non-interoperability and has no migration path. Larry
Received on Thursday, 3 September 1998 21:24:47 UTC