Date: Wed, 9 Apr 1997 18:19:29 +0200 (MET DST) From: "Martin J. Duerst" <email@example.com> To: Larry Masinter <firstname.lastname@example.org> Cc: Edward Cherlin <email@example.com>, firstname.lastname@example.org Subject: Re: revised "generic syntax" internet draft In-Reply-To: <33496C83.2F9D@parc.xerox.com> Message-Id: <Pine.SUN.3.96.970408105030.245J-100000@enoshima> On Mon, 7 Apr 1997, Larry Masinter wrote: > The requirements for "Draft Standard" (which is what > I propose for draft-fielding-url-syntax-04) are different than > the requirements for "Best Current Practice" (which is > what will be proposed by the URL-WG on the URL process.) > > Between Proposed and Draft, protocol specifications can > be changed to accomodate the actual experience of implementations. > The proposed wording isn't based on such experience. Not true! The proposed wording is based on the NEGATIVE experience of many people dealing with the problem of encoding non-ASCII characters in URLs. The implementation and use experience shows that the current specifications are not sufficient, and that some changes have to be accomodated. Whether the necessary change is such that the document has to restart at the Proposed Standard level, or whether it can advance to the Draft Standard level, is not very important to me personally. But I understand those who would like it to advance, and I have demonstrated that the criteria formally needed for Draft Standard can very well be met. The criteria also are met in essence. There is ample experience, in all kinds of application areas, that Unicode/ISO 10646 is the only solution for character <-> octet mapping on a worldwide scale. There is wide agreement that UTF-8 as an encoding of Unicode/ISO 10646 is extremely well suited for the task at hand. There is also practical experience (if it was ever needed :-) that UTF-8 + %HH URLs can be used in browsers and on servers. That some of this experience has not been made with URLs is a minor issue; most people in the IETF know how that 2 + 2 is 4. There is wide consensus that the proposal made is the right way to go. The discussion showed no real objections to the solution, no other solution that would be even close in suitability, and no serious arguments as to why the solution is not needed. > I've given reasons for rejecting the proposal wording > change that was actually made, and I also think that what > draft-fielding-url-syntax-04 says meets the requirements > for "draft standard". That is, I'm satisfied with the > words that exist. I can very well imagine that you are satisfied with the current NON-wording. As far as I am aware of (please correct me if I am wrong) you don't speak/write any language besides English, and in particular no language that uses something else than the Latin script. It is like somebody that has only numbers on his/her file system and does not mind if there is no spec for how to deal with basic Latin, whether in ASCII or EBCDIC or something else. The fact is that those people that are affected by what goes on beyond ASCII are not satisfied, as you very well know. You don't have the experience, either with implementations or otherwise, to tell them why or how they could be satisfied with the current state. The Draft-Standard-to-be is not for you personally, it is for the whole world. You have repeatedly been informed about deficiencies discovered with *actual implementation experience*. You have been informed about a very good solution. You have been informed about ample implementation experience. You also have been informed about why and how the proposed wording conforms to the requirements for Draft Standard. Your attempts, if there were some, to argue about any of these points, have not gone very far. If your bottom line is "I'm satisfied", that's not an argument! I reiterate that there is consensus on integrating text for UTF-8 as the recomended character encoding into the draft. We have a proposed wording, two paragraphs which I don't think I need to repeat. I have only heard very general arguments against this wording, arguments which I have showed to be untrue or irrelevant. If you want to upheld your position that the proposed wording should not be included in the document, I clearly request you to either a) show relevant problems in the arguments I have given or b) give new and serious arguments for your position or c) tell this group exactly what in the current wording it is that you don't like, and how it could be changed. > I think that it would be reasonable to have a new "Proposed > Standard" that covers 8-bit URLs in UTF-8 as well as > the recommendation that 7-bit URLs be encoded with %NN. It is very reasonable to have a document, or documents, that defines the use of URLs representing characters beyond ASCII in interfaces, text documents,... I have always been very open to this idea, both in public and in private, and I have volunteered to do a great deal of the work involved myself and have already started on it. But that's not the issue we are working on here. > Since this proposal wouldn't be incompatible with > draft-fielding-url-syntax-04.txt, it can progress > independently. I think any proposed standard for UTF-8 > encoded URLs would have a different range of applicability > than for ASCII URLs. The proposal of *recommending* UTF-8 for URLs (%HH-encoded) is indeed in no way incompatible with the present state of URLs, and the documents that describe it. That's why it is a very valuable addition to draft-fielding-url-syntax-04.txt, correcting (as far as this is possible at the present state) some defficiencies that clearly have shown up as a result of widespread implementation experience. The proposal of using URLs in documents without the %HH encoding, as raw 8-bit strings, on the other side, while being interoperable with what we have now, is not formally compatible as such with the current definitions of URLs, and therefore with due right goes into a different document. In the hope that this will shorten the discussion, I give some more material here that all supports the inclusion of UTF-8 as the recommended character encoding for URLs into the current draft. The first tidbit is the specification of a notation for internet ipv6 addresses in the domain name part. Semantically, this is a clear addition that is incompatible with the requirements for Draft Standard. Also, there is (as far as I know) no implementation experience, not even negative implementation experience that has complained about the missing of such a feature. Syntactically, it is clear that it works, but there is as of yet no browser that can handle it correctly. The introduction of a new top level domain touches some currently very hot political issues. Please note how favorably UTF-8 compares with this: There is clear negative implementation experience, it is supported by a wide range of browsers already, there are actually at least two instances of URLs that are proved to work, and there might have been some hot issues surrounding Unicode three or five years ago, but that has been cleaned up by the IAB workshop and a lot of other things. With all the systems using Unicode, in particular Java, NT, Office 97 and the Newton, there is also most probably much more implementation experience with Unicode than with ipv6. Yet strangely enough, not even the document editor so strongly concerned about formal and procedural compliance to the requirements for Draft Standard sees any problems with this. To make this clear, I in no way advocate that we throw out the solution for ipv6. It is very clear to everybody that a solution is needed and that it is better to find one now that only after people start to complain. It is as clear as 2+2=4. The main problem currently seems that there are less people that can do basic arithmetics for general problems such as internet addresses than there are people that can do the same for the area of internationalization. But at least one should be able to expect that those that don't have the experience in this field would listen to the arguments of the experts and users in this field. Another tidbit is IMAP. Those people working on IMAP internationalization and IMAP URLs have expressed very clearly that they would be very happy if URLs moved towards consistent internationalization, and that they would have absolutely no problems adding the necessary text to the IMAP URL specs and to do the necessary conversion from their specific solution to whatever the URL side decided to go with. However, they also had good reasons for making it clear that they couldn't write something like "IMAP uses 'modified UTF-7' for internationalization, and URLs recommend XXX, so there is a need for conversion, which goes like this..." as long as they didn't know what XXX would be. They would have been very happy to know what XXX was from the document we are discussing here, and would imediately have incorporated it into their work, if they had known. It is (also) for cases like this that there is great benefit, and absolutely no harm, to include text specifying UTF-8 as the recommended encoding. The last tidbit is RFC 1866, the proposed standard for HTML 2.0 of November 1995. While this was a Proposed Standard, there was the very clear working guideline to only document actual HTML features of HTML as of (mid?) 1995. Yet after some extended discussion, an exception was made, which resulted in the following text: [in 1.2.1] * Its document character set includes [ISO-8859-1] and agrees with [ISO-10646]; that is, each code position listed in 13, "The HTML Coded Character Set" is included, and each code position in the document character set is mapped to the same character as [ISO-10646] designates for that code position. [in 6.1] NOTE - To support non-western writing systems, a larger character repertoire will be specified in a future version of HTML. The document character set will be [ISO-10646], or some subset that agrees with [ISO-10646]; in particular, all numeric character references must use code positions assigned by [ISO-10646]. We have a very similar situation here, namely that some syntax (%HH in our case, &#nnn; in the case of HTML) had no defined character correspondences where it actually needed such correspondences. The solutions again are similar, namely to use UNicode/ISO 10646 for worldwide character correspondences. There are of course some differences, namely that we only issue a recommendation (to not break working applications), and that we have to deal with the actual encoding (namely UTF-8) instead of just abstract character values. The above idea of having defined character semantics may have been rather new at that time, but is well established by now. Also, Unicode/ISO 10646 is now much more established, and much more proven by implementations, than it was at that time. Some people were rather critical at the time the above text got introduced. But everybody now knows it's the right thing. A very honest statement by a developper from a very major company at one of the recent converences/workshops shows this very clearly. He said that they did it differently, because they didn't know the right solution, but now they know, and that's what they did. For URLs, we also know the right solution very well. We shouldn't let the rest of the world wait for it. Regards, Martin.