Re: revised "generic syntax" internet draft from Martin J. Duerst on 1997-04-09 (uri@w3.org from April 1997)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Wed, 9 Apr 1997 18:19:29 +0200 (MET DST)
To: Larry Masinter <masinter@parc.xerox.com>
Cc: Edward Cherlin <cherlin@newbie.net>, uri@bunyip.com
Message-Id: <Pine.SUN.3.96.970408105030.245J-100000@enoshima>
On Mon, 7 Apr 1997, Larry Masinter wrote:

> The requirements for "Draft Standard" (which is what
> I propose for draft-fielding-url-syntax-04) are different than
> the requirements for "Best Current Practice" (which is
> what will be proposed by the URL-WG on the URL process.)
> 
> Between Proposed and Draft, protocol specifications can
> be changed to accomodate the actual experience of implementations.
> The proposed wording isn't based on such experience.

Not true! The proposed wording is based on the NEGATIVE
experience of many people dealing with the problem of
encoding non-ASCII characters in URLs. The implementation
and use experience shows that the current specifications
are not sufficient, and that some changes have to be
accomodated.

Whether the necessary change is such that the document
has to restart at the Proposed Standard level, or whether
it can advance to the Draft Standard level, is not very
important to me personally. But I understand those who
would like it to advance, and I have demonstrated that
the criteria formally needed for Draft Standard can very
well be met.

The criteria also are met in essence. There is ample
experience, in all kinds of application areas, that
Unicode/ISO 10646 is the only solution for character
<-> octet mapping on a worldwide scale. There is wide
agreement that UTF-8 as an encoding of Unicode/ISO 10646
is extremely well suited for the task at hand. There
is also practical experience (if it was ever needed :-)
that UTF-8 + %HH URLs can be used in browsers and on
servers. That some of this experience has not been
made with URLs is a minor issue; most people in the
IETF know how that 2 + 2 is 4.

There is wide consensus that the proposal made is the
right way to go. The discussion showed no real objections
to the solution, no other solution that would be even
close in suitability, and no serious arguments as to why
the solution is not needed.


> I've given reasons for rejecting the proposal wording
> change that was actually made, and I also think that what
> draft-fielding-url-syntax-04 says meets the requirements
> for "draft standard". That is, I'm satisfied with the
> words that exist.

I can very well imagine that you are satisfied with the
current NON-wording. As far as I am aware of (please
correct me if I am wrong) you don't speak/write any
language besides English, and in particular no language
that uses something else than the Latin script.
It is like somebody that has only numbers on his/her
file system and does not mind if there is no spec
for how to deal with basic Latin, whether in ASCII
or EBCDIC or something else.

The fact is that those people that are affected by what
goes on beyond ASCII are not satisfied, as you very well
know. You don't have the experience, either with
implementations or otherwise, to tell them why or
how they could be satisfied with the current state.
The Draft-Standard-to-be is not for you personally, it is
for the whole world. You have repeatedly been informed
about deficiencies discovered with *actual implementation
experience*. You have been informed about a very good
solution. You have been informed about ample implementation
experience. You also have been informed about why and
how the proposed wording conforms to the requirements for
Draft Standard. Your attempts, if there were some,
to argue about any of these points, have not gone
very far. 

If your bottom line is "I'm satisfied", that's not an
argument!

I reiterate that there is consensus on integrating text
for UTF-8 as the recomended character encoding into the
draft. We have a proposed wording, two paragraphs which
I don't think I need to repeat. I have only heard very
general arguments against this wording, arguments which
I have showed to be untrue or irrelevant.

If you want to upheld your position that the proposed
wording should not be included in the document, I clearly
request you to either a) show relevant problems in the
arguments I have given or b) give new and serious
arguments for your position or c) tell this group
exactly what in the current wording it is that you
don't like, and how it could be changed.


> I think that it would be reasonable to have a new "Proposed
> Standard" that covers 8-bit URLs in UTF-8 as well as
> the recommendation that 7-bit URLs be encoded with %NN.

It is very reasonable to have a document, or documents,
that defines the use of URLs representing characters beyond
ASCII in interfaces, text documents,... I have always been
very open to this idea, both in public and in private, and
I have volunteered to do a great deal of the work involved
myself and have already started on it. But that's not the
issue we are working on here.

> Since this proposal wouldn't be incompatible with
> draft-fielding-url-syntax-04.txt, it can progress
> independently. I think any proposed standard for UTF-8
> encoded URLs would have a different range of applicability
> than for ASCII URLs.

The proposal of *recommending* UTF-8 for URLs (%HH-encoded)
is indeed in no way incompatible with the present state
of URLs, and the documents that describe it. That's why
it is a very valuable addition to draft-fielding-url-syntax-04.txt,
correcting (as far as this is possible at the present state)
some defficiencies that clearly have shown up as a result
of widespread implementation experience.

The proposal of using URLs in documents without the %HH
encoding, as raw 8-bit strings, on the other side, while
being interoperable with what we have now, is not formally
compatible as such with the current definitions of URLs,
and therefore with due right goes into a different document.



In the hope that this will shorten the discussion, I give some
more material here that all supports the inclusion of UTF-8
as the recommended character encoding for URLs into the current
draft.


The first tidbit is the specification of a notation for internet
ipv6 addresses in the domain name part. Semantically, this is a
clear addition that is incompatible with the requirements for
Draft Standard. Also, there is (as far as I know) no implementation
experience, not even negative implementation experience that has
complained about the missing of such a feature. Syntactically,
it is clear that it works, but there is as of yet no browser
that can handle it correctly. The introduction of a new top
level domain touches some currently very hot political issues.
Please note how favorably UTF-8 compares with this: There is
clear negative implementation experience, it is supported by
a wide range of browsers already, there are actually at least
two instances of URLs that are proved to work, and there might
have been some hot issues surrounding Unicode three or five years
ago, but that has been cleaned up by the IAB workshop and a lot
of other things. With all the systems using Unicode, in particular
Java, NT, Office 97 and the Newton, there is also most probably
much more implementation experience with Unicode than with ipv6.

Yet strangely enough, not even the document editor so strongly
concerned about formal and procedural compliance to the requirements
for Draft Standard sees any problems with this.

To make this clear, I in no way advocate that we throw out
the solution for ipv6. It is very clear to everybody that a
solution is needed and that it is better to find one now that
only after people start to complain. It is as clear as 2+2=4.
The main problem currently seems that there are less people
that can do basic arithmetics for general problems such as
internet addresses than there are people that can do the
same for the area of internationalization. But at least one
should be able to expect that those that don't have the
experience in this field would listen to the arguments
of the experts and users in this field.


Another tidbit is IMAP. Those people working on IMAP internationalization
and IMAP URLs have expressed very clearly that they would be very
happy if URLs moved towards consistent internationalization, and
that they would have absolutely no problems adding the necessary
text to the IMAP URL specs and to do the necessary conversion from
their specific solution to whatever the URL side decided to go
with. However, they also had good reasons for making it clear that
they couldn't write something like "IMAP uses 'modified UTF-7'
for internationalization, and URLs recommend XXX, so there
is a need for conversion, which goes like this..." as long
as they didn't know what XXX would be. They would have been very
happy to know what XXX was from the document we are discussing
here, and would imediately have incorporated it into their
work, if they had known. It is (also) for cases like this that
there is great benefit, and absolutely no harm, to include
text specifying UTF-8 as the recommended encoding.


The last tidbit is RFC 1866, the proposed standard for HTML 2.0
of November 1995. While this was a Proposed Standard, there was
the very clear working guideline to only document actual HTML
features of HTML as of (mid?) 1995. Yet after some extended
discussion, an exception was made, which resulted in the following
text:

[in 1.2.1]

        * Its document character set includes [ISO-8859-1] and
        agrees with [ISO-10646]; that is, each code position listed
        in 13, "The HTML Coded Character Set" is included, and each
        code position in the document character set is mapped to the
        same character as [ISO-10646] designates for that code
        position.

[in 6.1]

      NOTE - To support non-western writing systems, a larger character
      repertoire will be specified in a future version of HTML. The
      document character set will be [ISO-10646], or some subset that
      agrees with [ISO-10646]; in particular, all numeric character
      references must use code positions assigned by [ISO-10646].

We have a very similar situation here, namely that some syntax
(%HH in our case, &#nnn; in the case of HTML) had no defined
character correspondences where it actually needed such correspondences.
The solutions again are similar, namely to use UNicode/ISO 10646
for worldwide character correspondences. There are of course some
differences, namely that we only issue a recommendation (to not
break working applications), and that we have to deal with the
actual encoding (namely UTF-8) instead of just abstract character
values.

The above idea of having defined character semantics may have been
rather new at that time, but is well established by now. Also,
Unicode/ISO 10646 is now much more established, and much more proven
by implementations, than it was at that time. Some people were
rather critical at the time the above text got introduced. But
everybody now knows it's the right thing. A very honest statement
by a developper from a very major company at one of the recent
converences/workshops shows this very clearly. He said that they
did it differently, because they didn't know the right solution,
but now they know, and that's what they did.


For URLs, we also know the right solution very well. We shouldn't
let the rest of the world wait for it.


Regards,	Martin.
Received on Wednesday, 9 April 1997 12:23:31 UTC