Re: revised "generic syntax" and "data:" internet drafts from Martin J. Duerst on 1997-04-07 (uri@w3.org from April 1997)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Mon, 7 Apr 1997 16:45:58 +0200 (MET DST)
To: Larry Masinter <masinter@parc.xerox.com>
Cc: Edward Cherlin <cherlin@newbie.net>, uri@bunyip.com
Message-Id: <Pine.SUN.3.96.970407124232.245C-100000@enoshima>
On Fri, 4 Apr 1997, Larry Masinter wrote:

> I'm surprised I have to spell this out. 

In an environment with many participants with different backgrounds
and different oppinions, spelling things out is usually much better
than assuming everybody has the same understanding as you.


> Martin has proposed that we add to the "Draft Standard" the following
> wording:
> 
> >    URL creation mechanisms that generate the URL from a source which
> >    is not restricted to a single character->octet encoding are
> >    encouraged, but not required, to transition resource names toward
> >    using UTF-8 exclusively.

The wording is by Roy Fielding. I just had to reiterate the proposal
to put it into the draft because it had been ignored. The reasons
for why it has been ignored, as far as Larry's explanations go,
seem to be mainly of procedural nature. Unfortunately, these
reasons have not been discussed before, so that's why we have
to discuss them now.


> I will point out that of all of the implementors of all of the software
> that I'm aware of that contain "URL creation mechanisms", including
> the software products from Alis, Accent, Netscape and Microsoft -- 
> even in the latest versions which purport to support UTF-8 in the
> representation of text--  I have yet to see any product that is a
> "URL creation mechanism" that actually does what Martin is proposing
> we encourage them to do ("transition resource names toward using
> UTF-8 exclusively").

You seem to have a rather restricted view of "URL creation mechanism".
And you seem to forget that the standard is not about URL creation
mechanisms, it is about URLs as such. You asked for two interoperable
implementations. Obviously, interoperability for URLs means that
if I get an URL, for example via mail, and transfer it, for example
to paper and then into a web browser, I get the resource that the URL
denotes. You had ample possibility to check that this is indeed possible
with the two URLs I provided. They use two different schemes/protocols,
and work with a wide set of browsers, so as far as URLs with UTF-8
are concerned, the formal/procedural requirements are met.

As far as "URL creation mechanisms" go, I also mentionned two of them,
namely a) do it by hand, and b) use a system that already uses
UTF-8 to denote file names (such as RS 6000 with the right Unix system
and the right locale). So we have two URL creation mechanisms that
already support UTF-8, two interoperable implementations. The fact
that these interoperable implementations are trivial seems to bother
you, but their triviality is definitely an advantage and not a
disadvantage. The fact that for other configurations, the things
may be a little bit more complicated as for these two cases is an issue
that was discussed in quite some detail, with rather satisfactory
results. It is not of concern for the procedural requirement of
"two interoperable implementations".


> Not only aren't they transitioning toward
> using UTF-8 exclusively,

If you worry that the exact wording, as proposed by Roy, is too
restrictive, for example in particular that "towards using UTF-8
exclusively" may give the impression that hitherto working non-
UTF-8 URLs should be discontinued prematurely, then please say so.
I have absolutely no problems in changing the wording, as long
as the basic intention is maintained.


> I've yet to see one that actually uses
> UTF-8 in resource names at all.

Well, there are not that many URLs out there currently that use
anything beyond ASCII. If you can name ten sites without doing
research, it would rather surprise me.


> When I asked for instances of some
> actual practice,

You didn't ask for instances of actual practice. You asked for
"interoperable implementations".

> I got sent two examples of URLs on Martin's own site
> in which the "URL creation mechanism" was careful hand crafting
> of the URLs themselves.

Should I have told you that I had some magic translation software
translating English filenames into Japanese? Indeed I did the
translation from English to Japanese manually, by editing a file.
How else should I have translated "ruby" and "FontComposition" to
its Japanese translations?
All the rest was just mechanical. It would be the same if I
wanted to produce these Japanese resource names in EUC or SJIS,
only that I would have to change one setting in a "Save As"
command.


> Furthermore, none of the implementors
> of the "URL creation mechanisms" have stepped forward to endorse
> this proposal.

First, please don't ignore Francois Yergeau, from Alis!

Obviously, none of them has stepped forward to say anything
against it! And as far as my recollections go from the past two
Unicode conferences, from the symposium in Sevilla, from
a recent trip to Japan, and from private mail, many people
working in the field of software internationalization, including
employees of the companies you mention above, have expressed positive
oppinions towards URLs and UTF-8, often clearly expressing their
satisfaction that the issue of representation of characters in
URLs is finally being adressed.

Also, it might help you to know that for host naming,
Microsoft is using UTF-8 as far as they can (i.e. as
long as they are not limited by the current state of
DNS when interfacing to the outside world).
I have no doubt they wouldn't mind if DNS went to UTF-8
directly instead of using my proposal in
draft-duerst-dns-i18n-00.txt or something similar.


> The only voices for it are those who are not actually
> producing "URL creation mechanisms". Even the most ferverent
> believer in UTF-8 would not be so foolish as to create a product
> that 'transitioned' toward 'using UTF-8 exclusively'.

I can repeat it again here: If you have problems with the word
"exclusively", then let's discuss that.


> Certainly, 99.99%
> of the installed "URL creation mechanisms that generate a URL from a
> source that is restricted to a single character->octet encoding"
> do *NOT* "use UTF-8 exclusively".

99% or more of the existing URLs are ASCII only. So they are
UTF-8 by definition :-). As for the "creation mechanisms", how do
you want to count them?


> It would be irresponsible and ridiculous to insert a recommendation
> into a Draft Standard of a practice that not only did not occur
> in the Proposed Standard but also is not the result of implementation
> experience of the community.

Assume a protocol was being upgraded from proposed to draft standard.
Assume a security hole was found in the protocol, and that it was
rather clear how to fix this. Would it be irresponsible and ridiculous
to insert a recommendation, into a Draft Standard, that this security
hole should be fixed (and a recommendation of how this should be fixed)?
Or would it be irresponsible and ridiculous to leave the security
hole open for some perceived procedural reasons?


With the danger of being accused of repetition, and of being too
clear, I clearly spell out the state of the discussion, as far as
I see it:

- The discussion on the uri and the url list have lead to a
	"rough consensus" that it is a good thing to recommend
	UTF-8 for the encoding of characters into octets to be
	encoded as URLs.

- The above consensus addresses a clear defficiency in the current
	spec with a *recommendation* that is long awaited for
	by those who care, while not at all affecting those
	that don't care.

- The above consensus is in accordance with the IAB workshop
	recommendations, the URN syntax, the proposals of the
	ftpext wg, the leading individuals in IMAP internatio-
	nalization, and so on, and with the widely acknowledged
	development of the industry as a whole.

- This consensus has been accepted by the document editor
	of the process draft (url list).

- The document editor of the syntax draft (uri list), while being
	identical to the document editor of the process draft, has
	ignored the abovementionned consensus without clearly informing
	the list. After checking and investigation, he claimed
	procedural reasons for his decision, for which he never
	before gave the list a chance to address.
	After closer investigation, procedural concerns turn
	out to be inexistent. The requirement of "two interworking
	implementations" is clearly met, even if these implementations
	turn out to be trivial.

- After "procedural concerns", the document editor currently pursues
	an additional line of reasoning, based on "vendor support".
	Vendor support is in no way procedurally relevant. Also, the
	vendors are ready to take up a reasonable proposal as soon as
	it is nailed down.

- The two paragraphs of text proposed for addition are worded by
	a longstanding expert on URLs and Internet matters in general.
	Nevertheless they may still contain some inacuracies or
	possibilities for misinterpretation. If this turns out to
	be the case, the wording should be improved as quickly
	as possible. I'm definitely open for discussion.

I hope the above is clear enough.	Regards,	Martin.
Received on Monday, 7 April 1997 10:47:10 UTC