Re: revised "generic syntax" and "data:" internet drafts from Martin J. Duerst on 1997-04-03 (uri@w3.org from April 1997)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Thu, 3 Apr 1997 16:53:33 +0200 (MET DST)
To: Larry Masinter <masinter@parc.xerox.com>
Cc: Edward Cherlin <cherlin@newbie.net>, uri@bunyip.com
Message-Id: <Pine.SUN.3.96.970403120236.245D-100000@enoshima>
On Wed, 2 Apr 1997, Larry Masinter wrote:

> In my personal judgement, there was significant controversy
> about adding to a Draft Standard document additional constraints
> that were not part of the Proposed Standard and are not
> implemented in at least two interoperable implementations.

In the current discussion, started by my original proposals in
mid February, there was definitely no "significant controversy"
about procedural matters such as those you mention above.
If you think otherwise, please give the references to the
mailing archive. As you can see below, there is no need for
such a controversy. If you had brought this subject up earlier,
I could have answered as below earlier.

Also, there are in no way any additional constraints.
There are only recommendations. I have clearly shown
that these don't affect existing (or even future)
implementations in any major way. If you want to challenge
this, please give the details.

In addition, the requirement of "two interoperable implementations"
is rather easy to fulfill, too easy to actually even bother
about it except for procedural reasons. For those that have
been seriously involved in the discussion, this is quite clear,
but I will explain it here in detail (don't blame me, please,
if you think this is too detailled!).

Obviously, on the browser side, the only thing we need is
the ability to input %HH-escaped UTF-8 URLs. There are dozens
of browsers that allow this as of now!

On the server side, we need two sites with some resource
names actually in UTF-8. I will provide two here, one http
and one ftp, with one resource name each. As these two
"implementations" are not personally independent, I hope
somebody else can provide another one. I just provided
one resource name each in UTF-8, which should be enough,
but if you think that this works by chance rather than in
general, please tell me what other kinds of names you
would like (me) to provide.


This is the first URL:
---------------------

http://www.ifi.unizh.ch/mml/mduerst/%e3%83%ab%e3%83%93.html

which is actually a link to:
	http://www.ifi.unizh.ch/mml/mduerst/ruby.jp.html

The part %e3%83%ab%e3%83%93 are the two katakana characters
for "ruby", U+30EB U+30D3, in UTF-8 (please check for yourself).
[The contents is an attempt of a translation of one of my earlier
documents about ruby in HTML into Japanese, never finished :-(.
It's not up to date anymore, but that's obviously irrelevant here.]

I tested this with Netscape Gold 3.01 and with NTSC Mosaic 2.6, both
on a Sun.


This is the second URL:
----------------------

ftp://ftp.ifi.unizh.ch/pub/multilingual/
%E6%9B%B8%E4%BD%93%E7%B5%84%E3%81%BF%E5%90%88%E3%82%8F%E3%81%9B.ps.Z

It is a link to
	ftp://ftp.ifi.unizh.ch/pub/multilingual/FontComposition.ps.Z

The long %HH escapes are the Japanese characters for "shotaikumiawase",
a translation of "font composition". Unicode codepoints available on
request. [The file itself is a prepublication version of a paper
written for the 1995 Unicode conference; I can recommend it for
those that are interested in the subject.]
If you want to avoid to have to type in the whole long string,
just go to the URL ftp://ftp.ifi.unizh.ch/pub/multilingual/
and click on the one name that is not ASCII. In Netscape,
for example, the filename will show as some garbage, unless
you have a 4.0 version and set "document encoding" to UTF-8.
I guess the same works for Internet Explorer 4.0. Why does
anybody claim that we have no interworking applications,
when things already work better even than we might like it
(namely that they strictly use %HH)? In Netscape 3.0,
you can verify the URL as above with "view source".


Some people may wonder how I created these file/resource names.
Well, I used an editor offering multilingual input facilities and
the ability to store a file as UTF-8 to write a shell script
with the needed "ln" commands. One would not even need an editor
capable of UTF-8, a filter converting to UTF-8 would also do the
job. The whole thing gets a little bit more difficult if your
original filenames are not pure ASCII; you would have to have
two different encodings in one and the same shell script file.


As an aside, I found a very nice feature in Apache 1.2 that
will probably allow to very easily make a server that serves
UTF-8 names for a site with overall or per directory fixed
legacy encodings of file/resource names. It is the module
"rewrite". This can be configured to call a program for
rewrites. If you choose tcs as the program and configure
it so that it converts from UTF-8 to e.g. Latin-1, it will
do a nice job. Tcs has to be changed to work non-buffered,
but that's very easy. If defined on a per-site basis, the
tcs program is started only once. This setup will not yet
address the upgrade path; for this, a general "retry" facility
is needed, i.e. a way to specify "try with this name, if you
don't find the resource, retry with another name". Such
a facility may exist somewhere in Apache, or could probably
be built in with many other uses. The alternative is to use
links, as already explained.


The whole story about "reference implementation" shows that
we are dealing here not with adding something new to an
existing protocol, but with *recommending* clear semantics
in an area that was up to now blatantly ignored and needs
some fix, the sooner the better.


> As I said, I edited the document to contain those changes that
> I thought were non-controversial.

I hope it is fair to say that there was rough consensus on
recommending UTF-8 (with %HH) for character encoding. You have
acknowledged this consensus for the process draft, and while
the discussion on both lists did not proceed exactly in parallel,
there is really nothing much that would in the end distinguish
the discussion on both lists.


> >   URL creation mechanisms that generate the URL from a source which
> >   is not restricted to a single character->octet encoding are
> >   encouraged, but not required, to transition resource names toward
> >   using UTF-8 exclusively.
> >   URL creation mechanisms that generate the URL from a source which
> >   is restricted to a single character->octet encoding should use UTF-8
> >   exclusively.  If the source encoding is not UTF-8, then a mapping
> >   between the source encoding and UTF-8 should be used.
> >
> This is an additional requirement that does not correspond,
> as far as I can tell, to any kind of "implementation experience".
> I know of no URL creation mechanisms that actually do this.

See above. "implementation experience" is obviously trivial.


> Further, I think that the complaints that there is a certain
> amount of ambiguity in practice over exactly how one goes
> about doing this are legitimate, and that not only is there
> no "running code", there is not "rough consensus".

The code that we have is obviously very much sufficient.
Rough consensus is there, the word "rough", as I have seen
it interpreted in IETF working groups, takes care of the
case of a single individual raising the same far-fetched
and unrelated complaints over and over, in a rather short
and cryptic manner, even after they have been addressed
in detail.

I don't know exactly what you intend to refer to with
"certain ambiguity". If you mean ambiguities arising from
URLs such as http://0oO0Il1.com/IlIl10oO.html, this is
obviously a problem that is ignored for ASCII, because
of the correct assumption that URL generators learn to
avoid such cases by trial and error if not otherwise.
I do not think that at the present time, things beyond
ASCII need to be specified more explicitly than ASCII
itself, in this respect.

I very well acknowledge that for some cases, some more
detailled specifications are highly desirable. I have
talked with many people about the issues involved, and
I have repeatedly volunteered to work on the necessary
documemnts. However, I do not see any sense in writing
such documents in the void, without a clear commitment
for a good solution in the central document. Actually,
I would like nothing more than finishing the current
controversy on the base issue and having some time to
work on more documentation. I therefore sincerely hope
that we can stop useless "procedural concerns" as above
as quickly as possible. [Also, as long as we are only
concerned with %HH (this is the only thing that should go
into the current draft, I agree that the transition to
using "native" URLs is something more experimental, and
that the necessary documents for it will have to be written),
the potential ambiguities actually don't arise :-].


> > I'm surprised, too. I thought we had this worked out, and that
> > there was no significant objection or controversy.
> 
> I hope that the domain name from which you post ("newbie.net")
> isn't some kind of joke. If you insist, I will forward you
> the three hundred or so email messages discussing the controversy
> around the proposed additions.

I guess there is no need to do that. Edward is very well aware
of the discussion that went on. Some of the best contributions
to it are from him. He probably followed the discussion more
closely than many others. Threatening him with mail flooding
is beyond what I want to comment about.


Regards,	Martin.
Received on Thursday, 3 April 1997 09:55:24 UTC